AVX SIMD instructions are free CPU parallelization hidden in every CPU. Access these vectorized instructions for extra speed without learning assembly language by coding in C++ AVX intrinsics.
Key Introduction to AVX SIMD intrinsicsVectorization and horizontal reductionsLow latency tricks and branchless programmingInstruction-level parallelism and out-of-order executionLoop unrolling & double loop unrolling Table of Part AVX Optimizations 1. AVX Intrinsics 2. Simple AVX Example 3. CPU Platform Detection 4. Common Bugs & Slugs 5. One-Dimensional Vectorization 6. Horizontal Reductions 7. Vector Dot Product 8. Loop Optimizations 9. Softmax 10. Advanced AVX Techniques Part Low-Level Code Optimizations 11. Compile-Time Optimizations 12. Zero Runtime Cost Operations 13. Bitwise Operations 14. Floating-Point Computations 15. Arithmetic Optimizations 16. Branch Prediction 17. Instruction-Level Parallelism 18. Core Pinning 19. Cache Locality 20. Cache Warming 21. Contiguous Memory Blocks 22. False Sharing 23. Memory Pools Appendix Long List of Low Latency Techniques Appendix License Details