Elevate your AI system performance capabilities with this definitive guide to maximizing efficiency across every layer of your AI infrastructure. In today's era of ever-growing generative models, AI Systems Performance Engineering provides engineers, researchers, and developers with a hands-on set of actionable optimization strategies. Learn to co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems that excel in both training and inference. Authored by Chris Fregly, a performance-focused engineering and product leader, this resource transforms complex AI systems into streamlined, high-impact AI solutions.
Inside, you'll discover step-by-step methodologies for fine-tuning GPU CUDA kernels, PyTorch-based algorithms, and multinode training and inference systems. You'll also master the art of scaling GPU clusters for high performance, distributed model training jobs, and inference servers. The book ends with a 175+-item checklist of proven, ready-to-use optimizations.
Codesign and optimize hardware, software, and algorithms to achieve maximum throughput and cost savings Implement cutting-edge inference strategies that reduce latency and boost throughput in real-world settings Utilize industry-leading scalability tools and frameworks Profile, diagnose, and eliminate performance bottlenecks across complex AI pipelines Integrate full stack optimization techniques for robust, reliable AI system performance
very enlightening book on llm distributed training and inference at scale, concise and well versed, packed with lots of details. I particularly like the discussion on the software hardware co-design, and discussion on the compute, storage, and network io
Wow, what a highly informative, on the bleeding edge, tome - the best programming book I've read in the last few years.
I came into this book well prepared, having pre-read the PyTorch docs and CUDA programming guide and being immersed daily in Deep Learning neural net architectures over the last 10 years. Even so, this book brought a lot of new stuff together in a clear and structured and practical manner.
At a chonky 1000 pages, its really a few amazing sub-books combined:
1: the optimal CUDA programming guide on the market. Chapters 6 to 12 covered many techniques: instruction level parallelism, prefetching, structured sparsity, cooperative tiling, double / tripple buffering pipelines with atomic queues & barriers, persistent kernels, cooperative groups, distributed shared memory, unified memory (and NUMA awareness), persistent kernels, cooperative groups, warp specialization, PDL (programmatic dependent launch), thread block clusters, async CUDA streams with events, CUDA graphs, NVTX markers. Every other CUDA book out there is dated in comparison.
2. a briefer guide to distributed PyTorch and its compiler internals in chapters 13-14. Distribution covered DDP vs FSDP2, as well as 5D parallelism: data, tensor, pipeline, context and expert (for MoEs). Also convered were torch.compile internals: TorchDynamo for bytecode capture and data extraction, AOT Autograd fusion for forward and backward passes, Prims and ATen IRs, and TorchInductor backend for codegen, and how to minimize CUDA Graph breaks. This section also briefly covered Triton custom kernels and some high performance PyTorch specializations: a. torchao lib: for custom data types and optimizations b. SuperOffload: Speculation-then-Validation (STV), Heterogeneous Optimizer Computation, Superchip-Aware Casting, GraceAdam for Optimizer Efficiency c. TorchTitan: AsyncTP, AutoParallel, SimpleFSDP
3. A great guide to Prefill-Decode Disaggregation techniques for LLM inference in chapters 15-19. No other book currently covers this, using vLLM, SGLang and Dynamo. This section covered metrics (TTFT latency, TPOT throughput), speculative decoding and parallel token generation (eg Medusa, EAGLE3), dynamic batching + latency free dynamic scheduling policies + dynamic routing, stall-free scheduling (Chunked prefill), weight / activation quantization, application-level optimizations, KV cache tuning + hybrid prefill + fast transfer, and SLO-aware request mgmt + fault tolerance. Ch19 especially focused on some very recent dynamic adaptive optimizations around paralleism, quantization precision, network-topology and utilization awareness, speculative KV prefetching + policy switching + compression, dynamic memory allocation, kernel hot-swap, dynamic batch and prefill - the future is wild !
Throughout this, a few overriding themes are interleaved: 1. Mechanical sympathy between hardware, software and algorithms is emphasized, and the core substrate under study is the NVL72 rack, consisting of Grace-Blackwell super chips optimized with Tensor Cores and interconnected with NVLink and NVSwitch. 2. Continuous profiler guided analysis with strong focus on using NSight Systems (nsys) / Compute (ncu) to analyze metrics with the conceptual roofline model and tune away from warp stalls and memory bank conflicts towards GPU occupancy and arithemetic intensity. PyTorch Profiler is also covered, and K8s Prometheus, Meta HTA and Linux perf is touched on. 3. Networking nodes into a distributed compute grid to optimize over an AI supercomputer. This covers various stacks: NCCL for distributed GPU comms, NIXL for DPD coordination, the SHARP scalable reduction protocol, Magnum IO GPUDirect RDMA / Storage, NVSHMEM for GPU-to-GPU memory access. In a way its a return to the go-go years of scheduling jobs on a Mainframe ! 4. the foundational understanding of the tradeoff calculations in hardware-specific rightsizing grids, blocks and on-chip SM shared memory required for GPGPU programming: warp shuffle intrinsics, interleaved aysyc optimized fetching from the memory hierarchy to keep the GPUs busy, and bypass CPU+SDRAM
Some small critiques: 1. sometimes I lost track of all the sizing numbers - a summary reference appendix of tables would have been good. 2. the OpenAI Triton section did not have enough guidance compared to the CUDA sections it was supposed to abstract 3. no mention of the new and shiny cuTile programming model which is supposed to change everything 4. detail deep dive on NVidia higher level libs was lacking: cuBLAS, CUtlass, cuDNN, TensorRT. TMA was a bit of a black-box 5. PyTorch centric at the high level - no coverage of the JAX/FLAX stack with MLIR XLA lowering on TPUs - I suppose that could be another 1000 page book !
Being at the bleeding edge, the book also covers DeepSeek innovations (3FS, FlashMLA, R1, V3), looks to the future roadmap with the Vera Rubin and Feynman chips, Agentic / RL autotuning explorations, and techniques such as predibase adaptive kernels, DASH (input aware layer skipping) and LazyLLM dynamic pruning.
AI code-generation has introduced great upheaval in the software engineering world. Common development is increasingly replaceable. There is creative destruction as competative edges are erroded. There is a capability bifurcation: the high priests who build the AIs and the masses who use them; the GPU rich vs the token poor. At the high level, agentic devs are increasingly faced with a level playing field of vanishing gradients. This guide on how to build the code to build the code to generate the code will give you the skills you need to survive and thrive in the industry in the years to come.