Name: AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch
Rating: 4.43 (2 reviews)
ISBN: 9798341627789

Joshua Reuben

20 reviews7 followers

January 19, 2026

The Ultimate Coalescence of AI-HPC Techniques !

Wow, what a highly informative, on the bleeding edge, tome - the best programming book I've read in the last few years.

I came into this book well prepared, having pre-read the PyTorch docs and CUDA programming guide and being immersed daily in Deep Learning neural net architectures over the last 10 years. Even so, this book brought a lot of new stuff together in a clear and structured and practical manner.

At a chonky 1000 pages, its really a few amazing sub-books combined:

1: the optimal CUDA programming guide on the market. Chapters 6 to 12 covered many techniques: instruction level parallelism, prefetching, structured sparsity, cooperative tiling, double / tripple buffering pipelines with atomic queues & barriers, persistent kernels, cooperative groups, distributed shared memory, unified memory (and NUMA awareness), persistent kernels, cooperative groups, warp specialization, PDL (programmatic dependent launch), thread block clusters, async CUDA streams with events, CUDA graphs, NVTX markers. Every other CUDA book out there is dated in comparison.

2. a briefer guide to distributed PyTorch and its compiler internals in chapters 13-14. Distribution covered DDP vs FSDP2, as well as 5D parallelism: data, tensor, pipeline, context and expert (for MoEs). Also convered were torch.compile internals: TorchDynamo for bytecode capture and data extraction, AOT Autograd fusion for forward and backward passes, Prims and ATen IRs, and TorchInductor backend for codegen, and how to minimize CUDA Graph breaks. This section also briefly covered Triton custom kernels and some high performance PyTorch specializations:
a. torchao lib: for custom data types and optimizations
b. SuperOffload: Speculation-then-Validation (STV), Heterogeneous Optimizer Computation, Superchip-Aware Casting, GraceAdam for Optimizer Efficiency
c. TorchTitan: AsyncTP, AutoParallel, SimpleFSDP

3. A great guide to Prefill-Decode Disaggregation techniques for LLM inference in chapters 15-19. No other book currently covers this, using vLLM, SGLang and Dynamo. This section covered metrics (TTFT latency, TPOT throughput), speculative decoding and parallel token generation (eg Medusa, EAGLE3), dynamic batching + latency free dynamic scheduling policies + dynamic routing, stall-free scheduling (Chunked prefill), weight / activation quantization, application-level optimizations, KV cache tuning + hybrid prefill + fast transfer, and SLO-aware request mgmt + fault tolerance. Ch19 especially focused on some very recent dynamic adaptive optimizations around paralleism, quantization precision, network-topology and utilization awareness, speculative KV prefetching + policy switching + compression, dynamic memory allocation, kernel hot-swap, dynamic batch and prefill - the future is wild !

Throughout this, a few overriding themes are interleaved:
1. Mechanical sympathy between hardware, software and algorithms is emphasized, and the core substrate under study is the NVL72 rack, consisting of Grace-Blackwell super chips optimized with Tensor Cores and interconnected with NVLink and NVSwitch.
2. Continuous profiler guided analysis with strong focus on using NSight Systems (nsys) / Compute (ncu) to analyze metrics with the conceptual roofline model and tune away from warp stalls and memory bank conflicts towards GPU occupancy and arithemetic intensity. PyTorch Profiler is also covered, and K8s Prometheus, Meta HTA and Linux perf is touched on.
3. Networking nodes into a distributed compute grid to optimize over an AI supercomputer. This covers various stacks: NCCL for distributed GPU comms, NIXL for DPD coordination, the SHARP scalable reduction protocol, Magnum IO GPUDirect RDMA / Storage, NVSHMEM for GPU-to-GPU memory access. In a way its a return to the go-go years of scheduling jobs on a Mainframe !
4. the foundational understanding of the tradeoff calculations in hardware-specific rightsizing grids, blocks and on-chip SM shared memory required for GPGPU programming: warp shuffle intrinsics, interleaved aysyc optimized fetching from the memory hierarchy to keep the GPUs busy, and bypass CPU+SDRAM

Some small critiques:
1. sometimes I lost track of all the sizing numbers - a summary reference appendix of tables would have been good.
2. the OpenAI Triton section did not have enough guidance compared to the CUDA sections it was supposed to abstract
3. no mention of the new and shiny cuTile programming model which is supposed to change everything
4. detail deep dive on NVidia higher level libs was lacking: cuBLAS, CUtlass, cuDNN, TensorRT. TMA was a bit of a black-box
5. PyTorch centric at the high level - no coverage of the JAX/FLAX stack with MLIR XLA lowering on TPUs - I suppose that could be another 1000 page book !

Being at the bleeding edge, the book also covers DeepSeek innovations (3FS, FlashMLA, R1, V3), looks to the future roadmap with the Vera Rubin and Feynman chips, Agentic / RL autotuning explorations, and techniques such as predibase adaptive kernels, DASH (input aware layer skipping) and LazyLLM dynamic pruning.

AI code-generation has introduced great upheaval in the software engineering world. Common development is increasingly replaceable. There is creative destruction as competative edges are erroded. There is a capability bifurcation: the high priests who build the AIs and the masses who use them; the GPU rich vs the token poor. At the high level, agentic devs are increasingly faced with a level playing field of vanishing gradients. This guide on how to build the code to build the code to generate the code will give you the skills you need to survive and thrive in the industry in the years to come.

tech

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

Chris Fregly

About the author

Chris Fregly

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?