Rethinking Networking for the AI/ML Era

How Design Teams Are Reacting to 10x ... An Alternative Chat UI Layout

Rethinking Networking for the AI/ML Era

In her AI Speaker Series presentation at Sutter Hill Ventures, Google Distinguished Engineer Nandita Dukkipati explained how AI/ML workloads have completely broken traditional networking. Here's my notes from her talk:

AI broke our networking assumptions. Traditional networking expected some latency variance and occasional failures. AI workloads demand perfection: high bandwidth, ultra-low jitter (tens of microseconds), and near-flawless reliability. One slow node kills the entire training job.

Why AI is different: These workloads use bulk synchronous parallel computing. Everyone waits at a barrier until every node completes its step. The slowest worker determines overall speed. No "good enough" when 99 of 100 nodes finish fast.

Real example: Gemini traffic shows hundreds of milliseconds at line rate, but average utilization is 5x below peak. Synchronized bursts with no statistical multiplexing benefits. Both latency sensitive AND bandwidth intensive.

Three Breakthroughs

Falcon (Hardware Transport): Existing hardware transports assumed lossless networks: fundamentally incompatible with Ethernet. Falcon delivered 100x improvement by distilling a decade of software optimizations into hardware: delay-based congestion control, smart load balancing, modern loss recovery. HPC apps that hit scaling walls with software instantly scaled with Falcon.

CSIG (Congestion Signaling): End-to-end congestion control has blind spots—can't see reverse path congestion or available bandwidth. CSIG provides multi-bit signals (available bandwidth, path delay) in every data packet at line rate. No probing needed. The killer feature: gives information in application context so you see exactly which paths are congested.

Firefly: Jitter kills AI workloads. Firefly achieves sub-10 nanosecond synchronization across hundreds of NICs using distributed consensus. Measured reality: ±5 nanoseconds via oscilloscope. Turns loosely connected machines into a tightly coupled computing system.

The Remaining Challenges

Straggler detection: Even with perfect networking, finding the one slow GPU in thousands remains the hardest problem. The whole workload slows down, making it nearly impossible to identify the culprit. Statistical outlier analysis is too noisy. Active work in progress.

Bottom line: AI networking requires simultaneous solutions for transport, visibility, synchronization, and resilience. Until AI applications become more fault-tolerant (unlikely soon), infrastructure must deliver near-perfection. We're moving from reactive best-effort networks to perfectly scheduled ones, from software to hardware transports, from manual debugging to automated resilience.

View more on Luke Wroblewski's website »

Like • 0 comments • flag

Published on October 31, 2025 10:00

No comments have been added yet.

Luke Wroblewski's Blog

Luke Wroblewski's profile
86 followers

Luke Wroblewski isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.

delete edit this post