Maximizing Training Efficiency with Ironwood TPUs: A Developer's Guide

Maximizing Training Efficiency with Ironwood TPUs: A Developer's Guide

The shift towards trillion-parameter systems has amplified the need for computational power, pushing traditional infrastructures to their limits. The seventh-generation Ironwood TPU, featuring Google's custom AI architecture, is designed to scale effectively, supporting pods of up to 9,216 chips. This is achieved through the integration of Inter-Chip Interconnect (ICI), Optical Circuit Switch (OCS), Data Center Network (DCN), and extensive High Bandwidth Memory (HBM) capacity. The Ironwood TPU also boasts a co-design between hardware and software, incorporating innovations like Compiler-Centric XLA and Python-native kernels such as Pallas and Mosaic, which enhance the ability to train and serve complex systems efficiently.

Key Optimization Strategies for Ironwood

1. Leverage Native FP8 with MaxText

Ironwood introduces native 8-bit floating point (FP8) support in its Matrix Multiply Units (MXUs). Utilizing FP8 for weights, activations, and gradients can potentially double throughput compared to Brain Floating Point 16 (BF16). When configured properly, FP8 training can enhance efficiency without sacrificing quality. Users can begin implementing FP8 recipes using the Qwix library.

2. Accelerate with Tokamax Kernels

The Tokamax library offers high-performance JAX kernels optimized for TPUs, addressing specific bottlenecks:

  • Splash Attention: This mechanism optimizes I/O limitations by keeping computations within on-chip SRAM, ideal for long context lengths.
  • Megablox Grouped Matrix Multiplication (GMM): This technique efficiently manages ragged tensors in Mixture of Experts (MoE) models, enhancing MXU utilization.
  • Kernel Tuning: The library provides utilities for hyperparameter optimization, allowing adjustments to tile sizes and configurations.

3. Offload Collectives to SparseCore

Ironwood's fourth-generation SparseCores are tailored for managing irregular memory access patterns. By utilizing specific XLA flags, users can offload collective communication tasks like All-Gather and Reduce-Scatter to SparseCore, allowing TensorCores to focus on primary computations and improving overall efficiency.

4. Fine-Tune the Memory Pipeline on VMEM

VMEM, a crucial component of the TPU memory architecture, is a fast on-chip SRAM that optimizes kernel performance. Tuning VMEM allocation between current operations and future weight prefetch can enhance execution speed. For instance, increasing VMEM for current operations can boost tile sizes, improving kernel performance by minimizing memory stalls. More details can be found in TPU Pipelining.

5. Choose Optimal Sharding Strategies

MaxText supports various parallelism techniques applicable to all TPUs. The optimal choice depends on model size, architecture, and sequence length:

  • Fully Sharded Data Parallelism (FSDP): Ideal for large models exceeding single chip memory, FSDP shards weights, gradients, and optimizer states across multiple chips.
  • Tensor Parallelism (TP): Effective for very large dimensions, leveraging Ironwood's high arithmetic intensity.
  • Expert Parallelism (EP): Useful for distributing experts in MoE models.
  • Context Parallelism (CP): Necessary for long sequences, sharding activations along the sequence dimension.
  • Hybrid Approaches: Combining strategies can balance compute, memory, and communication for large-scale operations.

For more insights on techniques 2-5, refer to the Optimizing Frontier Model Training on TPU v7x Ironwood post in the Developer forums.

The Ironwood Advantage: System-Level Performance

By implementing these optimization strategies alongside Ironwood's architectural strengths, such as the high-speed 3D Torus Inter-Chip Interconnect (ICI) and substantial HBM capacity, users can create a robust platform for training advanced systems. The close integration of hardware, compilers (XLA), and frameworks (JAX, MaxText) ensures maximum performance can be achieved from the infrastructure.

Further Reading

A special thanks to Hina Jajoo and Amanda Liang for their contributions to this blog post.