News

Heterogeneous Parallelism Boosts Multimodal LLM Training by 49%

· Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh

Heterogeneous Parallelism Boosts Multimodal LLM Training by 49%

Training large multimodal AI models is getting unwieldy. When you pile vision, video, and text into one model, you end up with mismatched parts.

Training large multimodal AI models is getting unwieldy. When you pile vision, video, and text into one model, you end up with mismatched parts. The LLM might need one kind of parallelism for its long context windows, while the encoder needs something else for its shorter inputs. A team has now released a solution for this, extending Megatron-LM to let each module in the training graph choose its own sharding layout and GPU placement.

The core problem is that current training frameworks force everything into the same parallelism scheme. This means encoders get stuck with the LLM’s sharding decisions, which can add unnecessary communication and limit how fast they run. It gets worse at long context lengths where the LLM needs context parallelism but the encoders do not. The new approach, called heterogeneous parallelism, lets you decouple these choices. Modules can run on the same GPUs or on completely separate rank sets, and the system handles the tensor reshuffling at module boundaries automatically.

To make this work, the team built boundary communicators. These transform activations into the destination layout during the forward pass and route gradients back to the source layout during backpropagation. They also added scheduling changes to handle both colocated and non-colocated placement. The result is a flexible system where you can pick the right parallelism for each part of the model.

In their benchmarks, colocated setups improved TFLOPS per GPU by up to 49.3% compared to the homogeneous baseline. Non-colocated setups boosted aggregate token throughput by 13% and TFLOPS per GPU by 9.6%. They also verified that loss convergence matches the standard approach. The code is now available as an open-source Megatron-LM extension. This is a practical fix for a growing bottleneck. As models eat more modalities and longer contexts, rigid parallelism will only become more painful. Letting each module breathe might be the simplest way to keep training efficient.

Original source