Skip to content

harshita-mp/transformer-sequence-parallelism

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer Sequence Parallelism with MPI4Python

Overview

Modern Transformers struggle to scale as sequence lengths grow, with attention becoming one of the biggest computational bottlenecks. To explore how these limitations can be overcome, this project implements sequence parallelism - a technique that distributes long sequences across multiple GPUs while still preserving correctness.

This project implements Sequence Parallelism for Transformer layers using MPI4Python and GPU acceleration on the CWRU Pioneer HPC Cluster. Sequence parallelism distributes computation by splitting the input along the sequence (token) dimension, allowing each GPU to process a subset of tokens independently while still producing correct attention outputs.

In a standard Transformer, every token must attend to every other token, which tightly couples computations across the full sequence. To enable parallelism, this implementation:

  • Partitions the input tensor across GPUs along the sequence axis
  • Computes local Q/K/V projections independently on each GPU
  • Uses MPI AllGather to assemble full K and V tensors on every GPU
  • Computes attention locally, but only for each GPU’s assigned token slice

This preserves correctness while enabling distributed forward propagation across multiple devices.

Due to the lack of CUDA-aware support in the MPI configuration, communication operations must use a slower CPU-staging pipeline:

GPU → CPU → MPI → CPU → GPU

This adds significant latency (≈5–15 ms per AllGather), making communication the primary bottleneck at higher GPU counts.

Despite this limitation, the project demonstrates:

  • How to construct a distributed Transformer layer from scratch
  • How sequence parallelism behaves across 1, 2, and 4 GPUs
  • How communication overhead impacts scaling
  • Practical trade-offs in parallelizing attention mechanisms

The results provide meaningful insight into the challenges and design considerations behind parallel deep learning systems, even when ideal speedup is not achieved.

Quick Start

# Setup
bash scripts/setup_env.sh

# Run experiments (p=1,2,4 on same GPUs)
bash scripts/submit_all_gpu.sh

# Monitor
tail -f logs/sp_all_experiments_<JOBID>.out

This project uses three scripts to automate everything on the node. scripts/submit_all_gpu.sh is the one-command entry point: it creates logs/ and experiments/results/, submits the batch job via sbatch scripts/run_all_experiments.sh, and prints how to monitor the job and where results will be saved. scripts/run_all_experiments.sh is the Slurm batch script: it requests 1 node with 4 GPUs, sets up the environment, then runs the strong-scaling experiment with sequence parallelism for P=1,2,4 using srun, saving each run’s metrics to experiments/results/strong_scaling_sp_gpu_p<P>.json and printing a small summary at the end. scripts/setup_env.sh is the environment bootstrapper: it loads the appropriate OpenMPI and CUDA modules, installs torch, mpi4py, numpy, and cupy if needed, and runs small sanity checks to confirm that MPI, CUDA, and PyTorch are working correctly before any experiments start.

Implementation

Sequence Partitioning:

  • Input: [Batch, Seq_len, Hidden] split along sequence dimension
  • Each GPU gets: [B, S/p, H]

Attention Layer:

  1. Q/K/V projections: Local (no communication)
  2. AllGather K and V: [B, S/p, H][B, S, H] on all ranks
  3. Attention: Local Q with full K/V
  4. Output: [B, S/p, H] per rank

Feed-Forward: No communication (position-independent)

Communication: 2 AllGathers per layer (K and V in attention)

CPU-based MPI: GPU → CPU → MPI → CPU → GPU (adds overhead ~5-15ms per AllGather)

Configuration

Edit experiments/experiment_configs.py:

hidden_dim = 5120      # Model size
seq_len = 4096         # Sequence length (significantly increased)
batch_size = 4         # Batch size (reduced to fit longer sequences)
num_iterations = 10    # Timing iterations

Results

Experimental Configuration

All experiments were run using the following configuration:

  • Hidden dimension: 5120
  • Sequence length: 4096
  • Batch size: 4
  • Iterations: 10 timed forward passes
  • Parallelism: Sequence parallelism across 1, 2, and 4 GPUs
  • Communication backend: MPI4Py using CPU-staged collectives

Each experiment measures:

  • End-to-end forward pass time
  • Speedup relative to 1 GPU
  • Parallel efficiency
  • Throughput (tokens processed per second)

Raw Results

GPUs Time (s) Speedup Efficiency Tokens/sec
1 0.947 1.00× 100% 17,295
2 0.816 1.16× 58% 20,072
4 1.010 0.94× 23% 16,218

Visualization

Speedup vs. GPU Count

Speedup

Parallel Efficiency

Efficiency

Tokens Processed per Second

TPS

Analysis

1. Performance at 2 GPUs

Using 2 GPUs improves runtime from 0.947s → 0.816s, yielding a 16% speedup.
This is a real but modest gain, and it reflects the fundamental trade-off of sequence parallelism:

  • Each GPU computes on half the sequence, reducing local FLOPs.
  • But attention still requires global K and V, forcing 2 AllGather operations.
  • These AllGathers use the slower GPU → CPU → MPI → CPU → GPU staging path.

Even with the communication penalty, the reduction in compute is enough to produce a net speedup.

2. Performance at 4 GPUs

At 4 GPUs, the system shows a slowdown rather than improvement:

  • Runtime increases from 0.816s → 1.010s
  • Speedup drops below 1.0×
  • Efficiency collapses to 23%

Why?

Because the communication volume grows faster than the compute reduction:

GPUs Tokens per GPU AllGather Size Communication/Compute Ratio
1 4096 N/A Low
2 2048 Medium Noticeable
4 1024 Large Dominant

At P=4, all GPUs must exchange increasingly small compute blocks for increasingly large global tensors. Given that collectives occur twice per Transformer layer, the system becomes communication-bound.

The result: More GPUs → More communication → Worse performance.

3. Why Communication Dominates

Because the MPI stack is not CUDA-aware, every collective follows:

GPU → CPU → MPI → CPU → GPU

This introduces:

  • PCIe transfer latency
  • Extra CPU memory copying
  • Synchronization stalls across all ranks

Each AllGather incurs ~5–15 ms, and with 2 AllGathers per layer, total overhead overwhelms compute savings at higher GPU counts.

Key Observations

  • Sequence parallelism works functionally and distributes compute correctly.
  • Scaling is limited entirely by communication, not computation.
  • P=2 offers the best trade-off; beyond that, CPU-staged collectives dominate the timeline.
  • Increasing sequence length (4096) amplifies tensor size and thus communication cost.
  • Tokens/sec correlates strongly with efficiency, dropping sharply at 4 GPUs.

What This Means for Real Parallel LLM Training

These experiments reflect a common insight in distributed training research:

Parallelism is only beneficial if communication cost does not exceed compute savings.

Frameworks like Megatron-LM avoid this issue by using CUDA-aware MPI or NCCL collectives, where:

  • GPU memory is exchanged directly
  • Without CPU involvement
  • At significantly higher bandwidth
    Our results show exactly why such optimizations are necessary.

Lessons Learned

Implementing sequence parallelism made one thing clear: scaling transformer layers is not just a mathematical parallelization problem—it is a systems engineering problem. The experiments highlight how:

  • Correctness can be preserved even under distributed partitioning
  • Communication topology matters as much as algorithmic design
  • Real clusters impose practical limits on theoretical parallelism
  • Performance is shaped by the interaction between hardware, memory pathways, and collective communication libraries

Even though 4 GPUs were slower than 1, the project successfully demonstrates the core challenges behind modern distributed LLM training.

Troubleshooting

  • CUDA Out of Memory: Reduce batch_size or seq_len
  • Job Timeout: Increase #SBATCH --time=02:00:00
  • Poor Scaling: Expected with CPU-based MPI; communication overhead dominates at higher GPU counts

Team

Name Institution
Harshita M.P Case Western Reserve University
Art Zabarov Case Western Reserve University

References

About

Implements sequence parallelism for Transformer layers using MPI4Python and multi-GPU acceleration, exploring scalability trade-offs and communication bottlenecks in distributed attention.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 90.8%
  • Shell 9.2%