Transformer Sequence Parallelism with MPI4Python

Overview

Modern Transformers struggle to scale as sequence lengths grow, with attention becoming one of the biggest computational bottlenecks. To explore how these limitations can be overcome, this project implements sequence parallelism - a technique that distributes long sequences across multiple GPUs while still preserving correctness.

This project implements Sequence Parallelism for Transformer layers using MPI4Python and GPU acceleration on the CWRU Pioneer HPC Cluster. Sequence parallelism distributes computation by splitting the input along the sequence (token) dimension, allowing each GPU to process a subset of tokens independently while still producing correct attention outputs.

In a standard Transformer, every token must attend to every other token, which tightly couples computations across the full sequence. To enable parallelism, this implementation:

Partitions the input tensor across GPUs along the sequence axis
Computes local Q/K/V projections independently on each GPU
Uses MPI AllGather to assemble full K and V tensors on every GPU
Computes attention locally, but only for each GPU’s assigned token slice

This preserves correctness while enabling distributed forward propagation across multiple devices.

Due to the lack of CUDA-aware support in the MPI configuration, communication operations must use a slower CPU-staging pipeline:

GPU → CPU → MPI → CPU → GPU

This adds significant latency (≈5–15 ms per AllGather), making communication the primary bottleneck at higher GPU counts.

Despite this limitation, the project demonstrates:

How to construct a distributed Transformer layer from scratch
How sequence parallelism behaves across 1, 2, and 4 GPUs
How communication overhead impacts scaling
Practical trade-offs in parallelizing attention mechanisms

The results provide meaningful insight into the challenges and design considerations behind parallel deep learning systems, even when ideal speedup is not achieved.

Quick Start

# Setup
bash scripts/setup_env.sh

# Run experiments (p=1,2,4 on same GPUs)
bash scripts/submit_all_gpu.sh

# Monitor
tail -f logs/sp_all_experiments_<JOBID>.out

This project uses three scripts to automate everything on the node. scripts/submit_all_gpu.sh is the one-command entry point: it creates logs/ and experiments/results/, submits the batch job via sbatch scripts/run_all_experiments.sh, and prints how to monitor the job and where results will be saved. scripts/run_all_experiments.sh is the Slurm batch script: it requests 1 node with 4 GPUs, sets up the environment, then runs the strong-scaling experiment with sequence parallelism for P=1,2,4 using srun, saving each run’s metrics to experiments/results/strong_scaling_sp_gpu_p<P>.json and printing a small summary at the end. scripts/setup_env.sh is the environment bootstrapper: it loads the appropriate OpenMPI and CUDA modules, installs torch, mpi4py, numpy, and cupy if needed, and runs small sanity checks to confirm that MPI, CUDA, and PyTorch are working correctly before any experiments start.

Implementation

Sequence Partitioning:

Input: [Batch, Seq_len, Hidden] split along sequence dimension
Each GPU gets: [B, S/p, H]

Attention Layer:

Q/K/V projections: Local (no communication)
AllGather K and V: [B, S/p, H] → [B, S, H] on all ranks
Attention: Local Q with full K/V
Output: [B, S/p, H] per rank

Feed-Forward: No communication (position-independent)

Communication: 2 AllGathers per layer (K and V in attention)

CPU-based MPI: GPU → CPU → MPI → CPU → GPU (adds overhead ~5-15ms per AllGather)

Configuration

Edit experiments/experiment_configs.py:

hidden_dim = 5120      # Model size
seq_len = 4096         # Sequence length (significantly increased)
batch_size = 4         # Batch size (reduced to fit longer sequences)
num_iterations = 10    # Timing iterations

Results

Experimental Configuration

All experiments were run using the following configuration:

Hidden dimension: 5120
Sequence length: 4096
Batch size: 4
Iterations: 10 timed forward passes
Parallelism: Sequence parallelism across 1, 2, and 4 GPUs
Communication backend: MPI4Py using CPU-staged collectives

Each experiment measures:

End-to-end forward pass time
Speedup relative to 1 GPU
Parallel efficiency
Throughput (tokens processed per second)

Raw Results

GPUs	Time (s)	Speedup	Efficiency	Tokens/sec
1	0.947	1.00×	100%	17,295
2	0.816	1.16×	58%	20,072
4	1.010	0.94×	23%	16,218

Visualization

Speedup vs. GPU Count

Parallel Efficiency

Tokens Processed per Second

Analysis

1. Performance at 2 GPUs

Using 2 GPUs improves runtime from 0.947s → 0.816s, yielding a 16% speedup.
This is a real but modest gain, and it reflects the fundamental trade-off of sequence parallelism:

Each GPU computes on half the sequence, reducing local FLOPs.
But attention still requires global K and V, forcing 2 AllGather operations.
These AllGathers use the slower GPU → CPU → MPI → CPU → GPU staging path.

Even with the communication penalty, the reduction in compute is enough to produce a net speedup.

2. Performance at 4 GPUs

At 4 GPUs, the system shows a slowdown rather than improvement:

Runtime increases from 0.816s → 1.010s
Speedup drops below 1.0×
Efficiency collapses to 23%

Why?

Because the communication volume grows faster than the compute reduction:

GPUs	Tokens per GPU	AllGather Size	Communication/Compute Ratio
1	4096	N/A	Low
2	2048	Medium	Noticeable
4	1024	Large	Dominant

At P=4, all GPUs must exchange increasingly small compute blocks for increasingly large global tensors. Given that collectives occur twice per Transformer layer, the system becomes communication-bound.

The result: More GPUs → More communication → Worse performance.

3. Why Communication Dominates

Because the MPI stack is not CUDA-aware, every collective follows:

GPU → CPU → MPI → CPU → GPU

This introduces:

PCIe transfer latency
Extra CPU memory copying
Synchronization stalls across all ranks

Each AllGather incurs ~5–15 ms, and with 2 AllGathers per layer, total overhead overwhelms compute savings at higher GPU counts.

Key Observations

Sequence parallelism works functionally and distributes compute correctly.
Scaling is limited entirely by communication, not computation.
P=2 offers the best trade-off; beyond that, CPU-staged collectives dominate the timeline.
Increasing sequence length (4096) amplifies tensor size and thus communication cost.
Tokens/sec correlates strongly with efficiency, dropping sharply at 4 GPUs.

What This Means for Real Parallel LLM Training

These experiments reflect a common insight in distributed training research:

Parallelism is only beneficial if communication cost does not exceed compute savings.

Frameworks like Megatron-LM avoid this issue by using CUDA-aware MPI or NCCL collectives, where:

GPU memory is exchanged directly
Without CPU involvement
At significantly higher bandwidth
Our results show exactly why such optimizations are necessary.

Lessons Learned

Implementing sequence parallelism made one thing clear: scaling transformer layers is not just a mathematical parallelization problem—it is a systems engineering problem. The experiments highlight how:

Correctness can be preserved even under distributed partitioning
Communication topology matters as much as algorithmic design
Real clusters impose practical limits on theoretical parallelism
Performance is shaped by the interaction between hardware, memory pathways, and collective communication libraries

Even though 4 GPUs were slower than 1, the project successfully demonstrates the core challenges behind modern distributed LLM training.

Troubleshooting

CUDA Out of Memory: Reduce batch_size or seq_len
Job Timeout: Increase #SBATCH --time=02:00:00
Poor Scaling: Expected with CPU-based MPI; communication overhead dominates at higher GPU counts

Team

Name	Institution
Harshita M.P	Case Western Reserve University
Art Zabarov	Case Western Reserve University

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
experiments		experiments
plots		plots
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Sequence Parallelism with MPI4Python

Overview

Quick Start

Implementation

Configuration

Results

Experimental Configuration

Raw Results

Visualization

Speedup vs. GPU Count

Parallel Efficiency

Tokens Processed per Second

Analysis

1. Performance at 2 GPUs

2. Performance at 4 GPUs

3. Why Communication Dominates

Key Observations

What This Means for Real Parallel LLM Training

Lessons Learned

Troubleshooting

Team

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer Sequence Parallelism with MPI4Python

Overview

Quick Start

Implementation

Configuration

Results

Experimental Configuration

Raw Results

Visualization

Speedup vs. GPU Count

Parallel Efficiency

Tokens Processed per Second

Analysis

1. Performance at 2 GPUs

2. Performance at 4 GPUs

3. Why Communication Dominates

Key Observations

What This Means for Real Parallel LLM Training

Lessons Learned

Troubleshooting

Team

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages