Modern Transformers struggle to scale as sequence lengths grow, with attention becoming one of the biggest computational bottlenecks. To explore how these limitations can be overcome, this project implements sequence parallelism - a technique that distributes long sequences across multiple GPUs while still preserving correctness.
This project implements Sequence Parallelism for Transformer layers using MPI4Python and GPU acceleration on the CWRU Pioneer HPC Cluster. Sequence parallelism distributes computation by splitting the input along the sequence (token) dimension, allowing each GPU to process a subset of tokens independently while still producing correct attention outputs.
In a standard Transformer, every token must attend to every other token, which tightly couples computations across the full sequence. To enable parallelism, this implementation:
- Partitions the input tensor across GPUs along the sequence axis
- Computes local Q/K/V projections independently on each GPU
- Uses MPI AllGather to assemble full K and V tensors on every GPU
- Computes attention locally, but only for each GPU’s assigned token slice
This preserves correctness while enabling distributed forward propagation across multiple devices.
Due to the lack of CUDA-aware support in the MPI configuration, communication operations must use a slower CPU-staging pipeline:
GPU → CPU → MPI → CPU → GPU
This adds significant latency (≈5–15 ms per AllGather), making communication the primary bottleneck at higher GPU counts.
Despite this limitation, the project demonstrates:
- How to construct a distributed Transformer layer from scratch
- How sequence parallelism behaves across 1, 2, and 4 GPUs
- How communication overhead impacts scaling
- Practical trade-offs in parallelizing attention mechanisms
The results provide meaningful insight into the challenges and design considerations behind parallel deep learning systems, even when ideal speedup is not achieved.
# Setup
bash scripts/setup_env.sh
# Run experiments (p=1,2,4 on same GPUs)
bash scripts/submit_all_gpu.sh
# Monitor
tail -f logs/sp_all_experiments_<JOBID>.outThis project uses three scripts to automate everything on the node. scripts/submit_all_gpu.sh is the one-command entry point: it creates logs/ and experiments/results/, submits the batch job via sbatch scripts/run_all_experiments.sh, and prints how to monitor the job and where results will be saved. scripts/run_all_experiments.sh is the Slurm batch script: it requests 1 node with 4 GPUs, sets up the environment, then runs the strong-scaling experiment with sequence parallelism for P=1,2,4 using srun, saving each run’s metrics to experiments/results/strong_scaling_sp_gpu_p<P>.json and printing a small summary at the end. scripts/setup_env.sh is the environment bootstrapper: it loads the appropriate OpenMPI and CUDA modules, installs torch, mpi4py, numpy, and cupy if needed, and runs small sanity checks to confirm that MPI, CUDA, and PyTorch are working correctly before any experiments start.
Sequence Partitioning:
- Input:
[Batch, Seq_len, Hidden]split along sequence dimension - Each GPU gets:
[B, S/p, H]
Attention Layer:
- Q/K/V projections: Local (no communication)
- AllGather K and V:
[B, S/p, H]→[B, S, H]on all ranks - Attention: Local Q with full K/V
- Output:
[B, S/p, H]per rank
Feed-Forward: No communication (position-independent)
Communication: 2 AllGathers per layer (K and V in attention)
CPU-based MPI: GPU → CPU → MPI → CPU → GPU (adds overhead ~5-15ms per AllGather)
Edit experiments/experiment_configs.py:
hidden_dim = 5120 # Model size
seq_len = 4096 # Sequence length (significantly increased)
batch_size = 4 # Batch size (reduced to fit longer sequences)
num_iterations = 10 # Timing iterationsAll experiments were run using the following configuration:
- Hidden dimension: 5120
- Sequence length: 4096
- Batch size: 4
- Iterations: 10 timed forward passes
- Parallelism: Sequence parallelism across 1, 2, and 4 GPUs
- Communication backend: MPI4Py using CPU-staged collectives
Each experiment measures:
- End-to-end forward pass time
- Speedup relative to 1 GPU
- Parallel efficiency
- Throughput (tokens processed per second)
| GPUs | Time (s) | Speedup | Efficiency | Tokens/sec |
|---|---|---|---|---|
| 1 | 0.947 | 1.00× | 100% | 17,295 |
| 2 | 0.816 | 1.16× | 58% | 20,072 |
| 4 | 1.010 | 0.94× | 23% | 16,218 |
Using 2 GPUs improves runtime from 0.947s → 0.816s, yielding a 16% speedup.
This is a real but modest gain, and it reflects the fundamental trade-off of sequence parallelism:
- Each GPU computes on half the sequence, reducing local FLOPs.
- But attention still requires global K and V, forcing 2 AllGather operations.
- These AllGathers use the slower GPU → CPU → MPI → CPU → GPU staging path.
Even with the communication penalty, the reduction in compute is enough to produce a net speedup.
At 4 GPUs, the system shows a slowdown rather than improvement:
- Runtime increases from 0.816s → 1.010s
- Speedup drops below 1.0×
- Efficiency collapses to 23%
Why?
Because the communication volume grows faster than the compute reduction:
| GPUs | Tokens per GPU | AllGather Size | Communication/Compute Ratio |
|---|---|---|---|
| 1 | 4096 | N/A | Low |
| 2 | 2048 | Medium | Noticeable |
| 4 | 1024 | Large | Dominant |
At P=4, all GPUs must exchange increasingly small compute blocks for increasingly large global tensors. Given that collectives occur twice per Transformer layer, the system becomes communication-bound.
The result: More GPUs → More communication → Worse performance.
Because the MPI stack is not CUDA-aware, every collective follows:
GPU → CPU → MPI → CPU → GPU
This introduces:
- PCIe transfer latency
- Extra CPU memory copying
- Synchronization stalls across all ranks
Each AllGather incurs ~5–15 ms, and with 2 AllGathers per layer, total overhead overwhelms compute savings at higher GPU counts.
- Sequence parallelism works functionally and distributes compute correctly.
- Scaling is limited entirely by communication, not computation.
- P=2 offers the best trade-off; beyond that, CPU-staged collectives dominate the timeline.
- Increasing sequence length (4096) amplifies tensor size and thus communication cost.
- Tokens/sec correlates strongly with efficiency, dropping sharply at 4 GPUs.
These experiments reflect a common insight in distributed training research:
Parallelism is only beneficial if communication cost does not exceed compute savings.
Frameworks like Megatron-LM avoid this issue by using CUDA-aware MPI or NCCL collectives, where:
- GPU memory is exchanged directly
- Without CPU involvement
- At significantly higher bandwidth
Our results show exactly why such optimizations are necessary.
Implementing sequence parallelism made one thing clear: scaling transformer layers is not just a mathematical parallelization problem—it is a systems engineering problem. The experiments highlight how:
- Correctness can be preserved even under distributed partitioning
- Communication topology matters as much as algorithmic design
- Real clusters impose practical limits on theoretical parallelism
- Performance is shaped by the interaction between hardware, memory pathways, and collective communication libraries
Even though 4 GPUs were slower than 1, the project successfully demonstrates the core challenges behind modern distributed LLM training.
- CUDA Out of Memory: Reduce
batch_sizeorseq_len - Job Timeout: Increase
#SBATCH --time=02:00:00 - Poor Scaling: Expected with CPU-based MPI; communication overhead dominates at higher GPU counts
| Name | Institution |
|---|---|
| Harshita M.P | Case Western Reserve University |
| Art Zabarov | Case Western Reserve University |


