A production-quality terminal-based profiling tool that diagnoses GPU performance bottlenecks in PyTorch models. This tool doesn't optimize your code—it tells you why it's slow and what to fix.
- Roofline Analysis: Determines if your workload is memory-bound or compute-bound
- Kernel Launch Diagnosis: Detects excessive kernel overhead and launch patterns
- Synchronization Detection: Identifies CPU-GPU sync stalls
- Data Pipeline Analysis: Measures HtoD/DtoH transfer overhead
- Tensor Core Utilization: Estimates tensor core usage and precision issues
- Memory Hierarchy: Analyzes cache efficiency (best effort)
- Actionable Recommendations: Clear, prioritized suggestions for each bottleneck
pip install torch torchvision # CUDA version requiredprofiler/
├── main.py # CLI entry point
├── collect.py # torch.profiler data collection
├── metrics.py # Derived metrics calculation
├── roofline.py # Roofline model analysis
├── diagnose.py # Rule-based diagnosis engine
└── report.py # Terminal report formatting
# Profile a ResNet50
python main.py --model resnet50 --batch-size 32 --steps 50
# Profile a Transformer
python main.py --model transformer --batch-size 16 --steps 100
# Profile a simple CNN
python main.py --model simple_cnn --batch-size 64 --steps 50from profiler.main import profile_model
import torch.nn as nn
# Your model
model = nn.Sequential(...)
# Input generator function
def input_gen():
return torch.randn(32, 3, 224, 224).cuda()
# Profile and diagnose
results = profile_model(model, input_gen, steps=50)
# Access results
bottlenecks = results['diagnosis']['bottlenecks']
primary = results['diagnosis']['primary_bottleneck']╔════════════════════════════════════════════════════════════════════════════╗
║ GPU PERFORMANCE DIAGNOSIS REPORT ║
╚════════════════════════════════════════════════════════════════════════════╝
📊 EXECUTIVE SUMMARY
────────────────────────────────────────────────────────────────────────────
Primary Bottleneck: CPU-Bound Operations Blocking GPU
Severity: CRITICAL
Category: CPU_RESIDENCY
GPU Utilization: 45.3%
Step Time: 28.43 ms
Compute Time: 10.38 ms
🏔️ ROOFLINE ANALYSIS
────────────────────────────────────────────────────────────────────────────
Bottleneck Type: MEMORY-BOUND ⚠️
└─ Arithmetic Intensity: 0.842 FLOPs/Byte
└─ Ridge Point: 12.400 FLOPs/Byte
└─ Status: Below ridge point → memory bandwidth limited
Performance vs. Roofline:
• Achieved: 73.2% of attainable
• Distance from roof: 26.8%
• Severity: MODERATE
Hardware Utilization:
• Memory Bandwidth: 78.4% of 900 GB/s
• Compute Throughput: 23.1% of 19.5 TFLOPS
• Headroom Available: 21.6%
🔍 DETECTED BOTTLENECKS
────────────────────────────────────────────────────────────────────────────
1. 🔴 CPU-Bound Operations Blocking GPU [CRITICAL]
Category: CPU_RESIDENCY
Evidence:
• CPU-heavy ops time: 15.2 ms/step
• CPU overhead: 53.5% of step time
• Top offender: torchvision.transforms (12.1 ms/step)
Explanation:
Your training loop is CPU-bound. Operations are being executed on CPU
instead of GPU, consuming 53.5% of step time. This often happens when
tensors stay on CPU too long or preprocessing happens on CPU instead
of GPU.
2. 🟠 Excessive CPU-GPU Data Movement [SEVERE]
Category: CPU_RESIDENCY
Evidence:
• CPU→GPU transfers per step: 247
• GPU→CPU transfers per step: 89
• Transfer overhead: 18.3%
Explanation:
Detected 336 CPU-GPU data transfers per step. This suggests data is
being moved back and forth between devices repeatedly.
💡 TOP RECOMMENDATIONS
────────────────────────────────────────────────────────────────────────────
Immediate Actions (Critical/Severe Issues):
For CPU-Bound Operations Blocking GPU:
• Move tensor operations to GPU immediately: data = data.cuda() at data loading
• Use GPU-based augmentation (kornia, DALI) instead of CPU (torchvision.transforms)
• Check for tensor.numpy() calls forcing CPU execution
• Profile with 'with_stack=True' to see exact callsites
For Excessive CPU-GPU Data Movement:
• Move all tensors to GPU once at the start
• Avoid .item(), .numpy(), .cpu() in training loop
• Use persistent_workers=True in DataLoader
• Use torch.cuda.stream() to overlap transfers with compute
The tool detects and diagnoses:
-
Roofline Bottlenecks
- Memory-bound vs compute-bound classification
- Bandwidth/compute efficiency
- Distance from theoretical peak
-
Kernel Launch Overhead
- Too many small kernels (<20µs)
- Poor launch amortization
- Excessive kernel count
-
CPU-GPU Synchronization
.item(),.cpu()calls in loops- Explicit
synchronize()calls - GPU idle gaps
-
CPU Residency Bottlenecks ⭐ NEW
- Data kept on CPU too long
- Late GPU migration patterns
- CPU-heavy operations blocking GPU
- Excessive CPU↔GPU transfers
- CPU-based preprocessing bottleneck
-
Data Pipeline Issues
- HtoD copy overhead
- Poor compute-copy overlap
- DataLoader inefficiency
-
Low GPU Occupancy
- Insufficient parallelism
- Small batch sizes
- Underutilization
-
Memory Hierarchy (Best Effort)
- Cache inefficiency
- Uncoalesced accesses
- Poor memory patterns
-
Tensor Core Underutilization
- FP32-heavy workloads
- Misaligned tensor shapes
- Missing mixed precision
-
Compiler/Graph Issues
- High kernel diversity
- Lack of fusion
- Graph fragmentation
- Timing: Step time, GPU compute time, idle time
- Kernel Stats: Count, duration, launch overhead
- Memory: HtoD/DtoH transfers, bandwidth achieved
- Compute: FLOPs, achieved throughput, arithmetic intensity
- Precision: FP16/FP32 usage, tensor core utilization
- CPU Residency: CPU-heavy ops, device transfer patterns, preprocessing time
The tool specifically looks for four patterns of keeping data on CPU:
Detects operations with high CPU time but minimal GPU time (CPU/GPU ratio > 10:1)
Example caught:
# This creates a CPU bottleneck
data = data.numpy() # Forces CPU execution
result = custom_function(data) # CPU computation
data = torch.from_numpy(result).cuda() # Move backCounts .cuda(), .cpu(), .to(device) calls per step
Example caught:
# Moving data back and forth
for batch in dataloader:
x = batch['data'].cuda() # Transfer 1
if x.max().item() > 0: # .item() forces CPU
x = x.cpu() # Transfer 2
x = preprocess(x) # CPU work
x = x.cuda() # Transfer 3Identifies CPU augmentation/transforms during training
Example caught:
# torchvision transforms run on CPU
transform = transforms.Compose([
transforms.RandomCrop(224), # CPU
transforms.ColorJitter(), # CPU
])Detects tensors created on CPU then moved (vs created on GPU directly)
Example caught:
# Creating on CPU first
noise = torch.randn(x.shape).cuda() # CPU alloc + transfer
# Should be:
noise = torch.randn(x.shape, device='cuda') # Direct GPU allocSee CPU Residency Guide for detailed examples and fixes.
- Tensor Core Utilization: Estimated from operation types (best effort)
- Cache Metrics: Requires Nsight Compute for hardware counters
- Bytes Transferred: Approximate from memory operations
- Precision Detection: Inferred from kernel names
For production-level profiling, combine this tool with:
- NVIDIA Nsight Systems (timeline visualization)
- NVIDIA Nsight Compute (kernel-level analysis)
- torch.profiler with TensorBoard (detailed traces)
- Diagnosis over Metrics: Focuses on explaining bottlenecks, not dumping data
- Rule-Based Engine: Clear, explainable diagnostic rules
- Severity Classification: Prioritizes critical issues
- Actionable Output: Every diagnosis includes fix suggestions
- Uses
torch.profilerfor minimal overhead - Captures both CPU and CUDA events
- Records shapes, memory, and FLOPs
- Warmup steps excluded from analysis
Implements the roofline performance model:
- Ridge Point = Peak FLOPS / Peak Bandwidth
- Memory-bound: AI < Ridge Point
- Compute-bound: AI ≥ Ridge Point
- Calculates distance from theoretical maximum
This tool is designed for ML systems engineers. When adding new diagnosis rules:
- Add clear evidence (metrics)
- Explain why it matters
- Provide actionable suggestions
- Classify severity appropriately
This is production-quality diagnostic code. Use it to understand your models better.