GPU Performance Diagnosis Tool for PyTorch

A production-quality terminal-based profiling tool that diagnoses GPU performance bottlenecks in PyTorch models. This tool doesn't optimize your code—it tells you why it's slow and what to fix.

Features

Roofline Analysis: Determines if your workload is memory-bound or compute-bound
Kernel Launch Diagnosis: Detects excessive kernel overhead and launch patterns
Synchronization Detection: Identifies CPU-GPU sync stalls
Data Pipeline Analysis: Measures HtoD/DtoH transfer overhead
Tensor Core Utilization: Estimates tensor core usage and precision issues
Memory Hierarchy: Analyzes cache efficiency (best effort)
Actionable Recommendations: Clear, prioritized suggestions for each bottleneck

Requirements

pip install torch torchvision  # CUDA version required

Project Structure

profiler/
├── main.py          # CLI entry point
├── collect.py       # torch.profiler data collection
├── metrics.py       # Derived metrics calculation
├── roofline.py      # Roofline model analysis
├── diagnose.py      # Rule-based diagnosis engine
└── report.py        # Terminal report formatting

Usage

Command Line

# Profile a ResNet50
python main.py --model resnet50 --batch-size 32 --steps 50

# Profile a Transformer
python main.py --model transformer --batch-size 16 --steps 100

# Profile a simple CNN
python main.py --model simple_cnn --batch-size 64 --steps 50

Programmatic API

from profiler.main import profile_model
import torch.nn as nn

# Your model
model = nn.Sequential(...)

# Input generator function
def input_gen():
    return torch.randn(32, 3, 224, 224).cuda()

# Profile and diagnose
results = profile_model(model, input_gen, steps=50)

# Access results
bottlenecks = results['diagnosis']['bottlenecks']
primary = results['diagnosis']['primary_bottleneck']

Example Output

╔════════════════════════════════════════════════════════════════════════════╗
║                   GPU PERFORMANCE DIAGNOSIS REPORT                         ║
╚════════════════════════════════════════════════════════════════════════════╝

📊 EXECUTIVE SUMMARY
────────────────────────────────────────────────────────────────────────────
Primary Bottleneck: CPU-Bound Operations Blocking GPU
Severity: CRITICAL
Category: CPU_RESIDENCY

GPU Utilization: 45.3%
Step Time: 28.43 ms
Compute Time: 10.38 ms

🏔️  ROOFLINE ANALYSIS
────────────────────────────────────────────────────────────────────────────
Bottleneck Type: MEMORY-BOUND ⚠️
  └─ Arithmetic Intensity: 0.842 FLOPs/Byte
  └─ Ridge Point: 12.400 FLOPs/Byte
  └─ Status: Below ridge point → memory bandwidth limited

Performance vs. Roofline:
  • Achieved: 73.2% of attainable
  • Distance from roof: 26.8%
  • Severity: MODERATE

Hardware Utilization:
  • Memory Bandwidth: 78.4% of 900 GB/s
  • Compute Throughput: 23.1% of 19.5 TFLOPS
  • Headroom Available: 21.6%

🔍 DETECTED BOTTLENECKS
────────────────────────────────────────────────────────────────────────────

1. 🔴 CPU-Bound Operations Blocking GPU [CRITICAL]
   Category: CPU_RESIDENCY

   Evidence:
     • CPU-heavy ops time: 15.2 ms/step
     • CPU overhead: 53.5% of step time
     • Top offender: torchvision.transforms (12.1 ms/step)

   Explanation:
     Your training loop is CPU-bound. Operations are being executed on CPU
     instead of GPU, consuming 53.5% of step time. This often happens when
     tensors stay on CPU too long or preprocessing happens on CPU instead
     of GPU.

2. 🟠 Excessive CPU-GPU Data Movement [SEVERE]
   Category: CPU_RESIDENCY

   Evidence:
     • CPU→GPU transfers per step: 247
     • GPU→CPU transfers per step: 89
     • Transfer overhead: 18.3%

   Explanation:
     Detected 336 CPU-GPU data transfers per step. This suggests data is
     being moved back and forth between devices repeatedly.

💡 TOP RECOMMENDATIONS
────────────────────────────────────────────────────────────────────────────

Immediate Actions (Critical/Severe Issues):

  For CPU-Bound Operations Blocking GPU:
    • Move tensor operations to GPU immediately: data = data.cuda() at data loading
    • Use GPU-based augmentation (kornia, DALI) instead of CPU (torchvision.transforms)
    • Check for tensor.numpy() calls forcing CPU execution
    • Profile with 'with_stack=True' to see exact callsites

  For Excessive CPU-GPU Data Movement:
    • Move all tensors to GPU once at the start
    • Avoid .item(), .numpy(), .cpu() in training loop
    • Use persistent_workers=True in DataLoader
    • Use torch.cuda.stream() to overlap transfers with compute

Detected Bottlenecks

The tool detects and diagnoses:

Roofline Bottlenecks
- Memory-bound vs compute-bound classification
- Bandwidth/compute efficiency
- Distance from theoretical peak
Kernel Launch Overhead
- Too many small kernels (<20µs)
- Poor launch amortization
- Excessive kernel count
CPU-GPU Synchronization
- .item(), .cpu() calls in loops
- Explicit synchronize() calls
- GPU idle gaps
CPU Residency Bottlenecks ⭐ NEW
- Data kept on CPU too long
- Late GPU migration patterns
- CPU-heavy operations blocking GPU
- Excessive CPU↔GPU transfers
- CPU-based preprocessing bottleneck
Data Pipeline Issues
- HtoD copy overhead
- Poor compute-copy overlap
- DataLoader inefficiency
Low GPU Occupancy
- Insufficient parallelism
- Small batch sizes
- Underutilization
Memory Hierarchy (Best Effort)
- Cache inefficiency
- Uncoalesced accesses
- Poor memory patterns
Tensor Core Underutilization
- FP32-heavy workloads
- Misaligned tensor shapes
- Missing mixed precision
Compiler/Graph Issues
- High kernel diversity
- Lack of fusion
- Graph fragmentation

Metrics Collected

Timing: Step time, GPU compute time, idle time
Kernel Stats: Count, duration, launch overhead
Memory: HtoD/DtoH transfers, bandwidth achieved
Compute: FLOPs, achieved throughput, arithmetic intensity
Precision: FP16/FP32 usage, tensor core utilization
CPU Residency: CPU-heavy ops, device transfer patterns, preprocessing time

How It Detects CPU Residency Issues

The tool specifically looks for four patterns of keeping data on CPU:

1. CPU-Heavy Operations

Detects operations with high CPU time but minimal GPU time (CPU/GPU ratio > 10:1)

Example caught:

# This creates a CPU bottleneck
data = data.numpy()  # Forces CPU execution
result = custom_function(data)  # CPU computation
data = torch.from_numpy(result).cuda()  # Move back

2. Excessive CPU↔GPU Transfers

Counts .cuda(), .cpu(), .to(device) calls per step

Example caught:

# Moving data back and forth
for batch in dataloader:
    x = batch['data'].cuda()  # Transfer 1
    if x.max().item() > 0:    # .item() forces CPU
        x = x.cpu()            # Transfer 2
        x = preprocess(x)      # CPU work
        x = x.cuda()           # Transfer 3

3. CPU-Based Preprocessing

Identifies CPU augmentation/transforms during training

Example caught:

# torchvision transforms run on CPU
transform = transforms.Compose([
    transforms.RandomCrop(224),    # CPU
    transforms.ColorJitter(),      # CPU
])

4. Late GPU Migration

Detects tensors created on CPU then moved (vs created on GPU directly)

Example caught:

# Creating on CPU first
noise = torch.randn(x.shape).cuda()  # CPU alloc + transfer

# Should be:
noise = torch.randn(x.shape, device='cuda')  # Direct GPU alloc

See CPU Residency Guide for detailed examples and fixes.

Limitations

Tensor Core Utilization: Estimated from operation types (best effort)
Cache Metrics: Requires Nsight Compute for hardware counters
Bytes Transferred: Approximate from memory operations
Precision Detection: Inferred from kernel names

For production-level profiling, combine this tool with:

NVIDIA Nsight Systems (timeline visualization)
NVIDIA Nsight Compute (kernel-level analysis)
torch.profiler with TensorBoard (detailed traces)

Architecture Notes

Design Principles

Diagnosis over Metrics: Focuses on explaining bottlenecks, not dumping data
Rule-Based Engine: Clear, explainable diagnostic rules
Severity Classification: Prioritizes critical issues
Actionable Output: Every diagnosis includes fix suggestions

Profiling Strategy

Uses torch.profiler for minimal overhead
Captures both CPU and CUDA events
Records shapes, memory, and FLOPs
Warmup steps excluded from analysis

Roofline Model

Implements the roofline performance model:

Ridge Point = Peak FLOPS / Peak Bandwidth
Memory-bound: AI < Ridge Point
Compute-bound: AI ≥ Ridge Point
Calculates distance from theoretical maximum

Contributing

This tool is designed for ML systems engineers. When adding new diagnosis rules:

Add clear evidence (metrics)
Explain why it matters
Provide actionable suggestions
Classify severity appropriately

License

This is production-quality diagnostic code. Use it to understand your models better.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
profiler		profiler
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
example.py		example.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Performance Diagnosis Tool for PyTorch

Features

Requirements

Project Structure

Usage

Command Line

Programmatic API

Example Output

Detected Bottlenecks

Metrics Collected

How It Detects CPU Residency Issues

1. CPU-Heavy Operations

2. Excessive CPU↔GPU Transfers

3. CPU-Based Preprocessing

4. Late GPU Migration

Limitations

Architecture Notes

Design Principles

Profiling Strategy

Roofline Model

Contributing

License

About

Uh oh!

Releases

Packages

Languages

pythongiant/AutoKernel

Folders and files

Latest commit

History

Repository files navigation

GPU Performance Diagnosis Tool for PyTorch

Features

Requirements

Project Structure

Usage

Command Line

Programmatic API

Example Output

Detected Bottlenecks

Metrics Collected

How It Detects CPU Residency Issues

1. CPU-Heavy Operations

2. Excessive CPU↔GPU Transfers

3. CPU-Based Preprocessing

4. Late GPU Migration

Limitations

Architecture Notes

Design Principles

Profiling Strategy

Roofline Model

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages