Skip to content

A production-quality terminal-based profiling tool that diagnoses GPU performance bottlenecks in PyTorch models.

Notifications You must be signed in to change notification settings

pythongiant/AutoKernel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Performance Diagnosis Tool for PyTorch

A production-quality terminal-based profiling tool that diagnoses GPU performance bottlenecks in PyTorch models. This tool doesn't optimize your code—it tells you why it's slow and what to fix.

Features

  • Roofline Analysis: Determines if your workload is memory-bound or compute-bound
  • Kernel Launch Diagnosis: Detects excessive kernel overhead and launch patterns
  • Synchronization Detection: Identifies CPU-GPU sync stalls
  • Data Pipeline Analysis: Measures HtoD/DtoH transfer overhead
  • Tensor Core Utilization: Estimates tensor core usage and precision issues
  • Memory Hierarchy: Analyzes cache efficiency (best effort)
  • Actionable Recommendations: Clear, prioritized suggestions for each bottleneck

Requirements

pip install torch torchvision  # CUDA version required

Project Structure

profiler/
├── main.py          # CLI entry point
├── collect.py       # torch.profiler data collection
├── metrics.py       # Derived metrics calculation
├── roofline.py      # Roofline model analysis
├── diagnose.py      # Rule-based diagnosis engine
└── report.py        # Terminal report formatting

Usage

Command Line

# Profile a ResNet50
python main.py --model resnet50 --batch-size 32 --steps 50

# Profile a Transformer
python main.py --model transformer --batch-size 16 --steps 100

# Profile a simple CNN
python main.py --model simple_cnn --batch-size 64 --steps 50

Programmatic API

from profiler.main import profile_model
import torch.nn as nn

# Your model
model = nn.Sequential(...)

# Input generator function
def input_gen():
    return torch.randn(32, 3, 224, 224).cuda()

# Profile and diagnose
results = profile_model(model, input_gen, steps=50)

# Access results
bottlenecks = results['diagnosis']['bottlenecks']
primary = results['diagnosis']['primary_bottleneck']

Example Output

╔════════════════════════════════════════════════════════════════════════════╗
║                   GPU PERFORMANCE DIAGNOSIS REPORT                         ║
╚════════════════════════════════════════════════════════════════════════════╝

📊 EXECUTIVE SUMMARY
────────────────────────────────────────────────────────────────────────────
Primary Bottleneck: CPU-Bound Operations Blocking GPU
Severity: CRITICAL
Category: CPU_RESIDENCY

GPU Utilization: 45.3%
Step Time: 28.43 ms
Compute Time: 10.38 ms

🏔️  ROOFLINE ANALYSIS
────────────────────────────────────────────────────────────────────────────
Bottleneck Type: MEMORY-BOUND ⚠️
  └─ Arithmetic Intensity: 0.842 FLOPs/Byte
  └─ Ridge Point: 12.400 FLOPs/Byte
  └─ Status: Below ridge point → memory bandwidth limited

Performance vs. Roofline:
  • Achieved: 73.2% of attainable
  • Distance from roof: 26.8%
  • Severity: MODERATE

Hardware Utilization:
  • Memory Bandwidth: 78.4% of 900 GB/s
  • Compute Throughput: 23.1% of 19.5 TFLOPS
  • Headroom Available: 21.6%

🔍 DETECTED BOTTLENECKS
────────────────────────────────────────────────────────────────────────────

1. 🔴 CPU-Bound Operations Blocking GPU [CRITICAL]
   Category: CPU_RESIDENCY

   Evidence:
     • CPU-heavy ops time: 15.2 ms/step
     • CPU overhead: 53.5% of step time
     • Top offender: torchvision.transforms (12.1 ms/step)

   Explanation:
     Your training loop is CPU-bound. Operations are being executed on CPU
     instead of GPU, consuming 53.5% of step time. This often happens when
     tensors stay on CPU too long or preprocessing happens on CPU instead
     of GPU.

2. 🟠 Excessive CPU-GPU Data Movement [SEVERE]
   Category: CPU_RESIDENCY

   Evidence:
     • CPU→GPU transfers per step: 247
     • GPU→CPU transfers per step: 89
     • Transfer overhead: 18.3%

   Explanation:
     Detected 336 CPU-GPU data transfers per step. This suggests data is
     being moved back and forth between devices repeatedly.

💡 TOP RECOMMENDATIONS
────────────────────────────────────────────────────────────────────────────

Immediate Actions (Critical/Severe Issues):

  For CPU-Bound Operations Blocking GPU:
    • Move tensor operations to GPU immediately: data = data.cuda() at data loading
    • Use GPU-based augmentation (kornia, DALI) instead of CPU (torchvision.transforms)
    • Check for tensor.numpy() calls forcing CPU execution
    • Profile with 'with_stack=True' to see exact callsites

  For Excessive CPU-GPU Data Movement:
    • Move all tensors to GPU once at the start
    • Avoid .item(), .numpy(), .cpu() in training loop
    • Use persistent_workers=True in DataLoader
    • Use torch.cuda.stream() to overlap transfers with compute

Detected Bottlenecks

The tool detects and diagnoses:

  1. Roofline Bottlenecks

    • Memory-bound vs compute-bound classification
    • Bandwidth/compute efficiency
    • Distance from theoretical peak
  2. Kernel Launch Overhead

    • Too many small kernels (<20µs)
    • Poor launch amortization
    • Excessive kernel count
  3. CPU-GPU Synchronization

    • .item(), .cpu() calls in loops
    • Explicit synchronize() calls
    • GPU idle gaps
  4. CPU Residency Bottlenecks ⭐ NEW

    • Data kept on CPU too long
    • Late GPU migration patterns
    • CPU-heavy operations blocking GPU
    • Excessive CPU↔GPU transfers
    • CPU-based preprocessing bottleneck
  5. Data Pipeline Issues

    • HtoD copy overhead
    • Poor compute-copy overlap
    • DataLoader inefficiency
  6. Low GPU Occupancy

    • Insufficient parallelism
    • Small batch sizes
    • Underutilization
  7. Memory Hierarchy (Best Effort)

    • Cache inefficiency
    • Uncoalesced accesses
    • Poor memory patterns
  8. Tensor Core Underutilization

    • FP32-heavy workloads
    • Misaligned tensor shapes
    • Missing mixed precision
  9. Compiler/Graph Issues

    • High kernel diversity
    • Lack of fusion
    • Graph fragmentation

Metrics Collected

  • Timing: Step time, GPU compute time, idle time
  • Kernel Stats: Count, duration, launch overhead
  • Memory: HtoD/DtoH transfers, bandwidth achieved
  • Compute: FLOPs, achieved throughput, arithmetic intensity
  • Precision: FP16/FP32 usage, tensor core utilization
  • CPU Residency: CPU-heavy ops, device transfer patterns, preprocessing time

How It Detects CPU Residency Issues

The tool specifically looks for four patterns of keeping data on CPU:

1. CPU-Heavy Operations

Detects operations with high CPU time but minimal GPU time (CPU/GPU ratio > 10:1)

Example caught:

# This creates a CPU bottleneck
data = data.numpy()  # Forces CPU execution
result = custom_function(data)  # CPU computation
data = torch.from_numpy(result).cuda()  # Move back

2. Excessive CPU↔GPU Transfers

Counts .cuda(), .cpu(), .to(device) calls per step

Example caught:

# Moving data back and forth
for batch in dataloader:
    x = batch['data'].cuda()  # Transfer 1
    if x.max().item() > 0:    # .item() forces CPU
        x = x.cpu()            # Transfer 2
        x = preprocess(x)      # CPU work
        x = x.cuda()           # Transfer 3

3. CPU-Based Preprocessing

Identifies CPU augmentation/transforms during training

Example caught:

# torchvision transforms run on CPU
transform = transforms.Compose([
    transforms.RandomCrop(224),    # CPU
    transforms.ColorJitter(),      # CPU
])

4. Late GPU Migration

Detects tensors created on CPU then moved (vs created on GPU directly)

Example caught:

# Creating on CPU first
noise = torch.randn(x.shape).cuda()  # CPU alloc + transfer

# Should be:
noise = torch.randn(x.shape, device='cuda')  # Direct GPU alloc

See CPU Residency Guide for detailed examples and fixes.

Limitations

  • Tensor Core Utilization: Estimated from operation types (best effort)
  • Cache Metrics: Requires Nsight Compute for hardware counters
  • Bytes Transferred: Approximate from memory operations
  • Precision Detection: Inferred from kernel names

For production-level profiling, combine this tool with:

  • NVIDIA Nsight Systems (timeline visualization)
  • NVIDIA Nsight Compute (kernel-level analysis)
  • torch.profiler with TensorBoard (detailed traces)

Architecture Notes

Design Principles

  1. Diagnosis over Metrics: Focuses on explaining bottlenecks, not dumping data
  2. Rule-Based Engine: Clear, explainable diagnostic rules
  3. Severity Classification: Prioritizes critical issues
  4. Actionable Output: Every diagnosis includes fix suggestions

Profiling Strategy

  • Uses torch.profiler for minimal overhead
  • Captures both CPU and CUDA events
  • Records shapes, memory, and FLOPs
  • Warmup steps excluded from analysis

Roofline Model

Implements the roofline performance model:

  • Ridge Point = Peak FLOPS / Peak Bandwidth
  • Memory-bound: AI < Ridge Point
  • Compute-bound: AI ≥ Ridge Point
  • Calculates distance from theoretical maximum

Contributing

This tool is designed for ML systems engineers. When adding new diagnosis rules:

  1. Add clear evidence (metrics)
  2. Explain why it matters
  3. Provide actionable suggestions
  4. Classify severity appropriately

License

This is production-quality diagnostic code. Use it to understand your models better.

About

A production-quality terminal-based profiling tool that diagnoses GPU performance bottlenecks in PyTorch models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages