The rapid growth of large language models (LLMs) has significantly increased the computational demand and energy consumption of modern GPU servers. This project develops a system-level GPU frequency control algorithm to improve power efficiency and reduce energy consumption for LLM inference workloads under varying throughput requirements.
The proposed system supports both:
- Fixed-workload scheduling
- Fixed-interval scheduling
A performance and power model is first derived from experiment data and then used as input to an optimization algorithm that determines optimal GPU frequency settings and workload allocations.
- Improve energy efficiency of multi-GPU LLM inference systems
- Reduce overall energy consumption under throughput constraints
- Develop system-level frequency and workload optimization strategies
The system consists of three main stages:
Empirical measurements are collected to model:
- GPU performance and Power under different frequency settings
A system-level optimization framework is used to determine:
- GPU frequency configuration
- Workload allocation across GPUs
- Idle-state configuration
Implemented in:
optimization.pyopt.pycal.py
The optimized scheduling strategy is evaluated on a multi-GPU system (ARC cluster) and compared against baseline configurations.
Implemented in:
final_test.pyfinal_test_boot.pyfinal_run.sh
.
├── calc.py # Efficiency and energy calculation
├── optimization.py # Core optimization formulation
├── opt.py # Main optimization execution logic
├── final_test.py # Main evaluation script
├── final_test_boot.py # Reboot experiment script
├── final_run.sh # Shell script for configuring GPU frequencies on ARC
├── results/ # Output results and figures
│ ├── benchmark/ # Benchmark measurements
│ ├── boot/ # Boot measurements
│ ├── idle/ # Algorithm evaluation measurements
└── pre_results # Pre-experiment measurements on performance and power models
└── README.md