NeuSim is a simulator framework for modeling the performance and power behaviors of neural processing units (NPUs) when running machine learning workloads.
As shown in the above figure, an NPU chip consists of systolic arrays (SAs) for matrix multiplications and SIMD vector units (VUs) for generic vector operations. Each chip has an off-chip high-bandwidth memory (HBM) to store the ML model weights and input/output data, and an on-chip SRAM to exploit data locality and hide HBM access latency. A direct memory access (DMA) engine performs asynchronous memory copy between the HBM and SRAM. Multiple NPU chips can be connected via high-speed inter-chip interconnect (ICI) links, which form an NPU pod. A pod is typically arranged as a 2D/3D torus, which is optimized for AllReduce bandwidth. The DMA engine performs remote DMA (RDMA) operations to access another chipβs HBM or SRAM.
NeuSim features:
- Detailed performance modeling: NeuSim models each component (e.g., systolic array, vector unit, on-chip SRAM, HBM, ICI) on an NPU chip and reports rich statistics for each tensor operator (e.g., execution time, FLOPS, memory traffic). It helps chip architects and system designers identify microarchitectural bottlenecks (e.g., SA-bound, VU-bound, HBM-bound).
- Power, energy, and carbon modeling: NeuSim models the static/dynamic power and energy consumption of each component on an NPU chip. It also models the embodied and operational carbon emissions.
- Flexibility: NeuSim can be invoked at different levels of granularity, including single operator simulation, end-to-end DNN model simulation, and batch simulations for design space explorations. This provides flexibility to users with different needs.
- Support for popular DNN models: NeuSim takes the model graph definition as an input. It supports various popular DNN architectures, including LLMs (e.g., Llama, DeepSeek), recommendation models (e.g., DLRM), and stable diffusion models (e.g., DiT-XL, GLIGEN).
- Multi-chip simulation: NeuSim supports simulating multi-chip systems with different parallelism strategies (e.g., tensor parallelism, pipeline parallelism, data parallelism, expert parallelism).
- Scalability: A typical use case of NeuSim is the design space exploration: sweeping over millions of NPU hardware configurations (e.g., number of chips) and software parameters (e.g., batch size, parallelism config) to learn the "optimal" setting. NeuSim automatically parallelizes simulation jobs across multiple machines using Ray to speed up large-scale design space explorations.
- Advanced features: NeuSim models advanced architectural features such as power gating and dynamic voltage and frequency scaling (DVFS) to help chip architects explore the trade-offs between performance, power, and energy efficiency.
-
Install Miniconda (skip this if you already have conda installed).
-
NeuSim is installed as a Python package. Create a conda environment and install NeuSim with
pip:conda create --name neusim python=3.12.2 conda activate neusim pip install -e .If you want to run unit tests or contribute to the codebase, you may also install the optional development dependencies:
pip install -e ".[dev]"
NeuSim can be launched in different ways depending on the use cases, including single operator simulations, single model simulations, and batch simulations for design space explorations.
The neusim/run_scripts/ directory contains several example scripts of NeuSim simulations.
To get started immediately, we provide an automated example script (neusim/run_scripts/example_npusim.sh) that demonstrates the full NeuSim pipeline. It sweeps through various hardware and model configurations to determine the most cost-efficient NPU design that meets specific performance targets.
-
Start ray server:
ray start --head --port=6379
-
Run the example script:
cd neusim/run_scripts ./example_npusim.shYou may view the progress of the test runs in the Ray dashboard (at
http://127.0.0.1:8265/by default, it may require port forwarding if you are ssh'ing onto a remote machine).After the script finishes with no errors, under the "Jobs" tab in the Ray dashboard, all jobs should have the "Status" column set to "SUCCEEDED". An output directory
resultsshould be created and contain the following folders:raw/: contains the performance simulation results. This is the output of the scriptrun_sim.py.raw_None/: contains the power simulation results. This is the output of the scriptenergy_operator_analysis_main.py.carbon_NoPG/dvfs_None/CI0.0624/UTIL0.6/: contains the results of the carbon emission analysis without power gating and DVFS, with carbon intensity 0.0624 kgCO2e/kWh and NPU chip duty cycle 60%. This is the output of the scriptcarbon_analysis_main.py.slo/: contains the SLO analysis results. This is the output of the scriptslo_analysis_main.py.
The example_npusim.sh script invokes the core components of NeuSim to simulate different DNN models running on various NPU hardware configurations, and analyze the output statistics to find the most cost-efficient NPU configuration that meets the target performance SLOs:
- First, it invokes
run_sim.pyfor performance simulations. This script is the main entry point for running a batch of performance simulations. It sweeps over all possible numbers of chips, batch sizes, NPU versions, and parallelism configurations for the given DNN models. It outputs the per-operator performance statistics for each configuration to CSV files. It also dumps the end-to-end statistics and the simulation configuration to a JSON file. TheOperatorclass contains the descriptions for all the statistics in the CSV files. This script will launch multiple Ray tasks to parallelize the simulation jobs. - Next, it invokes
energy_operator_analysis_main.pyto run power simulations. This script reads the performance statistics generated byrun_sim.pyand computes the power and energy consumption for each operator based on the NPU hardware configuration, power gating, and DVFS settings. (Note: we can integrate the power simulation intorun_sim.py, but we separate them here for modularity and flexibility.) - After that, it invokes
carbon_analysis_main.pyto run carbon footprint analysis and further aggregate the simulation statistics. This script reads the power and energy statistics generated byenergy_operator_analysis_main.pyand computes the carbon emissions based on the datacenter carbon intensity and NPU chip duty cycle. - Finally, it invokes
slo_analysis_main.py. This script analyzes the output of previous steps to find the optimal NPU configurations that meet the target SLOs (e.g., request latency for inference workloads).
A more comprehensive experiment script, run_power_gating.sh, demonstrates how to run simulations with different power gating strategies. It has the same structure as example_npusim.sh, but includes more models, NPU versions, and various power gating configurations.
Most scripts under neusim/run_scripts should have the --output_dir argument.
The user can specify the NPU hardware configuration and the model architecture of the simulation by creating new configuration files under configs/.
We provide a set of pre-defined configurations in the configs directory:
configs/chips/: contains the NPU chip parameters, such as the number of SAs, VUs, core frequency, HBM bandwidth, on-chip SRAM size, etc.configs/models/: contains the model architecture parameters as well as the parallelism configurations. We currently support LLMs (Llama and DeepSeek), DLRM, DiT-XL, and GLIGEN. See Defining New DNN Model Architectures for more details on how to add support for new models.configs/systems/: contains the system-level parameters, including the datacenter power usage efficiency (PUE) and carbon intensity used for carbon emission analysis.
The script neusim/run_scripts/run_sim.py automatically supports new configuration files added to these directories, as long as the file names follow the existing naming conventions:
--models: specify the model names. For example, if the user adds a new model configuration fileconfigs/models/llama4-17b.json, the user can specify--models="llama4-17b"to run simulations for this model.--versions: specify the NPU chip versions. For example, if the user adds a new chip configuration fileconfigs/chips/tpuv7.json, the user can specify--versions="7"to run simulations for this NPU version.
The power gating parameters are defined in neusim/npusim/frontend/power_analysis_lib.py. The user can modify the get_power_gating_config() function to add new power gating configurations, including power gating wake-up cycles and power gating policies for each component.
The scripts neusim/run_scripts/energy_operator_analysis_main.py and neusim/run_scripts/carbon_analysis_main.py can be invoked with command line arguments.
The --help option shows all available options. To perform sensitivity study for power gating parameters, these two scripts support overriding the default power gating configurations via the --power_gating_strategy flag as follows:
NoPG: no power gating.Ideal: ideal power gating with instruction-level temporal granularity and PE/ALU-level spatial granularity. This should result in the most power savings.Full: Same asIdealbut with non-zero power-gating factor (power_level_factors) and delay cycles.<base_config>_vary_Vth_<value>_<value_sram>: vary Vth_low (voltage when logic is power gated) and Vth_sram (voltage when SRAM cells are power gated) for sensitivity analysis. The values are the percentage over Vdd.<base_config>_vary_PG_delay_<value>: vary power gating wake-up delay for sensitivity analysis. The value is specified as the ratio over the base config.
See neusim/run_scripts/run_power_gating_sensitivity_analysis.sh for examples of how to specify different power gating strategies via the --power_gating_strategy flag.
See neusim/npusim/frontend/power_analysis_lib.py:get_power_gating_config() for how these parameters are being handled by NeuSim internally.
Please see neusim/run_scripts/run_single_op_main.py for an example of how to run a single tensor operator simulation. This script is helpful for analyzing a specific operator of interest rather than simulating the entire DNN model.
If the user wants more control over the simulation parameters, such as customizing the chip configs, model configs, and specifying the batch size and parallelism config search space, the best way is to write a custom script that invokes the neusim.npusim.frontend module directly.
See run_scripts/single_model_example.ipynb for an example of creating simulation configurations and running a single experiment.
We currently support LLMs (see neusim/npusim/frontend/llm_ops_generator.py), DLRM (see neusim/npusim/frontend/dlrm_ops_generator.py), DiT-XL (see neusim/npusim/frontend/dit_ops_generator.py), and GLIGEN (see neusim/npusim/frontend/gligen_ops_generator.py). Variants of these models (such as changing the number of layers or hidden dimensions) can be created by adding new configuration files in the configs/models directory.
To add support for new model architectures, the user needs to implement a new model generator class in neusim/npusim/frontend to reflect the model's execution graph. Many commonly used operators, such as GEMM, Conv, and LayerNorm, are implemented in neusim/npusim/backend/npusim_lib.py. Please refer to the existing model generator classes for examples on how to call these operators and implement new model generators.
See run_scripts/new_model_example.ipynb for an example of adding a new model generator class and running simulations for the new model.
To scale out the simulator on multiple machines, we can set up a shared storage directory and configure the Ray cluster. The instructions below show an example of setting up a shared NFS directory and configuring a Ray cluster.
-
The NFS server can be any node in the cluster (preferably the head node). To set up an NFS directory, run:
sudo apt install nfs-kernel-server sudo mkdir -p /mnt/[npusim_nfs_share] sudo chown nobody:nogroup /mnt/[npusim_nfs_share] sudo chmod 777 /mnt/[npusim_nfs_share] echo "/mnt/[npusim_nfs_share] *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports sudo exportfs -a sudo systemctl restart nfs-kernel-server
-
On each worker node, mount the NFS directory:
sudo apt install nfs-common sudo mkdir -p /mnt/npusim_nfs_share sudo mount -t nfs [head_node_ip]:/mnt/[npusim_nfs_share] /mnt/[npusim_nfs_share]
-
The GitHub repository should be cloned inside the shared NFS directory to ensure all nodes have access to the codebase.
-
The Python package
neusimmust be installed on all nodes. -
Launch ray runtime on the head node with the
neusimconda environment:conda activate neusim ray start --head --port=6379
-
Finally, start the ray runtime on each worker node with the
neusimconda environment:conda activate neusim ray start --address='[head_node_ip]:6379'You may verify all nodes are connected to the Ray cluster by running:
ray status
Alternatively, the "Cluster" tab in the Ray dashboard also shows the status of all nodes in the cluster.
-
The provided scripts under
neusim/run_scriptscan be launched on any node. Assuming we launch the scripts from the head node. Make sure the path uses the NFS shared directory, not the local path, as this path will be used by other nodes in the cluster. -
Run the test script to verify the setup:
cd /mnt/[npusim_nfs_share]/.../neusim/run_scripts ./example_npusim.shThe test script will run the same tests as in the single machine setup, but Ray will automatically distribute the tasks across all nodes.
-
Other experiment scripts can be launched in the same way as in the single machine setup, but make sure to use the NFS shared directory.
We use unittest and pytest for unit testing. To run all tests, execute the following command under the repo's root directory:
pytestpytest plugins can also be used. For example, to generate a code coverage report, run:
pytest --cov=.All tests covering a certain module should be placed under the tests directory under that module.
All test files should be named with the test_*.py format. If a unit test requires inputs from files, the input files should be placed under the tests directory (preferably inside a sub-directory of tests). See neusim/backend/tests for examples.
Please consider citing us if you find NeuSim useful in your research:
If you use the power modeling features, please cite:
@inproceedings{regate:micro25,
author = {Xue, Yuqi and Huang, Jian},
title = {ReGate: Enabling Power Gating in Neural Processing Units},
year = {2025},
url = {https://doi.org/10.1145/3725843.3756038},
booktitle = {Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture},
address = {Seoul, Korea},
series = {MICRO'25}
}If you use the performance modeling features, please cite:
@inproceedings{neu10:micro24,
author = {Xue, Yuqi and Liu, Yiqi and Nai, Lifeng and Huang, Jian},
title = {Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms},
year = {2024},
url = {https://doi.org/10.1109/MICRO61859.2024.00011},
booktitle = {Proceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture},
address = {Austin, TX, USA},
series = {MICRO'24}
}
@inproceedings{v10:isca23,
author = {Xue, Yuqi and Liu, Yiqi and Nai, Lifeng and Huang, Jian},
title = {V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness},
year = {2023},
url = {https://doi.org/10.1145/3579371.3589059},
booktitle = {Proceedings of the 50th Annual International Symposium on Computer Architecture},
address = {Orlando, FL, USA},
series = {ISCA'23}
}
@inproceedings{neucloud:hotos23,
author = {Xue, Yuqi and Liu, Yiqi and Huang, Jian},
title = {System Virtualization for Neural Processing Units},
year = {2023},
url = {https://doi.org/10.1145/3593856.3595912},
booktitle = {Proceedings of the 19th Workshop on Hot Topics in Operating Systems},
address = {Providence, RI, USA},
series = {HotOS'23}
}