- The Problem & Project Aim
- Project Overview
- System Architecture & Modules
- Custom AXI4-Lite Register Map
- Synthesis Metrics & Results
- Security Testing & Verification
- Tools Used
- How to Run Simulation
The Problem Statement: In modern integrated sensing and communication systems, data security is mandatory. However, executing complex cryptographic algorithms like AES-128 purely in software on a general-purpose processor is highly inefficient. Software execution requires thousands of clock cycles to encrypt a single 16-byte block of data. This creates a massive data bottleneck, consumes excessive CPU overhead, and drains power.
The Project Aim: The objective of this project was to offload the heavy mathematical workload of encryption into a dedicated hardware IP block. Specifically, the aim was to design, verify, and implement a FIPS-197 compliant AES-128 hardware accelerator from scratch strictly using Verilog HDL. This custom IP core had to be easily integrated into modern SoC architectures using a strict AXI4-Lite Slave Memory interface, allowing an AXI Master to write data, trigger the hardware, and read back the secured ciphertext with minimal overhead.
This project implements a fully unrolled, 10-stage pipelined AES-128 Cryptographic IP Core designed entirely in Verilog HDL. To ensure seamless integration into modern System-on-Chip (SoC) architectures, the cryptographic engine is wrapped in a custom, dual-ported AXI4-Lite Slave Memory interface.
By leveraging a pure hardware architecture, the system achieves massive mathematical throughput for standard ECB encryption in silicon. The accompanying automated Verilog AXI-Master testbench acts as the system controller, dynamically executing advanced stream cipher modes (CBC and CTR) by driving the AXI interface, proving the IP's versatility without altering the underlying FIPS-compliant pipeline.
The design is modular and hierarchical, built from the lowest mathematical functions to the highest system wrapper.
These modules are the foundational building blocks of the AES algorithm.
sbox.v(SubBytes): A non-linear substitution step acting as a massive Look-Up Table (LUT). It takes an 8-bit input and replaces it with a specific 8-bit output based on Galois Field inverse mathematics, providing the "confusion" in the cipher.shift_rows.v: Performs a simple hardware routing trick, shifting the bytes in the 4x4 data matrix by different offsets. Because it only requires rewiring (no actual logic gates), it executes in zero clock cycles.mix_columns_32bit.v: The most mathematically heavy module. It takes a 32-bit column of data and multiplies it against a fixed matrix in a Galois Field (GF(2^8)). This provides the "diffusion" (Avalanche Effect), ensuring a 1-bit change in the input cascades across the entire block.key_expand_stage.v: Takes the previous round's key and performs XOR and S-Box substitutions to generate the unique key for the next round on the fly.
aes_round.v: Represents one standard AES round. Instantiates the SubBytes, ShiftRows, MixColumns, and AddRoundKey logic in sequence.aes_round_last.v: The final round (Round 10) must skip the MixColumns step per the AES standard. This module ensures strict FIPS compliance.aes_pipeline.v: The core engine. Instead of using an iterative state machine, this module physically unrolls the loop. It instantiates 9 standard rounds and 1 final round, wiring them together in a massive 10-stage pipeline entirely in RTL.
aes_axi_wrapper.v: The bridge between the AXI Master and the Verilog pipeline, acting as an AXI4-Lite Slave Memory map.- Decouples the external 32-bit SoC bus limit from the engine's 128-bit internal datapath.
- Contains a strict Busy/Idle hardware lockout mechanism (Status Register at
0x04) that physically prevents the input data from being corrupted while the engine is busy. - Manages the 11-cycle latency state machine, capturing the ciphertext and raising a
Doneflag when finished.
Designed a 64-byte AXI4-Lite memory map for external control:
| Offset | Register Name | Access | Description |
|---|---|---|---|
0x00 |
Control | W | Bit 0: Start Engine (Auto-clearing pulse) |
0x04 |
Status | R | Bit 0: Busy, Bit 1: Idle, Bit 3: Done |
0x10 - 0x1C |
Key [0:3] | W | 128-bit Master AES Key |
0x20 - 0x2C |
Plaintext [0:3] | W | 128-bit Data Input (Locked when Busy) |
0x30 - 0x3C |
Ciphertext [0:3] | R | 128-bit Encrypted Output |
The project was a complete success, perfectly matching the expected ciphertext of the official NIST FIPS-197 standard test vectors. The mathematical core was heavily optimized for single-cycle resolution per round, yielding exceptional static timing results:
- Target Clock: 10.0 ns (100 MHz)
- Achieved WNS (Worst Negative Slack): +5.8 ns
- Maximum Operating Frequency (Fmax): 238 MHz
- Pipeline Latency: 11 Clock Cycles
Because the 10-stage pipeline is fully unrolled, it outputs a completed 128-bit block every single clock cycle once saturated.
- Peak Internal Engine Throughput: 30.4 Gbps (128-bit datapath processing one block per clock cycle at 238 MHz).
- Theoretical Interface Throughput: 7.6 Gbps (32-bit AXI4-Lite external bus limit: 32 bits × 238 MHz).
- Real-World System Bottleneck: Because AXI4-Lite requires a multi-cycle handshaking protocol for every transaction, system-level throughput is bound by the AXI Master's transmission speed. The cryptographic engine was successfully optimized to be vastly faster than its I/O interface.
- FIPS-197 Standard Vectors: The hardware was rigorously tested and verified against official NIST vectors.
- The Avalanche Effect: Cryptographic diffusion was visually verified in simulation. Modifying a single bit of the input plaintext results in a complete scrambling of the 128-bit ciphertext by Round 3, proving the mathematical integrity of the SubBytes and MixColumns stages.
- Automated RTL Control & Advanced Modes (
tb_aes_axi_lite.v): A custom Verilog testbench simulates a generic SoC AXI Master. It utilizes read/write tasks to program the key and trigger the memory map, exhaustively verifying advanced streaming modes:- ECB (Electronic Codebook): Natively executed in hardware.
- CBC (Cipher Block Chaining): Testbench-driven XOR chaining utilizing the RTL core as a coprocessor.
- CTR (Counter Mode): Stream cipher implementation encrypting a Nonce+Counter, masking data patterns while retaining high parallel throughput.
- EDA Tool: Xilinx Vivado 2018.3 (Synthesis & Behavioral Simulation)
- Language: Verilog-2001
- Clone this repository.
- Open Xilinx Vivado and create a new RTL project.
- Add the Verilog files to your project hierarchy.
- Set
tb_aes_axi_lite.vas the top module for simulation. - Launch Behavioral Simulation.
- In the TCL Console, observe the automated execution and verification of the FIPS-197 ECB baseline, followed by the CBC and CTR advanced mode tests. Expand the wave viewer to observe the AXI handshaking.