Skip to content

AdaMahdavi/CNN-Accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom RTL CNN Inference Accelerator

Overview

Ongoing project; building rtl and hardware/software stack piece by piece as I learn. I aim to use my custom weight-stationary compute engine with sliding window convolution dataflow on variable-size 16-bit fixed-point arrays.

So far: MVM engine is done! (Met timing at 500MHz on 16-wide, 16-bit vectors).

What's Done:

MVM compute core

A fully parameterized VEC_W x VEC_W matrix-vector multiply engine with a baseline memory interconnect; VEC_W parallel dot cores each computing one output element via a VEC_W-deep DSP48E2 cascade. Weight rows and activation vector read from dedicated BRAMs, results offloaded to result BRAM. Starts automatically on reset deassertion and runs continuously.

alt text

Cycling behaviour:

  • INIT : fills pipeline over VEC_W + DSP_STAGES - 2 cycles; loads first activation and walks weight rows into the compute core
  • STREAM : steady state; new weight row every cycle, new activation every VEC_W cycles, one result per cycle after pipeline fill
  • WRITE_RES : fires when core_result_ready asserts; walks VEC_W result addresses into result BRAM over VEC_W cycles, then returns to STREAM

BRAM read latency (2 cycles, DOB_REG=1) is absorbed into the INIT fill phase. Weight streaming continues during WRITE_RES to avoid stalling the pipeline.

Module hierarchy:

  • dsp_primitive.sv : DSP48E2 macro instantiation and wrapper
  • dot_product_pe.sv : VEC_W-deep DSP cascade; one dot product per instance, shift register queue handling data dependency hazard
  • mvm_core.sv : VEC_W parallel dot_core instances; one per output element, activation broadcast, weight rows distributed
  • control_mvm.sv : 3-state FSM (INIT/STREAM/WRITE_RES); sequences BRAM reads, drives compute core, writes results
  • act_bram.sv/weight_bram.sv : RAMB36E2 SDP, 72-bit width, explicit primitive instantiation for weights matrix and activation(s).
  • results_bram.sv : RAMB36E2 SDP, CLOCK_DOMAINS INDEPENDENT; PL write, PS read
  • mvm_pl_wrapper.sv — temporary top-level wrapper

Some notable architecture details (so far)

Integrating accumulation stage into DSP48E2 cascaded interconnect

One of the main things I considered while working on this was maximizing performance through making synthesis behaviour as predictable as possible and using native FPGA fabric resources. That included instantiating both memory and DSP macros manually.

DSP cascade architecture Typical approach: parallel DSPs with fabric adder tree vs this implementation: cascaded DSP MAC chain

A typical MVM approach separates multiplication (DSP slices) and accumulation (LUT-based adder tree), I combined both stages into cascaded DSP chains using PCIN/PCOUT. Three biggest differences it makes:

Routing: DSP slices sit in columns in UltraScale+ with dedicated native interconnect between them. Accumulation through PCIN/PCOUT uses that path; no general-purpose fabric routing. Verified post-implementation that all DSPs per compute instance landed on the same column without Pblock constraints.

Latency: An adder tree adds ⌈log₂(N)⌉ extra pipeline stages on top of DSP multiply latency. PCIN/PCOUT folds accumulation into the chain itself; latency is fixed at N DSP stages, each contributing 4 cycles internally. Tradeoff is that longer cascades mean more cycles to first result, but dedicated routing makes higher frequencies achievable.

Utilization: Exclusively DSP slices; no LUTs burned on adder tree logic.

  • Worth noting: whether the cascade beats an adder tree depends on kernel size. For very large vectors an adder tree might win on latency. But larger matrices also give you more cycles before the next matrix load, which naturally absorbs the extra pipeline fill time. A future optimization would be a mux between both approaches based on vector width.

Data dependency hazard handling

A major difference between adder tree and PCIN/PCOUT accumulation is data dependency. For maximum throughput you want a new valid result every cycle after initial pipeline fill; each valid set of data should stay in registers for exactly one clock cycle and continuously move down the pipeline.

The problem: accumulation happens at the 3rd stage of the DSP pipeline. So DSP #1 (which should output first + second partial sums) needs its input delayed by one cycle so it arrives at the accumulation stage exactly when PCOUT arrives from DSP #0. By the same logic, the ith DSP along the cascade needs its input delayed by i cycles.

Fixed with a VEC_W x VEC_W shift register queue; each operand pair gets staggered into its DSP at the correct cycle offset. This dependency hazard doesn't exist with an adder tree since all DSP instances output simultaneously and feed into fully parallel independent data lanes.

Sliding window convolution dataflow

Instead of feeding partial vector x vector operands, the engine feeds accumulated matrices; a full weight matrix row per cycle in steady state. This wasn't planned upfront; it emerged naturally. Weights persist in BRAM across activations, and sliding one new weight row per cycle while the activation stays fixed produces consecutive MVM results for overlapping weight windows. That's structurally identical to a convolution kernel sliding across input data. No architectural changes needed; the engine just was convolution.

This increases DSP usage compared to a vector-only approach but gives much more detailed convolution for image processing with one result per cycle in steady state after pipeline fill.

RAMB36E2 explicit instantiation

Activation and weight BRAMs are instantiated directly from ramb36e2 primitives in Simple Dual-Port mode (to enable maximum bandwidth). write port A, read port B, 72-bit width for maximum bandwidth. Explicit instantiation prevents distributed RAM inference.

Things that aren't obvious from the template:

  • SDP assigns port A as write and port B as read
  • 72-bit width spreads the data bus across DOUTADOUT and DOUTBDOUT combined
  • DOB_REG=1 adds a latency cycle the FSM has to absorb
  • REGCEB must be tied high or data won't clock through the output register

Result BRAM uses CLOCK_DOMAINS("INDEPENDENT"); PL writes on the compute clock, PS reads on its own clock when PS integration arrives.


Timing and Utilization

MVM Engine met timing at 500MHz with 0.234ns slack! (VEC_W = 16 => 16 x 16 = 256 DSP slices)

I used dont_touch rtl attributes on the shift register queue array declarations inside the compute primitive, and also on unused bram ports, as without them Vivado traces them as unobservable/unused and prunes them.

timing

Currently, final implementation uses 131K flip-flops (for 16-wide, 16-bit vecotrs) confirmed in utilization post-fix. Some serious room for optimization here; my current way of handling resets may be preventing Vivado from mapping registers to SRLs. Holding off on that until PS integration since it would change synthesis behavior anyway (as most 'dont_touch's would be gone).


What's In Progress

PL side:

  • Bias addition into the DSP MAC datapath
  • Splitting weight memory into multiple BRAM instances, one per matrix row, to decouple memory lanes and maximize read bandwidth

PS-PL integration: The meaningful way to test this on the SoC is through DDR DMA; streaming weights and activations from PS-side memory over AXI. Still studying the DMA flow before implementing.

CNN architecture: How many layers, how many running in parallel, what kernel sizes; decisions that depend on how the architecture evolves as I go.


References (so far-)

About

Building a CNN inference accelerator on Xilinx KV260, using a custom weight-stationary compute engine with sliding window convolution dataflow on variable-size 16-bit fixed-point arrays. Targeting 400+MHz.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors