Ongoing project; building rtl and hardware/software stack piece by piece as I learn. I aim to use my custom weight-stationary compute engine with sliding window convolution dataflow on variable-size 16-bit fixed-point arrays.
So far: MVM engine is done! (Met timing at 500MHz on 16-wide, 16-bit vectors).
A fully parameterized VEC_W x VEC_W matrix-vector multiply engine with a baseline memory interconnect; VEC_W parallel dot cores each computing one output element via a VEC_W-deep DSP48E2 cascade. Weight rows and activation vector read from dedicated BRAMs, results offloaded to result BRAM. Starts automatically on reset deassertion and runs continuously.
Cycling behaviour:
INIT: fills pipeline overVEC_W + DSP_STAGES - 2cycles; loads first activation and walks weight rows into the compute coreSTREAM: steady state; new weight row every cycle, new activation every VEC_W cycles, one result per cycle after pipeline fillWRITE_RES: fires whencore_result_readyasserts; walks VEC_W result addresses into result BRAM over VEC_W cycles, then returns to STREAM
BRAM read latency (2 cycles, DOB_REG=1) is absorbed into the INIT fill phase. Weight streaming continues during WRITE_RES to avoid stalling the pipeline.
Module hierarchy:
dsp_primitive.sv: DSP48E2 macro instantiation and wrapperdot_product_pe.sv: VEC_W-deep DSP cascade; one dot product per instance, shift register queue handling data dependency hazardmvm_core.sv: VEC_W parallel dot_core instances; one per output element, activation broadcast, weight rows distributedcontrol_mvm.sv: 3-state FSM (INIT/STREAM/WRITE_RES); sequences BRAM reads, drives compute core, writes resultsact_bram.sv/weight_bram.sv: RAMB36E2 SDP, 72-bit width, explicit primitive instantiation for weights matrix and activation(s).results_bram.sv: RAMB36E2 SDP, CLOCK_DOMAINS INDEPENDENT; PL write, PS readmvm_pl_wrapper.sv— temporary top-level wrapper
One of the main things I considered while working on this was maximizing performance through making synthesis behaviour as predictable as possible and using native FPGA fabric resources. That included instantiating both memory and DSP macros manually.
Typical approach: parallel DSPs with fabric adder tree vs this implementation: cascaded DSP MAC chain
A typical MVM approach separates multiplication (DSP slices) and accumulation (LUT-based adder tree), I combined both stages into cascaded DSP chains using PCIN/PCOUT. Three biggest differences it makes:
Routing: DSP slices sit in columns in UltraScale+ with dedicated native interconnect between them. Accumulation through PCIN/PCOUT uses that path; no general-purpose fabric routing. Verified post-implementation that all DSPs per compute instance landed on the same column without Pblock constraints.
Latency: An adder tree adds ⌈log₂(N)⌉ extra pipeline stages on top of DSP multiply latency. PCIN/PCOUT folds accumulation into the chain itself; latency is fixed at N DSP stages, each contributing 4 cycles internally. Tradeoff is that longer cascades mean more cycles to first result, but dedicated routing makes higher frequencies achievable.
Utilization: Exclusively DSP slices; no LUTs burned on adder tree logic.
- Worth noting: whether the cascade beats an adder tree depends on kernel size. For very large vectors an adder tree might win on latency. But larger matrices also give you more cycles before the next matrix load, which naturally absorbs the extra pipeline fill time. A future optimization would be a mux between both approaches based on vector width.
A major difference between adder tree and PCIN/PCOUT accumulation is data dependency. For maximum throughput you want a new valid result every cycle after initial pipeline fill; each valid set of data should stay in registers for exactly one clock cycle and continuously move down the pipeline.
The problem: accumulation happens at the 3rd stage of the DSP pipeline. So DSP #1 (which should output first + second partial sums) needs its input delayed by one cycle so it arrives at the accumulation stage exactly when PCOUT arrives from DSP #0. By the same logic, the ith DSP along the cascade needs its input delayed by i cycles.
Fixed with a VEC_W x VEC_W shift register queue; each operand pair gets staggered into its DSP at the correct cycle offset. This dependency hazard doesn't exist with an adder tree since all DSP instances output simultaneously and feed into fully parallel independent data lanes.
Instead of feeding partial vector x vector operands, the engine feeds accumulated matrices; a full weight matrix row per cycle in steady state. This wasn't planned upfront; it emerged naturally. Weights persist in BRAM across activations, and sliding one new weight row per cycle while the activation stays fixed produces consecutive MVM results for overlapping weight windows. That's structurally identical to a convolution kernel sliding across input data. No architectural changes needed; the engine just was convolution.
This increases DSP usage compared to a vector-only approach but gives much more detailed convolution for image processing with one result per cycle in steady state after pipeline fill.
Activation and weight BRAMs are instantiated directly from ramb36e2 primitives in Simple Dual-Port mode (to enable maximum bandwidth). write port A, read port B, 72-bit width for maximum bandwidth. Explicit instantiation prevents distributed RAM inference.
Things that aren't obvious from the template:
- SDP assigns port A as write and port B as read
- 72-bit width spreads the data bus across DOUTADOUT and DOUTBDOUT combined
- DOB_REG=1 adds a latency cycle the FSM has to absorb
- REGCEB must be tied high or data won't clock through the output register
Result BRAM uses CLOCK_DOMAINS("INDEPENDENT"); PL writes on the compute clock, PS reads on its own clock when PS integration arrives.
MVM Engine met timing at 500MHz with 0.234ns slack! (VEC_W = 16 => 16 x 16 = 256 DSP slices)
I used dont_touch rtl attributes on the shift register queue array declarations inside the compute primitive, and also on unused bram ports, as without them Vivado traces them as unobservable/unused and prunes them.
Currently, final implementation uses 131K flip-flops (for 16-wide, 16-bit vecotrs) confirmed in utilization post-fix. Some serious room for optimization here; my current way of handling resets may be preventing Vivado from mapping registers to SRLs. Holding off on that until PS integration since it would change synthesis behavior anyway (as most 'dont_touch's would be gone).
PL side:
- Bias addition into the DSP MAC datapath
- Splitting weight memory into multiple BRAM instances, one per matrix row, to decouple memory lanes and maximize read bandwidth
PS-PL integration: The meaningful way to test this on the SoC is through DDR DMA; streaming weights and activations from PS-side memory over AXI. Still studying the DMA flow before implementing.
CNN architecture: How many layers, how many running in parallel, what kernel sizes; decisions that depend on how the architecture evolves as I go.

