This repository contains the implementation and analysis of matrix multiplication (GEMM) operations on the Pynq-Z2 FPGA platform.
GEMM/
├── part01_gemm_vitis/ # HLS implementations
│ ├── 00_gemm_naive/ # Naive implementation
│ ├── 01_inner_loop_pipelined/ # Pipelined version
│ ├── 02_array_partition/ # Array partitioned version
│ └── 03_block_tiled_gemm/ # Block tiled + pipelined + partitioned
├── part02_gemm_vivado/ # Vivado project files
│ ├── drc_report.txt # Design Rule Check report
│ ├── power_report.txt # Power analysis report
│ ├── timing_report.txt # Timing analysis report
│ ├── utilization_report.txt # Resource utilization report
│ └── gemm_block_design.png # Block design diagram
└── part03_gemm_fpga/ # FPGA deployment files
├── overlays/
│ ├── gemm.bit # FPGA bitstream
│ └── gemm.hwh # Hardware handoff file
├── gemm.tcl # TCL script for project recreation
└── gemm.ipynb # Jupyter notebook for execution
| Implementation | Latency (cycles) | Speedup vs Naive | BRAM | DSP | FF | LUT |
|---|---|---|---|---|---|---|
| Naive GEMM | 12,582,933 | 1.00x | 6 (2%) | 5 (2%) | 3,518 (3%) | 4,133 (7%) |
| Pipelined | 2,181,518 | 5.77x | 6 (2%) | 5 (2%) | 49,277 (46%) | 28,498 (53%) |
| Pipelined + Array Partition | 346,510 | 36.31x | 34 (12%) | 40 (18%) | 66,468 (62%) | 45,226 (85%) |
| Block Tiling + Pipelined + Array Partition | 272,057 | 46.25x | 71 (25%) | 160 (72%) | 39,308 (36%) | 36,966 (69%) |
- Baseline implementation with basic loop structures
- Lowest resource utilization but highest latency
- Limited parallelism resulting in poor performance
- Achieved 5.77x speedup over naive approach
- Moderate increase in FF (46%) and LUT (53%) usage
- Maintained minimal BRAM and DSP utilization
- Timing violation observed (-1.19 slack)
- Significant performance improvement (36.31x speedup)
- Higher resource utilization across all metrics
- Persistent timing violation (-0.91 slack)
- Enhanced memory bandwidth through array partitioning
- Best overall performance (46.25x speedup)
- Optimized resource utilization
- Resolved timing violations (no negative slack)
- Efficient block-level data management
- Successfully generated GEMM IP core
- Integrated IP into board design
- Generated bitstream for FPGA configuration
The above block design shows the integration of GEMM IP with AXI interfaces and other necessary components for the Pynq-Z2 board implementation.
/overlays: Contains essential FPGA configuration filesgemm.bit: Bitstream filegemm.hwh: Hardware handoff file
gemm.ipynb: Jupyter notebook for execution and testing
- Hardware GEMM (FPGA) execution time: 0.014266 seconds
- Software GEMM (Cortex-A9) execution time: 20.150390 seconds
- Achieved speedup: 1412.50x
The FPGA implementation demonstrates significant performance improvements over CPU-based execution, with a 1412.50x speedup in real-world testing. The final implementation successfully balances resource utilization with performance optimization, making it highly suitable for practical applications on the Pynq-Z2 platform.