Skip to content

yanyoulin/HLS-study-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HLS-study-project

Vitis HLS 2024.2
part=xcku035-fbva676-2-e

什麼是HLS

將高階語言的演算法轉換為RTL代碼,進一步用於FPGA的硬體實現

HLS pragma

pragma是用來向Vitis HLS提供指令的關鍵字,幫助優化硬體設計並控制生成的RTL代碼的行為。這些指令可以用來進行性能調整、資源分配以及設計流程的優化
以下是我目前研究及使用過的pragma(Vitis HLS官網也有詳細說明):

pragma HLS pipeline

void sum_array(int in[8], int* out) {
#pragma HLS PIPELINE
    int total = 0;
    for (int i = 0; i < 8; i++) {
        total += in[i];
    }
    *out = total;
}

將loop或是function以pipeline的形式結構執行,增加效率
image
pipeline讓多個操作分工並行執行,將一段運算「拆解成多個階段」,讓每個clock cycle都能輸入新的資料、產出新的結果
可以自行設定Initiation Interval(II):啟動新一次迭代所需的clock cycle,若II=1則效率最高

pragma HLS unroll

void sum_array_unroll(int in[8], int* out) {
    int total = 0;
    for (int i = 0; i < 8; i++) {
        #pragma HLS UNROLL
        total += in[i];
    }
    *out = total;
}

不同於pipeline,unroll將迴圈展開成多組平行運算單元。展開後,迴圈中每次迭代的操作會「同時」在硬體中執行,而不是像軟體那樣一個一個執行。
附上示意圖-擷取自網路
image
可以發現,unroll通常效率會高於pipeline(latency較低),同時也會消耗更多資源

pragma HLS array_partition

array_partition是一個對陣列資料做結構性分割的指令,用來將一個大陣列分割成多個小部分,讓它們可以在硬體中同時被存取
在硬體中,一個array預設只有一個port,同一時刻只能做一筆讀/寫。所有資料從同一記憶體來,就會造成爭用,讓 HLS 無法並行處理
解法:把array拆成多個memory slice,每slice各有port

#pragma HLS array_partition variable=<array_name> type=<partition_type> dim=<dimension>

type=complete(完全獨立)/block(分塊獨立)->需與factor搭配
dim->要分的是哪個維度

void sum_array_unroll(int in[8], int* out) {
#pragma HLS array_partition variable=in complete
    int total = 0;
    for (int i = 0; i < 8; i++) {
        #pragma HLS UNROLL
        total += in[i];
    }
    *out = total;
}

unroll常與array_partition做使用,因為unroll只是「告訴工具我想要展開」,但是否真的能展開,要看資料能不能同時被取用

pragma HLS DATAFLOW

它能讓多個function、loop在硬體中同時執行,就像是多個pipeline串起來一樣
經典用法:

#pragma HLS DATAFLOW

read_input(input_stream, buf);
compute(buf, result);
write_output(result, output_stream);
void dense_model(int W1[HIDDEN_DIM][IN_DIM], int W2[OUT_DIM][HIDDEN_DIM],
                 int b1[HIDDEN_DIM], int b2[OUT_DIM], int x[IN_DIM], int y[OUT_DIM]) {
#pragma HLS DATAFLOW
    int h[HIDDEN_DIM];
#pragma HLS array_partition variable=h complete

    dense1(W1, x, b1, h);
    dense2(W2, h, b2, y);
}

image

進入Vitis

  1. 建立一個專案環境,存放你未來建立的component
    image
  2. 可以建立component了
    image
    image
  3. 放入你要轉換的cpp檔,以及自己寫的testbench(也可以選擇先跳過)
    image
  4. 設定板子環境
    image
  5. 這樣就建立完成了
    若跳過第3步,可以在建立完成後再處理(我自己是這樣做)
    image
    記得設定top function(HLS轉換的單位)
    image

如何進行-以sum_array為例

#include "ap_int.h"

void sum_array_unroll(int in[8], int* out) {
#pragma HLS array_partition variable=in complete
    int total = 0;
    for (int i = 0; i < 8; i++) {
        #pragma HLS UNROLL
        total += in[i];
    }
    *out = total;
}
//testbench
#include <iostream>
using namespace std;

void sum_array_unroll(int in[8], int* out);

int main() {
    int in[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    int result;
    sum_array_unroll(in, &result);
    cout << "sum = " << result << endl;
    if (result == 36) {
        cout << "PASS" << endl;
        return 0;
    } else {
        cout << "FAIL" << endl;
        return 1;
    }
}

之後跑 C Simulation 看測試的結果,用來檢查程式是否錯誤

 sum = 36
 PASS
 INFO: [SIM 211-1] CSim done with 0 errors.
 INFO: [SIM 211-3] *************** CSIM finish ***************
 INFO: [HLS 200-112] Total CPU user time: 2 seconds. Total CPU system time: 1 seconds. Total elapsed time: 7.338 seconds; peak allocated memory: 265.996 MB.
 INFO: [vitis-run 60-791] Total elapsed time: 0h 0m 12s
 C-simulation finished successfully

然後跑 C Synthesis 生成verilog code與report
在syn->report資料夾中有一份你的"檔名_synth.rpt"

//無unroll版
+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+------------------------------------------------+
    |  Latency (cycles) |  Latency (absolute) |  Interval |                    Pipeline                    |
    |   min   |   max   |    min   |    max   | min | max |                      Type                      |
    +---------+---------+----------+----------+-----+-----+------------------------------------------------+
    |       10|       10|  0.100 us|  0.100 us|    9|    9|  loop auto-rewind stp (delay=1 clock cycles(s))|
    +---------+---------+----------+----------+-----+-----+------------------------------------------------+
* Summary: 
+-----------------+---------+------+--------+--------+-----+
|       Name      | BRAM_18K|  DSP |   FF   |   LUT  | URAM|
+-----------------+---------+------+--------+--------+-----+
|DSP              |        -|     -|       -|       -|    -|
|Expression       |        -|     -|       0|      63|    -|
|FIFO             |        -|     -|       -|       -|    -|
|Instance         |        -|     -|       -|       -|    -|
|Memory           |        -|     -|       -|       -|    -|
|Multiplexer      |        -|     -|       0|      45|    -|
|Register         |        -|     -|      41|       -|    -|
+-----------------+---------+------+--------+--------+-----+
|Total            |        0|     0|      41|     108|    0|
+-----------------+---------+------+--------+--------+-----+
//有unroll版
+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |        0|        0|      0 ns|      0 ns|    1|    1|       no|
    +---------+---------+----------+----------+-----+-----+---------+
* Summary: 
+-----------------+---------+------+--------+--------+-----+
|       Name      | BRAM_18K|  DSP |   FF   |   LUT  | URAM|
+-----------------+---------+------+--------+--------+-----+
|DSP              |        -|     -|       -|       -|    -|
|Expression       |        -|     -|       0|     245|    -|
|FIFO             |        -|     -|       -|       -|    -|
|Instance         |        -|     -|       -|       -|    -|
|Memory           |        -|     -|       -|       -|    -|
|Multiplexer      |        -|     -|       -|       -|    -|
|Register         |        -|     -|       -|       -|    -|
+-----------------+---------+------+--------+--------+-----+
|Total            |        0|     0|       0|     245|    0|
+-----------------+---------+------+--------+--------+-----+

從report就能看出,unroll效率明顯提升,但資源使用量明顯較大
顯示出#pragma的重要性
我們也可以做 C/RTL Cosimulation (硬體正確性驗證)

Dense Layer

#include "ap_int.h"

#define IN_DIM  8
#define OUT_DIM 4

void dense(float W[OUT_DIM][IN_DIM], float x[IN_DIM], float b[OUT_DIM], float y[OUT_DIM]) {
#pragma HLS array_partition variable=W type=complete
#pragma HLS array_partition variable=x type=complete
#pragma HLS array_partition variable=b type=complete
#pragma HLS array_partition variable=y type=complete
#pragma HLS PIPELINE II=1

    for (int i = 0; i < OUT_DIM; i++) {
#pragma HLS UNROLL
        float acc = b[i];
        for (int j = 0; j < IN_DIM; j++) {
#pragma HLS UNROLL
            acc += W[i][j] * x[j];
        }
        y[i] = (acc > 0) ? acc : 0;
    }
}

2 dense layer

#include "ap_int.h"

#define IN_DIM  8
#define HIDDEN_DIM 4
#define OUT_DIM 2

void dense1(int W1[HIDDEN_DIM][IN_DIM], int x[IN_DIM], int b1[HIDDEN_DIM], int h[HIDDEN_DIM]) {
#pragma HLS array_partition variable=W1 type=complete
#pragma HLS array_partition variable=x type=complete
#pragma HLS array_partition variable=b1 type=complete
#pragma HLS array_partition variable=h type=complete
#pragma HLS PIPELINE II=1

    for (int i = 0; i < HIDDEN_DIM; i++) {
#pragma HLS UNROLL
        int acc = b1[i];
        for (int j = 0; j < IN_DIM; j++) {
#pragma HLS UNROLL
            acc += W1[i][j] * x[j];
        }
        if (acc < 0) acc = 0;
        h[i] = acc;
    }
}

void dense2(int W2[OUT_DIM][HIDDEN_DIM], int h[HIDDEN_DIM], int b2[OUT_DIM], int y[OUT_DIM]) {
#pragma HLS array_partition variable=W2 type=complete
#pragma HLS array_partition variable=h type=complete
#pragma HLS array_partition variable=b2 type=complete
#pragma HLS array_partition variable=y type=complete
#pragma HLS PIPELINE II=1

    for (int i = 0; i < OUT_DIM; i++) {
#pragma HLS UNROLL
        int acc = b2[i];
        for (int j = 0; j < HIDDEN_DIM; j++) {
#pragma HLS UNROLL
            acc += W2[i][j] * h[j];
        }
        y[i] = acc;
    }
}

void dense_model(int W1[HIDDEN_DIM][IN_DIM], int W2[OUT_DIM][HIDDEN_DIM],
                 int b1[HIDDEN_DIM], int b2[OUT_DIM], int x[IN_DIM], int y[OUT_DIM]) {
#pragma HLS DATAFLOW
    int h[HIDDEN_DIM];
#pragma HLS array_partition variable=h complete

    dense1(W1, x, b1, h);
    dense2(W2, h, b2, y);
}

將輸入資訊進行線性組合
每個輸出神經元都連接到所有輸入神經元
能學到任意線性轉換,適合作為特徵提取、映射與分類器的終端層
未來可以用在CNN之類的任務上

Attention Score & Softmax

#include "ap_fixed.h"
#include <hls_math.h>

#define DIM 4

typedef ap_fixed<16, 6> data_t;

// ----- Compute Q·K^T -----
void attention_score(data_t Q[DIM], data_t K[DIM], data_t* score_out) {
#pragma HLS array_partition variable=Q complete
#pragma HLS array_partition variable=K complete

    data_t score = 0;
    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        score += Q[i] * K[i];
    }
    *score_out = score;
}

// ----- Softmax over fixed-length 1D input -----
void softmax(data_t input[DIM], data_t output[DIM]) {
#pragma HLS array_partition variable=input complete
#pragma HLS array_partition variable=output complete

    data_t max_val = input[0];
    for (int i = 1; i < DIM; i++) {
#pragma HLS UNROLL
        if (input[i] > max_val) max_val = input[i];
    }

    data_t sum = 0;
    data_t exp_val[DIM];
#pragma HLS array_partition variable=exp_val complete

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        exp_val[i] = hls::exp(input[i] - max_val);
        sum += exp_val[i];
    }

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = exp_val[i] / sum;
    }
}

通過計算Q與每個k向量的內積,得到它們的關聯分數
使用Softmax運算使得score i 轉為總和為1的機率分佈
幫助模型建立有意義的上下文關係

Multi_Head_Attention->Transformer Block

image

void attention_head(
    data_t Q_proj[HEAD_DIM], data_t K_proj[DIM][HEAD_DIM], data_t V_proj[DIM][HEAD_DIM], data_t out[HEAD_DIM]) {
#pragma HLS array_partition variable=Q_proj complete
#pragma HLS array_partition variable=K_proj complete dim=2
#pragma HLS array_partition variable=V_proj complete dim=2
#pragma HLS array_partition variable=out complete

    data_t scores[DIM];
#pragma HLS array_partition variable=scores complete

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        attention_score(Q_proj, K_proj[i], &scores[i]);
    }

    data_t weights[DIM];
    softmax(scores, weights);

    for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
        out[i] = 0;
        for (int j = 0; j < DIM; j++) {
#pragma HLS UNROLL
            out[i] += weights[j] * V_proj[j][i];
        }
    }
}

void multi_head_attention(
    data_t Q[DIM], data_t K[DIM][DIM], data_t V[DIM][DIM],
    data_t W_Q[HEADS][HEAD_DIM][DIM],
    data_t W_K[HEADS][HEAD_DIM][DIM],
    data_t W_V[HEADS][HEAD_DIM][DIM],
    data_t W_O[DIM][HEADS * HEAD_DIM],
    data_t output[DIM]) {

#pragma HLS array_partition variable=Q complete
#pragma HLS array_partition variable=K complete dim=2
#pragma HLS array_partition variable=V complete dim=2
#pragma HLS array_partition variable=W_Q complete dim=2
#pragma HLS array_partition variable=W_K complete dim=2
#pragma HLS array_partition variable=W_V complete dim=2
#pragma HLS array_partition variable=W_O complete dim=2
#pragma HLS array_partition variable=output complete

    data_t concat_heads[HEADS * HEAD_DIM];
#pragma HLS array_partition variable=concat_heads complete

    for (int h = 0; h < HEADS; h++) {
#pragma HLS UNROLL
        data_t Q_proj[HEAD_DIM], K_proj[DIM][HEAD_DIM], V_proj[DIM][HEAD_DIM];
        data_t head_out[HEAD_DIM];
#pragma HLS array_partition variable=Q_proj complete
#pragma HLS array_partition variable=K_proj complete dim=2
#pragma HLS array_partition variable=V_proj complete dim=2
#pragma HLS array_partition variable=head_out complete

        for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
            Q_proj[i] = 0;
            for (int j = 0; j < DIM; j++) Q_proj[i] += W_Q[h][i][j] * Q[j];
        }
        for (int m = 0; m < DIM; m++) {
            for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
                K_proj[m][i] = 0;
                for (int j = 0; j < DIM; j++) K_proj[m][i] += W_K[h][i][j] * K[m][j];
            }
        }
        for (int m = 0; m < DIM; m++) {
            for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
                V_proj[m][i] = 0;
                for (int j = 0; j < DIM; j++) V_proj[m][i] += W_V[h][i][j] * V[m][j];
            }
        }

        attention_head(Q_proj, K_proj, V_proj, head_out);

        for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
            concat_heads[h * HEAD_DIM + i] = head_out[i];
        }
    }

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = 0;
        for (int j = 0; j < HEADS * HEAD_DIM; j++) {
#pragma HLS UNROLL
            output[i] += W_O[i][j] * concat_heads[j];
        }
    }
}

image image

#include "ap_fixed.h"
#include <hls_math.h>
#include "multi_head_attention.h"

#define DIM 4
#define HEADS 2
#define HEAD_DIM 2
#define FF_DIM 4

typedef ap_fixed<16, 6> data_t;

void multi_head_attention(
    data_t Q[DIM], data_t K[DIM][DIM], data_t V[DIM][DIM],
    data_t W_Q[HEADS][HEAD_DIM][DIM],
    data_t W_K[HEADS][HEAD_DIM][DIM],
    data_t W_V[HEADS][HEAD_DIM][DIM],
    data_t W_O[DIM][HEADS * HEAD_DIM],
    data_t output[DIM]);

void dense_ffn(data_t input[DIM], data_t W1[FF_DIM][DIM], data_t b1[FF_DIM],
               data_t W2[DIM][FF_DIM], data_t b2[DIM], data_t output[DIM]) {
#pragma HLS array_partition variable=input complete
#pragma HLS array_partition variable=output complete
#pragma HLS array_partition variable=W1 complete dim=2
#pragma HLS array_partition variable=W2 complete dim=2
#pragma HLS array_partition variable=b1 complete
#pragma HLS array_partition variable=b2 complete

    data_t hidden[FF_DIM];
#pragma HLS array_partition variable=hidden complete

    for (int i = 0; i < FF_DIM; i++) {
#pragma HLS UNROLL
        hidden[i] = b1[i];
        for (int j = 0; j < DIM; j++) hidden[i] += W1[i][j] * input[j];
        if (hidden[i] < 0) hidden[i] = 0;
    }

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = b2[i];
        for (int j = 0; j < FF_DIM; j++) output[i] += W2[i][j] * hidden[j];
    }
}
-
void transformer_block(
    data_t Q[DIM], data_t K[DIM][DIM], data_t V[DIM][DIM],
    data_t W_Q[HEADS][HEAD_DIM][DIM],
    data_t W_K[HEADS][HEAD_DIM][DIM],
    data_t W_V[HEADS][HEAD_DIM][DIM],
    data_t W_O[DIM][HEADS * HEAD_DIM],
    data_t W1[FF_DIM][DIM], data_t b1[FF_DIM],
    data_t W2[DIM][FF_DIM], data_t b2[DIM],
    data_t output[DIM]) {

#pragma HLS array_partition variable=Q complete
#pragma HLS array_partition variable=K complete dim=2
#pragma HLS array_partition variable=V complete dim=2
#pragma HLS array_partition variable=output complete

    data_t attn_out[DIM];
    data_t add1[DIM];
    data_t ffn_out[DIM];
#pragma HLS array_partition variable=attn_out complete
#pragma HLS array_partition variable=add1 complete
#pragma HLS array_partition variable=ffn_out complete

    multi_head_attention(Q, K, V, W_Q, W_K, W_V, W_O, attn_out);

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        add1[i] = Q[i] + attn_out[i];
    }

    dense_ffn(add1, W1, b1, W2, b2, ffn_out);

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = add1[i] + ffn_out[i];
    }
}

Multi-Head Attention
每個head執行比例化點積attention(Scaled Dot-Product Attention):
分數計算: score=Q·K^T
Softmax正規化
與值矩陣V的加權和
多個注意力頭的輸出會串接後送入線性投影層

Feed-Forward Network(FFN)
包含兩層Dense Layer與ReLU激活函數:
FFN(x)=max(0, W1x + b1)W2 + b2
成功實作一個HLS可合成的Transformer Block

目標 & 進度(每周更新)

4/15
開始朝stable diffusion在vitis上實作前進
學習實作像hls4ml把model轉換成c++ hls
思考新方向(如hls4rl, hls4障礙物偵測)
4/22
已完成一個transformer encoder block project的實作
包含:
dense layer(測試成功)
layer normalization(測試成功)
gelu(測試成功)
residual normalization(測試成功)
multi-head attention(測試成功)
最後整合到transformer encoder block達成圖示的目的
測試完成將把所有結果更新至github
可以以這個架構試著朝新主題式著發展了
可繼續往 Stable Diffusion、Edge AI 模型移植與加速方向發展

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors