Skip to content

sssszh/ELVul4LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ELVul4LLM -- Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation

📜 Introduction

This is an empirical study on the impact of ensemble learning on the performance of LLMs in vulnerability detection.

📕 Datas

💻 Experiments

Install Dependencies

conda env create -f environment.yml

cd transformers
pip install -e .

📖 Baseline LLMs w/o EL

Fine-tuning LLMs for Vulnerability Defection with QLora (CodeLlama on Devign Example)

cd CodeLlama

*****Training*****

python train.py \
    --model ../CodeLlama/model \
    --model_path ../CodeLlama/model \
    --save_dir ../CodeLlama/outputs \
    --train_path ../Datas/Devign/train.jsonl \
    --val_path ../Datas/Devign/valid.jsonl \
    --test_path ../Datas/Devign/test.jsonl \
    --num_labels 2 \
    --epochs 15 \
    --lr 2e-5 \
    --batch-size-per-replica 16 \
    --grad-acc-steps 2 \
    --bf16 True \
    --seed 42 \
    --save_total_limit 5

*****Prediction*****

python prediction.py \
    --model_path ../CodeLlama/model \
    --lora_path ../CodeLlama/outputs \
    --train_path ../Datas/Devign/train.jsonl \
    --val_path ../Datas/Devign/valid.jsonl \
    --test_path ../Datas/Devign/test.jsonl \
    --num_labels 2

📖 Ensemble Learning

✨ Bagging

  • Preparing Data
cd EL/Bagging
python get_data.py
  • Training LLMs on Bagging Subset
See Above
  • Hard/Soft Voting
cd EL/Bagging
python vote.py

✨ Boosting

  • Preparing Data
cd EL/Boosting
python data_weight.py
  • Training LLMs with Boosting (CodeLlama Example)
python train_boosting.py \
    --model ../CodeLlama/model \
    --model_path ../CodeLlama/model \
    --save_dir ../CodeLlama/outputs \
    --train_path ../Datas/Devign/train_weight.jsonl \
    --val_path ../Datas/Devign/valid_weight.jsonl \
    --test_path ../Datas/Devign/test_weight.jsonl \
    --num_labels 2 \
    --epochs 15 \
    --lr 2e-5 \
    --batch-size-per-replica 16 \
    --grad-acc-steps 2 \
    --bf16 True \
    --seed 42 \
    --save_total_limit 5

*****Prediction*****

python prediction.py \
    --model_path ../CodeLlama/model \
    --lora_path ../CodeLlama/outputs \
    --train_path ../Datas/Devign/train_weight.jsonl \
    --val_path ../Datas/Devign/valid_weight.jsonl \
    --test_path ../Datas/Devign/test_weight.jsonl \
    --num_labels 2

✨ Stacking

  • Preparing Data with Predictions of LLMs
cd EL/Stacking
python merge.py
  • Training Meta-Model
*****LR*****
python LR.py

*****RF*****
python RF.py

*****KNN*****
python KNN.py

*****SVM*****
python SVM.py

✨ DGS

  • Preparing Data cd EL/DGS
python precess_data.py
  • Training
python train_DGS.py \
    --model ../CodeBERT \
    --model_path ../CodeBERT \
    --save_dir ../EL/DGS/outputs \
    --train_path ../Datas/Devign/train_DGS.jsonl \
    --val_path ../Datas/Devign/valid_DGS.jsonl \
    --test_path ../Datas/Devign/test_DGS.jsonl \
    --num_labels 5 \
    --epochs 60 \
    --lr 2e-5 \
    --batch-size-per-replica 16 \
    --grad-acc-steps 2 \
    --bf16 True \
    --seed 42 \
    --save_total_limit 5

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors