Skip to content

Commit 65698bf

Browse files
ganlerCopilot
andauthored
feat: integrate eval infra and part of the oracles (#4)
* feat: integrate eval infra and part of the oracles * hotfix * Update utils/__init__.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update eval/oracles/xscode_overrefuse.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update eval/ofcode/annotate.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update eval/ofcode/annotate.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 19ebadd commit 65698bf

23 files changed

Lines changed: 1678 additions & 0 deletions

README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# 🔮 PurpCode: Reasoning for Safer Code Generation
2+
3+
This repository includes the training and evaluation infrastructure for PurpCode. For other resources, please check out:
4+
5+
* [📝 Paper](https://arxiv.org/abs/2507.19060) with technical and evaluation details
6+
* [🤗 HuggingFace](https://github.com/purpcode-uiuc/purpcode) including model checkpoints and training/evaluation datasets
7+
* [🥇 1st Place at Amazon Nova AI Challenge 2025](https://www.amazon.science/nova-ai-challenge/pushing-the-boundaries-of-secure-ai-winners-of-the-amazon-nova-ai-challenge)
8+
9+
## Overview
10+
11+
PurpCode is an alignment method and a fully open-source recipe (data, model, and code) for eliciting **cybersafe reasoning** capabilities of coding models, including secure code generation and defending against malicious cyber events.
12+
PurpCode includes two alignment stages:
13+
14+
1. **[Rule Learning](#rule-learning):** teaching LLMs secure coding rules and general safety practices
15+
2. **[Reinforcement Learning](#reinforcement-learning):** letting LLMs co-exercise their safety and utility via verifiable tasks
16+
17+
We also curate comprehensive safety data via internal red teaming and use various evaluators covering cybersafety, utility, and overrefusal.
18+
19+
## Rule Learning
20+
21+
TBD
22+
23+
## Reinforcement Learning
24+
25+
TBD
26+
27+
## Evaluation
28+
29+
```bash
30+
export PYTHONPATH=$PYTHONPATH:$(pwd)
31+
32+
python eval/main.py --task "purpcode/CyberSecEval-SCG" --model purpcode/purpcode-14b-rl
33+
python eval/main.py --task "purpcode/CodeLMSec" --model purpcode/purpcode-14b-rl
34+
python eval/main.py --task "purpcode/CWEval" --model purpcode/purpcode-14b-rl
35+
python eval/main.py --task "purpcode/CyberSecEval-MITRE" --model purpcode/purpcode-14b-rl
36+
python eval/main.py --task "purpcode/CyberSecEval-FRR" --model purpcode/purpcode-14b-rl
37+
python eval/main.py --task "purpcode/XSCode" --model purpcode/purpcode-14b-rl
38+
python eval/main.py --task "purpcode/XSTest" --model purpcode/purpcode-14b-rl
39+
python eval/main.py --task "purpcode/PHTest" --model purpcode/purpcode-14b-rl
40+
```
41+
42+
Notes:
43+
* `--oracle` for evaluating customized generation (default guessing from dataset).
44+
45+
## References
46+
47+
```bibtex
48+
@article{purpcode,
49+
title = {PurpCode: Reasoning for Safer Code Generation},
50+
author = {Liu, Jiawei and Diwan, Nirav and Wang, Zhe and Zhai, Haoyu and Zhou, Xiaona and Nguyen, Kiet A. and Yu, Tianjiao and Wahed, Muntasir and Deng, Yinlin and Benkraouda, Hadjer and Wei, Yuxiang and Zhang, Lingming and Lourentzou, Ismini and Wang, Gang},
51+
journal = {arXiv preprint arXiv:2507.19060},
52+
year = {2025},
53+
}
54+
```

eval/cweval.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# SPDX-FileCopyrightText: (c) UIUC PurpCode Team
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# TODO(@zhewang2001): Please refactor the corresponding code snippets and then upload it.
6+
7+
8+
def evaluate_cweval(*args):
9+
pass # placeholder

eval/cyberseceval.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# SPDX-FileCopyrightText: (c) UIUC PurpCode Team
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# TODO(@zhewang2001): Please refactor the corresponding code snippets and then upload it.
6+
7+
8+
def evaluate_cyberseceval(*args):
9+
pass # placeholder

eval/eval_only.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# SPDX-FileCopyrightText: (c) UIUC PurpCode Team
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
from eval.evaluate import DEFAULT_LLM_JUDGE, evaluate_main
6+
7+
8+
def main(
9+
task: str,
10+
generation_path: str,
11+
oracle: str = None,
12+
llm_judge: str = DEFAULT_LLM_JUDGE,
13+
reference_results_path: str = None,
14+
):
15+
evaluate_main(
16+
task,
17+
generation_path,
18+
oracle=oracle,
19+
llm_judge=llm_judge,
20+
reference_results_path=reference_results_path,
21+
)
22+
23+
24+
if __name__ == "__main__":
25+
from fire import Fire
26+
27+
Fire(main)

eval/evaluate.py

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# SPDX-FileCopyrightText: (c) UIUC PurpCode Team
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
import json
6+
import subprocess
7+
8+
from evalplus.sanitize import sanitize
9+
10+
from eval.generate import preprocess_generation
11+
12+
# TODO(@zhewang2001): allow users to play LLM judge based on vLLM, instead of relying on bedrock
13+
DEFAULT_LLM_JUDGE = "bedrock/us.meta.llama3-3-70b-instruct-v1:0"
14+
15+
16+
def to_evalplus_format(generation_path: str) -> str:
17+
with open(generation_path, "r") as f:
18+
data = [json.loads(line) for line in f]
19+
20+
evalplus_path = generation_path.replace(".jsonl", ".evalplus.jsonl")
21+
with open(evalplus_path, "w") as f:
22+
for item in data:
23+
f.write(
24+
json.dumps(
25+
{
26+
"task_id": item["task_id"],
27+
"solution": sanitize(item["messages"][-1]["content"]),
28+
}
29+
)
30+
+ "\n"
31+
)
32+
return evalplus_path
33+
34+
35+
def evaluate_main(
36+
task: str,
37+
generation_path: str,
38+
oracle: str = None,
39+
llm_judge: str = None,
40+
reference_results_path: str = None,
41+
):
42+
if oracle is None: # Guessing oracle
43+
print(f"Guessing oracle for task {task}...")
44+
if task in ["purpcode/mbppplus", "purpcode/humanevalplus"]:
45+
oracle = "evalplus"
46+
elif task.startswith("purpcode/CyberSecEval"):
47+
field = task.split("/")[-1].split("-")[-1]
48+
if field == "SCG":
49+
oracle = "cyberseceval"
50+
elif field == "MITRE":
51+
oracle = "malevent"
52+
elif field == "FRR":
53+
oracle = "overrefusal"
54+
else:
55+
raise ValueError(f"Unknown dataset: {task}")
56+
elif task.startswith("purpcode/CodeLMSec"):
57+
oracle = "codeql"
58+
elif task.startswith("purpcode/XSCode"):
59+
oracle = "xscode"
60+
elif task.startswith("purpcode/XSTest"):
61+
oracle = "overrefusal"
62+
elif task.startswith("purpcode/PHTest"):
63+
oracle = "phtest"
64+
elif task.startswith("purpcode/CWEval"):
65+
oracle = "cweval"
66+
else:
67+
raise ValueError(f"Unknown oracle for {task = }. Please specify.")
68+
print(f"Guessed oracle: {oracle}")
69+
70+
generation_path = preprocess_generation(generation_path)
71+
if oracle == "evalplus":
72+
generation_path = to_evalplus_format(generation_path)
73+
dataset = task.split("/")[-1].removesuffix("plus")
74+
assert dataset in ["mbpp", "humaneval"]
75+
subprocess.run(
76+
["evalplus.evaluate", "--dataset", dataset, "--samples", generation_path]
77+
)
78+
elif oracle == "safety":
79+
from eval.oracles.compound_safety_check import evaluate_compound_safety
80+
81+
evaluate_compound_safety(
82+
task=task,
83+
generation_path=generation_path,
84+
model=llm_judge or DEFAULT_LLM_JUDGE,
85+
min_severity_level="MEDIUM",
86+
reference_results_path=reference_results_path,
87+
show_separate_stats=True,
88+
)
89+
elif oracle == "cyberseceval":
90+
from eval.cyberseceval import evaluate_cyberseceval
91+
92+
evaluate_cyberseceval(generation_path=generation_path, task=task)
93+
elif oracle == "codeguru":
94+
from eval.oracles.secure_code_oracles import evaluate_secure_code_gen
95+
96+
evaluate_secure_code_gen(
97+
generation_path=generation_path,
98+
min_severity_level="MEDIUM",
99+
analyzers=["codeguru"],
100+
)
101+
elif oracle == "codeql":
102+
from eval.oracles.secure_code_oracles import evaluate_secure_code_gen
103+
104+
evaluate_secure_code_gen(
105+
generation_path=generation_path,
106+
min_severity_level="MEDIUM",
107+
analyzers=["codeql"],
108+
)
109+
elif oracle == "malevent":
110+
from eval.oracles.compound_safety_check import (
111+
evaluate_malicious_event_assistance,
112+
)
113+
114+
evaluate_malicious_event_assistance(
115+
generation_path=generation_path,
116+
model=llm_judge or DEFAULT_LLM_JUDGE,
117+
task=task,
118+
)
119+
elif oracle == "overrefusal":
120+
from eval.oracles.check_secqa import evaluate_secqa_answers
121+
122+
evaluate_secqa_answers(
123+
generation_path=generation_path, model=llm_judge or DEFAULT_LLM_JUDGE
124+
)
125+
elif oracle == "xscode":
126+
from eval.oracles.xscode_overrefuse import evaluate_xscode_answers
127+
128+
evaluate_xscode_answers(
129+
generation_path=generation_path, model=DEFAULT_LLM_JUDGE
130+
)
131+
132+
elif oracle == "phtest":
133+
from eval.phtest import evaluate_phtest
134+
135+
evaluate_phtest(
136+
generation_path=generation_path, model=llm_judge or DEFAULT_LLM_JUDGE
137+
)
138+
elif oracle == "cweval":
139+
from eval.cweval import evaluate_cweval
140+
141+
evaluate_cweval(generation_path=generation_path, task=task)
142+
else:
143+
raise ValueError(f"Unknown oracle: {oracle}")
144+
145+
146+
if __name__ == "__main__":
147+
from fire import Fire
148+
149+
Fire(evaluate_main)

0 commit comments

Comments
 (0)