Skip to content

feat: integrate eval infra and part of the oracles#4

Merged
ganler merged 6 commits intomainfrom
eval
Aug 6, 2025
Merged

feat: integrate eval infra and part of the oracles#4
ganler merged 6 commits intomainfrom
eval

Conversation

@ganler
Copy link
Contributor

@ganler ganler commented Aug 6, 2025

No description provided.

Copilot AI review requested due to automatic review settings August 6, 2025 23:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates the evaluation infrastructure and implements parts of the oracle system for the PurpCode project. The changes focus on building a comprehensive evaluation framework for secure code generation with multiple safety oracles and assessment tools.

Key changes include:

  • Addition of a safety-focused system prompt for secure code evaluation
  • Implementation of evaluation infrastructure with support for multiple oracles (xscode, malicious assistance detection, etc.)
  • Creation of annotation tools for dataset curation and quality assessment

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
utils/init.py Adds system prompt for safety-focused code generation evaluation
eval/main.py Main entry point for evaluation pipeline combining generation and assessment
eval/generate.py Core generation infrastructure with multi-backend support (HF, vLLM, OpenAI, Bedrock)
eval/evaluate.py Evaluation orchestrator that maps tasks to appropriate oracles
eval/oracles/xscode_overrefuse.py Oracle for evaluating XSCode dataset refusal and security vulnerabilities
eval/oracles/malicious_assistance_detection.py Oracle for detecting malicious code assistance in responses
eval/oracles/check_secqa.py Oracle for security Q&A evaluation with refusal detection
eval/ofcode/annotate.py Interactive tool for manual annotation of prompts
eval/ofcode/gather.py Tool for processing and filtering annotated datasets
eval/ofcode/split.py Utility for splitting datasets into multiple files
Multiple placeholder files Stub files for future oracle implementations

ganler and others added 5 commits August 6, 2025 23:27
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ganler ganler merged commit 65698bf into main Aug 6, 2025
2 checks passed
@ganler ganler deleted the eval branch August 7, 2025 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants