A specialized evaluation harness for assessing Rust code generation capabilities of language models, designed as a core component of the SigilDERG ecosystem for Rust-focused AI development.
📖 Ecosystem Architecture: For a comprehensive overview of how this project integrates with SigilDERG-Data_Production and SigilDERG-Finetuner, see ARCHITECTURE.md in the Data Production repository.
This evaluation harness is part of an integrated pipeline for training and evaluating Rust code generation models:
- SigilDERG-Data_Production: Generates high-quality, instruction-style Rust code datasets from real-world crates using static analysis and quality filters
- SigilDERG-Finetuner: Fine-tunes language models (like Llama-3.1-8B-Instruct) on Rust code using QLoRA and multi-phase training strategies
- HumanEval Rust (this project): Evaluates model performance on standardized Rust programming problems using the HumanEval benchmark format
- sigil-mmf-codex-priv: Additional components for the ecosystem
This evaluator is designed to work with fine-tuned Rust code generation models, particularly:
- Llama-3.1-8B-Instruct-Rust-QLora: A Phase 1 fine-tuned model produced using the SigilDERG Finetuner
This package requires Python 3.12.10 or later. We recommend using a virtual environment:
# Using venv (recommended)
python3.12 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using uv (fast alternative)
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateInstall a Rust toolchain via rustup and ensure a modern compiler with Edition 2021 support (Rust 1.56+; we recommend the latest stable toolchain):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable
rustc --versionpip install human-eval-rust📦 Package available on PyPI: https://pypi.org/project/human-eval-rust/
Install all three SigilDERG packages together:
pip install human-eval-rust[ecosystem]Or install via the pipeline package:
pip install sigil-pipeline[ecosystem]This installs:
human-eval-rust>=2.1.0sigil-pipeline>=2.2.1sigilderg-finetuner>=2.8.0
git clone https://github.com/Superuser666-Sigil/human-eval-Rust.git
cd human-eval-Rust
pip install -e .rust_execution.py; you should sandbox the Rust evaluator, because it builds binaries from untrusted code and runs their tests locally.
- Generate completions from your model using the HumanEval Rust prompts
- Save samples in JSONL format with
task_idandcompletionfields - Run evaluation to get pass@k metrics and detailed results
from human_eval.data import read_problems, write_jsonl, get_human_eval_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import AutoPeftModelForCausalLM
import torch
# Load your fine-tuned PEFT model (e.g., from HuggingFace)
# For checkpoint subdirectories, use: "repo-name/checkpoint-9000"
model_name = "Superuser666-Sigil/Llama-3.1-8B-Instruct-Rust-QLora/checkpoint-9000"
# Load tokenizer (try checkpoint subfolder first, fallback to repo root)
try:
tokenizer = AutoTokenizer.from_pretrained(model_name)
except Exception:
# Fallback: load from repo root or base model
repo_name = model_name.split("/checkpoint-")[0] if "/checkpoint-" in model_name else model_name
tokenizer = AutoTokenizer.from_pretrained(repo_name)
# Load PEFT model with explicit parameters to avoid TensorFlow loading issues
model = AutoPeftModelForCausalLM.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
from_tf=False, # Explicitly prevent TensorFlow loading
use_safetensors=True, # Prefer SafeTensors format
)
# For base models (not PEFT), use:
# model = AutoModelForCausalLM.from_pretrained(
# model_name,
# dtype=torch.bfloat16,
# device_map="auto",
# trust_remote_code=True,
# from_tf=False,
# use_safetensors=True,
# )
# Load HumanEval Rust problems
rust_problems = read_problems(get_human_eval_dataset())
# Generate completions
samples = []
for task_id, problem in rust_problems.items():
prompt = problem["prompt"]
# Generate completion (adjust parameters as needed)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
do_sample=True,
)
completion = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
samples.append(dict(task_id=task_id, completion=completion))
# Save samples
write_jsonl("rust_samples.jsonl", samples)
# Evaluate
# Run: evaluate_functional_correctness rust_samples.jsonl$ evaluate_functional_correctness rust_samples.jsonl
Reading samples...
164it [00:01, 1959.50it/s]
Running test suites...
100%|...| 164/164 [00:45<00:00, 3.62it/s]
Writing results to rust_samples.jsonl_results.jsonl...
100%|...| 164/164 [00:00<00:00, 42876.84it/s]
{'pass@1': 0.42, 'pass@10': 0.68, 'pass@100': 0.85}The evaluator provides detailed results in <input>_results.jsonl with per-sample pass/fail status and execution results ("passed", "timed out", or "failed").
The evaluation workflow integrates seamlessly with the SigilDERG Finetuner evaluation system:
- After training: Use the finetuner's evaluation scripts to generate samples
- Run this evaluator: Process the generated samples to get HumanEval metrics
- Compare metrics: Track improvements across training phases
Example integration:
# After Phase 1 training, evaluate checkpoint
python scripts/generate_samples.py \
--checkpoint out/llama8b-rust-qlora-phase1/checkpoint-1000 \
--output eval_samples.jsonl
# Evaluate with HumanEval Rust
evaluate_functional_correctness eval_samples.jsonl \
--problem_file=data/HumanEval_rust.jsonlThe example samples should yield 0.5 pass@1:
$ evaluate_functional_correctness data/example_rust_samples.jsonl --problem_file=data/example_rust_problem.jsonl
Reading samples...
4it [00:00, 1959.50it/s]
Running test suites...
100%|...| 4/4 [00:03<00:00, 1.13it/s]
Writing results to data/example_rust_samples.jsonl_results.jsonl...
100%|...| 4/4 [00:00<00:00, 1536.38it/s]
{'pass@1': 0.5}# Custom pass@k values
evaluate_functional_correctness samples.jsonl --k=1,5,10,20
# Adjust parallelism (default: 24 workers optimized for H100)
evaluate_functional_correctness samples.jsonl --n_workers=8
# Custom timeout budgets (separate for compile/test/clippy phases)
evaluate_functional_correctness samples.jsonl \
--compile-timeout=15.0 \
--run-timeout=10.0 \
--clippy-timeout=10.0
# Clippy enforcement modes
evaluate_functional_correctness samples.jsonl --clippy-required # Lint failures block completion
evaluate_functional_correctness samples.jsonl # Default: advisory mode (metrics only)
# Sandbox enforcement
evaluate_functional_correctness samples.jsonl --require-sandbox # Strict mode: Firejail required
evaluate_functional_correctness samples.jsonl # Default: fallback to unsandboxed
# Sandboxing options
evaluate_functional_correctness samples.jsonl --sandbox-mode=firejail # Recommended
evaluate_functional_correctness samples.jsonl --sandbox-mode=none # UNSAFE: local dev only
# Non-interactive mode (for CI/automated pipelines)
evaluate_functional_correctness samples.jsonl --allow-no-sandbox # Required when Firejail unavailable
# Policy enforcement (pattern filtering)
evaluate_functional_correctness samples.jsonl --enforce-policy # Default: enabled
evaluate_functional_correctness samples.jsonl --no-enforce-policy # Disable for pure HumanEval compatibility
# See all options
evaluate_functional_correctness --helpThe evaluator includes multiple layers of security:
- Pattern-based filtering (optional, enabled by default): Blocks dangerous code patterns before execution (filesystem, network, process operations, unsafe code, etc.). Can be disabled with
--no-enforce-policyfor pure HumanEval compatibility. - Process isolation: Each evaluation runs in a separate process
- Firejail sandboxing (recommended): Full process jail isolation with resource limits
Policy Enforcement Modes:
--enforce-policy(default): Enables pattern-based filtering for security. Use this for production evaluation of untrusted LLM-generated code.--no-enforce-policy: Disables pattern filtering for pure HumanEval compatibility. Use this when you need exact 1:1 comparability with the original HumanEval benchmark format (research/publication mode).
Sandbox Modes:
firejail(recommended): Uses Firejail for Linux process isolation with--net=none, private filesystem, memory/CPU limitsnone: No sandboxing (UNSAFE - only for local development with trusted code)- Auto-detect (default): Automatically detects Firejail availability; prompts for installation or unsafe mode if unavailable
Firejail Setup (Linux only):
# Install Firejail
sudo apt-get install firejail # Debian/Ubuntu
# or
sudo dnf install firejail # Fedora/RHEL
# or
sudo yum install firejail # CentOS
# or
sudo pacman -S firejail # Arch LinuxInteractive Installation Flow:
When Firejail is not available, the evaluator presents an interactive prompt:
- Install Firejail: Attempts automatic installation via system package manager
- Cancel: Exit without running evaluation
- Proceed without sandbox: Only after explicit confirmation (UNSAFE)
Non-Interactive Mode:
For CI/CD pipelines or automated scripts, use the --allow-no-sandbox flag to bypass interactive prompts:
# In CI, when Firejail is available
evaluate_functional_correctness samples.jsonl --sandbox-mode=firejail
# In CI, when you've verified the environment is secure
evaluate_functional_correctness samples.jsonl --sandbox-mode=none --allow-no-sandboxThe HumanEval Rust dataset (data/HumanEval_rust.jsonl) contains 164 Rust programming problems. Each problem includes:
task_id: Unique identifier (e.g., "HumanEval/0")prompt: Function signature and docstringcanonical_solution: Reference implementationtest: Rust test cases using#[cfg(test)]entry_point: Function name
Sample format:
{"task_id": "HumanEval/0", "prompt": "fn has_close_elements(...) -> bool{", "canonical_solution": "...", "test": "#[cfg(test)]\nmod tests {...}", "entry_point": "has_close_elements"}When using the SigilDERG evaluation pipeline (lambda-package), prompts are automatically enhanced with Rust-specific instructions:
- Includes the Rust function signature and doc comment from the problem
- Adds explicit instructions: "Implement only the requested function in Rust"
- Prohibits
fn main, tests, example code, and unnecessary comments - Prohibits
...,todo!(), andunimplemented!() - Includes Rust-specific reminders about imports and mutability
This ensures models generate focused, correct Rust code without extra scaffolding.
- Data Production → Generate training data with SigilDERG-Data_Production
- Model Fine-Tuning → Train on Rust code with SigilDERG-Finetuner
- Evaluation → Assess performance with this HumanEval Rust harness
- Iteration → Use results to guide further training and data collection
This evaluator provides comprehensive metrics for Rust code generation:
Standard HumanEval Metrics:
- pass@k: Functional correctness at k samples (pass@1, pass@2, pass@10, pass@100)
Enhanced Metrics (v1.4.4+):
- compile_rate: Fraction of samples that compile successfully
- main_free_rate: Percentage of completions without
fn main()functions
Result Schema (v1.4.4+): Each evaluation result includes enhanced fields for trust and auditability:
{
"task_id": "HumanEval/0",
"completion": "...",
"compile_ok": true,
"test_ok": true,
"error_type": null,
"stderr": "",
"main_free": true,
"passed": true,
"result": "passed"
}Error Types:
infra_missing_toolchain: Infrastructure failure (rustc not available)infra_missing_linter: Clippy not available or failed infrastructure checkscompile_error: Code failed to compilecompile_timeout: Compilation exceeded time budgetruntime_error: Code compiled but crashed during executiontest_timeout: Test execution exceeded time budgetassertion_failure: Tests failed (code ran but assertions failed)clippy_timeout: Clippy linting exceeded time budgetlint_failure: Clippy found code quality issues (when--clippy-requiredis enabled)
Preflight Checks:
- Validates
rustcavailability before evaluation (fails fast on infrastructure issues) - Never drops completions silently - all samples are included in results with appropriate status
Together, these metrics provide a complete picture of model performance for Rust code generation, with full auditability for Rule Zero compliance.
Version 2.0.0+ includes optimizations specifically tuned for high-performance GPU evaluation environments (e.g., 1x H100 with 26 vCPUs and 225GB RAM):
- Parallel Workers: 24 (default
--n_workers=24) - Optimized to saturate 26 vCPUs (reserving 2 for OS/orchestration) - Timeout Budgets: 10.0 seconds per phase (compile/run/clippy) - Separate budgets ensure fair evaluation
--compile-timeout=10.0- Default for rustc compilation--run-timeout=10.0- Default for test execution--clippy-timeout=10.0- Default for linting
- Firejail Memory Limit: 4GB per process - Handles complex, macro-heavy Rust code compilation
For faster evaluation on high-end hardware, reduce timeout budgets:
evaluate_functional_correctness samples.jsonl \
--n_workers=32 \
--compile-timeout=5.0 \
--run-timeout=5.0 \
--clippy-timeout=5.0This configuration achieves ~200 samples/minute vs ~150 samples/minute with default 10s timeouts.
With 24 workers and 4GB memory per process:
- Maximum Memory Usage: ~96GB (24 workers × 4GB) - Well within 225GB safety margin
- CPU Utilization: ~92% (24/26 vCPUs) - Near-saturation for maximum throughput
These defaults are optimized for production evaluation on high-end hardware. For smaller systems, you can override with --n_workers and timeout flags.
See ADR-008 for timeout budget design rationale.
Docker Support Removed: Version 2.0.0 removes Docker-based sandboxing in favor of Firejail-first architecture:
- Simpler deployment: No Docker daemon required
- Faster startup: No container overhead
- Interactive installation: Prompts to install Firejail if missing
- Non-interactive mode:
--allow-no-sandboxfor CI/CD pipelines
Migration from v1.x:
- If you were using
--sandbox-mode=docker, switch to--sandbox-mode=firejail - Install Firejail on your system (see Firejail Setup above)
- For CI/CD, use
--allow-no-sandboxif running in a secure environment without Firejail
Completion Extraction & Cleaning:
- Automatically extracts function bodies from model completions
- Removes extra
main()functions and standalone code - Strips markdown code blocks (
rust,) - Handles completions with or without function signatures
- Improves evaluation accuracy by ensuring only the target function is tested
Robust Validation:
- Validates
rustcavailability before evaluation (fails fast if unavailable) - Prevents silent failures across thousands of samples
While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.
malloc: can't allocate region
This evaluation harness is based on the HumanEval benchmark format described in the original Codex paper. Please cite:
@article{chen2021codex,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MIT License
This release hardens Firejail usage with seccomp, capability dropping, CPU/file/process limits, and read-only mounts to reduce risk when running untrusted Rust code.
The evaluator now reports compile rate, main-free rate, clippy pass rate, average compile time, and binary sizes alongside pass@k.
An extended Rust dataset stub is available at data/HumanEval_rust_extended.jsonl and can be regenerated with scripts/generate_extended_dataset.py.
Use human_eval.logging_config.setup_logging to configure structured logging for CLI invocations.