CodeFixes Benchmark Infrastructure (MVP)

Multi-module Java 17 + Gradle (Groovy DSL) framework for reproducible Java CodeFixes benchmarking.

Modules

bench-core: domain model, config, fingerprints, stage contracts
bench-io: JSONL+Zstd streaming IO, blob store, resume state index
bench-dataset: dataset generation + masking stage
bench-retrieval: Lucene BM25 index/retrieval + leakage filter
bench-augmentation: prompt block construction
bench-llm: OpenAI-compatible generation client
bench-metrics: exact/edit/compile metrics
bench-runner: E2E orchestration, batching, resume/reuse/shuffle
bench-cli: Spring Boot + picocli CLI commands

CLI

bench run --config <path> [--resume] [--start-from-stage <STAGE>] [--reuse-from-run <RUN_ID>]
bench inspect-run --run-id <RUN_ID> [--output-dir runs]
bench validate-config --config <path>

Quick Start On Fixed Project

This repository already contains a fixed Java project:

examples/fixed-project

And ready configs:

examples/run_config.fixed.yaml (normal run with local LLM endpoint)
examples/run_config.fixed.no_llm.yaml (run without available LLM, still produces numeric artifacts)

If Gradle wrapper is not present yet, replace ./gradlew with gradle in commands below.

1) Validate config

./gradlew :bench-cli:bootRun --args='validate-config --config examples/run_config.fixed.no_llm.yaml'

2) Run benchmark now (no LLM required)

RUN_ID=fixed-no-llm-001
./gradlew :bench-cli:bootRun --args="run --config examples/run_config.fixed.no_llm.yaml --run-id ${RUN_ID}"

3) Inspect run summary

./gradlew :bench-cli:bootRun --args="inspect-run --run-id ${RUN_ID}"

4) Get numeric results from artifacts

Dataset size:

zstd -dc runs/${RUN_ID}/dataset/raw_samples*.jsonl.zst | wc -l

Retrieval leakage counters:

for f in runs/${RUN_ID}/retrieving/part-*.jsonl.zst; do zstd -dc "$f"; done | \
  jq -s '{
    samples: length,
    leakage_detected: map(select(.leakage_detected == true)) | length,
    leakage_filtered_total: (map(.leakage_filtered_count) | add)
  }'

Generation status (how many records contain errors):

for f in runs/${RUN_ID}/generation/part-*.jsonl.zst; do zstd -dc "$f"; done | \
  jq -s '{
    samples: length,
    with_candidates: map(select((.candidates | length) > 0)) | length,
    with_errors: map(select((.errors | length) > 0)) | length
  }'

Main metrics:

for f in runs/${RUN_ID}/metrics/part-*.jsonl.zst; do zstd -dc "$f"; done | \
  jq -s '{
    rows: length,
    exact_match_rate: (map(if .metrics.exact_match then 1 else 0 end) | add / (length | if . == 0 then 1 else . end)),
    mean_edit_similarity: (map(.metrics.edit_similarity) | add / (length | if . == 0 then 1 else . end)),
    compile_pass_rate: (map(if .metrics.compiles then 1 else 0 end) | add / (length | if . == 0 then 1 else . end))
  }'

5) Run with real model endpoint (optional)

If local vLLM (or any OpenAI-compatible server) is available on http://127.0.0.1:8000:

RUN_ID=fixed-llm-001
./gradlew :bench-cli:bootRun --args="run --config examples/run_config.fixed.yaml --run-id ${RUN_ID}"

Then reuse the same metrics commands above for real quality numbers.

Windows PowerShell (path-safe quoting)

$RUN_ID = "fixed-no-llm-001"
./gradlew :bench-cli:bootRun --args='validate-config --config examples/run_config.fixed.no_llm.yaml'
./gradlew :bench-cli:bootRun --args="run --config examples/run_config.fixed.no_llm.yaml --run-id $RUN_ID"
./gradlew :bench-cli:bootRun --args="inspect-run --run-id $RUN_ID"

Artifacts

For each run, outputs are written under runs/<run_id>/:

run_manifest.json
run_config.yaml
dataset/raw_samples.jsonl.zst
masking/part-*.jsonl.zst
retrieving/part-*.jsonl.zst
augmentation/part-*.jsonl.zst
generation/part-*.jsonl.zst
metrics/part-*.jsonl.zst
blobs/sha256/...

Example config

See:

examples/run_config.yaml
examples/run_config.fixed.yaml
examples/run_config.fixed.no_llm.yaml

Notes

The implementation assumes Gradle wrapper (./gradlew) for compile metrics in project/module mode.
Stage outputs include sample_id, stage_config_fingerprint, and input_fingerprint for resume/reuse.
Relative paths inside YAML (dataset.project_path, execution.output_dir) are resolved relative to the config file location.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeFixes Benchmark Infrastructure (MVP)

Modules

CLI

Quick Start On Fixed Project

1) Validate config

2) Run benchmark now (no LLM required)

3) Inspect run summary

4) Get numeric results from artifacts

5) Run with real model endpoint (optional)

Windows PowerShell (path-safe quoting)

Artifacts

Example config

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gradle-local/wrapper/dists/gradle-9.0.0-bin/d6wjpkvcgsg3oed0qlfss3wgl		.gradle-local/wrapper/dists/gradle-9.0.0-bin/d6wjpkvcgsg3oed0qlfss3wgl
bench-augmentation		bench-augmentation
bench-cli		bench-cli
bench-core		bench-core
bench-dataset		bench-dataset
bench-io		bench-io
bench-llm		bench-llm
bench-metrics		bench-metrics
bench-retrieval		bench-retrieval
bench-runner		bench-runner
examples		examples
gradle/wrapper		gradle/wrapper
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Folders and files

Latest commit

History

Repository files navigation

CodeFixes Benchmark Infrastructure (MVP)

Modules

CLI

Quick Start On Fixed Project

1) Validate config

2) Run benchmark now (no LLM required)

3) Inspect run summary

4) Get numeric results from artifacts

5) Run with real model endpoint (optional)

Windows PowerShell (path-safe quoting)

Artifacts

Example config

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages