Skip to content

sckwokyboom/Java-Code-Fixing-Benchmarks

Repository files navigation

CodeFixes Benchmark Infrastructure (MVP)

Multi-module Java 17 + Gradle (Groovy DSL) framework for reproducible Java CodeFixes benchmarking.

Modules

  • bench-core: domain model, config, fingerprints, stage contracts
  • bench-io: JSONL+Zstd streaming IO, blob store, resume state index
  • bench-dataset: dataset generation + masking stage
  • bench-retrieval: Lucene BM25 index/retrieval + leakage filter
  • bench-augmentation: prompt block construction
  • bench-llm: OpenAI-compatible generation client
  • bench-metrics: exact/edit/compile metrics
  • bench-runner: E2E orchestration, batching, resume/reuse/shuffle
  • bench-cli: Spring Boot + picocli CLI commands

CLI

  • bench run --config <path> [--resume] [--start-from-stage <STAGE>] [--reuse-from-run <RUN_ID>]
  • bench inspect-run --run-id <RUN_ID> [--output-dir runs]
  • bench validate-config --config <path>

Quick Start On Fixed Project

This repository already contains a fixed Java project:

  • examples/fixed-project

And ready configs:

  • examples/run_config.fixed.yaml (normal run with local LLM endpoint)
  • examples/run_config.fixed.no_llm.yaml (run without available LLM, still produces numeric artifacts)

If Gradle wrapper is not present yet, replace ./gradlew with gradle in commands below.

1) Validate config

./gradlew :bench-cli:bootRun --args='validate-config --config examples/run_config.fixed.no_llm.yaml'

2) Run benchmark now (no LLM required)

RUN_ID=fixed-no-llm-001
./gradlew :bench-cli:bootRun --args="run --config examples/run_config.fixed.no_llm.yaml --run-id ${RUN_ID}"

3) Inspect run summary

./gradlew :bench-cli:bootRun --args="inspect-run --run-id ${RUN_ID}"

4) Get numeric results from artifacts

Dataset size:

zstd -dc runs/${RUN_ID}/dataset/raw_samples*.jsonl.zst | wc -l

Retrieval leakage counters:

for f in runs/${RUN_ID}/retrieving/part-*.jsonl.zst; do zstd -dc "$f"; done | \
  jq -s '{
    samples: length,
    leakage_detected: map(select(.leakage_detected == true)) | length,
    leakage_filtered_total: (map(.leakage_filtered_count) | add)
  }'

Generation status (how many records contain errors):

for f in runs/${RUN_ID}/generation/part-*.jsonl.zst; do zstd -dc "$f"; done | \
  jq -s '{
    samples: length,
    with_candidates: map(select((.candidates | length) > 0)) | length,
    with_errors: map(select((.errors | length) > 0)) | length
  }'

Main metrics:

for f in runs/${RUN_ID}/metrics/part-*.jsonl.zst; do zstd -dc "$f"; done | \
  jq -s '{
    rows: length,
    exact_match_rate: (map(if .metrics.exact_match then 1 else 0 end) | add / (length | if . == 0 then 1 else . end)),
    mean_edit_similarity: (map(.metrics.edit_similarity) | add / (length | if . == 0 then 1 else . end)),
    compile_pass_rate: (map(if .metrics.compiles then 1 else 0 end) | add / (length | if . == 0 then 1 else . end))
  }'

5) Run with real model endpoint (optional)

If local vLLM (or any OpenAI-compatible server) is available on http://127.0.0.1:8000:

RUN_ID=fixed-llm-001
./gradlew :bench-cli:bootRun --args="run --config examples/run_config.fixed.yaml --run-id ${RUN_ID}"

Then reuse the same metrics commands above for real quality numbers.

Windows PowerShell (path-safe quoting)

$RUN_ID = "fixed-no-llm-001"
./gradlew :bench-cli:bootRun --args='validate-config --config examples/run_config.fixed.no_llm.yaml'
./gradlew :bench-cli:bootRun --args="run --config examples/run_config.fixed.no_llm.yaml --run-id $RUN_ID"
./gradlew :bench-cli:bootRun --args="inspect-run --run-id $RUN_ID"

Artifacts

For each run, outputs are written under runs/<run_id>/:

  • run_manifest.json
  • run_config.yaml
  • dataset/raw_samples.jsonl.zst
  • masking/part-*.jsonl.zst
  • retrieving/part-*.jsonl.zst
  • augmentation/part-*.jsonl.zst
  • generation/part-*.jsonl.zst
  • metrics/part-*.jsonl.zst
  • blobs/sha256/...

Example config

See:

  • examples/run_config.yaml
  • examples/run_config.fixed.yaml
  • examples/run_config.fixed.no_llm.yaml

Notes

  • The implementation assumes Gradle wrapper (./gradlew) for compile metrics in project/module mode.
  • Stage outputs include sample_id, stage_config_fingerprint, and input_fingerprint for resume/reuse.
  • Relative paths inside YAML (dataset.project_path, execution.output_dir) are resolved relative to the config file location.

About

Java/Spring Benchmarks for LLM Code-Fixing Tasks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages