This repository contains the code used for the experiments and analyses for the Tricky2 dataset. All components are organized to support reproducibility, including dataset preprocessing, model training/evaluation scripts, and analysis utilities.
Tricky² is a benchmark designed to evaluate the robustness of automated software-engineering systems—particularly large language models (LLMs)—on realistic, multi-origin software defects. It extends prior bug–fixing datasets by introducing a controlled mixture of human-written bugs and LLM-generated bugs, enabling the study of how these defect types differ and interact. The dataset contains three primary splits:
- Human-only: Programs containing naturally occurring bugs from real student or developer submissions.
- LLM-only: Programs where the only defects were injected by large language models using structured prompts.
- Human+LLM (mixed-origin): Programs that contain original human bugs along with additional LLM-injected bugs.
Each program includes:
- The buggy code
- A corresponding reference solution
- A taxonomy label describing the fault type
- Problem metadata (language, difficulty, problem category)
- Test suites for evaluating correctness or attempted repairs
The benchmark supports multiple evaluation tasks, including:
- Origin classification – determining whether a bug is human-authored, LLM-generated, or mixed.
- Error identification – localizing the lines or regions responsible for the defect.
- Program repair – producing fixes that pass the provided tests. Tricky² is intended to help researchers study failure modes, interaction effects among multiple bug sources, and the limits of current automated program-analysis and repair models.
All requirements can be installed via
pip install -r requirements.txt
The dataset used for this repository is not directly included. You can find the full dataset on Zenodo.