Skip to content

Empirical analysis to observe to what extend the LLMs on code can generate code errors,

License

Notifications You must be signed in to change notification settings

WM-SEMERU/prj-syntax-errors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tricky2

This repository contains the code used for the experiments and analyses for the Tricky2 dataset. All components are organized to support reproducibility, including dataset preprocessing, model training/evaluation scripts, and analysis utilities.

Tricky² is a benchmark designed to evaluate the robustness of automated software-engineering systems—particularly large language models (LLMs)—on realistic, multi-origin software defects. It extends prior bug–fixing datasets by introducing a controlled mixture of human-written bugs and LLM-generated bugs, enabling the study of how these defect types differ and interact. The dataset contains three primary splits:

  • Human-only: Programs containing naturally occurring bugs from real student or developer submissions.
  • LLM-only: Programs where the only defects were injected by large language models using structured prompts.
  • Human+LLM (mixed-origin): Programs that contain original human bugs along with additional LLM-injected bugs.

Each program includes:

  • The buggy code
  • A corresponding reference solution
  • A taxonomy label describing the fault type
  • Problem metadata (language, difficulty, problem category)
  • Test suites for evaluating correctness or attempted repairs

The benchmark supports multiple evaluation tasks, including:

  1. Origin classification – determining whether a bug is human-authored, LLM-generated, or mixed.
  2. Error identification – localizing the lines or regions responsible for the defect.
  3. Program repair – producing fixes that pass the provided tests. Tricky² is intended to help researchers study failure modes, interaction effects among multiple bug sources, and the limits of current automated program-analysis and repair models.

Requirements

All requirements can be installed via

pip install -r requirements.txt

Data

The dataset used for this repository is not directly included. You can find the full dataset on Zenodo.

About

Empirical analysis to observe to what extend the LLMs on code can generate code errors,

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages