Skip to content

WM-SEMERU/Hallucinations-in-Code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Detecting and Correcting Hallucinations in LLM-Generated Code

A Deterministic AST-Based Evaluation Pipeline

LLM-generated code often contains knowledge-conflicting hallucinations such as incorrect API calls, nonexistent functions, or missing imports. These issues are subtle and difficult to catch without structured analysis. This repository provides a deterministic, explainable pipeline for detecting and correcting such hallucinations using static analysis, a knowledge base of valid APIs, and prompt-guided inference.

The goal is to offer a reproducible benchmark for evaluating hallucination detection and correction on Python snippets generated by LLMs.

Overview

The system consists of four main components:

1. Knowledge Base

The knowledge base stores:

  • Canonical function names per library
  • Valid aliases (for example, pd for pandas, np for numpy)
  • Version information
  • Additional semantic preferences (such as prefer read_csv over read_excel for CSV paths)

The KB supports deterministic matching of valid APIs and drives the correction logic.

2. Dataset Generation

The dataset generation module produces synthetic examples that intentionally include API misuse, missing imports, missing aliases, and semantic errors across several libraries.

Each generated sample includes:

  • A prompt describing the task
  • The hallucinated code
  • The correct ground truth
  • A reason label for the hallucination
  • A boolean indicating whether the detector should catch it

The output CSV serves as a controlled benchmark for evaluating detection precision, recall, and correction quality.

LLM usage (dataset source):

  • The 200-sample CSV was generated using GPT-5 during prototyping and dataset iteration.
  • The dataset generation was performed manually by running a single prompt and saving the resulting CSV output.
  • The exact prompt used is included below for documentation and reproducibility.

Prompt used to generate the dataset:

you are helping me generate a deterministic benchmark dataset for evaluating hallucination detection and correction in llm-generated python code. output a csv with exactly this header in this order: id,prompt,ground_truth,hallucination,reason. generate exactly 200 rows with ids 1..200. each row should have a short realistic python task prompt, a correct minimal code snippet (ground_truth), an incorrect version (hallucination), and a short reason label. only use these libraries in the code when needed: numpy as np, pandas as pd, matplotlib.pyplot as plt, requests, and json, and dont use any other imports. ground_truth must be valid python, short, and generally follow a simple import + function style. most rows should contain a hallucination, but include a smaller set where hallucination == ground_truth exactly, labeled no hallucination. hallucinations should look realistic like typos, wrong function choice, missing imports, incorrect alias usage, or context-inappropriate api choices

3. Detection and Correction

This module is the core of the pipeline. It:

  • Parses code using Python's ast module
  • Extracts import statements, aliases, and function calls
  • Identifies invalid or unknown APIs
  • Uses fuzzy matching to correct misspelled functions
  • Applies semantic rules per library (for example, fixing JSON load/loads misuse)
  • Rewrites incorrect file I/O patterns into correct library calls
  • Inserts missing imports and avoids duplicates

4. Evaluation

After running the pipeline on the generated dataset, the evaluation module computes:

  • Detection accuracy
  • Correction accuracy
  • Precision
  • Recall
  • F1 score

Running the Pipeline

Build the Knowledge Base

python -m src.build_kb

Generate the dataset

python -m src.hallucination_generator

Run the detector/corrector

python -m src.run_on_generated

Evaluate results

python -m src.evaluator

Extending the System

To support additional libraries:

  • Add the module and its aliases to knowledge_base.json
  • Add new hallucination types in the dataset generator
  • (Optional) Add semantic rules for domain-specific fixes in the detector/corrector module

The pipeline is designed to be modular. Each library's behavior can be independently expanded without affecting the others.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •