Detecting and Correcting Hallucinations in LLM-Generated Code

A Deterministic AST-Based Evaluation Pipeline

LLM-generated code often contains knowledge-conflicting hallucinations such as incorrect API calls, nonexistent functions, or missing imports. These issues are subtle and difficult to catch without structured analysis. This repository provides a deterministic, explainable pipeline for detecting and correcting such hallucinations using static analysis, a knowledge base of valid APIs, and prompt-guided inference.

The goal is to offer a reproducible benchmark for evaluating hallucination detection and correction on Python snippets generated by LLMs.

Overview

The system consists of four main components:

1. Knowledge Base

The knowledge base stores:

Canonical function names per library
Valid aliases (for example, pd for pandas, np for numpy)
Version information
Additional semantic preferences (such as prefer read_csv over read_excel for CSV paths)

The KB supports deterministic matching of valid APIs and drives the correction logic.

2. Dataset Generation

The dataset generation module produces synthetic examples that intentionally include API misuse, missing imports, missing aliases, and semantic errors across several libraries.

Each generated sample includes:

A prompt describing the task
The hallucinated code
The correct ground truth
A reason label for the hallucination
A boolean indicating whether the detector should catch it

The output CSV serves as a controlled benchmark for evaluating detection precision, recall, and correction quality.

LLM usage (dataset source):

The 200-sample CSV was generated using GPT-5 during prototyping and dataset iteration.
The dataset generation was performed manually by running a single prompt and saving the resulting CSV output.
The exact prompt used is included below for documentation and reproducibility.

Prompt used to generate the dataset:

you are helping me generate a deterministic benchmark dataset for evaluating hallucination detection and correction in llm-generated python code. output a csv with exactly this header in this order: id,prompt,ground_truth,hallucination,reason. generate exactly 200 rows with ids 1..200. each row should have a short realistic python task prompt, a correct minimal code snippet (ground_truth), an incorrect version (hallucination), and a short reason label. only use these libraries in the code when needed: numpy as np, pandas as pd, matplotlib.pyplot as plt, requests, and json, and dont use any other imports. ground_truth must be valid python, short, and generally follow a simple import + function style. most rows should contain a hallucination, but include a smaller set where hallucination == ground_truth exactly, labeled no hallucination. hallucinations should look realistic like typos, wrong function choice, missing imports, incorrect alias usage, or context-inappropriate api choices

3. Detection and Correction

This module is the core of the pipeline. It:

Parses code using Python's ast module
Extracts import statements, aliases, and function calls
Identifies invalid or unknown APIs
Uses fuzzy matching to correct misspelled functions
Applies semantic rules per library (for example, fixing JSON load/loads misuse)
Rewrites incorrect file I/O patterns into correct library calls
Inserts missing imports and avoids duplicates

4. Evaluation

After running the pipeline on the generated dataset, the evaluation module computes:

Detection accuracy
Correction accuracy
Precision
Recall
F1 score

Running the Pipeline

Build the Knowledge Base

python -m src.build_kb

Generate the dataset

python -m src.hallucination_generator

Run the detector/corrector

python -m src.run_on_generated

Evaluate results

python -m src.evaluator

Extending the System

To support additional libraries:

Add the module and its aliases to knowledge_base.json
Add new hallucination types in the dataset generator
(Optional) Add semantic rules for domain-specific fixes in the detector/corrector module

The pipeline is designed to be modular. Each library's behavior can be independently expanded without affecting the others.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.devcontainer		.devcontainer
Script		Script
hallucination_pipeline		hallucination_pipeline
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting and Correcting Hallucinations in LLM-Generated Code

A Deterministic AST-Based Evaluation Pipeline

Overview

1. Knowledge Base

2. Dataset Generation

3. Detection and Correction

4. Evaluation

Running the Pipeline

Build the Knowledge Base

Generate the dataset

Run the detector/corrector

Evaluate results

Extending the System

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

WM-SEMERU/Hallucinations-in-Code

Folders and files

Latest commit

History

Repository files navigation

Detecting and Correcting Hallucinations in LLM-Generated Code

A Deterministic AST-Based Evaluation Pipeline

Overview

1. Knowledge Base

2. Dataset Generation

3. Detection and Correction

4. Evaluation

Running the Pipeline

Build the Knowledge Base

Generate the dataset

Run the detector/corrector

Evaluate results

Extending the System

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages