LLM-generated code often contains knowledge-conflicting hallucinations such as incorrect API calls, nonexistent functions, or missing imports. These issues are subtle and difficult to catch without structured analysis. This repository provides a deterministic, explainable pipeline for detecting and correcting such hallucinations using static analysis, a knowledge base of valid APIs, and prompt-guided inference.
The goal is to offer a reproducible benchmark for evaluating hallucination detection and correction on Python snippets generated by LLMs.
The system consists of four main components:
The knowledge base stores:
- Canonical function names per library
- Valid aliases (for example,
pdfor pandas,npfor numpy) - Version information
- Additional semantic preferences (such as prefer
read_csvoverread_excelfor CSV paths)
The KB supports deterministic matching of valid APIs and drives the correction logic.
The dataset generation module produces synthetic examples that intentionally include API misuse, missing imports, missing aliases, and semantic errors across several libraries.
Each generated sample includes:
- A prompt describing the task
- The hallucinated code
- The correct ground truth
- A reason label for the hallucination
- A boolean indicating whether the detector should catch it
The output CSV serves as a controlled benchmark for evaluating detection precision, recall, and correction quality.
LLM usage (dataset source):
- The 200-sample CSV was generated using GPT-5 during prototyping and dataset iteration.
- The dataset generation was performed manually by running a single prompt and saving the resulting CSV output.
- The exact prompt used is included below for documentation and reproducibility.
Prompt used to generate the dataset:
you are helping me generate a deterministic benchmark dataset for evaluating hallucination detection and correction in llm-generated python code. output a csv with exactly this header in this order: id,prompt,ground_truth,hallucination,reason. generate exactly 200 rows with ids 1..200. each row should have a short realistic python task prompt, a correct minimal code snippet (ground_truth), an incorrect version (hallucination), and a short reason label. only use these libraries in the code when needed: numpy as np, pandas as pd, matplotlib.pyplot as plt, requests, and json, and dont use any other imports. ground_truth must be valid python, short, and generally follow a simple import + function style. most rows should contain a hallucination, but include a smaller set where hallucination == ground_truth exactly, labeled no hallucination. hallucinations should look realistic like typos, wrong function choice, missing imports, incorrect alias usage, or context-inappropriate api choices
This module is the core of the pipeline. It:
- Parses code using Python's
astmodule - Extracts import statements, aliases, and function calls
- Identifies invalid or unknown APIs
- Uses fuzzy matching to correct misspelled functions
- Applies semantic rules per library (for example, fixing JSON
load/loadsmisuse) - Rewrites incorrect file I/O patterns into correct library calls
- Inserts missing imports and avoids duplicates
After running the pipeline on the generated dataset, the evaluation module computes:
- Detection accuracy
- Correction accuracy
- Precision
- Recall
- F1 score
python -m src.build_kbpython -m src.hallucination_generatorpython -m src.run_on_generatedpython -m src.evaluatorTo support additional libraries:
- Add the module and its aliases to
knowledge_base.json - Add new hallucination types in the dataset generator
- (Optional) Add semantic rules for domain-specific fixes in the detector/corrector module
The pipeline is designed to be modular. Each library's behavior can be independently expanded without affecting the others.