Skip to content

agohr/llm_proof_feedback

Repository files navigation

LLM Proof Feedback: Code and Data

Accompanying code and data for the paper

Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Stephen Heslip, Inigo Serjeant
BEA 2026

Overview

This repository contains:

  1. Feedback data — 65 question–student-answer pairs from a first-year transition-to-proof course, each paired with LLM-generated feedback under four conditions (two models × two workflows).
  2. Human evaluation data — H&T rubric scores and reconciled grades from the grading form used by the authors.
  3. H&T evaluation scripts — for re-running the Hattie & Timperley (2007) meta-evaluation of the LLM feedback using LLM graders.
  4. Grade screening — regrading each feedback item with a lightweight model to derive a grade and compute Kendall τ with human grades.

For grade-correlation sweeps, workflow parameter searches, and the advanced stress tests, see the companion repository agohr/math_tutor.

The evaluation rubric for all four dimensions is documented in ht_rubric.md as well as in the configuration files in configs/. Human and LLM graders used the same rubric to grade the feedback items.

Repository Layout

configs/                       H&T rubric and regrading prompt configs
tests/feedback_data/           Input data: one JSON file per condition
human_evaluation/              Human H&T scores and matched LLM output
run_ht_evaluation.sh           Runs all H&T grading + regrading jobs
compare_with_human_ratings.py  Compares outputs against human scores
math_tutor.py                  Core batch-processing engine
evaluator.py / token_usage.py  Supporting utilities

Feedback Conditions

File Feedback model Workflow
gpt41__baseline_concise.json GPT-4.1 Baseline-concise
gpt41__ms_w_example_final.json GPT-4.1 MS-w-example
gpt5__baseline_concise.json GPT-5 Baseline-concise
gpt5__ms_w_example_final.json GPT-5 MS-w-example

Each file contains 65 entries with fields question, answer, llm_feedback, and rating (reconciled human grade, 0–5).

The gpt41__ms_w_example_final.json file is the condition for which a full human H&T meta-evaluation was conducted; the matched scores are in human_evaluation/ht_scores_matched.json.

Installation

pip install -r requirements.txt

Set OPENAI_API_KEY in your environment or in a .env file.

Running the H&T Evaluation

bash run_ht_evaluation.sh

This runs four H&T rubric configs × two grader models (GPT-5.4, GPT-4.1) plus grade screening via GPT-4.1-nano in parallel. Outputs are written to tests/feedback_data/output/<model>/. Expect roughly 500–600 API calls and a few minutes of wall time.

Comparing with Human Ratings

python compare_with_human_ratings.py

Reports Kendall τ between regraded and human grades, mean H&T scores per grader and dimension, and item-level agreement between LLM and human H&T scores for the GPT-4.1 + MS-w-example condition.

Human Evaluation File

human_evaluation/ht_scores_matched.json contains one entry per question–answer pair (GPT-4.1 + MS-w-example condition) with:

  • question, answer, llm_feedback — the triple that was rated
  • human_grade — reconciled expert grade (0–5)
  • human_ht — scores from the paper grading form: d1_correctness, d2_clarity, d3_process, d4_self_regulation (0–5); all 65 entries are fully annotated
  • llm_ht — corresponding GPT-5.4 LLM grades with chain-of-thought justifications
  • regraded_grade — GPT-4.1-nano grade inferred from question + feedback only

This file was regenerated from feedback_grading.tex (the annotated grading form) and the GPT-5.4 H&T outputs for the GPT-4.1 + MS-w-example condition.

About

Supplementary Code and Data for the Paper *Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics\\ Through the Lens of Feedback Theory*

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors