LLM Proof Feedback: Code and Data

Accompanying code and data for the paper

Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Stephen Heslip, Inigo Serjeant
BEA 2026

Overview

This repository contains:

Feedback data — 65 question–student-answer pairs from a first-year transition-to-proof course, each paired with LLM-generated feedback under four conditions (two models × two workflows).
Human evaluation data — H&T rubric scores and reconciled grades from the grading form used by the authors.
H&T evaluation scripts — for re-running the Hattie & Timperley (2007) meta-evaluation of the LLM feedback using LLM graders.
Grade screening — regrading each feedback item with a lightweight model to derive a grade and compute Kendall τ with human grades.

For grade-correlation sweeps, workflow parameter searches, and the advanced stress tests, see the companion repository agohr/math_tutor.

The evaluation rubric for all four dimensions is documented in ht_rubric.md as well as in the configuration files in configs/. Human and LLM graders used the same rubric to grade the feedback items.

Repository Layout

configs/                       H&T rubric and regrading prompt configs
tests/feedback_data/           Input data: one JSON file per condition
human_evaluation/              Human H&T scores and matched LLM output
run_ht_evaluation.sh           Runs all H&T grading + regrading jobs
compare_with_human_ratings.py  Compares outputs against human scores
math_tutor.py                  Core batch-processing engine
evaluator.py / token_usage.py  Supporting utilities

Feedback Conditions

File	Feedback model	Workflow
`gpt41__baseline_concise.json`	GPT-4.1	Baseline-concise
`gpt41__ms_w_example_final.json`	GPT-4.1	MS-w-example
`gpt5__baseline_concise.json`	GPT-5	Baseline-concise
`gpt5__ms_w_example_final.json`	GPT-5	MS-w-example

Each file contains 65 entries with fields question, answer, llm_feedback, and rating (reconciled human grade, 0–5).

The gpt41__ms_w_example_final.json file is the condition for which a full human H&T meta-evaluation was conducted; the matched scores are in human_evaluation/ht_scores_matched.json.

Installation

pip install -r requirements.txt

Set OPENAI_API_KEY in your environment or in a .env file.

Running the H&T Evaluation

bash run_ht_evaluation.sh

This runs four H&T rubric configs × two grader models (GPT-5.4, GPT-4.1) plus grade screening via GPT-4.1-nano in parallel. Outputs are written to tests/feedback_data/output/<model>/. Expect roughly 500–600 API calls and a few minutes of wall time.

Comparing with Human Ratings

python compare_with_human_ratings.py

Reports Kendall τ between regraded and human grades, mean H&T scores per grader and dimension, and item-level agreement between LLM and human H&T scores for the GPT-4.1 + MS-w-example condition.

Human Evaluation File

human_evaluation/ht_scores_matched.json contains one entry per question–answer pair (GPT-4.1 + MS-w-example condition) with:

question, answer, llm_feedback — the triple that was rated
human_grade — reconciled expert grade (0–5)
human_ht — scores from the paper grading form: d1_correctness, d2_clarity, d3_process, d4_self_regulation (0–5); all 65 entries are fully annotated
llm_ht — corresponding GPT-5.4 LLM grades with chain-of-thought justifications
regraded_grade — GPT-4.1-nano grade inferred from question + feedback only

This file was regenerated from feedback_grading.tex (the annotated grading form) and the GPT-5.4 H&T outputs for the GPT-4.1 + MS-w-example condition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Proof Feedback: Code and Data

Overview

Repository Layout

Feedback Conditions

Installation

Running the H&T Evaluation

Comparing with Human Ratings

Human Evaluation File

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
human_evaluation		human_evaluation
tests/feedback_data		tests/feedback_data
.gitignore		.gitignore
compare_with_human_ratings.py		compare_with_human_ratings.py
evaluator.py		evaluator.py
ht_rubric.md		ht_rubric.md
math_tutor.py		math_tutor.py
model_costs.json		model_costs.json
readme.md		readme.md
requirements.txt		requirements.txt
run_ht_evaluation.sh		run_ht_evaluation.sh
token_usage.py		token_usage.py

Folders and files

Latest commit

History

Repository files navigation

LLM Proof Feedback: Code and Data

Overview

Repository Layout

Feedback Conditions

Installation

Running the H&T Evaluation

Comparing with Human Ratings

Human Evaluation File

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages