Accompanying code and data for the paper
Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Stephen Heslip, Inigo Serjeant
BEA 2026
This repository contains:
- Feedback data — 65 question–student-answer pairs from a first-year transition-to-proof course, each paired with LLM-generated feedback under four conditions (two models × two workflows).
- Human evaluation data — H&T rubric scores and reconciled grades from the grading form used by the authors.
- H&T evaluation scripts — for re-running the Hattie & Timperley (2007) meta-evaluation of the LLM feedback using LLM graders.
- Grade screening — regrading each feedback item with a lightweight model to derive a grade and compute Kendall τ with human grades.
For grade-correlation sweeps, workflow parameter searches, and the advanced stress tests, see the companion repository agohr/math_tutor.
The evaluation rubric for all four dimensions is documented in ht_rubric.md as well as in the configuration files in configs/. Human and LLM graders used the same rubric to grade the feedback items.
configs/ H&T rubric and regrading prompt configs
tests/feedback_data/ Input data: one JSON file per condition
human_evaluation/ Human H&T scores and matched LLM output
run_ht_evaluation.sh Runs all H&T grading + regrading jobs
compare_with_human_ratings.py Compares outputs against human scores
math_tutor.py Core batch-processing engine
evaluator.py / token_usage.py Supporting utilities
| File | Feedback model | Workflow |
|---|---|---|
gpt41__baseline_concise.json |
GPT-4.1 | Baseline-concise |
gpt41__ms_w_example_final.json |
GPT-4.1 | MS-w-example |
gpt5__baseline_concise.json |
GPT-5 | Baseline-concise |
gpt5__ms_w_example_final.json |
GPT-5 | MS-w-example |
Each file contains 65 entries with fields question, answer, llm_feedback, and rating (reconciled human grade, 0–5).
The gpt41__ms_w_example_final.json file is the condition for which a full human H&T meta-evaluation was conducted; the matched scores are in human_evaluation/ht_scores_matched.json.
pip install -r requirements.txtSet OPENAI_API_KEY in your environment or in a .env file.
bash run_ht_evaluation.shThis runs four H&T rubric configs × two grader models (GPT-5.4, GPT-4.1) plus grade screening via GPT-4.1-nano in parallel. Outputs are written to tests/feedback_data/output/<model>/. Expect roughly 500–600 API calls and a few minutes of wall time.
python compare_with_human_ratings.pyReports Kendall τ between regraded and human grades, mean H&T scores per grader and dimension, and item-level agreement between LLM and human H&T scores for the GPT-4.1 + MS-w-example condition.
human_evaluation/ht_scores_matched.json contains one entry per question–answer pair (GPT-4.1 + MS-w-example condition) with:
question,answer,llm_feedback— the triple that was ratedhuman_grade— reconciled expert grade (0–5)human_ht— scores from the paper grading form:d1_correctness,d2_clarity,d3_process,d4_self_regulation(0–5); all 65 entries are fully annotatedllm_ht— corresponding GPT-5.4 LLM grades with chain-of-thought justificationsregraded_grade— GPT-4.1-nano grade inferred from question + feedback only
This file was regenerated from feedback_grading.tex (the annotated grading form) and the GPT-5.4 H&T outputs for the GPT-4.1 + MS-w-example condition.