These scripts help you run manual and automated conversations against the
Student Simulation API, log full transcripts, generate understanding level
predictions, and submit to the /evaluate/mse endpoint.
Create a .env file in the repo root with:
BASE_URL=...
TEAM_API_KEY=...
OPENAI_API_KEY=...
Optional:
LOG_FILE=logs/conversations.jsonl
Manual CLI helper to list students/topics and run a chat by hand.
Examples:
./scripts/knu_api.sh list-students mini_dev
./scripts/knu_api.sh student-topics <student_id>
./scripts/knu_api.sh start <student_id> <topic_id>
./scripts/knu_api.sh interact <conversation_id> "Explain how you would solve x^2 - 5x + 6 = 0"
./scripts/knu_api.sh chat
Notes:
- Reads
BASE_URLandTEAM_API_KEYfrom.env. - Logs every
startandinteracttologs/conversations.jsonl(JSONL format).
Automates conversations for a set, generates tutor messages with GPT, and produces a predicted understanding level after each conversation.
Examples:
./scripts/knu_auto_chat.py
./scripts/knu_auto_chat.py --set-type mini_dev --model gpt-5.2 --mode responses
./scripts/knu_auto_chat.py --max-turns 6
Notes:
- Reads
BASE_URL,TEAM_API_KEY,OPENAI_API_KEYfrom.env. - Writes a
conversation_summaryentry with the full transcript and prediction tologs/conversations.jsonl. - Use
--mode chatif your account does not support the responses API.
Submits the latest predictions (per student/topic pair) from
logs/conversations.jsonl to /evaluate/mse.
Examples:
./scripts/knu_submit_mse.py --set-type mini_dev
./scripts/knu_submit_mse.py --set-type mini_dev --dry-run
Notes:
- Picks the most recent
conversation_summaryper student/topic pair. - Fails if any required pair is missing.
Submits a tutoring evaluation request to /evaluate/tutoring.
Examples:
./scripts/knu_submit_tutoring.py --set-type mini_dev
Notes:
- Requires at least one conversation per student/topic pair in the set.
One-shot flow: run conversations, then submit predictions.
Examples:
./scripts/knu_run_and_submit.sh --set-type mini_dev --model gpt-5.2 --mode responses
Lists student/topic pairs for a set (one pair per line, space-separated).
Examples:
./scripts/knu_list_pairs.py --set-type dev
Runs all dev student-topic pairs in parallel and writes logs per pair.
Examples:
PARALLEL=4 LOG_DIR=new_logs ./scripts/knu_run_dev_parallel.sh
Notes:
- Uses
scripts/knu_list_pairs.pyto enumerate dev pairs. - Writes one JSONL log per pair to
new_logs/. - You can set
MODEL,MODE,SLEEP, orMAX_TURNSvia env vars.
Runs all eval student-topic pairs in parallel, writes logs per pair, scores with strict
student-only diagnostic scoring, and submits to /evaluate/mse.
Examples:
PARALLEL=4 LOG_DIR=eval_logs ./scripts/knu_run_eval_parallel.sh
Notes:
- Uses
scripts/knu_list_pairs.pyto enumerate eval pairs. - Writes one JSONL log per pair to
eval_logs/. - Scoring uses
scripts/knu_score_only.py --diagnostic-only.
Runs LLM scoring on existing conversations (no new API conversations) and can
optionally submit to /evaluate/mse.
Examples:
./scripts/knu_score_only.py --prompt-version A
./scripts/knu_score_only.py --prompt-version B --submit-mse
./scripts/knu_score_only.py --prompt-version C --set-type mini_dev --mode responses
Notes:
- Uses the most recent
conversation_summaryper student/topic pair fromlogs/conversations.jsonl. - Writes results to
logs/score_only_<version>_<timestamp>.json.
Runs A/B/C scoring back-to-back and submits each to /evaluate/mse.
Examples:
./scripts/knu_score_abc.sh --set-type mini_dev --model gpt-5.2 --mode responses
Notes:
- Forwards any args to
knu_score_only.py(except--prompt-versionand--submit-mse).
Asks each student to self-report their understanding level (1–5) and submits to /evaluate/mse.
Examples:
./scripts/knu_self_report.py --set-type mini_dev
./scripts/knu_self_report.py --set-type mini_dev --no-submit-mse
Notes:
- If the student does not return a number, the script can use an LLM to map their reply to 1–5.
- Disable LLM mapping with
--no-llm-parse(defaults to 3 when non-numeric).
Infers true levels for mini_dev using controlled MSE probes (multiple submissions).
Example:
./scripts/knu_infer_truth.py --set-type mini_dev
Notes:
- Uses multiple
/evaluate/msecalls; refuses non-mini_devunless--force. - Writes inferred levels to
logs/inferred_levels.json.
All scripts append to logs/conversations.jsonl (JSON Lines). Each line is a
JSON object with event types like start, interact, or
conversation_summary.