An AI tutor for French schoolkids (CP to 3e). Live, paid, and processing children's data. The funny parts are the product; the parts that had to be right are below.
Co-founded with Anthony Thomas, I own the backend. The codebase is private (minors' data, paid product), so this is the engineering story, not the source. Live at docendo.fr.
A kid can be strong on fractions and lost on geometry in the same hour. A single grade hides that. And the students who most need help are often the ones who never raise a hand in class, because asking is admitting you didn't get it.
Docendo does two things about that: it works out where a student actually stands, theme by theme, and it gives them a tutor they're not afraid to look stupid in front of. Both are easy to fake with a chatbot and a quiz. Doing them honestly is where the engineering is.
These shaped every decision below:
- Children's data, French and European. GDPR isn't a checkbox here, it's the design. The voice of minors especially.
- Live and paid. Real students, Stripe billing, no "it's just a prototype" excuses.
- A team of two. Every system has to be operable by people who also need to sleep.
flowchart LR
Student["Student / PWA"] --> API["Symfony API"]
API --> DB[("PostgreSQL")]
API --> Engine["Adaptive scoring engine<br/>item response theory"]
API --> Milo["Milo<br/>guardrailed tutor"]
Milo --> Haiku["Anthropic Haiku"]
Student -->|"spoken answer"| Talk["MiloTalk"]
Talk --> Whisper["Self-hosted Whisper<br/>EU servers"]
Whisper -->|"transcript, audio destroyed"| Milo
API --> Stripe["Stripe billing"]
A raw quiz score tells you how many questions a kid got right. It doesn't tell you what they know. So the level estimate runs on an adaptive model from item response theory: each answer updates an estimate of the student's ability per theme, and separates "strong on one topic, weak on another" instead of averaging them into a meaningless number.
The honest part: that model is only as good as the difficulty you assign each question, and we got that wrong at first. During testing, students were failing questions we'd rated as easy. The ratings were too high. So we revised a large chunk of them down and ran a full audit across every level and subject, so a student succeeds at the questions they should succeed at. The engine was the easy half (the formulas are in the literature). Calibrating it against real kids was the half that took work.
One of Milo's first serious tests: we told it 2 + 2 = 3. It agreed.
That's the real problem with an LLM tutor. By default the model wants to please you, and a teacher who wants to please you teaches nothing. Milo runs on Anthropic's Haiku, but the trait isn't Haiku's, it's every model's default: push a little, it folds. Harmless when you ask for a recipe. Disqualifying when a student says something wrong and the tutor nods along.
So the first job wasn't making Milo smart, the model already is. It was stopping it from agreeing: making it verify what it asserts, hold the truth when a student insists otherwise, and check each message for pedagogical soundness before it goes out. On top of that, one hard rule: Milo never gives the answer. A tutor that hands over the solution is a calculator with vocabulary. Milo is conditioned to send students back to their own reasoning until they solve it themselves. That works against the model's instinct to be helpful and complete, so it isn't one line of prompt, it's a frame of constraints you set, test, and tighten every time a kid finds the gap.
A student can also explain out loud what they understood of a topic, fifteen seconds to five minutes. We transcribe it, the model analyses the spoken explanation for clarity and accuracy, and returns a report: acquired, in progress, still weak. Explaining out loud is one of the best comprehension tests there is, and exactly what a student never does alone.
The catch: that's the recorded voice of a child. Sending it to a third-party transcription API was never on the table. Transcription runs on a self-hosted Whisper on our European servers, the audio is destroyed right after transcription, and the voice never leaves our infrastructure. GDPR-first sold as a feature, not a footnote.
The adaptive engine is still being calibrated against real usage, and the early mis-rating of question difficulty is the reminder that a clean model on paper is worthless until it meets actual students. We caught it because we were testing with real kids before opening wide, which is the entire point of testing with real kids before opening wide.
PHP 8.4 / Symfony 7.4, PostgreSQL, FrankenPHP. Milo on Anthropic Haiku, MiloTalk on a self-hosted Whisper. Stripe for billing. PWA, plus OCR to scan handwritten homework.
© 2026 Thibault Lafaurie. Words and diagrams, all rights reserved. Built with Anthony Thomas, codebase private. Live at docendo.fr.