Add JSON Schema and CI validation for data.json#20
Conversation
Treats data.json as a spec: data/schema.json describes the shape convert_doc.py produces, data/validate.py runs the check locally, and a GitHub Actions workflow regenerates the data from doc_export.html and validates on every PR, so the parser and the committed artifact cannot silently drift. While wiring this up the schema surfaced 43 empty-titled subsections in data.json. They were coming from convert_doc.py appending a subsection for every <h2> including styling-only ones; this commit includes a one-line skip for empty <h2>s and regenerates data.json. The source Google Doc likely still contains those empty <h2>s. Schema is strict by default (additionalProperties: false at the top level and on data entries) so future field additions surface as CI failures and prompt a schema bump. Happy to relax if maintainers prefer the schema to document rather than constrain.
|
This is a great addition. Does the change you made to convert_doc fix all the cases? |
Thanks! The empty Happy to move the validation call into convert_doc.py so the check runs as part of the conversion anyone regenerating locally gets the same guardrail without remembering a second command. I'd keep validate.py as a standalone entry point too, so it can be pointed at any data.json independent of regeneration. Want me to add that in this PR? |
Treats
data.jsonas a spec — adds a JSON Schema, a validator, and a GitHub Actions workflow that regenerates data fromdoc_export.htmland validates on every PR, so the parser and the committed artifact cannot silently drift.While wiring this up the schema surfaced 43 empty-titled subsections in
data.jsoncaused byconvert_doc.pyappending a subsection for every<h2>element including styling-only ones. This PR includes a one-line skip for empty<h2>s and regeneratesdata.json; CI is green after that fix. The source Google Doc likely still contains those empty<h2>s — I don't have edit access there.The schema is strict by default (
additionalProperties: falseat the top level and on data entries) so future field additions surface as CI failures and prompt a schema bump alongside the data change. Easy to relax totrueif you'd rather have the schema document the shape without constraining it.Files:
data/schema.json— draft-2020-12 schemadata/validate.py— local + CI validator (run viapython data/validate.py).github/workflows/validate.yml— round-trip + validate on push to main and on every PRdata/convert_doc.py— skip empty<h2>elementsdata/data.json— regenerated (43 phantom subsections removed)requirements.txt— addedjsonschema