Skip to content

Add JSON Schema and CI validation for data.json#20

Open
ksallee wants to merge 1 commit into
ves-tech:mainfrom
ksallee:spike/data-json-spec
Open

Add JSON Schema and CI validation for data.json#20
ksallee wants to merge 1 commit into
ves-tech:mainfrom
ksallee:spike/data-json-spec

Conversation

@ksallee
Copy link
Copy Markdown
Contributor

@ksallee ksallee commented May 12, 2026

Treats data.json as a spec — adds a JSON Schema, a validator, and a GitHub Actions workflow that regenerates data from doc_export.html and validates on every PR, so the parser and the committed artifact cannot silently drift.

While wiring this up the schema surfaced 43 empty-titled subsections in data.json caused by convert_doc.py appending a subsection for every <h2> element including styling-only ones. This PR includes a one-line skip for empty <h2>s and regenerates data.json; CI is green after that fix. The source Google Doc likely still contains those empty <h2>s — I don't have edit access there.

The schema is strict by default (additionalProperties: false at the top level and on data entries) so future field additions surface as CI failures and prompt a schema bump alongside the data change. Easy to relax to true if you'd rather have the schema document the shape without constraining it.

Files:

  • data/schema.json — draft-2020-12 schema
  • data/validate.py — local + CI validator (run via python data/validate.py)
  • .github/workflows/validate.yml — round-trip + validate on push to main and on every PR
  • data/convert_doc.py — skip empty <h2> elements
  • data/data.json — regenerated (43 phantom subsections removed)
  • requirements.txt — added jsonschema

Treats data.json as a spec: data/schema.json describes the shape
convert_doc.py produces, data/validate.py runs the check locally, and a
GitHub Actions workflow regenerates the data from doc_export.html and
validates on every PR, so the parser and the committed artifact cannot
silently drift.

While wiring this up the schema surfaced 43 empty-titled subsections in
data.json. They were coming from convert_doc.py appending a subsection
for every <h2> including styling-only ones; this commit includes a
one-line skip for empty <h2>s and regenerates data.json. The source
Google Doc likely still contains those empty <h2>s.

Schema is strict by default (additionalProperties: false at the top
level and on data entries) so future field additions surface as CI
failures and prompt a schema bump. Happy to relax if maintainers prefer
the schema to document rather than constrain.
@richardssam
Copy link
Copy Markdown
Contributor

This is a great addition. Does the change you made to convert_doc fix all the cases?
What I'm wondering is whether we should be doing the validation as we convert the doc to data.json (using the convert_doc.py script).

@ksallee
Copy link
Copy Markdown
Contributor Author

ksallee commented May 26, 2026

This is a great addition. Does the change you made to convert_doc fix all the cases? What I'm wondering is whether we should be doing the validation as we convert the doc to data.json (using the convert_doc.py script).

Thanks!

The empty <h2> skip cleared all 43 violations the schema caught against the current doc_export.html. It doesn't guarantee every future export is clean, but that's exactly what the schema is for: next time the doc changes shape, CI fails and we either tighten the parser or bump the schema deliberately.

Happy to move the validation call into convert_doc.py so the check runs as part of the conversion anyone regenerating locally gets the same guardrail without remembering a second command. I'd keep validate.py as a standalone entry point too, so it can be pointed at any data.json independent of regeneration. Want me to add that in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants