feat(diff): add ODCS semantic diff command and API endpoint by davidb-tada · Pull Request #1112 · datacontract/datacontract-cli

davidb-tada · 2026-03-31T16:06:39Z

feat(diff): semantic diff for ODCS data contracts with text and HTML reports

Summary

Reintroducing the much loved diff functionality, with a wider capability including semantic diff with field level granularity. This PR adds a datacontract diff command that produces a semantic diff of two data contract YAML files, with plain-text and HTML report formats.

Note: Breaking change classification (which changes are breaking vs non-breaking) is out of scope for this PR and will follow in a subsequent one.

# Plain-text diff to stdout
datacontract diff v1.odcs.yaml v2.odcs.yaml

# HTML report saved to a file
datacontract diff v1.odcs.yaml v2.odcs.yaml --format html --output diff.html

A POST /diff endpoint is also added to the API server, accepting the two contracts as YAML strings in the request body.

Motivation

When evolving a data contract, understanding what changed between two versions is a prerequisite to assessing impact. This provides a structured, human-readable diff rather than a raw YAML text diff, with changes grouped by field path and presented at both a summary (rolled-up) and detail (leaf) level.

Screenshots

HTML Report

Text Report

How it works

The diff engine normalises both contracts before diffing — converting named lists (e.g. schema[], customProperties[], authoritativeDefinitions[]) from positional arrays to dicts keyed by their natural identifier.

This eliminates false positives from list reordering and produces stable, meaningful field paths in the output e.g.
schema.orders.properties.order_id.logicalType Changed
rather than
schema[0].properties[1].logicalType Changed

The renderer uses LIST_CONTAINERS to apply list-item prefixes (- ) to named collection entries in the output. This set is derived automatically by reflecting over the ODCS Pydantic models, so it stays in sync with the schema without manual maintenance.

Natural key limitation and upstream path - Future Considerations

This section provides context around the current semantic / natural key implementation and envisaged improvements.

The natural key for each named list — the field used to key items when converting arrays to dicts before diffing — is currently hardcoded in report_helpers.py. Each entry in the table below was determined by inspecting the required array in the ODCS v3 JSON Schema and selecting the field that acts as the stable semantic identifier:

Entity	Natural key	Source
`SchemaObject`	`name`	`required: [name]` in JSON schema
`SchemaProperty`	`name`	`required: [name]` in JSON schema
`Server`	`server`	`required: [server, type]` — `server` chosen as identifier
`SLAProperty`	`property`	`required: [property, value]` — `property` chosen as identifier
`CustomProperty`	`property`	`required: [property, value]` — `property` chosen as identifier
`Role`	`role`	`required: [role]` in JSON schema
`SupportItem`	`channel`	`required: [channel]` in JSON schema
`TeamMember`	`username`	`required: [username]` in JSON schema
`DataQuality`	`name`	no `required` in JSON schema — `name` inferred; positional fallback if absent

The DataQuality case illustrates the root problem: the JSON schema does not declare any field as required for that object, so there is no machine-readable source of truth from which the natural key can be derived. The key must instead be inferred from domain knowledge and hardcoded with a fallback.

The correct long-term fix is for the natural key to be declared in the spec itself — specifically, by ensuring that each named-list entity type in odcs-json-schema-v3.1.0.json has a required array whose first entry is unambiguously its stable identifier. That metadata would then flow through to the Pydantic model package as a non-optional field, allowing helpers.py to derive the natural key table dynamically by reflecting over model_fields rather than maintaining it by hand.

If this PR is merged, the author intends to raise issues and submit PRs to both upstream projects — bitol-io/open-data-contract-standard to add the missing required declarations to the JSON schema, and datacontract/open-data-contract-standard-python to surface those as non-optional Pydantic fields — which would allow a follow-up PR here to simplify the implementation, replace the hardcoded key table and its associated logic with fully derived logic.

Changes

datacontract/reports/diff/diff.py — ODCS semantic diff implementation with diff pipeline
datacontract/reports/diff/contract_diff_report.py — builds the report data dict from raw diff output
datacontract/reports/diff/text_contract_diff_renderer.py — self-contained plain-text renderer
datacontract/reports/diff/html_contract_diff_renderer.py — self-contained HTML renderer
datacontract/reports/common/report_helpers.py — helper code to facilitate semantic diffs
datacontract/reports/common/html_base_renderer.py — common HTML base rendering functionality
datacontract/api.py — POST /diff endpoint
datacontract/cli.py — datacontract diff command
tests/test_reports_diff_*.py — comprehensive test suite covering diff, report building, rendering, and API
tests/test_reports_common_helpers.py — tests for common report helper functions
tests/fixtures/diff/* — new fixture files for ODCS diff testing
tests/fixtures/breaking/* — removed leftover fixture files from the DCS based implementation
API.md — diff endpoint documentation
CHANGELOG.md — unreleased entry
README.md — updated with new diff functionality
pyproject.toml — dependency updates

Testing

uv run pytest tests/test_rep* tests/test_cli.py tests/test_api.py

266 tests, all passing.

Tests pass
ruff format
README.md updated (if relevant)
CHANGELOG.md entry added

jochenchrist · 2026-03-31T17:33:42Z

datacontract/reports/common/__init__.py

I am not a big fan of "common" packages. Mostly there is not much in common. Please move the logic to the actual reports.

OK - The intent was not explicit here: I am laying the foundation for the next PR that will focus on breaking changes.

Both diff and breaking will be sharing that rendering logic, report layout and UX, ensuring they are aligned on these aspects. This is the reason common is being introduced here.

For now, I think they fit better in the reports directory directly. This makes it similarly obvious that they are not specific to either diff or breaking. 2 shared files are not really a reason for a common package.

Also, there are many functions that are only called in one of html_contract_diff_renderer or text_contract_diff_renderer (and in the tests). Please move those to the place where they are called, as @jochenchrist said. Only when they are actually used for the breaking logic, they should be moved into a third file.

jochenchrist · 2026-03-31T17:34:17Z

tests/fixtures/diff/integration/golden_integration_html.html

+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>ODCS Diff — tests/fixtures/diff/integration/report_renderer_integration_v1.yaml → tests/fixtures/diff/integration/report_renderer_integration_v2.yaml</title>
+<style>
+@import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;500&family=IBM+Plex+Sans:wght@300;400;500;600&display=swap');


Why need for this font? Is it used anywhere else?

This font is used as part of the html diff report, and will also be used by the breaking report.

This font provides a good look-and-feel (cf screenshot in the PR description).

I would ask to to make the HTML report consistent with the HTML export (following Tailwind Plus styleguide).

We could also remove the HTML report for now to focus the scope of this feature.

I agree with your suggestion to remove this and reduce the scope for now.

davidb-tada · 2026-04-01T14:04:36Z

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.

Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

jochenchrist · 2026-04-02T17:18:17Z

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.
Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent.

jschoedl · 2026-04-02T17:21:29Z

Hi @davidb-tada, thank you for the PR! I scanned through it and will take a deeper look later.

One general note: Our convention is to use the _ prefix for function names that are only used within the same file, whereas you sometimes imported such functions. Not sure if this is still relevant as soon as you moved helper functions to the places of their calling.

datacontract/api.py

datacontract/reports/diff/contract_diff_report.py

jschoedl · 2026-04-02T15:55:49Z

datacontract/reports/diff/diff.py

+if __name__ == "__main__":
+    from datacontract.reports.diff.contract_diff_report import ContractDiffReport
+
+    _positional = []
+    _output_path = None
+    _fmt = "text"
+    _it = iter(sys.argv[1:])
+    for _a in _it:
+        if _a in ("--output", "-o"):
+            _output_path = next(_it, None)
+        elif _a == "--html":
+            _fmt = "html"
+        elif not _a.startswith("--"):
+            _positional.append(_a)
+    if len(_positional) != 2:
+        print("Usage: python diff.py <v1.yaml> <v2.yaml> [--html] [--output <file>]")
+        sys.exit(1)
+    ContractDiffReport().generate(
+        v1_path=_positional[0],
+        v2_path=_positional[1],
+        fmt=_fmt,
+        output_path=_output_path,
+    )


Is this a leftover from your testing? We do not have other __main__ entry points in the codebase except for cli.py.

This was a left over. Removed

jschoedl · 2026-04-02T16:26:07Z

datacontract/reports/diff/html_contract_diff_renderer.py

+from __future__ import annotations
+
+from datacontract.reports.common.html_base_renderer import (
+    _CSS,  # noqa: F401 — re-exported for any callers that import it directly


this import is not used here, right?

Removed - ruff did not detect

jschoedl · 2026-04-02T16:30:49Z

CHANGELOG.md

+hint: Resolve all conflicts manually, mark them as resolved with
+hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
+hint: You can instead skip this commit: run "git rebase --skip".
+hint: To abort and get back to the state before "git rebase", run "git rebase --abort".


CHANGELOG.md

jschoedl · 2026-04-02T16:35:15Z

CLAUDE.md


+6. **Semantic Diff (`datacontract/reports/diff/`)**: Semantic diff engine with HTML and text report renderers.
+
+   **Normalization — planned improvement**: The natural keys used to match list items (e.g. `schema[].name`, `customProperties[].property`) are currently hardcoded in `diff.py` as a 20-line path-to-key table, backed by six helper methods (`_normalize_by`, `_normalize_schema_fields`, `_normalize_quality`, `_normalize_auth_defs`, `_normalize_relationships`, `_normalize_properties`). The intended long-term fix is to encode these natural keys directly in the ODCS JSON Schema — as an `x-natural-key` extension or similar — and carry that through to the upstream `open-data-contract-standard` Pydantic models (via a `__natural_key__` class var or `Field` annotation). This is an improvement beyond just simplifying this code: it makes the identity semantics of each model explicit and authoritative in the schema itself, rather than having them implied by convention or rediscovered by downstream consumers. Once that lands, all six helpers and the key table collapse into a single ~15-line recursive function that walks the model tree by reflection — adding or changing a list field in the ODCS schema would require no changes here. See the `NOTE` in `ContractDiff._normalize()` for the exact upstream change required.


Why do we need this here?

Merged with existing doc in diff.py

tests/fixtures/diff/integration/helper/generate_golden.py

jschoedl · 2026-04-02T16:44:28Z

datacontract/reports/common/html_base_renderer.py

+    return _html.escape(str(s))
+
+
+def _format_value(val: Any, max_len: int = None) -> str:


max_len is unused. Either remove or implement.

Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Added granular error handling for invalid contracts (err 422) Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Adding match ['key'] or ["key"] Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Fixed fixture path Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

davidb-tada · 2026-04-03T09:09:34Z

Hi @davidb-tada, thank you for the PR! I scanned through it and will take a deeper look later.

One general note: Our convention is to use the _ prefix for function names that are only used within the same file, whereas you sometimes imported such functions. Not sure if this is still relevant as soon as you moved helper functions to the places of their calling.

Making these public for now - This will eventually be refactored

davidb-tada · 2026-04-03T09:20:48Z

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.

Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent.

Thanks for the review folks.

Based on all feedback I have decided to close this PR and create a new PR that will focus on the text output with an output format and code design that is better aligned with the existing code base.

I will incorporate all the present feedback into the new code

davidb-tada · 2026-04-03T15:52:32Z

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.

Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent.

Thanks for the review folks.

Based on all feedback I have decided to close this PR and create a new PR that will focus on the text output with an output format and code design that is better aligned with the existing code base.

I will incorporate all the present feedback into the new code

@jochenchrist / @jschoedl -> c.f. #1118

jochenchrist reviewed Mar 31, 2026

View reviewed changes

jschoedl requested changes Apr 2, 2026

View reviewed changes

davidb-tada force-pushed the feat/semantic-diff branch 4 times, most recently from c89d387 to 75e4d13 Compare April 3, 2026 08:52

davidb-tada and others added 7 commits April 3, 2026 09:57

feat(diff): add ODCS semantic diff command and API endpoint

0cc89de

Update CHANGELOG.md

f362e26

Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Update datacontract/api.py

0da019a

Added granular error handling for invalid contracts (err 422) Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Update datacontract/reports/diff/contract_diff_report.py

55971b0

Adding match ['key'] or ["key"] Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Moved comments from CLAUDE.md to diff.py

abb0cab

Update tests/fixtures/diff/integration/helper/generate_golden.py

d6e2f3c

Fixed fixture path Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>

Removed unused argument max_len

f0dcc38

davidb-tada force-pushed the feat/semantic-diff branch from 75e4d13 to f0dcc38 Compare April 3, 2026 08:58

Removed _ on public func names

8156b7f

davidb-tada closed this Apr 3, 2026

davidb-tada mentioned this pull request Apr 3, 2026

Added changelog command and API endpoint #1118

Open

4 tasks


		6. Semantic Diff (`datacontract/reports/diff/`): Semantic diff engine with HTML and text report renderers.

		Normalization — planned improvement: The natural keys used to match list items (e.g. `schema[].name`, `customProperties[].property`) are currently hardcoded in `diff.py` as a 20-line path-to-key table, backed by six helper methods (`_normalize_by`, `_normalize_schema_fields`, `_normalize_quality`, `_normalize_auth_defs`, `_normalize_relationships`, `_normalize_properties`). The intended long-term fix is to encode these natural keys directly in the ODCS JSON Schema — as an `x-natural-key` extension or similar — and carry that through to the upstream `open-data-contract-standard` Pydantic models (via a `__natural_key__` class var or `Field` annotation). This is an improvement beyond just simplifying this code: it makes the identity semantics of each model explicit and authoritative in the schema itself, rather than having them implied by convention or rediscovered by downstream consumers. Once that lands, all six helpers and the key table collapse into a single ~15-line recursive function that walks the model tree by reflection — adding or changing a list field in the ODCS schema would require no changes here. See the `NOTE` in `ContractDiff._normalize()` for the exact upstream change required.

		return _html.escape(str(s))


		def _format_value(val: Any, max_len: int = None) -> str:

Conversation

davidb-tada commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(diff): semantic diff for ODCS data contracts with text and HTML reports

Summary

Motivation

Screenshots

HTML Report

Text Report

How it works

Natural key limitation and upstream path - Future Considerations

Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschoedl Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidb-tada Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidb-tada commented Apr 1, 2026

Uh oh!

jochenchrist commented Apr 2, 2026

Uh oh!

jschoedl commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidb-tada commented Apr 3, 2026

Uh oh!

davidb-tada commented Apr 3, 2026

Uh oh!

davidb-tada commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidb-tada commented Mar 31, 2026 •

edited

Loading

jschoedl Apr 2, 2026 •

edited

Loading

davidb-tada Apr 1, 2026 •

edited

Loading