Skip to content

feat(diff): add ODCS semantic diff command and API endpoint#1112

Closed
davidb-tada wants to merge 8 commits intodatacontract:mainfrom
davidb-tada:feat/semantic-diff
Closed

feat(diff): add ODCS semantic diff command and API endpoint#1112
davidb-tada wants to merge 8 commits intodatacontract:mainfrom
davidb-tada:feat/semantic-diff

Conversation

@davidb-tada
Copy link
Copy Markdown

@davidb-tada davidb-tada commented Mar 31, 2026

feat(diff): semantic diff for ODCS data contracts with text and HTML reports

Summary

Reintroducing the much loved diff functionality, with a wider capability including semantic diff with field level granularity. This PR adds a datacontract diff command that produces a semantic diff of two data contract YAML files, with plain-text and HTML report formats.

Note: Breaking change classification (which changes are breaking vs non-breaking) is out of scope for this PR and will follow in a subsequent one.

# Plain-text diff to stdout
datacontract diff v1.odcs.yaml v2.odcs.yaml

# HTML report saved to a file
datacontract diff v1.odcs.yaml v2.odcs.yaml --format html --output diff.html

A POST /diff endpoint is also added to the API server, accepting the two contracts as YAML strings in the request body.

Motivation

When evolving a data contract, understanding what changed between two versions is a prerequisite to assessing impact. This provides a structured, human-readable diff rather than a raw YAML text diff, with changes grouped by field path and presented at both a summary (rolled-up) and detail (leaf) level.

Screenshots

HTML Report

golden_integration_html

Text Report

golden_text_report

How it works

The diff engine normalises both contracts before diffing — converting named lists (e.g. schema[], customProperties[], authoritativeDefinitions[]) from positional arrays to dicts keyed by their natural identifier.

This eliminates false positives from list reordering and produces stable, meaningful field paths in the output e.g.
schema.orders.properties.order_id.logicalType Changed
rather than
schema[0].properties[1].logicalType Changed

The renderer uses LIST_CONTAINERS to apply list-item prefixes (- ) to named collection entries in the output. This set is derived automatically by reflecting over the ODCS Pydantic models, so it stays in sync with the schema without manual maintenance.

Natural key limitation and upstream path - Future Considerations

This section provides context around the current semantic / natural key implementation and envisaged improvements.

The natural key for each named list — the field used to key items when converting arrays to dicts before diffing — is currently hardcoded in report_helpers.py. Each entry in the table below was determined by inspecting the required array in the ODCS v3 JSON Schema and selecting the field that acts as the stable semantic identifier:

Entity Natural key Source
SchemaObject name required: [name] in JSON schema
SchemaProperty name required: [name] in JSON schema
Server server required: [server, type]server chosen as identifier
SLAProperty property required: [property, value]property chosen as identifier
CustomProperty property required: [property, value]property chosen as identifier
Role role required: [role] in JSON schema
SupportItem channel required: [channel] in JSON schema
TeamMember username required: [username] in JSON schema
DataQuality name no required in JSON schema — name inferred; positional fallback if absent

The DataQuality case illustrates the root problem: the JSON schema does not declare any field as required for that object, so there is no machine-readable source of truth from which the natural key can be derived. The key must instead be inferred from domain knowledge and hardcoded with a fallback.

The correct long-term fix is for the natural key to be declared in the spec itself — specifically, by ensuring that each named-list entity type in odcs-json-schema-v3.1.0.json has a required array whose first entry is unambiguously its stable identifier. That metadata would then flow through to the Pydantic model package as a non-optional field, allowing helpers.py to derive the natural key table dynamically by reflecting over model_fields rather than maintaining it by hand.

If this PR is merged, the author intends to raise issues and submit PRs to both upstream projects — bitol-io/open-data-contract-standard to add the missing required declarations to the JSON schema, and datacontract/open-data-contract-standard-python to surface those as non-optional Pydantic fields — which would allow a follow-up PR here to simplify the implementation, replace the hardcoded key table and its associated logic with fully derived logic.

Changes

  • datacontract/reports/diff/diff.py — ODCS semantic diff implementation with diff pipeline
  • datacontract/reports/diff/contract_diff_report.py — builds the report data dict from raw diff output
  • datacontract/reports/diff/text_contract_diff_renderer.py — self-contained plain-text renderer
  • datacontract/reports/diff/html_contract_diff_renderer.py — self-contained HTML renderer
  • datacontract/reports/common/report_helpers.py — helper code to facilitate semantic diffs
  • datacontract/reports/common/html_base_renderer.py — common HTML base rendering functionality
  • datacontract/api.pyPOST /diff endpoint
  • datacontract/cli.pydatacontract diff command
  • tests/test_reports_diff_*.py — comprehensive test suite covering diff, report building, rendering, and API
  • tests/test_reports_common_helpers.py — tests for common report helper functions
  • tests/fixtures/diff/* — new fixture files for ODCS diff testing
  • tests/fixtures/breaking/* — removed leftover fixture files from the DCS based implementation
  • API.md — diff endpoint documentation
  • CHANGELOG.md — unreleased entry
  • README.md — updated with new diff functionality
  • pyproject.toml — dependency updates

Testing

uv run pytest tests/test_rep* tests/test_cli.py tests/test_api.py

266 tests, all passing.

  • Tests pass
  • ruff format
  • README.md updated (if relevant)
  • CHANGELOG.md entry added

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of "common" packages. Mostly there is not much in common. Please move the logic to the actual reports.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - The intent was not explicit here: I am laying the foundation for the next PR that will focus on breaking changes.

Both diff and breaking will be sharing that rendering logic, report layout and UX, ensuring they are aligned on these aspects. This is the reason common is being introduced here.

Copy link
Copy Markdown
Collaborator

@jschoedl jschoedl Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think they fit better in the reports directory directly. This makes it similarly obvious that they are not specific to either diff or breaking. 2 shared files are not really a reason for a common package.

Also, there are many functions that are only called in one of html_contract_diff_renderer or text_contract_diff_renderer (and in the tests). Please move those to the place where they are called, as @jochenchrist said. Only when they are actually used for the breaking logic, they should be moved into a third file.

<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>ODCS Diff — tests/fixtures/diff/integration/report_renderer_integration_v1.yaml → tests/fixtures/diff/integration/report_renderer_integration_v2.yaml</title>
<style>
@import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;500&family=IBM+Plex+Sans:wght@300;400;500;600&display=swap');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need for this font? Is it used anywhere else?

Copy link
Copy Markdown
Author

@davidb-tada davidb-tada Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This font is used as part of the html diff report, and will also be used by the breaking report.

This font provides a good look-and-feel (cf screenshot in the PR description).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ask to to make the HTML report consistent with the HTML export (following Tailwind Plus styleguide).

We could also remove the HTML report for now to focus the scope of this feature.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your suggestion to remove this and reduce the scope for now.

@davidb-tada
Copy link
Copy Markdown
Author

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.

image

Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

@jochenchrist
Copy link
Copy Markdown
Contributor

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.

image Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent.

@jschoedl
Copy link
Copy Markdown
Collaborator

jschoedl commented Apr 2, 2026

Hi @davidb-tada, thank you for the PR! I scanned through it and will take a deeper look later.

One general note: Our convention is to use the _ prefix for function names that are only used within the same file, whereas you sometimes imported such functions. Not sure if this is still relevant as soon as you moved helper functions to the places of their calling.

Comment on lines +296 to +318
if __name__ == "__main__":
from datacontract.reports.diff.contract_diff_report import ContractDiffReport

_positional = []
_output_path = None
_fmt = "text"
_it = iter(sys.argv[1:])
for _a in _it:
if _a in ("--output", "-o"):
_output_path = next(_it, None)
elif _a == "--html":
_fmt = "html"
elif not _a.startswith("--"):
_positional.append(_a)
if len(_positional) != 2:
print("Usage: python diff.py <v1.yaml> <v2.yaml> [--html] [--output <file>]")
sys.exit(1)
ContractDiffReport().generate(
v1_path=_positional[0],
v2_path=_positional[1],
fmt=_fmt,
output_path=_output_path,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a leftover from your testing? We do not have other __main__ entry points in the codebase except for cli.py.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a left over. Removed

from __future__ import annotations

from datacontract.reports.common.html_base_renderer import (
_CSS, # noqa: F401 — re-exported for any callers that import it directly
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import is not used here, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed - ruff did not detect

Comment on lines +29 to +32
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

CLAUDE.md Outdated

6. **Semantic Diff (`datacontract/reports/diff/`)**: Semantic diff engine with HTML and text report renderers.

**Normalization — planned improvement**: The natural keys used to match list items (e.g. `schema[].name`, `customProperties[].property`) are currently hardcoded in `diff.py` as a 20-line path-to-key table, backed by six helper methods (`_normalize_by`, `_normalize_schema_fields`, `_normalize_quality`, `_normalize_auth_defs`, `_normalize_relationships`, `_normalize_properties`). The intended long-term fix is to encode these natural keys directly in the ODCS JSON Schema — as an `x-natural-key` extension or similar — and carry that through to the upstream `open-data-contract-standard` Pydantic models (via a `__natural_key__` class var or `Field` annotation). This is an improvement beyond just simplifying this code: it makes the identity semantics of each model explicit and authoritative in the schema itself, rather than having them implied by convention or rediscovered by downstream consumers. Once that lands, all six helpers and the key table collapse into a single ~15-line recursive function that walks the model tree by reflection — adding or changing a list field in the ODCS schema would require no changes here. See the `NOTE` in `ContractDiff._normalize()` for the exact upstream change required.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged with existing doc in diff.py

return _html.escape(str(s))


def _format_value(val: Any, max_len: int = None) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_len is unused. Either remove or implement.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@davidb-tada davidb-tada force-pushed the feat/semantic-diff branch 4 times, most recently from c89d387 to 75e4d13 Compare April 3, 2026 08:52
davidb-tada and others added 7 commits April 3, 2026 09:57
Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
Added granular error handling for invalid contracts (err 422)

Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
Adding match ['key'] or ["key"]

Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
Fixed fixture path

Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
@davidb-tada
Copy link
Copy Markdown
Author

Hi @davidb-tada, thank you for the PR! I scanned through it and will take a deeper look later.

One general note: Our convention is to use the _ prefix for function names that are only used within the same file, whereas you sometimes imported such functions. Not sure if this is still relevant as soon as you moved helper functions to the places of their calling.

Making these public for now - This will eventually be refactored

@davidb-tada
Copy link
Copy Markdown
Author

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.
image
Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent.

Thanks for the review folks.

Based on all feedback I have decided to close this PR and create a new PR that will focus on the text output with an output format and code design that is better aligned with the existing code base.

I will incorporate all the present feedback into the new code

@davidb-tada
Copy link
Copy Markdown
Author

Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.
image
Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready.

I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent.

Thanks for the review folks.

Based on all feedback I have decided to close this PR and create a new PR that will focus on the text output with an output format and code design that is better aligned with the existing code base.

I will incorporate all the present feedback into the new code

@jochenchrist / @jschoedl -> c.f. #1118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants