feat(diff): add ODCS semantic diff command and API endpoint#1112
feat(diff): add ODCS semantic diff command and API endpoint#1112davidb-tada wants to merge 8 commits intodatacontract:mainfrom
Conversation
There was a problem hiding this comment.
I am not a big fan of "common" packages. Mostly there is not much in common. Please move the logic to the actual reports.
There was a problem hiding this comment.
OK - The intent was not explicit here: I am laying the foundation for the next PR that will focus on breaking changes.
Both diff and breaking will be sharing that rendering logic, report layout and UX, ensuring they are aligned on these aspects. This is the reason common is being introduced here.
There was a problem hiding this comment.
For now, I think they fit better in the reports directory directly. This makes it similarly obvious that they are not specific to either diff or breaking. 2 shared files are not really a reason for a common package.
Also, there are many functions that are only called in one of html_contract_diff_renderer or text_contract_diff_renderer (and in the tests). Please move those to the place where they are called, as @jochenchrist said. Only when they are actually used for the breaking logic, they should be moved into a third file.
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
| <title>ODCS Diff — tests/fixtures/diff/integration/report_renderer_integration_v1.yaml → tests/fixtures/diff/integration/report_renderer_integration_v2.yaml</title> | ||
| <style> | ||
| @import url('https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;500&family=IBM+Plex+Sans:wght@300;400;500;600&display=swap'); |
There was a problem hiding this comment.
Why need for this font? Is it used anywhere else?
There was a problem hiding this comment.
This font is used as part of the html diff report, and will also be used by the breaking report.
This font provides a good look-and-feel (cf screenshot in the PR description).
There was a problem hiding this comment.
I would ask to to make the HTML report consistent with the HTML export (following Tailwind Plus styleguide).
We could also remove the HTML report for now to focus the scope of this feature.
There was a problem hiding this comment.
I agree with your suggestion to remove this and reduce the scope for now.
|
Upon consideration it may be worthwhile aligning at least the text based report output with the existing structure and color scheme.
Any thoughts on this @jochenchrist ? I am happy to look at this if there is commitment on your side to incorporate this when deemed ready. |
I think it is OK when the diff is a bit different to the test results, if contents are different. Of course, from a user perspective, some general UI pattern should be consistent. |
|
Hi @davidb-tada, thank you for the PR! I scanned through it and will take a deeper look later. One general note: Our convention is to use the |
| if __name__ == "__main__": | ||
| from datacontract.reports.diff.contract_diff_report import ContractDiffReport | ||
|
|
||
| _positional = [] | ||
| _output_path = None | ||
| _fmt = "text" | ||
| _it = iter(sys.argv[1:]) | ||
| for _a in _it: | ||
| if _a in ("--output", "-o"): | ||
| _output_path = next(_it, None) | ||
| elif _a == "--html": | ||
| _fmt = "html" | ||
| elif not _a.startswith("--"): | ||
| _positional.append(_a) | ||
| if len(_positional) != 2: | ||
| print("Usage: python diff.py <v1.yaml> <v2.yaml> [--html] [--output <file>]") | ||
| sys.exit(1) | ||
| ContractDiffReport().generate( | ||
| v1_path=_positional[0], | ||
| v2_path=_positional[1], | ||
| fmt=_fmt, | ||
| output_path=_output_path, | ||
| ) |
There was a problem hiding this comment.
Is this a leftover from your testing? We do not have other __main__ entry points in the codebase except for cli.py.
There was a problem hiding this comment.
This was a left over. Removed
| from __future__ import annotations | ||
|
|
||
| from datacontract.reports.common.html_base_renderer import ( | ||
| _CSS, # noqa: F401 — re-exported for any callers that import it directly |
There was a problem hiding this comment.
this import is not used here, right?
There was a problem hiding this comment.
Removed - ruff did not detect
| hint: Resolve all conflicts manually, mark them as resolved with | ||
| hint: "git add/rm <conflicted_files>", then run "git rebase --continue". | ||
| hint: You can instead skip this commit: run "git rebase --skip". | ||
| hint: To abort and get back to the state before "git rebase", run "git rebase --abort". |
CLAUDE.md
Outdated
|
|
||
| 6. **Semantic Diff (`datacontract/reports/diff/`)**: Semantic diff engine with HTML and text report renderers. | ||
|
|
||
| **Normalization — planned improvement**: The natural keys used to match list items (e.g. `schema[].name`, `customProperties[].property`) are currently hardcoded in `diff.py` as a 20-line path-to-key table, backed by six helper methods (`_normalize_by`, `_normalize_schema_fields`, `_normalize_quality`, `_normalize_auth_defs`, `_normalize_relationships`, `_normalize_properties`). The intended long-term fix is to encode these natural keys directly in the ODCS JSON Schema — as an `x-natural-key` extension or similar — and carry that through to the upstream `open-data-contract-standard` Pydantic models (via a `__natural_key__` class var or `Field` annotation). This is an improvement beyond just simplifying this code: it makes the identity semantics of each model explicit and authoritative in the schema itself, rather than having them implied by convention or rediscovered by downstream consumers. Once that lands, all six helpers and the key table collapse into a single ~15-line recursive function that walks the model tree by reflection — adding or changing a list field in the ODCS schema would require no changes here. See the `NOTE` in `ContractDiff._normalize()` for the exact upstream change required. |
There was a problem hiding this comment.
Why do we need this here?
There was a problem hiding this comment.
Merged with existing doc in diff.py
| return _html.escape(str(s)) | ||
|
|
||
|
|
||
| def _format_value(val: Any, max_len: int = None) -> str: |
There was a problem hiding this comment.
max_len is unused. Either remove or implement.
c89d387 to
75e4d13
Compare
Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
Added granular error handling for invalid contracts (err 422) Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
Adding match ['key'] or ["key"] Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
Fixed fixture path Co-authored-by: Jakob Schödl <jschoedl04@gmail.com>
75e4d13 to
f0dcc38
Compare
Making these public for now - This will eventually be refactored |
Thanks for the review folks. Based on all feedback I have decided to close this PR and create a new PR that will focus on the text output with an output format and code design that is better aligned with the existing code base. I will incorporate all the present feedback into the new code |
@jochenchrist / @jschoedl -> c.f. #1118 |


feat(diff): semantic diff for ODCS data contracts with text and HTML reports
Summary
Reintroducing the much loved diff functionality, with a wider capability including semantic diff with field level granularity. This PR adds a
datacontract diffcommand that produces a semantic diff of two data contract YAML files, with plain-text and HTML report formats.Note: Breaking change classification (which changes are breaking vs non-breaking) is out of scope for this PR and will follow in a subsequent one.
A
POST /diffendpoint is also added to the API server, accepting the two contracts as YAML strings in the request body.Motivation
When evolving a data contract, understanding what changed between two versions is a prerequisite to assessing impact. This provides a structured, human-readable diff rather than a raw YAML text diff, with changes grouped by field path and presented at both a summary (rolled-up) and detail (leaf) level.
Screenshots
HTML Report
Text Report
How it works
The diff engine normalises both contracts before diffing — converting named lists (e.g.
schema[],customProperties[],authoritativeDefinitions[]) from positional arrays to dicts keyed by their natural identifier.This eliminates false positives from list reordering and produces stable, meaningful field paths in the output e.g.
schema.orders.properties.order_id.logicalType Changedrather than
schema[0].properties[1].logicalType ChangedThe renderer uses
LIST_CONTAINERSto apply list-item prefixes (-) to named collection entries in the output. This set is derived automatically by reflecting over the ODCS Pydantic models, so it stays in sync with the schema without manual maintenance.Natural key limitation and upstream path - Future Considerations
This section provides context around the current semantic / natural key implementation and envisaged improvements.
The natural key for each named list — the field used to key items when converting arrays to dicts before diffing — is currently hardcoded in
report_helpers.py. Each entry in the table below was determined by inspecting therequiredarray in the ODCS v3 JSON Schema and selecting the field that acts as the stable semantic identifier:SchemaObjectnamerequired: [name]in JSON schemaSchemaPropertynamerequired: [name]in JSON schemaServerserverrequired: [server, type]—serverchosen as identifierSLAPropertypropertyrequired: [property, value]—propertychosen as identifierCustomPropertypropertyrequired: [property, value]—propertychosen as identifierRolerolerequired: [role]in JSON schemaSupportItemchannelrequired: [channel]in JSON schemaTeamMemberusernamerequired: [username]in JSON schemaDataQualitynamerequiredin JSON schema —nameinferred; positional fallback if absentThe
DataQualitycase illustrates the root problem: the JSON schema does not declare any field asrequiredfor that object, so there is no machine-readable source of truth from which the natural key can be derived. The key must instead be inferred from domain knowledge and hardcoded with a fallback.The correct long-term fix is for the natural key to be declared in the spec itself — specifically, by ensuring that each named-list entity type in
odcs-json-schema-v3.1.0.jsonhas arequiredarray whose first entry is unambiguously its stable identifier. That metadata would then flow through to the Pydantic model package as a non-optional field, allowinghelpers.pyto derive the natural key table dynamically by reflecting overmodel_fieldsrather than maintaining it by hand.If this PR is merged, the author intends to raise issues and submit PRs to both upstream projects —
bitol-io/open-data-contract-standardto add the missingrequireddeclarations to the JSON schema, anddatacontract/open-data-contract-standard-pythonto surface those as non-optional Pydantic fields — which would allow a follow-up PR here to simplify the implementation, replace the hardcoded key table and its associated logic with fully derived logic.Changes
datacontract/reports/diff/diff.py— ODCS semantic diff implementation with diff pipelinedatacontract/reports/diff/contract_diff_report.py— builds the report data dict from raw diff outputdatacontract/reports/diff/text_contract_diff_renderer.py— self-contained plain-text rendererdatacontract/reports/diff/html_contract_diff_renderer.py— self-contained HTML rendererdatacontract/reports/common/report_helpers.py— helper code to facilitate semantic diffsdatacontract/reports/common/html_base_renderer.py— common HTML base rendering functionalitydatacontract/api.py—POST /diffendpointdatacontract/cli.py—datacontract diffcommandtests/test_reports_diff_*.py— comprehensive test suite covering diff, report building, rendering, and APItests/test_reports_common_helpers.py— tests for common report helper functionstests/fixtures/diff/*— new fixture files for ODCS diff testingtests/fixtures/breaking/*— removed leftover fixture files from the DCS based implementationAPI.md— diff endpoint documentationCHANGELOG.md— unreleased entryREADME.md— updated with new diff functionalitypyproject.toml— dependency updatesTesting
uv run pytest tests/test_rep* tests/test_cli.py tests/test_api.py266 tests, all passing.