Turn ambiguous archaeology propositions into a deterministic canonical map for one bounded legacy outcome slice.
This repo is no longer planning the whole end-state decode universe as the current implementation wedge. Phase 1 is intentionally narrow:
- archaeology mode only
- consume derived
claim.v0fromcrucible scan - converge claims into a canonical map
- emit escalations where ambiguity remains
Document extraction mode, entity resolution, and broader claim-resolution surfaces are deferred.
Phase 1 is a deterministic convergence engine for legacy-system archaeology.
For the first Hyperion-style slices, decoding does not need to:
- canonicalize extracted financial rows
- resolve entity identity graphs
- talk to Neo4j
- drive twins
- mutate production databases
It needs to do one thing well:
take messy claims from multiple legacy surfaces and produce the first usable, auditable canonical understanding of the slice.
That does NOT mean every scan fact should pass through decoding. Directly
observed metadata should land in the catalog first and bypass decode entirely.
Phase 1 is NOT:
- document extraction decode
- mutation emission for production targets
- hot-path entity resolution
- a database
- a workflow engine
- a model-assisted reasoner
Those may return later. They are not part of the current build wedge.
Phase 1 should keep a hard split between:
- Observed metadata Facts directly recoverable from scans and normalized into the metadata catalog.
- Derived claims Propositions that are ambiguous, inferential, or contradicted across sources.
decoding only owns the second category.
In other words, decoding consumes only the claim channel from the
two-channel Crucible scanner contract. It does not ingest catalog hydration
records.
Examples that should bypass decoding:
- table and column existence
- file inventory
- directly observed applications, jobs, reports, feeds, mappings, and consumers
- mechanically extractable lineage or dependency edges
Examples that should go through decoding:
- inferred
valid_values livenessassessmentsauthoritative_for- semantic labels
- weak or conflicting dependency edges
An operator should be able to run:
decoding archaeology claims/*.jsonl \
--policy legacy.decode.v0.json \
--output canon-map.jsonl \
--escalations escalations.jsonl \
--convergence convergence.jsonand receive:
- canonical entries for resolved archaeology subjects
- escalations for unresolved or conflicting subjects
- a convergence report showing what is settled and what still needs work
That is the Phase 1 bar.
Phase 1 consumes the derived claim.v0 contract emitted by crucible when
direct observation alone is not enough.
Required shape:
{
"event": "claim.v0",
"claim_id": "sha256:...",
"source": {
"kind": "repo_scan",
"scanner": "crucible.scan.repo@0.1.0",
"artifact_id": "sha256:...",
"locator": {
"kind": "file_range",
"value": "src/close_pack.py#L40-L65"
}
},
"subject": {
"kind": "report",
"id": "hyperion.close_pack_ebitda"
},
"property_type": "depends_on",
"value": {
"kind": "feed",
"id": "fdmee.actuals_load"
},
"confidence": 0.88
}Rules:
claim_idis content-addressed from normalized payload- decoding trusts provenance, not narrative
- identical claims must replay identically
- unknown
source.kindvalues are refusal conditions in Phase 1 - unknown
subject.kindorproperty_typevalues are refusal conditions in Phase 1 - malformed JSON, malformed
claim_id, or property/value shape mismatches are refusal conditions in Phase 1
Phase 1 must keep a hard boundary between invalid input and unresolved meaning.
Refusal (exit 2) conditions:
- malformed JSONL
- missing required fields
- malformed
claim_id - unknown
source.kind - unknown
subject.kind - unknown
property_type valueshape that does not match the frozen property contract- unknown policy keys
Escalation conditions:
- compatible contract, but conflicting propositions
- compatible contract, but insufficient corroboration
- compatible contract, but no declared policy path to resolution
If the decoder accepts a claim into a bucket, that claim has already passed the Phase 1 validity gate.
The decoder should therefore be thought of as a convergence layer above the catalog, not as the ingestion path for all scan output.
If a future implementation finds itself parsing table/file/resource/link catalog records directly, the boundary has drifted and should be corrected.
The earlier archaeology plan bucketed claims around
(table, column, property_type). That is too SQL-centric for the first
Hyperion slices.
We need to reason about more than tables and columns:
- jobs
- reports
- feeds
- mappings
- downstream consumers
- artifacts on disk
So Phase 1 uses a more general bucket key:
(subject.kind, subject.id, property_type)
This is still simple and deterministic, but it fits the real legacy surfaces better.
The simple bucket key works for singular or set-valued properties such as
schema, valid_values, or liveness. It is too coarse for edge properties
like reads or depends_on, where one subject can have many independent
targets.
Phase 1 therefore uses a logical bucket key:
base bucket: (subject.kind, subject.id, property_type)
edge bucket: (subject.kind, subject.id, property_type, value.kind, value.id)
Edge bucket rules apply to:
readswritesdepends_onused_byauthoritative_for
The rest stay on the base bucket key.
Each unique logical bucket key is a bucket.
Duplicate claim handling is deterministic:
- repeated identical
claim_ids collapse to one logical claim before bucketing convergence.claim_countcounts distinct claims after that collapse- source-artifact distinct counting is computed from the surviving distinct claims
- explanation payloads never repeat the same
claim_id
Claims pour into buckets from multiple sources. The bucket moves through a small state machine:
EMPTY -> SINGLE_SOURCE -> CONVERGING -> CONVERGED
|
v
CONFLICTED -> ESCALATED
| State | Meaning |
|---|---|
EMPTY |
no claim yet |
SINGLE_SOURCE |
one claim only |
CONVERGING |
multiple compatible claims |
CONVERGED |
enough evidence to publish canonical entry |
CONFLICTED |
incompatible claims exist |
ESCALATED |
conflict or ambiguity requires human review |
bucket_id must be computed from canonical JSON of the logical bucket key:
- base bucket object:
{"subject":{"kind":"...","id":"..."},"property_type":"..."} - edge bucket object:
{"subject":{"kind":"...","id":"..."},"property_type":"...","value":{"kind":"...","id":"..."}}
The hash format is always sha256:<64 lowercase hex>.
Phase 1 needs a small stable property vocabulary. Do not over-design it.
Initial property types:
| Property type | Typical subjects | Meaning |
|---|---|---|
exists |
all | subject exists |
schema |
table, column, view | structural definition |
constraint |
column, table | not null, FK, check, uniqueness |
reads |
job, procedure, report, consumer | reads from another subject |
writes |
job, procedure, feed | writes to another subject |
depends_on |
report, mapping, artifact | dependency edge |
used_by |
table, column, view, report | downstream usage |
schedule |
job, feed | cadence or trigger info |
valid_values |
column, mapping | allowed values |
semantic_label |
column, report line, mapping | business meaning hint |
liveness |
all | alive, dead, stale, unknown |
authoritative_for |
report, extract, consumer | authoritative output hint |
This list may grow, but Phase 1 should freeze a versioned vocabulary before code starts.
Phase 1 needs a small property-aware comparator registry. Freeze the compatibility rules before code starts:
| Property type | Compatible when |
|---|---|
exists |
both claims are true |
schema |
normalized JSON deep-equal |
constraint |
normalized JSON deep-equal |
reads |
same subject ref |
writes |
same subject ref |
depends_on |
same subject ref |
used_by |
same subject ref |
schedule |
normalized JSON deep-equal |
valid_values |
same sorted set of strings |
semantic_label |
same normalized string |
liveness |
same state, or alive + stale, or stale + unknown |
authoritative_for |
same subject ref |
alive and dead conflict. dead should never auto-win from absence alone.
Phase 1 should stay conservative.
These can resolve with little or no corroboration:
existsfrom high-confidence structural scansschemafrom database metadataconstraintfrom database metadata
These should normally require multiple compatible claims:
readswritesdepends_onused_byschedulevalid_valuessemantic_labelauthoritative_for
liveness is special:
- structural evidence alone is weak
- executed evidence is stronger
- absence of evidence is not death
Phase 1 should prefer alive, stale, or unknown and avoid overclaiming
that something is dead.
Resolved buckets emit canonical entries.
Required shape:
{
"event": "canon_entry.v0",
"bucket_id": "sha256:...",
"subject": {
"kind": "report",
"id": "hyperion.close_pack_ebitda"
},
"property_type": "depends_on",
"canonical_value": {
"kind": "feed",
"id": "fdmee.actuals_load"
},
"policy_id": "legacy.decode.v0",
"convergence": {
"state": "converged",
"source_count": 3,
"claim_count": 4
},
"explain": {
"winner_claim_ids": ["sha256:...", "sha256:..."],
"compatible_claim_ids": ["sha256:...", "sha256:..."],
"resolution_kind": "corroborated"
}
}Frozen field contract:
| Field | Type | Rules |
|---|---|---|
event |
string | exactly canon_entry.v0 |
bucket_id |
string | sha256:<64 lowercase hex> of the logical bucket key |
subject.kind |
enum | same frozen vocabulary as input |
subject.id |
string | same normalized ID as input |
property_type |
enum | same frozen vocabulary as input |
canonical_value |
JSON | normalized value chosen by policy |
policy_id |
string | legacy.decode.v0 for Phase 1 |
convergence.state |
enum | single_source, converging, converged |
convergence.source_count |
integer | number of distinct source artifacts contributing |
convergence.claim_count |
integer | total contributing claims |
explain.winner_claim_ids |
array | sorted winning claim IDs |
explain.compatible_claim_ids |
array | sorted compatible claim IDs included in support |
explain.resolution_kind |
enum | single_source, corroborated, priority_break, liveness_fold |
The explanation payload should stay structured. Free-text commentary can wait.
Unresolved or conflicting buckets emit escalations.
Required shape:
{
"event": "escalation.v0",
"bucket_id": "sha256:...",
"subject": {
"kind": "mapping",
"id": "adj.ebitda.rule.family"
},
"property_type": "semantic_label",
"reason": "conflicted",
"claim_ids": ["sha256:...", "sha256:..."],
"candidate_values": [
{"kind": "scalar", "value": "Adjusted EBITDA rule family"},
{"kind": "scalar", "value": "EBITDA exception class"}
],
"recommended_action": "review",
"summary": "two incompatible semantic interpretations remain"
}Frozen field contract:
| Field | Type | Rules |
|---|---|---|
event |
string | exactly escalation.v0 |
bucket_id |
string | same bucket hash used for canonical entries |
subject.kind |
enum | same frozen vocabulary as input |
subject.id |
string | same normalized ID as input |
property_type |
enum | same frozen vocabulary as input |
reason |
enum | conflicted, missing_corroboration, no_resolution_path |
claim_ids |
array | sorted claim IDs in the bucket |
candidate_values |
array | normalized candidate values under review; each entry is either {"kind":"scalar","value":...} or a subject ref object |
recommended_action |
enum | review, scan_more, fix_scanner, fix_policy |
summary |
string | short human-readable one-line explanation |
Escalations are the bounded review queue. If a bucket cannot produce one of the above reasons, the reason model is still underspecified.
Phase 1 also emits one report summarizing:
- bucket counts by state
- top conflicted subjects
- marginal value by source class
- unresolved areas by surface
At minimum the convergence report must contain:
{
"event": "convergence.v0",
"policy_id": "legacy.decode.v0",
"totals": {
"buckets": 0,
"converged": 0,
"converging": 0,
"single_source": 0,
"conflicted": 0,
"escalated": 0
},
"by_property_type": {},
"by_source_kind": {},
"top_escalations": []
}Phase 1 also needs a frozen minimal policy contract:
{
"policy_id": "legacy.decode.v0",
"auto_resolve": ["exists", "schema", "constraint"],
"min_corroboration": {
"reads": 2,
"writes": 2,
"depends_on": 2,
"used_by": 2,
"schedule": 2,
"valid_values": 2,
"semantic_label": 2,
"authoritative_for": 2
},
"source_priority": {
"liveness": ["db_scan", "file_scan", "repo_scan"]
}
}Phase 1 policy should remain declarative and small. If the engine needs property-specific code beyond the comparator registry and liveness fold, the policy surface is too ambitious.
Phase 1 should ship a narrow CLI:
decoding archaeology <CLAIMS>... --policy <FILE> [OPTIONS]
Arguments:
<CLAIMS>... Claim JSONL files
Options:
--policy <FILE> Archaeology decode policy
--output <FILE> Canon entry JSONL output
--escalations <FILE> Escalation JSONL output
--convergence <FILE> Convergence report JSON output
--json JSON status messages
Exit codes:
0no escalations1escalations emitted2refusal / invalid claim set / invalid policy
Do not ship a broader CLI in Phase 1.
The only accepted diagnostic exception is read-only doctor mode:
decoding doctor health [--json]
decoding doctor capabilities [--json]
decoding doctor robot-docs
decoding doctor --robot-triage
Doctor mode is not part of the archaeology domain surface. It must return
before claim loading, policy loading, bucketing, resolution, and artifact
writing. It must not parse catalog records, read derived claim files, write
.doctor/, or offer doctor --fix.
The current code inventory has no decoding-managed home or repo-local config,
state, cache, receipt, log, or lock path. canon_entry.v0 defaults to stdout,
and --output, --escalations, and --convergence write only to explicit
operator-supplied paths.
The canonical CMD+RVL root remains ~/.cmdrvl/ for any future managed path.
Migration and deprecation records belong under:
~/.cmdrvl/migrations/applied.jsonl~/.cmdrvl/notices/deprecated-paths.jsonl
No first-run copy is required for this version because there are no legacy decoding-managed paths in the inventory.
-
Freeze archaeology vocabulary Finalize
subject.kind,property_type, and the normalized claim schema. -
Bucket state machine Insert claims deterministically and drive state transitions.
-
Convergence tracker Count sources, detect compatible vs conflicting claims, and generate convergence summaries.
-
Archaeology policy engine Encode the conservative resolution rules for structural, behavioral, and semantic properties.
-
canon_entry.v0/escalation.v0outputs Emit stable JSONL outputs with explanation payloads. -
Convergence report Show what settled, what conflicted, and where the next scan should focus.
That is enough for Phase 1.
Phase 1 does not need a massive gold-set system to start. It needs strong determinism and fixture coverage.
Required test layers:
- synthetic bucket transition tests
- conflicting claim fixtures
- replay determinism tests
- mixed-source archaeology fixtures
- explanation payload snapshots
If archaeology decode proves valuable on the first real slices, we can promote repeating fixtures into a larger regression harness later.
Decoding is Phase 1 implementation-ready when this checklist is concrete enough to code without reopening the model:
-
Contract module
- input claim parser
- frozen enum for
source.kind - frozen enums for
subject.kindandproperty_type - bucket-id builder
- normalized value helpers
-
Bucket store
- deterministic grouping by the logical bucket key
- claim ordering by canonical claim ID
- source-artifact distinct counting
-
Comparator registry
- one comparator per frozen property type
- compatibility tests for every comparator
- liveness fold logic isolated and explicit
-
Policy loader
- parse
legacy.decode.v0.json - refuse unknown policy keys in Phase 1
- wire
auto_resolve,min_corroboration, andsource_priority
- parse
-
State machine + resolver
- drive bucket states
- choose canonical value or escalation
- emit structured explanation payloads
-
Output writers
canon_entry.v0JSONLescalation.v0JSONLconvergence.v0JSON
-
Determinism and fixture tests
- replay-identical input test
- mixed-source archaeology fixture
- conflicted bucket fixture
- invalid-contract refusal fixture
Phase 1 coding should start only after the comparator registry and minimal policy contract are frozen.
| Component | Source | LOC estimate |
|---|---|---|
| CLI surface | clap derive + custom validation |
~200-400 |
| Claim contract parser / normalizer | Custom | ~500-800 |
| Bucket key / hashing layer | Custom | ~200-400 |
| Bucket store and ordering | Custom | ~400-700 |
| Comparator registry | Custom | ~300-600 |
| Resolver + state machine | Custom | ~500-900 |
| Policy loader / validator | Custom | ~200-400 |
Output writers (canon_entry, escalation, convergence) |
Custom | ~300-600 |
| Fixture harness and snapshots | Custom | ~300-600 |
| Total | ~2.9-5.4K lines of Rust |
This is intentionally small. If the Phase 1 implementation starts pulling in a database, graph runtime, or model workflow substrate, the plan has drifted.
The implementation should converge on a file layout that keeps contract, resolution, and reporting work from colliding constantly.
Recommended module ownership:
| Path | Responsibility |
|---|---|
src/cli.rs |
Clap surface, exit-code mapping, file loading orchestration |
src/contracts/{mod,claim,canon_entry,escalation,convergence,policy}.rs |
Wire contracts, serde schemas, contract validation |
src/normalize.rs |
Canonical JSON, string normalization, sorted-set helpers, hash helpers |
src/bucket.rs |
Logical bucket keys, edge/base bucket construction, bucket grouping |
src/compare.rs |
Property-aware comparator registry |
src/resolve.rs |
State machine and resolution decisions |
src/report.rs |
Convergence summary generation |
tests/contracts/*.rs |
Parse/refusal and schema tests |
tests/fixtures/*.rs |
Mixed-source archaeology fixtures |
tests/snapshots/*.rs |
Explanation and output snapshots |
The exact filenames can vary slightly, but v0 should preserve this separation.
| Need | Crate | Notes |
|---|---|---|
| CLI | clap |
derive-based CLI surface |
| JSON parsing | serde, serde_json |
contracts, policy, outputs |
| Content hashing | sha2 |
claim_id and bucket_id helpers |
| Deterministic map ordering | indexmap or BTreeMap |
preserve stable rendering where needed |
| Snapshot assertions | insta |
explanation/output snapshots |
Avoid pulling in graph databases, workflow engines, or heavy rule frameworks in Phase 1. The resolver is small enough to keep explicit.
Phase 1 should follow the same standards as the other spine primitives:
#![forbid(unsafe_code)]- clap derive CLI
- MIT license
- CI gate of
fmt -> clippy -> test - cross-platform release builds
- deterministic artifact rendering for every output format
Minimum release/CI surface for v0:
- GitHub Actions or equivalent running:
cargo fmt --checkcargo clippy --all-targets -- -D warningscargo test
- One fixture corpus checked into the repo and run on every PR.
- Snapshot tests for structured explanation payloads.
- A release workflow that builds tagged binaries for the supported platforms.
- A smoke test that runs the CLI on fixture claims and verifies stable output artifacts.
Phase 1 does not need perf benchmarking infrastructure, but it does need deterministic release confidence.
Before each release:
- Run the quality gate locally.
- Verify fixture outputs and explanation snapshots intentionally changed, if at all.
- Bump the crate version semver appropriately.
- Ensure
Cargo.lockis current. - Tag and publish only after the fixture corpus passes cleanly on CI.
These are explicitly parked:
- document extraction mode
- mutation emission for target databases
- entity resolution
canon orghot-path integration- Neo4j / data-fabric graph queries
- extraction-mode gold set infrastructure
- broader cascade machinery for financial claim resolution
Phase 1 should not carry these abstractions.
crucible discovers evidence. decoding converges only the subset of that
evidence that is actually a claim-resolution problem.
legacy estate
-> crucible scan
-> metadata catalog / lineage / inventory
-> derived claim.v0 where needed
-> decoding archaeology
-> canon_entry.v0 + escalation.v0 + convergence.v0
For the first Hyperion slices, that is the entire product surface of
decoding.
Decoding Phase 1 is implementation-ready when all of the following are true:
claim.v0is frozen with stable subject and property vocabularies- archaeology CLI scope is fixed
- canonical entry and escalation outputs are fixed
- a conservative default policy exists
- determinism and fixture tests can be written without open design questions
Decoding Phase 1 is functionally successful when one real legacy slice can be
fed from crucible scan into decoding and produce:
- a useful first canonical map
- a bounded human review queue
- clear next-scan guidance
That is enough to start the manual replacement loop.
decoding is credible for Phase 1 when all of the following are true:
- The same input claim set and policy file always produce byte-for-byte stable
canon_entry.v0,escalation.v0, andconvergence.v0outputs. - Malformed or unknown claims fail fast at the refusal boundary instead of leaking into escalation handling.
- Edge properties such as
depends_onandreadsretain independent targets without collapsing into one bucket. - One real archaeology slice produces a useful canonical map plus a bounded review queue.
- The fixture corpus can catch regressions in bucket identity, comparator behavior, and explanation payloads.
The test strategy should be implemented as named suites, not informal good intentions.
- Contract suite: parse valid
claim.v0, reject malformed or unknown contract shapes, validate policy loading and refusal behavior. - Bucket suite: verify base vs edge bucket identity, bucket hashing, claim ordering, and source-artifact distinct counting.
- Comparator suite: one focused corpus per property type, including compatibility and incompatibility cases.
- Resolver suite: state-transition fixtures for
single_source,converging,converged,conflicted, andescalated. - Snapshot suite: stable snapshots for
canon_entry.v0,escalation.v0, andconvergence.v0. - Real-slice suite: one bounded legacy archaeology fixture representative of the first Hyperion-style slice.
Coverage goals for v0:
- every frozen
property_typeexercised by at least one comparator test - every refusal condition exercised by at least one contract test
- every
resolution_kindexercised by at least one resolver fixture - every
escalation.reasonexercised by at least one resolver fixture
- If edge properties are still collapsing under the bucket store, stop and fix bucket identity before adding more vocabulary.
- If malformed claims can reach the resolver, stop and fix the refusal boundary before adding more policy behavior.
- If explanation payloads are unstable across identical reruns, stop and fix normalization before widening the fixture corpus.
- If the first real slice produces an unbounded escalation queue, stop and tighten the vocabulary/policy surface before adding more property types.