A lineage-aware semantic retrieval system with controlled data lanes, chunk symmetry enforcement, and embedding model evaluation.
This project originated from a structural failure: embedding bias caused by asymmetrical data lanes (EPO XML vs Google text).
The incident revealed that semantic retrieval performance cannot be meaningfully evaluated without:
- Controlled ingestion lanes
- Enforced chunk symmetry
- Schema-level validation
- Policy-bound station contracts
- Family-level normalization
This repository documents the redesigned, governance-controlled semantic retrieval experiment.
The system enforces:
- Single canonical semantic source (Google lane)
- Fixed chunk policy (Claim 1 / Claim set / Specification)
- English isolation gate (en95)
- Equal per-family vector rule
- Station-level input/output schema validation
- Embedding layer contract enforcement
Two experimental tracks are implemented:
- Track A — Embedding model fairness evaluation
- Track B — GPT vs Human semantic judgment stability testing
- Detection and isolation of lane bias in semantic pipelines
- Chunk symmetry enforcement to eliminate structural embedding bias
- Lineage-aware validation across ingestion and embedding stages
- Governance-controlled experimental design
- Embedding model comparison under controlled constraints
- GPT reasoning drift analysis under structured evaluation
docs/— Experimental design and evaluation reportspolicies/— Governance and contract enforcement rulesscripts/canonical/— Controlled semantic pipelineexperiments/— Track A / Track B evaluation artifacts (summary only)
Ongoing experimental research.
Future extensions include cross-language embedding evaluation and LLM-enhanced semantic patent database integration.
- Architecture: semantic_station_contract
- Policies: semantic policies
- Track A execution: track_a_execution
- Consolidated evaluation: embedding_comparison_consolidated

