Skip to content

pickhottea/lineage-aware-semantic-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lineage-Aware Semantic Retrieval

A lineage-aware semantic retrieval system with controlled data lanes, chunk symmetry enforcement, and embedding model evaluation.


Motivation

This project originated from a structural failure: embedding bias caused by asymmetrical data lanes (EPO XML vs Google text).

The incident revealed that semantic retrieval performance cannot be meaningfully evaluated without:

  • Controlled ingestion lanes
  • Enforced chunk symmetry
  • Schema-level validation
  • Policy-bound station contracts
  • Family-level normalization

This repository documents the redesigned, governance-controlled semantic retrieval experiment.


Architecture Overview

The system enforces:

  • Single canonical semantic source (Google lane)
  • Fixed chunk policy (Claim 1 / Claim set / Specification)
  • English isolation gate (en95)
  • Equal per-family vector rule
  • Station-level input/output schema validation
  • Embedding layer contract enforcement

Fig 1 — Governance-Controlled Semantic Pipeline

Governance Pipeline

Fig 2 — Controlled Embedding Evaluation Flow

Embedding Flow

Two experimental tracks are implemented:

  • Track A — Embedding model fairness evaluation
  • Track B — GPT vs Human semantic judgment stability testing

Core Contributions

  • Detection and isolation of lane bias in semantic pipelines
  • Chunk symmetry enforcement to eliminate structural embedding bias
  • Lineage-aware validation across ingestion and embedding stages
  • Governance-controlled experimental design
  • Embedding model comparison under controlled constraints
  • GPT reasoning drift analysis under structured evaluation

Repository Structure

  • docs/ — Experimental design and evaluation reports
  • policies/ — Governance and contract enforcement rules
  • scripts/canonical/ — Controlled semantic pipeline
  • experiments/ — Track A / Track B evaluation artifacts (summary only)

Status

Ongoing experimental research.
Future extensions include cross-language embedding evaluation and LLM-enhanced semantic patent database integration.

Quick links

About

A lineage-aware semantic retrieval system with controlled data lanes, chunk symmetry enforcement, and embedding model evaluation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages