Skip to content

feat: add multilingual semantic deduplication and trajectory intelligence layer#15

Open
Abhishek-Kumar-Rai5 wants to merge 1 commit into
OpenAgriNet:mainfrom
Abhishek-Kumar-Rai5:feat-trajectory-intelligence
Open

feat: add multilingual semantic deduplication and trajectory intelligence layer#15
Abhishek-Kumar-Rai5 wants to merge 1 commit into
OpenAgriNet:mainfrom
Abhishek-Kumar-Rai5:feat-trajectory-intelligence

Conversation

@Abhishek-Kumar-Rai5
Copy link
Copy Markdown

@Abhishek-Kumar-Rai5 Abhishek-Kumar-Rai5 commented May 16, 2026

What this PR does

This PR adds a Dataset Quality Intelligence Layer focused on:

  • multilingual semantic deduplication,
  • transliteration-aware leakage detection,
  • trajectory failure categorization,
  • recovery-pattern mining,
  • hard-example preservation,
  • metadata enrichment for downstream SFT/DPO pipelines.

The implementation is designed for trajectory-aware agentic training pipelines operating on multilingual production logs.


Why this matters

Current pipeline work already covers:

  • schema normalization,
  • export formatting,
  • validation,
  • PII handling.

However, multilingual semantic overlap and trajectory-quality intelligence remain relatively underexplored.

This PR focuses on:

  • preserving difficult trajectories instead of discarding them,
  • improving multilingual split integrity,
  • detecting trajectory failures and recovery behavior,
  • enriching trajectories with metadata useful for downstream training and evaluation.

This is especially relevant for:

  • Hindi + English code-switching,
  • transliterated Hindi,
  • multilingual semantic leakage,
  • recovery-heavy agent trajectories.

Added Modules

Multilingual Semantic Deduplication

multilingual/

  • indic_normalizer.py
  • transliteration.py
  • semantic_dedup.py
  • leakage_detector.py
  • multilingual_metrics.py

Features

  • Unicode + Indic normalization
  • transliteration-aware canonicalization
  • multilingual semantic clustering
  • train/eval leakage detection
  • semantic overlap reporting

Trajectory Intelligence Layer

trajectory/

  • models.py
  • analyzer.py
  • failure_classifier.py
  • recovery_patterns.py
  • difficulty.py
  • hard_example_miner.py
  • metadata_enrichment.py

Features

  • trajectory failure categorization
  • recovery-pattern mining
  • tool-use efficiency analysis
  • difficulty classification
  • hard-example tagging
  • metadata enrichment

Key Design Principle

Hard examples are never discarded automatically.

Instead, difficult trajectories are:

  • tagged,
  • categorized,
  • and routed appropriately
    (repair candidates, DPO-negative candidates, evaluation-worthy trajectories, etc.).

Example Capabilities

Multilingual Semantic Deduplication

queries = [
    "कल मौसम कैसा रहेगा",
    "kal mausam kaisa rahega",
    "kal mosam kaisa rahega",
    "weather tomorrow",
]

# Screenshots of all the tests passing : 

<img width="935" height="746" alt="image" src="https://github.com/user-attachments/assets/3099bfc0-19ab-4b55-bfb3-1f84dd9d6ed7" />
<img width="930" height="743" alt="image" src="https://github.com/user-attachments/assets/8991104c-14cb-4656-bedc-76acff033560" />
<img width="942" height="751" alt="image" src="https://github.com/user-attachments/assets/46a50d12-dbb6-4206-b385-6f3331e70879" />
<img width="935" height="160" alt="image" src="https://github.com/user-attachments/assets/22503f11-daa4-4779-94fc-1e21e52fe65c" />

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant