feat: add multilingual semantic deduplication and trajectory intelligence layer by Abhishek-Kumar-Rai5 · Pull Request #15 · OpenAgriNet/training_setup_logs

Abhishek-Kumar-Rai5 · 2026-05-16T18:08:50Z

What this PR does

This PR adds a Dataset Quality Intelligence Layer focused on:

multilingual semantic deduplication,
transliteration-aware leakage detection,
trajectory failure categorization,
recovery-pattern mining,
hard-example preservation,
metadata enrichment for downstream SFT/DPO pipelines.

The implementation is designed for trajectory-aware agentic training pipelines operating on multilingual production logs.

Why this matters

Current pipeline work already covers:

schema normalization,
export formatting,
validation,
PII handling.

However, multilingual semantic overlap and trajectory-quality intelligence remain relatively underexplored.

This PR focuses on:

preserving difficult trajectories instead of discarding them,
improving multilingual split integrity,
detecting trajectory failures and recovery behavior,
enriching trajectories with metadata useful for downstream training and evaluation.

This is especially relevant for:

Hindi + English code-switching,
transliterated Hindi,
multilingual semantic leakage,
recovery-heavy agent trajectories.

Added Modules

Multilingual Semantic Deduplication

`multilingual/`

indic_normalizer.py
transliteration.py
semantic_dedup.py
leakage_detector.py
multilingual_metrics.py

Features

Unicode + Indic normalization
transliteration-aware canonicalization
multilingual semantic clustering
train/eval leakage detection
semantic overlap reporting

Trajectory Intelligence Layer

`trajectory/`

models.py
analyzer.py
failure_classifier.py
recovery_patterns.py
difficulty.py
hard_example_miner.py
metadata_enrichment.py

Features

trajectory failure categorization
recovery-pattern mining
tool-use efficiency analysis
difficulty classification
hard-example tagging
metadata enrichment

Key Design Principle

Hard examples are never discarded automatically.

Instead, difficult trajectories are:

tagged,
categorized,
and routed appropriately
(repair candidates, DPO-negative candidates, evaluation-worthy trajectories, etc.).

Example Capabilities

Multilingual Semantic Deduplication

queries = [
    "कल मौसम कैसा रहेगा",
    "kal mausam kaisa rahega",
    "kal mosam kaisa rahega",
    "weather tomorrow",
]

# Screenshots of all the tests passing : 

<img width="935" height="746" alt="image" src="https://github.com/user-attachments/assets/3099bfc0-19ab-4b55-bfb3-1f84dd9d6ed7" />
<img width="930" height="743" alt="image" src="https://github.com/user-attachments/assets/8991104c-14cb-4656-bedc-76acff033560" />
<img width="942" height="751" alt="image" src="https://github.com/user-attachments/assets/46a50d12-dbb6-4206-b385-6f3331e70879" />
<img width="935" height="160" alt="image" src="https://github.com/user-attachments/assets/22503f11-daa4-4779-94fc-1e21e52fe65c" />

feat-trajectory

7948b73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multilingual semantic deduplication and trajectory intelligence layer#15

feat: add multilingual semantic deduplication and trajectory intelligence layer#15
Abhishek-Kumar-Rai5 wants to merge 1 commit into
OpenAgriNet:mainfrom
Abhishek-Kumar-Rai5:feat-trajectory-intelligence

Abhishek-Kumar-Rai5 commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abhishek-Kumar-Rai5 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Why this matters

Added Modules

Multilingual Semantic Deduplication

multilingual/

Features

Trajectory Intelligence Layer

trajectory/

Features

Key Design Principle

Example Capabilities

Multilingual Semantic Deduplication

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abhishek-Kumar-Rai5 commented May 16, 2026 •

edited

Loading

`multilingual/`

`trajectory/`