Skip to content

feat: add split leakage and trajectory consistency validators#9

Open
VarshiniGunti wants to merge 2 commits into
OpenAgriNet:mainfrom
VarshiniGunti:feat-integrity-validators
Open

feat: add split leakage and trajectory consistency validators#9
VarshiniGunti wants to merge 2 commits into
OpenAgriNet:mainfrom
VarshiniGunti:feat-integrity-validators

Conversation

@VarshiniGunti
Copy link
Copy Markdown

Summary

This PR introduces focused data-integrity guardrails for the logs-to-training pipeline by adding two validation layers:

  1. Cross-split leakage detection (train/val/test)
  2. Agent trajectory consistency checks (tool-call/tool-result linkage)

The scope is intentionally narrow and infrastructure-focused so it complements existing broad pipeline PRs without duplicating end-to-end ingestion/export implementations.

What’s Added

1. Split Leakage Validator

  • Added training_setup_logs/integrity.py
  • Normalizes text and computes stable fingerprints to detect near-duplicate overlaps across splits
  • Flags cross-split collisions with split names and sample indices for debugging
  • Useful for preventing train/validation/test contamination and accidental overlap between derived subsets

2. Trajectory Consistency Validator

  • Added training_setup_logs/trajectory_validator.py
  • Validates tool-use sequence integrity:
    • tool_call must include tool_call_id
    • tool_result must include tool_call_id
    • each tool_result must map to a prior tool_call
  • Produces structured issues that can be used to exclude or quarantine invalid trajectories

3. Validator CLI

  • Added training_setup_logs/validate_cli.py
  • New commands:
    • validate-splits
    • validate-trajectories
  • Writes machine-readable JSON reports for CI/review workflows
  • Returns non-zero exit code when violations are found, enabling straightforward gating

4. Tests and Fixtures

  • Added tests:
    • tests/test_integrity.py
    • tests/test_trajectory_validator.py
  • Added example fixtures:
    • examples/train_sample.jsonl
    • examples/val_sample.jsonl
    • examples/test_sample.jsonl
    • examples/trajectory_valid.jsonl
    • examples/trajectory_invalid.jsonl

5. Documentation

  • Updated README.md with usage commands and validator workflow for this PR prototype

Why This Matters

This PR addresses two critical failure modes in training data preparation:

  • Leakage risk: near-identical prompts across splits can inflate offline metrics and harm generalization.
  • Broken trajectories: inconsistent tool traces degrade behavior-cloning quality and can introduce invalid supervision.

By adding these checks early, downstream SFT/DPO preparation can rely on cleaner, policy-aligned data.

Validation Performed

  • python -m py_compile for new modules/tests
  • python -m pytest -q (all tests passed)
  • validate-splits CLI run on sample split files (leakage intentionally detected)
  • validate-trajectories CLI run on valid and invalid trajectory fixtures (expected pass/fail behavior)

Files Included

  • training_setup_logs/integrity.py
  • training_setup_logs/trajectory_validator.py
  • training_setup_logs/validate_cli.py
  • tests/test_integrity.py
  • tests/test_trajectory_validator.py
  • examples/train_sample.jsonl
  • examples/val_sample.jsonl
  • examples/test_sample.jsonl
  • examples/trajectory_valid.jsonl
  • examples/trajectory_invalid.jsonl
  • README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant