Skip to content

feat: add canonical log schema and normalization parser prototype#8

Open
VarshiniGunti wants to merge 1 commit into
OpenAgriNet:mainfrom
VarshiniGunti:feat-canonical-schema-parser
Open

feat: add canonical log schema and normalization parser prototype#8
VarshiniGunti wants to merge 1 commit into
OpenAgriNet:mainfrom
VarshiniGunti:feat-canonical-schema-parser

Conversation

@VarshiniGunti
Copy link
Copy Markdown

Summary

This PR adds the first concrete implementation layer for the training_setup_logs project: a canonical log schema + normalization parser that converts heterogeneous production logs into a single, training-ready JSONL format.

Instead of attempting a full end-to-end pipeline in one step, this PR intentionally focuses on a high-leverage foundation that downstream steps can reliably build on (PII redaction, SFT/DPO export, trajectory validation, and split governance).

Problem This PR Solves

Production logs in real systems typically vary by source and shape (role, type, event_type, kind; different session keys; different tool fields). Without normalization, every downstream stage has to handle source-specific edge cases repeatedly.

This PR establishes one consistent event contract so later pipeline stages can be deterministic and composable.

Scope

1. Canonical schema for log events

Introduced a v1 canonical schema in training_setup_logs/schema.py with explicit event types:

  • user
  • assistant
  • tool_call
  • tool_result
  • system
  • error

Event records include:

  • core fields: schema_version, timestamp, session_id, event_type
  • optional fields by event type: content, tool_name, tool_call_id, tool_args, tool_result, error_type, error_message
  • free-form metadata for future compatibility

2. Parser + normalization layer

Implemented training_setup_logs/parser.py to map mixed raw log formats into canonical events:

  • event type aliases from event_type / type / role / kind
  • session id fallback chain: session_id / thread_id / conversation_id
  • content fallback chain: content / message / text
  • tool call/result key normalization:
    • tool_name or tool
    • tool_call_id or call_id
    • tool_args or args
    • tool_result or result or observation
  • robust skip behavior for rows that cannot be mapped safely (e.g., missing recognizable type or session id)

3. JSONL IO and CLI workflow

Added:

  • JSONL readers/writers in parser module
  • CLI entrypoint training_setup_logs/cli.py for straightforward usage:
    • ingest raw JSONL
    • emit canonical JSONL
    • print parse summary (input rows, output events, skipped rows)

4. Tests + sample data

Added:

  • tests/test_parser.py covering mixed-shape normalization and skip behavior
  • examples/raw_logs_sample.jsonl as runnable sample input

5. Repo usability improvements

  • requirements-dev.txt for test dependencies
  • .gitignore to avoid generated cache/artifact noise
  • README section documenting parser and test commands

Design Notes

  • Schema-first approach: downstream components should consume a stable contract, not source-specific logs.
  • Strict where needed, flexible where practical: unknown/invalid rows are skipped rather than guessed.
  • Versioned schema (v1) from day one: enables additive evolution and migration in future PRs.
  • No policy assumptions baked in yet: PII removal, quality filtering, and preference construction are intentionally separate follow-up concerns.

Validation Performed

Executed and verified:

  1. Parser run on sample logs:
    • python -m training_setup_logs.cli --input examples/raw_logs_sample.jsonl --output examples/canonical_events.jsonl
  2. Unit tests:
    • python -m pytest -q (pass)
  3. Syntax checks:
    • python -m py_compile training_setup_logs/schema.py training_setup_logs/parser.py training_setup_logs/cli.py tests/test_parser.py

Files Added/Updated

  • .gitignore
  • requirements-dev.txt
  • examples/raw_logs_sample.jsonl
  • training_setup_logs/__init__.py
  • training_setup_logs/schema.py
  • training_setup_logs/parser.py
  • training_setup_logs/cli.py
  • tests/test_parser.py
  • README.md

Impact

This PR establishes a reliable normalization contract for the project and reduces implementation risk for all downstream stages. With canonical events in place, subsequent work on PII redaction, dataset export, and validation can be implemented as focused extensions rather than source-specific rewrites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant