feat: add canonical log schema and normalization parser prototype by VarshiniGunti · Pull Request #8 · OpenAgriNet/training_setup_logs

VarshiniGunti · 2026-05-08T13:52:14Z

Summary

This PR adds the first concrete implementation layer for the training_setup_logs project: a canonical log schema + normalization parser that converts heterogeneous production logs into a single, training-ready JSONL format.

Instead of attempting a full end-to-end pipeline in one step, this PR intentionally focuses on a high-leverage foundation that downstream steps can reliably build on (PII redaction, SFT/DPO export, trajectory validation, and split governance).

Problem This PR Solves

Production logs in real systems typically vary by source and shape (role, type, event_type, kind; different session keys; different tool fields). Without normalization, every downstream stage has to handle source-specific edge cases repeatedly.

This PR establishes one consistent event contract so later pipeline stages can be deterministic and composable.

Scope

1. Canonical schema for log events

Introduced a v1 canonical schema in training_setup_logs/schema.py with explicit event types:

user
assistant
tool_call
tool_result
system
error

Event records include:

core fields: schema_version, timestamp, session_id, event_type
optional fields by event type: content, tool_name, tool_call_id, tool_args, tool_result, error_type, error_message
free-form metadata for future compatibility

2. Parser + normalization layer

Implemented training_setup_logs/parser.py to map mixed raw log formats into canonical events:

event type aliases from event_type / type / role / kind
session id fallback chain: session_id / thread_id / conversation_id
content fallback chain: content / message / text
tool call/result key normalization:
- tool_name or tool
- tool_call_id or call_id
- tool_args or args
- tool_result or result or observation
robust skip behavior for rows that cannot be mapped safely (e.g., missing recognizable type or session id)

3. JSONL IO and CLI workflow

Added:

JSONL readers/writers in parser module
CLI entrypoint training_setup_logs/cli.py for straightforward usage:
- ingest raw JSONL
- emit canonical JSONL
- print parse summary (input rows, output events, skipped rows)

4. Tests + sample data

Added:

tests/test_parser.py covering mixed-shape normalization and skip behavior
examples/raw_logs_sample.jsonl as runnable sample input

5. Repo usability improvements

requirements-dev.txt for test dependencies
.gitignore to avoid generated cache/artifact noise
README section documenting parser and test commands

Design Notes

Schema-first approach: downstream components should consume a stable contract, not source-specific logs.
Strict where needed, flexible where practical: unknown/invalid rows are skipped rather than guessed.
Versioned schema (v1) from day one: enables additive evolution and migration in future PRs.
No policy assumptions baked in yet: PII removal, quality filtering, and preference construction are intentionally separate follow-up concerns.

Validation Performed

Executed and verified:

Parser run on sample logs:
- python -m training_setup_logs.cli --input examples/raw_logs_sample.jsonl --output examples/canonical_events.jsonl
Unit tests:
- python -m pytest -q (pass)
Syntax checks:
- python -m py_compile training_setup_logs/schema.py training_setup_logs/parser.py training_setup_logs/cli.py tests/test_parser.py

Files Added/Updated

.gitignore
requirements-dev.txt
examples/raw_logs_sample.jsonl
training_setup_logs/__init__.py
training_setup_logs/schema.py
training_setup_logs/parser.py
training_setup_logs/cli.py
tests/test_parser.py
README.md

Impact

This PR establishes a reliable normalization contract for the project and reduces implementation risk for all downstream stages. With canonical events in place, subsequent work on PII redaction, dataset export, and validation can be implemented as focused extensions rather than source-specific rewrites.

feat: add canonical log schema and normalization parser prototype

7bdf23d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add canonical log schema and normalization parser prototype#8

feat: add canonical log schema and normalization parser prototype#8
VarshiniGunti wants to merge 1 commit into
OpenAgriNet:mainfrom
VarshiniGunti:feat-canonical-schema-parser

VarshiniGunti commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VarshiniGunti commented May 8, 2026

Summary

Problem This PR Solves

Scope

1. Canonical schema for log events

2. Parser + normalization layer

3. JSONL IO and CLI workflow

4. Tests + sample data

5. Repo usability improvements

Design Notes

Validation Performed

Files Added/Updated

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant