feat: add canonical log schema and normalization parser prototype#8
Open
VarshiniGunti wants to merge 1 commit into
Open
feat: add canonical log schema and normalization parser prototype#8VarshiniGunti wants to merge 1 commit into
VarshiniGunti wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the first concrete implementation layer for the
training_setup_logsproject: a canonical log schema + normalization parser that converts heterogeneous production logs into a single, training-ready JSONL format.Instead of attempting a full end-to-end pipeline in one step, this PR intentionally focuses on a high-leverage foundation that downstream steps can reliably build on (PII redaction, SFT/DPO export, trajectory validation, and split governance).
Problem This PR Solves
Production logs in real systems typically vary by source and shape (
role,type,event_type,kind; different session keys; different tool fields). Without normalization, every downstream stage has to handle source-specific edge cases repeatedly.This PR establishes one consistent event contract so later pipeline stages can be deterministic and composable.
Scope
1. Canonical schema for log events
Introduced a
v1canonical schema intraining_setup_logs/schema.pywith explicit event types:userassistanttool_calltool_resultsystemerrorEvent records include:
schema_version,timestamp,session_id,event_typecontent,tool_name,tool_call_id,tool_args,tool_result,error_type,error_messagemetadatafor future compatibility2. Parser + normalization layer
Implemented
training_setup_logs/parser.pyto map mixed raw log formats into canonical events:event_type/type/role/kindsession_id/thread_id/conversation_idcontent/message/texttool_nameortooltool_call_idorcall_idtool_argsorargstool_resultorresultorobservation3. JSONL IO and CLI workflow
Added:
training_setup_logs/cli.pyfor straightforward usage:4. Tests + sample data
Added:
tests/test_parser.pycovering mixed-shape normalization and skip behaviorexamples/raw_logs_sample.jsonlas runnable sample input5. Repo usability improvements
requirements-dev.txtfor test dependencies.gitignoreto avoid generated cache/artifact noiseDesign Notes
v1) from day one: enables additive evolution and migration in future PRs.Validation Performed
Executed and verified:
python -m training_setup_logs.cli --input examples/raw_logs_sample.jsonl --output examples/canonical_events.jsonlpython -m pytest -q(pass)python -m py_compile training_setup_logs/schema.py training_setup_logs/parser.py training_setup_logs/cli.py tests/test_parser.pyFiles Added/Updated
.gitignorerequirements-dev.txtexamples/raw_logs_sample.jsonltraining_setup_logs/__init__.pytraining_setup_logs/schema.pytraining_setup_logs/parser.pytraining_setup_logs/cli.pytests/test_parser.pyREADME.mdImpact
This PR establishes a reliable normalization contract for the project and reduces implementation risk for all downstream stages. With canonical events in place, subsequent work on PII redaction, dataset export, and validation can be implemented as focused extensions rather than source-specific rewrites.