Phase 3: JSONL ingestion module by evansenter · Pull Request #11 · evansenter/agent-session-analytics

evansenter · 2025-12-31T04:37:25Z

Summary

Implements the JSONL log ingestion module for parsing Claude Code session logs:

find_log_files() - Discovers JSONL files in ~/.claude/projects/ within date range
parse_tool_use() - Extracts tool info (command, file_path, skill_name) from tool_use blocks
parse_entry() - Parses JSONL entries into Event objects, handles tool_use/tool_result
ingest_file() - Incremental ingestion with mtime/size tracking (only processes changed files)
ingest_logs() - Full ingestion orchestration
update_session_stats() - Aggregates session statistics from events

Also integrates with server.py to provide real data for the ingest_logs MCP tool.

Test plan

Run make check - all 31 tests pass
Verify tool_use parsing for Bash, Read, Edit, Skill, MCP tools
Verify entry parsing for user, assistant, tool_result types
Verify incremental ingestion (unchanged files are skipped)
Verify force re-ingestion works

Closes #3

🤖 Generated with Claude Code

- pyproject.toml with FastMCP, uvicorn, and dev dependencies - Makefile with check, fmt, lint, test, install, uninstall targets - LaunchAgent plist and install/uninstall scripts for auto-start - dev.sh script for development mode with auto-reload - Basic FastMCP server with placeholder tools: - get_status: Returns server status - ingest_logs: Placeholder for log ingestion - query_tool_frequency: Placeholder for frequency queries - Usage guide as MCP resource at session-analytics://guide - Tests for the placeholder tools - README with installation and usage instructions Server runs on port 8081 (to not conflict with event-bus on 8080). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- storage.py with SQLiteStorage class: - Events table with denormalized fields for fast queries - Sessions table for session metadata - Ingestion state tracking for incremental updates - Patterns table for pre-computed insights - Indexes on timestamp, session_id, tool_name, project_path - Data classes: Event, Session, IngestionState, Pattern - CRUD operations for all entities with batch insert support - get_db_stats() for monitoring database health - Updated server.py to use storage for get_status() - Comprehensive test suite (16 tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements log file discovery and parsing: - find_log_files(): Discovers JSONL files within date range - parse_tool_use(): Extracts tool info (command, file_path, skill_name) - parse_entry(): Parses entries into Event objects - ingest_file(): Incremental ingestion with mtime/size tracking - ingest_logs(): Full ingestion orchestration - update_session_stats(): Aggregates session statistics Integrates with server.py to provide real data for ingest_logs tool. Closes #3 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2025-12-31T04:39:31Z

Code Review - Phase 3: JSONL Ingestion Module

This PR implements the core ingestion functionality for parsing Claude Code session logs. Overall, this is well-structured, production-ready code with good test coverage and attention to detail. Here's my detailed review:

✅ Strengths

1. Excellent Architecture & Design

Clean separation of concerns: parsing, ingestion, and storage are well-separated
Incremental ingestion with mtime/size tracking is smart and efficient
Batch insertion for performance (add_events_batch) is the right approach
INSERT OR IGNORE with UNIQUE(session_id, uuid) prevents duplicate events elegantly

2. Robust Error Handling

Graceful degradation throughout (e.g., ingest.py:51-52 logs warnings but continues)
Try-catch blocks in appropriate places (ingest.py:321-328, 437-439)
Skips malformed entries without crashing (parse_entry returns empty list for invalid data)

3. Comprehensive Test Coverage

31 tests covering all major code paths
Tests for incremental ingestion, force re-ingestion, tool parsing, and edge cases
Good use of fixtures for test setup (storage, sample_logs_dir)

4. Code Quality

Type hints throughout (Python 3.10+ style with str | None)
Clear, descriptive function/variable names
Good docstrings with Args/Returns sections
No TODO/FIXME comments - code is complete

5. Performance Considerations

Proper database indexing (storage.py:188-191)
Batch operations instead of row-by-row inserts
Skips unchanged files to avoid redundant work

🔧 Issues & Recommendations

Critical Issues

None found! 🎉

High Priority Improvements

1. SQL Injection Protection (`ingest.py:362-375`)

While currently safe (no user input in queries), the update_session_stats function uses string interpolation in a subquery. Consider parameterizing or adding a comment explaining why it's safe:

# Line 368-371: Subquery is safe as it references the same table
(SELECT git_branch FROM events e2
 WHERE e2.session_id = events.session_id
 ORDER BY timestamp DESC LIMIT 1) as primary_branch

2. Potential Race Condition (`ingest.py:300-307`)

ingest_file checks ingestion state, then reads/processes the file. If a file is modified between the state check and processing, changes could be missed. Consider:

Using file size + mtime as a composite key
Or calculating a hash for critical files

Current risk: Low (JSONL files are append-only in practice)

3. Memory Usage for Large Files (`ingest.py:316-330`)

All events from a file are held in memory before batch insert. For very large sessions (10k+ events), this could be problematic. Consider:

# Process in chunks of 1000 events
BATCH_SIZE = 1000
for i in range(0, len(events), BATCH_SIZE):
    batch = events[i:i+BATCH_SIZE]
    storage.add_events_batch(batch)

Medium Priority Improvements

4. Error Reporting Could Be More Specific (`ingest.py:323-328`)

except json.JSONDecodeError as e:
    logger.debug(f"JSON parse error in {file_path}:{line_num}: {e}")
    errors += 1
except Exception as e:
    logger.warning(f"Error processing {file_path}:{line_num}: {e}")
    errors += 1

Consider using logger.exception() for the generic Exception case to get stack traces in debug mode.

5. Timezone Handling (`ingest.py:132-134`)

timestamp = datetime.fromisoformat(timestamp_str.replace("Z", "+00:00"))
# Convert to naive datetime (remove timezone for SQLite compatibility)
timestamp = timestamp.replace(tzinfo=None)

This converts UTC to naive datetime without adjusting to local time. This is fine if all times are treated as UTC, but document this assumption clearly. Consider storing timezone info or using Unix timestamps.

6. Missing Index (`storage.py:188-191`)

You have indexes for common queries, but command is also frequently queried (for permission gap detection). Consider:

CREATE INDEX IF NOT EXISTS idx_events_command ON events(command) WHERE command IS NOT NULL

7. Summary Entry Handling (`ingest.py:266-276`)

Summary entries fall back to datetime.now() if timestamp is missing. This could cause ordering issues. Consider rejecting malformed summaries instead:

if not timestamp:
    logger.debug(f"Skipping summary entry without timestamp: {uuid}")
    return []

Low Priority / Nice-to-Have

8. Progress Reporting

For ingesting many files, consider a progress indicator (especially useful for initial days=30 ingestion).

9. Validation of Tool Input JSON (`ingest.py:68`)

json.dumps(tool_use.get("input", {})) assumes input is serializable. Add error handling if it could contain non-JSON-safe types.

10. Magic Number (`ingest.py:35`)

cutoff = datetime.now() - timedelta(days=days) - Consider if this should use file mtime or current time (documented behavior is correct, just verify it matches user expectations).

🔒 Security Review

✅ No Serious Security Issues

No command injection risks (all file paths are from filesystem, not user input)
No SQL injection (all queries use parameterized statements)
File access is restricted to ~/.claude/projects/ (safe)
No arbitrary code execution risks

Minor Concerns

Symlink attacks: find_log_files doesn't check for symlinks. An attacker could symlink malicious files into ~/.claude/projects/. Add:
```
if project_dir.is_symlink():
    logger.warning(f"Skipping symlink: {project_dir}")
    continue
```

📊 Performance Analysis

Benchmarks Needed

Time to ingest 10,000 events (estimate: <1s)
Time to ingest 100 files (estimate: <10s)
Database size after 30 days of logs (estimate: <10MB)

Optimizations Applied ✅

Batch inserts
Incremental ingestion
Indexes on common query patterns
INSERT OR IGNORE instead of SELECT-then-INSERT

🧪 Test Coverage Assessment

Well Tested ✅

Tool parsing (Bash, Read, Edit, Skill, MCP)
Entry parsing (user, assistant, tool_result)
Incremental ingestion
Force re-ingestion
File discovery

Missing Tests (Recommend Adding)

Error handling for corrupted JSONL files
Very large files (10k+ events)
Concurrent ingestion (if supported)
Malformed timestamps
Missing required fields in events

📝 Documentation

✅ Good

Clear docstrings with Args/Returns
README explains installation and usage
CLAUDE.md provides project context

Could Improve

Add inline comments for complex logic (e.g., parse_entry state machine)
Document timezone assumptions
Add example usage in docstrings

🎯 Final Verdict

APPROVE with minor suggestions

This is high-quality, production-ready code. The architecture is sound, error handling is robust, and test coverage is comprehensive. The issues I've identified are minor and mostly preventative (addressing edge cases that may never occur in practice).

Before Merging (Optional)

Add chunked batch processing for large files (High Priority Phase 3: Ingestion #3)
Add symlink check (Security)
Add index on command field (Medium Priority Phase 6: Integration #6)

Can Be Addressed in Future PRs

Progress reporting
Enhanced error logging
Additional test coverage for edge cases

Great work! 🚀

evansenter · 2025-12-31T12:57:23Z

Superseded by #19 which contains all Phase 1-7 work plus RFC #18 refactoring.

evansenter and others added 3 commits December 31, 2025 04:14

evansenter closed this Dec 31, 2025

evansenter deleted the phase-3-ingestion branch December 31, 2025 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: JSONL ingestion module#11

Phase 3: JSONL ingestion module#11
evansenter wants to merge 3 commits into
mainfrom
phase-3-ingestion

evansenter commented Dec 31, 2025

Uh oh!

claude Bot commented Dec 31, 2025

Uh oh!

evansenter commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evansenter commented Dec 31, 2025

Summary

Test plan

Uh oh!

claude Bot commented Dec 31, 2025

Code Review - Phase 3: JSONL Ingestion Module

✅ Strengths

1. Excellent Architecture & Design

2. Robust Error Handling

3. Comprehensive Test Coverage

4. Code Quality

5. Performance Considerations

🔧 Issues & Recommendations

Critical Issues

High Priority Improvements

1. SQL Injection Protection (ingest.py:362-375)

2. Potential Race Condition (ingest.py:300-307)

3. Memory Usage for Large Files (ingest.py:316-330)

Medium Priority Improvements

4. Error Reporting Could Be More Specific (ingest.py:323-328)

5. Timezone Handling (ingest.py:132-134)

6. Missing Index (storage.py:188-191)

7. Summary Entry Handling (ingest.py:266-276)

Low Priority / Nice-to-Have

8. Progress Reporting

9. Validation of Tool Input JSON (ingest.py:68)

10. Magic Number (ingest.py:35)

🔒 Security Review

✅ No Serious Security Issues

Minor Concerns

📊 Performance Analysis

Benchmarks Needed

Optimizations Applied ✅

🧪 Test Coverage Assessment

Well Tested ✅

Missing Tests (Recommend Adding)

📝 Documentation

✅ Good

Could Improve

🎯 Final Verdict

Before Merging (Optional)

Can Be Addressed in Future PRs

Uh oh!

evansenter commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. SQL Injection Protection (`ingest.py:362-375`)

2. Potential Race Condition (`ingest.py:300-307`)

3. Memory Usage for Large Files (`ingest.py:316-330`)

4. Error Reporting Could Be More Specific (`ingest.py:323-328`)

5. Timezone Handling (`ingest.py:132-134`)

6. Missing Index (`storage.py:188-191`)

7. Summary Entry Handling (`ingest.py:266-276`)

9. Validation of Tool Input JSON (`ingest.py:68`)

10. Magic Number (`ingest.py:35`)