Skip to content

Phase 3: JSONL ingestion module#11

Closed
evansenter wants to merge 3 commits into
mainfrom
phase-3-ingestion
Closed

Phase 3: JSONL ingestion module#11
evansenter wants to merge 3 commits into
mainfrom
phase-3-ingestion

Conversation

@evansenter

Copy link
Copy Markdown
Owner

Summary

Implements the JSONL log ingestion module for parsing Claude Code session logs:

  • find_log_files() - Discovers JSONL files in ~/.claude/projects/ within date range
  • parse_tool_use() - Extracts tool info (command, file_path, skill_name) from tool_use blocks
  • parse_entry() - Parses JSONL entries into Event objects, handles tool_use/tool_result
  • ingest_file() - Incremental ingestion with mtime/size tracking (only processes changed files)
  • ingest_logs() - Full ingestion orchestration
  • update_session_stats() - Aggregates session statistics from events

Also integrates with server.py to provide real data for the ingest_logs MCP tool.

Test plan

  • Run make check - all 31 tests pass
  • Verify tool_use parsing for Bash, Read, Edit, Skill, MCP tools
  • Verify entry parsing for user, assistant, tool_result types
  • Verify incremental ingestion (unchanged files are skipped)
  • Verify force re-ingestion works

Closes #3

🤖 Generated with Claude Code

evansenter and others added 3 commits December 31, 2025 04:14
- pyproject.toml with FastMCP, uvicorn, and dev dependencies
- Makefile with check, fmt, lint, test, install, uninstall targets
- LaunchAgent plist and install/uninstall scripts for auto-start
- dev.sh script for development mode with auto-reload
- Basic FastMCP server with placeholder tools:
  - get_status: Returns server status
  - ingest_logs: Placeholder for log ingestion
  - query_tool_frequency: Placeholder for frequency queries
- Usage guide as MCP resource at session-analytics://guide
- Tests for the placeholder tools
- README with installation and usage instructions

Server runs on port 8081 (to not conflict with event-bus on 8080).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- storage.py with SQLiteStorage class:
  - Events table with denormalized fields for fast queries
  - Sessions table for session metadata
  - Ingestion state tracking for incremental updates
  - Patterns table for pre-computed insights
  - Indexes on timestamp, session_id, tool_name, project_path
- Data classes: Event, Session, IngestionState, Pattern
- CRUD operations for all entities with batch insert support
- get_db_stats() for monitoring database health
- Updated server.py to use storage for get_status()
- Comprehensive test suite (16 tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements log file discovery and parsing:
- find_log_files(): Discovers JSONL files within date range
- parse_tool_use(): Extracts tool info (command, file_path, skill_name)
- parse_entry(): Parses entries into Event objects
- ingest_file(): Incremental ingestion with mtime/size tracking
- ingest_logs(): Full ingestion orchestration
- update_session_stats(): Aggregates session statistics

Integrates with server.py to provide real data for ingest_logs tool.

Closes #3

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude

claude Bot commented Dec 31, 2025

Copy link
Copy Markdown

Code Review - Phase 3: JSONL Ingestion Module

This PR implements the core ingestion functionality for parsing Claude Code session logs. Overall, this is well-structured, production-ready code with good test coverage and attention to detail. Here's my detailed review:


✅ Strengths

1. Excellent Architecture & Design

  • Clean separation of concerns: parsing, ingestion, and storage are well-separated
  • Incremental ingestion with mtime/size tracking is smart and efficient
  • Batch insertion for performance (add_events_batch) is the right approach
  • INSERT OR IGNORE with UNIQUE(session_id, uuid) prevents duplicate events elegantly

2. Robust Error Handling

  • Graceful degradation throughout (e.g., ingest.py:51-52 logs warnings but continues)
  • Try-catch blocks in appropriate places (ingest.py:321-328, 437-439)
  • Skips malformed entries without crashing (parse_entry returns empty list for invalid data)

3. Comprehensive Test Coverage

  • 31 tests covering all major code paths
  • Tests for incremental ingestion, force re-ingestion, tool parsing, and edge cases
  • Good use of fixtures for test setup (storage, sample_logs_dir)

4. Code Quality

  • Type hints throughout (Python 3.10+ style with str | None)
  • Clear, descriptive function/variable names
  • Good docstrings with Args/Returns sections
  • No TODO/FIXME comments - code is complete

5. Performance Considerations

  • Proper database indexing (storage.py:188-191)
  • Batch operations instead of row-by-row inserts
  • Skips unchanged files to avoid redundant work

🔧 Issues & Recommendations

Critical Issues

None found! 🎉

High Priority Improvements

1. SQL Injection Protection (ingest.py:362-375)

While currently safe (no user input in queries), the update_session_stats function uses string interpolation in a subquery. Consider parameterizing or adding a comment explaining why it's safe:

# Line 368-371: Subquery is safe as it references the same table
(SELECT git_branch FROM events e2
 WHERE e2.session_id = events.session_id
 ORDER BY timestamp DESC LIMIT 1) as primary_branch

2. Potential Race Condition (ingest.py:300-307)

ingest_file checks ingestion state, then reads/processes the file. If a file is modified between the state check and processing, changes could be missed. Consider:

  • Using file size + mtime as a composite key
  • Or calculating a hash for critical files

Current risk: Low (JSONL files are append-only in practice)

3. Memory Usage for Large Files (ingest.py:316-330)

All events from a file are held in memory before batch insert. For very large sessions (10k+ events), this could be problematic. Consider:

# Process in chunks of 1000 events
BATCH_SIZE = 1000
for i in range(0, len(events), BATCH_SIZE):
    batch = events[i:i+BATCH_SIZE]
    storage.add_events_batch(batch)

Medium Priority Improvements

4. Error Reporting Could Be More Specific (ingest.py:323-328)

except json.JSONDecodeError as e:
    logger.debug(f"JSON parse error in {file_path}:{line_num}: {e}")
    errors += 1
except Exception as e:
    logger.warning(f"Error processing {file_path}:{line_num}: {e}")
    errors += 1

Consider using logger.exception() for the generic Exception case to get stack traces in debug mode.

5. Timezone Handling (ingest.py:132-134)

timestamp = datetime.fromisoformat(timestamp_str.replace("Z", "+00:00"))
# Convert to naive datetime (remove timezone for SQLite compatibility)
timestamp = timestamp.replace(tzinfo=None)

This converts UTC to naive datetime without adjusting to local time. This is fine if all times are treated as UTC, but document this assumption clearly. Consider storing timezone info or using Unix timestamps.

6. Missing Index (storage.py:188-191)

You have indexes for common queries, but command is also frequently queried (for permission gap detection). Consider:

CREATE INDEX IF NOT EXISTS idx_events_command ON events(command) WHERE command IS NOT NULL

7. Summary Entry Handling (ingest.py:266-276)

Summary entries fall back to datetime.now() if timestamp is missing. This could cause ordering issues. Consider rejecting malformed summaries instead:

if not timestamp:
    logger.debug(f"Skipping summary entry without timestamp: {uuid}")
    return []

Low Priority / Nice-to-Have

8. Progress Reporting

For ingesting many files, consider a progress indicator (especially useful for initial days=30 ingestion).

9. Validation of Tool Input JSON (ingest.py:68)

json.dumps(tool_use.get("input", {})) assumes input is serializable. Add error handling if it could contain non-JSON-safe types.

10. Magic Number (ingest.py:35)

cutoff = datetime.now() - timedelta(days=days) - Consider if this should use file mtime or current time (documented behavior is correct, just verify it matches user expectations).


🔒 Security Review

✅ No Serious Security Issues

  • No command injection risks (all file paths are from filesystem, not user input)
  • No SQL injection (all queries use parameterized statements)
  • File access is restricted to ~/.claude/projects/ (safe)
  • No arbitrary code execution risks

Minor Concerns

  • Symlink attacks: find_log_files doesn't check for symlinks. An attacker could symlink malicious files into ~/.claude/projects/. Add:
    if project_dir.is_symlink():
        logger.warning(f"Skipping symlink: {project_dir}")
        continue

📊 Performance Analysis

Benchmarks Needed

  • Time to ingest 10,000 events (estimate: <1s)
  • Time to ingest 100 files (estimate: <10s)
  • Database size after 30 days of logs (estimate: <10MB)

Optimizations Applied ✅

  • Batch inserts
  • Incremental ingestion
  • Indexes on common query patterns
  • INSERT OR IGNORE instead of SELECT-then-INSERT

🧪 Test Coverage Assessment

Well Tested ✅

  • Tool parsing (Bash, Read, Edit, Skill, MCP)
  • Entry parsing (user, assistant, tool_result)
  • Incremental ingestion
  • Force re-ingestion
  • File discovery

Missing Tests (Recommend Adding)

  • Error handling for corrupted JSONL files
  • Very large files (10k+ events)
  • Concurrent ingestion (if supported)
  • Malformed timestamps
  • Missing required fields in events

📝 Documentation

✅ Good

  • Clear docstrings with Args/Returns
  • README explains installation and usage
  • CLAUDE.md provides project context

Could Improve

  • Add inline comments for complex logic (e.g., parse_entry state machine)
  • Document timezone assumptions
  • Add example usage in docstrings

🎯 Final Verdict

APPROVE with minor suggestions

This is high-quality, production-ready code. The architecture is sound, error handling is robust, and test coverage is comprehensive. The issues I've identified are minor and mostly preventative (addressing edge cases that may never occur in practice).

Before Merging (Optional)

  1. Add chunked batch processing for large files (High Priority Phase 3: Ingestion #3)
  2. Add symlink check (Security)
  3. Add index on command field (Medium Priority Phase 6: Integration #6)

Can Be Addressed in Future PRs

  • Progress reporting
  • Enhanced error logging
  • Additional test coverage for edge cases

Great work! 🚀

@evansenter

Copy link
Copy Markdown
Owner Author

Superseded by #19 which contains all Phase 1-7 work plus RFC #18 refactoring.

@evansenter evansenter closed this Dec 31, 2025
@evansenter evansenter deleted the phase-3-ingestion branch December 31, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 3: Ingestion

1 participant