Skip to content

Phase 5: Pattern detection and insights#13

Closed
evansenter wants to merge 5 commits into
mainfrom
phase-5-patterns
Closed

Phase 5: Pattern detection and insights#13
evansenter wants to merge 5 commits into
mainfrom
phase-5-patterns

Conversation

@evansenter

Copy link
Copy Markdown
Owner

Summary

Implements pattern detection for /improve-workflow integration:

  • compute_tool_frequency_patterns() - Tool usage frequency analysis
  • compute_command_patterns() - Bash command frequency
  • compute_sequence_patterns() - Tool n-gram detection (e.g., "Read → Edit")
  • compute_permission_gaps() - Commands frequently used but not in settings.json
  • get_insights() - Unified insights API for /improve-workflow

New MCP tools:

  • query_sequences - Find common tool patterns/sequences
  • query_permission_gaps - Identify commands needing settings.json entries
  • get_insights - Get pre-computed patterns for workflow improvement

Test plan

  • Run make check - all 69 tests pass
  • Verify tool frequency pattern detection
  • Verify sequence detection (n-gram analysis)
  • Verify permission gap detection
  • Verify get_insights returns organized data

Closes #5

🤖 Generated with Claude Code

evansenter and others added 5 commits December 31, 2025 04:14
- pyproject.toml with FastMCP, uvicorn, and dev dependencies
- Makefile with check, fmt, lint, test, install, uninstall targets
- LaunchAgent plist and install/uninstall scripts for auto-start
- dev.sh script for development mode with auto-reload
- Basic FastMCP server with placeholder tools:
  - get_status: Returns server status
  - ingest_logs: Placeholder for log ingestion
  - query_tool_frequency: Placeholder for frequency queries
- Usage guide as MCP resource at session-analytics://guide
- Tests for the placeholder tools
- README with installation and usage instructions

Server runs on port 8081 (to not conflict with event-bus on 8080).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- storage.py with SQLiteStorage class:
  - Events table with denormalized fields for fast queries
  - Sessions table for session metadata
  - Ingestion state tracking for incremental updates
  - Patterns table for pre-computed insights
  - Indexes on timestamp, session_id, tool_name, project_path
- Data classes: Event, Session, IngestionState, Pattern
- CRUD operations for all entities with batch insert support
- get_db_stats() for monitoring database health
- Updated server.py to use storage for get_status()
- Comprehensive test suite (16 tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements log file discovery and parsing:
- find_log_files(): Discovers JSONL files within date range
- parse_tool_use(): Extracts tool info (command, file_path, skill_name)
- parse_entry(): Parses entries into Event objects
- ingest_file(): Incremental ingestion with mtime/size tracking
- ingest_logs(): Full ingestion orchestration
- update_session_stats(): Aggregates session statistics

Integrates with server.py to provide real data for ingest_logs tool.

Closes #3

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements all query MCP tools:
- query_tool_frequency: Tool usage counts with project filter
- query_timeline: Events in time window with filtering
- query_commands: Bash command breakdown with prefix filter
- query_sessions: Session metadata and token totals
- query_tokens: Token usage grouped by day/session/model

Also adds:
- ensure_fresh_data(): Auto-refresh mechanism (5 min staleness)
- Comprehensive tests for all queries (18 new tests)

Closes #4

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements pattern detection for /improve-workflow integration:
- compute_tool_frequency_patterns(): Tool usage frequency
- compute_command_patterns(): Bash command frequency
- compute_sequence_patterns(): Tool n-gram detection
- compute_permission_gaps(): Commands not in settings.json
- get_insights(): Unified insights API for /improve-workflow

New MCP tools:
- query_sequences: Common tool patterns
- query_permission_gaps: Commands needing settings.json
- get_insights: Pre-computed patterns

Adds 16 new tests (69 total).

Closes #5

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude

claude Bot commented Dec 31, 2025

Copy link
Copy Markdown

Code Review: Phase 5 - Pattern Detection and Insights

Thank you for this comprehensive implementation! This PR adds substantial value to the session analytics MCP server. Here's my detailed review:


✅ Strengths

Architecture & Design

  • Well-structured pattern detection: The separation of concerns between patterns.py, queries.py, and storage.py is clean and maintainable
  • N-gram sequence detection: The sliding window approach in compute_sequence_patterns() (lines 144-161 in patterns.py) is elegant and efficient
  • Permission gap analysis: Smart integration with Claude Code's settings.json to identify frequently-used commands that could benefit from auto-approval
  • Metadata storage: Using a flexible JSON metadata field in the Pattern model allows for future extensibility

Code Quality

  • Comprehensive test coverage: 69 tests total with dedicated test files for each module
  • Good error handling: JSON parsing, file I/O, and database operations have appropriate try/catch blocks
  • Clear documentation: Docstrings are thorough and follow consistent formatting
  • Type hints: Proper use of type annotations throughout (e.g., str | None, list[Pattern])

🔍 Issues Found

🐛 Bug: SQL Injection Vulnerability (CRITICAL)

Location: src/session_analytics/queries.py:68-79

where_clause = " AND ".join(conditions)
rows = conn.execute(
    f"""
    SELECT tool_name, COUNT(*) as count
    FROM events
    WHERE {where_clause}  # ⚠️ Dangerous string interpolation
    GROUP BY tool_name
    ORDER BY count DESC
    """,
    params,
).fetchall()

Problem: Using f-string interpolation to build SQL queries can lead to SQL injection if conditions contains user input. While currently the conditions are hardcoded, this pattern is fragile and could be exploited if the code evolves.

Recommendation: Build the WHERE clause more safely:

# Better approach - still use parameterized queries
base_query = """
    SELECT tool_name, COUNT(*) as count
    FROM events
    WHERE timestamp >= ?
"""
if project:
    base_query += " AND project_path LIKE ?"
base_query += """
    GROUP BY tool_name
    ORDER BY count DESC
"""

🐛 Edge Case: Sequence Detection Doesn't Process Last Session

Location: src/session_analytics/patterns.py:157-161

The loop processes sessions as they change, but the final session in the dataset is handled outside the loop. This works, but if the last session is empty or has fewer than sequence_length tools, it's silently skipped without logging.

Recommendation: Add debug logging for edge cases:

# Process last session
if len(session_tools) >= sequence_length:
    for i in range(len(session_tools) - sequence_length + 1):
        seq = tuple(session_tools[i : i + sequence_length])
        sequences[seq] += 1
elif session_tools:
    logger.debug(f"Session {current_session} has only {len(session_tools)} tools, skipping sequence detection")

⚠️ Performance Concern: In-Memory Sequence Processing

Location: src/session_analytics/patterns.py:139-161

The sequence detection loads all tool events for the time period into memory as a list. For users with extensive session histories, this could consume significant memory.

Recommendation: Consider adding a warning if processing very large datasets:

total_events = len(rows)
if total_events > 100000:
    logger.warning(f"Processing {total_events} events for sequence detection - this may take a while")

Or implement streaming/batching for very large datasets.

🔒 Security: File Path Validation Missing

Location: src/session_analytics/patterns.py:183-211

The load_allowed_commands() function accepts an arbitrary settings_path parameter but doesn't validate that it's within expected directories.

Recommendation: Add path validation:

def load_allowed_commands(settings_path: Path = DEFAULT_SETTINGS_PATH) -> set[str]:
    """Load allowed commands from Claude Code settings.json."""
    # Validate path is within expected directories
    try:
        settings_path = settings_path.resolve()
        if not settings_path.is_relative_to(Path.home()):
            logger.warning(f"Settings path outside home directory: {settings_path}")
            return set()
    except (ValueError, RuntimeError):
        logger.warning(f"Invalid settings path: {settings_path}")
        return set()
    
    if not settings_path.exists():
        return set()
    # ... rest of function

🎯 Suggestions for Improvement

1. Add Rate Limiting to Auto-Refresh

Location: src/session_analytics/queries.py:8-40

The ensure_fresh_data() function can trigger expensive ingestion operations. If multiple MCP tools are called in quick succession, this could cause redundant ingestion.

Suggestion: Add a global lock or timestamp check to prevent concurrent ingestion:

_last_ingestion_check = None
_ingestion_lock = threading.Lock()

def ensure_fresh_data(...):
    global _last_ingestion_check
    with _ingestion_lock:
        # Check if we recently ran ingestion
        if _last_ingestion_check and (datetime.now() - _last_ingestion_check).seconds < 10:
            return False
        # ... existing logic
        _last_ingestion_check = datetime.now()

2. Add Configurable Pattern Thresholds

The pattern detection functions have hardcoded defaults (e.g., min_count=3 in tests, threshold=5 for permission gaps). Consider making these user-configurable via environment variables or a config file.

3. Improve Test Coverage for Edge Cases

While test coverage is good, consider adding tests for:

  • Empty database scenarios
  • Sessions with only 1 tool call (for sequence detection)
  • Malformed settings.json files
  • Extremely long sequences (e.g., 100+ tools in one session)
  • Unicode/special characters in command names

4. Add Performance Metrics

Consider adding timing instrumentation to pattern detection functions:

import time

def compute_sequence_patterns(...):
    start = time.time()
    # ... existing logic
    elapsed = time.time() - start
    logger.debug(f"Computed {len(patterns)} sequences in {elapsed:.2f}s")
    return patterns

📊 Test Coverage Assessment

Total Tests: 69 ✅
Test Files: 5 (server, storage, ingest, queries, patterns)

Coverage Highlights:

  • ✅ Pattern detection logic thoroughly tested
  • ✅ Edge cases for empty data handled
  • ✅ Permission gap logic tested with mock settings files
  • ✅ Sequence min_count threshold filtering tested

Missing Coverage:

  • ⚠️ No tests for concurrent access (multiple queries at once)
  • ⚠️ No integration tests for the full MCP server with pattern tools
  • ⚠️ No tests for very large datasets (performance regression testing)

🔐 Security Assessment

Overall: Low risk ✅

Concerns:

  1. SQL injection potential (see bug above) - MEDIUM priority to fix
  2. Path traversal in load_allowed_commands() - LOW priority (limited exposure)
  3. No input validation on days parameter (could pass negative numbers or extremely large values) - LOW priority

Recommendations:

  • Add input validation for all user-facing parameters
  • Consider adding rate limiting to prevent DoS via repeated expensive queries
  • Audit all SQL query construction for injection risks

📝 Minor Suggestions

Code Style

  1. Inconsistent logging levels: Some errors use logger.warning(), others use logger.debug(). Consider establishing a logging convention.

  2. Magic numbers: Line 174 in patterns.py uses " → " as a hardcoded separator. Consider making this a module constant:

    SEQUENCE_SEPARATOR = " → "
    pattern_key=SEQUENCE_SEPARATOR.join(seq)
  3. Duplicate cutoff calculation: Multiple functions compute cutoff = datetime.now() - timedelta(days=days). Consider a helper function.

Documentation

  1. Add examples to docstrings showing expected return values
  2. Document the expected format of the metadata field in Pattern dataclass
  3. Add a CHANGELOG.md entry for this phase

✨ Excellent Implementation Details

Things Done Right:

  1. Batch operations: Using Counter for sequence counting is efficient
  2. Graceful degradation: Returns empty sets/lists when files are missing rather than crashing
  3. Separation of concerns: compute_all_patterns() orchestrates without knowing implementation details
  4. Idempotent operations: Can safely re-run pattern computation without duplicates
  5. Resource cleanup: Using context managers for DB connections

🚀 Recommendation

APPROVE with minor revisions

This is solid work that adds significant value to the project. The SQL injection issue should be addressed before merging, but it's a simple fix. All other issues are non-blocking and can be addressed in follow-up PRs.

Suggested merge strategy:

  1. Fix the SQL injection pattern in queries.py
  2. Add input validation for numeric parameters (days, threshold, etc.)
  3. Merge to main
  4. Create follow-up issues for performance optimization and additional test coverage

Great job on Phase 5! The pattern detection will be extremely valuable for the /improve-workflow integration. 🎉


Files Reviewed:

  • ✅ src/session_analytics/patterns.py (370 lines)
  • ✅ src/session_analytics/server.py (287 lines)
  • ✅ tests/test_patterns.py (310 lines)
  • ✅ pyproject.toml, Makefile, README.md
  • ✅ SQL storage schema and queries

@evansenter

Copy link
Copy Markdown
Owner Author

Superseded by #19 which contains all Phase 1-7 work plus RFC #18 refactoring.

@evansenter evansenter closed this Dec 31, 2025
@evansenter evansenter deleted the phase-5-patterns branch December 31, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 5: Pattern detection

1 participant