Phase 5: Pattern detection and insights by evansenter · Pull Request #13 · evansenter/agent-session-analytics

evansenter · 2025-12-31T04:51:25Z

Summary

Implements pattern detection for /improve-workflow integration:

compute_tool_frequency_patterns() - Tool usage frequency analysis
compute_command_patterns() - Bash command frequency
compute_sequence_patterns() - Tool n-gram detection (e.g., "Read → Edit")
compute_permission_gaps() - Commands frequently used but not in settings.json
get_insights() - Unified insights API for /improve-workflow

New MCP tools:

query_sequences - Find common tool patterns/sequences
query_permission_gaps - Identify commands needing settings.json entries
get_insights - Get pre-computed patterns for workflow improvement

Test plan

Run make check - all 69 tests pass
Verify tool frequency pattern detection
Verify sequence detection (n-gram analysis)
Verify permission gap detection
Verify get_insights returns organized data

Closes #5

🤖 Generated with Claude Code

- pyproject.toml with FastMCP, uvicorn, and dev dependencies - Makefile with check, fmt, lint, test, install, uninstall targets - LaunchAgent plist and install/uninstall scripts for auto-start - dev.sh script for development mode with auto-reload - Basic FastMCP server with placeholder tools: - get_status: Returns server status - ingest_logs: Placeholder for log ingestion - query_tool_frequency: Placeholder for frequency queries - Usage guide as MCP resource at session-analytics://guide - Tests for the placeholder tools - README with installation and usage instructions Server runs on port 8081 (to not conflict with event-bus on 8080). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- storage.py with SQLiteStorage class: - Events table with denormalized fields for fast queries - Sessions table for session metadata - Ingestion state tracking for incremental updates - Patterns table for pre-computed insights - Indexes on timestamp, session_id, tool_name, project_path - Data classes: Event, Session, IngestionState, Pattern - CRUD operations for all entities with batch insert support - get_db_stats() for monitoring database health - Updated server.py to use storage for get_status() - Comprehensive test suite (16 tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements log file discovery and parsing: - find_log_files(): Discovers JSONL files within date range - parse_tool_use(): Extracts tool info (command, file_path, skill_name) - parse_entry(): Parses entries into Event objects - ingest_file(): Incremental ingestion with mtime/size tracking - ingest_logs(): Full ingestion orchestration - update_session_stats(): Aggregates session statistics Integrates with server.py to provide real data for ingest_logs tool. Closes #3 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements all query MCP tools: - query_tool_frequency: Tool usage counts with project filter - query_timeline: Events in time window with filtering - query_commands: Bash command breakdown with prefix filter - query_sessions: Session metadata and token totals - query_tokens: Token usage grouped by day/session/model Also adds: - ensure_fresh_data(): Auto-refresh mechanism (5 min staleness) - Comprehensive tests for all queries (18 new tests) Closes #4 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements pattern detection for /improve-workflow integration: - compute_tool_frequency_patterns(): Tool usage frequency - compute_command_patterns(): Bash command frequency - compute_sequence_patterns(): Tool n-gram detection - compute_permission_gaps(): Commands not in settings.json - get_insights(): Unified insights API for /improve-workflow New MCP tools: - query_sequences: Common tool patterns - query_permission_gaps: Commands needing settings.json - get_insights: Pre-computed patterns Adds 16 new tests (69 total). Closes #5 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2025-12-31T04:53:39Z

Code Review: Phase 5 - Pattern Detection and Insights

Thank you for this comprehensive implementation! This PR adds substantial value to the session analytics MCP server. Here's my detailed review:

✅ Strengths

Architecture & Design

Well-structured pattern detection: The separation of concerns between patterns.py, queries.py, and storage.py is clean and maintainable
N-gram sequence detection: The sliding window approach in compute_sequence_patterns() (lines 144-161 in patterns.py) is elegant and efficient
Permission gap analysis: Smart integration with Claude Code's settings.json to identify frequently-used commands that could benefit from auto-approval
Metadata storage: Using a flexible JSON metadata field in the Pattern model allows for future extensibility

Code Quality

Comprehensive test coverage: 69 tests total with dedicated test files for each module
Good error handling: JSON parsing, file I/O, and database operations have appropriate try/catch blocks
Clear documentation: Docstrings are thorough and follow consistent formatting
Type hints: Proper use of type annotations throughout (e.g., str | None, list[Pattern])

🔍 Issues Found

🐛 Bug: SQL Injection Vulnerability (CRITICAL)

Location: src/session_analytics/queries.py:68-79

where_clause = " AND ".join(conditions)
rows = conn.execute(
    f"""
    SELECT tool_name, COUNT(*) as count
    FROM events
    WHERE {where_clause}  # ⚠️ Dangerous string interpolation
    GROUP BY tool_name
    ORDER BY count DESC
    """,
    params,
).fetchall()

Problem: Using f-string interpolation to build SQL queries can lead to SQL injection if conditions contains user input. While currently the conditions are hardcoded, this pattern is fragile and could be exploited if the code evolves.

Recommendation: Build the WHERE clause more safely:

# Better approach - still use parameterized queries
base_query = """
    SELECT tool_name, COUNT(*) as count
    FROM events
    WHERE timestamp >= ?
"""
if project:
    base_query += " AND project_path LIKE ?"
base_query += """
    GROUP BY tool_name
    ORDER BY count DESC
"""

🐛 Edge Case: Sequence Detection Doesn't Process Last Session

Location: src/session_analytics/patterns.py:157-161

The loop processes sessions as they change, but the final session in the dataset is handled outside the loop. This works, but if the last session is empty or has fewer than sequence_length tools, it's silently skipped without logging.

Recommendation: Add debug logging for edge cases:

# Process last session
if len(session_tools) >= sequence_length:
    for i in range(len(session_tools) - sequence_length + 1):
        seq = tuple(session_tools[i : i + sequence_length])
        sequences[seq] += 1
elif session_tools:
    logger.debug(f"Session {current_session} has only {len(session_tools)} tools, skipping sequence detection")

⚠️ Performance Concern: In-Memory Sequence Processing

Location: src/session_analytics/patterns.py:139-161

The sequence detection loads all tool events for the time period into memory as a list. For users with extensive session histories, this could consume significant memory.

Recommendation: Consider adding a warning if processing very large datasets:

total_events = len(rows)
if total_events > 100000:
    logger.warning(f"Processing {total_events} events for sequence detection - this may take a while")

Or implement streaming/batching for very large datasets.

🔒 Security: File Path Validation Missing

Location: src/session_analytics/patterns.py:183-211

The load_allowed_commands() function accepts an arbitrary settings_path parameter but doesn't validate that it's within expected directories.

Recommendation: Add path validation:

def load_allowed_commands(settings_path: Path = DEFAULT_SETTINGS_PATH) -> set[str]:
    """Load allowed commands from Claude Code settings.json."""
    # Validate path is within expected directories
    try:
        settings_path = settings_path.resolve()
        if not settings_path.is_relative_to(Path.home()):
            logger.warning(f"Settings path outside home directory: {settings_path}")
            return set()
    except (ValueError, RuntimeError):
        logger.warning(f"Invalid settings path: {settings_path}")
        return set()
    
    if not settings_path.exists():
        return set()
    # ... rest of function

🎯 Suggestions for Improvement

1. Add Rate Limiting to Auto-Refresh

Location: src/session_analytics/queries.py:8-40

The ensure_fresh_data() function can trigger expensive ingestion operations. If multiple MCP tools are called in quick succession, this could cause redundant ingestion.

Suggestion: Add a global lock or timestamp check to prevent concurrent ingestion:

_last_ingestion_check = None
_ingestion_lock = threading.Lock()

def ensure_fresh_data(...):
    global _last_ingestion_check
    with _ingestion_lock:
        # Check if we recently ran ingestion
        if _last_ingestion_check and (datetime.now() - _last_ingestion_check).seconds < 10:
            return False
        # ... existing logic
        _last_ingestion_check = datetime.now()

2. Add Configurable Pattern Thresholds

The pattern detection functions have hardcoded defaults (e.g., min_count=3 in tests, threshold=5 for permission gaps). Consider making these user-configurable via environment variables or a config file.

3. Improve Test Coverage for Edge Cases

While test coverage is good, consider adding tests for:

Empty database scenarios
Sessions with only 1 tool call (for sequence detection)
Malformed settings.json files
Extremely long sequences (e.g., 100+ tools in one session)
Unicode/special characters in command names

4. Add Performance Metrics

Consider adding timing instrumentation to pattern detection functions:

import time

def compute_sequence_patterns(...):
    start = time.time()
    # ... existing logic
    elapsed = time.time() - start
    logger.debug(f"Computed {len(patterns)} sequences in {elapsed:.2f}s")
    return patterns

📊 Test Coverage Assessment

Total Tests: 69 ✅
Test Files: 5 (server, storage, ingest, queries, patterns)

Coverage Highlights:

✅ Pattern detection logic thoroughly tested
✅ Edge cases for empty data handled
✅ Permission gap logic tested with mock settings files
✅ Sequence min_count threshold filtering tested

Missing Coverage:

⚠️ No tests for concurrent access (multiple queries at once)
⚠️ No integration tests for the full MCP server with pattern tools
⚠️ No tests for very large datasets (performance regression testing)

🔐 Security Assessment

Overall: Low risk ✅

Concerns:

SQL injection potential (see bug above) - MEDIUM priority to fix
Path traversal in load_allowed_commands() - LOW priority (limited exposure)
No input validation on days parameter (could pass negative numbers or extremely large values) - LOW priority

Recommendations:

Add input validation for all user-facing parameters
Consider adding rate limiting to prevent DoS via repeated expensive queries
Audit all SQL query construction for injection risks

📝 Minor Suggestions

Code Style

Inconsistent logging levels: Some errors use logger.warning(), others use logger.debug(). Consider establishing a logging convention.
Magic numbers: Line 174 in patterns.py uses " → " as a hardcoded separator. Consider making this a module constant:
```
SEQUENCE_SEPARATOR = " → "
pattern_key=SEQUENCE_SEPARATOR.join(seq)
```
Duplicate cutoff calculation: Multiple functions compute cutoff = datetime.now() - timedelta(days=days). Consider a helper function.

Documentation

Add examples to docstrings showing expected return values
Document the expected format of the metadata field in Pattern dataclass
Add a CHANGELOG.md entry for this phase

✨ Excellent Implementation Details

Things Done Right:

Batch operations: Using Counter for sequence counting is efficient
Graceful degradation: Returns empty sets/lists when files are missing rather than crashing
Separation of concerns: compute_all_patterns() orchestrates without knowing implementation details
Idempotent operations: Can safely re-run pattern computation without duplicates
Resource cleanup: Using context managers for DB connections

🚀 Recommendation

APPROVE with minor revisions ✅

This is solid work that adds significant value to the project. The SQL injection issue should be addressed before merging, but it's a simple fix. All other issues are non-blocking and can be addressed in follow-up PRs.

Suggested merge strategy:

Fix the SQL injection pattern in queries.py
Add input validation for numeric parameters (days, threshold, etc.)
Merge to main
Create follow-up issues for performance optimization and additional test coverage

Great job on Phase 5! The pattern detection will be extremely valuable for the /improve-workflow integration. 🎉

Files Reviewed:

✅ src/session_analytics/patterns.py (370 lines)
✅ src/session_analytics/server.py (287 lines)
✅ tests/test_patterns.py (310 lines)
✅ pyproject.toml, Makefile, README.md
✅ SQL storage schema and queries

evansenter · 2025-12-31T12:57:28Z

Superseded by #19 which contains all Phase 1-7 work plus RFC #18 refactoring.

evansenter and others added 5 commits December 31, 2025 04:14

evansenter closed this Dec 31, 2025

evansenter deleted the phase-5-patterns branch December 31, 2025 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 5: Pattern detection and insights#13

Phase 5: Pattern detection and insights#13
evansenter wants to merge 5 commits into
mainfrom
phase-5-patterns

evansenter commented Dec 31, 2025

Uh oh!

claude Bot commented Dec 31, 2025

Uh oh!

evansenter commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evansenter commented Dec 31, 2025

Summary

Test plan

Uh oh!

claude Bot commented Dec 31, 2025

Code Review: Phase 5 - Pattern Detection and Insights

✅ Strengths

Architecture & Design

Code Quality

🔍 Issues Found

🐛 Bug: SQL Injection Vulnerability (CRITICAL)

🐛 Edge Case: Sequence Detection Doesn't Process Last Session

⚠️ Performance Concern: In-Memory Sequence Processing

🔒 Security: File Path Validation Missing

🎯 Suggestions for Improvement

1. Add Rate Limiting to Auto-Refresh

2. Add Configurable Pattern Thresholds

3. Improve Test Coverage for Edge Cases

4. Add Performance Metrics

📊 Test Coverage Assessment

🔐 Security Assessment

📝 Minor Suggestions

Code Style

Documentation

✨ Excellent Implementation Details

Things Done Right:

🚀 Recommendation

Uh oh!

evansenter commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant