Skip to content

[aw][perf] vscode_parser.discover_vscode_logs: public API bypasses module-level discovery cache #906

@microsasa

Description

@microsasa

Summary

The public discover_vscode_logs function (src/copilot_usage/vscode_parser.py, lines 199–235, exported in __all__) always runs the full multi-level Path.glob(_GLOB_PATTERN) traversal on every call. The private _cached_discover_vscode_logs (lines 255–325) implements identical logic with an O(1) steady-state cache that avoids the glob when the root directory and its sentinel child are unchanged.

Because discover_vscode_logs is the documented public entry point, callers using it directly (e.g., listing log paths before deciding whether to call get_vscode_summary) pay full glob cost — O(n_window_dirs) filesystem traversal — on each invocation. get_vscode_summary already bypasses this by calling _cached_discover_vscode_logs internally, creating an inconsistency where the public function is slower than the internal one.

What makes it slow

The glob pattern "*/window*/exthost/GitHub.copilot-chat/GitHub Copilot Chat.log" requires traversing at minimum three directory levels under each candidate root. On a mature VS Code installation with many session directories (each containing multiple window*/ subdirectories), this glob touches hundreds of inodes on every call.

Currently, repeated calls to discover_vscode_logs() with an unchanged directory each pay this full traversal cost. _cached_discover_vscode_logs reduces steady-state cost to two stat calls (root + sentinel child), which is at least an order of magnitude cheaper.

Concrete fix

Make discover_vscode_logs delegate to _cached_discover_vscode_logs:

def discover_vscode_logs(base_path: Path | None = None) -> list[Path]:
    """Find all VS Code Copilot Chat log files.
    ...
    """
    return _cached_discover_vscode_logs(base_path)

_cached_discover_vscode_logs already handles both the base_path=None (default candidates) and base_path != None (explicit directory) cases with identical externally-visible behaviour. The one implementation difference — the uncached function calls candidate.is_dir() while the cached version uses candidate.stat() — produces the same result for valid and missing paths.

Expected improvement

Steady-state cost for repeated calls drops from O(n_inodes) Path.glob traversal to O(1): two stat syscalls (root identity + sentinel child identity). On a VS Code installation with 20 session directories each containing 3 window directories, this eliminates ~60+ directory entries from being traversed per call.

Testing requirement

Add a test to tests/copilot_usage/test_vscode_parser.py:

  1. Create a temporary VS Code log directory structure (matching _GLOB_PATTERN) and call discover_vscode_logs(tmp_path) twice with no changes between calls.
  2. Spy on Path.glob (or monkeypatch the glob call in vscode_parser) and assert it is called at most once across both calls — the second call must use the cache rather than re-running the glob.

This follows the project's deterministic perf-test convention (call-count assertion, no wall-clock timing), matching the pattern in TestVscodeDiscoveryCache.

Generated by Performance Analysis · ● 4.4M ·

Metadata

Metadata

Assignees

No one assigned

    Labels

    awCreated by agentic workflowaw-dispatchedIssue has been dispatched to implementerperfPerformance improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions