Skip to content

[aw][perf] vscode_parser: get_vscode_summary stats every log file on every call before checking the global summary cache #898

@microsasa

Description

@microsasa

Summary

get_vscode_summary calls safe_file_identity(p) (a stat() syscall) for every discovered log file to build current_ids, and only then compares against _vscode_summary_cache.file_ids. O(n_log_files) stat syscalls happen on every call, even when the global summary cache is perfectly valid.

File & Function

src/copilot_usage/vscode_parser.py · get_vscode_summary

What Makes It Slow

logs = _cached_discover_vscode_logs(base_path)
log_ids: list[tuple[Path, tuple[int, int] | None]] = [
    (p, safe_file_identity(p)) for p in logs   # ← stat() every log file
]
current_ids: frozenset[...] = frozenset(log_ids)  # ← O(n) hash construction

if (
    _vscode_summary_cache is not None
    and _vscode_summary_cache.file_ids == current_ids  # ← cache check is AFTER stats
):
    return _vscode_summary_cache.summary

_cached_discover_vscode_logs already knows which log files exist via its discovery cache. But the file content identity (mtime + size) is not retained there — so get_vscode_summary re-stats every file to detect changes. When the summary is valid (nothing changed), all those stats are wasted.

A VS Code power user accumulates many log files across multiple window sessions. Fifty files across two VS Code installs is easily reached after a week of active use; at 5 µs/stat on a local SSD that is 250 µs wasted per call, and proportionally more on network filesystems.

Concrete Fix

Extend _VSCodeDiscoveryCache to also store the per-log-file identities computed during discovery. Have _cached_discover_vscode_logs return them alongside the paths so get_vscode_summary can build current_ids from cached data without extra stat calls.

`@dataclass`(slots=True)
class _VSCodeDiscoveryCache:
    root_id: tuple[int, int]
    child_ids: _ChildIds
    log_paths: tuple[Path, ...]
    log_file_ids: tuple[tuple[int, int] | None, ...]  # ← added: one entry per log_path

When _cached_discover_vscode_logs runs the glob (cache miss), it stats each found log file and stores the results alongside the paths. On the next call (cache hit), the stored log_file_ids are returned without any additional stat() calls.

get_vscode_summary then receives (log_paths, log_file_ids) from _cached_discover_vscode_logs and builds current_ids directly from the cached IDs:

logs, log_file_ids = _cached_discover_vscode_logs(base_path)
current_ids = frozenset(zip(logs, log_file_ids))
# no safe_file_identity() calls needed

On a discovery cache miss (e.g., new log file appeared), the stat calls are unavoidable — they produce the data that fills the cache. On a cache hit, all stat calls are eliminated.

Alternatively, as a simpler intermediate fix without changing the return type: if _vscode_summary_cache is not None and the set of discovered log paths (ignoring IDs) matches _vscode_summary_cache.file_ids, stat the files lazily — skip the stat loop entirely and compare frozensets based on the stored IDs. This is only possible if we first check whether the path sets are equal.

Expected Improvement

For a user with 50 VS Code log files, steady-state cost drops from 50 stat() calls per invocation to 0 (discovery cache hit path). The savings are proportional to the number of log files and compound with the sibling fix for _scan_child_ids.

Testing Requirement

Monkeypatch safe_file_identity and count calls to it (excluding the root/child discovery path). After warming the caches with one call, a second call with unchanged files must invoke safe_file_identity for individual log files 0 times:

def test_get_vscode_summary_no_log_stats_on_cache_hit(tmp_path, monkeypatch):
    # create log files and warm both discovery cache and summary cache
    ...
    stat_calls: list[Path] = []
    original = vscode_parser.safe_file_identity
    def spy(p: Path) -> tuple[int, int] | None:
        stat_calls.append(p)
        return original(p)
    monkeypatch.setattr(vscode_parser, "safe_file_identity", spy)
    get_vscode_summary(tmp_path)
    log_file_stats = [p for p in stat_calls if p.suffix == ".log"]
    assert log_file_stats == [], "log files must not be stat'd on a warm cache hit"

Generated by Performance Analysis · ● 3.9M ·

Metadata

Metadata

Assignees

No one assigned

    Labels

    awCreated by agentic workflowaw-dispatchedIssue has been dispatched to implementerperfPerformance improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions