feat(benchmark): modernize perf-test.py for new CLI structure#15
feat(benchmark): modernize perf-test.py for new CLI structure#15
Conversation
- Update all CLI commands to use new subcommand structure - Fix commands to use cwd parameter instead of path arguments - Commands updated: index generate, query overview, query callgraph, search, validate, analyze, cache clear, query module - Add validation benchmarks (duplicates, complexity) - Add diff analysis benchmarks - Add cache performance benchmarks (cold/warm) - Add self-benchmark mode (--self flag) - Add comparison mode for regression tracking - Save results to gitignored target/perf-history/ - Output follows Google Benchmark JSON format - Update .gitignore for benchmark directories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR modernizes the perf-test.py script to align with a new CLI structure where commands operate on the current working directory via the cwd parameter rather than accepting path arguments. It also expands benchmark coverage by adding validation, diff analysis, cache performance, and module query benchmarks, plus a self-benchmark mode to test the engine on its own codebase.
Key Changes:
- Updated CLI commands to use
cwdparameter:query overview/callgraph,search,validate, andcache clearnow operate on current directory - Added new benchmark functions for validation (duplicates/complexity), diff analysis, cache performance (cold vs warm), and module queries
- Introduced
--selfflag for benchmarking semfora-engine on its own codebase and--compareflag for regression tracking - Results now saved to gitignored
target/perf-history/directory in Google Benchmark JSON format
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 23 comments.
| File | Description |
|---|---|
| scripts/perf-test.py | Comprehensive update to CLI command structure, addition of new benchmark categories (validation, diff, cache, module queries), self-benchmark mode, baseline comparison functionality, and improved metadata tracking (engine version, index size, result counts, token estimates) |
| .gitignore | Added entries to ignore benchmark history and output directories (/target/perf-history/, /benchmark_output/, /benchmark_results/) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/perf-test.py
Outdated
| if change_pct > 10: # More than 10% slower | ||
| regressions.append((curr.name, base.real_time, curr.real_time, change_pct)) | ||
| elif change_pct < -10: # More than 10% faster | ||
| improvements.append((curr.name, base.real_time, curr.real_time, change_pct)) | ||
| else: | ||
| unchanged.append((curr.name, base.real_time, curr.real_time, change_pct)) |
There was a problem hiding this comment.
The comparison logic uses a 10% threshold but doesn't handle the case where baseline time is very small (close to zero). This could lead to false positives where tiny absolute differences (e.g., 0.001s to 0.0015s) trigger regression alerts. Consider adding a minimum absolute time threshold in addition to the percentage threshold.
scripts/perf-test.py
Outdated
| mode = random.choice(["symbols", "semantic"]) | ||
| cmd = [str(ENGINE_BIN), "search", pattern, str(path), "--mode", mode, "--limit", "20"] | ||
| start = time.perf_counter() | ||
| try: | ||
| subprocess.run(cmd, capture_output=True, timeout=30) |
There was a problem hiding this comment.
The command is using the deprecated --mode flag instead of the new --symbols or --related flags. Additionally, it should use cwd=path instead of passing path as an argument, consistent with the new CLI structure used elsewhere in the file (e.g., line 318).
| mode = random.choice(["symbols", "semantic"]) | |
| cmd = [str(ENGINE_BIN), "search", pattern, str(path), "--mode", mode, "--limit", "20"] | |
| start = time.perf_counter() | |
| try: | |
| subprocess.run(cmd, capture_output=True, timeout=30) | |
| flag = random.choice(["--symbols", "--related"]) | |
| cmd = [str(ENGINE_BIN), "search", pattern, flag, "--limit", "20"] | |
| start = time.perf_counter() | |
| try: | |
| subprocess.run(cmd, capture_output=True, timeout=30, cwd=path) |
scripts/perf-test.py
Outdated
| "avg_latency_ms": round(avg_latency * 1000, 2), | ||
| "min_latency_ms": round(min(successful) * 1000, 2) if successful else 0, | ||
| "max_latency_ms": round(max(successful) * 1000, 2) if successful else 0, | ||
| "p95_latency_ms": round(sorted(successful)[int(len(successful) * 0.95)] * 1000, 2) if len(successful) > 20 else 0, |
There was a problem hiding this comment.
The p95 calculation will fail with an IndexError when the list has exactly 20 successful requests. The condition checks len(successful) > 20, but at exactly 20 items, int(20 * 0.95) = 19 which is a valid index. However, the issue is that this will incorrectly return 0 for exactly 20 items. The condition should be >= 20 or better yet, use a smaller threshold like > 1 since percentile calculations are meaningful with just a few data points.
| "p95_latency_ms": round(sorted(successful)[int(len(successful) * 0.95)] * 1000, 2) if len(successful) > 20 else 0, | |
| "p95_latency_ms": round(sorted(successful)[int(len(successful) * 0.95)] * 1000, 2) if len(successful) > 1 else 0, |
| def main(): | ||
| parser = argparse.ArgumentParser(description="Semfora Performance Test Suite") | ||
| parser.add_argument("--quick", action="store_true", help="Quick smoke test (small repos only)") | ||
| parser.add_argument("--self", action="store_true", help="Benchmark semfora-engine codebase itself") |
There was a problem hiding this comment.
The argument name --self is problematic because self is a Python keyword. While it works in argparse when accessed as args.self, this is not idiomatic Python and could cause confusion. Consider using --self-benchmark or --engine-self instead.
scripts/perf-test.py
Outdated
| # Try to extract result count from JSON output | ||
| try: | ||
| data = json.loads(output) | ||
| if isinstance(data, list): | ||
| result_counts.append(len(data)) | ||
| elif isinstance(data, dict) and "matches" in data: | ||
| result_counts.append(len(data["matches"])) | ||
| except: | ||
| pass |
There was a problem hiding this comment.
The function attempts to extract JSON data without first checking if the command succeeded with valid output. If the command fails but still produces some text output that happens to be parseable as JSON, this could lead to incorrect result counts being recorded. The JSON parsing should only occur within the if success: block.
scripts/perf-test.py
Outdated
| for f in index_path.rglob("*"): | ||
| if f.is_file(): | ||
| total += f.stat().st_size | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
scripts/perf-test.py
Outdated
| for line in f: | ||
| if line.startswith("VmRSS:"): | ||
| return int(line.split()[1]) / 1024 | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except: | |
| # Best-effort metric: ignore any errors (e.g., when /proc is unavailable) |
scripts/perf-test.py
Outdated
| result_counts.append(len(data)) | ||
| elif isinstance(data, dict) and "matches" in data: | ||
| result_counts.append(len(data["matches"])) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except: | |
| # If JSON parsing or structure inspection fails, just skip result count | |
| # extraction; timing metrics are still valid without this metadata. |
scripts/perf-test.py
Outdated
| edge_counts.append(data["edge_count"]) | ||
| elif "edges" in data: | ||
| edge_counts.append(len(data["edges"])) | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except json.JSONDecodeError: | |
| # Ignore malformed JSON output for this iteration; timing is still recorded. |
scripts/perf-test.py
Outdated
| data = json.loads(output) | ||
| if "modules" in data and data["modules"]: | ||
| module_name = data["modules"][0].get("name", "src") | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except: | |
| # If parsing the overview JSON or accessing its structure fails for any reason, | |
| # silently fall back to the default module name ("src") defined above. |
- Fix stress test to use correct CLI flags (--symbols/--related) and cwd - Fix p95 calculation threshold (>= 1 instead of > 20) - Add absolute time threshold (10ms) to comparison logic to avoid false positives on fast operations - Replace bare except blocks with specific exceptions and comments - Fix size_str redundancy by extracting value before conditional format
Summary
cwdparameter instead of path arguments (commands operate on current directory)--self) for testing without external repostarget/perf-history/with Google Benchmark JSON formatChanges
CLI Command Fixes
query overview/query callgraph- usecwd=repo_path, no path argumentsearch <pattern>- use--symbols/--relatedflags instead of--modevalidate- use--duplicatesflag, operates on current dircache clear- usecwd=repo_pathNew Features
benchmark_validation()- duplicates and complexity benchmarksbenchmark_diff_analysis()- git diff analysis timingbenchmark_cache_performance()- cold vs warm cache comparisonbenchmark_query_module()- module query operationsrun_self_benchmark()- benchmark semfora-engine on its own codebasecompare_reports()- compare against baseline for regression trackingTest plan
./scripts/perf-test.py --self --no-build- all benchmarks passtarget/perf-history/