Skip to content

feat(benchmark): modernize perf-test.py for new CLI structure#15

Merged
Kadajett merged 2 commits intomainfrom
feat/modernize-perf-benchmarks
Dec 18, 2025
Merged

feat(benchmark): modernize perf-test.py for new CLI structure#15
Kadajett merged 2 commits intomainfrom
feat/modernize-perf-benchmarks

Conversation

@Kadajett
Copy link
Contributor

Summary

  • Update all CLI commands in perf-test.py to use new subcommand structure
  • Fix commands to use cwd parameter instead of path arguments (commands operate on current directory)
  • Add new benchmark categories: validation, diff analysis, cache performance
  • Add self-benchmark mode (--self) for testing without external repos
  • Save results to gitignored target/perf-history/ with Google Benchmark JSON format

Changes

CLI Command Fixes

  • query overview / query callgraph - use cwd=repo_path, no path argument
  • search <pattern> - use --symbols/--related flags instead of --mode
  • validate - use --duplicates flag, operates on current dir
  • cache clear - use cwd=repo_path

New Features

  • benchmark_validation() - duplicates and complexity benchmarks
  • benchmark_diff_analysis() - git diff analysis timing
  • benchmark_cache_performance() - cold vs warm cache comparison
  • benchmark_query_module() - module query operations
  • run_self_benchmark() - benchmark semfora-engine on its own codebase
  • compare_reports() - compare against baseline for regression tracking

Test plan

  • Run ./scripts/perf-test.py --self --no-build - all benchmarks pass
  • Verify JSON output follows Google Benchmark format
  • Verify results saved to target/perf-history/

- Update all CLI commands to use new subcommand structure
- Fix commands to use cwd parameter instead of path arguments
- Commands updated: index generate, query overview, query callgraph,
  search, validate, analyze, cache clear, query module
- Add validation benchmarks (duplicates, complexity)
- Add diff analysis benchmarks
- Add cache performance benchmarks (cold/warm)
- Add self-benchmark mode (--self flag)
- Add comparison mode for regression tracking
- Save results to gitignored target/perf-history/
- Output follows Google Benchmark JSON format
- Update .gitignore for benchmark directories

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes the perf-test.py script to align with a new CLI structure where commands operate on the current working directory via the cwd parameter rather than accepting path arguments. It also expands benchmark coverage by adding validation, diff analysis, cache performance, and module query benchmarks, plus a self-benchmark mode to test the engine on its own codebase.

Key Changes:

  • Updated CLI commands to use cwd parameter: query overview/callgraph, search, validate, and cache clear now operate on current directory
  • Added new benchmark functions for validation (duplicates/complexity), diff analysis, cache performance (cold vs warm), and module queries
  • Introduced --self flag for benchmarking semfora-engine on its own codebase and --compare flag for regression tracking
  • Results now saved to gitignored target/perf-history/ directory in Google Benchmark JSON format

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 23 comments.

File Description
scripts/perf-test.py Comprehensive update to CLI command structure, addition of new benchmark categories (validation, diff, cache, module queries), self-benchmark mode, baseline comparison functionality, and improved metadata tracking (engine version, index size, result counts, token estimates)
.gitignore Added entries to ignore benchmark history and output directories (/target/perf-history/, /benchmark_output/, /benchmark_results/)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +880 to +885
if change_pct > 10: # More than 10% slower
regressions.append((curr.name, base.real_time, curr.real_time, change_pct))
elif change_pct < -10: # More than 10% faster
improvements.append((curr.name, base.real_time, curr.real_time, change_pct))
else:
unchanged.append((curr.name, base.real_time, curr.real_time, change_pct))
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison logic uses a 10% threshold but doesn't handle the case where baseline time is very small (close to zero). This could lead to false positives where tiny absolute differences (e.g., 0.001s to 0.0015s) trigger regression alerts. Consider adding a minimum absolute time threshold in addition to the percentage threshold.

Copilot uses AI. Check for mistakes.
Comment on lines 748 to 752
mode = random.choice(["symbols", "semantic"])
cmd = [str(ENGINE_BIN), "search", pattern, str(path), "--mode", mode, "--limit", "20"]
start = time.perf_counter()
try:
subprocess.run(cmd, capture_output=True, timeout=30)
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command is using the deprecated --mode flag instead of the new --symbols or --related flags. Additionally, it should use cwd=path instead of passing path as an argument, consistent with the new CLI structure used elsewhere in the file (e.g., line 318).

Suggested change
mode = random.choice(["symbols", "semantic"])
cmd = [str(ENGINE_BIN), "search", pattern, str(path), "--mode", mode, "--limit", "20"]
start = time.perf_counter()
try:
subprocess.run(cmd, capture_output=True, timeout=30)
flag = random.choice(["--symbols", "--related"])
cmd = [str(ENGINE_BIN), "search", pattern, flag, "--limit", "20"]
start = time.perf_counter()
try:
subprocess.run(cmd, capture_output=True, timeout=30, cwd=path)

Copilot uses AI. Check for mistakes.
"avg_latency_ms": round(avg_latency * 1000, 2),
"min_latency_ms": round(min(successful) * 1000, 2) if successful else 0,
"max_latency_ms": round(max(successful) * 1000, 2) if successful else 0,
"p95_latency_ms": round(sorted(successful)[int(len(successful) * 0.95)] * 1000, 2) if len(successful) > 20 else 0,
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The p95 calculation will fail with an IndexError when the list has exactly 20 successful requests. The condition checks len(successful) > 20, but at exactly 20 items, int(20 * 0.95) = 19 which is a valid index. However, the issue is that this will incorrectly return 0 for exactly 20 items. The condition should be >= 20 or better yet, use a smaller threshold like > 1 since percentile calculations are meaningful with just a few data points.

Suggested change
"p95_latency_ms": round(sorted(successful)[int(len(successful) * 0.95)] * 1000, 2) if len(successful) > 20 else 0,
"p95_latency_ms": round(sorted(successful)[int(len(successful) * 0.95)] * 1000, 2) if len(successful) > 1 else 0,

Copilot uses AI. Check for mistakes.
def main():
parser = argparse.ArgumentParser(description="Semfora Performance Test Suite")
parser.add_argument("--quick", action="store_true", help="Quick smoke test (small repos only)")
parser.add_argument("--self", action="store_true", help="Benchmark semfora-engine codebase itself")
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument name --self is problematic because self is a Python keyword. While it works in argparse when accessed as args.self, this is not idiomatic Python and could cause confusion. Consider using --self-benchmark or --engine-self instead.

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +330
# Try to extract result count from JSON output
try:
data = json.loads(output)
if isinstance(data, list):
result_counts.append(len(data))
elif isinstance(data, dict) and "matches" in data:
result_counts.append(len(data["matches"]))
except:
pass
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function attempts to extract JSON data without first checking if the command succeeded with valid output. If the command fails but still produces some text output that happens to be parseable as JSON, this could lead to incorrect result counts being recorded. The JSON parsing should only occur within the if success: block.

Copilot uses AI. Check for mistakes.
for f in index_path.rglob("*"):
if f.is_file():
total += f.stat().st_size
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
for line in f:
if line.startswith("VmRSS:"):
return int(line.split()[1]) / 1024
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except:
# Best-effort metric: ignore any errors (e.g., when /proc is unavailable)

Copilot uses AI. Check for mistakes.
result_counts.append(len(data))
elif isinstance(data, dict) and "matches" in data:
result_counts.append(len(data["matches"]))
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except:
# If JSON parsing or structure inspection fails, just skip result count
# extraction; timing metrics are still valid without this metadata.

Copilot uses AI. Check for mistakes.
edge_counts.append(data["edge_count"])
elif "edges" in data:
edge_counts.append(len(data["edges"]))
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except json.JSONDecodeError:
# Ignore malformed JSON output for this iteration; timing is still recorded.

Copilot uses AI. Check for mistakes.
data = json.loads(output)
if "modules" in data and data["modules"]:
module_name = data["modules"][0].get("name", "src")
except:
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except:
# If parsing the overview JSON or accessing its structure fails for any reason,
# silently fall back to the default module name ("src") defined above.

Copilot uses AI. Check for mistakes.
- Fix stress test to use correct CLI flags (--symbols/--related) and cwd
- Fix p95 calculation threshold (>= 1 instead of > 20)
- Add absolute time threshold (10ms) to comparison logic to avoid false
  positives on fast operations
- Replace bare except blocks with specific exceptions and comments
- Fix size_str redundancy by extracting value before conditional format
@Kadajett Kadajett merged commit 95b1f30 into main Dec 18, 2025
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants