Build GraphOptim — an open-source Python library (published on PyPI) that parses AI-generated code into weighted graph representations and applies graph-theoretic optimization algorithms to detect and fix structural inefficiencies that rule-based linters cannot catch.
The library targets developers and teams who heavily use LLM coding assistants (Copilot, Claude, Cursor, Gemini) and need a post-processing optimization layer that understands code as a graph, not just as text or AST tokens.
AI-generated code contains graph-level structural inefficiencies — redundant paths, inflated cyclomatic complexity, dead nodes, and over-centralized call chains — that rule-based linters (flake8, pylint, ruff) systematically miss because they operate on token/rule patterns, not graph topology.
This hypothesis must be benchmarked empirically before the tool is published. The benchmark pipeline is part of the library itself — users can run it against their own codebases.
Before building the optimizer, validate the hypothesis with a rigorous benchmark.
- Download the HumanEval dataset (OpenAI, 164 problems) from
openai/human-evalon GitHub - Download MBPP (Google, 500 problems) from
google-research-datasets/mbpp - Select 100 problems that have a clean, single-function human reference solution
- For each problem, generate an LLM solution using three models via API:
- Claude (via Anthropic API —
claude-opus-4-5or latest) - GPT-4o (via OpenAI API)
- Gemini 1.5 Pro (via Google Generative AI API)
- Claude (via Anthropic API —
- Use the same prompt for all models:
Write a Python function that solves the following problem. Return only the function code, no explanations. [problem description] - Store all solutions in a structured JSON dataset:
benchmark_dataset.json
For each function (human + 3 LLM variants), extract the following:
# Required libraries:
# pip install radon staticfg networkx scipy pandas matplotlib ast-comments
import ast
import networkx as nx
from radon.complexity import cc_visit
from radon.metrics import h_visit # Halstead metrics
from staticfg import CFGBuilder
def extract_cfg_metrics(source_code: str, func_name: str) -> dict:
"""
Parse source → CFG → weighted DiGraph → extract graph metrics.
Returns a dict of all measurable graph properties.
"""
# Build CFG
cfg = CFGBuilder().build_from_src(func_name, source_code)
G = nx.DiGraph()
for block in cfg:
for exit_ in block.exits:
G.add_edge(block.id, exit_.target.id, weight=1)
N = G.number_of_nodes()
E = G.number_of_edges()
CC = E - N + 2 # McCabe cyclomatic complexity
# Dead node detection: nodes with in-degree=0 that are not the entry node
entry = list(nx.topological_sort(G))[0] if nx.is_directed_acyclic_graph(G) else None
dead_nodes = [n for n in G.nodes if G.in_degree(n) == 0 and n != entry]
# Shortest / longest path (critical path)
try:
avg_path = nx.average_shortest_path_length(G)
diameter = nx.diameter(G.to_undirected()) if nx.is_connected(G.to_undirected()) else None
except Exception:
avg_path = None
diameter = None
# Centrality
betweenness = nx.betweenness_centrality(G)
max_betweenness = max(betweenness.values()) if betweenness else 0
# Radon metrics
cc_results = cc_visit(source_code)
radon_cc = cc_results[0].complexity if cc_results else CC
return {
"nodes": N,
"edges": E,
"cyclomatic_complexity": radon_cc,
"dead_nodes_count": len(dead_nodes),
"avg_shortest_path": avg_path,
"cfg_diameter": diameter,
"max_betweenness_centrality": max_betweenness,
"node_edge_ratio": E / N if N > 0 else 0,
"lines_of_code": len([l for l in source_code.split('\n') if l.strip()])
}| Metric | Formula | What it detects |
|---|---|---|
| Cyclomatic Complexity | E - N + 2P |
Overall branching density |
| Dead Nodes | In-degree=0, not entry | Unreachable code blocks |
| CFG Diameter | Longest shortest path | Unnecessarily deep execution chains |
| Avg Shortest Path | Mean over all node pairs | Execution efficiency |
| Max Betweenness Centrality | Node bottleneck score | Over-centralized functions |
| Node/Edge Ratio | E/N | Branching density |
| Lines of Code | Non-blank lines | Verbosity |
For each metric, run Mann-Whitney U test (non-parametric) between human and each LLM group:
from scipy.stats import mannwhitneyu, kruskal
import pandas as pd
def run_statistical_tests(df: pd.DataFrame) -> pd.DataFrame:
metrics = ['cyclomatic_complexity', 'dead_nodes_count', 'avg_shortest_path',
'cfg_diameter', 'max_betweenness_centrality', 'lines_of_code']
results = []
for metric in metrics:
human = df[df['source'] == 'human'][metric].dropna()
for model in ['claude', 'gpt4o', 'gemini']:
llm = df[df['source'] == model][metric].dropna()
stat, p = mannwhitneyu(human, llm, alternative='two-sided')
results.append({
'metric': metric,
'model': model,
'human_median': human.median(),
'llm_median': llm.median(),
'p_value': p,
'significant': p < 0.05
})
return pd.DataFrame(results)Produce:
benchmark_results.json— raw metrics for all 400 functionsstatistical_report.csv— test results per metric/model pairbenchmark_report.md— human-readable findings with plots
This benchmark becomes the scientific foundation of the library and a key part of the README.
graphoptim/
├── __init__.py
├── parser/
│ ├── __init__.py
│ ├── cfg_builder.py # AST → CFG → NetworkX DiGraph
│ ├── dfg_builder.py # Data Flow Graph builder
│ └── ast_utils.py # AST helpers and node weighers
├── analyzer/
│ ├── __init__.py
│ ├── metrics.py # Graph metric extraction
│ ├── patterns.py # Pattern detectors (dead nodes, redundant paths, etc.)
│ └── reporter.py # Human-readable report generator
├── optimizer/
│ ├── __init__.py
│ ├── passes/
│ │ ├── dead_code.py # Dead node elimination
│ │ ├── path_shortener.py # Redundant branch merging
│ │ ├── centrality.py # Over-centralized node splitting
│ │ └── knapsack.py # Optimal pass selection via 0/1 knapsack
│ └── rewriter.py # Graph changes → code AST rewrite
├── benchmark/
│ ├── __init__.py
│ ├── collector.py # Dataset downloader and LLM solution generator
│ ├── runner.py # Full benchmark pipeline runner
│ └── report.py # Statistical analysis and report generation
├── cli.py # Command-line interface
└── config.py # Configuration dataclasses
The library must have a clean, minimal API. Three main entry points:
import graphoptim as go
# Analyze a single function
report = go.analyze(source_code)
print(report.cyclomatic_complexity) # int
print(report.dead_nodes) # List[DeadNode]
print(report.redundant_paths) # List[RedundantPath]
print(report.bottlenecks) # List[BottleneckNode]
print(report.score) # 0-100 optimization score
print(report.summary()) # human-readable string
# Analyze a whole file
file_report = go.analyze_file("mymodule.py")
for func_report in file_report.functions:
print(func_report.function_name, func_report.score)
# Analyze a whole directory
project_report = go.analyze_project("./src")# Optimize and return improved source code
optimized_code = go.optimize(source_code)
# With configuration — choose which passes to apply
optimized_code = go.optimize(
source_code,
passes=["dead_code", "path_shortener"], # explicit passes
budget={"max_changes": 5, "min_improvement": 0.1} # knapsack budget
)
# Optimize a file in-place (creates .bak backup)
go.optimize_file("mymodule.py", inplace=True)
# Optimize all files in a project
go.optimize_project("./src", output_dir="./src_optimized")from graphoptim.benchmark import Benchmark
bm = Benchmark(
anthropic_api_key="...",
openai_api_key="...",
google_api_key="...",
n_samples=100,
dataset="humaneval" # or "mbpp"
)
results = bm.run()
results.save("my_benchmark_results/")
results.show_report() # opens HTML reportDetection: Nodes in the CFG with in_degree == 0 that are not the function entry point.
Fix: Remove the corresponding AST nodes from the source.
class DeadCodePass:
def detect(self, cfg: nx.DiGraph, ast_tree: ast.AST) -> List[DeadNode]:
"""Find unreachable code blocks."""
...
def fix(self, ast_tree: ast.AST, dead_nodes: List[DeadNode]) -> ast.AST:
"""Remove dead AST nodes."""
...Detection: Two or more paths in the CFG that:
- Start and end at the same nodes
- Have identical or equivalent intermediate operations
- Represent duplicate conditional branches
Algorithm: Build a path matrix, compare path fingerprints (hash of AST node sequence), flag duplicates.
Fix: Merge duplicate branches into a single path, possibly extracting the shared logic into a helper.
Detection: Functions where the critical execution path (longest weighted path in the CFG) is unnecessarily long due to sequential operations that could be restructured.
Algorithm:
- Weight edges by estimated operation cost (I/O ops = high weight, arithmetic = low weight)
- Find the critical path using longest-path algorithm on DAG
- Suggest reordering or early-exit insertions
Detection: Nodes with betweenness_centrality > threshold (default: 0.7) — these are bottleneck nodes that every execution path passes through, indicating the function is over-centralized.
Fix: Suggest extracting the bottleneck node's logic into a helper function to improve testability and reusability.
This is the key innovation. Given a set of available optimization passes, each with:
cost: estimated code change risk (0.0 to 1.0)benefit: estimated complexity reduction (0.0 to 1.0)prerequisites: passes that must run first
Solve the 0/1 Knapsack problem to select the optimal subset of passes within a user-defined budget:
from graphoptim.optimizer.passes.knapsack import KnapsackSelector
selector = KnapsackSelector(budget=0.6) # max total risk = 0.6
available_passes = [dead_code_pass, path_merger, critical_path, centrality]
# Returns the optimal subset maximizing total benefit within budget
selected = selector.select(available_passes, current_metrics)Algorithm: Dynamic programming O(n·W) knapsack where:
- n = number of available passes
- W = budget (discretized into integer units, e.g., 0.6 → 60)
The CLI is a first-class citizen. Users should be able to use graphoptim without writing Python.
# Install
pip install graphoptim
# Analyze a file
graphoptim analyze myfile.py
# Analyze with detailed output
graphoptim analyze myfile.py --verbose --output report.html
# Optimize a file (preview only)
graphoptim optimize myfile.py --dry-run
# Optimize in place
graphoptim optimize myfile.py --inplace
# Optimize a whole project
graphoptim optimize ./src --output ./src_optimized
# Run the benchmark (requires API keys)
graphoptim benchmark --samples 100 --models claude,gpt4o,gemini
# Show config
graphoptim config show
# Set API keys
graphoptim config set anthropic_api_key sk-ant-...CLI output example for graphoptim analyze myfile.py:
GraphOptim Analysis — myfile.py
═══════════════════════════════════════
Functions analyzed: 4
Total optimization score: 62/100
function: process_data() score: 45/100 ⚠ needs attention
├─ Cyclomatic complexity: 12 (threshold: 7)
├─ Dead nodes: 2 (unreachable branches at lines 34, 51)
├─ Betweenness bottleneck: 0.9 (line 22 is critical path bottleneck)
└─ Suggested passes: dead_code, centrality_split
function: validate_input() score: 88/100 ✓ good
function: format_output() score: 71/100 ~ acceptable
function: main() score: 44/100 ⚠ needs attention
Run `graphoptim optimize myfile.py` to apply fixes.
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "graphoptim"
version = "0.1.0"
description = "Graph-theoretic optimizer for AI-generated Python code"
readme = "README.md"
license = {text = "MIT"}
requires-python = ">=3.9"
keywords = ["ai", "code-optimization", "graph-theory", "static-analysis", "llm"]
classifiers = [
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"Topic :: Software Development :: Quality Assurance",
"Programming Language :: Python :: 3.9",
]
dependencies = [
"networkx>=3.0",
"radon>=6.0",
"staticfg>=0.1",
"scipy>=1.10",
"pandas>=2.0",
"rich>=13.0", # Beautiful CLI output
"click>=8.0", # CLI framework
"anthropic>=0.25", # Claude API (optional, for benchmark)
"openai>=1.0", # OpenAI API (optional, for benchmark)
"google-generativeai>=0.5", # Gemini API (optional, for benchmark)
]
[project.optional-dependencies]
benchmark = ["anthropic", "openai", "google-generativeai", "matplotlib", "seaborn"]
dev = ["pytest", "black", "ruff", "mypy"]
[project.scripts]
graphoptim = "graphoptim.cli:main"- The benchmark results — concrete numbers: "LLM-generated code has X% higher cyclomatic complexity, Y% more dead nodes"
- A 30-second quickstart — install → analyze → optimize
- How graph theory applies — one clear diagram showing CFG nodes/edges
- Comparison table with existing tools (flake8, pylint, ruff) showing what GraphOptim catches that they miss
- CI badge and PyPI version badge
Every optimization pass must have unit tests with known-bad and known-good Python snippets:
# tests/test_dead_code_pass.py
def test_dead_code_detected():
code = """
def foo(x):
if x > 0:
return x
return -x
print("never reached") # dead code
"""
report = go.analyze(code)
assert report.dead_nodes_count == 1
assert report.dead_nodes[0].line == 7
def test_dead_code_eliminated():
optimized = go.optimize(code, passes=["dead_code"])
assert "never reached" not in optimized- Run analyze + optimize on real Python stdlib functions
- Verify that optimized code is syntactically valid (
ast.parse()must not raise) - Verify that optimized code passes the original function's tests (where available from HumanEval)
- Store 20 known problematic AI-generated snippets
- After every change, verify all 20 are still correctly detected and optimized
# Clone and set up
git clone https://github.com/<your-org>/graphoptim
cd graphoptim
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,benchmark]"
# API keys for benchmark (set in environment)
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."
# Run tests
pytest tests/ -v
# Run benchmark (generates benchmark_results/)
graphoptim benchmark --samples 20 --models claude # start small
# Build and check package
python -m build
twine check dist/*| Step | What to build | Done when |
|---|---|---|
| 1 | parser/cfg_builder.py — CFG from Python source |
CFG of a 10-line function has correct nodes/edges |
| 2 | analyzer/metrics.py — extract all 7 metrics |
Metrics match radon's output for CC |
| 3 | benchmark/collector.py — download HumanEval, call APIs |
100 function pairs collected and stored as JSON |
| 4 | benchmark/runner.py — run metrics + stats |
Statistical report shows p-values per metric |
| 5 | optimizer/passes/dead_code.py |
Dead code removed from 5 test cases |
| 6 | optimizer/passes/knapsack.py |
Pass selector returns optimal subset |
| 7 | optimizer/rewriter.py — graph edits → code |
Output is valid Python (ast.parse() passes) |
| 8 | cli.py — analyze and optimize commands |
CLI works end-to-end on a real file |
| 9 | PyPI publish | pip install graphoptim works |
| 10 | benchmark CLI command |
Users can reproduce the benchmark |
- Never break working code. The optimizer must produce syntactically valid, semantically equivalent code. When unsure, emit a warning and skip the pass rather than risking a bad rewrite.
- Preserve docstrings and comments. AST rewrites must carry through all decorators, docstrings, and inline comments.
- Dry-run by default.
optimize()returns the new code as a string; it never writes to disk unless explicitly asked viaoptimize_file(inplace=True). - Graceful degradation. If
staticfgfails to parse a function (e.g., async functions, complex decorators), fall back to AST-only metrics rather than crashing. - Python-first, language-agnostic design. The
parser/andoptimizer/modules should be designed so that a futuregraphoptim-jsorgraphoptim-rspackage can follow the same architecture.
| Tool | Approach | Misses |
|---|---|---|
| flake8 / ruff | Rule-based token patterns | Graph-level redundancy, dead paths |
| pylint | AST heuristics + rules | Topological inefficiencies |
| Graphify | CFG visualization only | No optimization, no metrics |
| SonarQube | Rule-based + some metrics | No graph-theoretic pass selection |
| GraphOptim | Weighted CFG + graph algorithms + knapsack pass selection | — |
This document is the complete specification for GraphOptim v0.1. Start with Phase 0 (benchmark) — the empirical results will validate the core hypothesis and form the scientific credibility of the library.