How Complexity Analysis Works

This document explains the internals of the complexity analyzer: what it measures, how it processes PRs, and how you can fine-tune it for your organization.

Overview

The complexity analyzer uses an LLM to estimate the implementation complexity of a GitHub pull request on a 1–10 scale. The goal is to approximate how "hard" the PR was to implement from a developer's perspective—design, implementation, and testing effort—not raw line count or operational risk.

End-to-End Flow

PR URL → Fetch from GitHub → Preprocess Diff → Build Prompt → LLM Analysis → Parse & Return Score

1. Fetch PR from GitHub

Uses the GitHub API to fetch the PR diff and metadata (title, additions, deletions, changed files)
Supports token rotation for high-volume batch analysis
Requires GH_TOKEN or GITHUB_TOKEN for private repos or higher rate limits

2. Preprocess the Diff

The raw diff is processed before being sent to the LLM:

Step	What Happens
Redaction	Secrets (API keys, tokens, passwords), emails, and AWS/GitHub keys are replaced with `[REDACTED_SECRET]` or `[REDACTED_EMAIL]`
Filtering	Binary files, lockfiles (`package-lock.json`, `pnpm-lock.yaml`), markdown, and paths like `vendor/`, `node_modules/`, `dist/`, `build/`, `coverage/` are excluded
Chunking	Each file is limited to a configurable number of hunks (default: 2 per file) to keep the diff representative without overwhelming the context
Truncation	The combined diff is truncated to a token limit (default: 50,000 tokens) using tiktoken
Stats	Builds `additions`, `deletions`, `changedFiles`, `byExt`, `byLang`, `fileCount` from metadata and selected files

3. Build the Prompt Input

The LLM receives a structured input containing:

PR URL and title
Stats summary (additions, deletions, changed files, file list)
Diff excerpt between --- DIFF START --- and --- DIFF END ---

4. LLM Analysis

The LLM (OpenAI, Anthropic, or AWS Bedrock) is given:

System prompt — Instructions defining what "complexity" means and how to score
User message — The formatted diff excerpt, stats JSON, and title

The prompt instructs the model to:

Focus on code and logic complexity, not line count
Consider: non-trivial logic, modules touched, conceptual difficulty, testing effort
Avoid conflating: large data/lockfile diffs with high complexity, operational risk with implementation difficulty

5. Parse Response

The LLM must respond with strict JSON:

{"complexity": 5, "explanation": "Multiple modules with non-trivial control flow changes"}

The response is parsed, validated, and the complexity score is clamped to 1–10.

What "Complexity" Means

The prompt defines complexity as implementation effort, not:

Raw line count
Operational risk (e.g., dangerous migrations)
Deployment complexity

Factors the LLM Considers

Factor	Low (1–3)	Medium (4–6)	High (7–10)
Scope	Single file, localized	Several modules/services	Cross-cutting, many components
Logic	Guard clauses, simple conditions, decorators	New abstractions, orchestration, mappings	Complex workflows, concurrency, stateful migrations
Testing	Minimal or straightforward	Moderate coverage	Broad and deep test changes
Data	Simple schema changes	Migration/ETL with mappings	Bidirectional compatibility, rollback considerations

Scoring Heuristic (from the prompt)

Band	Score	Typical Characteristics
Almost trivial	1–2	Single guard clause, config tweak, simple decorator, tiny migration
Small but non-trivial	3–4	Localized changes, simple business logic, basic mappings
Medium	5–6	Multiple modules, non-trivial orchestration, DTOs and mappings
Large or sophisticated	7–8	Cross-cutting changes, complex workflows, concurrency
Very complex	9–10	Major architectural changes, intricate migrations, distributed systems

Fine-Tuning for Your Organization

You can adapt the analyzer to your team's norms in several ways.

1. Custom Prompt File

Use --prompt-file (or -p) to supply your own prompt:

complexity-cli analyze-pr "https://github.com/org/repo/pull/123" --prompt-file ./prompts/org-specific.txt

What to customize in the prompt:

Domain-specific examples — Add examples from your codebase (e.g., "Our ETL rivers typically involve X, Y, Z")
Scoring bands — Adjust the 1–10 bands to match your team's expectations (e.g., if "medium" PRs in your org are usually 4–5, not 5–6)
Factors to emphasize — E.g., "In our org, data pipeline changes are typically higher complexity due to downstream dependencies"
Factors to de-emphasize — E.g., "We treat lockfile-only dependency bumps as 1 regardless of size"
Output format — The prompt must still instruct the LLM to return {"complexity": N, "explanation": "..."}

Example custom prompt addition (append to the default or use as a base):

------------------------------------------------------------
ORGANIZATION-SPECIFIC GUIDANCE
------------------------------------------------------------

For [YourOrg]:
- Data pipeline / river changes: Add +1 to base score if multiple sources or targets are involved.
- Infra/DevOps PRs: Treat Terraform/K8s changes as moderate (4–6) when touching multiple environments.
- Frontend-only PRs: UI-only changes without new logic are typically 2–4.

2. Adjust Diff Processing

Option	Default	Purpose
`--max-tokens`	50000	Increase if your PRs are large and you want more context; decrease to reduce cost/latency
`--hunks-per-file`	2	Increase to show more of each file (e.g., 3–4 for refactors); decrease for very large PRs

complexity-cli analyze-pr "https://github.com/org/repo/pull/123" --max-tokens 80000 --hunks-per-file 3

3. Filter and Ignore Patterns

To exclude more file types or paths, you would need to modify cli/preprocess.py:

IGNORE_EXT_RE — File extensions to skip (e.g., add .snap for Jest snapshots)
IGNORE_PATH_RE — Path patterns (e.g., add generated/ or __fixtures__/)

4. Velocity Weighting

The README suggests t-shirt sizes for velocity:

Score	Size	Weight
1–2	XS	0
3	S	1
4	M	2
5–6	L	3
7+	XL	4

You can change these weights in your reporting/BI layer to match how your org values different PR sizes.

5. Calibration with Historical PRs

Run batch analysis on a sample of closed PRs:

complexity-cli batch-analyze --org yourorg --since 2024-01-01 --until 2024-01-31 --output calibration.csv

Manually review a subset and note where scores feel off.
Update your custom prompt with guidance that corrects those cases.
Re-run and compare until the distribution matches your expectations.

Configuration Reference

Config	Default	Description
`max_tokens`	50000	Max tokens for diff excerpt
`hunks_per_file`	2	Max hunks per file in excerpt
`model`	gpt-5.2	LLM model (provider-specific)
`timeout`	120	Request timeout (seconds)
`provider`	openai	`openai`, `anthropic`, or `bedrock`
`prompt_file`	(embedded)	Path to custom prompt

File Layout

File	Role
`cli/analyze.py`	Core analysis flow: fetch → process → LLM → parse
`cli/preprocess.py`	Redaction, filtering, chunking, truncation, stats
`cli/prompt/default.txt`	Default LLM prompt (scoring rules, factors, examples)
`cli/scoring.py`	Parse LLM JSON response, validate, clamp score
`cli/llm*.py`	Provider implementations (OpenAI, Anthropic, Bedrock)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Complexity Analysis Works

Overview

End-to-End Flow

1. Fetch PR from GitHub

2. Preprocess the Diff

3. Build the Prompt Input

4. LLM Analysis

5. Parse Response

What "Complexity" Means

Factors the LLM Considers

Scoring Heuristic (from the prompt)

Fine-Tuning for Your Organization

1. Custom Prompt File

2. Adjust Diff Processing

3. Filter and Ignore Patterns

4. Velocity Weighting

5. Calibration with Historical PRs

Configuration Reference

File Layout

FilesExpand file tree

COMPLEXITY_ANALYSIS.md

Latest commit

History

COMPLEXITY_ANALYSIS.md

File metadata and controls

How Complexity Analysis Works

Overview

End-to-End Flow

1. Fetch PR from GitHub

2. Preprocess the Diff

3. Build the Prompt Input

4. LLM Analysis

5. Parse Response

What "Complexity" Means

Factors the LLM Considers

Scoring Heuristic (from the prompt)

Fine-Tuning for Your Organization

1. Custom Prompt File

2. Adjust Diff Processing

3. Filter and Ignore Patterns

4. Velocity Weighting

5. Calibration with Historical PRs

Configuration Reference

File Layout