Skip to content

GOSICK-Angel/code-merge-system

Repository files navigation

中文 | English

🔀 Code Merge System

Ship upstream upgrades to long-lived forks — without the 500-file conflict nightmare.

A multi-agent pipeline that turns months of upstream drift into an auditable, resumable, and safe merge — preserving every fork customization along the way.

Python 3.11+ Tests Coverage License Anthropic

Code Merge System Dashboard


The Problem

Teams that maintain a long-lived fork face a brutal reality when syncing with upstream:

  • Hundreds to thousands of file conflicts — impossible to handle manually, one by one
  • Line-level diffs hide semantic intent — LLMs and humans both make the wrong call
  • Fork-only customizations get silently overwritten — APIs, routes, CI jobs, sentinels disappear without a trace
  • One wrong merge creates runtime vulnerabilities or missing features — and they're hard to roll back

git merge gives you a list of conflicts. Code Merge System gives you a decision pipeline.


Quick Start

pip install code-merge-system

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

cd /path/to/your-fork-repo
merge upstream/main --dry-run    # preview the plan before touching any files

First run opens a browser UI and walks you through a one-time setup wizard. Your config is saved to .merge/config.yaml — no wizard on subsequent runs.


See It In Action

Plan Review — 124 files analyzed, 87.9% auto-merge confidence, risk distribution across A–E change categories.

Plan Review

Conflict Resolution — Side-by-side intent analysis of fork vs. upstream changes, with LLM-recommended merge strategy (SEMANTIC_MERGE 85% confidence).

Conflict Resolution

Judge Verdict — Independent review agent audits every merged file; CRITICAL/HIGH/MEDIUM/LOW issue breakdown with repair rounds.

Judge Verdict

Run Report — Full cost accounting ($0.04 for 124 files), per-agent token breakdown, learned memory entries for future runs.

Run Report


How It Works

Eight phases driven by a state machine. Seven specialized agents. Every write is snapshotted. Any Ctrl+C is safe.

┌─────────────────────────────────────────────────────────────┐
│  CLI / Web UI                                               │
│         │                                                   │
│   Orchestrator ── 8-phase state machine                    │
│         │                                                   │
│  ┌──────┴───────┐                                          │
│  │              │                                           │
│ Agents        Tools              Memory                     │
│ (7 roles)   (50+ deterministic   (L0/L1/L2                  │
│              + AST parsers)       cross-run store)          │
│  │                                                          │
│ LLM layer (Anthropic + OpenAI, credential pool, routing)   │
└─────────────────────────────────────────────────────────────┘
Phase What happens
INITIALIZE 3-way classification, risk scoring, fork-profile routing
PLANNING Planner generates merge plan with per-file strategy
PLAN_REVIEW PlannerJudge audits the plan; up to 2 revision rounds
AWAITING_HUMAN You review the plan report; fill in any HUMAN_REQUIRED decisions
AUTO_MERGING Executor applies auto-safe files with snapshot-before-write
CONFLICT_ANALYSIS ConflictAnalyst does semantic analysis on risky conflicts
JUDGE_REVIEW Judge + 50+ deterministic scanners audit all merged output
COMPLETED Full report generated; you decide when to git commit
Agent Role Default Model
Planner Generates merge plan Claude Opus
PlannerJudge Reviews plan (read-only) GPT-4o
ConflictAnalyst Semantic analysis of high-risk conflicts Claude Sonnet
Executor Sole write authority — applies merges GPT-4o
Judge Reviews merged output + runs deterministic checks Claude Opus
HumanInterface Generates decision templates Claude Haiku
SmokeTest Post-merge smoke testing

Why two LLM providers? Planner/Judge use Anthropic; Executor/PlannerJudge use OpenAI. Different providers for reviewer vs. writer eliminates collusion bias.


Features

Shadow conflicts, interface reverse impacts, top-level call drops, config line preservation, scar auto-learning, and business sentinel scanning — the failure modes that git merge misses entirely.

Every file write creates a snapshot of the original. Any failure triggers automatic rollback. You never end up with a half-merged file.

State is persisted after every phase. merge resume --run-id <id> picks up exactly where you left off — useful for large merges that take hours.

No TIMEOUT_DEFAULT. No silent fallbacks. Files that need human judgment generate a decisions.yaml template; skipped decisions stay as AWAITING_HUMAN until explicitly resolved.

Python, TypeScript, JavaScript, Go, Rust, Java, and C all use tree-sitter for semantic-level diff — not just line-level.

Decisions, disputes, and metrics are summarized into a SQLite store. Future runs on the same repo load relevant history to inform planning.

CI validation only flags newly introduced failures — not pre-existing ones. Merging into a repo with a known broken test won't block you.

Real-time pipeline progress, conflict resolution UI, plan review, judge verdict — all in a local browser app. Use --no-web for pure terminal output or --ci for JSON output in CI.


Compared to Alternatives

Code Merge System git merge / git rebase GitHub/GitLab UI LLM chat (ChatGPT etc.)
Handles 500+ file conflicts ❌ Manual, one-by-one ❌ Context limit
Preserves fork-only features ✅ Auto-detected via scar/sentinel ❌ Easy to overwrite ❌ No repo context
Auditable decision trail ✅ Per-file, with rationale Partial (PR comments)
Resumable after interrupt ✅ Checkpoint after every phase
Deterministic safety checks ✅ 50+ scanners post-merge
Cost ~$0.04 for 124 files Free Free Per-token, no automation

Can You Trust the Output?

A merge tool is only worth as much as the evidence that its output is correct. This project ships a formal evaluation framework and an auditable self-learning loop — and reports their results honestly, including where the numbers are not yet impressive.

Evaluation against human golden merges

We do not ask the LLM judge to grade its own verdict. The framework under doc/evaluation/ measures system output against expert human golden merges as ground truth, scoring five trust dimensions at once — a system that blindly takes upstream and scores 100% "coverage" while losing half the fork's work must still fail:

Dimension Question it answers Key metrics
Correctness Did it merge what should merge, correctly? miss-merge rate, wrong-merge rate, conflict-resolution accuracy
Safety Did it silently drop private changes? M1–M6 semantic-loss recall, security-sensitive escalation rate, snapshot rollback rate
Process Trust Does it escalate uncertainty instead of guessing? over-escalation rate, plan-dispute hit rate, Judge↔ground-truth agreement
Explainability Can every decision be replayed? rationale completeness, discarded_content retention, trace replayability
Operational Stable across re-runs and models? Cost bounded? decision consistency, $/run, wall-time P95

Three dataset tiers feed it: Tier-1 micro-bench (30–60 PRs, runs in CI), Tier-2 real long-span replays (human merge diff = oracle), Tier-3 adversarial injections (does it actually catch M1–M6?). The harness lives in scripts/eval/ (prepare.py → run.py → diff_against_golden.py → summarize.py → gate.py).

Hard gates that veto a release (acceptance.md): wrong-merge rate = 0%, security-sensitive escalation = 100%, private-content retention = 100%, snapshot rollback = 100%, duplicate top-level symbols = 0, hallucinated cross-module references = 0; miss-merge ≤ 2% (Tier-1), each M1–M6 recall ≥ 95%. Soft gates track overall accuracy (≥ 92% Tier-1), determinism (≥ 90% across 3 runs), cross-model consistency (≥ 85%), and cost/latency drift caps.

Honesty over marketing: the version-baseline table in acceptance.md is still seeded with a template row — no release has cleared the full gate yet, so we make no "evaluated & trusted" claim. The framework exists precisely so that claim, when made, is backed by lockable dataset SHAs and per-file golden diffs rather than a "99% merge success" headline.

Self-learning — measured, not assumed

The system improves across runs without weight fine-tuning and without embeddings — a deliberate choice backed by a 24-source survey (see doc/plan/self-learning-system.md): non-parametric, auditable SQLite memory + execution-grounded reflection beats opaque RL on cost and deletability.

Phase What it does Status
P0 Effectiveness metric Ablation harness: memory=on vs memory=off decision lift Landedmerge eval-memory
P1 Grounded feedback loop Persistent auditable suppression of harmful entries · confidence write-back from judge+compile+ci signals · verified-repair recipe library Landed, feedback loops opt-in until ablation proves net gain
P2 Memory-quality hardening High-information entries enforced · key invariants pinned against summarization drift Landed
P3 Offline prompt optimization merge optimize-prompts ranks gate-prompt variants against a golden set, emits a human-review report — never auto-applies Landed, opt-in

The governing rule is measure before you activate: a feedback loop only flips to on-by-default after merge eval-memory shows lift > 0 and causally-attributed harm = 0 on a fixed dataset. First baseline (forgejo, 124 files): lift measured at 0.0000 — so the loops stay opt-in. That run was dominated by deterministic mechanisms (take-target + veto), leaving memory no room to act; it does not prove memory worthless, and an LLM-judgment-dense dataset is needed to measure real lift. We report the zero rather than hide it — that is the trust signal.


Prerequisites

Python 3.11+ mypy strict / Pydantic v2 / async throughout
ANTHROPIC_API_KEY Planner, ConflictAnalyst, Judge, HumanInterface
OPENAI_API_KEY PlannerJudge, Executor (dual-provider anti-collusion)
GITHUB_TOKEN (optional) GitHub integration — pull PR comments, push merge results
Node.js (optional) Web UI development only; the installed wheel bundles web/dist/

Target repo must:

  • Be a git repo with a clean working tree (git status shows no uncommitted changes)
  • Have upstream accessible locally — either as a branch or via git fetch <remote>
# If you haven't added upstream yet:
git remote add upstream https://github.com/<owner>/<repo>.git
git fetch upstream

Full Workflow

1. Plan (dry-run)

cd /path/to/your-fork-repo
merge upstream/main --dry-run

The browser UI opens and runs through INITIALIZE → PLANNING → PLAN_REVIEW → AWAITING_HUMAN then stops. Check the output reports:

.merge/plans/MERGE_PLAN_<run_id>.md   # file-by-file merge strategy
.merge/runs/<run_id>/plan_review.md   # PlannerJudge audit record

2. Merge

merge upstream/main     # remove --dry-run to run for real

Any Ctrl+C is safe — resume with merge resume --run-id <id>.

3. Handle Human Decisions

When the system pauses at AWAITING_HUMAN, fill in .merge/runs/<id>/decisions.yaml:

- file_path: "backend/services/auth/auth.service.ts"
  decision: take_current          # take_target / take_current / semantic_merge / escalate_human
  rationale: "Fork uses SSO — must preserve"

Then resume:

merge resume --run-id <id> --decisions .merge/runs/<id>/decisions.yaml

4. Review and commit

.merge/runs/<run_id>/merge_report.md    # final report
.merge/runs/<run_id>/checkpoint.json    # full state
.merge/runs/<run_id>/logs/run_<id>.log  # complete execution log

The system stops at the working tree. It never auto-commits or auto-pushes — you review, then decide.


All Commands

# Daily use
merge <target-branch>                         # default: browser Web UI
merge <target-branch> --dry-run               # plan only, no file writes
merge <target-branch> --no-web                # terminal output
merge <target-branch> -r                      # re-run setup wizard

# Resume / decisions
merge resume --run-id <id>
merge resume --run-id <id> --decisions decisions.yaml
merge resume --run-id <id> --web              # view history in browser

# Validate
merge validate --config <path>                # check config + all API keys

# Fork profile (only needed when fork deleted ≥30 files)
merge forks-profile init -o .merge/forks-profile.yaml
merge forks-profile diff
merge forks-profile validate

# CI
merge <target-branch> --ci                    # non-interactive, JSON summary to stdout
merge <target-branch> --ci --auto-decisions <yaml>

Troubleshooting

Symptom Fix
API Key not set Run merge validate --config .merge/config.yaml; check shell env → .merge/.env~/.config/code-merge-system/.env
working tree dirty git stash or git commit, then re-run
upstream ref not found Run git fetch upstream; use upstream/main not main
Plan review stuck in multiple rounds Normal — Planner and PlannerJudge are negotiating; after max_plan_revision_rounds=2 it transitions to AWAITING_HUMAN. Check plan_review.md.
Run interrupted mid-way merge resume --run-id <id> (find run_id under .merge/runs/)
Want to start over rm -rf .merge/runs/<id>/, then re-run

Development

git clone <repo-url> && cd code-merge-system
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest tests/unit/ -q               # unit tests (no LLM calls)
pytest tests/integration/ -v        # integration tests (real API, local only)
mypy src                            # type check (strict)
ruff check src/ && ruff format src/ # lint + format

# Web UI (only needed for frontend changes)
cd web && npm install
cd web && npm run dev               # Vite dev server at localhost:5173
cd web && npm run build             # tsc + build → web/dist/
cd web && npm test                  # vitest

Architecture constraints enforced by unit tests — do not violate:

  • No TIMEOUT_DEFAULT on DecisionSource — human decisions must be explicit
  • Judge / PlannerJudge receive ReadOnlyStateView — no state writes from reviewer agents
  • Executor uses apply_with_snapshot() — no direct file writes
  • plan_revision_rounds >= maxAWAITING_HUMAN, not FAILED
  • HumanInterface never fills in default decisions

Contributing

Contributions are welcome — whether it's a bug report, a feature idea, or a pull request.

Good places to start:

  • 🐛 Report a bug — include your Python version, the command you ran, and the relevant log from .merge/runs/<id>/logs/
  • 💡 Request a feature — describe your fork/upstream scenario and what the system currently gets wrong
  • 🔧 Browse open issues — look for good first issue labels if you want a guided starting point

Before submitting a PR:

  1. Run pytest tests/unit/ — all tests must pass
  2. Run mypy src — no new type errors
  3. Run ruff check src/ — no lint errors
  4. Keep new files under 800 lines; organize by feature layer (models → tools → llm → agents → core → cli)
  5. New agents require a contract yaml under src/agents/contracts/ — see src/agents/contracts/_schema.md

Key docs for contributors:


Documentation

Full index: doc/README.md

Onboarding Guide Start here if you're new to the project
Architecture Layers, data flow, persistence, extension points
Flow & State Machine 13 states, 8 phases
Six Lost Patterns + P0/P1/P2 Hardening How we catch what git merge misses
Evaluation Framework Golden-merge ground truth, 5 trust dimensions, 3 dataset tiers, acceptance gates
Self-Learning System Non-parametric memory + grounded feedback loop, phased rollout
Migration-Aware Merge Handling bulk-copy scenarios
Risk Levels How files are classified A–E
Web UI User Journey Browser-side walkthrough

License

MIT


Built for teams that maintain long-lived forks and need more than git merge.

About

Multi-agent Code Merge System

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors