Skip to content

Make the exported knowledge graph intuitive (curation pass, flagged off)#352

Open
swati510 wants to merge 13 commits into
mainfrom
kg-intuitiveness
Open

Make the exported knowledge graph intuitive (curation pass, flagged off)#352
swati510 wants to merge 13 commits into
mainfrom
kg-intuitiveness

Conversation

@swati510
Copy link
Copy Markdown
Collaborator

@swati510 swati510 commented Jun 3, 2026

Summary

Adds a curation/presentation pass over the deterministic KG skeleton so the exported knowledge-graph.json is intuitive on every index: bounded dependency-ordered layers, capped/ranked real entry points, one canonical layer-aware tour, typed infra/CI/data nodes, and never-empty summaries — plus an optional self-validated portable artifact.

All behaviour is behind REPOWISE_KG_CURATION (default off). With the flag off, the exported KG is byte-identical to today's. The NetworkX graph, centrality, community detection, and DB tables are never mutated (curation only writes the returned KnowledgeGraphResult; a regression test guards node/edge counts pre/post).

New module: packages/core/src/repowise/core/analysis/kg_curation.py, wired at pipeline/orchestrator.py:527 (runs in both FAST and STANDARD, before generation/persistence).

Phases

  • 0 — seam: no-op curate_knowledge_graph + flag.
  • 1 — layers: generation/layers.py:infer_layer spine replaces raw-community layers → bounded, partitioned, dependency-ordered. Edge A: CLI hint added to _LAYER_HINTS/_CANONICAL_RANK. Edge B: mega-layers (core/ui) get a two-level primary→sub-group split.
  • 2 — entry points: re-export barrels demoted in the presentation view only (graph is_entry_point untouched); survivors ranked by pagerank+betweenness, capped at 8; full ranked list in project.entry_candidates.
  • 3 — tour: single canonical layer-aware tour via tour.py:build_tour, README-first, each step carries layer_id. LLM enrichment now keeps the curated tour instead of regenerating it.
  • 4 — typing + summaries: infra/CI/data nodes typed; never-empty deterministic summary floor (rich wiki-page summaries still win — backfill kept fill-empty, floor deferred to post-backfill in generate mode).
  • 5 — C4: curated layers flow to the architecture view automatically (locked by test); Mermaid groups externals by category past a threshold.
  • 6 — portable: build_portable_kg → self-contained artifact with meta + embedded validation; save_knowledge_graph_json(..., portable=True).
  • 7 — invariants: validate_kg pure checker + cross-repo invariant suite (many-isolates regression, flat single-package, deep monorepo), AST-untouched guard, determinism.

Tests

Full unit suite: 3261 passed, 2 xfailed, 0 failures. New tests in tests/unit/analysis/test_kg_curation.py, test_kg_invariants.py, kg_fixtures.py, and tests/unit/server/services/test_c4_curation.py.

Empirical validation

repowise itself: 103 → 11 layers, 0 singletons, largest layer 25.7%, Application catch-all 14.1%, CLI surfaced as its own layer.

Multi-repo (§7.2 acceptance gate): core wins hold on every repo (no layer explosion, 0 singletons, partition intact, count ≤15). But the >35% mega-layer / <20% catch-all balance targets are not met on test-heavy repos (django Test layer = 63%) and flat single-package libraries (requests/flask catch-all 42–59%, everything in one package dir → Application). validate_kg correctly warns on these.

Why the flag stays OFF

The acceptance gate requires no repo exceeding the 35%/20% balance targets; that currently fails on the repo shapes above, so the default is intentionally left off (no flip commit). Two follow-ups before default-on:

  1. Path-only infer_layer can't subdivide flat libraries (no directory signal) — add a filename-based hint fallback (models.py→Data, etc.).
  2. Consider exempting the Test layer from the largest-primary-layer metric (tests legitimately dominate some repos).

Deliberate divergence from the plan

The plan said to extend to_dict() for the portable artifact; I kept to_dict() byte-identical and added build_portable_kg separately instead, otherwise a new meta key would break the flag-off byte-identity rule.

@swati510 swati510 requested a review from RaghavChamadiya as a code owner June 3, 2026 13:13
@repowise-bot
Copy link
Copy Markdown

repowise-bot Bot commented Jun 5, 2026

✅ Health: 7.0 (unchanged)
5 hotspots · 5 hidden couplings · 2 with fix history

🚨 Change risk: high (riskier than 86% of this repo's commits · raw 9.7/10)
This change's risk is driven by:

  • large diff (many lines added)
  • scattered, high-entropy change

🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)

🔥 Hotspots touched (5)
  • .../c4_builder/mermaid.py — 1 commits/90d, 1 dependents · primary owner: Raghav Chamadiya (100%)
  • .../pipeline/orchestrator.py — 33 commits/90d, 15 dependents · primary owner: Raghav Chamadiya (82%)
  • .../generation/knowledge_graph.py — 1 commits/90d, 3 dependents · primary owner: Swati Ahuja (100%)
2 more
  • .../generation/test_layers.py — 1 commits/90d, 0 dependents · primary owner: Raghav Chamadiya (100%)
  • .../generation/layers.py — 1 commits/90d, 3 dependents · primary owner: Raghav Chamadiya (100%)
🔗 Hidden coupling (1 file)
  • .../pipeline/orchestrator.py co-changes with these files (not in this PR):
    • .../pipeline/persist.py (9× — 🟢 routine)
    • .../commands/update_cmd.py (8× — 🟢 routine)
    • .../persistence/models.py (7× — 🟢 routine)
    • .../mcp_server/tool_overview.py (6× — 🟢 routine)
    • README.md (6× — 🟢 routine)

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-05 07:36 UTC
Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant