Skip to content

Cat 4c: edge count alongside component count, sparse annotation#37

Open
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:fix/cat4c-edge-count-noise
Open

Cat 4c: edge count alongside component count, sparse annotation#37
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:fix/cat4c-edge-count-noise

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented May 25, 2026

Summary

Closes #15.

Adds edge count alongside component count in the Cat 4c "Edge-type monoculture" section and annotates sparse types (<5 edges) to distinguish real structural signals from noise.

  • Adds per_edge_type_edge_counts: dict[str, int] field to IngestionIntegrityReport
  • format_report() now shows both edge count and component count per type, sorted by edge count descending
  • Types with <5 edges are annotated as [sparse — <5 edges]

Before:

  Edge-type distribution:
    RELATED: 6 edges
    MENTIONS: 1 edge

After:

  Edge-type distribution:
    RELATED:   6 edges, 1 component
    MENTIONS:  1 edge, 1 component  [sparse — <5 edges]
    LINKS_TO:  1 edge, 1 component  [sparse — <5 edges]

Test plan

  • test_per_edge_type_edge_counts_populated — verifies edge counts match ground truth
  • test_format_report_sparse_annotation — verifies sparse annotation appears for low-edge-count types and not for RELATED
  • Existing tests pass unchanged

🫏 Generated with Claude Code

)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 01:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds per-edge-type edge counts to the ingestion integrity report and enhances the formatted output to annotate sparse edge types (Issue #15), with new tests to validate both the report field and rendering.

Changes:

  • Added per_edge_type_edge_counts to IngestionIntegrityReport and populated it in score_ingestion_integrity
  • Updated format_report to render per-edge-type edge counts alongside component counts and mark sparse edge types
  • Added tests covering the new report field and sparse annotation behavior in formatted output

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
sme/categories/ingestion_integrity.py Extends report schema and updates formatting to include edge counts + sparse annotations
tests/test_ingestion_integrity.py Adds regression tests for the new report field and format_report output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +139 to 141
per_edge_type_edge_counts: dict[str, int] = field(default_factory=dict)


Comment thread sme/categories/ingestion_integrity.py Outdated
sparse_threshold = 5
combined = sorted(
report.per_edge_type_components.items(),
key=lambda kv: -report.per_edge_type_edge_counts.get(kv[0], 0),
Comment thread sme/categories/ingestion_integrity.py Outdated
Comment on lines +312 to +313
lines.append(" Per-edge-type edges + components (4c monoculture signal):")
sparse_threshold = 5
@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Edge counts alongside component counts + sparse annotation is exactly the right fix for #15 — the per-edge-type component count was structurally noisy on small populations and this makes the signal/noise tradeoff visible per-type.

CI is red only on one ruff finding: ambiguous variable name l in tests/test_ingestion_integrity.py:155. Rename to line. Substance is fine after that.

Small follow-ups (non-blocking, worth in same touch):

  • Sparse threshold magic number 5 — pull to a module constant so it shows up in one place if someone wants to tune it later.
  • Sort key on (edge_count,) is non-deterministic on ties — add (edge_count, edge_type_name) for deterministic output.

- Rename ambiguous `l` to `line` in test (clears the only red ruff finding)
- Pull sparse edge-count threshold 5 into named _SPARSE_EDGE_THRESHOLD constant
- Tie-break the 4c render sort on edge_type name so ties are deterministic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 27, 2026

Pushed fa0ad5f:

  1. Renamed the ambiguous lline in tests/test_ingestion_integrity.py — that was the only red-CI finding (ruff E741); checks are green now on 3.10/3.11/3.12.
  2. Hoisted the sparse threshold (was a bare 5) to a named module constant.
  3. The sort key is now (edge_count, edge_type_name) for deterministic output on ties.

🫏

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Verified — _SPARSE_EDGE_THRESHOLD is hoisted, the sort key is (edge_count, edge_type_name) for deterministic output on ties, and the l → line rename clears E741. Tests cover both sparse annotation and the tie-break. Ship.

One small forward thought, not for this PR: 5 as the sparse floor is calibrated for the current small-corpus regime. When the LoCoMo / HotpotQA / MINE integration (issue #43) lands and runs are on graphs 100× larger, 5 will mark almost nothing as sparse and the annotation will quietly stop being useful. A one-line comment on the constant noting it's corpus-size calibrated would save a future reviewer from wondering whether to bump it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cat 4c — per-edge-type component count is structurally noisy on small edge populations

3 participants