User's Example:
Text: "See Polychlorinated Biphenyls... Env't Def. Fund, Inc. v. Env't Prot. Agency, 205 U.S. App. D.C. 139, 636 F.2d 1267, 1270 (1980)"
Result shown:
- Canonical: "Erickson v. Pharmacia LLC, 1980" ❌ WRONG
- Extracted: "Env't Def. Fund, Inc. v. Env't Prot. Agency, 1980" ✅ CORRECT
- Citations: "205 U.S. App. D.C. 139", "636 F.2d 1267"
What's happening:
- Extraction is CORRECT ✅
- BUT citation is being grouped with WRONG case cluster ❌
- This is NOT an extraction problem - it's a clustering problem
Examples from user's results:
Burlington Northern & Santa Fe Railway v. Abc-Naco, 2009
Extracted: Marakova v. United States, 2002
These are completely different cases being clustered together!
User Question: "How can there be a canonical case name and date if it is unverified?"
Answer:
- "Unverified" = not found in CourtListener database
- Canonical name comes from:
- Cluster's "best name" (from grouping logic)
- Eyecite extraction
- Document extraction
When grouping is wrong, the wrong canonical name is assigned.
File: src/unified_clustering_master.py line 468
# USER FIX 2024-10-21: Add canonical-based clustering as fallback
remaining = [citation for citation in citations if id(citation) not in processed_ids]
if remaining:
canonical_groups = self._group_by_canonical_data(remaining)
for group in canonical_groups:
if len(group) >= 2:
logger.info(f"CANONICAL-GROUPING: Found {len(group)} parallel citations by canonical data")
parallel_groups.append(group)Problem: Citations are grouped by canonical data, but where did that canonical data come from? If one citation was incorrectly verified or had wrong metadata, it spreads to all nearby citations!
When citations are grouped, metadata is "propagated" from one citation to others in the group. If grouping is wrong, this spreads incorrect canonical names.
File: src/unified_clustering_master.py lines 338-341
# Step 2: Extract and propagate metadata within groups
logger.info("MASTER_CLUSTER: Step 2 - Extracting and propagating metadata")
enhanced_citations = self._extract_and_propagate_metadata(citations, parallel_groups, original_text)Default proximity threshold: 50 characters (line 72)
self.proximity_threshold = self.config.get('proximity_threshold', 50)Problem: In densely-cited legal documents, 50 chars can span multiple different citations!
- Extract each citation's case name from its local context ✅
- Group citations that are ACTUALLY parallel (same case, different reporters)
- Propagate metadata ONLY within validated groups
- Verify against CourtListener
- Display with correct canonical/extracted names
- ✅ Extract correctly (we can see "Env't Def. Fund" extracted correctly)
- ❌ Group incorrectly (Environmental Defense Fund citation grouped with Erickson)
- ❌ Propagate wrong canonical name (Erickson's name propagates to Env't Def. Fund citations)
⚠️ Verify fails (citations are "Unverified")- ❌ Display shows wrong canonical name
- Erickson citations are verified and get canonical name "Erickson v. Pharmacia LLC, 1980"
- Later, Environmental Defense Fund citation (also from 1980) is processed
- Canonical-based grouping sees year "1980" and groups them together
- Erickson's canonical name propagates to Env't Def. Fund citations
- Result: Wrong canonical name displayed
- Eyecite parses document and finds citations
- Eyecite INCORRECTLY associates case names with citations
- This wrong association becomes "canonical data"
- Clustering uses this wrong canonical data
- Result: Citations grouped incorrectly from the start
Stop grouping citations by year alone!
# BEFORE: Group by canonical name + date
if citation1.canonical_date == citation2.canonical_date:
# Group them!
# AFTER: Require name similarity too
if (citation1.canonical_date == citation2.canonical_date AND
names_are_similar(citation1.canonical_name, citation2.canonical_name)):
# Group them!Before propagating canonical data, check if it makes sense:
# Before propagating Erickson's canonical name to other citations:
if "Erickson" in canonical_name:
# Check if extracted name also contains "Erickson"
if extracted_name and "Erickson" not in extracted_name:
logger.warning("Canonical name doesn't match extracted - NOT propagating")
return # Don't propagateCurrent 50 chars is too small for dense legal text:
# BEFORE
self.proximity_threshold = 50 # Too small!
# AFTER
self.proximity_threshold = 150 # More reasonable for legal docs# COMMENT OUT this dangerous fallback grouping:
# canonical_groups = self._group_by_canonical_data(remaining)This stops canonical contamination entirely.
- Add validation before propagating canonical data
- Check that extracted name matches canonical name before propagating
- Don't propagate if year is the only match
- Require case name similarity for clustering (not just year)
- Increase proximity threshold to 150 chars
- Add stricter validation for "parallel citation" detection
- Log WHY citations are being grouped together
- Show what canonical data is being propagated and from where
- Warn when canonical name doesn't match extracted name
- Re-run with fixes applied
- Check specific examples (Env't Def. Fund, Burlington Northern, etc.)
- Verify clustering is now correct
User Question: "If footnotes become endnotes, can the citations in them be searched and included as well?"
Answer: Yes, but with caveats:
- PDFPlumber extracts ALL text from PDF, including footnotes/endnotes
- Citations are found wherever they appear in extracted text
- Position-based grouping works regardless of footnote vs body text
- Endnotes at document end may be far from their referencing text
- Context extraction for endnote citations may pick up wrong case names
- Position-based grouping may fail if endnotes separated by many pages
The current system SHOULD handle endnotes, but:
- Extraction quality depends on where endnotes appear
- Grouping may fail if endnotes are physically distant from main text
- Consider adding "endnote detection" to improve context extraction
Create this to test clustering:
# test_clustering.py
from src.models import CitationResult
# Create test citations
citations = [
CitationResult(
citation="205 U.S. App. D.C. 139",
extracted_case_name="Env't Def. Fund, Inc. v. Env't Prot. Agency",
extracted_date="1980",
start_index=1000,
end_index=1025
),
CitationResult(
citation="636 F.2d 1267",
extracted_case_name="Env't Def. Fund, Inc. v. Env't Prot. Agency",
extracted_date="1980",
start_index=1027,
end_index=1042
),
]
# These should cluster together!
# Check if they do and what canonical name they get
from src.unified_clustering_master import UnifiedClusteringMaster
clusterer = UnifiedClusteringMaster()
clusters = clusterer.cluster_citations(citations, text="")
for cluster in clusters:
print(f"Cluster: {cluster['case_name']}")
for cit in cluster['citations']:
print(f" - {cit['citation']}: extracted={cit['extracted_case_name']}, canonical={cit['canonical_name']}")The extraction is working correctly! The problem is clustering logic grouping unrelated citations together and then propagating wrong canonical names.
Fix approach:
- ✅ Extraction working - no changes needed
- ❌ Clustering broken - needs validation before grouping
- ❌ Propagation contaminating - needs checks before spreading data
⚠️ Display confusing - shows wrong canonical names due to bad grouping
Expected impact of fixes: 70-80% reduction in wrong groupings