The case name extraction logic was picking up incorrect or truncated case names from document headers and captions, leading to contamination in citation extraction.
- Document headers like "CARTER, Respondent, v. MARY E. JONES, Appellant" were not being properly filtered
- Sentence fragments like "This was later cited in Smith v. Johnson" were being extracted instead of just "Smith v. Johnson"
- Role words (Respondent, Appellant, etc.) in case names were not being cleaned properly
- Context window was too small (40 chars) to capture full case names
- Strategy 2 & 3: Added robust cleaning of role words from both ends of party names
- Handles patterns like "CARTER, Respondent, v. MARY E. JONES, Appellant" → "CARTER v. MARY E. JONES"
- Cleans up double commas and extra spaces
- Lines 1185-1186: Added pattern to detect case caption headers with role words
- Lines 1196-1216: Enhanced logic to filter case captions while preserving legitimate case discussion
- Distinguishes between headers ("CARTER, Respondent, v. MARY E. JONES, Appellant") and discussion ("In Smith v. Jones, the court held...")
- Lines 2989-2999: Added role word removal from extracted case names
- Removes patterns like ", Respondent" and "Appellant, " from case names
- Cleans up resulting punctuation
- Lines 1735-1754: Added logic to extract case names from longer sentences
- Detects signal words (see, cited, established, etc.) to identify sentence contamination
- Uses precise regex pattern to extract just the case name:
r"\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+v\.\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b" - Example: "As established in Davis v. Wilson" → "Davis v. Wilson"
- Line 1448: Increased default context window from 40 to 100 characters
- Ensures enough context to capture complete case names
- Lines 1949-1963: Made contamination indicators more specific
- Removed overly broad terms like "established" that caused false positives
- Now only rejects clear contamination patterns like "held that", "the court held that", etc.
All test cases now pass:
- ✅ Simple case: "Smith v. Johnson" from "The court ruled in Smith v. Johnson, 123 F.3d 456"
- ✅ Signal words: "Davis v. Wilson" from "As established in Davis v. Wilson, 456 U.S. 789"
- ✅ Header contamination: Properly filters "CARTER, Respondent, v. MARY E. JONES, Appellant"
- ✅ Role word cleaning: Removes "Respondent" and "Appellant" from case names
- More accurate citation extraction - No longer picks up document headers as case names
- Cleaner case names - Role words and signal phrases properly removed
- Better contamination detection - Distinguishes between headers and legitimate case discussion
- Improved clustering - Correct case names lead to better citation clustering
src/unified_clustering_master.py: Enhanced primary case name extractionsrc/unified_case_extraction_master.py: Multiple improvements to extraction and filtering logic
Created comprehensive test suites:
test_case_name_extraction_fixes.py: Full pipeline testingtest_simple_case_extraction.py: Focused testing of specific scenarios
All tests pass successfully, confirming the fixes work as expected.