File: src/utils/unified_case_name_extractor.py
What it does:
- After extracting a case name, validates it actually appears in the text BEFORE the citation
- Searches in a 500-character window before the citation
- If case name not found, rejects it (prevents cross-contamination)
- If found too far (>400 chars), also rejects it
Impact: Prevents picking up case names from wrong citations
File: src/utils/strict_context_isolator.py
What it does:
- Uses END position of previous citations as boundaries (not start)
- Ensures we don't include any text from previous citations
- Better handling of parallel citations (within 50 chars)
Impact: Prevents case name bleeding from nearby citations
File: src/case_name_validator.py
What it does:
- Validates extracted names don't contain legal analysis phrases
- Rejects names containing: "Frye rulings de novo", "WPLA claim", "ER 702", "We review", etc.
- Rejects names starting with legal analysis phrases
Impact: Prevents contamination from surrounding legal text
File: src/utils/strict_context_isolator.py
What it does:
- Removes legal analysis phrases from context BEFORE extraction
- Patterns like: "Frye rulings de novo", "WPLA claim", "We review choice of law", etc.
Impact: Prevents legal text from contaminating extracted case names
After these fixes, you should see:
- Fewer wrong extracted names - Names should match the citation they're extracted for
- No legal text contamination - Names like "Frye rulings de novo. L.M. v. Hamilton" should be rejected
- Better boundary detection - Case names from nearby citations shouldn't bleed through
Test with the problematic cases from your results:
Erickson v. Pharmacia LLC, 1980- Should extract correct name, not "Env't Def. Fund"Rice v. Dow Chemical Co., 1994- Should extract "Rice v. Dow Chemical Co.", not "Erickson v. Pharmacia"State v. Copeland, 1996- Should extract "State v. Copeland", not "Frye rulings de novo. L.M. v. Hamilton"State v. Cauthron, 1993- Should extract "State v. Cauthron", not "Frye hearing. State v. Copeland"
If issues persist:
- Check logs for
[BOUNDARY-VALIDATION]messages to see why names are being rejected - Review context windows - May need to adjust max_lookback or boundary detection
- Add more legal phrase patterns - If new contamination patterns are found