Skip to content

Latest commit

 

History

History
86 lines (52 loc) · 2.74 KB

File metadata and controls

86 lines (52 loc) · 2.74 KB

Case Name Bleeding Fixes - Implementation Summary

Fixes Implemented

1. ✅ Validate Extracted Name Appears Before Citation

File: src/utils/unified_case_name_extractor.py

What it does:

  • After extracting a case name, validates it actually appears in the text BEFORE the citation
  • Searches in a 500-character window before the citation
  • If case name not found, rejects it (prevents cross-contamination)
  • If found too far (>400 chars), also rejects it

Impact: Prevents picking up case names from wrong citations


2. ✅ Improve Citation Boundary Detection

File: src/utils/strict_context_isolator.py

What it does:

  • Uses END position of previous citations as boundaries (not start)
  • Ensures we don't include any text from previous citations
  • Better handling of parallel citations (within 50 chars)

Impact: Prevents case name bleeding from nearby citations


3. ✅ Reject Legal Analysis Text in Extracted Names

File: src/case_name_validator.py

What it does:

  • Validates extracted names don't contain legal analysis phrases
  • Rejects names containing: "Frye rulings de novo", "WPLA claim", "ER 702", "We review", etc.
  • Rejects names starting with legal analysis phrases

Impact: Prevents contamination from surrounding legal text


4. ✅ Remove Legal Analysis Phrases from Context

File: src/utils/strict_context_isolator.py

What it does:

  • Removes legal analysis phrases from context BEFORE extraction
  • Patterns like: "Frye rulings de novo", "WPLA claim", "We review choice of law", etc.

Impact: Prevents legal text from contaminating extracted case names


Expected Results

After these fixes, you should see:

  1. Fewer wrong extracted names - Names should match the citation they're extracted for
  2. No legal text contamination - Names like "Frye rulings de novo. L.M. v. Hamilton" should be rejected
  3. Better boundary detection - Case names from nearby citations shouldn't bleed through

Testing

Test with the problematic cases from your results:

  1. Erickson v. Pharmacia LLC, 1980 - Should extract correct name, not "Env't Def. Fund"
  2. Rice v. Dow Chemical Co., 1994 - Should extract "Rice v. Dow Chemical Co.", not "Erickson v. Pharmacia"
  3. State v. Copeland, 1996 - Should extract "State v. Copeland", not "Frye rulings de novo. L.M. v. Hamilton"
  4. State v. Cauthron, 1993 - Should extract "State v. Cauthron", not "Frye hearing. State v. Copeland"

Next Steps

If issues persist:

  1. Check logs for [BOUNDARY-VALIDATION] messages to see why names are being rejected
  2. Review context windows - May need to adjust max_lookback or boundary detection
  3. Add more legal phrase patterns - If new contamination patterns are found