Skip to content

Fix OCR double period artifacts#25

Open
aqilaziz wants to merge 1 commit into
ilhamfp:mainfrom
aqilaziz:fix-ocr-double-periods
Open

Fix OCR double period artifacts#25
aqilaziz wants to merge 1 commit into
ilhamfp:mainfrom
aqilaziz:fix-ocr-double-periods

Conversation

@aqilaziz
Copy link
Copy Markdown

@aqilaziz aqilaziz commented May 6, 2026

Summary

  • normalize OCR/tokenization double-period artifacts into a single period
  • preserve valid ellipsis (...)
  • add parser unit tests for the new behavior

Fixes #14
Also covers the list-marker case described in #19.

Verification

  • python -m pytest scripts/parser/test_ocr_correct.py -q
  • python -m py_compile scripts/parser/ocr_correct.py scripts/parser/test_ocr_correct.py
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parser] Double periods from OCR not cleaned up

1 participant