Skip to content

ocr_line vs. ocrx_line  #19

@kba

Description

@kba

@zdenop asked 2012 on the hocr ML without an answer:

I need clarification of ocr_line vs. ocrx_line

hOCR spec define ocrx_line as:

  • any kind of "line" returned by an OCR system that differs from the
    standard ocr_line above
  • might be some kind of "logical" line

hocr-tools provide this example of ocr_line[1]:

 <span class='ocr_line' title='bbox 461 648 2077 707'>Alice was beginning to get very tired of sitting by her sister on the bank,</span>

And tesseract-ocr (r729) produce this hocr output:

  <span class='ocr_line' id='line_2' title="bbox 464 651 2074 704">
      <span class='ocrx_word' id='word_5' title="bbox 464 651 569 688">Alice</span>
      <span class='ocrx_word' id='word_6' title="bbox 591 665 667 688">was</span>
       ...
      <span class='ocrx_word' id='word_19' title="bbox 1962 660 2074 704">bank,</span>
  </span>

Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions