@zdenop asked 2012 on the hocr ML without an answer:
I need clarification of ocr_line vs. ocrx_line
hOCR spec define ocrx_line as:
- any kind of "line" returned by an OCR system that differs from the
standard ocr_line above
- might be some kind of "logical" line
hocr-tools provide this example of ocr_line[1]:
<span class='ocr_line' title='bbox 461 648 2077 707'>Alice was beginning to get very tired of sitting by her sister on the bank,</span>
And tesseract-ocr (r729) produce this hocr output:
<span class='ocr_line' id='line_2' title="bbox 464 651 2074 704">
<span class='ocrx_word' id='word_5' title="bbox 464 651 569 688">Alice</span>
<span class='ocrx_word' id='word_6' title="bbox 591 665 667 688">was</span>
...
<span class='ocrx_word' id='word_19' title="bbox 1962 660 2074 704">bank,</span>
</span>
Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?
@zdenop asked 2012 on the hocr ML without an answer: