Conversation
|
|
||
| ```html | ||
| ... | ||
| <span class="ocrx_line"> |
There was a problem hiding this comment.
ocr_lines nested in ocrx_line? That's doesn't look right to me.
There was a problem hiding this comment.
It's ocr_line nested in ocrx_line, in this case a single heading split over two lines.
But I'll gladly make a better example if you have an idea. What i've seen in the wild is just replacements for ocr_line, e.g. https://github.com/jwilk/ocrodjvu/blob/master/lib/hocr.py.
There was a problem hiding this comment.
It's ocr_line nested in ocrx_line
Yeah, I fixed my original mistake...
There was a problem hiding this comment.
Sadly, I don't know what is the right way in this case.
There was a problem hiding this comment.
ocrx_line is engine-specific line markup. It exists for those cases where your OCR engine outputs text lines that don't correspond to "normal" text lines.
The most common case is if you apply an engine that's not capable of column segmentation to a multi-column document and you want to prevent subsequent processing stages from assuming that the text lines it gets contain text in reading order.
Basically, if you use ocrx_line instead of ocr_line, you're (intentionally) breaking most subsequent processing, since most OCR output processing will look for ocr_line tags (and assume they are in reading order).
There was a problem hiding this comment.
Tom, thanks for clarifying this for us.
No description provided.