Skip to content

ocrx_line example#39

Open
kba wants to merge 1 commit intomasterfrom
ocrx_line-example
Open

ocrx_line example#39
kba wants to merge 1 commit intomasterfrom
ocrx_line-example

Conversation

@kba
Copy link
Copy Markdown
Owner

@kba kba commented Oct 1, 2016

No description provided.

@kba kba force-pushed the ocrx_line-example branch from cd35c43 to b69b342 Compare October 1, 2016 15:06

```html
...
<span class="ocrx_line">
Copy link
Copy Markdown
Collaborator

@amitdo amitdo Oct 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocr_lines nested in ocrx_line? That's doesn't look right to me.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ocr_line nested in ocrx_line, in this case a single heading split over two lines.

But I'll gladly make a better example if you have an idea. What i've seen in the wild is just replacements for ocr_line, e.g. https://github.com/jwilk/ocrodjvu/blob/master/lib/hocr.py.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ocr_line nested in ocrx_line

Yeah, I fixed my original mistake...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I don't know what is the right way in this case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocrx_line is engine-specific line markup. It exists for those cases where your OCR engine outputs text lines that don't correspond to "normal" text lines.

The most common case is if you apply an engine that's not capable of column segmentation to a multi-column document and you want to prevent subsequent processing stages from assuming that the text lines it gets contain text in reading order.

Basically, if you use ocrx_line instead of ocr_line, you're (intentionally) breaking most subsequent processing, since most OCR output processing will look for ocr_line tags (and assume they are in reading order).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tom, thanks for clarifying this for us.

@amitdo amitdo mentioned this pull request Oct 22, 2016
kba added a commit that referenced this pull request Nov 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants