Skip to content

Boilerplate removal header post processing incorrect #36

@tfmorris

Description

@tfmorris

The conditional here is wrong:
https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350
causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head hurt and are error prone, so I'd recommend using normal logic which matches the algorithm descriptions. ie In this case, instead of:

        if (!(paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad"))) {
            continue;
        }

use

        if (paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad")) {

The current code goes pathologically wrong in the case of documents with a large number empty elements (45,000 "paragraphs", a large number of which were consecutive <br> elements in the example I looked at). In this case the 200 character distance limit never gets reached to trigger the loop exit, causing O(n!) processing of 45,000 elements.

This suggests a couple other possible improvements:

  • compress runs of more than 2 <br> elements
  • introduce a max number of elements distance limit in addition to the max number of character limit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions