Skip to content

fix: detect bold/italic/underline from semantic HTML tags (<b>, <strong>, <i>, <em>, <u>)#94

Open
vibeyclaw wants to merge 1 commit intoalphanome-ai:mainfrom
vibeyclaw:fix/highlighted-text-classifier-b-tags
Open

fix: detect bold/italic/underline from semantic HTML tags (<b>, <strong>, <i>, <em>, <u>)#94
vibeyclaw wants to merge 1 commit intoalphanome-ai:mainfrom
vibeyclaw:fix/highlighted-text-classifier-b-tags

Conversation

@vibeyclaw
Copy link

Problem

Fixes #61

The HighlightedTextClassifier was silently skipping text styled via semantic HTML tags like <b> and <strong>. As shown in the issue, many SEC filings use bare <b> tags rather than style="font-weight:bold" to apply bold formatting, so those text elements were never classified as HighlightedTextElement or promoted to TitleElement.

Root cause: _compute_effective_style in text_styles_metrics.py only walked the tag tree looking at style="..." attributes. It had no knowledge of the implied CSS properties that HTML semantic tags carry.

Fix

Extended _compute_effective_style to recognise a small set of semantic tags and map them to their implied CSS properties after any inline style attribute is processed (so inline styles still win):

Tag Implied CSS
<b>, <strong> font-weight: bold
<i>, <em> font-style: italic
<u> text-decoration: underline

The setdefault pattern already used throughout the function ensures correct cascade precedence: an explicit inline style always wins over the implied tag style.

Tests added

  • test_should_detect_bold_from_b_tag
  • test_should_detect_bold_from_strong_tag
  • test_should_detect_italic_from_i_tag
  • test_should_detect_italic_from_em_tag
  • test_inline_style_should_override_semantic_tag (precedence check)
  • Two new test_title_step parametrize cases (bold via <b> tag and bold via <strong> tag) verifying end-to-end promotion to TitleElement

All 12 tests pass. No existing tests were modified.

…ng>, <i>, <em>, <u>)

The HighlightedTextClassifier was not detecting text styled via semantic
HTML tags such as <b> and <strong>. The _compute_effective_style function
only examined inline CSS style attributes, missing the implied font-weight
that <b> and <strong> carry by default.

Fix: Extend _compute_effective_style to map semantic tag names to their
implied CSS properties before falling back to inline styles. Inline styles
still take precedence (processed first with setdefault semantics).

Supported tag → CSS property mappings added:
  <b>, <strong>  → font-weight: bold
  <i>, <em>      → font-style: italic
  <u>            → text-decoration: underline

Adds tests covering <b>, <strong>, <i>, <em>, inline-style override, and
end-to-end HighlightedTextClassifier / TitleClassifier integration.

Fixes alphanome-ai#61
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make HighlightedTextClassifier work with <b> tags

1 participant