Parse partial HTML as-is? #160

maxcorbeau · 2025-04-19T10:06:08Z

maxcorbeau
Apr 19, 2025

When I try to parse HTML, selectolax does a few extra things (not shocking for an HTML parser):

adds extra tags (html/head/body from what I can see)
strips invalid tags (e.g. a <tr> encountered outside a <table>)

from selectolax.parser import HTMLParser,parse_fragment

sample = "<tr><i>foo</i><i>bar</i></tr>"
print(f"{HTMLParser(sample).html=}")
# =><html><head></head><body><i>foo</i><i>bar</i></body></html>
# <tr> stripped because not part of a table
print(f"{[x.html for x in parse_fragment(sample)]=}")
# ['<i>foo</i>', '<i>bar</i>']
# <tr> is lost
# we get a list of nodes and not a tree anymore

Is there a way to use selectolax in loose mode (i.e. don't remove/add any tags)?

Reason I wanted to use selectolax is because of speed (I get ~3x to 4x vs. lxml, ~20x vs. bs4)

I think I'm going to end up using some Rust pure XML parser if selectolax can't do it.

pygarap · 2025-11-24T00:53:50Z

pygarap
Nov 24, 2025

@maxcorbeau With PR #188 being merged, you can do it! (just wait for the release)

from selectolax.lexbor import LexborHTMLParser

sample = "<tr><i>foo</i><i>bar</i></tr>"
print(f"{LexborHTMLParser(sample, is_fragment=True).html=}")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse partial HTML as-is? #160

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Parse partial HTML as-is? #160

Uh oh!

Uh oh!

maxcorbeau Apr 19, 2025

Replies: 1 comment

Uh oh!

Uh oh!

pygarap Nov 24, 2025

maxcorbeau
Apr 19, 2025

pygarap
Nov 24, 2025