Skip to content

Latest commit

 

History

History
66 lines (47 loc) · 3.05 KB

File metadata and controls

66 lines (47 loc) · 3.05 KB

HTML to DOM

A short note about parsing HTML documents in Java using W3C’s DOM (Document Object Model). Learning to parse using the DOM is good because the DOM is a widely implemented standard: once you know how it works, you can parse HTML (and XML) documents in Javascript, Python, .NET, …

Valid HTML

This should successfully parse any document that can be transformed to a DOM according to W3C standard, including valid HTML documents in HTML syntax or XHTML syntax. It uses the bootstrapping approach (described in the DOM Level 3 Core Specification) and the LS feature (described in the DOM Level 3 Load and Save specification).

Obtain a DOM object from a valid HTML document
Document doc;
File inputFile = new File("input.html");
String inputUrl = inputFile.toURI().toURL().toExternalForm();

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
doc = builder.parseURI(inputUrl);

Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());

XHTML

This approach uses SAX rather than the standard bootstrapping approach. I recommend using the previous one instead where applicable. (It might also be specific to loading XML documents, though I have had successes loading an HTML document in HTML syntax with it.)

Obtain a DOM object from a valid XHTML document
Document doc;
File inputFile = new File("input.html");
String inputUrl = inputFile.toURI().toURL().toExternalForm();

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(inputUrl);

Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());

Real-life HTML

HTML documents in the wild are seldom valid. You may use the jsoup library for these cases.

Obtain a DOM object from a real-life HTML document
Document doc;
File inputFile = new File("input.html");

org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(inputFile, StandardCharsets.UTF_8.name());
doc = new W3CDom().fromJsoup(jsoupDoc);

Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());

Refs

  • Parsing from DOM and related technologies in Java: see JAXP tutorial (focus on the parts related to the DOM)

  • Comparison of HTML parsers (Wikipedia)

  • W3C DOM4 (Recommendation 19 November 2015), a snapshot of the DOM Living Standard