This should successfully parse any document that can be transformed to a DOM according to W3C standard, including valid HTML documents in HTML syntax or XHTML syntax. It uses the bootstrapping approach (described in the DOM Level 3 Core Specification) and the LS feature (described in the DOM Level 3 Load and Save specification).
Document doc;
File inputFile = new File("input.html");
String inputUrl = inputFile.toURI().toURL().toExternalForm();
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
doc = builder.parseURI(inputUrl);
Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());This approach uses SAX rather than the standard bootstrapping approach. I recommend using the previous one instead where applicable. (It might also be specific to loading XML documents, though I have had successes loading an HTML document in HTML syntax with it.)
Document doc;
File inputFile = new File("input.html");
String inputUrl = inputFile.toURI().toURL().toExternalForm();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(inputUrl);
Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());HTML documents in the wild are seldom valid. You may use the jsoup library for these cases.
Document doc;
File inputFile = new File("input.html");
org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(inputFile, StandardCharsets.UTF_8.name());
doc = new W3CDom().fromJsoup(jsoupDoc);
Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());-
Parsing from DOM and related technologies in Java: see JAXP tutorial (focus on the parts related to the DOM)
-
Comparison of HTML parsers (Wikipedia)
-
W3C DOM4 (Recommendation 19 November 2015), a snapshot of the DOM Living Standard