Just like there’s a need to parse XML, there is also a demand for parsing HTML.
Here are some of the reasons why people want to parse HTML. These are:
- Displaying the webpage as it should, a nice looking document. Webbrowsers have their own HTML parsers.
- Extracting valueable information from webpages (like links and/or titles).
- Extracting the text from a webpage (for searching).
- Extracting tables for importing in Spreadsheets or Databases (Web-Scraping).
- Extracting other information (like weather information) and display it on other sites (Web-Scraping).
- Web crawlers need to be able to collect all links from multiple webpages in order to follow the links.
The he most popular approach to HTML parsing is to first convert the HTML to a valid XML format and then use DOM or SAX to parse the Document. HTML (and certainly the older Documents) do not always comply with the XML standards. Taking for example the following piece of HTML.
The code below is not well-formed HTML, because the em and strong elements overlap:
<!-- WRONG! NOT WELL-FORMED HTML! --> <p>Normal <em>emphasized <strong>strong emphasized</em> strong</strong></p>
This is what it should look like:
<!-- Correct: Well-formed HTML. --> <p>Normal <em>emphasized <strong>strong emphasized</strong></em> <strong>strong</strong></p> <p>Alternatively <em>emphasized</em> <strong><em>strong emphasized</em> strong</strong></p>
The process of correcting this is called tag-balancing and there are programs which can do it automatically. The oldest and best known is HTML tidy. In short, it transforms bad HTML to well formed XHTML which is actually a form of XML. A good HTML parser should therefore combine a tag balancer with a parser (optimized for XHTML. There are some good ones which I list below:
All HTML parsers are Open Source.
- The HTML Parser. – A Parser which comes with a large number of examples which can be used directly. It also includes a (SWING based) GUI which can be used to create complex queries to get anything from an HTML document.
- TagSoup.- This is a classic HTML parser from Apache. It was used for many years in Apache Nutch (pre-1.0).
- jTidy – One of the first Java HTML parsers. JTydy is one of the few SAX based Parsers.
- jSoupThis one is relatively young and is probably the best available . It takes HTML from either a file, text String or URL and transforms it to a DOM Object, which has additional features. In addition to the common way of selecting what you want from the HTML with XPATH, jSoup also offers selection by CSS selectors.
Tip: HTML is often queried with XPATH to grab the part you are interested in. To create a good XPATH expression use a browser plugin which can help. You just select the text from the browser and the plugin gives you the XPATH.
For a full list of available HTML parsers see https://en.m.wikipedia.org/wiki/Comparison_of_HTML_parsers