What is Web scraping?
Wikipedia tells us the following:
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Web scraping a web page involves fetching it and extracting from Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list (contact scraping).
In a different article I already introduced you to HTML parsers and what you can do with them. In this article we will parse a complete Database from the web and import it into an Open Source database (MySQL) and make it much more beautiful than it already was.
Off course you can see this as theft and indeed most Governments have laws to prevent this. My government, the Dutch government is no exception. We have the Database protection Act, which protects Databases from being stolen. But if we read this good it is about personal Databases or Company Databases which can be regarded as Intellectual property. Fortunately for us the Governmental databases don’t fall under this law and can therefore freely scraped.
One prerequisite for scraping complete Databases is that all records are similar so our parser doesn’t need to have different for each record type.
A second prerequisite is that we must have a single point of entry that lists all records in the Database.
Let’s go. We take for example this Database. Have a look inside and you will see it’s pretty boring. It lists all animal pharmaceutical products for which endorsement was requested since the start of the Dutch Veterinary Medicines Act.To get the full list go into the Database and click the link which shows the recently added products, study the URL and change the srart and end dates to 1-1-1900 and 1-1-2030, Also change the number of products to list to 25000. This gives you the following URL:
Copy and pasted it into a browser to check if the list is complete. Off course it is. Now open your favorite IDE and make sure that the jSoup jar is on the ClassPath. You can download the jar from jsoup.org. Next have a look at the jSoup documentation. There’s an example of how to extract all the links from a page. Apply that to our page and store all links in a Java Collection such as the ArrayLists.
In the next step we loop through all links, retrieve them and parse the Documents. Since each Document consists of an HTML table with two columns you need to create XPATH expressions to get the contents of each cell.
Now we have the basics we can start with creating a mashup from our data.