The Apache Nutch Virtual Appliance (NVA) is being worked on.
Inspired on the Google Appliance, a black box Search Appliance, I recently decided to start working on an Apache Nutch Virtual Appliance.
Project in progress
Since this is a project in progress I can’t show you anything yet, but as the development progresses further you will off course be kept informed trough this page.
From the users perspective I thought the best way of looking at this is the Google Appliance. This implies the following specifications:
Area of Application. Just like the Google Appliance our NVA will crawl Intranets and Business Networks only. So we don’t have a need to crawl the whole Internet. This eliminates the need for separate Hadoop Instances.
Plug & play – The user shouldn’t be bothered with tecnical implications such as complicated configurations or difficult XML files to configure. The only thing the user should do is plugin the VA to the Network and the rest should work automatically.
Basic configuration should be done when the VA is first started and connected to the network.
The Fetching, Parsing and Indexing should all happen in the background to keep the system responsive during the whole process of crawling and Indexing.
During Fetching, Parsing and Indexing, the user should be able to follow the progress on a Dashboard, preferably from a webbrowser.
We are absolutely not concerned with speed since the whole process will work in the background.
The search interface should be clear and intuitive.
Everything should be Open Source, including the hypervizor. We will use Apache Nutch 1.x because of performance considerations. The development hypervizor will be Oracle VirtualBox. The guest OS is Linux. The distribution probably Debian.
The Search Interface will be based on Apache Velocity. This is at present the most promising framework for creating Solr based user friendly user interfaces.
Timelines & Planning
I am planning to do this in my spare time so it’s difficult for me to give an accurate estimation on the delivery date.
Project members sought
If you are an Apache Nutch developer and interested in joining me on this project then please Contact me.
Nutch version 2 is already out for 7 years now, while version 1 is also still available and under active development. Currently we can choose from 2 major branches, the 1.x and 2.x branches. The main differences are that the 2.x branch comes with support for NoSQL Databases for it’s storage and Nutch 1.x stores it’s data and Index still in Apache SOLR.
The backend for the NoSQL connectivity is provided by Apache Gora, which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores, distributed in-memory key/value stores, in-memory data grids, in-memory caches, distributed multi-model stores, and hybrid in-memory architectures.
Gora also enables analysis of data with extensive Apache Hadoop MapReduce™ and Apache Spark™ support. Gora uses the Apache Software License v2.0. Gora graduated from the Apache Incubator in January 2012 to become a top-level Apache project.
At DigitalPebble a study was performed between the two versions to determine which was the fastest. It was concluded that Nutch 1.x was still the fastest on all fronts and this was mainly due to Gora which is responsible for a lot of overhead.
At present there’s no need to upgrade from 1 to 2 and this will probably not change the coming years.
I already introduced you to Solr in a different article. This time I will get further into Solr and one of its magical features: The more like this functionality.
We all have seen this, since the most modern web Search systems also have it.
One time however I was completely surprised by the Solr implementation.
I was indexing a Wiki Garden for a large Dutch governmental organization. This Wiki Garden was a collection of Wiki’s used by the organization to document about everything you can think of; Policy’s, Ideas, Project Documentation as well as meeting minutes and discussion forums. All with all the Wiki garden contained a few hundred thousand documents. Not really Big Data yet but it came nearby.
I crawled and Indexed the whole garden with Apache Nutch 1.1, which by default stores it’s index in Apache Solr. After I finished building a nice looking interface, we decided to add a more like this (mlt) link to every single search result in order to find similar documents.
When the Lucene Term Vector is properly applied to an index, the Vector-Space model of Information Retrieval allows us to compute the ‘distance’ between two documents and express it in a number. This ‘distance’ ranges between 0 (not equal) to 1 (exactly equal or the same document).
The equalities 0 and 1 are not of interest, but what was of interest to us were equalities from 0.5 to 0.9, so we did an investigation of these equalities. What we found was pretty awesome, especially the range between 0.9 and 1. These documents were often from different authors describing the same subject (for example a report of the same meeting or a review of the same scientific article!
Practical applications of document similarly
Here are a few I can think of;
Plagiarism (fraud) detection.
In bibliographic research (to find similar research papers).
In biochemical research (to find genetical) similarities in Nucleotide sequences (In fact this is already done using BLAST. To see BLAST at work, visit NCBI.