“A complex system or interconnected system: The success of the Apple products depend on their separate products”. Nature’s balance is a complex system, dependant of many organisms working together.
The Lucene Ecosystem depends on a number of (Apache) software products which work very closely together as one single product. In this article we’ll summarize the components which together make up the Lucene Ecosystem.
Apache Lucene is not an out-of-the-box solution. Many people who are looking for a ready to use Search Solution will be disappointed after downloading Lucene! In fact out of the box it doesn’t do anything!
In it’s current form Lucene is only a software library, a tool which can be used by Java programmers to develop search solutions. The Lucene library is however a full featured Information Retrieval software Library, containing every aspect from the (academic) discipline of Information Retrieval.
Solr is often named together with Lucene. It solves the problem of Lucene, which doesn’t do anything out of the box. Solr actually is a working (Web) application based on Lucene, which can immediately be used. After installing and configuration, Solr is a full featured Enterprise Search Application.
The Solr package comes with a manual and a number of example (XML) files.
When you follow the manual, you will be able to index the example files and search them immediately. Solr offers much more than simple indexing and searching. You can create complex queries and retrieve exactly what you are searching for.
Solr is not really for end users. It offers a simple but great REST-like interface which can be used by web-developers to create powerful Search Applications. Just like other REST API’s, Solr can be used with XML or JSON. But next to these, Solr also understands the major scripting and programming languages like PHP, PERL and Python.
Where Lucene only offers a Java API, Solr offers multiple API’s which Web developers are already familiar with and at the same time unlocks all the IR features that Lucene can offer.
Search and IR applications can only work with plain text. There exists however a large number of other formats on the Web. PDF’s, Office (Word, Excel, PowerPoint, images, Video and Audio files are all commonly found on the web and contain (textual, meta) information. Photos for example contain EXIF metadata. Depending on the camera (or phone), photos contain information about camara and it’s settings but also location, date and time when the photo was taken. This can be useful to display the photo within it’s context (On a map, together with events which took place in the same time).
This is how Google can display photos on Google Maps.
in the same way, music (MP3) files contain meta-information about artist, year, album and genre of the song. Apple uses this extensively in their iTunes product, to sort tracks by date, artist, album or year.
Initially Apache Tika was nothing more than a system to detect the file-types of all the media types available. Then, when the media-type was known, it could delegate the files to their responsible parsers, which could extract the text from the file.
This was commonly done during the development of Apache Nutch (see below). Later on, the Apache Software Foundation decided to integrate the detector with the parsers, and Tika became a stand-alone application for Media/Mime type detection and text extradition.
Apache Nutch was an attempt from Apache to create a world-class Web-Crawler and Search Engine that could compete with Google.
It was pretty soon successful. In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
Hadoop was created as a subproject of Apache Nutch to be able to handle very big amounts of data. Also known as big data. It consists of two major components:
- MapReduce – A programming framework that allows code to be executed on multiple machines simultaneously. This solves a scalability problem. If the system requires more processing power, simply extra machines (nodes) can be added to the cluster. MapReduce was first invented and described by Google in a research paper.
- HDFS – Hadoop Distrbuted File System – This solves the problem of scaling data storage. Similar to Mapreduce. In case more storage is required, additional nodes can be added to the cluster. All the nodes together behave as a single file system which can be read, written and formatted.
Both Mapreduce and HDFS also work together as failover systems. Whenever a node in the cluster fails, the other nodes take over.
“NameNode controls all other nodes in the cluster”
In analogy with XML, each cluster has a top node, called the NameNode. This node doesn’t participate in the Map- and Reduce jobs. It’s the node which controls all active nodes in the cluster and takes action (delegates) jobs when one or more nodes fail or break.
Hadoop vs Google
You can test this yourself. Yahoo! Search uses Hadoop. Perform a search with the same search terms on both systems and compare the results.
All sites using Hadoop
The link below lists a comprehensive number of sites using Hadoop, as well as the cluster sizes, Petabytes stored and number of processors. Very interesting! This list opens at ‘Y’ (from Yahoo!) since this is the biggest currently know cluster.
The Yahoo! Cluster occupies a multiple stores Data Warehouse. It takes several hours to boot up the whole cluster. The boot process is finished when the last node has reported itself to the NameNode