This is a true story
Back in 2008, I was working on a Search project at a large Dutch governmental organization with branches all over the world. The project was a research project where two commercial products for Enterprise Search where evaluated. Next, to this, the two commercial products would be compared with an Open Source equivalent. Next to a simple product comparison, this story handles about the core Information Retrieva subjects.
The Open Source candidate was Apache Nutch and I was the consultant to implement it.
When I first came into office, I had a meeting with my Project Manager who showed me around in the Department and introduced me to some of the other engineers as well as some people of the management staff and Officers I was going to work with. Everything in this organization was highly confidential so I can’t tell you many details. Particularly because I didn’t get all the information but also since the project was classified (the organization was the Dutch Ministry of Defense). Therefore I won’t go into details very much. My PM was a nice and reasonable man in the rank of a Colonel. I knew this organization already since 20 years earlier I served for them in a uniform and at another location.
This time, however, I was hired by them as a citizen for my Nutch expertise. I was then actively involved in the community and my name was found in several Apache documents on the Internet.
At the end of the introduction, my PM gave me my security chip card. He explicitly warned me not to forget (or even worse) lose it, since it was my key to the building.
A few days later, however, Murphies law struck. While on my way home I decided to stop to do some shoppings and I lost my chip card in the supermarket.
After I noticed I had lost it I went back to the supermarket and asked if it had been found. Off course no one had found it (or reported it was found).
I contacted my PM and told him what had happened. As I mentioned before he was a reasonable man and told me not to worry too much about it. The next day we would go to the Security department who would have a solution for this problem. I wasn’t the only one. It had happened before.
Next day I reported my problem to the security office. The MP in charge was obviously not happy with the situation and did let me know the ‘Military way’. He knew there was a new procedure for this. He only didn’t know the exact details. Those were in a recent word document somewhere on the organization’s intranet. He instructed me to go and search for the document. He appointed two other security officers to help me. By the time we started searching it was 4 PM. I took place behind my workstation and started to have a look at the Intranet.
The Intranet was basically a large collection of Office Documents, hosted on a Microsoft Internet Information Server (IIS) which came with its own full-text Indexing and Search server, the Microsoft Index Server.
Searching with Microsoft; a lot of work and no results.
My first query was “procedure AND verloren AND pas”. Since the organization was Dutch, I had to search in the Dutch language. Literally translated this means “procedure AND lost AND pass” The AND operator was necessary to tell the search server to look for documents containing my terms procedure, lost and pass, so the search results had to contain these three terms.
The search returned a few thousand documents which matched the query. I looked into a few of them. I saw reports about stolen passes in the Netherlands Antilles and several other documents that didn’t contain the information I was looking for.
Searching with parameters like AND, OR, NOT and NEAR is called Boolean searching. The. Queries are called Boolean queries.
Boolean searching is very common and popular because they can be nested to complex queries like: ((cat OR cats) AND (dog OR dogs) NEAR (food) OR foods)) NOT (medicine OR medicines)).
Professional researchers and Academic Librarians still use this when searching for scientific documents in Databases like MEDLINE, CAS, Analytical Abstracts, TOXLINE (a subset of MEDLINE), EMBASE, DERWENTor BEILSTIJN to name just a few. (There are much more but mentioning them here falls outside the scope of this article).
I changed my query to “procedure AND (lost NEAR (pass OR card)) for obvious reasons.
Again this resulted in a list of over thousand documents, all matching the query but not what I was looking for. It was now 7 PM and we were getting tired. The security officers were used to leave at 5 PM.
Finally, the officer in charge decided to phone his superior and asked him if he knew where the document could be. The superior officer had written the document himself and remembered exactly where he had stored it. A few days later I had a chip card and was able to walk In and out the office. I had learned from this exercise. I would never lose my chip card again and the document in question was obviously a needle in a haystack and therefore a good test case.
A few months later I had implemented all the requirements in Nutch (Officials are very strict, especially when it comes to security and forms) and I got the opportunity to Crawl and Index the complete Intranet. The Intranet consisted of four departments (Just like the organization) Army, Air Force, Navy and Military Police. Each department had about 20.000 to 30.000 Office documents so the complete Organization had an estimated 100.000 documents.
We crawled one department at a time. Each crawl took about 3 hours so crawling the complete intranet took about 12 hours. (We used only one single (Virtual) Machine, no Hadoop cluster.
Searching with Nutch; Result in less than a minute!
After the crawl had finished, the first thing I did was doing a test search. In the search box I typed (what else) “procedure lost pass” (The Boolean parameters are not necessary for Nutch. Like Google, Nutch regards every term as relevant.
The result: Bingo! What took three men 3 hours (or a full day in FTE equivalent) was now done in just ten seconds. A saving of one day Full-Time Equivalent! How was this possible?
The answer lies in a few technical differences between the Microsoft Search system and Nutch. I will explain them here:
- Result ranking – The only thing Microsoft takes into account is the Term Frequency (Tf) of a document. The reasoning is simple. The more often the search term(s) occur in a document, the more relevant the document. Nutch does the same thing but adds some more intelligence like References from other documents (also known as anchors, incoming links or inLinks. The more documents link to another document, the more important it must be and if the links come from important documents, then the ‘score’ must also be higher. This ‘score’ as I name it is in Google known as ‘PageRank‘. Google uses this. In fact, Nutch has copied this from Google. Actually, Nutch and Hadoop are inspired from Google publications.
- Similarity– Lucene (the engine behind Nutch) has this built in. Using the Vector-Space Model of Information Retrieval it’s possible to compute a ‘distance’ between two documents or a document and search query. Think of Vector-Space model as Newton’s 1st law of Motion we all learned in High-school where we could predict the motion of an object by computing the resultant of the vectors of all forces applied to the object. Vector-Space is similar to this. By expressing all terms in a document as vectors it’s possible to compute a single resultant to a document (or phrase) we can express this as one single value which can be used to compare two documents with each other. The more the resultant of two documents compare, the more similar they are. This is typically used in More Like This (mlt) which we often see on the web. For this to work, all terms and their positions in the document need to be stored in the Index. Which is a standard feature in Lucene.
- Document parts (or Meta-Data) – Documents consist of parts like the title, chapters, paragraphs, and body text. When search terms are found in the title of a document or one of its headings (h1, h2, h3 etc.), it is likely the document is more relevant to the query and the score is boosted by a factor (configurable in Nutch).
- Results clustering – This is what actually did the magic in my case. Based on the similarities described above, search results can be grouped together in conceptual clusters. By applying statistical analysis of these groups relations can be found and related terms can be displayed near the search results. In my case, I saw (next to the results) the related term ‘badge’. One click on this term narrowed down the results and my document was number 1 in the results. I clicked it and found that a few months earlier could not be found by 3 men.
Results clustering is not a standard feature of Nutch but it was implemented using a plugin. The Carrot2 plugin. Carrot2 is an algorithm for results clustering. To see it in action click here. Play around with it, especially with the graphical representations. When I search for my own name I see this. From the figure it’s obvious the person “Evert Wagenaar” has something to do with Nutch, Sor, Lucene, Indexing, Facebook, LinkedIn and obviously has some followers on Twitter. Type in your own name and see what you can discover about yourself. It’s fun! Please note: You won’t find this view in the online version of Carrot2. Instead, you will need to download the workbench version for your platform, which is an Eclipse Rich Client Application. Although carrot2.org looks like a Search Engine, it’s actually not. It uses Public API’s from Google, Yahoo!, Bing and DuckDuckGo to create the Search functionality. Carrot2 does the clustering itself using different algorithms. You don’t have to go to carrot2 to access it. I downloaded carrot2, installed it on my Apache Tomcat Server so you can run it from as well.
What was the outcome of this study?
I can’t tell. Just like everything at the ministry of defense, this is classified Information. I’m already in a breach by telling you this.
Managers and CEOs. It’s time to wake up!
Start looking in your organization how much time your employees are spending to search for the information they need to do their jobs! As anyone else you know time = money.
If it’s only 1 FTE per week you should hire me! I’ll do the job for you for a fixed price. Your investment will pay off in 3 months. This is guaranteed. Not good? Money back!