I have 5 years experience with Apache Tomcat in a real world (Enterprise) environment which does make me know what I’m talking about. Yes, I know my Java, Apache Lucene, Tomcat and some others.
My adventure started at ABN AMRO in Amsterdam where I came in to troubleshoot a memory-leak which caused the Tomcat server to be unstable. The problem was solved in 2 weeks but I left only 5 ‘years later.
The problems where I was hired for was believed by the ABN AMRO IT Management to be caused by a Web Application (Global Intranet), which was the homepage of all WorkStations inside the ABN AMRO World Wide. These were little more than 100 thousand workstations, spread across the globe.
The reason why I was called was because the IT staff knew the Global Intranet used a certain library called Lucene. And when you searched Google at that time for Lucene and my location (Netherlands) my name came up at the top of the list. And yes. I knew (and still know) something about Apache Lucene.
Lucene was at that time little known. Which is what it made a suspect for the IT staff to blame. “Onbekend maakt onbemind” is a Dutch expression. Translated into English this is “unknown makes unloved”. So when I arrived there, I found a hectic environment of “experts” from a wide range of countries. The US, GB, India, Ireland, Scotland, Norway, Canada and more. At first sight nobody had a method or plan. They all seemed to work unstructured without any plans or goals.
I soon learned that the Irish guy was hired because of his knowledge of load testing and indeed he had brought an impressive collection of tools which he used to bring the Server under a heavy load with the only goal (as far as I could see) to bring it down, in which he didn’t succeed.
When I joined the team my first thoughts were to clearly define the goals and write down a strategy to focus on and follow until the goals were met: The list of goals was as follows:
- Find out if the problem was really a memory leak.
- If it was really a memory leak, provide proof.
- Track down where the problem came from.
- Find a procedure to repeat the problem.
- Find a solution to fix the problem.
Off course nobody took the effort to read my plans. Everyone was too busy with their own things, which was proving that Lucene was indeed the root- cause of the memory leak.
I call this ‘Wishful thinking’ because by showing the Manager that he was right would increase the chance of being hired again, which everyone involved would like to.
Most of these people (except for the Indians and Egyptians) were bad, mean and willing to stick their arm up to the elbow into the manager’s ass just to get an extension. The whole culture inside the Bank was sick. The truth was less important than making the Manager happy and one thing to do this was to support his (discussable) ideas.
I wasn’t like them and was the only one focused on finding the real cause. The Indians and Egyptians included, since they were assigned to me.
Logically I started with point 1. I wanted to see the actual free memory. Every JDK comes with a set of tools to measure the Java Memory as well as many other properties. This is called JMX.
A program of the JMX suite to measure (and graph) Java Memory was the jConsole program. Using this in the life system showed me that the memory was gradually increasing but it didn’t provide any proof yet.
I needed a way to repeat the problem (without disturbing the life environment) so I asked IT for a test environment. Which they gave me the next day.
I skipped to point 4. Repeating the problem in my test environment now had the highest priority.
I knew about an Open Source tool for load testing called Apache jMeter. I read the documentation and decided that this was an excellent tool to use so I downloaded it.
In the jMeter documentation I saw it had a nice feature: speeded up Simulation of the Access-Logs. This was exactly what I needed. I asked the IT for the log files in which the problem had occurred and 2 hours later I hit the jackpot! I could repeat the problem within roughly one hour. Just by playing the logfiles in a faster tempo.
After I repeated the problem 3 times it became obvious to me what was going on. There was no Lucene Memory leak. It was the Global Intranet which caused the problem.
OOME in Java
Java itself has a great Memory Management build in. This is the Garbage Collector (GC). Normally this works fine and programmers don’t need to manage the memory since this is automatically done by the GC. In seldom cases Java Programs can give an Out-Of-Memory Error (OOME) . When such an error occurs it means the end of the Java Processes running and the entire process stops. It’s the death of the Server and the only way to get it running again is to restart the Server, including the Web Application.
In fact, everyone had overseen the issues which can arise when an Apache Tomcat Server is working 24/7 and also is in use 24/7. In case of a global Intranet we can expect the following to happen:
- Asia wakes up, all the employees go to their jobs and switch on their Computers, which causes the Server to instantiate Objects which cause memory consumption on the Apache Tomcat Server.
- In the meantime the Middle East wakes up and does the same as in 1.
- Asia is done with their jobs and return back home. The Objects on the server don’t suddenly disappear but wait until their Time-To-Life has expired before they destroy themselves. (Invalidation).
- Europe wakes up and starts with 1.
- The Middle East goes home.
- The US wakes up and starts with 1.
- Europe goes home.
The world is turning round and round. The process never stops. When the US is going home Asia is already at work. etc, etc, etc.
Java is Object Oriented and with each user Session, Objects are instantiated to live Objects which consume memory on our Apache Tomcat Server. You can imagine that the Global Intranet on our Apache Tomcat server should need a system in place which carefully could handle the number of live Objects.
- A quick look inside the Web Application told me that the original Deveopers (oyster.com) had taken this into account and had built in a Caching system for this and even made it Configurable.
After I altered the Cache configuration, the problem was solved. Just to make sure I ran the jMeter test a few times without crashing the server the changes were documented and the problem was gone forever.
So I finished my job and wanted to go home but then my Project Managers decided that I was useful in their organization and they had some requirements for some new projects so I was asked to stay for some time and work on more Apache Tomcat projects.
- Attempts to make a career just by putting your arms in the Boss’s ass doesn’t work. Commitment to your job and trying to find the truth (even if it’s not what the Manager wants to hear) works better in the end.
- Apache Tomcat is a good and stable platform. Even in multinationals with 100.000 and more concurrent clients.
- Apache Lucene is a stable and good IR Library which can easily handle thousands of concurrent searches per second, as long as the Index isn’t extremely large.
- Trust Open Source Software! It’s usually better and more stable than Microsoft or other Commercial Software suppliers. Modern Open Source software (especially Apache) is more stable than commercial variants because it has been tested thoroughly by a large community of users, developers, administrators and automated systems.
- If you start working with a new system, take the time to read the documentation. It’s always better to spend extra time on making sure you are working with a well configured system than finding out years later that you have invested millions in troubleshooting by expensive hired consultants who actually weren’t the “experts” they claimed to be.
- In case you doubt this story, contact some of my LinkedIn connections, who were there. Here are some names: Adrena Keating, Wim Fokkert, Melle de Wit, Satnam Agrawal, Ron Warlich. None of them know I’m mentioning them here so if you want to contact them, also refer to this article.
To experience the performance of my Tomcat Hosting environment yourself you can try a few of my Tomcat Applications yourself:
1 and 2 are Lucene based. You can try to put them under heavy load. Especially 2. It’s already under a load of 5 searches per second. I tested it with 2000 queries per second which didn’t affect performance yet. Currently even complex queries take about 1 to 2 ms.