Welcome!

AJAX & REA Authors: John Funnell, Bob Little, Kevin Hoffman, Maureen O'Gara, Onkar Singh

Related Topics: Java

Java: Article

Crunching Big Data with Java

One Team, One Month, One JVM

Finally, on to implementation! As luck would have it, I happen to have at my disposal a very handy dataflow framework implemented in Java, (now in beta 2 and available as a free download at www.pervasivedatarush.com). Looking at the component framework, I see that I'm missing a few of the pieces I need. This isn't a problem since the framework is very extensible. I'll just have to code up the new operators in Java, plug them in, and move on to the next step.

The encoders are easy: the Apache project already includes several encodings, including Soundex and Metaphone, two of the most popular encodings. It only took a few minutes to wrap those library calls in an operator in our dataflow framework.

The Group Pairs operator was not too challenging either, since it's very like an inner join, but without the redundant pairings. It took a few days to get that implementation working and unit tested.

It was easy to find good algorithm definitions for many of the field comparison methods we wanted to use. Again, Apache already had an implementation of the Levenshtein Edit Distance measure. The other comparisons were coded from their algorithms. Based on the algorithm, each comparison took a day or so to code, unit test, and then begin using in the application.,p> For the classification operator, we started with a simple weighted average. The operator takes as input the results of each of the field comparisons, applies the specified weight, and the outputs the average. The score is normalized to be between 0 and 1, where 1 is a perfect match. This operator had to be written from scratch, but was very easy to put together and begin using. The application was written in a way to allow applying another classification technique easily. In the future, we may implement a classification technique based on modeling, for example: using a Support Vector Machine (SVM). Using such a model supports "learning" what comparison results lead to good matches. This can be a much more powerful and flexible classification method than simple weighted averages.

This is a key point: dataflow frameworks are very flexible and extensible. As discussed in the classification phase, it will be very easy in the future to design a new classification method that can be plugged into the overall application without affecting either the upstream or downstream components. As long as the new operator conforms to the defined contract, plugging it in place of another operator is very simple, almost plug-and-play.

All of the pieces were already in place for filtering since we're using a threshold comparison to the record score. We plug in a compare to constant operator pipelined to a row selection operator. Any candidate record pairs that don't meet or exceed the base record score are filtered out of the output.

The Results
This fuzzy matching application was built for a specific customer, but in the beginning of the project we didn't have access to the customer's data (which I have found to be a common occurrence). So we needed to generate data. The Febrl project comes with a very useful data generator that creates original records and then duplicates records with a tunable amount of "editing." We generated a data set with one million original records and one million duplicates. The distribution of the data into blocks was done using the state and city fields with Soundex encoding on city. Given the data set, a total of 329 million candidate pairs were generated. The Levenshtein comparison was used on nine fields.

We ran the performance tests for this project on a four-way box with dual-core Opteron processors (for a total of eight cores). The box has 16GB of memory and runs 64-bit Windows Server 2003.

Our first runs on this lab machine ran the whole process in about 46 minutes. Overall, the performance seemed okay, but not great. On to profiling ...

Using the Java profiler in NetBeans, we were able to see that the Levenshtein edit distance process was slinging lots of character arrays and was the bottleneck in processing. Looking closer at the algorithm implementation, we found that it was coded as a static method, and so allocated character data arrays for each call. The implementation was rewritten using code optimized for many calls. The data arrays are now allocated once (re-allocated as needed for size) and the code was optimized for speed. This combination of changes made the implementation run twice as fast.

Once the comparison code was running faster, we tackled optimizing the blocking phase. Using our plug-in to JConsole, which can be seen in Figure 4, we noted that the number of processes in our dataflow graph was unusually high. Due to the task-based partitioning per column (a k a vertical partitioning) and the high number of input columns, the Read, Sort, and Group Pairs operators were doing more work than needed. I realized that we should have thought of that first: obviously, cutting down on the number of columns dragged through any application is always good for performance. So, we added an operator to select only the fields we absolutely needed as early in the pipeline as possible. That gave us another big boost in performance.

At this point, we're running the same data in about nine minutes, as compared to the 46 minutes we started with. That feels much better. That works out to a candidate record rate of about 600K records per second. The CPU utilization looks great: all eight cores pegged at nearly 100% once the application gets past the start-up phase, as can be seen in the JConsole snapshot captured in Figure 5.

At last, actual customer data arrives and we happily run it through our application. Using back-of-the -envelope calculations, we think the customer data will run through in about 10 minutes. I transfer the data over to our lab machine, start up the application, and crank up JConsole to monitor the performance. To my surprise and dismay, the CPU utilization is spotty. It's pegged at 100% at times, but down to 20% or so at others. What's going on? I enable performance monitoring of the application in the framework and run it again. I quickly see, again using our JConsole plug-in, that several partitions of the processing end very quickly. One or two other partitions continue to work, with one finally working all on its own. What's going on?

It's got to be the data! Sure enough, looking at the frequency distribution of the data, two partitions just happen to get several of the biggest groups. The hash partitioner is doing a great job since the number of records distributed to each partition is almost even. The near-quadratic nature of the data generators is the problem. How can this be fixed? Our first thought was to use a load-balance partitioner at the end of the blocking phase to spread the work out evenly over the rest of the application. Using the flexibility of the dataflow framework, this was easy to try out. And sure enough, it helps, but not enough. What really needs to happen is to implement the pair generation in a way that's already load-balanced. Next step: more creative thinking of how to partition the problem. We must use all of the CPU at all times - it's our quest and we won't rest till the mission is accomplished! We're prototyping a solution and will have it in the customer's hands soon. Initial results look good with faster run times and full CPU utilization, even with unbalanced data. Running the latest on an HP Proliant with four quad-core Xeon processors we are processing more than one million candidate record pairs per second. That's over 10 million Levenshtein field measures per second. I'll leave the redesigned blocking algorithm as an exercise for the reader...

Upon delivering the application to the end user we realized the customer wants the record pairs “rolled up” into sets. For instance, if A matches B and B matches C and A matches C, the application outputs three record pairs: AB, AC and BC. What the customer really wants is A, B and C in the same set so a “winner” record can be selected. The customer had been thinking about the problem for a while and wrote a program in SQL to do the roll-up. The SQL ran in about three hours over 14 million matched pairs. That is way too slow given the matching itself runs in minutes. Realizing the roll-up is a disjoint set problem, we found an algorithm for combining disjoint sets, coded the algorithm in Java and incorporated the code into an operator within DataRush, our dataflow framework. Running against the same data using the DataRush solution on the customer’s same 8-core machine, the runtime dropped to 22 seconds. Java rocks! What a difference! Same output 490 times faster: needless to say the customer is very happy with the DataRush implementation.

This experience exemplifies a problem we run into often: using an RDBMS for a job it was never meant to accomplish. Running outside of the database will often provide a much more efficient solution. In this particular case, the savings includes a massive speed-up without the friction of loading and unloading the database.

Concluding Thoughts
Java can perform well and scale on large data analytic problems. But a different architectural approach is needed. The traditional way of optimizing "hot" sections of code simply isn't enough to deal with massive data volumes. Applications must be designed from the beginning with parallelism in mind.

We have shown that a successful approach for data analytic problems is to take a data-oriented view of the solution. This type of analysis leads to designs that are transferred easily to implementations using dataflow techniques. It was demonstrated during the course of this article how a small team of three developers created a fuzzy matching application in a short amount of time using these techniques. The debugging and profiling tools built into Java and the dataflow framework used were instrumental in optimizing the application not only for minimal runtime, but for optimal resource (CPU) utilization. The development focus on parallelization and utilization lead to good scalability, allowing the application to show much faster runtimes on configurations with more processing cores. This is important because we don't want to have to re-code as more cores arrive on the scene.

The productivity of Java, its excellent IDE support, and the wide variety of Java libraries available make it an excellent platform for software development. My team's work on the fuzzy matching application demonstrates that Java can also be used for applications that are a mix of data-intensive and compute-intensive elements. The Java platform provides an excellent mix of design-time and runtime performance and scalability. With new architectural approaches and dataflow library extensions, Java can be turned into a formidable data-crunching machine.

References
  •   http://en.wikipedia.org/wiki/Dataflow.
  •   Mattson, Sanders and Massingill. Patterns for Parallel Programming. Addison-Wesley.
  •   Febrl - A Freely Available Record Linkage System with a Graphical User Interface. http://crpit.com/confpapers/CRPITV80Christen.pdf.
  •   The Febrl project on the web: http://sourceforge.net/projects/febrl.

More Stories By Jim Falgout

Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Eman 04/05/08 10:33:42 AM EDT

Funny, Cos, you are pointing out how Java isn't all that "free & open" like its corp. creator claims it is... the beauty of open source + patent law = morass of bear traps

Frankly, I haven't seen any Java framework that holds a match to this DataRush thing... download and see for yourself.

Cos 03/27/08 08:05:17 PM EDT

Daah! Check US Patent 7,020,699
Filed: December 19, 2001