Welcome!

AJAX & REA Authors: John Funnell, Bob Little, Kevin Hoffman, Maureen O'Gara, Onkar Singh

Related Topics: Java

Java: Article

Crunching Big Data with Java

One Team, One Month, One JVM

What type of processing is the data going through? The processing may be some sort of data transformation, data mining analytics, business intelligence, data aggregation, matching, or data cleansing task, but a key factor is that it applies to the whole data set. Multi-core machines provide a great opportunity for streamlining this processing. To take full advantage of these hardware resources, however, software developers need better approaches.

And the amount of data gets larger and larger every day. The ease and reduced cost to capture, transfer, and store information have resulted in huge collections of records, transactions, and events. As capacities for processing (more cores) and storage (faster and higher-capacity disks) continue to increase at dizzying speeds, the processing requirements increase almost as fast. Plus no one ever wants to let go of that data once they have it, because once you have the data, you want to squeeze every possible ounce of value out if it.

Data-Oriented Thinking
Within the disciplines of data mining, data quality, matching, searching and other data-intensive efforts, there exist compute-intensive algorithms for doing specific functions. It seems natural to solve our performance and scalability problems by focusing in and optimizing these algorithms using standard techniques.

While the above approach helps, it's only one part of solving the problem. A more general approach is needed, a different way of attacking the problem. Instead of thinking in terms of the functions to run on the data (the algorithms) or even the objects to use to process the data, I suggest you think in terms of the data itself. Consider the way the data needs to flow to be transformed, processed, mined, cleansed, or matched to reach the end goal. This dataflow way of thinking isn't a step back to the flow chart days of COBOL, but a higher-level data-oriented abstraction. Once you can break a data-intensive, analytic problem into such a form, it can be mapped to a dataflow graph that may be easily implemented in Java. We'll see how in the course of this article.

A dataflow is a graph structure that consists of two basic building blocks: a processing node and a data edge. A processing node (or process) transforms its input data in some way and outputs the transformed data to its output queue. The edges in the graph are dataflow queues. Dataflow queues are blocking queues used to transmit data between the processing nodes. A data processing application is built in a dataflow framework by stitching together dataflow processes to provide the desired data transformations. Please see Figure 1 for depiction of a simple dataflow graph.

If you use command line shells such as sh or bash, you're already familiar with this design methodology. From the shell command line you can run command pipelines such as:

awk -F, '{print $3}' < file.txt | sort | unique -c | sort -rn

(This pipeline reads the third column of the input CSV file, sorts the values, gets a count of each distinct value, and then sorts the distinct values by their frequency in reverse order.) This command pipeline is hooked together using the standard input and standard output data streams of each command. The shell creates a process for each command, hooks the outputs to inputs and then lets the data flow. None of the commands need to know about each other, they just need to honor the contract of writing their output to standard output and reading their input from standard input. Contract-based programming in action!

It's the same with dataflow programming. A dataflow process defines a contract: its input and output dataflow queues. A process may also have properties that can be set that are analogous to command line options of shell commands. Building a dataflow graph is very similar to building a shell pipeline. However, dataflow processes can normally have multiple inputs and multiple outputs. Shells commands can have two outputs: standard output and standard error, but only one input. There are other differences as well, but the analogy is useful to understanding the concepts. Figure 1 depicts the given shell command line example implemented as a dataflow graph. In the dataflow graph, the awk command is replaced with a reader process. The uniq command is replaced with a Group operator that uses a row count aggregator.

It's important to note that dataflow implements pipeline parallelism by its very nature. Each process in a dataflow graph works independently (somewhat) of other processes in the graph. As a process handles its input and does its transformations, it writes the resulting data to its output. Pipelining is a very powerful construct that allows for simple parallelism. It's one of the basic building blocks of parallel algorithm structure.

Let's look more closely at pipelining. Referring to the dataflow graph in Figure 1, the ReadText process begins reading the input file and outputs the third column of data for every input row. The ReadText can write each record to its output as it goes; it doesn't have to process all of its input before producing any output. It's a pipeline-friendly process. The Sort process, however, isn't pipeline-friendly. The Sort process must read all of its input data and process it before producing any output. To do so, Sort may have to create temporary space on local disk for merging. This is required since the Sort process may be asked to handle more input data than it can sort in memory.

This illustrates that attacking huge collections of data with the dataflow approach requires some new rules: any data-oriented framework must be data scalable. In other words, it should be able to handle gigabytes to terabytes of data without relying on in-memory algorithms that fail when billions of rows of data have to be handled.

A dataflow graph can also support the "divide and conquer" technique to enhance scalability. This is another basic parallel algorithm structure. With this technique, the input data is partitioned in some way and the same algorithm is applied to each partition. This allows the data to be processed in parallel. On computers with multi-core hardware, divide and conquer (a k a horizontal partitioning) can provide a huge boost to performance.


More Stories By Jim Falgout

Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Eman 04/05/08 10:33:42 AM EDT

Funny, Cos, you are pointing out how Java isn't all that "free & open" like its corp. creator claims it is... the beauty of open source + patent law = morass of bear traps

Frankly, I haven't seen any Java framework that holds a match to this DataRush thing... download and see for yourself.

Cos 03/27/08 08:05:17 PM EDT

Daah! Check US Patent 7,020,699
Filed: December 19, 2001