YOUR FEEDBACK
Chris Keene's Prescription for Curing the Java Flu
Pedro wrote: "Adobe and Microsoft are doing a far better job making their ...
SOA World Conference
Virtualization Conference
$200 Savings Expire May 16, 2008... – Register Today!

SYS-CON.TV

2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
TOP THREE LINKS YOU MUST CLICK ON


Crunching Big Data with Java
One Team, One Month, One JVM

Digg This!

Page 2 of 4   « previous page   next page »

What type of processing is the data going through? The processing may be some sort of data transformation, data mining analytics, business intelligence, data aggregation, matching, or data cleansing task, but a key factor is that it applies to the whole data set. Multi-core machines provide a great opportunity for streamlining this processing. To take full advantage of these hardware resources, however, software developers need better approaches.

And the amount of data gets larger and larger every day. The ease and reduced cost to capture, transfer, and store information have resulted in huge collections of records, transactions, and events. As capacities for processing (more cores) and storage (faster and higher-capacity disks) continue to increase at dizzying speeds, the processing requirements increase almost as fast. Plus no one ever wants to let go of that data once they have it, because once you have the data, you want to squeeze every possible ounce of value out if it.

Data-Oriented Thinking
Within the disciplines of data mining, data quality, matching, searching and other data-intensive efforts, there exist compute-intensive algorithms for doing specific functions. It seems natural to solve our performance and scalability problems by focusing in and optimizing these algorithms using standard techniques.

While the above approach helps, it's only one part of solving the problem. A more general approach is needed, a different way of attacking the problem. Instead of thinking in terms of the functions to run on the data (the algorithms) or even the objects to use to process the data, I suggest you think in terms of the data itself. Consider the way the data needs to flow to be transformed, processed, mined, cleansed, or matched to reach the end goal. This dataflow way of thinking isn't a step back to the flow chart days of COBOL, but a higher-level data-oriented abstraction. Once you can break a data-intensive, analytic problem into such a form, it can be mapped to a dataflow graph that may be easily implemented in Java. We'll see how in the course of this article.

A dataflow is a graph structure that consists of two basic building blocks: a processing node and a data edge. A processing node (or process) transforms its input data in some way and outputs the transformed data to its output queue. The edges in the graph are dataflow queues. Dataflow queues are blocking queues used to transmit data between the processing nodes. A data processing application is built in a dataflow framework by stitching together dataflow processes to provide the desired data transformations. Please see Figure 1 for depiction of a simple dataflow graph.

If you use command line shells such as sh or bash, you're already familiar with this design methodology. From the shell command line you can run command pipelines such as:

awk -F, '{print $3}' < file.txt | sort | unique -c | sort -rn

(This pipeline reads the third column of the input CSV file, sorts the values, gets a count of each distinct value, and then sorts the distinct values by their frequency in reverse order.) This command pipeline is hooked together using the standard input and standard output data streams of each command. The shell creates a process for each command, hooks the outputs to inputs and then lets the data flow. None of the commands need to know about each other, they just need to honor the contract of writing their output to standard output and reading their input from standard input. Contract-based programming in action!

It's the same with dataflow programming. A dataflow process defines a contract: its input and output dataflow queues. A process may also have properties that can be set that are analogous to command line options of shell commands. Building a dataflow graph is very similar to building a shell pipeline. However, dataflow processes can normally have multiple inputs and multiple outputs. Shells commands can have two outputs: standard output and standard error, but only one input. There are other differences as well, but the analogy is useful to understanding the concepts. Figure 1 depicts the given shell command line example implemented as a dataflow graph. In the dataflow graph, the awk command is replaced with a reader process. The uniq command is replaced with a Group operator that uses a row count aggregator.

It's important to note that dataflow implements pipeline parallelism by its very nature. Each process in a dataflow graph works independently (somewhat) of other processes in the graph. As a process handles its input and does its transformations, it writes the resulting data to its output. Pipelining is a very powerful construct that allows for simple parallelism. It's one of the basic building blocks of parallel algorithm structure.

Let's look more closely at pipelining. Referring to the dataflow graph in Figure 1, the ReadText process begins reading the input file and outputs the third column of data for every input row. The ReadText can write each record to its output as it goes; it doesn't have to process all of its input before producing any output. It's a pipeline-friendly process. The Sort process, however, isn't pipeline-friendly. The Sort process must read all of its input data and process it before producing any output. To do so, Sort may have to create temporary space on local disk for merging. This is required since the Sort process may be asked to handle more input data than it can sort in memory.

This illustrates that attacking huge collections of data with the dataflow approach requires some new rules: any data-oriented framework must be data scalable. In other words, it should be able to handle gigabytes to terabytes of data without relying on in-memory algorithms that fail when billions of rows of data have to be handled.

A dataflow graph can also support the "divide and conquer" technique to enhance scalability. This is another basic parallel algorithm structure. With this technique, the input data is partitioned in some way and the same algorithm is applied to each partition. This allows the data to be processed in parallel. On computers with multi-core hardware, divide and conquer (a k a horizontal partitioning) can provide a huge boost to performance.



Page 2 of 4   « previous page   next page »

About Jim Falgout
Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.

Eman wrote: Funny, Cos, you are pointing out how Java isn't all that "free & open" like its corp. creator claims it is... the beauty of open source + patent law = morass of bear traps Frankly, I haven't seen any Java framework that holds a match to this DataRush thing... download and see for yourself.
read & respond »
Cos wrote: Daah! Check US Patent 7,020,699 Filed: December 19, 2001
read & respond »
LATEST AJAXWORLD STORIES
AJAX World - How To Launch a Successful Technology Start-Up
'Ten years ago,' Coach Wei tells Jeremy Geelan in this exclusive interview with AJAXWorld Magazine, 'I was as a poor graduate student naive enough to start a company at the bottom of the 'dot-bomb' burst. I learned so much coping with the 'nuclear winter,' raising $18M in financi
frevvo Empowers Businesses to Create RIAs With New Live Forms Software
frevvo announced Live Forms software which enables users and developers to easily create rich AJAX forms with built-in business capabilities. Live Forms provides a complete Web-based design experience that can be embedded into any application and is ideal for Enterprise Social So
Appcelerator Building Out the RIA Open Source Community
'We're dedicated to building the largest open-source community dedicated to RIAs, breaking down the barriers between traditional preferred languages, programming models and solutions,' says the co-founder & CEO of Appcelerator, Jeff Haynie, in this Exclusive Q&A with Jeremy Geela
Software Executive Claims "The Love Is Gone" for Java
'When was the last time you heard about a cool web app that wasn't written in Rails or PHP?' asks Chris Keene, CEO of WaveMaker, in an article published today at SYS-CON.com. 'OK, people still build lots of cool stuff in Java,' Keene continues, 'but the love is gone and it's just
AJAX World - Sun Talks Up its Late-to-the-Party AIR-Silverlight Rival
At Java One this week Sun has been selling its year -old-but-still-upcoming - and definitely late-to-the-party - Adobe AIR- and Microsoft Silverlight-competitive JavaFX Rich Client environment as a potential revenue-generator capable of putting ads on mobile applications and JavaF
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS

ADS BY GOOGLE