| By Jim Falgout | Article Rating: |
|
| March 30, 2008 04:00 AM EDT | Reads: |
25,331 |
What type of processing is the data going through? The processing may be some sort of data transformation, data mining analytics, business intelligence, data aggregation, matching, or data cleansing task, but a key factor is that it applies to the whole data set. Multi-core machines provide a great opportunity for streamlining this processing. To take full advantage of these hardware resources, however, software developers need better approaches.
And the amount of data gets larger and larger every day. The ease and reduced cost to capture, transfer, and store information have resulted in huge collections of records, transactions, and events. As capacities for processing (more cores) and storage (faster and higher-capacity disks) continue to increase at dizzying speeds, the processing requirements increase almost as fast. Plus no one ever wants to let go of that data once they have it, because once you have the data, you want to squeeze every possible ounce of value out if it.
Data-Oriented Thinking
Within the disciplines of data mining, data quality, matching, searching and other data-intensive efforts, there exist compute-intensive algorithms for doing specific functions. It seems natural to solve our performance and scalability problems by focusing in and optimizing these algorithms using standard techniques.
While the above approach helps, it's only one part of solving the problem. A more general approach is needed, a different way of attacking the problem. Instead of thinking in terms of the functions to run on the data (the algorithms) or even the objects to use to process the data, I suggest you think in terms of the data itself. Consider the way the data needs to flow to be transformed, processed, mined, cleansed, or matched to reach the end goal. This dataflow way of thinking isn't a step back to the flow chart days of COBOL, but a higher-level data-oriented abstraction. Once you can break a data-intensive, analytic problem into such a form, it can be mapped to a dataflow graph that may be easily implemented in Java. We'll see how in the course of this article.
A dataflow is a graph structure that consists of two basic building blocks: a processing node and a data edge. A processing node (or process) transforms its input data in some way and outputs the transformed data to its output queue. The edges in the graph are dataflow queues. Dataflow queues are blocking queues used to transmit data between the processing nodes. A data processing application is built in a dataflow framework by stitching together dataflow processes to provide the desired data transformations. Please see Figure 1 for depiction of a simple dataflow graph.
If you use command line shells such as sh or bash, you're already familiar with this design methodology. From the shell command line you can run command pipelines such as:
awk -F, '{print $3}' < file.txt | sort | unique -c | sort -rn
(This pipeline reads the third column of the input CSV file, sorts the values, gets a count of each distinct value, and then sorts the distinct values by their frequency in reverse order.) This command pipeline is hooked together using the standard input and standard output data streams of each command. The shell creates a process for each command, hooks the outputs to inputs and then lets the data flow. None of the commands need to know about each other, they just need to honor the contract of writing their output to standard output and reading their input from standard input. Contract-based programming in action!
It's the same with dataflow programming. A dataflow process defines a contract: its input and output dataflow queues. A process may also have properties that can be set that are analogous to command line options of shell commands. Building a dataflow graph is very similar to building a shell pipeline. However, dataflow processes can normally have multiple inputs and multiple outputs. Shells commands can have two outputs: standard output and standard error, but only one input. There are other differences as well, but the analogy is useful to understanding the concepts. Figure 1 depicts the given shell command line example implemented as a dataflow graph. In the dataflow graph, the awk command is replaced with a reader process. The uniq command is replaced with a Group operator that uses a row count aggregator.
It's important to note that dataflow implements pipeline parallelism by its very nature. Each process in a dataflow graph works independently (somewhat) of other processes in the graph. As a process handles its input and does its transformations, it writes the resulting data to its output. Pipelining is a very powerful construct that allows for simple parallelism. It's one of the basic building blocks of parallel algorithm structure.
Let's look more closely at pipelining. Referring to the dataflow graph in Figure 1, the ReadText process begins reading the input file and outputs the third column of data for every input row. The ReadText can write each record to its output as it goes; it doesn't have to process all of its input before producing any output. It's a pipeline-friendly process. The Sort process, however, isn't pipeline-friendly. The Sort process must read all of its input data and process it before producing any output. To do so, Sort may have to create temporary space on local disk for merging. This is required since the Sort process may be asked to handle more input data than it can sort in memory.
This illustrates that attacking huge collections of data with the dataflow approach requires some new rules: any data-oriented framework must be data scalable. In other words, it should be able to handle gigabytes to terabytes of data without relying on in-memory algorithms that fail when billions of rows of data have to be handled.
A dataflow graph can also support the "divide and conquer" technique to enhance scalability. This is another basic parallel algorithm structure. With this technique, the input data is partitioned in some way and the same algorithm is applied to each partition. This allows the data to be processed in parallel. On computers with multi-core hardware, divide and conquer (a k a horizontal partitioning) can provide a huge boost to performance.
Published March 30, 2008 Reads 25,331
Copyright © 2008 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jim Falgout
Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.
![]() |
Eman 04/05/08 10:33:42 AM EDT | |||
Funny, Cos, you are pointing out how Java isn't all that "free & open" like its corp. creator claims it is... the beauty of open source + patent law = morass of bear traps Frankly, I haven't seen any Java framework that holds a match to this DataRush thing... download and see for yourself. |
||||
![]() |
Cos 03/27/08 08:05:17 PM EDT | |||
Daah! Check US Patent 7,020,699 |
||||
- Kindle 2 vs Nook
- Cloud Computing on Gartner's Top 10 List and SYS-CON Events' 2010 Calendar
- Confessions of a Ulitzer Addict
- IBM Hardware Chief, Intel VC Exec Arrested in Insider Trading Scam
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- Ulitzer.com Named Exclusive "New Media" Sponsor of Cloud Computing Conference & Expo
- Moving Your RIA Apps into the Cloud: Seven Challenges
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Windows 7 – Microsoft’s First Step to the Cloud
- Ulitzer Provides a Powerful Social Journalism Platform
- Jill Tummler Singer, Deputy CIO of CIA, Keynotes at GovIT Expo
- Open Source Mobile Cloud Sync and Push Email
- Kindle 2 vs Nook
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing on Gartner's Top 10 List and SYS-CON Events' 2010 Calendar
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Confessions of a Ulitzer Addict
- IBM Hardware Chief, Intel VC Exec Arrested in Insider Trading Scam
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- Ulitzer.com Named Exclusive "New Media" Sponsor of Cloud Computing Conference & Expo
- US Post Office Hops a Ride on NetSuite’s Cloud
- Moving Your RIA Apps into the Cloud: Seven Challenges
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Building a Drag-and-Drop Shopping Cart with AJAX
- What Is AJAX?
- Google Maps! AJAX-Style Web Development Using ASP.NET
- Flashback to January 2006: Exclusive SYS-CON.TV Interviews on "OpenAjax Alliance" Announcement
- AJAXWorld Conference & Expo to Take Place October 2-4, 2006, at the Santa Clara Convention Center, California
- AJAX Sponsor Webcasts Are Now Available at AJAXWorld Website
- How and Why AJAX, Not Java, Became the Favored Technology for Rich Internet Applications
- "Real-World AJAX" One-Day Seminar Arrives in Silicon Valley
- AJAXWorld University Announces AJAX Developer Bootcamp
- AJAX Support In JadeLiquid WebRenderer v3.1
- Where Are RIA Technologies Headed in 2008?
- Struts Validations Framework Using AJAX








































