| By Colin Hendricks | Article Rating: |
|
| November 14, 2007 07:45 AM EST | Reads: |
24,431 |
The basic software architecture pattern to follow is a dispatcher-worker pattern. This pattern consists of a centralized controller (the dispatcher) that sends chunks of work to multiple worker threads. This basic pattern is often used in software applications that must do a high volume of work.
The central problem is to determine how to partition the chunks of work to maximize the efficiency of threads and database interactions. Clearly, each thread must operate on an independent logical unit of work. Otherwise, concurrent threads might end up waiting on one another or incorrectly altering results related to other threads. Each thread must be able to complete its job independently from all other threads.
Subsequently, the units of work should be designed to minimize database interactions because these are very expensive. They involve asking the database to retrieve some data (which could require spinning a physical hard drive) and then moving that data over a network to the JEE server. Minimizing the frequency and volume of these interactions is the single most important factor in JEE batch processing performance. There are many ways to minimize database access and optimize the necessary database interactions.
Be Lazy (Avoid unnecessary work)
Only Get the Required Data
The easiest way
to minimize database interactions is to carefully construct the batch
algorithm to only get and operate on the data it really needs. This
may seem obvious but modern software often has layers such as Data
Access Objects (DAOs) or perhaps an object model based on Hibernate
that tend to return fully populated objects rather than just the few
fields required. It's often convenient to reuse an existing data layer
that does these things, but only do so if the time required to retrieve
the extra data is acceptable. Otherwise, create new DAOs or JDBC
statements to get just the specific data required.
Only Do the Work Required for Each Run
Another
way to avoid bringing back unnecessary data and doing pointless work is
to create a configurable batch process. Batch processes often do
several different but related operations and not all of them are always
necessary. A little extra development work is required to provide input
parameters that allow certain operations to be switched off for certain
batch runs, but avoiding unnecessary work can provide worthwhile
performance improvements.
Only Work on Data that Has Changed
In this
same vein of avoiding unnecessary work, it is often possible to
implement a feature that tracks which data has changed (and requires new
batch operations) and which data has not changed (and can safely be
ignored). Depending on the rate of change of the data, ignoring
unchanged values can lead to a large performance improvement.
Use Data Warehousing Techniques to Compress Data over Time
The
size of the dataset can be further reduced by exploiting common data
warehouse data modeling techniques such as the concept of slowly
changing dimensions. Data warehouses are often modeled to contain
dimension tables and fact tables. The dimension tables contain all the
descriptive attributes on which data is sliced. Fact tables contain
the actual aggregated data. For example, there may be a fact table
containing order totals with a foreign key to a dimension table that
captures the name of the salesperson for the order, allowing the
creation of a report to slice order totals by salesperson.
Slowly changing dimensions and slowly changing facts are methods that can be used to compress the volume of this data if the data changes over time. The idea is to put date ranges on the dimensions and facts rather than repeating the same values for each date in the time period. For example, a salesperson's name could change over time if she gets married. Without date range effectiveness on this dimension, it is necessary to capture the name as it was at each batch run to preserve historical data even though the data most likely does not change often. This is repetitive and wasteful. If the dimension has a date range, then the batch process need only store a row for each different value.
The same can be done with facts if the model requires storing facts at different times. If the result of the computation happens to be the same value as it was the last time, the batch process could just store a date range with the answer rather than storing the same answer multiple times.
Optimize Database Interactions
Eliminating
unnecessary work is the best way to limit database interactions, but, clearly, some interactions must happen. Further strategies can be used
to make sure those interactions are as efficient as possible.
Caching
One approach is to take advantage of
caching technologies. Batch processes often require access to some set
of master data that is reused throughout the process. This master data
should be loaded from the database just once and then cached in memory
within the application server context and reused.
This can be done using singletons or static variables that hold the data, or caching tools like JBoss Cache, GigaSpaces, Tangosol Coherence, etc. These latter tools provide benefits such as replicating the cached values across multiple JVM instances but introduce added complexity to the application.
One caveat for caching is that it may solve a database interaction problem but create a memory constraint problem because the in-memory cache in the application server tier may grow too large. RAM has become much cheaper in recent years, but most 32-bit JVMs are still limited to 2-4GB of heap space. Be careful that the cache will not exceed the memory space available and cause disk swapping on the application server. Also, consider 64-bit computing architectures that allow larger heap space.
Data Streaming
Another approach for optimizing
database interactions is to favor a smaller number of denormalized
queries that retrieve large volumes of data rather than a larger number
of more granular queries that retrieve small volumes of data.
Relational databases are very good at creating an execution plan for a
few complex queries and then streaming back the results as quickly as
possible. They perform less well when asked to execute lots of small
queries that appear to be randomly organized.
For example, consider a batch process that needs to compute the shipping cost on a large set of orders. You could choose to define each chunk of work to be a single order. The dispatcher could ask the database for the master list of order IDs and send each ID to a worker thread for processing. That worker thread could then ask the database for the details of each order, do its work, and save the answer back to the database.
To the database, this approach will feel like it's getting slammed by lots of concurrent users asking for different orders all at the same time. There will be high contention for resources such as database connections and access to the order table. It will look sort of like a denial of service attack.
Published November 14, 2007 Reads 24,431
Copyright © 2007 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Colin Hendricks
Colin Hendricks is CTO of Rome Corp. He has worked as a software developer and consultant on high-performance, server-side Java systems for the past 10 years.
![]() |
Snehal Antani 07/27/08 08:06:36 PM EDT | |||
Kalyan, to answer your questions: "what are the hiccups?": a key issue with batch processing using java and application servers relates to JDBC cursors, transactions, and holding cursors across transactions. Checkpointing - committing work periodically so you can restart the job if needed - is important in batch. Checkpointing is achieved by using transactions, JTA transactions specifically. Unfortunately if you use a Type-4 JDBC driver with XA, you're not able to keep cursors open across transactions, therefore you are not easily able to do a "select account from table1" type of query that retrieves all of the accounts to process and leverage some checkpoint strategy as you process those records. There are a few approaches to getting around this: first, we've built a stateful session bean pattern (SFSB) where reads to the DB are done in a local transaction and the writes to the database are done in the global transaction; second, executing smaller queries that are bounded by the checkpoint intervals versus one very large query; third, if you are on z/OS and your data is in DB2 z/OS, to use the Type-2 JDBC driver that allows you to hold cursors across transactions; fourth, to use Last Participant Support, which is the ability to use a single 1-PC resource in a 2-PC (XA) transaction. This problem will plague *every* java-batch solution and a pain due to limitations in XA. The WebSphere XD Compute Grid (aka WebSphere Batch) forum has some posts on this topic, please feel free to ask more questions there: http://www-128.ibm.com/developerworks/forums/forum.jspa?forumID=1240&sta.... Within Compute Grid, we've built the SFSB pattern as part of our Batch Datastream Framework (BDS Framework) to make it simpler to leverage. Using LPS or type-2 drivers is pretty straightforward in WebSphere. Another important gotcha is workload management and ensuring your batch processing doesn't negatively impact your online transaction (OLTP) workloads (and vice versa). The only way to have a good solution in this area is to use a software stack that integrates with the database and the workload manager. Basically, you need an integrated batch and OLTP platform, not just a batch container. "app's performance would depend on database specifics": yes, of course, but this is business-as-usual. DB vendors have their own knobs and runtime behaviors that will differ, therefore each has to be optimized in its own way. "what sort of frameworks have you worked with": I've found Hibernate to not be very good for batch processing. You can read more about why here: http://forum.hibernate.org/viewtopic.php?t=988575&view=next&sid=0aada757.... I've seen customers use IBatis, OpenJPA, raw JDBC, Pure Query, and SQLJ/Static SQL. As the article mentions, getting down to the raw SQL query for Batch can be crucial for performance. I tend to stick to raw JDBC and I use the Batch Data Stream Framework (BDS Framework) to manage the connections, prepared statements, restarting, etc. You can read more about this at: http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=190623... |
||||
![]() |
Kalyan 11/13/07 04:06:33 PM EST | |||
This article looks pretty good in its content. Couple of questions though: # Have you used this architecture on any of the systems that you have implemented? If so, what are the hiccups that you have come across? # Though you discourage using storedpocs for performance reasons, you say that tweak some database configuration to see if one can get better performance. Wouldn't this make the app's performance (thought not logic) dependent on database specifics? Interacting with databases is the most important part of any batch processing application that has to save data to the persistent store. It'd be interesting to see what sort of framework (hibernate, ibatis, etc.) have you worked with in this kind of architecture. |
||||
![]() |
Snehal Antani 08/13/07 04:06:11 PM EDT | |||
Interesting article. I recently published an article describing your Dispatcher-Worker pattern for highly parallel batch jobs in the context of WebSphere XD Compute Grid. http://www.ibm.com/developerworks/websphere/techjournal/0707_antani/0707... An interesting extension to the your description is depicted in figure 6 of my article- establishing endpoint affinity which enables new caching opportunities. The minus with using straight JEE5 multi-threading packages versus building on an existing enterprise java batch framework like Compute Grid- the developer would have to manage threading which, for enterprise adopters composed of large development teams, could be more trouble than its worth. |
||||
- Kindle 2 vs Nook
- Cloud Computing on Gartner's Top 10 List and SYS-CON Events' 2010 Calendar
- Confessions of a Ulitzer Addict
- IBM Hardware Chief, Intel VC Exec Arrested in Insider Trading Scam
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- Ulitzer.com Named Exclusive "New Media" Sponsor of Cloud Computing Conference & Expo
- Moving Your RIA Apps into the Cloud: Seven Challenges
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Windows 7 – Microsoft’s First Step to the Cloud
- Ulitzer Provides a Powerful Social Journalism Platform
- Jill Tummler Singer, Deputy CIO of CIA, Keynotes at GovIT Expo
- Open Source Mobile Cloud Sync and Push Email
- Kindle 2 vs Nook
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing on Gartner's Top 10 List and SYS-CON Events' 2010 Calendar
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Confessions of a Ulitzer Addict
- IBM Hardware Chief, Intel VC Exec Arrested in Insider Trading Scam
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- Ulitzer.com Named Exclusive "New Media" Sponsor of Cloud Computing Conference & Expo
- US Post Office Hops a Ride on NetSuite’s Cloud
- Moving Your RIA Apps into the Cloud: Seven Challenges
- Adobe’s Aiming ColdFusion at Multiple Clouds
- Building a Drag-and-Drop Shopping Cart with AJAX
- What Is AJAX?
- Google Maps! AJAX-Style Web Development Using ASP.NET
- Flashback to January 2006: Exclusive SYS-CON.TV Interviews on "OpenAjax Alliance" Announcement
- AJAXWorld Conference & Expo to Take Place October 2-4, 2006, at the Santa Clara Convention Center, California
- AJAX Sponsor Webcasts Are Now Available at AJAXWorld Website
- How and Why AJAX, Not Java, Became the Favored Technology for Rich Internet Applications
- "Real-World AJAX" One-Day Seminar Arrives in Silicon Valley
- AJAXWorld University Announces AJAX Developer Bootcamp
- AJAX Support In JadeLiquid WebRenderer v3.1
- Where Are RIA Technologies Headed in 2008?
- Struts Validations Framework Using AJAX







































