Welcome!

AJAX & REA Authors: John Funnell, Bob Little, Kevin Hoffman, Maureen O'Gara, Onkar Singh

Related Topics: Java

Java: Article

High-Performance Batch Processing with Java Enterprise Edition

The benefits

On the other hand, you could choose to define the chunks as an arbitrary number of orders, perhaps 1,000. The dispatcher could query the database for all required order columns rather than just the order ID. As each row is returned, the dispatcher would send all the order data required to compute shipping costs to a worker. Each worker would not have to query the database to do its work because everything it needs is provided as input.

As each worker completes its work, it would place the results on a persistence queue rather than immediately sending an individual insert or update to the database. Every time this queue reaches 1,000 entries, the batch process would send a bulk insert/update statement to the database.

The result is that the database is allowed to do a few, high-volume things as fast as it is able, rather than swapping between numerous small tasks.

Optimize Physical Database Access
Databases often respond slowly when they receive multiple requests that contend for data located on the same physical media. Avoiding this contention will speed the batch process. It's often possible to specify how database tables are segregated on different physical disk drives and divide tables that are likely to receive large numbers of concurrent requests onto different physical drives.

Use Database Tricks
Relational databases offer many configuration options and interaction methods that can be used intelligently to optimize a batch process. Performance monitoring tools should be used to watch the behavior of the database as the batch process runs. This will allow the optimal configuration of settings such as how much memory to commit to the database's shared cache.

Transactions
A database uses transactions to group multiple changes into a single logical unit of work. These changes are then all committed and stored or all rolled back and thrown away together. The database must maintain a log of these changes to keep track of which things belong together. Large transactions result in a large transaction log. Large logs can negatively impact performance. There are a couple ways to avoid this problem in a batch process. You could choose not to use transactions at all. Most databases include an autocommit feature that allows all changes to be committed immediately. Alternatively, you could make sure that each thread independently commits its own relatively small transaction. In any case, it's not wise to have large, long-running transactions as part of a batch process.

Prepared Statements
Most databases support the idea of precompiled SQL statements called prepared statements. A prepared statement is a SQL statement with placeholders for parameters that will be supplied later on with actual data values. The statement can be compiled once and then reused even if the parameters change. This saves compilation time on the database platform and improves performance.

Batch processes usually involve multiple executions of the same SQL statements over and over with different parameter values. This is a perfect situation for prepared statements. Dynamic statements should always be avoided.

Application Server Clustering
One of the benefits of using the JEE platform for batch processing is that you can leverage its ability to cluster multiple application servers. If JMS is used as the transport mechanism to move messages from the dispatcher to the workers and the JMS implementation supports clustered, distributed queues (as many do), the workers can reside on different physical machines. This provides a method to scale the performance of the batch process by adding application servers. A powerful cluster can be assembled using multiple, inexpensive commodity application servers and it can grow with the requirements of the batch process.

Conclusion
While the JEE platform was originally designed for building enterprise Web applications, it has grown into a versatile Java server platform that can successfully solve many problems. Batch processing is a common enterprise requirement. The JEE platform can provide an excellent batch processing platform as long as care is taken to optimize database interactions.

More Stories By Colin Hendricks

Colin Hendricks is CTO of Rome Corp. He has worked as a software developer and consultant on high-performance, server-side Java systems for the past 10 years.

Comments (3) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Snehal Antani 07/27/08 08:06:36 PM EDT

Kalyan, to answer your questions:

"what are the hiccups?": a key issue with batch processing using java and application servers relates to JDBC cursors, transactions, and holding cursors across transactions. Checkpointing - committing work periodically so you can restart the job if needed - is important in batch. Checkpointing is achieved by using transactions, JTA transactions specifically. Unfortunately if you use a Type-4 JDBC driver with XA, you're not able to keep cursors open across transactions, therefore you are not easily able to do a "select account from table1" type of query that retrieves all of the accounts to process and leverage some checkpoint strategy as you process those records. There are a few approaches to getting around this: first, we've built a stateful session bean pattern (SFSB) where reads to the DB are done in a local transaction and the writes to the database are done in the global transaction; second, executing smaller queries that are bounded by the checkpoint intervals versus one very large query; third, if you are on z/OS and your data is in DB2 z/OS, to use the Type-2 JDBC driver that allows you to hold cursors across transactions; fourth, to use Last Participant Support, which is the ability to use a single 1-PC resource in a 2-PC (XA) transaction. This problem will plague *every* java-batch solution and a pain due to limitations in XA. The WebSphere XD Compute Grid (aka WebSphere Batch) forum has some posts on this topic, please feel free to ask more questions there: http://www-128.ibm.com/developerworks/forums/forum.jspa?forumID=1240&sta.... Within Compute Grid, we've built the SFSB pattern as part of our Batch Datastream Framework (BDS Framework) to make it simpler to leverage. Using LPS or type-2 drivers is pretty straightforward in WebSphere.

Another important gotcha is workload management and ensuring your batch processing doesn't negatively impact your online transaction (OLTP) workloads (and vice versa). The only way to have a good solution in this area is to use a software stack that integrates with the database and the workload manager. Basically, you need an integrated batch and OLTP platform, not just a batch container.

"app's performance would depend on database specifics": yes, of course, but this is business-as-usual. DB vendors have their own knobs and runtime behaviors that will differ, therefore each has to be optimized in its own way.

"what sort of frameworks have you worked with": I've found Hibernate to not be very good for batch processing. You can read more about why here: http://forum.hibernate.org/viewtopic.php?t=988575&view=next&sid=0aada757.... I've seen customers use IBatis, OpenJPA, raw JDBC, Pure Query, and SQLJ/Static SQL. As the article mentions, getting down to the raw SQL query for Batch can be crucial for performance. I tend to stick to raw JDBC and I use the Batch Data Stream Framework (BDS Framework) to manage the connections, prepared statements, restarting, etc. You can read more about this at: http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=190623...

Kalyan 11/13/07 04:06:33 PM EST

This article looks pretty good in its content. Couple of questions though:

# Have you used this architecture on any of the systems that you have implemented? If so, what are the hiccups that you have come across?

# Though you discourage using storedpocs for performance reasons, you say that tweak some database configuration to see if one can get better performance. Wouldn't this make the app's performance (thought not logic) dependent on database specifics?

Interacting with databases is the most important part of any batch processing application that has to save data to the persistent store. It'd be interesting to see what sort of framework (hibernate, ibatis, etc.) have you worked with in this kind of architecture.

Snehal Antani 08/13/07 04:06:11 PM EDT

Interesting article. I recently published an article describing your Dispatcher-Worker pattern for highly parallel batch jobs in the context of WebSphere XD Compute Grid.

http://www.ibm.com/developerworks/websphere/techjournal/0707_antani/0707...

An interesting extension to the your description is depicted in figure 6 of my article- establishing endpoint affinity which enables new caching opportunities.

The minus with using straight JEE5 multi-threading packages versus building on an existing enterprise java batch framework like Compute Grid- the developer would have to manage threading which, for enterprise adopters composed of large development teams, could be more trouble than its worth.