Wednesday, July 22, 2009

Consideration of running batch programs without Stored Procedures

A few days ago, I've joined a discussion with one of my company's customer about software application architecture. Once again, the good old debate about using stored procedure(SP) was brought onto the table.

The customer, which used stored procedure for database operation in her legacy systems quite intensively, seems giving in this time. They agreed that for online transactions, DB operations can be implemented in Java based Data Persistence Layer(DPL), like DAO or Entity EJB. However, for batch programs, we proposed Stored Procedures can still be use. The customers seems happy. After all, this design looks nature, low risk, and DB programmers hired by the customer won't lost their job.

Running batch programs with SP seems a very nature decision. Batch programs usually involve large amount of data processing in database. And usually the time window for a batch program to run is very limited. Many application systems can only allocate 8 hours for running dozens of batch job everyday. So performance is also a key issue. SP usually is the fastest answer for this type of requirement.

Actually, from my experience, SP may not always help. In my career I've saw a weekly batch program with thousands of SQL scripts took more than 24 hours to finish. The problem is, no matter how fast SP can be execute, resource available for SP programs is only CPU and memory of one DB server.

The believers of Object Oriented Programming continuously have been selling us to put all persistent logic in DPL. In spite of the performance issue, using SP to update data in a batch process still defeat the whole propose of using DPL. After some time, the old data integrity issue will still emerge, as if no DPL is being used. So, if you think the hybrid solution I proposed to my client is the ultimate answer, you are in fact fooling yourself.

So, I re-think again and again, how a batch program can be implement with Java while all requirements, mainly on performance can be met? I have some answer below. If you are OOP people, please read them and kindly share your through with me. If you are SP people, please let me know how can you maintenance a batch program written in SP that need more than 24 hours to run without database re-design.

Problem 1: Performance

As I mentioned above, performance is usually the biggest hurdle against implementing batch programs outside the database. Why this have to be slower? The answer is simple. In this way the system have to send the data from DB to another server thought the network. Transform the data to another structure (e.g. Java Objects). And write the result back to DB thought the network again. Comparing with using SP, all these are additional overhead. As we cannot completely avoid these overhead. There is one way to compensate these.

When you run a batch program with SP, you have to use the CPU in the DB server. No matter how expansive your server box is. All you can have are usually 2 or at most 4 CPUs. With this limited computation power, they have to perform all the operation including query, calculation, and update.

When you perform your batch program outside your DB server. You are actually moving your calculation to other CPUs. In this way, not only you can have more CPUs to do the job, as you can have different servers to do independent batch programs. Also, your DB server would have less work to do, so queries and updates will go faster. With proper design, I think the performance will at least comparable with the traditional SP approach.

Problem 2: Memory Management

Surprisingly, there's a factor that make batch process with program outside the DB difficult, or sometimes consider impossible. It is memory management.

It is very common for a DB contains many Gigabytes or even Terabytes of data. In SP, it is relatively easy to create a temporary data set to store transient data for calculation. DB usually can handle this without much difficulties. However, for a program outside DB, memory available are usually a few Gigabytes, or less than 2 Gb for a Java program. If your program have to perform sorting or grouping on a large amount of data. You may easily hit the Out of Memory error.

The answer to this problem, is saving the data in temporary files on the disk. For example, with Java, you may implement common Java data structure such as List or Map to serialize the content and save them in temp files. As such actions will be wrapped in the data structure class, data read/write from the temp files would be hide away from the business logic.

Conclusion

After the discussion above, I have to say migrating batch programs from SP to something outside DB require considerable effort, and it is not risk free. However, we should not under-estimate the benefit of isolating the data persistence logic in the long run. If your application have a complicated database scheme, data in the database would gradually turn into a chaos if you let programmers to update the data directly in their business logic, even only in batch programs. Then causing painful maintenance issues afterward.

2 comments:

Михайло said...

I agree that batch gives more capabilities.
But in real environment where system administrator have maintain many systems and computers getting difficult works for one program with many computers.

Unknown said...

I agree that by always batch job running should be very carefully planned, but such planning is required no matter you run your batch programs inside or outside the DB. Though, you'll need different architecture for different option.