parallel batch performance considerations
DESCRIPTION
With the laws of physics providing a nice brick wall that chip builders are heading towards for processor clock speed, we are heading into the territory where simply buying a new machine won't necessarily make your batch go faster. So if you can't go short, go wide! This session looks at some of the performance issues and techniques of splitting your batch jobs into parallel streams to do more at once.TRANSCRIPT
© 2009 IBM Corporation
Session IK: Parallel Batch Performance Considerations
Martin Packer, [email protected]
Abstract
With the laws of physics providing a nice brick wall that chip builders are heading towards for processor clock speed, we are heading into the territory where simply buying a new machine won't necessarily make your batch go faster. So if you can't go short, go wide! This session looks at some of the performance issues and techniques of splitting your batch jobs into parallel streams to do more at once.
Motivations
Increased Window Challenge● Workloads growing in an accelerated fashion:
● Business success
● Mergers and acquisitions
– Standardisation of processes and applications
● More processing
– Regulation
– Analytics
– “Just because”
● Shortened Window
● Challenge will outstrip “single actor” speed-up
● For SOME installations and applications
– CPU
– Disk
– Tape
– Etc
● Important to assess where on the (possibly) bell-shaped curve you are
This is still10 – 15
Years away
Maybe more
Maybe never
© 2012 IBM Corporation
IBM System z
zEC12TLLB7zEC12TLLB7
zEC12 – Overall Attributes Highlights (compared to z196)
• 50% more cores in a CP chip– Up to 5.7% faster core running frequency– Up to 25% capacity improvement over z196 uni-processor
• Bigger caches and shorter latency– Total L2 per core is 33% bigger– Total on-chip shared L3 is 100% bigger– Unique private L2 designed to reduce L1 miss latency by up to 45%
• 3rd Generation High Frequency, 2nd Generation Out of Order Design– Numerous pipeline improvements based on z10 and z196 designs– # of instructions in flight is increased by 25%
• New 2nd level Branch Prediction Table for enterprise scale program footprint – 3.5x more branches
• Dedicated Co-processor per core with improved performance and additional capability– New hardware support for Unicode UTF8<>UTF16 bulk conversions
• Multiple innovative architectural extensions for software exploitation
Other Motivations● Take advantage of capacity overnight
● Proactively move work out of the peak
● Resilience
● Consider what happens when some part of the batch fails
● Note: This presentation doesn't deal with the case of concurrent batch
● Some challenges are different● Some are the same
– You just have 24 hours to get stuff done, not e.g. 8
Issues
Issues
● Main issue is to break “main loop” down into e.g. 8 copies acting on a subset of the data● Assuming the program in question has this pattern *
● Not the only issue but necessary to drive cloning
● With this pattern dividing the Master File is the key thing …
* This is a simplification of the more general “the whole stream needs cloning” case
More Issues● Reworking the “results”
● Probably some kind of merge process● Handling inter-clone issues
● Locking● I/O Bottlenecks
● Provisioning resource● Concurrent use of memory and CPU greatly increased
● Scheduling and choreography● Streams in lockstep or not● Recovery boundaries● Automation of cloning in the schedule
A Note On Reworking
● Consider the “merge at the end” portion:● Probably valuable to separate data merge from
“presentation”– “Presentation” here means e.g. reports, persistent output
● Consider an “architected” intermediate file– XML or JSON or whatever
● Use the architected intermediate file for other purposes– e.g PDF format reporting– Alongside original purpose
Implementation
Implementation
● Implementation consists of three obvious steps:● Analysis● Make changes – or Implementation :-)● Monitoring
(and loop back around)
Analysis● Look for e.g.
● CPU-intensive steps● Database-I/O intensive steps
● Prefer other tune-ups● Be clear whether other tune-ups get you there
– Some may effectively do cloning for you● Take a forward-looking view
– Lead time– Keep list of potential jobs to clone later on
● Assess whether “code surgery” will be required
Making Changes
● Splitting the transaction file● Changing the program to expect a subset of the
data● Merging the results● Refactoring JCL● Changing the Schedule● Reducing data contention
Monitoring● Monitoring isn't terribly different from any other batch
monitoring.
● Usual tools, including:
● Scheduler-based monitoring tools - for how the clones are progressing against the planned schedule.
● SMF - for timings, etc.● Logs
● Need to demonstrate still functions correctly
● Work on “finding the sweet spot”:● e.g. Is 2 the best, 4 or 8?*
● Work on “balance”
– * Note bias to “power of two”
Case Study
Case Study● Our application
● Not meant to be identical to yours
● Scales nicely through iterations
● Process important
● Stepwise progress
● Use e.g. DB2 Accounting Trace to guide
● In the following:
● 0-Up is the original unmodified program
● 1-Up is prepared for 2-Up etc and has
– Reporting removed & replaced by writing a report data file
● Report writing and “fan out” and “fan in” minimal elapsed / CPU
● 2 Key metrics:
● Total CPU cost
● Longest Leg Elapsed Time
Don't Commit: Won't Go Beyond 1-Up
45% DB2 CPU, 45% Non-DB2 CPU
0 -Up 1 -Up0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes
Commit Every Update: Scales Nicely Up To 8-Up
50% DB2 CPU, 50% Non-DB2 CPU
0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16-Up 32-Up0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes
8 Balanced Partitions: 50% Elapsed / CPU Reduction Up To 32-Up
Up to 16-Up almost all time is Non-DB2 CPU, 32 is 50% “Queue”
16 and 32 partitions made no difference
0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16 -Up 32 -Up0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes
Case Study Lessons
● Applications mustn't break when you try to clone
● “Sweet spot” in our case is around 8-up
● Might still drive further if CPU increase acceptable● Elapsed time got better at 16-up
● Data Management work can help
● Partitioning very nice in our case
● Environmental conditions matter
● In our case CPU contention limited scalability
● DB2 Accounting Trace guided us:
● Explained why preloading data into DB2 buffer pools did nothing