j2ee batch processing

1

Batch ProcessingWith J2EEDesign, Architecture and Performance

Chris Adkin

28th December 2008

2

Introduction

For the last two years I have worked on a project testing the performance and scalability of batch processes using a J2EE application server.

This presentation summarises my findings and conclusions based upon this work.

3

Introduction There is scarce information:-

on batch processing using J2EE in the public domain. on the end to end tuning of J2EE architectures which use Oracle

for persistence There is a lack of information within the DBA community on

performance tuning with respect to J2EE than goes beyond JDBC usage.

Most J2EE material only goes as far down to the database as persistence frameworks and JDBC.

The available information is not as “Joined up” as it could be. Hopefully, this presentation may fill some of these gaps and bridge the

divide between J2EE and database tuning.

4

Design Considerations

5

Design and Architecture Considerations

Use a third party tools and frameworks:- Spring batch Quartz

J2EE Application Server Extensions IBM WebSphere Compute Grid

Write your own infrastructure, devx has a good example.

6

Considerations For Available Infrastructures Quartz

Not a full blown batch infrastructure and execution engine, just a scheduler.

Spring Batch Version 2.0 was therefore not available at the time

of inception for my project. Spring 1.0 is only designed to run on one JVM and

was written for JSE 1.4. Earlier versions of spring can compromise the

transaction integrity of the application server, refer to this article.

7

Considerations For Available Infrastructures WebSphere Compute Grid

IBM has a long track record in both the J2EE and batch processing worlds.

“a complete out-of-the-box solution for building and deploying Java-based batch applications on all platforms supported by WebSphere” according to this article.

Integrates the tightest with WebSphere out of all the available options, but also ties you into WebSphere.

Requires WebSphere Network deployment as a pre-requisite. Not just a batch job processing infrastructure but a grid as

well. Comes with full tooling for developing batch jobs.

8

Infrastructure Considerations

Workload partitioning and scalability Can the workload be subdivided for distribution amongst

worker threads and nodes in a J2EE application server cluster ?

Does the infrastructure scale across JVM threads ? A grid ? J2EE application servers in a cluster ? Multiple JVMs via JMS and associated queuing

technologies ?

9


Job traceability Does the framework give visibility of each stage of

processing that a job is at ?. Can the level of logging / tracing / auditing be changed

for individual batch jobs and how fine grained is this ?.

Exception handling Does the framework allow for this ?.

10


Resource consumption management Control over CPU utilisation.

Extensibility Do you have to get your hands dirty with maintaining

the framework or can you just ‘Drop’ your business logic into it ?.

Is the framework flexible in handling the delivery of jobs from different sources ?, JMS, web services ? Etc . . .

Is the framework flexible in integrating with different end points ?.

11


Scheduling and event notification Does the framework provide a scheduling

mechanism or can it easily hook into a third party scheduler products.

In particular the more popular schedulers such as BMC Control-M or Tivoli Maestro ?

Does the framework provide hooks into a pager and / or email event notification system ?.

12


Resilience If a job or batch fails, will it bring the whole application

server down ?. If a batch fails, does it roll back and leave the application

in a consistent state ?. Can batches be re-started without any special steps

having to be performed.

13

Batch Environment Components

Batch execution environment The actual batch run time environment

Batch ‘container’ software to provide the services for a batch to run.

Scheduling Does the environment provide this or hooks into

third party schedulers ? The application itself

14

What Does J2EE Provide For A Batch Environment Pooling for the efficient management of resources.

Access to logging frameworks, Apache log4j, Java Util Logging (JUL).

Rich integration infrastructure via J2EE Connection Architecture and JDBC Java Messaging Services Web Services Web Service based publish / subscribe style event processing via

WS Notification Session Initiation Protocol (SIP) Service Component Architecture (provided in WebSphere 7 via a

feature pack).

15

What Does J2EE Provide For A Batch Environment Asynchronous processing via message driven beans.

Transaction support via JTS and an API via JTA.

Scalability across multiple Java Virtual Machines Most J2EE application server vendors offer clustered solutions.

Scalability across multiple Java threads Threading is not supported in the EJB container by definition of the

J2EE standard, however, it can be simulated using a client JVM or asynchronous beans.

Security via JAAS.

16

ORM Considerations

Many frameworks are available: iBatis, Toplink, Spring, Hibernate and IBM pureQuery.

Java Persistence Architecture lessens the need for such frameworks.

Few frameworks utilise the Oracle array interface. Use of a framework can vastly reduce the amount

of code required to be written. A “Half way house” is to use a JDBC wrapper.

17

ORM Considerations Questions to ask when choosing an ORM:-

Can custom SQL be used ? Can SQL be hinted ? Does it have caching capabilities ? Does it allow stored procedures to be called ?, both PL/SQL and

Java. Does it allow for any batch / bulk operations to be performed ?, e.g.

wrappers for the JDBC batching API.

A hybrid approach can be adopted, for example:- Read only entity beans for access to standing data, these have been

highly optimised of late, as per this article. “Hand rolled” JDBC for bulk operations leveraging things such as

the Oracle array interface.

18

Caching Considerations What is the percentage split between read and write activity

against data stored in the database:- Read intensive, then caching needs to be seriously

considered. Write intensive, consider stored procedures and

leveraging bulk operations as much as possible. Some database (including Oracle) support Java stored

procedures Leverage skills of J2EE developers within the database !.

Whatever you do, frequently accessed standing data should always be cached.

19

Caching Considerations Processing that takes place within a batch job may not

reuse the same data, but batch jobs that follow on from one another might.

Java objects talking to Java objects is faster than Java objects talking to relational data. However, if there is a reporting requirement, most

reporting tools run off relational data.

Is a relational database going to be ‘Fronted’ with some sort of cache, or is an in memory object cache to be used without any database featuring at all ?.

20

Caching Considerations

Is a custom caching design going to be used ?. “Scale proof” this using Network Deployment friendly memory

structures such as DistributedMap.

Is an off the shelf caching (and grid) solution such as Coherence or WebSphere eXtreme scale to be used ? These are intrusive technologies that need to be factored into

development.

An in memory relational database caching solution, e.g. TimesTen can be easily retrofitted into the technical infrastructure:- Does the integration layer expect objects or relational data ?.

21

Design Challenges

Resource UtilisationUsing a database for persistence incurs

performance penalties:-Network round tripsLatency in data retrieval and

modificationObject Relational impedance mismatch,

the “Vietnam of Computer Science”.

22

Design Challenges Resource Utilisation

Well designed and written batch processes may saturate CPU capacity on the application server:-

Good for throughput. Spare CPU capacity may be required to run

multiple batches at once in “catch up” scenarios. Not so good for any other none batch activities

using the environment. Sustained spikes in J2EE application server CPU

utilisation whilst batch processes are running and low CPU activity at other times.

23

Design Challenges

ORM (Object Relational Mapping) frameworks There are a multitude ORM frameworks on the market. ORM frameworks abstract away the underlying database. Little or no support for JDBC batching and the Oracle array

interface. Focus on item by item processing and not on database

features conducive to achieving good performance and scalability.

J2EE Java Persistence has come a long way with JEE 5 in the form of the Java Persistence API, both in terms of functionality and performance.

Good for programmer productivity as less “hand cranked code” is required.

24

Design Challenges Raw JDBC

Statement batching support, available in JDBC 2.0 onwards.

Support for batch retrieval via fetch sizes configuration. Can result in having to produce more “hand cranked”

code than that required with an ORM framework. Provide access to vendor specific performance related

features such as the "Oracle array interface" Requires more skill on the part of the Java programmer

in terms of having SQL and database knowledge. Development team might require a DBA.

25

Design Challenges SQLJ

Is essentially a JDBC wrapper, SQLJ calls are translated into JDBC calls by a pre-processor

Can achieve similar results as JDBC with less coding. Support for statement batching. SQLJ syntax can be checked at compile time. Does not support the Oracle array interface. An IBM SQLJ reference. Oracle SQLJ examples.

26

Design Challenges

Can the Oracle array interface be leveraged ? Despite all the choices available only raw JDBC provides

access to the Oracle array interface. There may come in point in scaling your architecture when the

Oracle array interface needs to be used, in order to:- Minimize network round trips Minimize parsing Leverage bulk operations within the database

27

Design Challenges

Can Oracle 11g client side caching be used ? An extension of the technology that allows results to be

cached in the server shared pool, but on the “Client side”. Requires the use of the thick JDBC driver. Can vastly reduce network, round trips, data access latency

and CPU utilisation on the database server. An excerpt from the 360 degree programming blog:-

“Running the Nile benchmark[3] with Client Result Cache enabled and simulating up to 3000 users results inUp to 6.5 times less server CPU usage 15-22% response time improvement 7% improvement in mid-tier CPU usage”

28

To Batch Or Not Too Batch

When real time asynchronous processing is applicable Processing needs to take place as soon as the

source data arrives, which does not all come at the same time.

When the processing window is too small to process all the jobs in one batch and when the jobs arrive continuously throughout the day.

Jobs are delivered asynchronously.

29

To Batch Or Not Too Batch

When a batch environment is applicable If the jobs processed are delivered in batches, this

will to a degree enforce batch type processing. When files need to be generated for delivery to

another organisation. If migrating from a none J2EE legacy batch

environment to J2EE, stick to batch in the first iterations of development, rather than jump to J2EE and an event processing architecture in one “Quantum leap”.

30

A “Third Way” Hybrid Environment

A real world example of where this is in operation Most retailers aggregate sales information from their point

of sales (POS) systems for processing at the head office. Larger retailers tender so many transactions that

processing them within a single batch window is not practical.

Therefore, for some retailers, information from the POS systems is continuously trickled to the head office and then batched up for processing when a certain number of files have been received.

31

Our Batch Process Design

J2EE tier WebSphere launch client to instigate batch processes. Client using Java threads to fire off multiple requests at the

application server and hence ‘Simulate’ threading within the application server.

A batch session bean to process arrays of jobs within a loop inside the WebSphere application server.

Stateless session beans. Container managed transactions (JTS). Each job is processed within its own transaction. Application configurable max threads per batch process and max

jobs per thread.

32


Persistence (Oracle) tier Raw JDBC and the Oracle thin driver. Some use of JDBC statement batching. Oracle 10g release 2 for the database. Limited use of stored procedures. J2EE tier data caching limited to standing data:-

Data cached in XML within the application server. When a standing data table is accessed for the first

time it is cached. All subsequent retrievals are via XPath.

33


Not a true batch implementation as such. Web GUI, Web service(s) and hand held units can

and are used whilst ‘batch’ processes run. ‘Batch’ in the context of large numbers of jobs are

processed together within specific time windows. All batch control is via the WebSphere launch

client, no GUI based job control.

34

Performance Monitoring and Tuning “Tool Kit”

Application Server and Client JVM Verbose garbage collection output WebSphere Performance Monitoring Infrastructure (PMI) WebSphere performance advisor Java thread dumps JProfiler Java profiler

Oracle Database 10g performance infrastructure, advisors, addm, time model etc.

Operating System Tools Prstat, sar, vmstat, iostat etc .

Veritas volume management monitoring tools vxstat

35

Performance Monitoring and Tuning “Tool Kit”

Available IBM WebSphere tools not used on the project:- IBM Support Assistant plugins, namely the

thread analyzer and verbose garbage collection output analyzer.

ITCAM for Response Time Tracking. ITCAM for WebSphere.

Available Sun tools not used on the project:- jstat jconsole

36

Batch Architecture Deployment Diagram

Launch Client WebSphere 6.1 Application Server

EJB Container

Domain Layer

Data Access Layer

Utility Services (batch manager, logging, exception handling, standing data cache

etc)

Domain Interface Layer

Data Access Interface Layer

Oracle Database Server-<< JDBC >>

* -<< JDBC >>

*

-<< RMI >>

* -<< RMI >>

*

37

Software Architecture

Classical Horizontally Layered architecture Apache struts => out of the can MVC framework. Business logic tier implemented used stateless

beans, session façade and business delegate and service locator patterns.

Data Access layer written using stateless beans and raw JDBC and data transfer object pattern.

Utility layer providing logging, exception handling, service locator, EJB home caching, standing data cache and parameters and controls functionality.

38

Software Architecture

Vertical layering also Functional areas divided into vertical slices that go

through both the business logic / domain layers and the data access / integration layer.

Loose coupling of vertical slices via ‘manager’ beans, the session façade design pattern and coarse interfaces.

39

Domain / business logic layer Cached standing data EJB home caching (service locator design pattern) Use of session façade pattern with coarse interfaces All beans are stateless

IBM consider this to be a best practice. Unlike calls to stateful session beans, calls to stateless can be

load balanced across all members of a cluster. J2EE community regards stateless beans as being better than

statefull beans for performance.

Software Performance Features

40

Software Performance Features

Data Access Layer Use of Data Transfer Objects JDBC connection pooling, min and max settings on the

JDBC pool set to the same to prevent connection storms. JDBC statement batching used in places. JDBC prepared and callable statements used so as not

to abuse the Oracle database shared pool. Soft parsing may still be an issue, but can be reduced

slightly by using session_cached_cursors.

General design Batch process threading for scale out.

41

Batch Design Sequence Diagrambatch Clientbatch Client J2EE ContainerJ2EE Container DatabaseDatabase

1: Start the Batch process

3: Get no.of threads and no.of jobs per thread parameters

5: returns

6: Get the list of SPRs/Jobs to be processed

8: returns a list of SPRs / Job Ids

9: Create No.of threads and pass the 'job list' as parameter

10: Each thread makes a call to a Bean method by sending the ' job list' as parameter

12: On completion, each thread ends here

11: Loop through each SPR/ Job Id within the 'job list' to process them

4: Retrieve the parameters

7: Retrive the SPRs/Job Ids

2: Create a Batch record with Start time

13: Update the Batch record with Status, end time

42

Where Does The Source Data For Our Batch Processes Originate ?

Flat files delivered via ftp

Web Services

A third party of the shelf package via JNI

Hand Held Units using J2ME

43

Design Critique

44

Pros Design can scale out via threads. Design can scale out across multiple JVMs. Design is simple and clean.

Because of the online usage, the row by row processing simplifies the design.

Complex code might be required to allow for both batch array processing and on line usage.

45

Pros

If a single job fails the whole batch does not need to be rolled back.

CPU usage of batch can be controlled by changing the number of threads.

Provides a framework for the batch infrastructure.

46

Cons

Inefficiencies by design when accessing the database Limited opportunities for leveraging the JDBC

batching API and the Oracle array interface. Design is prone to a lot of ‘Chatter’ between the

application and database servers. Large “Soft parse overhead”.

47

Cons HHU job retrieval may be more conducive to an event

processing architecture than a batch architecture:- Better for more even CPU utilisation.

We have to maintain the infrastructure code as well as the business logic / domain code.

Is there a better way of simulating threading that could reduce the role of the launch client, message driven beans perhaps ?:- i.e. limiting the role of the launch client in batch processing

will be better for performance and scalability.

48

Network Round Trip Overheads

Database utilisation – network round trip overhead From

“Designing Applications For Performance And Scalability”:-

“When more than one row is being sent between the client and the server, performance can be greatly enhanced by batching these rows together in a single network roundtrip rather than having each row sent in individual network roundtrips. This is in particular useful for INSERT and SELECT statements, which frequently process multiple rows and the feature is commonly known as the array interface.”

There is minimal scope for leveraging the array interface (and also the JDBC batching API) using our design.

49

Parsing Overheads

Best J2EE programming practise dictates that resources should be released as soon as they are no longer required.

All cached prepared statement objects are discarded when the associated connection is released.

This could be coded around, but would lead to code that is both convoluted and prone to statement cache leaks.

50

Parsing Overheads

The statement API is more efficient than the preparedStatement JDBC API for the first execution of a statement. Subsequent executions of a prepared statement

are more efficient and more scalable. Using the statement API would be less resource

intensive on the application server but more resource intensive on the database.

51

Parsing Overheads

Should the prepared statement cache size be set to zero ? No point in baring the overheads associated with

cached statement object creation. Will also create unnecessary pressure on the

JVM heap.

52

Parsing Overheads

Why is parsing such a concern ?:- Oracle’s Tom Kyte and the Oracle Real World

Performance group stress that parsing and efficient cursor use cannot be over stated when it comes to the scalability of applications that use Oracle.

This is not a problem unique to Oracle, WebSphere and DB2 material advocates the use of static SQL for the very same reason of avoiding parsing.

53

Parsing Overheads

Database utilisation – soft parse overhead The “Designing Applications For Performance And

Scalability – An Oracle White Paper” quotes the type of SQL usage with our design as being:-

“Category 2 – continued soft parsing The second category of application is coded such that the hard parse is replaced by a soft parse. The application will do this by specifying the SQL statement using a bind variable at run-time including the actual value . . . Continued . . .

54

Parsing Overheads Database utilisation – soft parse overhead

The application code will now look somewhat similar to:

loop cursor cur; number eno := <some value>; parse(cur, “select * from emp where empno=:x”);

bind(cur, “:x”, eno); execute(cur); fetch(cur); close(cur); end loop;”

Refer to “Soft things can hurt” !!!

55

Parsing Overhead

The Oracle Automatic Database Diagnostic Monitor (ADDM) reports on the performance impact of continuous soft parsing:-

FINDING 3: 13% impact (211 seconds)-----------------------------------Soft parsing of SQL statements was consuming significant database time.

RECOMMENDATION 1: Application Analysis, 13% benefit (211 seconds) ACTION: Investigate application logic to keep open the frequently

used cursors. Note that cursors are closed by both cursor close calls and session disconnects.

56

Parsing Overhead

“Category 3” processing as per the white paper is more efficient and what we should really be striving for, as per the PL/SQL below:-

“cursor cur; number eno; parse(cur, “select * from emp where empno=:x”); loop

eno := <some value>; bind(cur, “:x”, eno); execute(cur); fetch(cur);

end loop; close(cur) ;”

57

Testing Environment

58

Monitoring And Tuning The Software

Lots of things to monitor and tune:- Client JVM Server JVM Application server Object Request Broker EJB container JDBC connection pool usage and statement cache Application code Database usage and resource utilisation Application server resource utilisation, mainly CPU Network between the application server and database server Number of threads per batch job Number of jobs per thread

59

Testing Environment

Performance targets based on actual run times of batch processes from legacy environment.

In testing, 200% of the equivalent legacy workload was used and the database was artificially ‘aged’ to give it the appearance of containing two years worth of data.

Oracle 10g database flashback used to reproduce tests. Large full table scan used to clear out the Oracle db cache

and cache on storage array cache To prevent results from being skewed when repeating the

same test again after making a performance optimization.

60

Test Work Load

Apart from the processing of flat files, most jobs process between 120,000 and 180,000 jobs.

Few reference will be made to this in the presentation, in that what we refer to as a ‘Job’ will have little meaning to other people unless they are using the same application.

However, there is a consensus that a ‘job’ is something that requires a discrete set of actions to be performed against it in order to be processed.

61

Hardware and Software Platforms

IBM WebSphere application server 6.1 base edition 32 bit. Oracle Enterprise Edition 10.2.0.4.0 (10g release 2). Solaris 10. 1 x 4 CPU (single core) Fujitsu Siemens Prime Power 450

with 32Gb Ram to host database. 1 x 4 CPU (single core) Fujitsu Siemens Prime Power 450

with 32Gb Ram to host application server. 100Mb Ethernet network. EMC CX3-20F storage array for database accessed via

fibre channel.

62

Hardware and Software Platforms

EMC CX3-20F storage array for database accessed via fibre channel, with:- Two Intel Zeon based storage processors Two trays of disks, with 15 disks per tray. 1Gb cache.

63

EMC CX3-20F Configuration Despite being ‘Batch’ oriented, from a database perspective, the ratio of

logical reads to block changes is 92%. Some people dislike RAID5, we however, think it is perfectly suitable for

read intensive work loads:- i.e. spread the database files across as many disks as possible. Some disks will be lost to EMC vault disk usage.

Raid 1 was used for the redo logs and archived redo log files. Cache on the array was split 50/50 between read and write usage as

per EMC recommended best practice. The size of the database in terms of application segments was

approximately 25G, not that large really.

64

Database Statistics A classical approach to ascertaining application scalability

is to look at resource consumption, latching in particular. Refer to Tom Kyte’s runstats package. The main problem with this was:-

Flashing the database back between tests would result in the loss of any resource consumption data loaded into a table.

Runstats is designed for capturing statistics within a single Oracle session.

This information could be written to a file, but this would result in expending effort in developing such a tool.

Fortunately, Oracle 10g provides an out of the box solution to this in the form of the db time model . . .

65

Database Statistics

What is db time ? A statistic that comes with the 10g performance management

infrastructure. Sum total of time spent in none idle database calls by

foreground processes across all sessions. !!! Not to be confused with “wall clock time” !!!. Provides a single high level metric for monitoring database

utilisation Higher db time = high database utilisation.

Makes tuning ‘simply’ a matter of reducing db time. Refer to this presentation from the architect at Oracle who

invented this.

66

Monitoring And Tuning The Software

So as not to be drowned by statistics, the following high level statistics were chosen for monitoring purposes:- Oracle CPU usage Oracle database time Average database load session WebSphere application server CPU usage

67

Database Statistics

Database load is a 10g statistic that usually accompanies db time, but what is this ? Active sessions as reported by the 10g Automatic Database

Diagnostic Monitor Is calculated by db time

wall clock time Higher average database load = greater database utilisation. High database utilisation = good throughput from application

server. Low database utilisation = some bottleneck in the application

server is bottle necking throughput through to the database.

68

How The db time Model Should Help

If to begin with, the CPU usage on the application server is high and the db time expended in the database low, this would imply some sort of bottleneck in the database.

If a bottleneck is addressed in the application server and db time goes up, methods for reducing the db time should be looked at.

69

Identifying Performance Bottlenecks

How do we know where the bottleneck is ?:- The Tivoli Performance Viewer EJB Summary report

is a good place to start. In the example screen shot on the next slide, the

total time expended by the batch manager session bean can be compared to the sum total time expended by the dbaccess module beans.

Separate beans for accessing the database not only separates the integration layer access from the business logic, but helps with performance tuning.

70

Identifying Bottlenecks

71

Identifying Bottlenecks

From the screen shot on the previous slide(ScheduleManager is not associated with the batch processes) batch manager bean time = 429,276,448 time spent in dbaccess beans = 1,737,440 Db access time as a % total = 0.40% The bottleneck might be on the application server !!!.

There is also an EJB method summary report for drilling down further.

72

The ‘Carrot’ Model

Documents the thread usage in a J2EE application servers generic components:- HTTP Server Web Container EJB Container (driven by the number of active ORB

threads) JDBC Connection Pool Database

73


Typically, utilisation should be high towards the ‘front’ of the application server (HTTP server) and gradually dwindle of towards the end (database). Hence the ‘carrot’ analogy, unless the

application architectures is similar to the Microsoft Pet Store .Net versus J2EE benchmark, i.e there is little business logic outside the database.

74


In summary, most of the load on the software stack will be carried by the J2EE application server.

Measuring the CPU on both the J2EE application and Oracle database servers, will show how well the ‘Carrot’ model applies to our architecture and design.

75


0

20

40

60

80

100

120

140

160

180

200

Threads Used

HTTP Server WebContainer

ORB threads JDBCConnection

Pool

DatabaseSessions

Component

J2EE Component Utilisation

76

Software Configuration Base Line

77

Oracle Initialisation Parameterscommit_write BATCH, NOWAITcursor_sharing SIMILARcursor_space_for_time TRUEdb_block_size 8192db_flashback_retention_target 999999log_archive_max_processes 4open_cursors 65535optimizer_index_cost_adj 100optimizer_dynamic_sampling 1optimizer_index_caching 0pga_aggregate_target 4294967296processes 500query_rewrite_enabled TRUEsession_cached_cursors 100sga_max_size 5368709120sga_target 4697620480statistics_level TYPICALundo_management AUTOundo_retention 691200undo_tablespace UNDOworkarea_size_policy AUTO

78

WebSphere Configuration Server JVM

-server -Xms2000m –Xmx 2500m

Client JVM -client -Xms200m –Xmx500m

JDBC Connection Pool Min connections 100 Max connections 100

ORB configuration Min threads 100 Max threads 100 JNI reader thread pool set to 100 Fragment size set to 3000

79

Application Configuration

Threads per batch process 100

Jobs per thread 100

Log4j logging level INFO

80

Notes On Oracle Parameter Settings

Cursor management has a major impact on the scalability of applications that use Oracle

With this in mind cursor_sharing, session_cached_cursors and cursor_space_for_ time have all been explicitly set.

“Designing applications for performance and scalability” has some salient points regarding these parameters which will be covered in the next few slides.

81

Notes On Oracle Parameter Settings

A separate JTS transaction per each job results in heavy usage of the Oracle log buffer and its associated synchronization mechanisms. The redo allocation latch is a unique point of

serialisation within the database engine. Therefore the log buffer needs to be used with care.

Asynchronous and batched commit writes were introduced for this purpose. Helps to prevent waits due log file sync waits.

82

Tuning

83

Disclaimer Tuning efforts of different projects will yield different results from those

detailed here due to differences in the :- Software stack component versions, e.g. using Oracle 10.1 and not

10.2, WebSphere 6.0 or 7.0 and not 6.1, 64 bit WebSphere and not 32 bit.

Software stack component vendors, e.g. you may be using Weblogic or JBoss and DB2 instead of Oracle

J2EE application server and database server topology J2EE and database initialisation parameters Application architecture design and coding Server hardware Data Etc . . .

84

Disclaimer

Despite all the reasons as to why your results might vary from those presented, the technical precepts behind what has been done should hold true for more than just the application tested here.

85

A Note On The Results

The tuning efforts made were mainly focussed on tuning the software stack from an environment perspective.

In practise there were a lot more ‘tweaks’ made than those presented here.

The optimisations made have been distilled down to those which made the greatest impact.

Despite this the biggest performance and scalability gains often come from:- The architecture The design Coding practises used

86


The next set of findings relate to the most ubiquitous type of batch process in our software.

This is a batch process that:- retrieves a list of jobs from the database. partitions jobs into ‘chunks’. invokes beans in the application server via child

threads with these ‘chunks’ attached as objects.

87

Finding 1: pass by copy overhead Symptom

db time, database load and CPU utilisation on the database server were all low.

CPU utilisation on the application server at 100%. Root cause

database access beans invoked by remote method calls. Action

set pass by reference to ‘On’ on the Object Request Broker. Result

Elapsed time 01:19:11 -> 00:41:58 WebSphere CPU utilisation 96% -> 66% Db time / avg sessions 23470 / 4.1 -> 40071 / 14.5

88

Finding 2: threading

Symptom high db time and database load high CPU time attributed to

com.ibm.ws.util.ThreadPool$Worker.run method (visible via Java profiler).

Root cause batch process threading set to high, 100 threads for

4 CPU boxes !!!.

89

Finding 2: threading

Action lower number of threads, optimum between 16 and

32 depending on the individual batch process.

Result (threads 100 -> 32) Elapsed run time 00:41:58 -> 00:36:45 Db time / avg sessions 40071 / 14.5 ->

21961 / 8.9 WebSphere CPU utilisation 66 % -> 73 %

90

Finding 3: db file sequential read over head

Symptom “db file sequential read event” = 73.6% total call

time. Root cause

job by job processing = heavy index range scanning. Action

compress most heavily used indexes. Result

Elapsed run time 00:36:45 -> 00:36:38 Db time / avg sessions 21961 / 8.9 -> 9354 /

3.6 WebSphere CPU utilisation 73 % -> 74 %

91

Finding 4: Physical read intensive objects

Symptom ADDM advised that there were physical read intensive objects

Root cause With a batch process same data is rarely read twice, except for

standing / lookup data. Action

‘pin’ hot objects into a ‘keep’ area configured in the db cache Result

Elapsed run time 00:36:38 -> 00:26:36 Db time / avg sessions 9354 / 3.6 -> 4105 / 2.3 WebSphere CPU utilisation 74 % -> 87 %

92

Finding 5: Server JVM heap configuration and

Symptom major garbage collections take place one a minute.

Root cause heap incorrectly configured.

Action tune JVM parameters.

Result Elapsed run time 00:26:36 -> 00:25:01 Db time / avg sessions 4105 / 2.3 -> 3598 /

2.4 WebSphere CPU utilisation 87 % -> 86 %

93


The most effective JVM parameter settings were found to be those used by IBM in a WebSphere 6.1 bench mark on Solaris submitted to the SPEC.

Resulted in one major garbage collection every 10 minutes.

Minimum heap size=2880 MB Maximum heap size=2880 MB initialHeapSize="2880" maximumHeapSize="2880" verboseModeGarbageCollection="true -server -Xmn780m -Xss128k -XX:-ScavengeBeforeFullGC -XX:+UseParallelGC -XX:ParallelGCThreads=24 -XX:PermSize=128m -XX:MaxTenuringThreshold=16 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC

94


Usage of the JVM configuration from the IBM bench mark came after a lot of testing and experimentation via trial and error.

The Sun JVM tuning material supports this approach.

The heap is probably oversized for our requirements, but for a “first cut” at getting the configuration correct it is not a bad start.

95

Finding 6: Client JVM heap configuration and ergonomics

Symptom major garbage collections take place more than

once a minute. Root cause

heap incorrectly configured. Action

tune JVM parameters. Result

Elapsed run time 00:25:01 -> 00:24:20 Db time / avg sessions 3598 / 2.4 ->

3704 /2.5 WebSphere CPU utilisation 86 % -> 86 %

96

Finding 6: Client JVM heap configuration and ergonomics

Client JVM configurationJVM Options: -server -Xms600m -Xmx600m -XX:+UseMPSS -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=2 -Xss128k -Dcom.ibm.CORBA.FragmentSize=3000 -Dsun.rmi.dgc.client.gcInterval=4200000 -Dsun.rmi.dgc.server.gcInterval=4200000 Server diagnostic trace turned off

97

Finding 6: Database Block Size Symptom

Significant latching around the db cache. Root cause

Block size too small. Action

Increase block size from 8 to 16K. larger block size = fewer index leaf blocks = less index branch

blocks = smaller indexes = less physical and logical IO, less logical IO = less latching

Result Elapsed run time 00:24:20 ->

00:21:25 Db time / avg sessions 3704 / 2.5 -> 2623 / 2 WebSphere CPU utilisation 86 % -> 93 %

98

Finding 7: JVM aggressive optimizations

Symptom No symptom as such, load still on the application

server. Root cause

N/A Action

Further experimentation with the server JVM options resulted in aggressive optimizations being used.

Result Elapsed run time 00:21:25 ->

00:18:36 Db time / avg sessions 2623 / 2 -> 2516 / 2.1 WebSphere CPU utilisation 93 % -> 85 %

99

Finding 7: JVM aggressive optimizations

AggressiveOpts has to be used with -XX:+UnlockDiagnosticVMOptions -XX:-EliminateZeroing, otherwise the application server would not start up !!!.

The following excerpt from the Java Tuning White Paper should be heeded:-

“Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.”

100


The other type of batch process in our software involved the reading and writing to files after the contents of files / database tables had been validated against standing data.

This type of batch process was highly ‘Chatty’ by design.

101

Tuning Finding: ‘Chatty’ Batch Process Design

Symptom Low CPU usage on WebSphere server. Low CPU usage on the database server.

Root cause Oracle stored procedure called to validate each record field in

files records being read and written, performance death by network round trips !!!!!!!!!.

Action Modify code to perform validation using pure Java code

against standing data cached within the application server. Results

See next slide

102

Tuning Finding: ‘Chatty’ Batch Process Design

Finding 3: excessive calls to Oracle stored procedures Results

Validation Method

Lines In File Threads Run Time(mm:ss)

% Improvement Over PL/SQL WebSphere

CPUOracle CPU

PL/SQL

15000

8

02:18 NA 68 60

Java

01:31 34% 77 68

4 01:48 24% 51 56

103

Other Findings

With some batch processes “cursor: pin S” wait events were observed, this accounted for up to 7.2% of total call time.

Investigating this alluded me to the fact that in 10.2.0.3.0 onwards the library cache pin had been replaced by a mutex.

In 11g even more of the what were library cache latches have been replaced with mutexes.

Notable, because one of the ways of comparing the scalability of different tuning efforts is to measure and compare latching activity.

104

Tuning Results Summary

105

Types Of Batch Processes

The following graphs illustrate capture the following statistics for an ‘atypical batch process” that has had all the tuning recommendations applied:- the average percentage CPU usage db time elapsed time

106

Batch Elapsed Time

0

200

400

600

800

1000

1200

4 8 16 32

Threads

Tim

e (s

)

Batch Elapsed Time

107

Batch DB Time

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

4 8 16 32

Threads

db

tim

e

Batch DB Time

108

Batch average db load

0

1

2

3

4

5

6

7

4 8 16 32

Therads

av

era

ge

db

lo

ad

in

se

ss

ion

s

average db load

109

Server % CPU Utilisation / Thread Count

0

10

20

30

40

50

60

70

80

90

4 8 16 32

Threads

Tim

e (

s)

% Database CPU Usage

% App Server CPU Usage

110

Critique Of Tools Used

111


Oracle 10g dbtime model This worked very well for measuring to the database

utilisation. It does not however give any indication of how heavy

utilisation is compared to the total capacity that the database tier can provide.

Both the Oracle diagnostics and tuning packs need to be licensed in order to use the tools that accompany the time model, namely the ADDM and workload repository.

These extra options are not cheap. The “ASH Masters” provide a low cost alternative to the 10g

performance infrastructure.

112


JProfiler (Java profiler) Provides detailed information on:-

Heap usage Thread lock monitor usage CPU usage, at method, class, package and bean level. JDBC usage. CPU profiling with drill downs all the way to JDBC calls. JNDI lookup activity.

Worked well for:- highlighting the RMI pass by object over head Diagnosing an issue earlier on whereby a ‘singleton’ object was

being created thousands of times resulting in excessive CPU usage and heap usage.

113

Critique Of Tools Used JProfiler:-

Used on the grounds that:- It was extremely easy to configure Attached to the JVM of WebSphere 6.1 Other products were more suited to JSE program profiling Some profilers could not attach to the WebSphere JVM, or could,

but not that of version 6.1 Other profilers came with unwieldy proprietary IDEs that we did

not require

Had a 100% performance overhead on the application server and should therefore not be used on production environments.

Kill -3 can be used to generate thread dumps, the “Poor mans profiler” according to some, this is much less intrusive than using a full blown Java profiler.

114


Tivoli Performance Monitoring Infrastructure (PMI) Comes with a number of summary reports, the EJB report of

which was particularly useful. If too many data points are graphed, the PMI viewer can

become painfully slow. Turning some data points on can have a major impact on

performance. One project member used the

WebSphere PerfServlet to query PMI statistics and graph them using big brother and round robin graphing.

115


WebSphere performance advisor Only useful information it provided was regarding the

turning off of the diagnostic trace service.

Relies on PMI data points being turned on in order to generate ‘Useful’ advice.

Turning some data point on can have a detrimental affect on performance, to reiterate what was mentioned on earlier slides.

Perhaps more useful when running WebSphere with the IBM JVM, as this is more tightly integrated into the performance monitoring infrastructure than the Sun JVM.

116

Conclusions

117

Bottlenecks In Distributed Object Architectures

This alluded to Martin Fowler’s "First law of distributed object architectures".

If remote interfaces are used and beans are deployed to a WebSphere application server in a single node configuration, the pass by copy overhead is still considerable.

118

Bottlenecks In Distributed Object Architectures

WebSphere application server provides a “quick win” for this situation in the form of the object request broker pass by reference setting. !!!! CAUTION !!!! This should not be used when the

invoking beans assume that they can use these objects passed by reference without the invoked beans having altered the received object(s).

For scale out, prefer shared nothing architectures as per this article from Sun.

WebSphere Network Deployment uses a shared nothing architecture.

119

Tuning Multi Tiered Applications

When multiple layers and tiers are involved an all encompassing approach needs to be taken to tuning the software stack:- Tuning the database in isolation may not result in the

performance and scalability goals being met. Tuning the J2EE application in isolation may not result in

the performance and scalability goals being met. Refer to

"Why you cant see your real performance problems" by Cary Millsap.

120

Tuning Multi Tiered Applications

The bottlenecks needs to be identified and targeted wherever they exist in the application stack.

A prime example of this is that the impact of database tuning would have been negligible had the pass by copy bottleneck not been addressed.

121

Threading

A given hardware platform can only support a finite number of threads.

There will be a “sweet spot” at which a given number of threads will give the best throughput for a given application on a given software stack.

Past a certain threshold, the time spent on context switching, thread synchronization and waiting on contention within the database, will result in diminishing returns from upping the thread count.

122

Avoid ‘Chatty’ Designs

‘Chatty’ ??? Yes, designs that can result in excessive chatter

between certain components. This can be particularly bad when there is a network

involved. “Design and Coding Applications for Performance and

Scalability” by IBM recommends putting processing closest to the resource that requires it (section 2.5.9).

123


A subtly different angle on this is that ‘Chatty’ designs should be avoided:- Specifically, avoid designs that and incur frequent

network round trips between the database and the application server.

Tuning finding 3 supports this.

124


Low CPU consumption on both the application server and database servers could be a sign of ‘Chatty’ software. i.e. excessive calls to the database, thus making

network round trips the bottleneck. Perform processing exclusively within the

application server where possible, but not when there are database features available specifically for carrying this work out.

125


Operations that involve significant bulk data manipulation should be done in the database.

Always look to minimise network round trips by leveraging:- Stored procedures Array interfaces, both in Oracle and the JDBC API Tuning the JDBC fetch size In line views Merge statements Sub query factoring SQL statement consolidation

126


‘Chatty-ness’ can be a problem within the application server also:- There are two vertical layers of domain (business) logic within

the application which are invariably called together. These could be consolidated into one layer with the benefit

of:- Code path length reduction Allowing for SQL statement consolidation

Not addressed to date as all of our performance goals have been achieved without having to carry this work out.

127

JVM Tuning The Java Virtual Machine is a platform in its own right,

therefore it deserves a certain amount of attention when it comes to tuning.

When using the Sun JVM, use the appropriate garbage collection ‘Ergonomics’ for you application.

As per some of Sun’s tuning material, there can be an element of trial and error in JVM tuning.

Use verbose garbage collection to minimise major garbage collections.

Look at what tuning experts have done on your platform in the past to get ideas. www.spec.org is not a bad place to look as per the example used in this material.

128

Row by Row Processing Scalability and Performance

There was great concern over the row by row access to the persistence layer. However, a bottleneck is only an issue if it prevents

performance goals from being achieved. It would be interesting to find the level of application

server through put required to make the database become the bottleneck.

This would require more application server instances, i.e. WebSphere network deployment.

129

Is The Database The Bottleneck ? db time does not help when measuring resource

usage and time spent in the database relative to the total available capacity.

However, as we have gone from 14.4 to 2.5 in terms of average database load (db time / elapsed time), we can infer that:- An average load of 2.5 sessions suggests that the

database is not the bottleneck. There is ample spare resource capacity on the

database tier. This conforms with the ‘Carrot’ model.

130

Is The Database The Bottleneck ?

Parsing was raised as a concern, the % None-parse CPU on the “Automatic Workload Repository” excerpt on the next slide will dispel this.

This report was captured whilst running an atypical batch process with all the tuning changes applied and 32 threads.

The “Parse CPU to parse elapsd” ratio is not too optimal, however as the % Non-Parse CPU is quite small, this is not a major concern.

131

Is The Database The Bottleneck ?

Buffer Nowait %: 99.99 Redo NoWait %: 100.00

Buffer Hit %: 99.33 In-memory Sort %:

100.00

Library Hit %: 99.99 Soft Parse %: 99.99

Execute to Parse %:

91.14 Latch Hit %: 99.91

Parse CPU to Parse Elapsd %:

24.76 % Non-Parse CPU:

94.13

132

There Is Always A Bottleneck

In all applications there are always performance and scalability bottlenecks. A J2EE application server will usually be bound by CPU

capacity and memory access latency from a pure resource usage point of view.

A relational database will usually be constrained by physical and logical IO.

In the J2EE world where a database is used for persistence, tuning will involve moving the bottleneck between the application server and the database.

133

Useful Resources

IBM resources Designing and Coding Applications For Performance

and Scalability in WebSphere Application Server WebSphere Application Server V6 Performance and

Scalability Handbook IBM WebSphere Application Server V6.1 on the Sol

aris 10 Operating System

134

Useful Resources

IBM WebSphere Compute Grid resources WebSphere Extended Deployment Compute Grid Executing Batch Programs In Parallel With WebSphere

Extended Deployment Compute Grid Compute Grid Run Time Compute Grid Applications Swiss Re Use Of Compute Grid Compute Grid Discussion Forum

Links provided courtesy of Snehal Antani of IBM.

135

Useful Resources

Sun Resources Albert Leigh’s Blog Dileep Kumar's Blog Scaling Your J2EE Applications Part 1 Scaling Your J2EE Applications Part 2 Java Tuning White Paper J2SE and J2EE Performance Best Practices, Tips

And Techniques

136

Useful Resources

Oracle Resources Oracle Real World Performance Blog 360 Degree DB Programming Blog Oracle Technology Network JDBC Resources Designing Applications For Performance And Scalability - An

Oracle White Paper Best Practices For Developing Performant Applications

137

Useful Resources

Other resources Standard Performance Evaluation Council

jAppServer 2004 Results JProfiler

j2ee batch processing

Documents

blown batch infrastructure

j2ee application servers

j2ee material

n batch processing

batch processing worlds

individual batch jobs

tuning of j2ee architectures

available information