ca workload automation (de) internals and troubleshooting€¦ · 2 october 13-16 2009 ca workload...
TRANSCRIPT
CA Workload Automation (DE) Internals and Troubleshooting
Lee Stecklov
Terms of This Presentation
This presentation was based on current information and resource allocations as of October
2009 and is subject to change or withdrawal by CA at any time without notice.
Notwithstanding anything in this presentation to the contrary, this presentation shall not serve
to (i) affect the rights and/or obligations of CA or its licensees under any existing or future
written license agreement or services agreement relating to any CA software product; or (ii)
amend any product documentation or specifications for any CA software product. The
development, release and timing of any features or functionality described in this presentation
remain at CA‟s sole discretion. Notwithstanding anything in this presentation to the contrary,
upon the general availability of any future CA product release referenced in this presentation,
CA will make such release available (i) for sale to new licensees of such product; and (ii) to
existing licensees of such product on a when and if-available basis as part of CA maintenance
and support, and in the form of a regularly scheduled major product release. Such releases
may be made available to current licensees of such product who are current subscribers to CA
maintenance and support on a when and if-available basis. In the event of a conflict between
the terms of this paragraph and any other information contained in this presentation, the
terms of this paragraph shall govern.
2 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
For Informational Purposes Only
Certain information in this presentation may outline CA‟s general product direction. All
information in this presentation is for your informational purposes only and may not be
incorporated into any contract. CA assumes no responsibility for the accuracy or completeness
of the information. To the extent permitted by applicable law, CA provides this document “as
is” without warranty of any kind, including without limitation, any implied warranties or
merchantability, fitness for a particular purpose, or non-infringement. In no event will CA be
liable for any loss or damage, direct or indirect, from the use of this document, including,
without limitation, lost profits, lost investment, business interruption, goodwill, or lost data,
even if CA is expressly advised of the possibility of such damages.
3 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
4 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Abstract
> CA dSeries Workload Automation is a java based client-
server cross-platform scheduler consisting of 3
components: CA dSeries Workload Automation server,
Agents and Desktop Client. Along with the 3 way
communication the CA dSeries Workload Automation
server communicates with a relational database system
through JDBC. The presentation discusses the Server
architectural components, the life cycle of an application,
environmental factors and troubleshooting.
5 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: background
> The Server
Pure Java (1.6 except AIX – 1.5)
Installed and started by any user
Launched by Shell on Unix/Linux or Service/command on
Windows
6 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
CA WA Architectural diagram – Server
CLI
7 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Server
> Design
Ground-breaking Object-Oriented design written in Java.
– Object oriented-design allows flexibility and adaptability.
– Multi-threaded -Java threading
– JDBC connectors for database –enhanced caching
– Java RMI interface for Graphical Client and Command Line
– SOAP Web Services interface
– Single instance or High Availability
8 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Server
The Result: Highly scalable, high performance scheduling
server.
Benchmarking 4 million jobs per day
Number of generations triggered
38098
Number of jobs run in a day 3,809,800
Components
CA WA Server R11.1 Sp1 (Build 58)
with default system agent
Database(Oracle 11g)
30 Unix Agents(physical)
30 Unix Agent(physical)
20 Unix Agents(physical)
20 Unix Agent(physical)
9 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Server
> Performance considerations: Memory
Java requires the heap to be pre-allocationed
– Xmx defines the maximum heap which is 1 GB by default. However process will
only acquire the necessary memory
Typical Memory profile for continuous scheduling:
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400
Mem
ory
(KB
)
Time(Minutes)
Memory Usage of server process
Memory …
10 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Server
> Performance considerations: CPU
Java Multi-threading means robust exploitation of available
cpu‟s
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400
% c
pu
usag
e
Time(Minutes)
cpu usage of server process
cpu usage
11 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: background
> Database considerations
– Configuration, Schedule and run-time data stored in relational
database (Oracle / MSSQL / DB2)
– JDBC used for database connection
– High volume, simple SQL –insert , update , delete
– To ensure persistence, all run-time data is saved in database
– Database requirements are straightforward: one tablespace
(Oracle), or one database (MSSQL).
12 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: background
> Relational Database Schema
– Logically divided into Runtime, History and Definition tables
– Runtime tables are used as queue and persistence tables – they
are highly transactional.
– History tables record job history, less volatile but grow over
time
– Definition tables contain XML definitions saved in CLOB data
types.
13 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: communications
> The Server
Communications: TCP/IP
– Server requires 4 LISTENING ports
– Manager
– Client
– RMI export
– RMI Registry
14 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architectural diagram – the network
DB Port
i.e. 1521
DB Port
i.e. 1521
RMI Ports
7598/7599Server:
Manager Port:7507
Agent:
Port 7520
Client Port:
7500RMI Ports:
7598/7599
RMI
ports
RMI
ports
Server
SNMP
Receiver
CLI
15 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: communications
> Server
Communications: TCP/IP
– All ports bind to default installation address
EXAMPLE:
tcp 0 0 simon.ca.com:58500 *:* LISTEN
tcp 0 0 simon.ca.com:58598 *:* LISTEN
tcp 0 0 simon.ca.com:58599 *:* LISTEN
tcp 0 0 simon.ca.com:58507 *:* LISTEN
16 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architectural diagram – UI
CLI
17 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: UI
> Desktop Client
– Based on the Eclipse graphical platform, pure Java
– Installs on user‟s workstation
– Started by a windows „executable‟ stub
– Personal workspace is C:\Documents and
Settings\%user%\workspace
– Can automatically failover to the primary server
18 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: UI
The Desktop Client
19 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: UI
> Desktop Client
– Creates java RMI (Remote Method Invocation) interface to
Server
– Primary Functions:
Define Workload
Monitor
Server Administration
20 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: UI
> Web Server
– Monitor and Control workload
– Apache Tomcat
– Web UI
21 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Web Services
22 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Web Services
> Web Services Interface uses Apace Tomcat and AXIS 1.
Simple Object Access Protocol (SOAP) over HTTP or HTTPS
use Web Services Description Language (WSDL)
You can use web services to invoke the following CA WA functions:
– Trigger an Event in addition to its usual schedule
– Bypass a scheduled Event
– Hold (postpone) an Event
– Release a held Event
– Suspend an Event from triggering
– Resume a suspended Event
– Replace an Event's next scheduled execution with a new time
23 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Application Life Cycle
> Scheduler thread
The scheduler initiates the Application Life Cycle.
– Loads the TDR (Time Driven Request), which is an XML
representation of the event.
– RUN.TEST2200910080000000017 – this TDR is scheduled on
the 2009 10 08 00:00.
– Processes JavaScript, and sends the Application definition to the
Manager.
– Builds the new TDR, and modifies the Event
24 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Application Life Cycle
> Manager
The Manager builds, executes and tracks.
– Builds the job definitions into memory, executes JavaScript if
necessary
– Sends commands to Agent
– Receives and processes state information from Agent
– Persists states in RDBMS
– Persists relevant job information in RDBMS for history reporting
25 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Application Life Cycle
Manager completes application life cycle.
– If the application is complete, terminates the thread, updates
the database
– By querying the database, we can see how many applications are active. This is
what the desktop client does when subscribing.
– The same information can be seen in esp_wss_appl:
SQL> select count(*) from esp_wss_appl where STATUS=1;
COUNT(*)
----------
45
There are 45 active applications.
26 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Architecture: Job Life Cycle
>Job Life Cycle Data Flow is best demonstrated by the job life cycle.
– Jobs are build when APPLICATION is created
– When a job has satisfied all it‟s requirements: resources, time
constraints, predecessors, etc.. server considers it „runnable‟ and can
send the „run‟ AFM (message) to the Agent
AGENT LSESP5 WINNT1/VERIFYD.444/MAIN RUN . Data(Command=C:\Batches\test.bat) MFUser(LEE)
– Agent acknowledges reception
– At this point, job is in READY state on server
Troubleshooting
> The most common problem
Job stays in READY state
– Telnet Smoke Test
27 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Tracelog and AFM log
Tracelog
– Logs errors – termed exceptions in Java
– Various „filter ids‟ correspond to server subsystems
– Filter ids can be turned on as needed (on the fly or
permanently)
Afmlog
– Logs AFM traffic (remote and internal)
– Can be used to watch agent traffic
Audit log
– Who did what and when
28 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Logging
Exceptions
– Basis of troubleshooting server problems.
– Exception logging include call stacks
Searching Log Files
– Quick test (Unix or Windows using GNU grep):
grep -i exception tracelog.txt |egrep -v -i 'send|connect|ping|ack'
– Case insensitive search for „exception‟, not including
exceptions with the strings commonly found in agent
connection exceptions20071228 18:06:00.974 [ID: 0] SS: org.mozilla.javascript.EvaluatorException: Property 0 not found.
29 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Logging…
Examining the exception stack
30 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Logging…
Example: JavaScript Exception stack20071228 00:05:46.845 [ID: 0] SS: org.mozilla.javascript.EvaluatorException: Property 0
not found.
at org.mozilla.javascript.DefaultErrorReporter.runtimeError(Unknown Source)
at org.mozilla.javascript.Context.reportRuntimeError(Unknown Source)
at org.mozilla.javascript.Context.reportRuntimeError(Unknown Source)
at org.mozilla.javascript.Context.reportRuntimeError1(Unknown Source)
at org.mozilla.javascript.ScriptableObject.setAttributes(Unknown Source)
at
cybermation.library.script.javascript.CybJSFlatObjectFactory.restore(CybJSFlatObjectFact
ory.java:593)
– In this example, the SS thread is throwing an exception
while evaluating a JavaScript.
– A good starting point….
31 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting> Logging…
Example: SQL error Exception stack20071231 17:57:09.505 [ID: 0] HAC: PRIMARY SQLException on update/insert -
java.sql.SQLException: Io exception: Connection reset by peer: socket write error
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:125)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:162)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:274)
)….
20071231 17:57:09.505 [ID: 0] HAC: Could not rollback the connection; the exception is Closed Connection - java.sql.SQLException: Closed Connection
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:125)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:162)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:227)
at oracle.jdbc.driver.PhysicalConnection.rollback(PhysicalConnection.java:994)
at com.cybermation.espresso.rdbms.DBPooledConnectionWrapper.rollback(DBPooledConnectionWrapper.java:141)
at com.cybermation.espresso.rdbms.RelationalDatabaseManager.unconditionalRollback(RelationalDatabaseManager.java:382)
– High Availability thread reporting an I/O error on RDBMS
32 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Logging…
SQL I/O error example
– 911 Thread monitors critical errors
20071231 17:57:09.525 [ID: 0] 911-Service:
************************************************************
* The thread HAC reported a fatal server error.
* Server 911 service is initiating the shutdown procedure.
************************************************************
– Server is programmed to shutdown on certain SQL errors.
– In this case, could have been a simple Network glitch, or
database down. The actual error „SQLCode: 17002
SQLState‟ was returned by the JDBC driver.
– This tells us that the problem occurred at the jdbc network
interface
33 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Logging..
Server monitoring
– 911 Thread responsible for server health
– Server state thread tracks memory and database
connection pools20080103 14:37:03.754 [ID: 0] Thread-5: Esp Server State:
Memory:
Using 18 MB of Currently Allocated 38 MB. Total Free Memory is 998 MB of the Max Heap Size 1016 MB
Threads:
[Thread group=system;active groups=6;active threads=131]
[Thread group=main;active groups=4;active threads=119]
[Thread group=RMI Runtime;active groups=0;active threads=0]
Database Connections:
…..
In Use Connections:
Connection Wrapper[usageCount: 241, lastAccessTime: 2008-01-03 14:37:03.204, Used by Thread: RDBOutputMessageQueue_13]
34 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
>Quick Log Searches
Using GNU grep of Windows findstr command
– Track server memory usage
Unix: grep Memory tracelog.txt
Windows: findstr Memory tracelog.txt
– Track the progress of the SS thread:
grep SS: tracelog.txt
– Track on particular Application generation:
grep INITDIR.889 tracelog.txt
– Track High Availability Component:
grep HAC tracelog.txt
35 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Troubleshooting
> Thread Dumps
Very helpful in case of server hangs
– A Java feature available on Unix/Linux only
– kill -3 pid_of_server creates a thread dump in
ESP_HOME/MonitorAndStatus/server.log(Linux/Solaris/HP),
<ESP_HOME>javacorennnn.nnnn.txt (AIX).
– Thread dump contains call stacks for all active threads
– Capture 3 thread dumps 1-2 minutes apart
– Valuable tool for support / development to analyze
problems not found in log files.
36 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA
Thank You
37 October 13-16 2009 CA Workload Automation (DE) Internals and Troubleshooting Copyright © 2009 CA