pcp
TRANSCRIPT
2
Conclusions• You don’t need RAC to use Parallel Concurrent
Processing (PCP)!• If you have PCP enabled, secondary nodes
must be defined during the upgrade to R12• Tuning of TCP, SQLNet and PMON
parameters can minimize PCP failover time.• Implement Failover Sensitive Workshifts
3
Concurrent Processing ServerAllows scheduling of jobs – batch jobs, or Requests in
Oracle terms. Processes concurrent programs as a Request. Requests can be grouped together into Request Sets. Different types of concurrent managers handle different
types of requests. A concurrent program can be assigned to a responsibility,
and that responsibility can be assigned to users, allowing them the permission to run the concurrent program.
Concurrent managers may have limits on the concurrent programs that can be run, and the times that they can be started. Requests have priorities, status and log and out files in the above directory
4
Definitions
• CP => Concurrent Processing • DCD => Dead Connection Detection• ICM => Internal Concurrent Manager• IM => Internal Monitor• CRM => Conflict Resolution Manager• PCP => Parallel Concurrent Processing• PMON => Process Monitor for ICM
5
Concurrent Request
6
Phase and Status of Concurrent RequestsPhase Status Description - ActionPending Normal The request is waiting to be picked up by the next
available manager.Pending Standby Waiting for CRM to resolve conflict. CRM could be
slow or an incompatible program is running.Running Normal The request is running normally.
Completed Normal The request has finished successfully
Completed Error The request has finished with an error. Check logs.
Completed Warning The request has finished with a Warning. Check the logs.
Inactive No Manager Request won’t run without a manager. Specialization rules aren’t configured properly.
7
PCP Failover
•Database Listener
•TCP_KEEPALIVE takes 240 seconds before issuing DCD
•SQL*Net
•Client
•SQL*Net
•Client
•RH7 •RH8
•PCP •PCP
•Database
•DB Node – RH8
•sqlnet.ora
•SQL*Net
•Client
•RH9
•PCP
8
Concurrent Managers
9
Concurrent ManagersManager Type Service Instance ProgramInternal Concurrent Manager Internal Manager FNDLIBRConflict Resolution Manager Conflict Resolution Manager FNDCRMInternal Monitor Internal Monitor:Node FNDIMON
Service Manager: Node FNDSMConcurrent Manager Standard Manager FNDLIBRConcurrent Manager Inventory Manager INVLIBRConcurrent Manager Session History Cleanup FNDLIBRConcurrent Manager PA Streamline Manager PALIBRTransaction Manager CRP Inquiry Manager CYQLIBTransaction Manager FastFormula Transaction Manager FFTMTransaction Manager PO Document Approval Manager POXCONTransaction Manager Transaction Manager FNDTMTST
Scheduler/Prerelease Manager FNDSVCOAM Generic Collection Service:Node FNDSVC
10
Concurrent Processing1. The Concurrent
Processing server communicates with the database using Oracle SQL*Net.
2. The concurrent program log or output file from a request is passed back as a report to the Report Review Agent.
3. The Report Review Agent passes a file containing the entire report to the forms server.
JAVAInterfaceJInitiator
Web Browser
Forms Server
ReportReviewAgent
SQL*Net
.rdx
Requests Log Out
ServiceManagerFNDSM
ICMFNDLIBR
Web ServerHTMLInterface
Reports Server
InternalMonitor
FNDIMONStandardManagerFNDLIBRFNDCRM
4. The Forms Services component passes the report back to the user’s browser one page at time. Profile options can be used to control the size of the files and pages passed, to suit report volume and available network capacity.
11
Internal Concurrent Manager• The Internal Concurrent Manager (ICM) starts, sets the
number of active processes, monitors, and terminates all other concurrent processes through requests made to the Service Manager, including restarting any failed processes.
• The ICM also starts and stops, and restarts the Service Manager for each node.
• The ICM will perform process migration during an instance or node failure.
• The ICM will be active on a single node. • This is also true in a PCP environment, where the ICM
will be active on at least one node at all times.
12
Internal Concurrent Manager• The ICM really does not have any scheduling
responsibilities. It has NOTHING to do with scheduling requests, or deciding which manager will run a particular request. The function of the ICM is to run 'queue control' requests; requests to startup or shutdown other managers.
• The ICM is responsible for startup and shutdown of the whole concurrent processing facility, and it monitors the other managers periodically, and restarts them if they should go down. It can also take over the Conflict Resolution manager's job, and resolve incompatibilities.
• If the ICM itself should go down, requests will continue to run normally, except for 'queue control' requests. Restart the ICM with 'startmgr'; no need to kill the other managers first.
13
Internal Concurrent Manager
14
Service ManagerFNDSM process - Communicates with the Internal Concurrent
Manager, Concurrent Manager, and non-Manager Service processes.
• The Service Manager (SM) spawns, and terminates manager and service processes (these could be Forms, or Apache Listeners, Metrics or Reports Server, and any other process controlled through Generic Service Management).
• When the ICM terminates the SM that resides on the same node with the ICM will also terminate.
• The SM is “chained” to the ICM. The SM will only reinitialize after termination when there is a function it needs to perform (start, or stop a process), so there may be periods of time when the SM is not active, and this would be normal.
15
Service Manager• All processes initialized by the SM inherit the
same environment as the SM. • The SM’s environment is set by APPSORA.env
file, and the gsmstart.sh script. • The apps_<sid> listener must be active on each
CP node to support the SM connection to the local instance.
• There should be a Service Manager active on each node where a Concurrent or non-Manager service process will reside.
16
FNDSM FailureFNDSM failover as noted in the concurrent manager log:
Could not contact Service Manager FNDSM_RH8_VIS. The TNS alias could not be located, the listener process on RH8 could not be contacted, or the listener failed to spawn the Service Manager process.
Found dead process: spid=(962754), cpid=(2259578), Service Instance=(1045)
CONC-SM TNS FAILCall to PingProcess failed for WFMAILERCONC-SM TNS FAILCall to StopProcess failed for WFMAILERCONC-SM TNS FAILCall to PingProcess failed for FNDCPGSC
17
FNDSM FailoverFound dead process: spid=(716870), cpid=(2259580), Service
Instance=(2009)Found dead process: spid=(1442020), cpid=(2259579), Service
Instance=(2010)
Starting WFMGSMD Concurrent Manager : 15-AUG-2008 13:28:56
Starting WFMGSMDB Concurrent Manager : 15-AUG-2008 13:28:56
Starting WFALSNRSVCB Concurrent Manager : 15-AUG-2008 13:28:57
Starting STANDARD Concurrent Manager : 15-AUG-2008 13:30:31
Starting Internal Concurrent Manager Concurrent Manager : 15-AUG- 2008 13:30:32
18
Internal Monitor(FNDIMON process) - Communicates with the Internal Concurrent
Manager. • This manager/service is used to implement Parallel Concurrent
Processing.• You do not need to run this manager/service unless you are using
Parallel Concurrent Processing.• The Internal Monitor (IM) monitors the Internal Concurrent Manager,
and restarts any failed ICM on the local node. It monitors whether the ICM is still running, and if the ICM crashes, it will restart it on another node.
• During a node failure in a PCP environment the IM will restart the ICM on a surviving node (multiple ICM's may be started on multiple nodes, but only the first ICM started will eventually remain active, all others will gracefully terminate).
• There should be an Internal Monitor defined on each node where the ICM may migrate.
19
Standard Manager
• (FNDLIBR process) - Communicates with the Service Manager and any client application process.
• The Standard Manager is a worker process that initiates, and executes client requests on behalf of Applications batch, and OLTP clients.
20
Standard Manager
21
Standard Manager - OAM
Since no secondary node is
defined, the Standard Manager
will not failover“Failover Processes” in the Work Shifts definition are the number of processes that will run (3) when the Standard Manager fails over to the secondary node.
The Standard Manager is active on RH9, even though no primary node is defined
22
Transaction ManagerA Transaction Manger communicates with the Service
Manager, and any user process initiated on behalf of Forms, or a Standard Manager request.
A Transaction Manager:• Supports synchronous processing of requests from a
client program• Gets request for a client program to run a server-side
program synchronously.• Return a status/results to the client program.• At runtime, it starts a number of these managers as
defined.• Doesn’t poll concurrent request table for a new request• Only need 1 transaction manager per database, not 1
per instance.
23
Transaction Managers
Some of the Transaction Managers in R12
24
Configuring Transaction Managers for RAC
• R11i Transaction Managers use DBMS_PIPE– This does not work across RAC instances– RAC users must perform additional configuration
• Requires complicated configuration or additional hardware
• R12 Transaction Managers use AQ– Works across RAC Instances– Simplifies configuration– Reduces complexity– Profile Option can switch between mechanisms
• DBMS_PIPE can be used for non-RAC users if performance becomes an issue
25
Configuring Transaction Managers for RAC
• Edit $ORACLE_HOME/dbs/<context_name>_ifile.ora and add these parameters:
• _lm_global_posts=TRUE • _immediate_commit_propagation=TRUE
• Change the profile option ‘Concurrent: TM Transport Type' to ‘QUEUE', and verify that the transaction manager works across the RAC instance. ATG RUP3 (4334965) or higher provides an option to use AQs in place of Pipes.
• Profile “Concurrent:TM Transport Type”• Set to QUEUE• Pipes are more efficient but require a Transaction Manager to be
running on each DB Instance.• Navigate to Concurrent > Manager > Define screen, and set up
the primary and secondary node names for transaction managers.
26
Configuring Transaction Managers for RAC
• Transaction Managers allow a client to make a request for a program to be run on the server immediately. The client then waits for the program to complete and can receive program results from the server. As the client and server are two separate database sessions, the communication between has been handled using the DBMS_PIPE package.
• Unfortunately the DBMS_PIPE package does not extend to communications between sessions on different RAC instances. On an Applications instance using RAC, the client and server are very likely to be on different instances, causing transactions to time out for long periods or fail completely. The current workaround is to manually set up Transaction managers to connect to all RAC instances, which not only takes up additional resources, it may require additional middle-tier hardware or a complicated configuration that is difficult to maintain.
27
R12 Transaction Managers
• In R12, the Transaction Managers use the AQ mechanism; the Transaction Managers, work on RAC connected to either instance.
• This greatly simplifies the configuration and reduces the complexity for RAC administrators. A Profile Option has been introduced to allow users to switch between the two transports DBMS_PIPE or AQ.
28
Concurrent:PCP Instance Check
• Concurrent processing provides database instance- sensitive failover capabilities. When an instance is down, all managers connecting to it switch to a secondary middle-tier node.
• However, if you prefer to handle instance failover separately from such middle-tier failover (for example, using TNS connection-time failover mechanism instead), use the profile option Concurrent:PCP Instance Check.
• When this profile option is set to OFF, Parallel Concurrent Processing will not provide database instance failover support; however, it will continue to provide middle-tier node failover support when a node goes down.
29
Conflict Resolution Manager• Concurrent managers read requests to start concurrent programs.
The Conflict Resolution Manager checks concurrent program definitions for incompatibility rules.
• If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the concurrent managers from starting other programs in the same conflict domain.
• When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have completed running.
• To enable/disable the Conflict Resolution Manager, use the system profile option 'Concurrent: Use ICM'. Set this to 'No' (default) allows the CRM to be started.
• Setting it to 'Yes' causes the CRM to be shutdown and the Internal Manager (ICM) will take over the conflict resolution duties.
• If the CRM will not start (it is started automatically by the ICM), check this profile option.
30
Conflict Resolution Manager
• Use the system profile option 'Concurrent: Use ICM'. 'No‘ allows the CRM to be started.
• Setting it to 'Yes' causes the CRM to shutdown. The Internal Manager (ICM) will take over the conflict resolution duties.
• Using the ICM to resolve conflicts is not recommended.
• The CRM's sole purpose is to resolve conflicts, while the ICM has other functions to perform as well.
• Setting this option to 'YES' is not recommended.
31
Generic Service Management• An E-Business Suite system depends on a variety of services, such
as Forms Listeners, HTTP Servers, Concurrent Managers, and Workflow Mailers. These services are composed of one or more processes. In the past, many of these processes had to be individually started and monitored by system administrators.
• Management of these processes is complicated, since these services can be distributed across multiple host machines.
• The introduction of Generic Service Management in Release 11i helped simplify the management of these processes by providing a fault tolerant service framework and a central management console built into Oracle Applications Manager.
• Service Management is an extension of Concurrent Processing, and provides a framework for managing processes on multiple host machines. With Service Management, virtually any application tier service can be integrated into this framework.
• Patch 2221688 introduces GSM.
32
GSM
33
Generic Services
34
GSM and Multiple Nodes• GSM enables users to manage Applications
services across multiple middle-tier nodes. • This includes services on Web/Forms nodes that
previously have had no concurrent processing footprint.
• Users configuring GSM in a multiple-node system should be sure to have followed the instructions for Parallel Concurrent Processing.
• This includes setting the environment variable APPLDCP=ON and assigning a primary node for all defined managers and services (if not already defined.)
35
Seeded GSM ServicesWhen configuring GSM the following GSM
Services are seeded automatically:– Forms Listener– Metrics Server– Metrics Client– Reports Server– Apache Listener
LINUX users should not Activate the Reports Server under GSM
36
Starting GSM
Apps Listener:listener.ora
gsmstart.shexec FNDSM
37
adcmctl.sh
adcmctl.sh calls:starmgr.sh
batchmgr.shCONCSUBFNDSVCRG
38
FNDSVCRG – Service Controller Utility
• FNDSVCRG is an executable introduced as a part of the Seeded GSM Services. It provides improved coordination between the GSM monitoring of these service and their command- line control scripts.
• The $FND_TOP/bin/FNDSVCRG executable is called from adcmctl.sh control script before and after the script starts or stops the service. FNDSVCRG connects to the database using JDBC and validates the configuration of the Seeded GSM Service.
39
Verify GSM• To verify GSM is working, start the concurrent
managers. • Once GSM is enabled, the ICM uses Service
Managers to start all concurrent managers and activated services.
• If the ICM is successfully starting the managers, then GSM has been configured properly.
• If managers and/or services fail to start, errors should appear in the ICM log file.
40
Service Manager Log
• Each Service Manager maintains its own log file named FNDSMxxxx.mgr, located in the same directory as concurrent manager log files.
• If you cannot locate the Service Manager log file, it is likely that the Service Managers are not starting properly and there is a configuration issue that needs troubleshooting.
41
Test – Kill services and see if GSM restarts them
Kill FNDSMapplvis 9007 1 0 11:53 ? 00:00:00 FNDSMapplvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9161 5683 0 11:55 pts/3 00:00:00 grep FND
[applvis@rh9 scripts]$ kill -9 9007[applvis@rh9 scripts]$ ps -ef |grep FNDapplvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9169 1 0 11:55 ? 00:00:00 FNDSMapplvis 9249 5683 0 11:57 pts/3 00:00:00 grep FND
Kill FNDCRM[applvis@rh9 scripts]$ ps -ef |grep FNDCRMapplvis 8886 1 0 11:52 ? 00:00:00 FNDCRM
APPS/ZGA13053E1E1B7BA773417089054DA88F194EAC0D687728CC2551870E6B78C4B439 EADB287342795115A88DBC85788CCB4 FND FNDCRM N 10 c LOCK Y RH9 1302318
[applvis@rh9 scripts]$ kill -9 8886
[applvis@rh9 scripts]$ ps -ef |grep FNDCRMapplvis 9457 9392 0 12:09 ? 00:00:00 FNDCRM
APPS/ZG26430816FA3570354BC57DE47FF105D145F8DE226EFE58CE04B416633DCB90126 7BFECFA7585114F7090060EFE1147BE FND FNDCRM N 10 c LOCK Y RH9 1302343
Both of these services were started before I could enter the grep command to find the corresponding process.
42
11i - Defining PCP Details
In Release 11i, the Secondary Node doesn’t
need to be filled in for failover to
occur
43
R12 PCP Details
In Release 12, failover won’t
occur if there is no Secondary Node defined
44
R12 PCP Setup
The only Standard
Manager set up to fail over
is the “Standard Manager”
45
R12 Manager Failover
46
PCP Failover
Database Listener
•TCP_KEEPALIVE takes 240 seconds before issuing DCD
SQL*Net
•Client
•SQL*Net
Client
•RH7 •RH8
•PCP •PCP
Database
•DB Node – RH8
sqlnet.ora
•SQL*Net
Client
•RH9
•PCP
47
Parallel Concurrent Processing• Parallel concurrent processing allows distribution of
concurrent managers across multiple nodes.• Benefits are better: performance, availability and
scalability (load balancing).• Parallel Concurrent Processing (PCP) is activated along
with Generic Service Management (GSM); it can not be activated independently of GSM.
• With parallel concurrent processing implemented with GSM, the Internal Concurrent Manager (ICM) tries to assign valid nodes for concurrent managers and other service instances.
48
Parallel Concurrent Processing
• There should be only one ICM and CRM, at any given time, although the ICM and CRM could be configured to run on several of the nodes.
• Concurrent Managers migrate to the surviving node when one of the concurrent nodes goes down.
49
Parallel Concurrent Processing
What’s wrong with this picture?
Database
DataJAVAInterface
JInitiator
Web Browser
Forms Server
ReportReviewAgent
SQL*Net
.rdx
ReportReviewAgent
SQL*Net
.rdx
Requests
Requests
Logs
Logs
Out
Out
ICMFNDLIBR
ServiceManagerFNDSM
ServiceManagerFNDSM
ICMFNDLIBR
Web ServerHTMLInterface
Reports Server
InternalMonitor
FNDIMON
InternalMonitor
FNDIMON
StandardManagerFNDLIBR
StandardManagerFNDLIBRFNDCRM
FNDCRM
50
APPLDCP Profile OptionStarting with Release 11.5.10, FND.H, the APPLDCP environment
variable is ignored. R12 GSM requires the value of APPLDCP to be set to “ON”. The value is hard-coded in afpcsq.lpc version 115.35, thereby ignoring the value of APPLDCP.
As per ATG Development:As of file "afpcsq.lpc" version 115.35 or higher, APPLDCP is internally
hard-coded to "ON" when the Generic Service Management (GSM) is enabled--"keeping in mind, use of the GSM is required".
In short, at "afpcsq.lpc" version 115.35 or higher with the GSM enabled, the setting of the APPLDCP environment variable is ignored--this is the "default behavior on all R12 releases."
NOTE: As per ARU, "Patch 11i.FND.H" (3262159) and "Oracle Applications Release 11.5.10" (3140000) contains "afpcsq.lpc" version 115.37.
From Note: 753678.1
51
PCP Failover Mechanisms
• TCP keepalive• PMON – ICM Process Monitor• Dead Connection Detection• Connection Failure Recovery – R12• 10g Timeout Parameters (untested)
– sqlnet.inbound_connect_timeout (server)– sqlnet.send_timeout (client and/or server)– sqlnet.recv_timeout (client and/or server)
52
11i PCP Failure
• TCP Failure• ICM Lock is released, FNDIMON pings
ICM node, if ping fails, check PMON• PMON detects a “dead process”, crashed
ICM• reviver.sh• DCD
53
R12 PCP Failure
• TCP Failure• PMON detects a “dead process”• ICM Shutdown
– Look for error messages ORA-3113, ORA- 3114 or ORA-1041
• reviver.sh• DCD
54
Reviver
From the CM log file:• The ICM has lost its
database connection and is shutting down.
• Spawning reviver process to restart the ICM when the database becomes available again.
• Spawned reviver process 10910.
Exit
Attempt to Get DB
Connection
Kill Previous DB Session
Yes
No
Starts to Shutdown
ICM REVIVER
ICM Started?
Sleep
Yes
No
Lost DB Connection?
Yes
YesSpawn Reviver
No
Start
Start ICM
Exit
Receive Shutdown?
No
55
reviver.log
The ICM has lost its database connection and is shutting down.
Spawning reviver process to restart the ICM when the database becomes available again.
Spawned reviver process 10910.
56
TCPTCP/IP is a connection-oriented protocol; TCP
implements packet timeout and retransmission in an effort to guarantee the safe and sequenced order of data packets.
If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number of times before timing out.
After TCP/IP gives up, SQL*Net receives notification that the probe failed.
57
TCP KeepaliveAt this time, client side SQL*Net connections do not enable
keepalive for TCP connections by default. However, it is possible to enable this by adding the
ENABLE=BROKEN parameter to the SQL*Net connect string, by adding this parameter to the sqlnet.ora file.
**WARNING** Keepalive intervals can typically be set to 2 hours or more (i.e,,it can take more than 2 hours to notice a dead server even if keepalive is enabled). To make keepalive useful for PCP and TAF the keepalive interval needs to be reduced to a smaller value (such as 2 minutes).
If there are a lot of IDLE connections on your network, then reducing keepalive can increase network traffic significantly.
58
ENABLE=BROKENSample TNS alias to enable keepalive (notice the
ENABLE=BROKEN clause)
VIS_BALANCE = (DESCRIPTION = (ENABLE=BROKEN)(ADDRESS_LIST = (LOAD_BALANCE = ON)
(FAILOVER = ON)ADDRESS = (PROTOCOL = TCP)
(HOST = rh8)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = rh6)(PORT = 1521)))
59
TCP Keepalive
• **WARNING** Keepalive intervals are typically set to 2 hours or more (ie: it can take more than 2 hours to notice a dead server even if keepalive is enabled).
• To make keepalive useful for TAF, the keepalive interval would need to be reduced to a smaller value (such as 2 minutes). Note: 249213.1
60
TCP KeepAlive Parameters for Linux
tcp_keepalive_time the time since the last data packet sent and the first keepalive probe
tcp_keepalive_intvl the time between keepalive probes
tcp_keepalive_probes the number of probes to be sent before declaring the connection dead
Default Settings tcp_keepalive_time = 7200 secondstcp_keepalive_intvl = 75tcp_keepalive_probes = 9
A total of 7875 seconds, or 2 hours 11 minutes and 15 seconds.
61
TCP Keepalive
Initial Settings– tcp_keepalive_time = 200 secs– tcp_keepalive_intvl = 20– tcp_keepalive_probes = 2
• After 200 seconds of no response, TCP sends the first of 2 probes, 20 seconds apart.
• TCP notifies SQL*Net of the failure, and SQL*Net removes the offending connection.
62
TCP Retries• tcp_retries1 (default: 3) The number of times TCP will
attempt to retransmit a packet on an established connection normally, without the extra effort of getting the network layers involved.
• tcp_retries2 (default: 15) The maximum number of times a TCP packet is retransmitted in established state before giving up
• tcp_syn_retries (default: 5) The maximum number of times initial SYNs for an active TCP connection attempt will be retransmitted. The default value is 5, corresponds to approximately 180 seconds.
63
TCP Retries
Now let’s consider changing the following TCP parameters from their default values:tcp_retries1 = 2tcp_retries2 = 2tcp_syn_retries = 2
In this example, the time to initialize the PCP failover was an average of 8 seconds after changing these TCP parameters.
64
Disconnect TCP Connection from RH9
From the ICM log:
The Internal Concurrent Manager has encountered an error.Review concurrent manager log file for more detailed information. : 12-
JAN-2009 15:22:55 -Shutting down Internal Concurrent Manager : 12-JAN-2009 15:22:5512-JAN-2009 15:22:55The ICM has lost its database connection and is shutting down.Spawning reviver process to restart the ICM when the database
becomes available again.Spawned reviver process 1541.The VIS_0112@VIS internal concurrent manager has terminated with
status 1 - giving up.Found dead process: spid=(17963), cpid=(1302176), ORA pid=(26),
manager=(0/1)
65
PMON & fnd_concurrent _queues
PMON updates the work_start column in the fnd_concurrent_queues table every 4 PMON cycles
fdpsrp() (running_processes correction):ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUESOracle error code returned: 1This message is information and does not indicate a problem with CP functionality.
remote call function (FNDIMON)15-AUG-2008 10:06:02 - Function to call: PingProcess
66
PMON – ICM Lock – 11i
• If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM.
• If the ping succeeds, we conclude that the ICM is fine. What????
• If the ping fails, we further check if it has been over “quesiz” pmon cycles since the ICM updated the work_start column fnd_concurrent_queues.
• If it has been more than four pmon cycles we conclude that the ICM is dead.
67
PMON “found dead process”
On RH9 the PMON found a dead process. The PMON takes about 1 second to run, then sleeps for 2 minutes:
Process monitor session started : 18-JAN-2009 21:46:05Found dead process: spid=(16977), cpid=(1321475), Service
Instance=(36543) Process monitor session ended : 18-JAN-2009 21:46:06
The Internal Concurrent Manager has encountered an error.Review concurrent manager log file for more detailed
information. : 18-JAN-2009 22:02:01
68
PMON – node RH9 is down
From the ICM log:
Process monitor session started : 12-JAN-2009 15:18:27
Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes.
CONC-SM TNS FAILCall to PingProcess failed for XDPCTRLS
69
PMON
Process monitor session started : 18-JAN-2009 22:38:57CONC-SM TNS FAILCall to PingProcess failed for OAMGCS18-JAN-2009 22:38:58 - Node:(RH7), Service
Manager:(FNDSM_RH7_VIS) currently unreachable by TNSFound dead process: spid=(11234), cpid=(1321563), ORA
pid=(167), manager=(0/4)
Process monitor session ended : 18-JAN-2009 22:38:58
70
PMONShutting down Internal Concurrent Manager : 18-
JAN-2009 22:02:0118-JAN-2009 22:02:01The ICM has lost its database connection and is
shutting down.Spawning reviver process to restart the ICM when
the database becomes available again.Spawned reviver process 10910.
71
PMON runs every 2 minutes
Process monitor session ended : 18-JAN- 2009 21:49:05
Process monitor session started : 18-JAN- 2009 21:51:05
72
Edit ICM Runtime Parameters
73
Edit PMON Parameters
74
Edit PMON Parameters
ICM parameters are read from batchmgr.sh when adcmctl.sh runs. Changing these parameters here does not change batchmgr.sh!
75
$FND_TOP/bin/batchmgr.shMake sure the PMON changes are made in the $FND_TOP/bin/batchmgr.sh file.
FILENAME# batchmgr# DESCRIPTION# fire up Internal Concurrent Manager process# USAGE# batchmgr arg1=val1 arg2=val2 ...## Parameters may be sent via the environment.## ARGUMENTS DEFAULT# [appmgr|sysmgr]=username/password# [sleep=sleep_seconds] 15 # [mgrname=manager_name] icm# [logfile=log_filename] $FND_TOP/$APPLLOG/$mgrname.mgr# [restart=N|mim minutes between restarts] N# [mailto="user1 user2..."] current user# [PRINTER=printer_name]# [pmon=iterations] 4# [quesiz=pmon_iterations] 1# [diag=Y|N] N
76
Reviver
From the CM log file:• The ICM has lost its
database connection and is shutting down.
• Spawning reviver process to restart the ICM when the database becomes available again.
• Spawned reviver process 10910.
Exit
Attempt to Get DB
Connection
Kill Previous DB Session
Yes
No
Starts to Shutdown
ICM REVIVER
ICM Started?
Sleep
Yes
No
Lost DB Connection?
Yes
YesSpawn Reviver
No
Start
Start ICM
Exit
Receive Shutdown?
No
77
reviver.logreviver.sh starting up...[ Mon Jan 12 20:02:15 MST 2009 ] - Read APPS username/password.[ Mon Jan 12 20:02:45 MST 2009 ] - Attempting database connection...[ Mon Jan 12 20:02:45 MST 2009 ] - Successful database connection.[ Mon Jan 12 20:02:45 MST 2009 ] - Killing previous ICM session...1 row updated.Commit complete.[ Mon Jan 12 20:02:45 MST 2009 ] - Looking for a running ICM
process...[ Mon Jan 12 20:02:45 MST 2009 ] - ICM now running, reviver.sh
complete.
78
reviver.shreviver.sh – code summary
Sleep 30Test_connectionKill_old _icm
Get sessionAlter system kill sessionCheck_running_icm
Fnd_conc.ecm_alivestart_icm
startmgr.sh
79
Dead Connection Detection
• Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Oracle Net8. DCD detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and releases the resources associated with it.
80
Implement DCD
• Implement by:
adding SQLNET.EXPIRE_TIME = 1 (Minutes) to the sqlnet.ora file
If the connection is idle for the time interval specified in minutes by the SQLNET.EXPIRE_TIME parameter, the server- side process sends a small 10-byte packet to the client. The packet is sent using TCP/IP.
81
DCD – ICM Lock
• ICM and IM can use the DCD functionality of the Network (TCP sqlnet).
• ICM is a client process connected to a DCD enabled DB dedicated server process.
• ICM holds the named PL/SQL Lock, the “ICM lock”.
• IM is continuously trying to check whether it can get the same named PL/SQL Lock.
82
DCD – ICM Lock• As soon as the “ICM lock” is released by the DB / DCD,
FNDIMON pings the ICM node, and the IM deduces that the ICM has crashed. – If the ping succeeds, we conclude that the ICM is fine.
• Obviously, the ICM can be down, even if TCP is working, this is bad logic.
– If the ping fails, FNDIMON determines if it’s been over four pmon cycles since the ICM updated the work_start column fnd_concurrent_queues.
– If it has been more than four pmon cycles FNDIMON concludes the ICM is dead.
• The DCD comes into picture here after ICM has crashed and DB needs to identify that the ICM is gone.
• The DB needs to clean up the dedicated server process resource corresponding to the ICM client process
83
FNDIMON has the ICM LockCheck if the ICM updated the work_start column fnd_concurrent_queues.
Be aware that if a TCP failure is not detected, failover will not occur. The following except from a concurrent manager log shows:
fdpsrp() (running_processes correction):ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUESOracle error code returned: 1This message is information and does not indicate a problem with CP functionality.
remote call function (FNDIMON)15-AUG-2008 10:06:02 - Function to call: PingProcess
The PingProcess continues until the CP processes resume, or a TCP failure is detected, and failover is begun.
84
11i PCP Failure
• TCP Failure• ICM Lock is released, FNDIMON pings
ICM node, if ping fails, check PMON• PMON detects a “dead process”, crashed
ICM• reviver.sh• DCD
85
R12 PCP Failure
• TCP Failure• PMON detects a “dead process”• ICM Shutdown
– Look for error messages ORA-3113, ORA- 3114 or ORA-1041
• reviver.sh• DCD
86
Test PCP Failover Parameters• Test to explore effect of DCD, PMON and TCP
failover methods.• Variables: sqlnet.expire_time, pmon sleep and
number of cycles, and the following TCP Keepalive parameters:
• tcp_keepalive_time,• tcp_keepalive_intvl,• tcp_keepalive_probes• tcp_retries1 (default: 3, new value 2)• tcp_retries2 (default: 15, new value 2) • tcp_syn_retries (default: 5, new value 2)
87
Failover Test ResultsFailover time /Failback time
Expire_time PMONSleep
PMONCycles
tcp_KA time
tcp KA intvl
tcp KA probes
tcp retries
tcp retries2
tcp syn retries
241 secs / 1 minute 30 secs 4 200 20 2 3 15 5
250 secs / 50 secs 5 minute 30 secs 4 200 20 2 3 15 5
262 secs / 100 sec 10 minutes 30 secs 4 200 20 2 3 15 5
300 secs / 75 secs 1 minute 15 secs 2 200 20 2 3 15 5
285 secs / 35 min 10 minute 30 secs 4 1000 60 10 3 15 5
8 secs / 105 secs 1 minute 30 secs 4 1000 60 10 2 2 2
10 secs / 42 secs 1 minute 30 secs 4 200 20 2 2 2 2
7 secs / 40 secs 10 minutes 30 secs 4 200 20 2 2 2 2
6 secs / 34 secs 1 minute 15 secs 2 200 20 2 2 2 2
88
All Services are UP
89
Concurrent Managers
• Processes - Actual = 1 and Target = 1, manager is running• Processes - Actual = 0 and Target = 1, manager is running
90
Actual Processes = 0
Example of Actual Processes = 0, in this example the CRM is not running
91
PCP Setup
PCP setup – this screen is continued on the next slide
92
Primary and Secondary Nodes
The CRM, ICM and Standard Manager will
fail over
Any concurrent
programs not assigned to
the Standard Manager will not fail over
93
TCP Failure
• TCP disconnected at 2:57:25• 10 seconds after the TCP connection was pulled, OAM reported the status above.• It took 10 seconds for OAM to register a failure of services on RH9.
94
CRM is DOWN
If any of the subordinate services fail, it rolls up to the Dashboard
95
CRM Failure
CRM has failed, Actual Processes = 0
96
PCP Failover from RH9 to RH7
Adding Node:(RH9), to unavailable listFound dead process: spid=(9696), cpid=(1321449), ORA pid=(80), manager=(0/0)Found dead process: spid=(9784), cpid=(1321458), ORA pid=(114), manager=(0/0)Found dead process: spid=(9783), cpid=(1321457), ORA pid=(104), manager=(0/0)Found running request 4413565 attached to dead manager process.Attempting to restart request.Internal Concurrent Manager found node RH9 to be down. Adding it to the list of
unavailable nodes.
97
GSM tries to restart the servicesTCP and TNS is unavailable:Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42CONC-SM TNS FAILRoutine AFPEIM encountered an error while starting concurrent manager STANDARD
with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.Check that your system has enough resources to start a concurrent manager process.
Contac : 18-JAN-2009 21:43:42Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42CONC-SM TNS FAILRoutine AFPEIM encountered an error while starting concurrent manager STANDARD
with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.Check that your system has enough resources to start a concurrent manager process.
Contac : 18-JAN-2009 21:43:42Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42CONC-SM TNS FAILRoutine AFPEIM encountered an error while starting concurrent manager STANDARD
with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.
98
ICM and CRM are DOWN
99
RH9 is DOWN
Not really down, just not on the network
100
PCP is DOWN
This is momentary as GSM figures out what to
do
101
Failover to Secondary Node
The ICM and CRM failed over to RH7 in about 1 minute and 30 seconds
102
Failover from RH9 to RH7Starting Internal Concurrent Manager Concurrent
Manager : 18-JAN-2009 21:51:23: Started ICM on Target RH7.
Process monitor session ended : 18- JAN-2009 21:52:53
: Migration of ICM has completed.Shutting down Internal Concurrent Manager : 18-
JAN-2009 21:53:23The VIS_0118@VIS internal concurrent manager
has terminated successfully - exiting.
103
ICM Failover to RH7Starting Internal Concurrent Manager Concurrent
Manager : 18-JAN-2009 21:51:23: Started ICM on Target RH7.
Process monitor session ended : 18- JAN-2009 21:52:53
: Migration of ICM has completed.Shutting down Internal Concurrent Manager : 18-
JAN-2009 21:53:23The VIS_0118@VIS internal concurrent manager
has terminated successfully - exiting.
104
RH9 not available
105
Request Failover
106
Standard Manager Failover Configuration
• Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover.
107
Managers with a Secondary Node
• Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover.
108
Failback
FAILBACK – tcp connected at 31:40The host, RH9 becomes available on OAM about 2
minutes later.
109
RH9 available
110
ICM Failback
111
Concurrent Manager LogStarting Internal Concurrent Manager Concurrent
Manager : 18-JAN-2009 22:53:33: Started ICM on Target RH9.
Process monitor session ended : 18- JAN-2009 22:55:03
: Migration of ICM has completed.Shutting down Internal Concurrent Manager : 18-
JAN-2009 22:55:33The VIS_0118@VIS internal concurrent manager
has terminated successfully - exiting.
112
113
Failback Complete
Total Failback Time 3 minutes and 45 seconds
114
Standard Manager before Failover
The Standard Manager has 3 Actual and Target
processes.
115
Standard Manager is DOWN
116
Standard Manager has 2 Processes on Failover
After 3 minutes and 30 seconds the Standard Manager started on RH7
117
Shutdown of CP
118
Concurrent Processing Load Balancing
Two types of Load Balancing
• Load Balancing with both nodes running – no failover
• Load Balancing during failover
119
PCP Load Balancing• One of the benefits Parallel Concurrent
Processing provides:– failover in case of node failure
• maintain throughput and keep the business running during node failures.
• When a node fails, the processes that were running on the failed node are restarted on secondary nodes.
• However, a resource intensive node may overload the secondary node when it fails-over.
120
PCP Load Balancing• If too many processes are running on the secondary
node when the primary node fails over, the secondary node may not have the capacity to process the requests from additional concurrent managers.
• R12 introduces Failover Sensitive Workshifts. This enhancement allows the System Administrator to configure how many processes failover for each workshift. With this added control, System Administrators can enjoy the benefits of PCP failover without risking performance issues through overloaded resources.
121
R12 Failover Sensitive Workshifts
122
Failover Sensitive Workshifts
123
Failover Sensitive Workshifts
• Conversely, if a failover occurs from node 1 to node 2, we may want to reduce the failover processes, however, this doesn’t work.
• Only if the node fails does the “failover processes” take effect.
124
Failover Processes
PO Document Approval Manager and the Standard Manager will reduce the number ofprocesses when RH7 fails. When RH9 fails, the number of failover processes for managersthat run on RH7 are not reduced.
125
Failover Sensitive Workshifts
It’s clear: to run a R11i or R12 system during a failover, there are two choices:• Run the servers at 35% or less utilization • Reduce the number of processes that are
allowed during failoverFor most businesses the second option isthe most practical.
126
References• 249213.1 - Performance problems with Failover when TCP Network goes down• 364171.1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP
Keepalive• 211362.1 - Process Monitor Session Cycle Repeats Too Frequently• 291201.1 - How To Remove a Dead Connection to the Target Database• 362135.1 - Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real
Application Clusters and Automatic Storage Management• Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari• 240818.1 - Concurrent Processing: Transaction Manager Setup and Configuration
Requirement in an 11i RAC Environment• R12 ATG - Concurrent Processing Functional Overview – Aaron Weisberg• 210062.1 - Generic Service Management (GSM) in Oracle Applications 11i• 271090.1 - Parallel Concurrent Processing Failover/Failback Expectations• 241370.1 - Concurrent Manager Setup and Configuration Requirements in an 11i RAC
Environment• 602899.1 - Some More Facts On How to Activate Parallel Concurrent Processing