pcp

1

Parallel Concurrent Processing

Mike SwingTruTek

[email protected] 2009

2

Conclusions• You don’t need RAC to use Parallel Concurrent

Processing (PCP)!• If you have PCP enabled, secondary nodes

must be defined during the upgrade to R12• Tuning of TCP, SQLNet and PMON

parameters can minimize PCP failover time.• Implement Failover Sensitive Workshifts

3

Concurrent Processing ServerAllows scheduling of jobs – batch jobs, or Requests in

Oracle terms. Processes concurrent programs as a Request. Requests can be grouped together into Request Sets. Different types of concurrent managers handle different

types of requests. A concurrent program can be assigned to a responsibility,

and that responsibility can be assigned to users, allowing them the permission to run the concurrent program.

Concurrent managers may have limits on the concurrent programs that can be run, and the times that they can be started. Requests have priorities, status and log and out files in the above directory

4

Definitions

• CP => Concurrent Processing • DCD => Dead Connection Detection• ICM => Internal Concurrent Manager• IM => Internal Monitor• CRM => Conflict Resolution Manager• PCP => Parallel Concurrent Processing• PMON => Process Monitor for ICM

5

Concurrent Request

6

Phase and Status of Concurrent RequestsPhase Status Description - ActionPending Normal The request is waiting to be picked up by the next

available manager.Pending Standby Waiting for CRM to resolve conflict. CRM could be

slow or an incompatible program is running.Running Normal The request is running normally.

Completed Normal The request has finished successfully

Completed Error The request has finished with an error. Check logs.

Completed Warning The request has finished with a Warning. Check the logs.

Inactive No Manager Request won’t run without a manager. Specialization rules aren’t configured properly.

7

PCP Failover

•Database Listener

•TCP_KEEPALIVE takes 240 seconds before issuing DCD

•SQL*Net

•Client

•SQL*Net

•Client

•RH7 •RH8

•PCP •PCP

•Database

•DB Node – RH8

•sqlnet.ora

•SQL*Net

•Client

•RH9

•PCP

8

Concurrent Managers

9

Concurrent ManagersManager Type Service Instance ProgramInternal Concurrent Manager Internal Manager FNDLIBRConflict Resolution Manager Conflict Resolution Manager FNDCRMInternal Monitor Internal Monitor:Node FNDIMON

Service Manager: Node FNDSMConcurrent Manager Standard Manager FNDLIBRConcurrent Manager Inventory Manager INVLIBRConcurrent Manager Session History Cleanup FNDLIBRConcurrent Manager PA Streamline Manager PALIBRTransaction Manager CRP Inquiry Manager CYQLIBTransaction Manager FastFormula Transaction Manager FFTMTransaction Manager PO Document Approval Manager POXCONTransaction Manager Transaction Manager FNDTMTST

Scheduler/Prerelease Manager FNDSVCOAM Generic Collection Service:Node FNDSVC

10

Concurrent Processing1. The Concurrent

Processing server communicates with the database using Oracle SQL*Net.

2. The concurrent program log or output file from a request is passed back as a report to the Report Review Agent.

3. The Report Review Agent passes a file containing the entire report to the forms server.

JAVAInterfaceJInitiator

Web Browser

Forms Server

ReportReviewAgent

SQL*Net

.rdx

Requests Log Out

ServiceManagerFNDSM

ICMFNDLIBR

Web ServerHTMLInterface

Reports Server

InternalMonitor

FNDIMONStandardManagerFNDLIBRFNDCRM

4. The Forms Services component passes the report back to the user’s browser one page at time. Profile options can be used to control the size of the files and pages passed, to suit report volume and available network capacity.

11

Internal Concurrent Manager• The Internal Concurrent Manager (ICM) starts, sets the

number of active processes, monitors, and terminates all other concurrent processes through requests made to the Service Manager, including restarting any failed processes.

• The ICM also starts and stops, and restarts the Service Manager for each node.

• The ICM will perform process migration during an instance or node failure.

• The ICM will be active on a single node. • This is also true in a PCP environment, where the ICM

will be active on at least one node at all times.

12

Internal Concurrent Manager• The ICM really does not have any scheduling

responsibilities. It has NOTHING to do with scheduling requests, or deciding which manager will run a particular request. The function of the ICM is to run 'queue control' requests; requests to startup or shutdown other managers.

• The ICM is responsible for startup and shutdown of the whole concurrent processing facility, and it monitors the other managers periodically, and restarts them if they should go down. It can also take over the Conflict Resolution manager's job, and resolve incompatibilities.

• If the ICM itself should go down, requests will continue to run normally, except for 'queue control' requests. Restart the ICM with 'startmgr'; no need to kill the other managers first.

13

Internal Concurrent Manager

14

Service ManagerFNDSM process - Communicates with the Internal Concurrent

Manager, Concurrent Manager, and non-Manager Service processes.

• The Service Manager (SM) spawns, and terminates manager and service processes (these could be Forms, or Apache Listeners, Metrics or Reports Server, and any other process controlled through Generic Service Management).

• When the ICM terminates the SM that resides on the same node with the ICM will also terminate.

• The SM is “chained” to the ICM. The SM will only reinitialize after termination when there is a function it needs to perform (start, or stop a process), so there may be periods of time when the SM is not active, and this would be normal.

15

Service Manager• All processes initialized by the SM inherit the

same environment as the SM. • The SM’s environment is set by APPSORA.env

file, and the gsmstart.sh script. • The apps_<sid> listener must be active on each

CP node to support the SM connection to the local instance.

• There should be a Service Manager active on each node where a Concurrent or non-Manager service process will reside.

16

FNDSM FailureFNDSM failover as noted in the concurrent manager log:

Could not contact Service Manager FNDSM_RH8_VIS. The TNS alias could not be located, the listener process on RH8 could not be contacted, or the listener failed to spawn the Service Manager process.

Found dead process: spid=(962754), cpid=(2259578), Service Instance=(1045)

CONC-SM TNS FAILCall to PingProcess failed for WFMAILERCONC-SM TNS FAILCall to StopProcess failed for WFMAILERCONC-SM TNS FAILCall to PingProcess failed for FNDCPGSC

17

FNDSM FailoverFound dead process: spid=(716870), cpid=(2259580), Service

Instance=(2009)Found dead process: spid=(1442020), cpid=(2259579), Service

Instance=(2010)

Starting WFMGSMD Concurrent Manager : 15-AUG-2008 13:28:56

Starting WFMGSMDB Concurrent Manager : 15-AUG-2008 13:28:56

Starting WFALSNRSVCB Concurrent Manager : 15-AUG-2008 13:28:57

Starting STANDARD Concurrent Manager : 15-AUG-2008 13:30:31

Starting Internal Concurrent Manager Concurrent Manager : 15-AUG- 2008 13:30:32

18

Internal Monitor(FNDIMON process) - Communicates with the Internal Concurrent

Manager. • This manager/service is used to implement Parallel Concurrent

Processing.• You do not need to run this manager/service unless you are using

Parallel Concurrent Processing.• The Internal Monitor (IM) monitors the Internal Concurrent Manager,

and restarts any failed ICM on the local node. It monitors whether the ICM is still running, and if the ICM crashes, it will restart it on another node.

• During a node failure in a PCP environment the IM will restart the ICM on a surviving node (multiple ICM's may be started on multiple nodes, but only the first ICM started will eventually remain active, all others will gracefully terminate).

• There should be an Internal Monitor defined on each node where the ICM may migrate.

19

Standard Manager

• (FNDLIBR process) - Communicates with the Service Manager and any client application process.

• The Standard Manager is a worker process that initiates, and executes client requests on behalf of Applications batch, and OLTP clients.

20

Standard Manager

21

Standard Manager - OAM

Since no secondary node is

defined, the Standard Manager

will not failover“Failover Processes” in the Work Shifts definition are the number of processes that will run (3) when the Standard Manager fails over to the secondary node.

The Standard Manager is active on RH9, even though no primary node is defined

22

Transaction ManagerA Transaction Manger communicates with the Service

Manager, and any user process initiated on behalf of Forms, or a Standard Manager request.

A Transaction Manager:• Supports synchronous processing of requests from a

client program• Gets request for a client program to run a server-side

program synchronously.• Return a status/results to the client program.• At runtime, it starts a number of these managers as

defined.• Doesn’t poll concurrent request table for a new request• Only need 1 transaction manager per database, not 1

per instance.

23

Transaction Managers

Some of the Transaction Managers in R12

24

Configuring Transaction Managers for RAC

• R11i Transaction Managers use DBMS_PIPE– This does not work across RAC instances– RAC users must perform additional configuration

• Requires complicated configuration or additional hardware

• R12 Transaction Managers use AQ– Works across RAC Instances– Simplifies configuration– Reduces complexity– Profile Option can switch between mechanisms

• DBMS_PIPE can be used for non-RAC users if performance becomes an issue

25


• Edit $ORACLE_HOME/dbs/<context_name>_ifile.ora and add these parameters:

• _lm_global_posts=TRUE • _immediate_commit_propagation=TRUE

• Change the profile option ‘Concurrent: TM Transport Type' to ‘QUEUE', and verify that the transaction manager works across the RAC instance. ATG RUP3 (4334965) or higher provides an option to use AQs in place of Pipes.

• Profile “Concurrent:TM Transport Type”• Set to QUEUE• Pipes are more efficient but require a Transaction Manager to be

running on each DB Instance.• Navigate to Concurrent > Manager > Define screen, and set up

the primary and secondary node names for transaction managers.

26


• Transaction Managers allow a client to make a request for a program to be run on the server immediately. The client then waits for the program to complete and can receive program results from the server. As the client and server are two separate database sessions, the communication between has been handled using the DBMS_PIPE package.

• Unfortunately the DBMS_PIPE package does not extend to communications between sessions on different RAC instances. On an Applications instance using RAC, the client and server are very likely to be on different instances, causing transactions to time out for long periods or fail completely. The current workaround is to manually set up Transaction managers to connect to all RAC instances, which not only takes up additional resources, it may require additional middle-tier hardware or a complicated configuration that is difficult to maintain.

27

R12 Transaction Managers

• In R12, the Transaction Managers use the AQ mechanism; the Transaction Managers, work on RAC connected to either instance.

• This greatly simplifies the configuration and reduces the complexity for RAC administrators. A Profile Option has been introduced to allow users to switch between the two transports DBMS_PIPE or AQ.

28

Concurrent:PCP Instance Check

• Concurrent processing provides database instance- sensitive failover capabilities. When an instance is down, all managers connecting to it switch to a secondary middle-tier node.

• However, if you prefer to handle instance failover separately from such middle-tier failover (for example, using TNS connection-time failover mechanism instead), use the profile option Concurrent:PCP Instance Check.

• When this profile option is set to OFF, Parallel Concurrent Processing will not provide database instance failover support; however, it will continue to provide middle-tier node failover support when a node goes down.

29

Conflict Resolution Manager• Concurrent managers read requests to start concurrent programs.

The Conflict Resolution Manager checks concurrent program definitions for incompatibility rules.

• If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the concurrent managers from starting other programs in the same conflict domain.

• When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have completed running.

• To enable/disable the Conflict Resolution Manager, use the system profile option 'Concurrent: Use ICM'. Set this to 'No' (default) allows the CRM to be started.

• Setting it to 'Yes' causes the CRM to be shutdown and the Internal Manager (ICM) will take over the conflict resolution duties.

• If the CRM will not start (it is started automatically by the ICM), check this profile option.

30

Conflict Resolution Manager

• Use the system profile option 'Concurrent: Use ICM'. 'No‘ allows the CRM to be started.

• Setting it to 'Yes' causes the CRM to shutdown. The Internal Manager (ICM) will take over the conflict resolution duties.

• Using the ICM to resolve conflicts is not recommended.

• The CRM's sole purpose is to resolve conflicts, while the ICM has other functions to perform as well.

• Setting this option to 'YES' is not recommended.

31

Generic Service Management• An E-Business Suite system depends on a variety of services, such

as Forms Listeners, HTTP Servers, Concurrent Managers, and Workflow Mailers. These services are composed of one or more processes. In the past, many of these processes had to be individually started and monitored by system administrators.

• Management of these processes is complicated, since these services can be distributed across multiple host machines.

• The introduction of Generic Service Management in Release 11i helped simplify the management of these processes by providing a fault tolerant service framework and a central management console built into Oracle Applications Manager.

• Service Management is an extension of Concurrent Processing, and provides a framework for managing processes on multiple host machines. With Service Management, virtually any application tier service can be integrated into this framework.

• Patch 2221688 introduces GSM.

32

GSM

33

Generic Services

34

GSM and Multiple Nodes• GSM enables users to manage Applications

services across multiple middle-tier nodes. • This includes services on Web/Forms nodes that

previously have had no concurrent processing footprint.

• Users configuring GSM in a multiple-node system should be sure to have followed the instructions for Parallel Concurrent Processing.

• This includes setting the environment variable APPLDCP=ON and assigning a primary node for all defined managers and services (if not already defined.)

35

Seeded GSM ServicesWhen configuring GSM the following GSM

Services are seeded automatically:– Forms Listener– Metrics Server– Metrics Client– Reports Server– Apache Listener

LINUX users should not Activate the Reports Server under GSM

36

Starting GSM

Apps Listener:listener.ora

gsmstart.shexec FNDSM

37

adcmctl.sh

adcmctl.sh calls:starmgr.sh

batchmgr.shCONCSUBFNDSVCRG

38

FNDSVCRG – Service Controller Utility

• FNDSVCRG is an executable introduced as a part of the Seeded GSM Services. It provides improved coordination between the GSM monitoring of these service and their command- line control scripts.

• The $FND_TOP/bin/FNDSVCRG executable is called from adcmctl.sh control script before and after the script starts or stops the service. FNDSVCRG connects to the database using JDBC and validates the configuration of the Seeded GSM Service.

39

Verify GSM• To verify GSM is working, start the concurrent

managers. • Once GSM is enabled, the ICM uses Service

Managers to start all concurrent managers and activated services.

• If the ICM is successfully starting the managers, then GSM has been configured properly.

• If managers and/or services fail to start, errors should appear in the ICM log file.

40

Service Manager Log

• Each Service Manager maintains its own log file named FNDSMxxxx.mgr, located in the same directory as concurrent manager log files.

• If you cannot locate the Service Manager log file, it is likely that the Service Managers are not starting properly and there is a configuration issue that needs troubleshooting.

41

Test – Kill services and see if GSM restarts them

Kill FNDSMapplvis 9007 1 0 11:53 ? 00:00:00 FNDSMapplvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9161 5683 0 11:55 pts/3 00:00:00 grep FND

[applvis@rh9 scripts]$ kill -9 9007[applvis@rh9 scripts]$ ps -ef |grep FNDapplvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9169 1 0 11:55 ? 00:00:00 FNDSMapplvis 9249 5683 0 11:57 pts/3 00:00:00 grep FND

Kill FNDCRM[applvis@rh9 scripts]$ ps -ef |grep FNDCRMapplvis 8886 1 0 11:52 ? 00:00:00 FNDCRM

APPS/ZGA13053E1E1B7BA773417089054DA88F194EAC0D687728CC2551870E6B78C4B439 EADB287342795115A88DBC85788CCB4 FND FNDCRM N 10 c LOCK Y RH9 1302318

[applvis@rh9 scripts]$ kill -9 8886

[applvis@rh9 scripts]$ ps -ef |grep FNDCRMapplvis 9457 9392 0 12:09 ? 00:00:00 FNDCRM

APPS/ZG26430816FA3570354BC57DE47FF105D145F8DE226EFE58CE04B416633DCB90126 7BFECFA7585114F7090060EFE1147BE FND FNDCRM N 10 c LOCK Y RH9 1302343

Both of these services were started before I could enter the grep command to find the corresponding process.

42

11i - Defining PCP Details

In Release 11i, the Secondary Node doesn’t

need to be filled in for failover to

occur

43

R12 PCP Details

In Release 12, failover won’t

occur if there is no Secondary Node defined

44

R12 PCP Setup

The only Standard

Manager set up to fail over

is the “Standard Manager”

45

R12 Manager Failover

46

PCP Failover

Database Listener

•TCP_KEEPALIVE takes 240 seconds before issuing DCD

SQL*Net

•Client

•SQL*Net

Client

•RH7 •RH8

•PCP •PCP

Database

•DB Node – RH8

sqlnet.ora

•SQL*Net

Client

•RH9

•PCP

47

Parallel Concurrent Processing• Parallel concurrent processing allows distribution of

concurrent managers across multiple nodes.• Benefits are better: performance, availability and

scalability (load balancing).• Parallel Concurrent Processing (PCP) is activated along

with Generic Service Management (GSM); it can not be activated independently of GSM.

• With parallel concurrent processing implemented with GSM, the Internal Concurrent Manager (ICM) tries to assign valid nodes for concurrent managers and other service instances.

48


• There should be only one ICM and CRM, at any given time, although the ICM and CRM could be configured to run on several of the nodes.

• Concurrent Managers migrate to the surviving node when one of the concurrent nodes goes down.

49


What’s wrong with this picture?

Database

DataJAVAInterface

JInitiator

Web Browser

Forms Server

ReportReviewAgent

SQL*Net

.rdx

ReportReviewAgent

SQL*Net

.rdx

Requests

Requests

Logs

Logs

Out

Out

ICMFNDLIBR

ServiceManagerFNDSM

ServiceManagerFNDSM

ICMFNDLIBR

Web ServerHTMLInterface

Reports Server

InternalMonitor

FNDIMON

InternalMonitor

FNDIMON

StandardManagerFNDLIBR

StandardManagerFNDLIBRFNDCRM

FNDCRM

50

APPLDCP Profile OptionStarting with Release 11.5.10, FND.H, the APPLDCP environment

variable is ignored. R12 GSM requires the value of APPLDCP to be set to “ON”. The value is hard-coded in afpcsq.lpc version 115.35, thereby ignoring the value of APPLDCP.

As per ATG Development:As of file "afpcsq.lpc" version 115.35 or higher, APPLDCP is internally

hard-coded to "ON" when the Generic Service Management (GSM) is enabled--"keeping in mind, use of the GSM is required".

In short, at "afpcsq.lpc" version 115.35 or higher with the GSM enabled, the setting of the APPLDCP environment variable is ignored--this is the "default behavior on all R12 releases."

NOTE: As per ARU, "Patch 11i.FND.H" (3262159) and "Oracle Applications Release 11.5.10" (3140000) contains "afpcsq.lpc" version 115.37.

From Note: 753678.1

51

PCP Failover Mechanisms

• TCP keepalive• PMON – ICM Process Monitor• Dead Connection Detection• Connection Failure Recovery – R12• 10g Timeout Parameters (untested)

– sqlnet.inbound_connect_timeout (server)– sqlnet.send_timeout (client and/or server)– sqlnet.recv_timeout (client and/or server)

52

11i PCP Failure

• TCP Failure• ICM Lock is released, FNDIMON pings

ICM node, if ping fails, check PMON• PMON detects a “dead process”, crashed

ICM• reviver.sh• DCD

53

R12 PCP Failure

• TCP Failure• PMON detects a “dead process”• ICM Shutdown

– Look for error messages ORA-3113, ORA- 3114 or ORA-1041

• reviver.sh• DCD

54

Reviver

From the CM log file:• The ICM has lost its

database connection and is shutting down.

• Spawning reviver process to restart the ICM when the database becomes available again.

• Spawned reviver process 10910.

Exit

Attempt to Get DB

Connection

Kill Previous DB Session

Yes

No

Starts to Shutdown

ICM REVIVER

ICM Started?

Sleep

Yes

No

Lost DB Connection?

Yes

YesSpawn Reviver

No

Start

Start ICM

Exit

Receive Shutdown?

No

55

reviver.log

The ICM has lost its database connection and is shutting down.

Spawning reviver process to restart the ICM when the database becomes available again.

Spawned reviver process 10910.

56

TCPTCP/IP is a connection-oriented protocol; TCP

implements packet timeout and retransmission in an effort to guarantee the safe and sequenced order of data packets.

If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number of times before timing out.

After TCP/IP gives up, SQL*Net receives notification that the probe failed.

57

TCP KeepaliveAt this time, client side SQL*Net connections do not enable

keepalive for TCP connections by default. However, it is possible to enable this by adding the

ENABLE=BROKEN parameter to the SQL*Net connect string, by adding this parameter to the sqlnet.ora file.

**WARNING** Keepalive intervals can typically be set to 2 hours or more (i.e,,it can take more than 2 hours to notice a dead server even if keepalive is enabled). To make keepalive useful for PCP and TAF the keepalive interval needs to be reduced to a smaller value (such as 2 minutes).

If there are a lot of IDLE connections on your network, then reducing keepalive can increase network traffic significantly.

58

ENABLE=BROKENSample TNS alias to enable keepalive (notice the

ENABLE=BROKEN clause)

VIS_BALANCE = (DESCRIPTION = (ENABLE=BROKEN)(ADDRESS_LIST = (LOAD_BALANCE = ON)

(FAILOVER = ON)ADDRESS = (PROTOCOL = TCP)

(HOST = rh8)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = rh6)(PORT = 1521)))

59

TCP Keepalive

• **WARNING** Keepalive intervals are typically set to 2 hours or more (ie: it can take more than 2 hours to notice a dead server even if keepalive is enabled).

• To make keepalive useful for TAF, the keepalive interval would need to be reduced to a smaller value (such as 2 minutes). Note: 249213.1

60

TCP KeepAlive Parameters for Linux

tcp_keepalive_time the time since the last data packet sent and the first keepalive probe

tcp_keepalive_intvl the time between keepalive probes

tcp_keepalive_probes the number of probes to be sent before declaring the connection dead

Default Settings tcp_keepalive_time = 7200 secondstcp_keepalive_intvl = 75tcp_keepalive_probes = 9

A total of 7875 seconds, or 2 hours 11 minutes and 15 seconds.

61

TCP Keepalive

Initial Settings– tcp_keepalive_time = 200 secs– tcp_keepalive_intvl = 20– tcp_keepalive_probes = 2

• After 200 seconds of no response, TCP sends the first of 2 probes, 20 seconds apart.

• TCP notifies SQL*Net of the failure, and SQL*Net removes the offending connection.

62

TCP Retries• tcp_retries1 (default: 3) The number of times TCP will

attempt to retransmit a packet on an established connection normally, without the extra effort of getting the network layers involved.

• tcp_retries2 (default: 15) The maximum number of times a TCP packet is retransmitted in established state before giving up

• tcp_syn_retries (default: 5) The maximum number of times initial SYNs for an active TCP connection attempt will be retransmitted. The default value is 5, corresponds to approximately 180 seconds.

63

TCP Retries

Now let’s consider changing the following TCP parameters from their default values:tcp_retries1 = 2tcp_retries2 = 2tcp_syn_retries = 2

In this example, the time to initialize the PCP failover was an average of 8 seconds after changing these TCP parameters.

64

Disconnect TCP Connection from RH9

From the ICM log:

The Internal Concurrent Manager has encountered an error.Review concurrent manager log file for more detailed information. : 12-

JAN-2009 15:22:55 -Shutting down Internal Concurrent Manager : 12-JAN-2009 15:22:5512-JAN-2009 15:22:55The ICM has lost its database connection and is shutting down.Spawning reviver process to restart the ICM when the database

becomes available again.Spawned reviver process 1541.The VIS_0112@VIS internal concurrent manager has terminated with

status 1 - giving up.Found dead process: spid=(17963), cpid=(1302176), ORA pid=(26),

manager=(0/1)

65

PMON & fnd_concurrent _queues

PMON updates the work_start column in the fnd_concurrent_queues table every 4 PMON cycles

fdpsrp() (running_processes correction):ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUESOracle error code returned: 1This message is information and does not indicate a problem with CP functionality.

remote call function (FNDIMON)15-AUG-2008 10:06:02 - Function to call: PingProcess

66

PMON – ICM Lock – 11i

• If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM.

• If the ping succeeds, we conclude that the ICM is fine. What????

• If the ping fails, we further check if it has been over “quesiz” pmon cycles since the ICM updated the work_start column fnd_concurrent_queues.

• If it has been more than four pmon cycles we conclude that the ICM is dead.

67

PMON “found dead process”

On RH9 the PMON found a dead process. The PMON takes about 1 second to run, then sleeps for 2 minutes:

Process monitor session started : 18-JAN-2009 21:46:05Found dead process: spid=(16977), cpid=(1321475), Service

Instance=(36543) Process monitor session ended : 18-JAN-2009 21:46:06

The Internal Concurrent Manager has encountered an error.Review concurrent manager log file for more detailed

information. : 18-JAN-2009 22:02:01

68

PMON – node RH9 is down

From the ICM log:

Process monitor session started : 12-JAN-2009 15:18:27

Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes.

CONC-SM TNS FAILCall to PingProcess failed for XDPCTRLS

69

PMON

Process monitor session started : 18-JAN-2009 22:38:57CONC-SM TNS FAILCall to PingProcess failed for OAMGCS18-JAN-2009 22:38:58 - Node:(RH7), Service

Manager:(FNDSM_RH7_VIS) currently unreachable by TNSFound dead process: spid=(11234), cpid=(1321563), ORA

pid=(167), manager=(0/4)

Process monitor session ended : 18-JAN-2009 22:38:58

70

PMONShutting down Internal Concurrent Manager : 18-

JAN-2009 22:02:0118-JAN-2009 22:02:01The ICM has lost its database connection and is

shutting down.Spawning reviver process to restart the ICM when

the database becomes available again.Spawned reviver process 10910.

71

PMON runs every 2 minutes

Process monitor session ended : 18-JAN- 2009 21:49:05

Process monitor session started : 18-JAN- 2009 21:51:05

72

Edit ICM Runtime Parameters

73

Edit PMON Parameters

74

Edit PMON Parameters

ICM parameters are read from batchmgr.sh when adcmctl.sh runs. Changing these parameters here does not change batchmgr.sh!

75

$FND_TOP/bin/batchmgr.shMake sure the PMON changes are made in the $FND_TOP/bin/batchmgr.sh file.

FILENAME# batchmgr# DESCRIPTION# fire up Internal Concurrent Manager process# USAGE# batchmgr arg1=val1 arg2=val2 ...## Parameters may be sent via the environment.## ARGUMENTS DEFAULT# [appmgr|sysmgr]=username/password# [sleep=sleep_seconds] 15 # [mgrname=manager_name] icm# [logfile=log_filename] $FND_TOP/$APPLLOG/$mgrname.mgr# [restart=N|mim minutes between restarts] N# [mailto="user1 user2..."] current user# [PRINTER=printer_name]# [pmon=iterations] 4# [quesiz=pmon_iterations] 1# [diag=Y|N] N

76

Reviver

From the CM log file:• The ICM has lost its

database connection and is shutting down.

• Spawning reviver process to restart the ICM when the database becomes available again.

• Spawned reviver process 10910.

Exit

Attempt to Get DB

Connection

Kill Previous DB Session

Yes

No

Starts to Shutdown

ICM REVIVER

ICM Started?

Sleep

Yes

No

Lost DB Connection?

Yes

YesSpawn Reviver

No

Start

Start ICM

Exit

Receive Shutdown?

No

77

reviver.logreviver.sh starting up...[ Mon Jan 12 20:02:15 MST 2009 ] - Read APPS username/password.[ Mon Jan 12 20:02:45 MST 2009 ] - Attempting database connection...[ Mon Jan 12 20:02:45 MST 2009 ] - Successful database connection.[ Mon Jan 12 20:02:45 MST 2009 ] - Killing previous ICM session...1 row updated.Commit complete.[ Mon Jan 12 20:02:45 MST 2009 ] - Looking for a running ICM

process...[ Mon Jan 12 20:02:45 MST 2009 ] - ICM now running, reviver.sh

complete.

78

reviver.shreviver.sh – code summary

Sleep 30Test_connectionKill_old _icm

Get sessionAlter system kill sessionCheck_running_icm

Fnd_conc.ecm_alivestart_icm

startmgr.sh

79

Dead Connection Detection

• Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Oracle Net8. DCD detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and releases the resources associated with it.

80

Implement DCD

• Implement by:

adding SQLNET.EXPIRE_TIME = 1 (Minutes) to the sqlnet.ora file

If the connection is idle for the time interval specified in minutes by the SQLNET.EXPIRE_TIME parameter, the server- side process sends a small 10-byte packet to the client. The packet is sent using TCP/IP.

81

DCD – ICM Lock

• ICM and IM can use the DCD functionality of the Network (TCP sqlnet).

• ICM is a client process connected to a DCD enabled DB dedicated server process.

• ICM holds the named PL/SQL Lock, the “ICM lock”.

• IM is continuously trying to check whether it can get the same named PL/SQL Lock.

82

DCD – ICM Lock• As soon as the “ICM lock” is released by the DB / DCD,

FNDIMON pings the ICM node, and the IM deduces that the ICM has crashed. – If the ping succeeds, we conclude that the ICM is fine.

• Obviously, the ICM can be down, even if TCP is working, this is bad logic.

– If the ping fails, FNDIMON determines if it’s been over four pmon cycles since the ICM updated the work_start column fnd_concurrent_queues.

– If it has been more than four pmon cycles FNDIMON concludes the ICM is dead.

• The DCD comes into picture here after ICM has crashed and DB needs to identify that the ICM is gone.

• The DB needs to clean up the dedicated server process resource corresponding to the ICM client process

83

FNDIMON has the ICM LockCheck if the ICM updated the work_start column fnd_concurrent_queues.

Be aware that if a TCP failure is not detected, failover will not occur. The following except from a concurrent manager log shows:

fdpsrp() (running_processes correction):ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUESOracle error code returned: 1This message is information and does not indicate a problem with CP functionality.

remote call function (FNDIMON)15-AUG-2008 10:06:02 - Function to call: PingProcess

The PingProcess continues until the CP processes resume, or a TCP failure is detected, and failover is begun.

84

11i PCP Failure

• TCP Failure• ICM Lock is released, FNDIMON pings

ICM node, if ping fails, check PMON• PMON detects a “dead process”, crashed

ICM• reviver.sh• DCD

85

R12 PCP Failure

• TCP Failure• PMON detects a “dead process”• ICM Shutdown

– Look for error messages ORA-3113, ORA- 3114 or ORA-1041

• reviver.sh• DCD

86

Test PCP Failover Parameters• Test to explore effect of DCD, PMON and TCP

failover methods.• Variables: sqlnet.expire_time, pmon sleep and

number of cycles, and the following TCP Keepalive parameters:

• tcp_keepalive_time,• tcp_keepalive_intvl,• tcp_keepalive_probes• tcp_retries1 (default: 3, new value 2)• tcp_retries2 (default: 15, new value 2) • tcp_syn_retries (default: 5, new value 2)

87

Failover Test ResultsFailover time /Failback time

Expire_time PMONSleep

PMONCycles

tcp_KA time

tcp KA intvl

tcp KA probes

tcp retries

tcp retries2

tcp syn retries

241 secs / 1 minute 30 secs 4 200 20 2 3 15 5

250 secs / 50 secs 5 minute 30 secs 4 200 20 2 3 15 5

262 secs / 100 sec 10 minutes 30 secs 4 200 20 2 3 15 5


285 secs / 35 min 10 minute 30 secs 4 1000 60 10 3 15 5



7 secs / 40 secs 10 minutes 30 secs 4 200 20 2 2 2 2


88

All Services are UP

89

Concurrent Managers

• Processes - Actual = 1 and Target = 1, manager is running• Processes - Actual = 0 and Target = 1, manager is running

90

Actual Processes = 0

Example of Actual Processes = 0, in this example the CRM is not running

91

PCP Setup

PCP setup – this screen is continued on the next slide

92

Primary and Secondary Nodes

The CRM, ICM and Standard Manager will

fail over

Any concurrent

programs not assigned to

the Standard Manager will not fail over

93

TCP Failure

• TCP disconnected at 2:57:25• 10 seconds after the TCP connection was pulled, OAM reported the status above.• It took 10 seconds for OAM to register a failure of services on RH9.

94

CRM is DOWN

If any of the subordinate services fail, it rolls up to the Dashboard

95

CRM Failure

CRM has failed, Actual Processes = 0

96

PCP Failover from RH9 to RH7

Adding Node:(RH9), to unavailable listFound dead process: spid=(9696), cpid=(1321449), ORA pid=(80), manager=(0/0)Found dead process: spid=(9784), cpid=(1321458), ORA pid=(114), manager=(0/0)Found dead process: spid=(9783), cpid=(1321457), ORA pid=(104), manager=(0/0)Found running request 4413565 attached to dead manager process.Attempting to restart request.Internal Concurrent Manager found node RH9 to be down. Adding it to the list of

unavailable nodes.

97

GSM tries to restart the servicesTCP and TNS is unavailable:Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42CONC-SM TNS FAILRoutine AFPEIM encountered an error while starting concurrent manager STANDARD

with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.Check that your system has enough resources to start a concurrent manager process.

Contac : 18-JAN-2009 21:43:42Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42CONC-SM TNS FAILRoutine AFPEIM encountered an error while starting concurrent manager STANDARD

with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.Check that your system has enough resources to start a concurrent manager process.

Contac : 18-JAN-2009 21:43:42Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42CONC-SM TNS FAILRoutine AFPEIM encountered an error while starting concurrent manager STANDARD

with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.

98

ICM and CRM are DOWN

99

RH9 is DOWN

Not really down, just not on the network

100

PCP is DOWN

This is momentary as GSM figures out what to

do

101

Failover to Secondary Node

The ICM and CRM failed over to RH7 in about 1 minute and 30 seconds

102

Failover from RH9 to RH7Starting Internal Concurrent Manager Concurrent

Manager : 18-JAN-2009 21:51:23: Started ICM on Target RH7.

Process monitor session ended : 18- JAN-2009 21:52:53

: Migration of ICM has completed.Shutting down Internal Concurrent Manager : 18-

JAN-2009 21:53:23The VIS_0118@VIS internal concurrent manager

has terminated successfully - exiting.

103

ICM Failover to RH7Starting Internal Concurrent Manager Concurrent






104

RH9 not available

105

Request Failover

106

Standard Manager Failover Configuration

• Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover.

107

Managers with a Secondary Node

• Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover.

108

Failback

FAILBACK – tcp connected at 31:40The host, RH9 becomes available on OAM about 2

minutes later.

109

RH9 available

110

ICM Failback

111

Concurrent Manager LogStarting Internal Concurrent Manager Concurrent






113

Failback Complete

Total Failback Time 3 minutes and 45 seconds

114

Standard Manager before Failover

The Standard Manager has 3 Actual and Target

processes.

115

Standard Manager is DOWN

116

Standard Manager has 2 Processes on Failover

After 3 minutes and 30 seconds the Standard Manager started on RH7

117

Shutdown of CP

118

Concurrent Processing Load Balancing

Two types of Load Balancing

• Load Balancing with both nodes running – no failover

• Load Balancing during failover

119

PCP Load Balancing• One of the benefits Parallel Concurrent

Processing provides:– failover in case of node failure

• maintain throughput and keep the business running during node failures.

• When a node fails, the processes that were running on the failed node are restarted on secondary nodes.

• However, a resource intensive node may overload the secondary node when it fails-over.

120

PCP Load Balancing• If too many processes are running on the secondary

node when the primary node fails over, the secondary node may not have the capacity to process the requests from additional concurrent managers.

• R12 introduces Failover Sensitive Workshifts. This enhancement allows the System Administrator to configure how many processes failover for each workshift. With this added control, System Administrators can enjoy the benefits of PCP failover without risking performance issues through overloaded resources.

121

R12 Failover Sensitive Workshifts

122

Failover Sensitive Workshifts

123


• Conversely, if a failover occurs from node 1 to node 2, we may want to reduce the failover processes, however, this doesn’t work.

• Only if the node fails does the “failover processes” take effect.

124

Failover Processes

PO Document Approval Manager and the Standard Manager will reduce the number ofprocesses when RH7 fails. When RH9 fails, the number of failover processes for managersthat run on RH7 are not reduced.

125


It’s clear: to run a R11i or R12 system during a failover, there are two choices:• Run the servers at 35% or less utilization • Reduce the number of processes that are

allowed during failoverFor most businesses the second option isthe most practical.

126

References• 249213.1 - Performance problems with Failover when TCP Network goes down• 364171.1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP

Keepalive• 211362.1 - Process Monitor Session Cycle Repeats Too Frequently• 291201.1 - How To Remove a Dead Connection to the Target Database• 362135.1 - Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real

Application Clusters and Automatic Storage Management• Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari• 240818.1 - Concurrent Processing: Transaction Manager Setup and Configuration

Requirement in an 11i RAC Environment• R12 ATG - Concurrent Processing Functional Overview – Aaron Weisberg• 210062.1 - Generic Service Management (GSM) in Oracle Applications 11i• 271090.1 - Parallel Concurrent Processing Failover/Failback Expectations• 241370.1 - Concurrent Manager Setup and Configuration Requirements in an 11i RAC

Environment• 602899.1 - Some More Facts On How to Activate Parallel Concurrent Processing

pcp

Documents

internal concurrent

available manager

node service manager

concurrent processes

file standard manager

icm4 concurrent request

nonmanager service processes

manager description