- oracle · exadata maa best practices series 1. e-business suite on exadata 2. siebel on exadata...
TRANSCRIPT
<Insert Picture Here>
Exadata MAA Best Practices SeriesSession #11: Troubleshooting ExadataDan NorrisDan NorrisX Team
Exadata MAA Best Practices Series
<Insert Picture Here>1. E-Business Suite on Exadata 2. Siebel on Exadata 3. PeopleSoft on Exadata 4. Exadata and OLTP Applications 5. Using Resource Manager on Exadata6. Migrating to Exadata 7. Using DBFS on Exadata 8. Exadata Monitoring 9. Exadata Backup & Recovery 10. Exadata MAA 11. Troubleshooting Exadata 12. Exadata Patching & Upgrades 13. Exadata Health Check
2
Troubleshooting ExadataAgendage da
• Key Points and Customer Takeawaysy y
• Business Takeaways
• Best Practices Takeaways
3
<Insert Picture Here>
Key Points and Customer TakeawaysCustomer Takeaways
4
Troubleshooting ExadataKey Pointsey o s
• Leverage existing troubleshooting knowledgeg g g g
• Involve the right personnel
• Become familiar with the proper tools and techniques
5
Key Point #1L i ti t bl h ti k l dLeverage existing troubleshooting knowledge
Business Value Standard components = reuse existingStandard components reuse existing troubleshooting skills
6
Leverage existing knowledge
• Standard components
• Database and networking knowledge applies
• OS utilities are the same as on any non-Exadata database machine serverdatabase machine server
7
Leverage existing knowledgeA few examples
• Oracle Enterprise Linux or Oracle Solaris
A few examples
– Configuration, diagnostics, performance data all the same
• Oracle DatabaseAWR l fil t fil ll th– AWR, log files, trace files all the same
• Oracle Grid Infrastructure– Log files, trace files, configuration all the sameg , , g
8
Key Point #2I l th i ht lInvolve the right personnel
Business ValueThe fastest way to resolve problems is to askThe fastest way to resolve problems is to ask the right personnel to help
9
Involve the Right Personnel
• DBAs are best to troubleshoot database issues
• Sysadmins are best to troubleshoot OS issues
• Network admins are best to troubleshoot network issuesissues
• All can stretch and learn, but that takes timeca st etc a d ea , but t at ta es t e
• Establish a triage team to expedite diagnosis
10
Involve the Right PersonnelExamples
• Sysadmins
Examples
y– DO: Update RPM packages and kernel versions– DON'T: Understand most Oracle database wait events
• DBAs• DBAs – DO: Analyze wait events in an Oracle instance– DON'T: Set shell limits and kernel parameters
• Network admins – DO: Configure routes and subnets
DON'T: Configure ASM diskgroups– DON T: Configure ASM diskgroups
11
Key Point #3B f ili ith th t l dBecome familiar with the proper tools and techniques
Business Value The right tools provide the best information leading to the fastest and lowest-risk resolution
12
Become Familiar With the Proper Tools
• Traditional database and OS tools may be usedy
• There are Exadata-specific tools and utilities
• Start with high-level tools, then drill down
• Know what the tools do
• Establish baselines
13
Become Familiar With the Proper ToolsChecking Oracle Clusterware
• Oracle Clusterware basic validation commands:
Checking Oracle Clusterware
crsctl stat res -tcrsctl query css votediskcrsctl check cluster -all
L fil (l l h d )• Logfile (local on each node): <GRID_HOME>/log/<hostname>/alert<hostname>.log
• Doc: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/crsref htm
14
12/e16794/crsref.htm
Become Familiar With the Proper ToolsChecking Oracle Clusterware
$ /u01/app/11.2.0/grid/bin/crsctl stat res -t--------------------------------------------------------------------------------NAME TARGET STATE SERVER STATE_DETAILS
Checking Oracle Clusterware
--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg
ONLINE ONLINE dbm1db01ONLINE ONLINE dbm1db01 ONLINE ONLINE dbm1db02
ora.LISTENER.lsnrONLINE ONLINE dbm1db01 ONLINE ONLINE dbm1db02
ora.RECO.dgONLINE ONLINE dbm1db01 ONLINE ONLINE dbm1db02
ora.SYSTEMDG.dgONLINE ONLINE db 1db01ONLINE ONLINE dbm1db01 ONLINE ONLINE dbm1db02
ora.asmONLINE ONLINE dbm1db01 Started ONLINE ONLINE dbm1db02
15
<remaining output omitted>
Become Familiar With the Proper ToolsChecking Oracle Clusterware
$ /u01/app/11.2.0/grid/bin/crsctl query css votedisk## STATE File Universal Id File Name Disk group-- ----- ----------------- --------- ---------1. ONLINE 2055c345d1b14f9dbfed3e5da092de61 (o/192.168.74.181/SYSTEMDG_CD_02_dbm1cel01)
Checking Oracle Clusterware
[SYSTEMDG]2. ONLINE 350147b3b3f84f2ebfb30b1f530bef66 (o/192.168.74.182/SYSTEMDG_CD_04_dbm1cel02) [SYSTEMDG]
3. ONLINE 98667ccebb844f2abf98790a0b7a79f4 (o/192.168.74.183/SYSTEMDG_CD_02_dbm1cel03) [SYSTEMDG]
Located 3 voting disk(s).$$ /u01/app/11.2.0/grid/bin/crsctl check cluster -all**************************************************************dbm1db01:dbm1db01:CRS-4537: Cluster Ready Services is onlineCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is online**************************************************************dbm1db02:CRS-4537: Cluster Ready Services is onlineCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is online**************************************************************
16
**************************************************************
Become Familiar With the Proper ToolsChecking Oracle ASM
• Oracle ASM basic validation commands:
Checking Oracle ASM
asmcmd lsdgsrvctl status diskgroup -g <dgname>select * from gv$asm_diskgroup;
• Logfile (local on each node): <DIAG DEST>/asm/+asm/<inst name>/trace/a<DIAG_DEST>/asm/+asm/<inst_name>/trace/alert<inst_name>.log
• Docs: http://download.oracle.com/docs/cd/E11882_01/server.112/e16102/asm_util004.htm#sthref1059
http://download.oracle.com/docs/cd/E11882_01/rac.112/e16795/sr
17
vctladmin.htm#BAJJCCGJ
Become Familiar With the Proper ToolsChecking Oracle ASM
$ asmcmd lsdgSt t T R b l S t Bl k AU T t l MB F MB
Checking Oracle ASM
State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED NORMAL N 512 4096 4194304 14745600 14382056 1340509 6520773 0 N DATA/
MOUNTED NORMAL N 512 4096 4194304 7402752 6886408 672977 3106715 0 N RECO/
MOUNTED NORMAL N 512 4096 4194304 894720 893120 81338 405891 0 Y SYSTEMDG/81338 405891 0 Y SYSTEMDG/
$$ srvctl status diskgroup -g dataDisk Group data is running on dbm1db01,dbm1db02$
18
Become Familiar With the Proper ToolsChecking Oracle ASM
SQL> select inst_id,name,state,total_mb,usable_file_mb,offline_disks2> from gv$asm diskgroup;
Checking Oracle ASM
2> from gv$asm_diskgroup;
INST_ID NAME STATE TOTAL_MB USABLE_FILE_MB OFFLINE_DISKS---------- ------------ ----------- ---------- -------------- -------------
2 DATA MOUNTED 14745600 6520773 02 DATA MOUNTED 14745600 6520773 02 RECO MOUNTED 7402752 3106715 02 SYSTEMDG MOUNTED 894720 405891 01 DATA MOUNTED 14745600 6520773 01 RECO MOUNTED 7402752 3106715 01 SYSTEMDG MOUNTED 894720 405891 0
6 rows selected.
SQL>
19
Become Familiar With the Proper ToolsChecking Oracle Exadata Storage Cell
• Oracle Exadata storage cell validation commands:
Checking Oracle Exadata Storage Cell
gcellcli –e list alerthistorycellcli –e list cell detailimageinfoimagehistory/ /l / ll / lid ti l/var/log/cellos/validations.log
• Logfiles: $ADR BASE/diag/asm/cell/<hostname>/trace/$ADR_BASE/diag/asm/cell/<hostname>/trace/alert.log
/var/log/messages
20
Become Familiar With the Proper ToolsChecking Oracle Exadata Storage Cell
[celladmin@dbm1cel01 ~]$ cellcli -e list alerthistory
Checking Oracle Exadata Storage Cell
<some output omitted>69 2011-01-17T02:00:24-08:00 info "BBU on disk contoller at adapter 0 is going into a learn cycle All Logical Volumes on harddisks will go intocycle. All Logical Volumes on harddisks will go into WriteThrough caching mode. Write Throughput will be lower."
70 1 2011-01-17T04:37:05-08:00 critical _"All Logical drives are in WriteThrough caching mode. Either battery is in a learn cycle or it needs to be replaced. Please contact Oracle Support"
70 2 2011 01 17 06 00 25 08 00 l70_2 2011-01-17T06:00:25-08:00 clear "Battery is back to a good state"
[celladmin@dbm1cel01 ~]$
21
Become Familiar With the Proper ToolsChecking Oracle Exadata Storage Cell[celladmin@dbm1cel01 ~]$ cellcli -e list cell detail
name: dbm1cel01bmcType: IPMIllV i OSS 11 2 2 1 1 LINUX X64 101105
Checking Oracle Exadata Storage Cell
cellVersion: OSS_11.2.2.1.1_LINUX.X64_101105cpuCount: 16fanCount: 12/12fanStatus: normalid: <serialnumber removed>interconnectCount: 6interconnect1: bondib0iormBoost: 0.0ipaddress1: 192.168.74.181/22kernelVersion: 2.6.18-194.3.1.0.3.el5locatorLEDStatus: offmakeModel: SUN MICROSYSTEMS SUN FIRE X4275
SERVER SATAmetricHistoryDays: 7notificationMethod: mailnotificationPolicy: critical,warning,clear
22
y , g,<continued on next slide>
Become Familiar With the Proper ToolsChecking Oracle Exadata Storage Cell<continued from previous slide>
offloadEfficiency: 6.7GpowerCount: 2/2
Checking Oracle Exadata Storage Cell
powerStatus: normalsmtpFrom: "Oracle Database Machine1"smtpFromAddr: [email protected]: 25smtpServer: mail-relay.corp.comsmtpToAddr: [email protected],[email protected]: FALSEstatus: onlinetemperatureReading: 22.0temperatureStatus: normalupTime: 48 days, 4:01cellsrvStatus: runningcellsrvStatus: runningmsStatus: runningrsStatus: running
[celladmin@dbm1cel01 ~]$
23
Become Familiar With the Proper ToolsChecking Oracle Exadata Storage Cell$ imageinfoKernel version: 2.6.18-194.3.1.0.3.el5 #1 SMP Tue Aug 31 22:41:13 EDT 2010 x86_64
Checking Oracle Exadata Storage Cell
_Cell version: OSS_11.2.2.1.1_LINUX.X64_101105Cell rpm version: cell-11.2.2.1.1_LINUX.X64_101105-1
Active image version: 11.2.2.1.1.101105Active image activated: 2010-11-24 23:31:47 -0500Active image status: successActive system partition on device: /dev/md6Active system partition on device: /dev/md6Active software partition on device: /dev/md8
In partition rollback: Impossible
<continued on next slide>
24
Become Familiar With the Proper ToolsChecking Oracle Exadata Storage Cell<continued from previous slide>
Cell boot usb partition: /dev/sdac1
Checking Oracle Exadata Storage Cell
Cell boot usb version: 11.2.2.1.1.101105
Inactive image version: 11.2.1.2.6Inactive image activated: 2010-05-20 12:22:11 -0400Inactive image status: successInactive system partition on device: /dev/md5Inactive software partition on device: /dev/md7
Boot area has rollback archive for the version: 11.2.1.2.6Rollback to the inactive partitions: Possible
25
Become Familiar With the Proper ToolsChecking InfiniBand Network
• InfiniBand network validation commands:
Checking InfiniBand Network
verify-topologyiblinkinfoibstatus
Logfile: / /l /• Logfile: /var/log/messages
26
Become Familiar With the Proper ToolsChecking InfiniBand Network
# /opt/oracle.SupportTools/ibdiagtools/verify-topology
Checking InfiniBand Network
[ DB Machine Infiniband Cabling Topology Verification Tool ][Version 11.2.1.3.b]
External non-Exadata-image nodes found...Ignoring those
Looking at 1 rack(s).....Spine switch check: Are any Exadata nodes connected ..............[SUCCESS]Spine switch check: Any inter spine switch connections............[SUCCESS]Spine switch check: Correct number of spine-leaf links............[SUCCESS]Leaf switch check: Inter-leaf link check..........................[SUCCESS]Leaf switch check: Correct number of leaf-spine connections.......[SUCCESS]Check if all hosts have 2 CAs to different switches...............[SUCCESS]Leaf switch check: cardinality and even distribution..............[SUCCESS]#
May need to use "-t halfrack" or "-t quarterrack" in those configurations
27
May need to use t halfrack or t quarterrack in those configurations
Become Familiar With the Proper ToolsChecking InfiniBand Network
# iblinkinfoSwitch 0x002128469d83a0a0 SUN DCS 36P QDR dbm1sw-ib2.us.oracle.com:
1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 35 1[ ]
Checking InfiniBand Network
"dbm1cel02 C 192.168.73.97 HCA-1" ( )1 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 39 1[ ]
"dbm1cel01 C 192.168.73.96 HCA-1" ( )<some output omitted>
1 9[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 7 1[ ] "dbm1db03 S 192.168.73.90 HCA-1" ( )
1 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 9 1[ ] "dbm1db02 S 192.168.73.89 HCA-1" ( )
/4 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" ( )
<some output omitted>Switch 0x002128469d7da0a0 SUN DCS 36P QDR dbm1sw-ib3.us.oracle.com:
/4 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 36 2[ ] "dbm1cel02 C 192.168.73.97 HCA-1" ( )
4 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 40 2[ ] "dbm1cel01 C 192.168.73.96 HCA-1" ( )
< t t itt d>
28
<some output omitted>
Become Familiar With the Proper ToolsChecking InfiniBand Network
# ibstatusInfiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:000a:7871
Checking InfiniBand Network
base lid: 0xdsm lid: 0x2fstate: 4: ACTIVEphys state: 5: LinkUpp y prate: 40 Gb/sec (4X QDR)
Infiniband device 'mlx4_0' port 2 status:default gid: fe80:0000:0000:0000:0002:c903:000a:7872default gid: fe80:0000:0000:0000:0002:c903:000a:7872base lid: 0xesm lid: 0x2fstate: 4: ACTIVEph s state 5 LinkUpphys state: 5: LinkUprate: 40 Gb/sec (4X QDR)
#
29
Become Familiar With the Proper ToolsChecking Hosts
• Host troubleshooting:
Checking Hosts
gipmitool sel list/opt/oracle.oswatcher/osw/archive
• Logfile: /var/log/messages
30
Become Familiar With the Proper ToolsChecking Hosts
# ipmitool sel list<some output omitted>
Checking Hosts
p8be | 10/03/2010 | 01:21:44 | System Firmware Progress | Management controller initialization | Asserted
8bf | 10/03/2010 | 01:21:44 | System Firmware Progress | Secondary CPU Initialization | Asserted
8c0 | 10/03/2010 | 01:21:58 | System Firmware Progress | Video initialization | Asserted8c0 | 10/03/2010 | 01:21:58 | System Firmware Progress | Video initialization | Asserted8c1 | 10/03/2010 | 01:22:05 | System Firmware Progress | Keyboard controller initialization | Asserted
8c2 | 10/03/2010 | 01:22:10 | System Firmware Progress | Option ROM initialization | Asserted8c3 | 10/03/2010 | 01:22:13 | System Firmware Progress | Option ROM initialization | Asserted8c4 | 10/03/2010 | 01:22:16 | System Firmware Progress | Option ROM initialization | Asserted8c5 | 10/03/2010 | 01:22:19 | System Firmware Progress | Option ROM initialization | Asserted8c6 | 10/03/2010 | 01:22:22 | System Firmware Progress | Option ROM initialization | Asserted8c7 | 10/03/2010 | 01:22:44 | System Firmware Progress | System boot initiated | Asserted
##
31
Become Familiar With the Proper ToolsChecking Hosts
# ls -l /opt/oracle.oswatcher/osw/archive/total 228
Checking Hosts
total 228drwxr-sr-x 2 root cellusers 20480 Jan 26 12:00 ExadataDiagCollectdrwxr-sr-x 2 root cellusers 24576 Jan 26 12:00 ExadataOSWdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:00 ExadataRDSd 2 t ll 20480 J 26 12 18 i t tdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:18 oswiostatdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:00 oswmeminfodrwxr-sr-x 2 root cellusers 20480 Jan 26 12:31 oswmpstatdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:08 oswnetstatdrwxr-sr-x 2 root cellusers 4096 Jan 19 12:29 oswprvtnetdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:00 oswpsdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:00 oswslabinfodrwxr-sr-x 2 root cellusers 20480 Jan 26 12:33 oswtopdrwxr-sr-x 2 root cellusers 20480 Jan 26 12:18 oswvmstat
32
Business Takeaways
33
Troubleshooting ExadataBusiness Takeawaysus ess a ea ays
• Standard components = reuse existing knowledgeg g
• Involving the right personnel = fastest resolution
• The right tools = fastest resolution + lowest risk
34
Best Practice Takeaways
35
Troubleshooting ExadataBest Practice Takeawayses ac ce a ea ays
• Know your baselinesy
• Following Oracle Exadata and MAA Best Practices id d f t bl h ti it tiavoids need for many troubleshooting situations
• Oracle Sun Database Machine X2 2 Diagnosability• Oracle Sun Database Machine X2-2 Diagnosability and Troubleshooting Best Practices (Doc ID 1274324.1)
• Oracle Database Machine Monitoring Best Practices (Doc ID 1110675 1)
36
(Doc ID 1110675.1)
<Insert Picture Here>
Appendix
37
Best PracticesAdditional Resources sponsored by MAA and X-TeamAdditional Resources sponsored by MAA and X Team
• MAA and Exadata OTN website contains best practices and architectural solutions
– MAA OTN website:http://www.oracle.com/goto/maaSun Oracle Database Machine and Exadata OTN website– Sun Oracle Database Machine and Exadata OTN websitehttp://www.oracle.com/technetwork/database/exadata/index-089737.html
• Openworld presentations– http://openworld.vportal.netp p p
• Oracle Exadata Best Practices (Doc ID 757552.1)• Exadata Hardware Alert: All Logical Drives Are In Writethrough
Caching Mode (Doc ID 1283341.1)Caching Mode (Doc ID 1283341.1)
38
SponsorsExadata MAA Team and X TeamExadata MAA Team and X Team
• Operational and configuration best practices g– Optimized and integrated for Exadata – Generic practices for other platforms
Examples: Migration Backup/Recovery Monitoring– Examples: Migration, Backup/Recovery, Monitoring, Troubleshooting, Patching, MAA, Consolidation, Active Data Guard, Cloning/Reporting, Application Failover
• Applications MAA and Scalability• Applications MAA and Scalability – Optimized and integrated for Exadata and Exalogic– Examples: E-Business Suite, Siebel, Peoplesoft, Fusion
Middleware
• Exadata Strategic Customer Program
39
40
41