exadata cell metrics

6

Click here to load reader

Upload: monowar-mukul-ocm-11g-dba

Post on 13-May-2015

938 views

Category:

Education


1 download

DESCRIPTION

Testing - Exadata Cell metrics

TRANSCRIPT

Page 1: Exadata Cell metrics

Exadata Cell metrics

Exadata CELLSRV periodically records important runtime properties, called metrics, for cell components such as CPUs, cell disks, grid disks, flash cache, and IORM statistics. These metrics are recorded in memory. Based on its own metric collection schedule, the Management Server (MS) gets the set of metric data accumulated by CELLSRV.

Management Server (MS) provides Exadata cell management and configuration functions. MS is responsible for sending alerts and collects some statistics in addition to those collected by CELLSRV. Each cell is individually managed with Exadata cell command-line interface (CellCLI).

Locate the MS process--------------------------------$ ps -ef | grep ms.err1000 3940 3723 0 01:42 pts/0 00:00:00 grep ms.errroot 24541 24540 0 Sep28 ? 00:01:32 /usr/java/jdk1.5.0_15/bin/java -Xms256m -Xmx512m -Djava.library.path=/opt/oracle/

Check the Alert History------------------------MS triggers an alert when it discovers a:

Cell hardware issue Cell software or configuration issue CELLSRV internal error Metric that has exceeded a threshold defined in the cell

CellCLI> list alerthistory 1 2013-09-26T22:51:15-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 2_1 2013-09-26T22:52:07-04:00 warning "Hugepage allocation failure in service cellsrv. Number of Hugepages allocated is 0, failed to allocate 110" 3 2013-09-26T22:54:08-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 4 2013-09-28T13:05:21-04:00 critical "RS-7445 [Serv RS_BACKUP is absent] [It will be restarted] [] [] [] [] [] [] [] [] [] []" 5 2013-09-28T22:05:38-04:00 critical "RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] [] [] [] [] [] [] []"

Create and check for disk I/O errors----------------------------ellCLI> create threshold CD_IO_ERRS_MIN comparison='>', warning=0, -> occurrences=1, observation=1Threshold CD_IO_ERRS_MIN successfully created

Page 2: Exadata Cell metrics

CellCLI> list threshold CD_IO_ERRS_MIN detail name: CD_IO_ERRS_MIN comparison: > observation: 1 occurrences: 1 warning: 0.0

ellCLI> list alerthistory where severity='warning'; 2_1 2013-09-26T23:02:12-04:00 warning "Hugepage allocation failure in service cellsrv. Number of Hugepages allocated is 0, failed to allocate 110"

CellCLI> list alerthistory where severity='critical'; 1 2013-09-26T23:01:18-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 3 2013-09-26T23:04:11-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 4 2013-10-01T06:42:39-04:00 critical "RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] [] [] [] [] [] [] []"

CellCLI> list alerthistory where severity='clear';

CellCLI> list alerthistory where severity='info';

MetricType: - cumulative: Cumulative statistics since the metric was created- instantaneous: Value at the time that the metric is collected- rate: Rates computed by averaging statistics over observation

periods- transition: Collected at the time when the value of the metrics

has changed, and typically captures important transitions in hardware status

CellCLI> list metriccurrent attributes name,metrictype,metricobjectname,metricvalue,collectionTime where metrictype='Rate'

Monitoring Exadata with Active Requests----------------------------------------CellCLI> LIST ACTIVEREQUEST WHERE IoType = 'predicate pushing' DETAILioType identifies the type of active request file initializationPossible values are read, write, predicate pushing, filtered backup read, predicate push read

Check retention period for metric and alert history-------------------------------------------------------CellCLI> list cell attributes metricHistoryDays 7CellCLI> alter cell metrichistorydays=5Cell qr03cel02 successfully altered

Page 3: Exadata Cell metrics

CellCLI> list cell attributes metrichistorydays 5

CellCLI> list cell attributes name,interconnectCount qr03cel02 2

configure the cell to automatically send an email and/or SNMP message to a designated set of Exadata administrator.-------------------------------------------------------------------------------------------------------------------alter cell smtpServer='my_mail.example.com', - smtpFromAddr='[email protected]', - smtpFrom='monowar mukul', - smtpToAddr='[email protected]', - notificationPolicy='critical,warning,clear', - notificationMethod='mail'

Watching for Undelivered Alerts---------------------------------It is important to periodically check the storage servers just to make sure that raised alerts have actually been delivered (via email and/or to Grid or Cloud Control).

CellCLI>LIST ALERTHISTORY where notificationState != 1 and examinedBy=''

dcli -g cell_group cellcli -e "LIST ALERTHISTORY where notificationState != 1 and examinedBy=\'\' " 1 2013-09-26T23:01:18-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 2_1 2013-09-26T23:02:12-04:00 warning "Hugepage allocation failure in service cellsrv. Number of Hugepages allocated is 0, failed to allocate 110" 3 2013-09-26T23:04:11-04:00 critical "ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [], []" 4 2013-10-01T06:42:39-04:00 critical "RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] [] [] [] [] [] [] []"

Drop Alert History---------------------CellCLI> drop alerthistory allAlert 1 successfully droppedAlert 2_1 successfully droppedAlert 3 successfully dropped

Checking Threshold-------------------CellCLI> list threshold cl_fsut./ cl_fsut./u01

Page 4: Exadata Cell metrics

CellCLI> create threshold cl_tst."/u01" comparison='>', warning=80Threshold cl_fsut."/u01" successfully created

CellCLI> list threshold detail name: cl_fsut./ comparison: > warning: 70.0

name: cl_fsut./u01 comparison: > warning: 80.0

CellCLI> alter threshold cl_fsut."/" comparison='>', warning=50Threshold cl_fsut."/" successfully altered

CellCLI> list threshold detail name: cl_fsut./ comparison: > warning: 50.0

name: cl_fsut./u01 comparison: > warning: 80.0

Execute the following command inside the cell operating system. It creates a 512-MB file on the root file system which will increase the utilization metric. After the metric crosses the threshold , an alert will be generated.$ dd if=/dev/zero of=/tmp/file.out bs=1024 count=500000[celladmin@qr03cel02 ~]$ dd if=/dev/zero of=/tmp/file.out bs=1024 count=500000500000+0 records in500000+0 records out512000000 bytes (512 MB) copied, 4.25551 seconds, 120 MB/s[celladmin@qr03cel02 ~]$ cellcliCellCLI: Release 11.2.3.1.0 - Production on Mon Sep 30 01:36:45 EDT 2013

Copyright (c) 2007, 2011, Oracle. All rights reserved.Cell Efficiency Ratio: 26M

CellCLI> list alerthistory 1_1 2013-09-30T01:32:46-04:00 warning "The warning threshold for the following metric has been crossed. Metric Name : CL_FSUT Metric Description : Percentage of total space on this file system that is currently used Object Name : / Current Value : 56.0 % Threshold Value : 50.0 % "

CellCLI> alter alerthistory 1_1 examinedby='investigator'Alert 1_1 successfully altered

CellCLI> list alerthistory detail name: 1_1 alertMessage: "The warning threshold for the following metric has been crossed. Metric Name : CL_FSUT

Page 5: Exadata Cell metrics

Metric Description : Percentage of total space on this file system that is currently used Object Name : / Current Value : 56.0 % Threshold Value : 50.0 % " alertSequenceID: 1 alertShortName: CL_FSUT alertType: Stateful beginTime: 2013-09-30T01:32:46-04:00 endTime: examinedBy: investigator metricObjectName: "/" metricValue: 56.0 notificationState: 0 sequenceBeginTime: 2013-09-30T01:32:46-04:00 severity: warning alertAction: "Examine the metric value that is violating the specified threshold, and take appropriate actions if needed."

The value of the name attribute is a composite of abbreviations.• CL_ (cell)• CD_ (cell disk)• GD_ (grid disk)• FC_ (flash cache)• DB_ (database)• CG_ (consumer group)• CT_ (category)• N_ (interconnect network)-- Monitoring IORM with cellcli command.I/O-related metric: • IO_RQ (number of requests)• IO_BY (number of MB)• IO_TM (I/O latency)• IO_WT (I/O wait time)_R for read _W for write._SM small I/O_LG large I/O_SEC signify per second_RQ to signify per request

• CD_IO_WT_R_SM is the cell disk (CD_) I/O wait time (IO_WT) to read (_R) small blocks (_SM).• GD_IO_RQ_W_LG_SEC is the grid disk (GD_) number of requests (IO_RQ) to write (_W) of large block (_LG) I/O per second (_SEC) on a grid disk.