inside oracle asm lc cern ukoug07
DESCRIPTION
inside Oracle ASMTRANSCRIPT
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
A Closer Look inside Oracle ASM
UKOUG Conference 2007Luca Canali, CERN IT
LCG
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 2
Outline
• Oracle ASM for DBAs– Introduction and motivations
• ASM is not a black box – Investigation of ASM internals– Focus on practical methods and troubleshooting
• ASM and VLDB– Metadata, rebalancing and performance
• Lessons learned from CERN’s production DB services
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 3
ASM
• Oracle Automatic Storage Management– Provides the functionality of a volume manager
and filesystem for Oracle (DB) files• Works with RAC
– Oracle 10g feature aimed at simplifying storage management
• Together with Oracle Managed Files and the Flash Recovery Area
– An implementation of S.A.M.E. methodology• Goal of increasing performance and reducing cost
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 4
ASM for a Clustered Architecture
• Oracle architecture of redundant low-cost components
Servers
SAN
Storage
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 5
ASM Disk Groups
• Example: HW = 4 disk arrays with 8 disks each• An ASM diskgroup is created using all available disks
– The end result is similar to a file system on RAID 1+0 – ASM allows to mirror across storage arrays – Oracle RDBMS processes directly access the storage– RAW disk access
Mirroring
Striping Striping
Failgroup1 Failgroup2
ASM Diskgroup
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 6
Files, Extents, and Failure Groups
Files and extent pointers
Failgroupsand ASMmirroring
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 7
ASM Is not a Black Box
• ASM is implemented as an Oracle instance– Familiar operations for the DBA– Configured with SQL commands– Info in V$ views – Logs in udump and bdump– Some ‘secret’ details hidden in X$TABLES and
‘underscore’ parameters
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 8
Selected V$ Views and X$ Tables
View Name X$ Table DescriptionV$ASM_DISKGROUP X$KFGRP performs disk discovery and lists
diskgroups
V$ASM_DISK X$KFDSK, X$KFKID performs disk discovery, lists disks and their usage metrics
V$ASM_FILE X$KFFIL lists ASM files, including metadata
V$ASM_ALIAS X$KFALS lists ASM aliases, files and directories
V$ASM_TEMPLATE X$KFTMTA ASM templates and their properties
V$ASM_CLIENT X$KFNCL lists DB instances connected to ASM
V$ASM_OPERATION X$KFGMG lists current rebalancing operations
N.A. X$KFKLIB available libraries, includes asmlib
N.A. X$KFDPARTNER lists disk-to-partner relationships
N.A. X$KFFXP extent map table for all ASM files
N.A. X$KFDAT allocation table for all ASM disks
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 9
ASM Parameters
• Notable ASM instance parameters:
*.asm_diskgroups='TEST1_DATADG1','TEST1_RECODG1'
*.asm_diskstring='/dev/mpath/itstor*p*'*.asm_power_limit=5*.shared_pool_size=70M*.db_cache_size=50M*.large_pool_size=50M*.processes=100
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 10
More ASM Parameters
• Underscore parameters – Several undocumented parameters– Typically don’t need tuning– Exception: _asm_ausize and _asm_stripesize
• May need tuning for VLDB in 10g
• New in 11g, diskgroup attributes– V$ASM_ATTRIBUTE, most notable
• disk_repair_time• au_size
– X$KFENV shows ‘underscore’ attributes
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 11
ASM Storage Internals
• ASM Disks are divided in Allocation Units (AU) – Default size 1 MB (_asm_ausize)– Tunable diskgroup attribute in 11g
• ASM files are built as a series of extents– Extents are mapped to AUs using a file extent map – When using ‘normal redundancy’, 2 mirrored extents
are allocated, each on a different failgroup– RDBMS read operations access only the primary
extent of a mirrored couple (unless there is an IO error)
– In 10g the ASM extent size = AU size
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 12
ASM Metadata Walkthrough
• Three examples follow of how to read data directly from ASM.
• Motivations:– Build confidence in the technology, i.e.
‘get a feeling’ of how ASM works– It may turn out useful one day to
troubleshoot a production issue.
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 13
Example 1: Direct File Access 1/2
Goal: Reading ASM files with OS tools, using metadata information from X$ tables
1. Example: find the 2 mirrored extents of the RDBMS spfile
2. sys@+ASM1> select GROUP_KFFXP Group#, DISK_KFFXP Disk#, AU_KFFXP AU#, XNUM_KFFXP Extent# from X$KFFXP where number_kffxp=(select file_number from v$asm_alias where name='spfiletest1.ora');
GROUP# DISK# AU# EXTENT#---------- ---------- ---------- ---------- 1 16 17528 0 1 4 14838 0
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 14
Example 1: Direct File Access 2/2
3. Find the disk path sys@+ASM1> select disk_number,path fromv$asm_disk where GROUP_NUMBER=1 and disk_numberin (16,4); DISK_NUMBER PATH ----------- ------------------------------------
4 /dev/mpath/itstor417_1p1 16 /dev/mpath/itstor419_6p1
4. Read data from disk using ‘dd’ dd if=/dev/mpath/itstor419_6p1 bs=1024kcount=1 skip=17528 |strings
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 15
X$KFFXP
Column Name DescriptionNUMBER_KFFXP ASM file number. Join with v$asm_file and
v$asm_alias
COMPOUND_KFFXP File identifier. Join with compound_index in v$asm_file
INCARN_KFFXP File incarnation id. Join with incarnation in v$asm_file
XNUM_KFFXP ASM file extent number (mirrored extent pairs have the same extent value)
PXN_KFFXP Progressive file extent number
GROUP_KFFXP ASM disk group number. Join with v$asm_disk and v$asm_diskgroup
DISK_KFFXP ASM disk number. Join with v$asm_disk
AU_KFFXP Relative position of the allocation unit from the beginning of the disk.
LXN_KFFXP 0->primary extent,1->mirror extent, 2->2nd mirror copy (high redundancy and metadata)
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 16
Example 2: A Different Way
A different metadata table to reach the same goal of reading ASM files directly from OS:
sys@+ASM1> select GROUP_KFDAT Group# ,NUMBER_KFDAT Disk#, AUNUM_KFDAT AU# from X$KFDAT where fnum_kfdat=(select file_number from v$asm_alias where name='spfiletest1.ora');
GROUP# DISK# AU#---------- ---------- ---------- 1 4 14838 1 16 17528
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 17
X$KFDAT
Column Name (subset) DescriptionGROUP_KFDAT Diskgroup number, join with
v$asm_diskgroup
NUMBER_KFDAT Disk number, join with v$asm_disk
COMPOUND_KFDAT Disk compund_index, join with v$asm_disk
AUNUM_KFDAT Disk allocation unit (relative position from the beginning of the disk), join with x$kffxp.au_kffxp
V_KFDAT Flag: V=this Allocation Unit is used; F=AU is free
FNUM_KFDAT File number, join with v$asm_file
XNUM_KFDAT Progressive file extent number join with x$kffxp.pxn_kffxp
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 18
Example 3: Yet Another Way
Using the internal package dbms_diskgroup
declare fileType varchar2(50); fileName varchar2(50); fileSz number; blkSz number; hdl number; plkSz number; data_buf raw(4096);begin fileName := '+TEST1_DATADG1/TEST1/spfiletest1.ora'; dbms_diskgroup.getfileattr(fileName,fileType,fileSz,
blkSz); dbms_diskgroup.open(fileName,'r',fileType,blkSz,
hdl,plkSz, fileSz); dbms_diskgroup.read(hdl,1,blkSz,data_buf); dbms_output.put_line(data_buf);end;/
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 19
DBMS_DISKGROUP
• Can be used to read/write ASM files directly – It’s an Oracle internal package– Does not require a RDBMS instance – 11g’s asmcmd cp command uses dbms_diskgroup
Procedure Name Parameters
dbms_diskgroup.open
(:fileName, :openMode, :fileType, :blkSz, :hdl,:plkSz, :fileSz)
dbms_diskgroup.read (:hdl, :offset, :blkSz, :data_buf)
dbms_diskgroup.createfile
(:fileName, :fileType, :blkSz, :fileSz, :hdl, :plkSz, :fileGenName)
dbms_diskgroup.close (:hdl)
dbms_diskgroup.commitfile (:handle)
dbms_diskgroup.resizefile (:handle,:fsz)
dbms_diskgroup.remap (:gnum, :fnum, :virt_extent_num)
dbms_diskgroup.getfileattr (:fileName, :fileType, :fileSz, :blkSz)
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 20
File Transfer Between OS and ASM
• The supported tools (10g)– RMAN– DBMS_FILE_TRANSFER– FTP (XDB)– WebDAV (XDB)– They all require a RDBMS instance
• In 11g, all the above plus asmcmd – cp command– Works directly with the ASM instance
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 21
Strace and ASM 1/3
Goal: understand strace output when using ASM storage
• Example:read64(15,"#33\0@\"..., 8192, 473128960)=8192 • This is a read operation of 8KB from FD 15 at offset
473128960• What is the segment name, type, file# and block# ?
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 22
Strace and ASM 2/3
1. From /proc/<pid>/fd I find that FD=15 is/dev/mpath/itstor420_1p1
2. This is disk 20 of D.G.=1 (from v$asm_disk)3. From x$kffxp I find the ASM file# and extent#:
• Note: offset 473128960 = 451 MB + 27 *8KBsys@+ASM1>select number_kffxp,
xnum_kffxp from x$kffxp where group_kffxp=1 and disk_kffxp=20 and au_kffxp=451;
NUMBER_KFFXP XNUM_KFFXP------------ ---------- 268 17
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 23
Strace and ASM 3/3
4. From v$asm_alias I find the file alias for file 268: USERS.268.612033477
5. From v$datafile view I find the RDBMS file#: 96. From dba extents finally find the owner and
segment name relative to the original IO operation:
sys@TEST1>select owner,segment_name,segment_type from dba_extents where FILE_ID=9 and 27+17*1024*1024 between block_id and block_id+blocks;
OWNER SEGMENT_NAME SEGMENT_TYPE ----- ------------ ------------SCOTT EMP TABLE
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 24
Investigation of Fine Striping
• An application: finding the layout of fine-striped files– Explored using strace of an oracle session
executing ‘alter system dump logfile ..’ – Result: round robin distribution over 8 x 1MB
extents
Fine striping size = 128KB (1MB/8)
……
A0
B0
……
……
A1
B1
……
……A2
B2……
……
A3
B3
……
……
A4
B4
……
……
A5
B5
……
……
A6
B6
……
……
A7
B7
……
AU
= 1
MB
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 25
Metadata Files
• ASM diskgroups contain ‘hidden files’ – Not listed in V$ASM_FILE (file# <256)– Details are available in X$KFFIL – In addition the first 2 AUs of each disk are marked as
file#=0 in X$KFDAT– Example (10g):
GROUP# FILE# FILESIZE_AFTER_MIRR RAW_FILE_SIZE---------- ---------- ------------------- ------------- 1 1 2097152 6291456 1 2 1048576 3145728 1 3 264241152 795869184 1 4 1392640 6291456 1 5 1048576 3145728 1 6 1048576 3145728
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 26
ASM Metadata 1/2
• File#0, AU=0: disk header (disk name, etc), Allocation Table (AT) and Free Space Table (FST)
• File#0, AU=1: Partner Status Table (PST) • File#1: File Directory (files and their extent pointers) • File#2: Disk Directory• File#3: Active Change Directory (ACD)
– The ACD is analogous to a redo log, where changes to the metadata are logged.
– Size=42MB * number of instances
Source: Oracle Automatic Storage Management, Oracle Press Nov 2007, N. Vengurlekar, M. Vallath, R.Long
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 27
ASM Metadata 2/2
• File#4: Continuing Operation Directory (COD). – The COD is analogous to an undo tablespace. It
maintains the state of active ASM operations such as disk or datafile drop/add. The COD log record is either committed or rolled back based on the success of the operation.
• File#5: Template directory• File#6: Alias directory• 11g, File#9: Attribute Directory• 11g, File#12: Staleness registry, created when
needed to track offline disks
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 28
ASM Rebalancing
• Rebalancing is performed (and mandatory) after space management operations– Goal: balanced space allocation across disks– Not based on performance or utilization– ASM spreads every file across all disks in a diskgroup
• ASM instances are in charge of rebalancing– Extent pointers changes are communicated to the RDBMS
• RDBMS’ ASMB process keeps an open connection to ASM• This can be observed by running strace against ASMB
– In RAC, extra messages are passed between the cluster ASM instances
• LMD0 of the ASM instances are very active during rebalance
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 29
ASM Rebalancing and VLDB
• Performance of Rebalancing is important for VLDB
• An ASM instance can use parallel slaves– RBAL coordinates the rebalancing operations– ARBx processes pick up ‘chunks’ of work. By
default they log their activity in udump• Does it scale?
– In 10g serialization wait events can limit scalability
– Even at maximum speed rebalancing is not always I/O bound
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 30
ASM Rebalancing Performance
• Tracing ASM rebalancing operations– 10046 trace of the +arbx processes
• Oradebug setospid … • oradebug event 10046 trace name context forever, level 12• Process log files (in bdump) with orasrp (tkprof will not work)
• Main wait events from my tests with RAC (6 nodes)– DFS lock handle
• Waiting for CI level 5 (cross instance lock)– Buffer busy wait – ‘unaccounted for’– enq: AD - allocate AU– enq: AD - deallocate AU– log write(even)– log write(odd)
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 31
ASM Single Instance Rebalancing
• Single instance rebalance– Faster in RAC if you can rebalance with only 1
node up (I have observed: 20% to 100% speed improvement)
– Buffer busy wait can be the main event• It seems to depend on the number of files in the
diskgroup.• Diskgroups with a small number of (large) files have
more contention (+arbx processes operate concurrently on the same file)
• Only seen in tests with 10g
– 11g has improvements regarding rebalancing contention
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 32
Rebalancing, an Example
ASM Rebalancing Performance (RAC)
0
1000
2000
300040005000
6000
7000
0 2 4 6 8 10 12
Diskgroup Rebalance Parallelism
Rate
, MB/
min
Oracle 11g
Oracle 10g
Data: D.Wojcik, CERN IT
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 33
Rebalancing Workload
• When ASM mirroring is used (e.g. with normal redundancy) – Rebalancing operations can move more data
than expected• Example:
– 5 TB (allocated): ~100 disks, 200 GB each – A disk is replaced (diskgroup rebalance)
• The total IO workload is 1.6 TB (8x the disk size!)• How to see this: query v$asm_operation, the column
EST_WORK keeps growing during rebalance
• The issue: excessive repartnering
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 34
ASM Disk Partners
• ASM diskgroup with normal redundancy– Two copies of each extents are written to
different ‘failgroups’ • Two ASM disks are partners:
– When they have at least one extent set in common (they are the 2 sides of a mirror for some data)
• Each ASM disk has a limited number of partners– Typically 10 disk partners: X$KFDPARTNER– Helps to reduce the risk associated with 2
simultaneous disk failures
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 35
Free and Usable Space
• When ‘ASM mirroring’ is used not all the free space should be occupied
• V$ASM_DISKGROUP.USABLE_FILE_MB:– Amount of free space that can be safely utilized
taking mirroring into account, and yet be able to restore redundancy after a disk failure
– it’s calculated for the case of the worst scenario, anyway it is a best practice not to have it go negative (it can)
– This can be a problem when deploying a small number of large LUNs and/or failgroups
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 36
Fast Mirror Resync
• ASM 10g with normal redundancy does not allow to offline part of the storage – A transient error in a storage array can cause
several hours of rebalancing to drop and add disks
– It is a limiting factor for scheduled maintenances– 11g has new feature ‘fast mirror resync’
• Redundant storage can be put offline for maintenance• Changes are accumulated in the staleness registry
(file#12)• Changes are applied when the storage is back online
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 37
Read Performance, Random I/O
IOPS measured with SQL (synthetic test)
Small Random I/O (8KB block, 32GB probe table) 64 SATA HDs (4 arrays, 1 instance)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Workload, number of oracle sessions
I/O s
mal
l rea
d pe
r sec
(IO
PS)
8675 IOPS
~130 IOPS per disk Destroking, only the external part of the disks is used
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 38
Read Performance, Sequential I/O
Sequential I/O Throughput, 80GB probe table64 SATA HDs (4 arrays and 4 RAC nodes)
0
100
200
300
400
500
600
700
800
1 2 4 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
Workload, number of parallel query slaves
MB
/sec
Limited by HBAs -> 4 x 2 Gb
(measured with parallel query)
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 39
Implementation Details
• Multipathing– Linux Device Mapper (2.6 kernel)
• Block devices– RHEL4 and 10gR2 allow to skip raw devices mapping– External half of the disk for data disk groups
• JBOD config– No HW RAID – ASM used to mirror across disk arrays
• HW:– Storage arrays (Infortrend): FC controller, SATA disks– FC (Qlogic): 4Gb switch and HBAs (2Gb in older HW)– Servers are 2x CPUs, 4GB RAM, 10.2.0.3 on RHEL4,
RAC of 4 to 8 nodes
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 40
Conclusions
• CERN deploys RAC and ASM on Linux on commodity HW– 2.5 years of production, 110 Oracle 10g RAC
nodes and 300TB of raw disk space (Dec 2007)• ASM metadata
– Most critical part, especially rebalancing– Knowledge of some ASM internals helps
troubleshooting• ASM on VLDB
– Know and work around pitfalls in 10g– 11g has important manageability and
performance improvements
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it Inside Oracle ASM, UKOUG Dec 2007 - 41
Q&A
Q&A• Links:
– http://cern.ch/phydb– http://twiki.cern.ch/twiki/bin/view/PSSGroup/ASM_Internals– http://www.cern.ch/canali