CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Implementing ASM Without HW RAID,Implementing ASM Without HW RAID,A User’s ExperienceA User’s Experience
Luca Canali, CERNDawid Wojcik, CERN
UKOUG, Birmingham, December 2008
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
OutlookOutlook
• Introduction to ASM – Disk groups, fail groups, normal redundancy
• Scalability and Performance of the solution• Possible pitfalls, sharing experiences• Implementation details, monitoring, and
tools to ease ASM deployment
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Architecture and main concepts Architecture and main concepts
• Why ASM ?– Provides functionality of volume manager and a
cluster file system– Raw access to storage for performance
• Why ASM-provided mirroring?– Allows to use lower-cost storage arrays– Allows to mirror across storage arrays
• arrays are not single points of failure• Array (HW) maintenances can be done in a rolling way
– Stretch clusters
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
ASM and cluster DB architectureASM and cluster DB architecture
• Oracle architecture of redundant low-cost components
Servers
SAN
Storage
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Files, extents, and failure groupsFiles, extents, and failure groups
Files and extent pointers
Failgroupsand ASMmirroring
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
ASM disk groupsASM disk groups
• Example: HW = 4 disk arrays with 8 disks each• An ASM diskgroup is created using all available disks
– The end result is similar to a file system on RAID 1+0 – ASM allows to mirror across storage arrays – Oracle RDBMS processes directly access the storage– RAW disk access
Mirroring
Striping Striping
Failgroup1 Failgroup2
ASM Diskgroup
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Performance and scalabilityPerformance and scalability
• ASM with normal redundancy – Stress tested for CERN’s use– Scales and performs
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Case Study: the largest cluster I have Case Study: the largest cluster I have ever installed, RAC5ever installed, RAC5
• The test used:14 servers
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Multipathed fiber channel Multipathed fiber channel
• 8 FC switches: 4Gbps (10Gbps uplink)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Many spindlesMany spindles
• 26 storage arrays (16 SATA disks each)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Case Study: I/O metrics for the RAC5 Case Study: I/O metrics for the RAC5 clustercluster
• Measured, sequential I/O– Read: 6 GB/sec– Read-Write: 3+3 GB/sec
• Measured, small random IO– Read: 40K IOPS (8 KB read ops)
• Note: – 410 SATA disks, 26 HBAS on the storage arrays– Servers: 14 x 4+4Gbps HBAs, 112 cores, 224
GB of RAM
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
How the test was runHow the test was run
• A custom SQL-based DB workload:– IOPS: Probe randomly a large table (several
TBs) via several parallel queries slaves (each reads a single block at a time)
– MBPS: Read a large (several TBs) table with parallel query
• The test table used for the RAC5 cluster was 5 TB in size– created inside a disk group of 70TB
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Possible pitfalls Possible pitfalls
• Production Stories– Sharing experiences – 3 years in production, 550 TB of raw capacity
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Rebalancing speedRebalancing speed
• Rebalancing is performed (and mandatory) after space management operations– Typically after HW failures (restore mirror)– Goal: balanced space allocation across disks– Not based on performance or utilization– ASM instances are in charge of rebalancing
• Scalability of rebalancing operations? – In 10g serialization wait events can limit scalability– Even at maximum speed rebalancing is not always I/O
bound
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Rebalancing, an exampleRebalancing, an example
ASM Rebalancing Performance (RAC)
0
1000
2000
3000
4000
5000
6000
7000
0 2 4 6 8 10 12
Diskgroup Rebalance Parallelism
Ra
te, M
B/m
in Oracle 11g
Oracle 10g
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
VLDB and rebalancingVLDB and rebalancing
• Rebalancing operations can move more data than expected
• Example: – 5 TB (allocated): ~100 disks, 200 GB each – A disk is replaced (diskgroup rebalance)
• The total IO workload is 1.6 TB (8x the disk size!)• How to see this: query v$asm_operation, the column
EST_WORK keeps growing during rebalance
• The issue: excessive repartnering
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Rebalancing issues wrap-upRebalancing issues wrap-up
• Rebalancing can be slow– Many hours for very large disk groups
• Risk associated– 2nd disk failure while rebalancing– Worst case - loss of the diskgroup because
partner disks fail
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Fast Mirror ResyncFast Mirror Resync
• ASM 10g with normal redundancy does not allow to offline part of the storage – A transient error in a storage array can cause
several hours of rebalancing to drop and add disks
– It is a limiting factor for scheduled maintenances
• 11g has new feature ‘fast mirror resync’– Great feature for rolling intervention on HW
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
ASM and filesystem utilities ASM and filesystem utilities
• Only a few tools can access ASM– Asmcmd, dbms_file_transfer, xdb, ftp– Limited operations (no copy, rename, etc)– Require open DB instances– file operations difficult in 10g
• 11g asmcmd has the copy command
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
ASM and corruptionASM and corruption
• ASM metadata corruption– Can be caused by ‘bugs’– One case in prod after disk eviction
• Physical data corruption– ASM will fix automatically most corruption on
primary extent – Typically when doing a full backup
– Secondary extent corruption goes undetected untill disk failure/rebalance can expose it
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Disaster recoveryDisaster recovery
• Corruption issues were fixed using physical standby to move to ‘fresh’ storage
• For HA our experience is that disaster recovery is needed– Standby DB– On-disk (flash) copy of DB
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Implementation details
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage deploymentStorage deployment
• Current storage deployment for Physics Current storage deployment for Physics Databases at CERNDatabases at CERN– SAN, FC (4Gb/s) storage enclosures with SATA
disks (8 or 16)– Linux x86_64, no ASM lib, device mapper instead
(naming persistence + HA)– Over 150 FC storage arrays (production,
integration and test) and ~ 2000 LUNs exposed– Biggest DB over 7TB (more to come when LHC
starts – estimated growth up to 11TB/year)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage deploymentStorage deployment
• ASM implementation details– Storage in JBOD configuration (1 disk -> 1 LUN)– Each disk partitioned on OS level
• 1st partition – 45% of disk size – faster part of disk – short stroke
• 2nd partition – rest – slower part – full stroke
innerinner sectors sectors – – fullfull stroke stroke
outerouter sectors sectors – – shortshort stroke stroke
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Failgroup4Failgroup4Failgroup2Failgroup2 Failgroup3Failgroup3
Storage deploymentStorage deployment
• Two diskgroups created for each cluster– DATA – data files and online redo logs – outer
part of the disks– RECO – flash recovery area destination –
archived redo logs and on disk backups – inner part of the disks
• One failgroup per storage array
Failgroup1Failgroup1
DATA_DG1DATA_DG1
RECO_DG1RECO_DG1
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage managementStorage management
• SAN configuration in JBOD configuration – many steps, can be time consuming– Storage level
• logical disks• LUNs• mappings
– FC infrastructure – zoning– OS – creating device mapper configuration
• multipath.conf – name persistency + HA
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage managementStorage management
• Storage manageability– DBAs set-up initial configuration– ASM – extra maintenance in case of storage
maintenance (disk failure)– Problems
• How to quickly set-up SAN configuration• How to manage disks and keep track of the mappings:
physical disk -> LUN -> Linux disk -> ASM Disk
SCSI [SCSI [1:0:1:31:0:1:3] & [] & [2:0:1:32:0:1:3] ->] ->/dev/sdn & /dev/sdax ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor901_3 ->/dev/mpath/rstor901_3 ->ASM – ASM – TEST1_DATADG1_0016TEST1_DATADG1_0016
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage managementStorage management
• Solution– Configuration DB - repository of FC switches, port
allocations and of all SCSI identifiers for all nodes and storages
• Big initial effort• Easy to maintain• High ROI
– Custom tools• Tools to identify
– SCSI (block) devices <-> device mapper device <-> physical storage and FC port
– Device mapper mapper device <-> ASM disk• Automatic generation of device mapper configuration
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage managementStorage management
[ ~]$ lssdisks.py
The following storages are connected:
* Host interface 1:
Target ID 1:0:0: - WWPN: 210000D0230BE0B5 - Storage: rstor316, Port: 0
Target ID 1:0:1: - WWPN: 210000D0231C3F8D - Storage: rstor317, Port: 0
Target ID 1:0:2: - WWPN: 210000D0232BE081 - Storage: rstor318, Port: 0
Target ID 1:0:3: - WWPN: 210000D0233C4000 - Storage: rstor319, Port: 0
Target ID 1:0:4: - WWPN: 210000D0234C3F68 - Storage: rstor320, Port: 0
* Host interface 2:
Target ID 2:0:0: - WWPN: 220000D0230BE0B5 - Storage: rstor316, Port: 1
Target ID 2:0:1: - WWPN: 220000D0231C3F8D - Storage: rstor317, Port: 1
Target ID 2:0:2: - WWPN: 220000D0232BE081 - Storage: rstor318, Port: 1
Target ID 2:0:3: - WWPN: 220000D0233C4000 - Storage: rstor319, Port: 1
Target ID 2:0:4: - WWPN: 220000D0234C3F68 - Storage: rstor320, Port: 1
SCSI Id Block DEV MPath name MP status Storage Port
------------- ---------------- -------------------- ---------- ------------------ -----
[0:0:0:0] /dev/sda - - - -
[1:0:0:0] /dev/sdb rstor316_CRS OK rstor316 0
[1:0:0:1] /dev/sdc rstor316_1 OK rstor316 0
[1:0:0:2] /dev/sdd rstor316_2 FAILED rstor316 0
[1:0:0:3] /dev/sde rstor316_3 OK rstor316 0
[1:0:0:4] /dev/sdf rstor316_4 OK rstor316 0
[1:0:0:5] /dev/sdg rstor316_5 OK rstor316 0
[1:0:0:6] /dev/sdh rstor316_6 OK rstor316 0
. . .
. . .
Custom made script
SCSI id (host,channel,id) -> storage name and
FC port
SCSI ID -> block device-> device mapper name and status -> storage name and
FC port
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage managementStorage management
[ ~]$ listdisks.py
DISK NAME GROUP_NAME FG H_STATUS MODE MOUNT_S STATE TOTAL_GB USED_GB
---------------- ------------------ ------------- ---------- ---------- ------- -------- ------- ------ -----
rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5
rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 119.9 1.7
rstor401_2p1 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 111.8 111.8
rstor401_2p2 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 120.9 120.9
rstor401_3p1 RAC9_DATADG1_0007 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6
rstor401_3p2 RAC9_RECODG1_0005 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8
rstor401_4p1 RAC9_DATADG1_0002 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5
rstor401_4p2 RAC9_RECODG1_0002 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8
rstor401_5p1 RAC9_DATADG1_0001 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5
rstor401_5p2 RAC9_RECODG1_0006 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8
rstor401_6p1 RAC9_DATADG1_0005 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5
rstor401_6p2 RAC9_RECODG1_0007 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8
rstor401_7p1 RAC9_DATADG1_0000 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6
rstor401_7p2 RAC9_RECODG1_0001 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8
rstor401_8p1 RAC9_DATADG1_0004 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6
rstor401_8p2 RAC9_RECODG1_0004 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8
rstor401_CRS1
rstor401_CRS2
rstor401_CRS3
rstor402_1p1 RAC9_DATADG1_0015 RAC9_DATADG1 RSTOR402 MEMBER ONLINE CACHED NORMAL 111.8 59.9
. . .
. . .
Custom made script
device mapper name -> ASM disk
and status
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage managementStorage management
[ ~]$ gen_multipath.py
# multipath default configuration for PDB
defaults {
udev_dir /dev
polling_interval 10
selector "round-robin 0"
. . .
}
. . .
multipaths {
multipath {
wwid 3600d0230006c26660be0b5080a407e00
alias rstor916_CRS
}
multipath {
wwid 3600d0230006c26660be0b5080a407e01
alias rstor916_1
}
. . .
}
Custom made script
device mapper alias – naming persistency and
multipathing (HA)
SCSI [1:0:1:3] & [2:0:1:3] ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor916_1
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage monitoringStorage monitoring
• ASM-based mirroring means– Oracle DBAs need to be alerted of disk failures
and evictions– Dashboard – global overview – custom solution –
RACMon
• ASM level monitoring– Oracle Enterprise Manager Grid Control– RACMon – alerts on missing disks and failgroups
plus dashboard
• Storage level monitoring– RACMon – LUNs’ health and storage
configuration details – dashboard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Storage monitoringStorage monitoring
• ASM instance level monitoring
• Storage level monitoring
new failing disk onRSTOR614
new disk installed onRSTOR903 slot 2
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Conclusions Conclusions
• Oracle ASM diskgroups with normal redundancy – Used at CERN instead of HW RAID– Performance and scalability are very good– Allows to use low-cost HW– Requires more admin effort from the DBAs than
high end storage– 11g has important improvements
• Custom tools to ease administration
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Q&AQ&A
Thank you• Links:
– http://cern.ch/phydb– http://www.cern.ch/canali