ibm san volume controller performance analysis
DESCRIPTION
IntroductionStorage Problems and Limitations with Native StorageSVC OverviewSVC Physical and Logical OverviewPerformance and Scalability ImplicationsTypes of ProblemsPerformance Analysis TechniquesPerformance Analysis Tools for SVCPerformance Analysis Metrics for SVCOnline Banking ExampleTRANSCRIPT
IBM Global Technology Services
© 2008 IBM Corporation
SAN Volume Controller Performance AnalysisJuly 25, 2008
Business Unit or Product Name
© 2008 IBM Corporation
Trademarks & Disclaimer
The following terms are trademarks of the IBM Corporation:
Enterprise Storage Server® - Abbreviated: ESS
TotalStorage® Expert TSE
FAStT/DS4000/DS8000
AIX®
IBM SAN Volume Controller
Other trademarks appearing in this report may be considered trademarks of their respective companies.
SANavigator,EFCM McDATA
UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.EMC is a registered trademark of EMC Inc.
HP-UX is a registered trademark of HP Inc.
Solaris is a registered trademark of SUN Microsystems, Inc
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Disclaimer
The views in this presentation are those of the author and are not necessarily those of IBM
Business Unit or Product Name
© 2008 IBM Corporation
Abstract
SAN Volume Controller(SVC) is a flexible, scalable platform for block level storage virtualization. While the SVC adds flexibility in provisioning of storage and provides enhancements to support higher availability potentials, it adds complexity in performance design. This impact is most acute in performance analysis as a new stripping layer is added in your data path and can and does make the analysis more complex. We will provide a technical overview of a SAN environment with SVC and explore the performance and scalability considerations when using SVC. We will review some of the tools, metrics, and methods necessary to identify root causes for the most common performance issues.
Business Unit or Product Name
© 2008 IBM Corporation
Table of Contents
Introduction
Storage Problems and Limitations with Native Storage
SVC Overview
SVC Physical and Logical Overview
Performance and Scalability Implications
Types of Problems
Performance Analysis Techniques
Performance Analysis Tools for SVC
Performance Analysis Metrics for SVC
Online Banking Example
Summary
Business Unit or Product Name
© 2008 IBM Corporation
Business Unit or Product Name
© 2008 IBM Corporation
SVC High Level Logical View
Business Unit or Product Name
© 2008 IBM Corporation
SVCCluster
I/O Group 1
VirtualDisk
LUN
ManagedDisk
mdisk010GB
mdisk110GB
mdisk210GB
mdisk310GB
mdisk620GB
mdisk520GB
mdisk420GB
FAStT10GB
FAStT10GB
FAStT10GB
FAStT10GB
ESS20GB
ESS20GB
ESS20GB
ManagedDisk
Groupsmdiskgrp0 [FAStT Group] - 40GB mdiskgrp1 [ESS Group] - 60GB
vdisk020GB
vdisk120GB
vdisk220GB
vdisk320GB
vdisk420GB
Virtual Disks Mapped to HostsSVC Combined Physical & Logical View
IBMIBM
I/O Group 2IBMIBM
Virtual Disks are associated with particular I/O Groups
Managed Disk Groups are accessible by all I/O Groups in the Cluster.
Business Unit or Product Name
© 2008 IBM Corporation
Performance and Scalability Limitations
Shared resources!
– Cache, fibre ports, CPU, Fabric
Cache implications– Completely random workload – ‘Cache Unfriendly’
– Highly sequential – ‘Large DB hot backups’
Fabric implications
– Increases the number of fabric hops!
– Additional fabric traffic to synchronize write data
– Traffic flows in and out of same ports – • Read cache misses• Write synchronizations
Business Unit or Product Name
© 2008 IBM Corporation
Types of Problems
Application– Configuration– Design issues– Defects– DB queries, etc
Host– Multi-pathing software compatibility– HBA microcode/device driver– OS compatibility
SVC – Microcode level, performance features– Front end contention - IO group, Node– Backend contention MDG, Mdisk
Backend Storage– Front end Port, Cache, NVS– Backend Controller, RAID Group (Disks)
Fabric– ISL Congestion
Business Unit or Product Name
© 2008 IBM Corporation
Performance Analysis Process
1. Gather Host multi pathing, SVC, and Storage configuration/firmware
2. Ensure device support and compatibility – SVC Support Matrix – If Host or Storage devices are unsupported Resolve!
• http://www-03.ibm.com/systems/storage/software/virtualization/svc/interop.html– Update SVC firmware to latest level (Ensure Host Multi-pathing is supported/configured right)
3. After resolving configuration issues:– Gather end to end response time (i.e. – Host iostat/perfmon data)– If elongated response time exists drill down to next layer
4. Measurement Points– Application – Transactional latency– Host – LV & Disk I/O Response Times, Disk Utilization, Throughput– Fabric – Throughput, Utilization– SVC – IO Group, MD Group, MDisks, Vdisks– Storage – Depends on technology
• EMC – FA, Cache, DA, Disk, Volume• DS8K/ESS – Front end Port, Array (Physical), Volume
Business Unit or Product Name
© 2008 IBM Corporation
Performance Analysis Tools for SVC
Tivoli Total Storage Productivity Center (TPC)
– Complex and expensive to deploy
– Provides lots of detail
Native command line interface – Data in XML format but no publicly available post-processing
– Custom written text parser not ideal
– XSL and ANT are good options or other XML parser/viewers
Business Unit or Product Name
© 2008 IBM Corporation
SVC Key Performance Metrics
IO Group– Front-End & Backend Latency (Read/Write), Queue Time (Read/Write), Throughput (Read/Write), Transfer Size (Read/Write), I/O Rates
(Read/Write)– Cache Hits
Node– Same as IO Group +CPU + Port to Local Node Send I/O Rate & Receive
MD Group– Front-End & Backend Latency (Read/Write), Queue Time (Read/Write), Throughput (Read/Write), Transfer Size (Read/Write), I/O Rates
(Read/Write)
MDisk – Backend Latency (Read/Write), Queue Time (Read/Write), Throughput (Read/Write), Transfer Size (Read/Write), I/O Rates (Read/Write)
Vdisk – Front-End, Queue Time (Read/Write), Throughput (Read/Write), Transfer Size (Read/Write), I/O Rates (Read/Write)– NVS Full & Delays, Cache Hits
Explanations– Overall Response Time = vdisk response time– If an I/O is a cache hit, then you only have the vdisk response time– Backend Response Time = mdisk fabric response time (i.e. from the point we send the I/O to the controller to when we get it back)– Backend Queue = mdisk queue time (inside SVC waiting to be sent onto fabric + fabric response time)– Backend responses are also for 32K tracks, so a vdisk doing 256K I/O will need many backend I/O to complete (if its a cache miss) a lot of
these will be concurrent
Business Unit or Product Name
© 2008 IBM Corporation
Real World Example: Online Banking Application (OLB) – Problem Statement
An online banking application and other applications that rely SAN I/O are experiencing intermittent, severe performance impacts
Performance impacts typified by a daily performance degradation between 3:15 am and 6:00am.
SVC response time outside of problem window is acceptable.
Business Unit or Product Name
© 2008 IBM Corporation
OLB – Host Impact – Increase in copy times
0
100
200
300
400
500
600
9/20
/200
7 14
:00
9/20
/200
7 15
:00
9/20
/200
7 16
:00
9/20
/200
7 17
:00
9/20
/200
7 18
:00
9/20
/200
7 19
:00
9/20
/200
7 20
:00
9/20
/200
7 21
:00
9/20
/200
7 22
:00
9/20
/200
7 23
:00
9/21
/200
7 0:
00
9/21
/200
7 1:
00
9/21
/200
7 2:
00
9/21
/200
7 3:
00
9/21
/200
7 4:
00
9/21
/200
7 5:
00
9/21
/200
7 6:
00
9/21
/200
7 7:
00
9/21
/200
7 8:
00
Host2 - /apps/olbfs
Host4 - /apps/olbfs
Host4 - /data/olb_input
Host3 - /ora/SOMEDB/data001/DBs/export
host5 - /apps/olbfs
host6 - /apps/olbfs
host7 - /data/archive
host7 - /apps/olbfs
host8 - /apps/olbfs
host9 - /data/output
host9 - /apps/olbfs
host10 - /localtest
host10 - /apps/olbfs
host10 - /data/olb_input
host11 - /apps/olbfs
host12 - /ora/SOMEDB/data001/DBs/export
host13 - /apps/olbfs
host14 - /apps/olbfs
host15 - /apps/olbfs
host16 - /apps/olbfs
Host1 - /ora/SOMEDB/data001/DBs/export
Drop Page Fields Here
Sum of Copy time (s)
Date
Host
File System
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Performance Analysis – Host Configuration
Collect host configuration data– Prior to microcode 4.3.1 it is very important that host multi-path sw communicates to
SVC preferred node! – Try to use IBM SDD/PCM as they work!– If using others DMP/MPxIO only 1 Multi-path software should be active– Special procedures and/or configuration changes may be required for non IBM MP
Hosts were running improperly configured MPxIO – Needed patch and SVC configuration change
http://www-1.ibm.com/support/docview.wss?rs=591&context=STC7HAC&context=STCWGAV&context=STCWGBP&dc=DB520&dc=DB530&dc=DB510&dc=DB550&q1=mpxio&uid=ssg1S1002938&loc=en_US&cs=utf-8&lang=en
Hosts running unsupported DMP configuration –– Needed patch from Veritas to fix– VxVM 5.0 Requires RP3 (Rolling Patch 3 and Hotfix 127320-02)
Identify and repair host configuration
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Upgrade SVC to Latest Firmware
Make sure you are at least at 4.x.
Latest SVC Firmware (4.2.x) has many fixes
Fixes to increase mdisk q-depth settings
Vs. 3.x – SVC 4.x takes advantage of all node ports
Cache partitioning available for governing workloads
4.x provides enhanced performance metrics
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Gather End to End Response Time
Initially gather enough information to confirm there are I/O related issues
Identify if I/O throughput degradation is systemic
– All devices on given host
– All devices on all hosts
– All devices on a given SVC or SAN component
In this case all hosts were impacted by throughput degradation
Watch for large transfer sizes as destages from cache to backend storage are done in 32 KB writes.
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Gather SVC MD Group data
SVC MD GROUP
Avg Read IO Rate
Avg Write IO Rate
Avg Total IO Rate
Avg Read Data Rate MB
Avg Write Data Rate MB
Avg Total Data Rate MB
Avg Read Size KB
Avg Write Size KB
Avg Size KB
SVC001 SVC1_12345_R5_1 459.80 119.80 579.60 29.00 28.00 57.00 64.70 518.90 109.20
SVC001 SVC1_22222_R1_3 395.10 78.50 473.60 28.10 29.30 57.40 72.80 381.90 125.40
SVC001 SVC1_12345_R5_3 309.30 124.00 433.30 22.80 20.50 43.40 74.80 308.70 106.00
SVC001 SVC1_12345_R5_2 293.10 60.20 353.30 17.70 18.00 35.70 62.50 359.70 105.00
SVC001 SVC1_33333_R5_9 286.70 91.30 378.00 14.60 7.40 22.00 56.00 75.70 67.90
SVC001 SVC1_12345_R5_4 233.30 102.20 335.60 17.50 13.10 30.60 77.90 242.90 99.10
SVC001 SVC1_33333_R5_2 224.70 78.20 302.90 8.40 2.30 10.80 32.20 33.10 34.00
SVC001 SVC1_33333_R5_0 207.70 68.10 275.80 8.30 1.90 10.20 30.20 22.00 31.30
SVC001 SVC1_33333_R5_1 197.30 105.30 302.60 10.30 6.90 17.20 41.10 76.40 70.50
SVC001 SVC1_33333_R5_4 191.80 87.80 279.60 16.50 3.00 19.40 76.30 37.70 62.10
Focus on those the MDGs with the most throughput during period
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Drill Down To Vdisk
What are these hosts doing during this time period!
VDISK
Servers Avg Read IO Rate
Avg Write IO Rate
Avg Total IO Rate
Avg Read Data Rate MB
Avg Write Data Rate MB
Avg Total Data Rate MB
Avg Read Size KB
Avg Write Size KB
Avg Size KB
vdisk1 Host1, Host2 72.8 1.1 73.9 3 0 3 13.7 8 13.7
vdisk2 Host1, Host2 69.4 0.1 69.5 2.9 0 2.9 13.8 8 13.8
vdisk3 Host1, Host2 68.4 39.7 108.1 3.2 1.6 4.8 58.7 41 47.2
vdisk4 Host1, Host2 40.9 4 45 2.6 0 2.6 17.3 8 17.1
vdisk5 Host3 19.7 2.4 22.1 1.9 0 1.9 12.2 6.7 17
vdisk6 Host4 5.4 2.1 7.4 0.5 0 0.5 11.1 6.1 15.3
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Identify Processes and Scheduled Jobs Initiating I/O
Check Native schedulers (cront/at) for
– Application users
– DB users
– root
Check 3rd party schedulers (Autosys)
Cron entries for db servers on hosts with high I/O identified 103 database backup schedules within problem period!
Business Unit or Product Name
© 2008 IBM Corporation
OLB – Root Cause
The root cause of the online banking performance degradation is a flooding of the San Volume Controller by streaming read IO’s originating from RMAN Oracle backups initiated on 103 databases within a 20 minute period.
This read IO flood is cache hostile, causing other read and write requests to queue, creating performance degradation.
With the current host read-ahead settings, at peak (concurrent Oracle RMAN incremental backups between 3am - 6am), the SVC is not able to process the combination of volume and composition of IO without a flow-on performance impact.
Business Unit or Product Name
© 2008 IBM Corporation
OLB: Actions Taken During Analysis
SVC MB/s SVC CPU SVC Read Resp (ms)
Initial Inspection 600 80 120
SVC 4.2.03 upgd 800 80 90
Host - DMP patch 1300 70 65
Host - MPxIO corrected
2350 60 60
Target Peak 2500 60 15
Business Unit or Product Name
© 2008 IBM Corporation
OLB Final Recommendations (by priority):
1. Implement production backup policy/strategy in the test environment.
Veritas snapshot backups for hosts operating large databases – Reduce data transferred/Schedule!
2. Tune RMAN, Oracle and the SVC to control IO composition and IO availability
Scheduling/Xfer Size/Isolation/Governance on vdisk
2. Add a new IO GROUP to SVC001 Isolation!
3. Replace the SVC 2145-8F4 nodes currently in use with 2145-8G4. Hardware Upgrade!
Business Unit or Product Name
© 2008 IBM Corporation
SVC Performance Analysis Summary
Identify performance requirements/expectations!
Determine compatibility/Resolve incompatibilities
Utilize latest SVC firmware if possible
Measure hosts
Measure SVC
Measure backend storage
Identify bottlenecks and resolve
Business Unit or Product Name
© 2008 IBM Corporation
Appendix A: Additional Resources
These publications are also relevant as further information sources:
IBM System Storage SAN Volume Controller, SG24-6423-05
Get More Out of Your SAN with IBM Tivoli Storage Manager, SG24-6687
IBM Tivoli Storage Area Network Manager: A Practical Introduction, SG24-6848
IBM System Storage: Implementing an IBM SAN, SG24-6116
IBM System Storage Open Software Family SAN Volume Controller: Planning Guide, GA22-1052
IBM System Storage Master Console: Installation and User’s Guide, GC30-4090
IBM System Storage Open Software Family SAN Volume Controller: Installation Guide , SC26-7541
IBM System Storage Open Software Family SAN Volume Controller: Service Guide, SC26-7542
IBM System Storage Open Software Family SAN Volume Controller: Configuration Guide , SC26-7543
IBM System Storage Open Software Family SAN Volume Controller: Command-Line Interface User's Guide , SC26-7544
IBM System Storage Open Software Family SAN Volume Controller: CIM Agent Developers Reference , SC26-7545
IBM TotalStorage Multipath Subsystem Device Driver User's Guide, SC30-4096 IBM System Storage Open Software Family SAN Volume Controller: Host Attachment Guide, SC26-7563
Business Unit or Product Name
© 2008 IBM Corporation
BiographyBrett Allison has been doing distributed systems performance related work since 1997 including J2EE application analysis, UNIX/NT, and Storage technologies. His current role is Performance and Capacity Management team lead ITDS. He has developed tools, processes, and service offerings to support storage performance and capacity. He has spoken at a number of conferences and is the author of several White Papers on performance