masters stretched svc-cluster-2012-04-13 v2
TRANSCRIPT
ATS Masters - Storage
© 2012 IBM Corporation
Stretched SVC Cluster
ATS Masters - Storage
© 2012 IBM Corporation
Agenda
A little history
Options and issues
Requirements and restrictions
ATS Masters - Storage
© 2012 IBM Corporation 3
Terminology SVC Split I/O Group = SVC Stretched Cluster = SVC Split Cluster
– Two independent SVC nodes in two independent sites + one independent site for Quorum– Acts just like a single I/O Group with distributed high availability
Distributed I/O groups – NOT a HA Configuration and not recommended, if one site failed:– Manual volume move required– Some data still in cache of offline I/O Group
I/O Group 1 I/O Group 1
Site 1 Site 2
I/O Group 1
I/O Group 2
I/O Group 1 I/O Group 2
Site 1 Site 2
Site 2Site 1
Storwize V7000 Split I/O Group not an option:– Single enclosure includes both nodes– Physical distribution across two sites not possible
Site 1 Site 2
ATS Masters - Storage
© 2012 IBM Corporation
Early Days - separate racks in machine room
Servers
SVC + UPS SVC + UPS
FabricA
FabricB
Protected from physical problemsPlumbing leak, fire
“A chance to tackle the guy with a chainsaw”
Where’s the quorum disk?
Lots of cross-cabling
ATS Masters - Storage
© 2012 IBM Corporation
Fabric presence on both sides of machine room
Servers
SVC + UPS SVC + UPS
FabricA
FabricB
Cross cabling only neededfor fabrics and SVC nodes
FabricB
FabricA
Requirement for zero-hop between nodes in an I/O Group
Needed to “zone away” ISL connections within an I/O Group
Each fabric has two sets of zones for the two sets of ports.
Quorum concern remains
ATS Masters - Storage
© 2012 IBM Corporation
Server cluster also stretched across machine room
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
Cross cabling only neededfor fabrics and SVC nodes
FabricB
FabricA
Requirement for zero-hop between nodes in an I/O Group
Needed to “zone away” ISL connections within an I/O Group
Each fabric has two sets of zones for the two sets of ports.
Quorum concern remains
Server Cluster
Can do cluster failover
Where’s the storage?
ATS Masters - Storage
© 2012 IBM Corporation
SVC V4.3 Vdisk (Volume) Mirroring
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
Cross cabling only neededfor fabrics and SVC nodes
FabricB
FabricA
Requirement for zero-hop between nodes in an I/O Group
Needed to “zone away” ISL connections within an I/O Group
Each fabric has two sets of zones for the two sets of ports.
Quorum concern remains
Server Cluster
SVC Volume has a copy
on either side of machine room
Can do cluster failover and
Mirroring allows single Volume
to be visible on both sides
ATS Masters - Storage
© 2012 IBM Corporation
SVC V5.1 – put LW SFPs in nodes for 10km distance
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
LW SFPs w/SM fibre allows up to 10km
FabricB
FabricA
Requirement for zero-hop between nodes in an I/O Group
Needed to “zone away” ISL connections within an I/O Group
Each fabric has two sets of zones for the two sets of ports.
Quorum concern remains
Server Cluster
SVC Volume has a copy
at each site
Where’s the quorum?
Can do cluster failover and
Mirroring allows single Volume
to be visible on both sides
ATS Masters - Storage
© 2012 IBM Corporation
SVC V5.1 – stretched cluster with 3rd site -1
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
LW SFPs w/SM fibre allows up to 10km
FabricB
FabricA
Server Cluster
Can do cluster failover and
Mirroring allows single Volume
to be visible on both sides
Active / passive storage devices (like DS3/4/5K):
Each quorum disk storage controller must be connected to both sites
Ok, but?
ATS Masters - Storage
© 2012 IBM Corporation
SVC V5.1 – stretched cluster with 3rd site -2
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
LW SFPs w/SM fibre allows up to 10km
FabricB
FabricA
Server Cluster
Can do cluster failover and
Mirroring allows single Volume
to be visible on both sides
Active / passive storage devices (like DS3/4/5K):
Each quorum disk storage controller must be connected to both sites
ATS Masters - Storage
© 2012 IBM Corporation
SVC V5.1 – stretched cluster with 3rd site -3
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
LW SFPs w/SM fibre allows up to 10km
FabricB
FabricA
Server Cluster
Can do cluster failover and
Mirroring allows single Volume
to be visible on both sides
Active / passive storage devices (like DS3/4/5K):
Each quorum disk storage controller must be connected to both sites
LOTS OF CROSS CABLING!
ATS Masters - Storage
© 2012 IBM Corporation
SVC V6.3-option 1: Same as V5 but farther using DWDM
Server Cluster
SVC + UPS SVC + UPS
FabricA
FabricB
DWDM allows up to 40kmSpeed drops after 10km
FabricB
FabricA
Server Cluster
Can do cluster failover and
Mirroring allows single Volume
to be visible on both sides
Active / passive storage devices (like DS3/4/5K):
Each quorum disk storage controller must be connected to both sites
ATS Masters - Storage
© 2012 IBM Corporation
SVC V6.3-option 1 (cont)
Server Cluster
SVC + UPS
Server Cluster
SVC + UPS
FabricA
Fabric
AFabric
B
FabricB
User chooses number of ISLs on SANStill no hops between nodes in an I/O GroupThese connections can be on DWDM too.
Two ports per SVC node attached to local switchessTwo ports per SVC node attached to remote switches via DWDMHosts and storage attached to local switches, need enough ISLs3rd site quorum (not shown) attached to both fabrics
Active or passive
DWDM over shared single mode fibre(s)
0-10 KM Fibre Channel distance supported up to 8Gbps
11-20KM Fibre Channel distance supported up to 4Gbps
21-40KM Fibre Channel distance supported up to 2Gbps
User chooses number of ISLs on SANStill no hops between nodes in an I/O GroupThese connections can be on DWDM too.
ATS Masters - Storage
© 2012 IBM Corporation
SVC V6.3 option 2: Dedicated ISLs for nodes (can use DWDM)
Server Cluster
SVC + UPS
Server Cluster
SVC + UPS
PrivateFabric
C
PrivateFabric
C
PrivateFabric
D
PrivateFabric
D
PublicFabric
A
PublicFabric
B
PublicFabric
B
PublicFabric
A
At least 1 ISL
Trunk if more than 1
At least 1 ISL
Trunk if more than 1
User chooses number of ISLs on public SANOnly half of all SVC ports used for host I/O
Two ports per SVC node attached to public fabricsTwo ports per SVC node attached to dedicated fabricsHosts and storage attached to public fabrics3rd site quorum (not shown) attached to public fabrics
User chooses number of ISLs on public SANOnly half of all SVC ports used for host I/O
Distance now up to 300km
Apps may require less
ATS Masters - Storage
© 2012 IBM Corporation
Configuration Using Brocade Virtual Fabrics
Server Cluster
SVC + UPS
Server Cluster
SVC + UPS
PublicFabric
A
PrivateFabric
A
PublicFabric
B
PrivateFabric
B
PrivateFabric
A
PublicFabric
A
PrivateFabric
B
PublicFabric
B
Physical switches are partitioned intotwo logical switches, two virtual fabrics
Note ISLs/Trunks for private fabrics are dedicatedrather than being shared to guarantee dedicatedbandwidth is available for node to node traffic
Note ISLs/Trunks for private SANs are dedicated rather than being shared to guarantee dedicated bandwidth available for node to node traffic
ATS Masters - Storage
© 2012 IBM Corporation
Configuration Using CISCO VSANs
Server Cluster
SVC + UPS
Server Cluster
SVC + UPS
PublicVSAN
A
PrivateVSAN
A
PublicVSAN
B
PrivateVSAN
B
PrivateVSAN
A
PublicVSAN
A
PrivateVSAN
B
PublicVSAN
B
Note ISLs/Trunks for private VSANs are dedicated rather than being shared to guarantee dedicated bandwidth available for node to node traffic
Switches/fabrics are partitioned using VSANs
1 ISL per I/O groupConfigured as trunk
1 ISL per I/O groupConfigured as trunk
User chooses number of ISLs on public SAN
User chooses number of ISLs on public SAN
Two ports per SVC node attached to public VSANsTwo ports per SVC node attached to private VSANsHosts and storage attached to public VSANs3rd site quorum (not shown) attached to public VSANs
ATS Masters - Storage
© 2012 IBM Corporation 17
Split I/O Group – Distance
The new Split I/O Group configurations will support distances of up to 300km (same recommendation as for Metro Mirror)
However for the typical deployment of a split I/O group only 1/2 or 1/3rd of this distance is recommended because there will be 2 or 3 times as much latency depending on what distance extension technology is used
The following charts explain why
ATS Masters - Storage
© 2012 IBM Corporation 18
Metro/Global Mirror
o Technically SVC supports distances up to 8000km SVC will tolerate a round trip delay of up to 80ms between
nodes
o The same code is used for all inter-node communication• Global Mirror, Metro Mirror, Cache Mirroring, Clustering• SVCs proprietary SCSI protocol only has 1 round trip
o In practice Applications are not designed to support a Write I/O latency of 80ms
Hence Metro Mirror is deployed for shorter distances (up to 300km) and Global Mirror is used for longer distances
ATS Masters - Storage
© 2012 IBM Corporation 19
M/etro Mirror: Application Latency = 1 long distance round trip
IBM Presentation Template Full Version
Data center 1 Data center 2
4) Metro Mirror Data transfer to remote site
5) Acknowledgment
Steps 1–6 affect application latency Steps 7–10 should not affect the application
Server Cluster 1
1) Write request from host2) Xfer ready to host3) Data transfer from host6) Write completed to host
7a) Write request from SVC8a) Xfer ready to SVC9a) Data transfer from SVC10a) Write completed to SVC
SVC Cluster 1
Server Cluster 2
7b) Write request from SVC8b) Xfer ready to SVC9b) Data transfer from SVC10b) Write completed to SVC
SVC Cluster 2
1 round trip
ATS Masters - Storage
© 2012 IBM Corporation 20
Split I/O Group – Preferred Node local, Write uses 1 round trip
IBM Presentation Template Full Version
Data center 1 Data center 2
Steps 1–6 affect application latency Steps 7–10 should not affect the application
Server Cluster 1
1) Write request from host2) Xfer ready to host3) Data transfer from host6) Write completed to host
Server Cluster 2
4) Cache Mirror Data transfer to remote site
5) Acknowledgment
SVC Split I/O Group
7b) Write request from SVC8b) Xfer ready to SVC9b) Data transfer from SVC10b) Write completed to SVC
1 round trip
2 round trips – but SVCwrite cache hides thislatency from the host
Node 1 Node 2
ATS Masters - Storage
© 2012 IBM Corporation 21
Split I/O Group – Preferred node remote, Write = 3 round trips
IBM Presentation Template Full Version
Data center 1 Data center 2
Server Cluster 1
1) Write request from host2) Xfer ready to host3) Data transfer from host6) Write completed to host
Server Cluster 2
7b) Write request from SVC8b) Xfer ready to SVC9b) Data transfer from SVC10b) Write completed to SVC
4) Cache Mirror Data transfer to remote site
5) Acknowledgment
SVC Split I/O Group
2 round trips
1 round trip
2 round trips – but SVCwrite cache hides thislatency from the host
Node 1 Node 2
Steps 1–6 affect application latency Steps 7–10 should not affect the application
ATS Masters - Storage
© 2012 IBM Corporation 22
Help with some round trips
Some switches and distance extenders use extra buffers and proprietary protocols to eliminate one of the round trips worth of latency for SCSI Write commands
These devices are already supported for use with SVC No benefit or impact inter-node communication Does benefit Host to remote SVC I/Os Does benefit SVC to remote Storage Controller I/Os
ATS Masters - Storage
© 2012 IBM Corporation 23
Split I/O Group – Preferred Node Remote with help, 2 round trips
IBM Presentation Template Full Version
Data center 1 Data center 2
Steps 1 to 12 affect application latency Steps 13 to 22 should not affect the application
Server Cluster 1
1) Write request from host2) Xfer ready to host3) Data transfer from host12) Write completed to host
Server Cluster 2
4) Write+ data transfer to remote site
8) Cache Mirror Data transfer to remote site
9) Acknowledgment
SVC Split I/O Group
11) Write completion to remote site
21) Write completion to remote site
16) Write+ data transfer to remote site
Distance Extenders
5) Write request to SVC6) Xfer ready from SVC7) Data transfer to SVC10) Write completed from SVC
13) Write request from SVC14) Xfer ready to SVC15) Data transfer from SVC22) Write completed to SVC
17) Write request to storage18) Xfer ready from storage19) Data transfer to storage20) Write completed from storage
1 round trip
1 round trip
1 round trip – hiddenfrom the host
Node 1 Node 2
ATS Masters - Storage
© 2012 IBM Corporation 24
Long Distance Impact Additional latency because of long distance Light speed in glass: ~ 200.000 km/sec
– 1 km distance = 2 km round trip Additional round trip time because of distance:
– 1 km = 0.01 ms – 10 km = 0.10 ms– 25 km = 0.25 ms– 100 km = 1.00 ms – 300 km = 3.00 ms
SCSI protocol:– Read: 1 I/O operation = 0.01 ms / km
• Initiator requests data and target provides data
– Write: 2 I/O operations = 0.02 ms / km• Initiator announces amount of data, target acknowledges• Initiator send data, target acknowledge
– SVC’s proprietary SCSI protocol for node-to-node traffic has only 1 round trip Fibre channel frame:
– User data per FC frame (Fibre channel payload): up to 2048 bytes = 2KB• Also for very small user data (< 2KB) a complete frame is required• Large user data is split across multiple frames
ATS Masters - Storage
© 2012 IBM Corporation 25
SVC Split I/O Group – Quorum Disk SVC creates three Quorum disk candidates on the first three
managed MDisks
One Quorum disk is active
SVC 5.1 and later:
– SVC is able to handle the Quorum disk management in a very flexible way, but in a Split I/O Group configuration a well defined setup is required.
– -> Disable the dynamic quorum feature using the “override” flag for V6.2 and later
• svctask chquorum -MDisk <mdisk_id or name> -override yes• This flag is currently not configurable in the GUI
“Split Brain” situation:
– SVC uses the quorum disk to decide which SVC node(s) should survive
No access to the active Quorum disk:
– In a standard situation (no split brain): SVC will select one of the other Quorum candidates as active Quorum
ATS Masters - Storage
© 2012 IBM Corporation
Quorum disk requirements: Only certain storage supported
– Must be placed in a third, independent site
– Storage box must be fibre channel connected
– ISLs with one hop to Quorum storage system are supported
Supported infrastructure:
– WDM equipment similar to Metro Mirror
– Link requirement similar to Metro Mirror• Max round trip delay time is 80 ms, 40 ms each direction
– FCIP to Quorum disk can be used with the following requirements:• Max round trip delay time is 80 ms, 40 ms each direction • If fabrics are not merged, routers required • Independent long distance equipment from each site to Site 3 is required
iSCSI storage not supported
Requirement for active / passive storage devices (like DS3/4/5K):
– Each quorum disk storage controller must be connected to both sites
26
SVC Split I/O Group – Quorum Disk
ATS Masters - Storage
© 2012 IBM Corporation 27
3rd-site quorum supports Extended Quorum
ATS Masters - Storage
© 2012 IBM Corporation 28
Split I/O Group without ISLs between SVC nodes
Minimum distance
Maximum distance
Maximum Link Speed
>= 0 km = 10 km 8 Gbps
> 10 km = 20 km 4 Gbps
> 20km = 40km 2 Gbps
SVC 6.3: – Similar to the support statement in SVC 6.2– Additional: support for active WDM devices– Quorum disk requirement similar to Remote Copy
(MM/GM) requirments:• Max. 80 ms Round Trip delay time, 40 ms each direction• FCIP connectivity supported for quorum disk• No support for iSCSI storage system
Split I/O Group without ISLs between SVC nodes (Classic Split I/O Group)
SVC 6.2 and earlier:– Two ports on each SVC node needed to be connected to the “remote” switch – No ISLs between SVC nodes– Third site required for Quorum disk– ISLs with max. 1 hop can be used for storage traffic and Quorum disk attachment
SVC 6.2 (late) update:– Distance extension to max. 40 km with passive WDM devices
• Up to 20km at 4Gb/s or up to 40km at 2Gb/s.• LongWave SFPs for long distances required• LongWave SFPs must be supported from the switch and WDM vendor
ATS Masters - Storage
© 2012 IBM Corporation
Switch 1
Switch 2
Switch 3
Switch 4
Active Quorum
SVC node1 SVC node2
Server 1 Server 2
Storage Storage
S ite 1 S ite 2
S ite 3
29
Split I/O Group without ISLs between SVC nodes
ATS Masters - Storage
© 2012 IBM Corporation 30
Split I/O Group without ISLs between SVC nodes
Switch 1
Switch 2
Switch 3
Switch 4
Switch 5 Switch 6
S V C n o d e 1 S V C n o d e 2
ISL (Server)
ISL (Server)
Server 1 Server 2
Storage 3 Storage 2
ISL (Server)
ISL (Server)
S ite 1 S ite 2
S ite 3A c t . Q u o ru m
ATS Masters - Storage
© 2012 IBM Corporation 31
Split I/O Group without ISLs between SVC nodes
Switch 1
Switch 2
Switch 3
Switch 4
Switch 5 Switch 6
S V C n o d e 1 S V C n o d e 2
ISL (Server)
ISL (Server)
Server 1 Server 2
Storage 3 Storage 2
ISL (Server)
ISL (Server)
S ite 1 S ite 2
S ite 3D S4700
A c t . Q u o ru m
C t l. A C t l. B
ATS Masters - Storage
© 2012 IBM Corporation 32
SAN and Buffer-to-Buffer Credits
Buffer-to-Buffer (B2B) credits
– Are used as a flow control method by Fibre Channel technology and represent the number of frames a port can store
• Provides best performance
Light must cover the distance 2 times
– Submit data from Node 1 to Node 2
– Submit acknowledge from Node 2 back to Node 1
B2B Calculation depends on link speed and distance
– Number of multiple frames in flight increase equivalent to the link speed
ATS Masters - Storage
© 2012 IBM Corporation 33
Split I/O Group without ISLs: Long distance configuration
SVC Buffer to Buffer credits
– 2145–CF8 / CG8 have 41 B2B credits•Enough for 10km at 8Gb/sec with 2 KB payload
– All earlier models:•Use 1/2/4Gb/sec fibre channel adapters•Have 8 B2B credits which is enough for 4km at 4Gb/sec
Recommendation 1:
– Use CF8 / CG8 nodes for more than 4km distance for best performance
Recommendation 2:
– SAN switches do not auto-negotiate B2B credits and 8 B2B credits is the default setting so change the B2B credits in the switch to 41 as well
Link speed FC fram e length
R equired B 2B credits for 10 km d istance
M ax d istance w ith 8 B2B credits
1G b/sec 1 km 5 16 km
2 G b/sec 0.5 km 10 8 km
4 G b/sec 0.25 km 20 4 km
8 G b/sec 0.125 km 40 2 km
Link speed FC fram e length
R equired B 2B credits for 10 km d istance
M ax d istance w ith 8 B2B credits
1G b/sec 1 km 5 16 km
2 G b/sec 0.5 km 10 8 km
4 G b/sec 0.25 km 20 4 km
8 G b/sec 0.125 km 40 2 km
ATS Masters - Storage
© 2012 IBM Corporation 34
Split I/O Group with ISLs between SVC nodes
Pr iv.SAN 1
Publ.SAN 2
Pr iv.SAN 1
Publ.SAN 2
Switch Switch
SVC -01 SVC -02
ISL
ISL
ISL
ISL
Server 1
Server 2
Server 3
Server 4
Publ.SAN 1 Publ.SAN 1ISL
Pr iv.SAN 2 Pr iv.SAN 2
ISL
Site 1 S ite 2
S ite 3
A c t . Q u o ru m
C t l. A C t l. B
Q u o ru m c a n d id a t e
Storage
Q u o ru m c a n d id a t e
Storage
W D M W D M
W D M W D M
ATS Masters - Storage
© 2012 IBM Corporation 35
Long distance with ISLs between SVC nodes
Some switches and distance extenders use extra buffers and proprietary protocols to eliminate one of the round trips worth of latency for SCSI Write commands
– These devices are already supported for use with SVC
– No benefit or impact inter-node communication
– Does benefit Host to remote SVC I/Os
– Does benefit SVC to remote Storage Controller I/Os
Consequences:
– Metro Mirror is deployed for shorter distances (up to 300km)
– Global Mirror is used for longer distances
– Split I/O Group supported distance will depend on application latency restrictions• 100km for live data mobility (150km with distance extenders)• 300km for fail-over / recovery scenarios• SVC supports up to 80ms latency, far greater than most application workloads would tolerate
ATS Masters - Storage
© 2012 IBM Corporation 36
Split I/O Group Configuration: Examples
Pr iv.SAN 1
Publ.SAN 2
Pr iv.SAN 1
Publ.SAN 2
Switch Switch
SVC -01 SVC -02
ISL
ISL
ISL
ISL
Server 1
Server 2
Server 3
Server 4
Publ.SAN 1 Publ.SAN 1ISL
Pr iv.SAN 2 Pr iv.SAN 2
ISL
Site 1 S ite 2
S ite 3
A c t . Q u o ru m
C t l. A C t l. B
Q u o ru m c a n d id a t e
Storage
Q u o ru m c a n d id a t e
Storage
W D M W D M
W D M W D M
Example 3)
Configuration without live data mobility :
VMware ESX with SRM, AIX HACMP, or MS Cluster
Distance between sites: 180km
-> Only SVC 6.3 Split I/O Group with ISLs is supported or
-> Metro Mirror configuration
Because of long distances: only in active / passive configuration
Example 1)
Configuration with live data mobility:
VMware ESX with VMotion or AIX with live partition mobility
Distance between sites: 12km
-> SVC 6.3: Configuration with or without ISLs are supported
-> SVC 6.2: Only configuration without ISLs is supported
Example 2)
Configuration with live data mobility :
VMware ESX with VMotion or AIX with live partition mobility
Distance between sites: 70km
-> Only SVC 6.3 Split I/O Group with ISLs is supported.
ATS Masters - Storage
© 2012 IBM Corporation 37
Split I/O Group - Disaster Recovery
Split I/O groups provide distributed HA functionality
Usage of Metro Mirror / Global Mirror is recommended for disaster protection
Both major Split I/O Group sites must be connected to the MM / GM infrastructure
Without ISLs between SVC nodes:
– All SVC ports can be used for MM / GM connectivity
With ISLs between SVC nodes:
– Only MM / GM connectivity to the public SAN network is supported
– Only 2 FC ports per SVC node will be available for MM or GM and will also be used for host to SVC and SVC to disk system I/O
• Thus limited capability currently
• Congestion on GM ports would affect host I/O, but not node-to-node (heartbeats, etc.)
• Might need more than one cluster to handle all traffic– More expensive, more ports and paths to deal with
ATS Masters - Storage
© 2012 IBM Corporation 38
Summary SVC Split I/O Group:
– Is a very powerful solution for automatic and fast handling of storage failures
– Transparent for servers
– Perfect fit in a vitualized environment (like VMware VMotion, AIX Live Partition Mobility)
– Transparent for all OS based clusters
– Distances up to 300 km (SVC 6.3) are supported
Two possible scenarios:
– Without ISLs between SVC nodes (classic SVC Split I/O Group)• Up to 40 km distance with support for active (SVC 6.3) and passive (SVC 6.2) WDM
– With ISLs between SVC nodes:• Up to 100 km distance for live data mobility (150 km with distance extenders)• Up to 300 km for fail-over / recovery scenarios
Long distance performance impact can be optimized by:
– Load distribution across both sites
– Appropriate SAN Buffer to Buffer credits