performance of and experiences with lustre over a long ... · system name pfs2 pfs3 pfs4 users...

24
Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD‘15 STEINBUCH CENTRE FOR COMPUTING - SCC www.kit.edu KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association Performance of and experiences with Lustre over a long distance InfiniBand connection Roland Laifer

Upload: others

Post on 13-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

STEINBUCH CENTRE FOR COMPUTING - SCC

www.kit.edu KIT – University of the State of Baden-Württemberg and

National Laboratory of the Helmholtz Association

Performance of and experiences with Lustre over a long distance InfiniBand connection

Roland Laifer

Page 2: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

2 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Overview

Lustre systems at KIT

and details of our complex InfinBand fabric

Investigation of Lustre related InfiniBand (IB) problems

based on two examples

Lustre performance over long distance IB

for throughput and metatdata

Page 3: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

3 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Lustre systems at KIT - diagram

• One IB fabric

• Some file systems

mounted on all clusters

Systems at Campus South (CS)

Cluster 1 Cluster 2 Cluster 3 Lustre 1-4

InfiniBand

Cluster 4

Lustre 5-7

Systems at Campus North (CN)

Cluster 5 Lustre 8-10

InfiniBand

33 km 320 Gbit/s

Page 4: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

4 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Lustre systems at KIT - details

System name pfs2 pfs3 pfs4

Users universities,

all clusters

universities,

tier 2 cluster

(phase 1)

universities,

tier 2 cluster

(phase 2)

Lustre server

version

DDN

Exascaler 2.3

DDN

Exascaler 2.1

DDN

Exascaler 2.3

# of clients 3100 540 1200

# of servers 21 17 23

# of file systems 4 3 3

# of OSTs 2*20, 2*40 1*20, 2*40 1*14, 1*28, 1*70

Capacity (TiB) 2*427, 2*853 1*427, 2*853 1*610, 1*1220,

1*3050

Throughput (GB/s) 2*8, 2*16 1*8, 2*16 1*10, 1*20, 1*50

Storage hardware DDN SFA12K DDN SFA12K DDN ES7K

# of enclosures 20 20 16

# of disks 1200 1000 1120

Page 5: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

5 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Long distance connection - overview

Two dark fibres for reliability

Allow transparent failover

IB failover same as with

redundant switches

Length is 28 and 33 km

Use the same dark fibres for

different protocols

10 GE, 100 GE, IB, FC

Underlying technology is

dense wavelength division

multiplexing (DWDM)

Spacing was reduced from

100 GHz to 50 GHz to provide

additional channels for IB

Page 6: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

6 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Long distance connection - details

MetroX Transponder Filter

(MUX / DEMUX) Dark fibre

FDR10 10 GE Coloured light on

different fibres

Coloured light on

same fibre

10 m 2 km 2 m 33 km

Mellanox MetroX IB switches

Special hardware with enough buffers to fill full length of dark fibre

Obsidian's Longbow is an alternative product

Page 7: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

7 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

IB network details

• Up/down routing

• 284 IB switches

• 3139 IB hosts

QDR

FDR10

FDR

EDR

FDR switch

8x

31x

18x

9x 9x 9x

29x

18x

9x

8x

9x 2x

DWDM 8x 12x 4x

34x

12x

14x

6x

24x 12x 12x

Campus South (CS)

Campus North (CN)

Page 8: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

8 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Ex. 1: Investigate Lustre connectivity problems

Initial problem

Failover servers of all file systems reported new client connections

Obviously many clients had problems to reach active servers

New file system servers were failing over Lustre services

Reason was that ping on IB had failed (and bad configuration)

2 clients on new cluster lost connection to everything on IB

Happened while IB throughput benchmark was running

Further investigation

Problem disappeared after reboot of the switch with the bad clients

Problem was reproducible

Even with running the benchmark on few clients (including the 2 bad)

IB subnet manager showed healthy fabric

Replacing the EDR switch with the 2 bad clients did not help

Page 9: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

9 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Ex. 1: Investigate Lustre connectivity problems

QDR

FDR10

FDR

EDR

FDR switch

8x

31x

18x

9x 9x 9x

29x

18x

9x

8x

9x 2x

DWDM 8x 12x 4x

34x

12x

14x

6x

24x 12x 12x

Campus South (CS)

Campus North (CN)

Failed client

connections

Replaced

switch

Bad

switch

Failed ping

on IB

Clients lost

IB connection

Page 10: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

10 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Ex. 1: Investigate Lustre connectivity problems

Solution

Routes to 2 bad clients used the same port on core EDR switch

Replacing that switch fixed the problem

Do not ignore problems on few bad clients

Investigate routing to check which components are shared

Possible root cause

Management communication still worked

Therefore no new route assignment by subnet manager

Maybe subnet manager used alternative connection

Data communication on bad EDR switch port was blocked

Possibly caused backlog and full switch buffers on other switches

Cascading blocked ports on complete fabric might be possible

We do not know how such issues could be clearly investigated

Page 11: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

11 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Ex. 2: Investigate Lustre performance problems

QDR

FDR10

FDR

EDR

FDR switch

8x

31x

18x

9x 9x 9x

29x

18x

9x

8x

9x 2x

DWDM 8x 12x 4x

34x

12x

14x

6x

24x 12x 12x

Campus South (CS)

Campus North (CN)

Components used to

measure throughput Bottleneck

Page 12: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

12 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Ex. 2: Investigate Lustre performance problems

Insufficient performance with iozone over long distance IB

On 50 GB/s file system only reached 22 GB/s

Peak IB bandwidth is 8 * FDR10 = 38 GB/s

With one FDR10 connection reached 4.6 GB/s which is good

Investigate IB network topology

MetroX have 8 FDR10 connections but file system has 10 OSS

IB has static routing, i.e. some OSS share same FDR10 connection

Lustre evenly distributes files to OSTs, i.e. performance per OSS is

half of FDR10

Peak with 10 OSS is FDR10 / 2 * 10 = 24 GB/s

Measured 22 GB/s are good

Page 13: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

13 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Ex. 2: Investigate Lustre performance problems

Double check

With ibtracert checked which OSS used same FDR10 cable

MDS used one cable, i.e. only 7 connections were used by OSS

Created directories with different OST stripe index for iozone

Assigned only half number of files to OSTs with shared connections

Measured 29.2 GB/s with 7 connections

Would be 33.4 GB/s with 8 connections, i.e. 12% below peak

Page 14: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

14 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Performance measurement details

QDR

FDR10

FDR

EDR

FDR switch

8x

31x

18x

9x 9x 9x

29x

18x

9x

8x

9x 2x

DWDM 8x 12x 4x

34x

12x

14x

6x

24x 12x 12x

Used servers at CS

Used clients at CS

Used servers at CN

Used clients at CN

Page 15: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

15 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Performance measurement details

Done while some of the systems were in production

Just show trends, no focus on best performance

Write performance measured with iozone

Options: -+m <file_name> -i 0 -+n -r 1024k -t <thread_count> -s 8g

Metadata performance measured with mdtest

Options: -u -n 10000 -i 3 -p 10 -d <lustre_dir>

Used clients

CN: RH7, Mellanox OFED, FDR Connect-IB, Exascaler 2.3

CS: RH6, RH OFED, FDR ConnectX-3, Exascaler 2.1

Used file systems

CN: EF4024 (MDT), 28 OSTs on ES7K, 6 TB disks, Exascaler 2.3

CS: EF3015 (MDT), 40 OSTs on SFA12K, 3 TB disks, Exasc. 2.1

Page 16: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

16 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Write performance to file system at CN

Remote CS clients on

older hardware and

software are slightly slower

0

5

10

15

20

25

Th

rou

gh

pu

t (G

B/s

)

Number of clients

Write perf with 14 threads per client

Clients CN

Clients CS

Page 17: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

17 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Write performance to file system at CS

Same performance

from both sites

0

2

4

6

8

10

12

14

1 2 3 4 5 6

Th

rou

gh

pu

t (G

B/s

)

Number of clients

Write perf with 20 threads per client

Clients CN

Clients CS

Page 18: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

18 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

File creation performance to file system at CN

Much slower

performance from

remote site

With high load from

many clients delay on

server is dominating

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 40

Op

era

tio

ns / s

Number of clients

File creation with 2 tasks per client

Clients CN

Clients CS

Page 19: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

19 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

File creation performance to file system at CS

Same performance from

both sites

Single client performance

is not good, impact of

older server version?

0

10000

20000

30000

40000

50000

1 2 3 4 5 6 40

Op

era

tio

ns / s

Number of clients

File creation with 2 tasks per client

Clients CN

Clients CS

Page 20: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

20 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

File stat performance to file system at CN

Remote clients are

slower

Using more clients

helps to hide latency

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 40

Op

era

tio

ns / s

Number of clients

File stat with 2 tasks per client

Clients CN

Clients CS

Page 21: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

21 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

File stat performance to file system at CS

Remote clients are

slower

Using more clients

helps to hide latency

0

20000

40000

60000

80000

100000

120000

140000

160000

1 2 3 4 5 6 40

Op

era

tio

ns / s

Number of clients

File stat with 2 tasks per client

Clients CN

Clients CS

Page 22: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

22 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

File remove performance to file system at CN

Remote clients are

slower

Using more clients

helps to hide latency

0

10000

20000

30000

40000

50000

60000

70000

80000

1 2 3 4 5 6 40

Op

era

tio

ns / s

Number of clients

File remove with 2 tasks per client

Clients CN

Clients CS

Page 23: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

23 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

File remove performance to file system at CS

Remote clients are

slower

Using more clients

helps to hide latency

0

10000

20000

30000

40000

50000

60000

1 2 3 4 5 6 40

Op

era

tio

ns / s

Number of clients

File remove with 2 tasks per client

Clients CN

Clients CS

Page 24: Performance of and experiences with Lustre over a long ... · System name pfs2 pfs3 pfs4 Users universities, all clusters universities, tier 2 cluster (phase 1) universities, tier

Roland Laifer – Lustre tools for ldiskfs investigation and lightweight I/O statistics – LAD‘15

INSTITUTS-, FAKULTÄTS-, ABTEILUNGSNAME (in der Masteransicht ändern)

24 2016-09-21 Steinbuch Centre for Computing Roland Laifer – Lustre over a long distance InfiniBand connection – LAD‘16

Experiences and summary

Experiences with complex IB network at KIT

Analyzing network related problems is not easy

We saw very few critical problems

Long distance IB connection just worked as expected

Performance over long distance IB

Throughput is similar to local usage

Metadata performance depends on distance

With 33 km drops to about one-third for some operations

Impact can be reduced by using more clients

All my talks about Lustre

http://www.scc.kit.edu/produkte/lustre.php

[email protected]