managing and monitoring large scale data transfers - networkshop44

41
Managing and monitoring large scale data transfers (WLCG FTS service as an example) Brian Davies

Upload: jisc

Post on 14-Jan-2017

1.397 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Managing and monitoring large scale data transfers - Networkshop44

Managing and monitoring large scale data transfers(WLCG FTS service as an

example)

Brian Davies

Page 2: Managing and monitoring large scale data transfers - Networkshop44

Managing and monitoring large scale data transfers

(WLCG FTS service as an example)

Brian DaviesNetworkshop44

22/03/16

Page 3: Managing and monitoring large scale data transfers - Networkshop44

Outline• Outline of the data transfer monitoring• What is the File Transfer Service (FTS)• Monitoring at different levels

– Central FTS data transfer monitoring– Virtual Organisation (VO) specific – User Monitoring

• Federated Failover• Use of “generic” monitoring tools

– Site Monitoring in conjunction with VO monitoring

Page 4: Managing and monitoring large scale data transfers - Networkshop44

WLCG Has a lot of Data transfers to monitor

• 167 Sites in 43 Countries on six Continents• Storage endpoints containing 250PB (disk) 300PB (tape)

– Organised and chaotic access– Supporting Single/Multiple endpoints for Single/Multiple Virtual Organisations– Vary in size and scope

• 10TB-10s of PB of Total Storage (Disk and Tape)• 1/10 GE NICs, 1/10/100 Gbps, R&E networks and private OPN• 10TB-1PB filesystems/object stores, 1-300 diskservers per site• Multiple filesystems (XFS,HDFS,CEPH,GPFS,Lustre)

• Central Production and User initiated• Last two years WLCG has moved 0.5EB of data

– Over 1billion files.• WN jobs produce a lot of data which also has to be stored/moved

– One VO runs 200k concurrent jobs which last 10mins to 72 Hours.– 0-100s of Input files, 2-3 Output files

• Individual file open times 1-10000s

Page 5: Managing and monitoring large scale data transfers - Networkshop44

Transfers to a single site/1day/1VO

Page 6: Managing and monitoring large scale data transfers - Networkshop44

Easily fill our networks*

*Not all the time

Page 7: Managing and monitoring large scale data transfers - Networkshop44

Data movements vary greatly• File size from ~10B to ~10GB• Latency between hosts from 0.1ms to 350ms (just for the UK )• Different workflows require different data movement

– WAN SE<->SE, SE->WN, WN->SE– LAN WN<->SE, SE<->SE

• Different Tools to monitor different workflows• Different storage middleware

– Native gridFTP, BeSTMan, DPM, dCache, SToRM• Different transfer protocols

– gsiFTP, http/WebDaV, xrootd, NFSv4.1, S3

Page 8: Managing and monitoring large scale data transfers - Networkshop44

• EGI Middleware Stack• Can handle many VOs

– 22 (HEP and non-HEP) • Checksum validation of files• Retry of failed transfers• Auto-optimisation of transfer parameters to maximise throughput• Ability to set limits suitable for varied storage setups• Web friendly GUI also available!! Federated Failover

– Mainly use Command line tools or higher level control systems.• Handle many file transfers (~1.5M a day)

– Single to thousands of files per single submission

File Transfer Service (FTS) Moves data!

Page 9: Managing and monitoring large scale data transfers - Networkshop44

Web GUI

Page 10: Managing and monitoring large scale data transfers - Networkshop44

• Overview of all transfers to see problematic sites is needed– But also be need ability to look at individual transfers

• Web GUIs, reading log files– Even have web GUIs which parse log files

• People using the monitoring Vary: – Site Admins ,regional support, VO users, Middleware developers.

• Management and technical– Different systems work well for different use cases.

• What is of interest?– Do transfers complete or fail?– How Fast do they complete?

• How can I tell if my changes improve/worsen the system.

Monitoring at different Levels

Page 11: Managing and monitoring large scale data transfers - Networkshop44

Central FTS Monitoring (dashboards and server GUIs)

Three Main VOs usage varies

Page 12: Managing and monitoring large scale data transfers - Networkshop44

Overview to see if single site is having issues

Page 13: Managing and monitoring large scale data transfers - Networkshop44

View smaller selections…Able to make sub-selections to diagnose problems not a the world scale

Page 14: Managing and monitoring large scale data transfers - Networkshop44

Comparison between inter-SE rates

Sites want to know if they are better than their collaborators/competitors

Page 15: Managing and monitoring large scale data transfers - Networkshop44

Ability to delve into greater detail at the server level

Many imbedded links to further monitoring

Page 16: Managing and monitoring large scale data transfers - Networkshop44

Down to individual transfers

Page 17: Managing and monitoring large scale data transfers - Networkshop44

To the log file

Page 18: Managing and monitoring large scale data transfers - Networkshop44

Which VO can then re-interpret

Page 19: Managing and monitoring large scale data transfers - Networkshop44

Transfer optimisation within FTS to increase individual transfers

Page 20: Managing and monitoring large scale data transfers - Networkshop44

Listing Errors (Helps find most important errors to solve)

Page 21: Managing and monitoring large scale data transfers - Networkshop44

Single failure mode failed transfers file list

Page 22: Managing and monitoring large scale data transfers - Networkshop44

History of a single file

Page 23: Managing and monitoring large scale data transfers - Networkshop44

Dedicated transfers to monitor rates

Page 24: Managing and monitoring large scale data transfers - Networkshop44

Users Gather their own information

• But systems change which breaks the monitoring.

Page 25: Managing and monitoring large scale data transfers - Networkshop44

AAA et al for federated failover • VOs each have their own system (AAA/FAX)

– But do similar actions– Copies data from remote storage if local copy does not exist to WN

• Allows for storage-less sites to be used.• Helps to reduce failures caused by local storage related issues.• Hierarchical Redirection

– Local->regional->continental->Global (or another convention)

Page 26: Managing and monitoring large scale data transfers - Networkshop44

Example of global network

Page 27: Managing and monitoring large scale data transfers - Networkshop44

FAX backup transfer mechanism also monitored

Page 28: Managing and monitoring large scale data transfers - Networkshop44

• Outline of the scale data transport issue for WLCG• What is the File Transfer Service (FTS)• Monitoring at different levels

– Central FTS data transfer monitoring– VO specific – User Monitoring

• Federated Failover• Use of “generic” monitoring tools

– Site Monitoring in conjunction with VO monitoring

Page 29: Managing and monitoring large scale data transfers - Networkshop44

Generic network monitoring tools

• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti

• Organising host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring

– http://atlas.ripe.net • perfSONAR toolkit

Page 30: Managing and monitoring large scale data transfers - Networkshop44

• Goals: – Find and isolate “network” problems; alerting in time– Characterize network use (base-lining) – Provide a source of network metrics for higher level services

• Choice of a standard open source tool: perfSONAR– Benefiting from the R&E community consensus

• Tasks achieved:– Finalized core deployment and commissioned perfSONAR network – Monitoring in place to create a baseline of the current situation between sites– Developed test coverage and made it possible to run “on-demand” tests to

quickly isolate problems and identify problematic links

Shawn McKee UoM

Overview of perfSONAR in WLCG/OSG

Page 31: Managing and monitoring large scale data transfers - Networkshop44

• End-to-end network issues are difficult to spot and localize – Network problems are multi-domain, complicating the process– Standardizing on specific tools and methods allows groups to focus resources more effectively and

better self-support– Performance issues involving the network are complicated by the number of components involved

end-to-end. • perfSONAR provides a number of standard metrics we can use• Latency measurements provide one-way delays and packet loss metrics

– Packet loss is almost always very bad for performance• Bandwidth tests measure achievable throughput and track TCP retries (using Iperf3)

– Provides a baseline to watch for changes; identify bottlenecks• Traceroute/Tracepath track network topology

– All measurements are only useful when we know the exact path they are taking through the network. – Tracepath additionally measures MTU but is frequently blocked

Shawn McKee UoM

Importance of Measuring Our Networks

Page 32: Managing and monitoring large scale data transfers - Networkshop44

Current perfSONAR Deployment

246 Active perfSONAR instances202 Running latest version (3.5+)- 95 sonars in latency mesh

- 8930 links measured at 10Hz

- packet-loss, one-way latency, jitter, ttl, packet-reordering

- 115 sonars in traceroutes mesh - 13110 links - hourly traceroutes, path-mtu

- 102 sonars in bandwidth mesh- 10920 links (iperf3)

Shawn McKee UoM

https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3

Page 33: Managing and monitoring large scale data transfers - Networkshop44

Generic network monitoring tools

• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti

• Organasing bi-directional host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring

– http://atlas.ripe.net • perfSONAR toolkit

Page 34: Managing and monitoring large scale data transfers - Networkshop44

Overview Dashboards

Page 35: Managing and monitoring large scale data transfers - Networkshop44

Dedicated monitoring Tools for the TCP layer

Page 36: Managing and monitoring large scale data transfers - Networkshop44

Central Service Monitoring

Page 37: Managing and monitoring large scale data transfers - Networkshop44

Analysis of the results garners useful information

Page 38: Managing and monitoring large scale data transfers - Networkshop44

Range of connections and rates on single host

Page 39: Managing and monitoring large scale data transfers - Networkshop44

Comparison between hosts at a single site

Page 40: Managing and monitoring large scale data transfers - Networkshop44

Conclusions• We have a lot of data to move (but successfully do so.)

– In many workflows• FTS is a method for how to do it.• Federated failover

– Automatic retries at multiple levels helps make problem transparent to the user• Lots of monitoring to ensure both a high success rate of transfers and a

high throughput both per file and overall.– Monitoring needs to be done at multiple levels

• Generic monitoring tools also useful. • Thank You

[email protected]

Page 41: Managing and monitoring large scale data transfers - Networkshop44

Contact

Thank you

Brian [email protected]