managing and monitoring large scale data transfers - networkshop44
TRANSCRIPT
Managing and monitoring large scale data transfers(WLCG FTS service as an
example)
Brian Davies
Managing and monitoring large scale data transfers
(WLCG FTS service as an example)
Brian DaviesNetworkshop44
22/03/16
Outline• Outline of the data transfer monitoring• What is the File Transfer Service (FTS)• Monitoring at different levels
– Central FTS data transfer monitoring– Virtual Organisation (VO) specific – User Monitoring
• Federated Failover• Use of “generic” monitoring tools
– Site Monitoring in conjunction with VO monitoring
WLCG Has a lot of Data transfers to monitor
• 167 Sites in 43 Countries on six Continents• Storage endpoints containing 250PB (disk) 300PB (tape)
– Organised and chaotic access– Supporting Single/Multiple endpoints for Single/Multiple Virtual Organisations– Vary in size and scope
• 10TB-10s of PB of Total Storage (Disk and Tape)• 1/10 GE NICs, 1/10/100 Gbps, R&E networks and private OPN• 10TB-1PB filesystems/object stores, 1-300 diskservers per site• Multiple filesystems (XFS,HDFS,CEPH,GPFS,Lustre)
• Central Production and User initiated• Last two years WLCG has moved 0.5EB of data
– Over 1billion files.• WN jobs produce a lot of data which also has to be stored/moved
– One VO runs 200k concurrent jobs which last 10mins to 72 Hours.– 0-100s of Input files, 2-3 Output files
• Individual file open times 1-10000s
Transfers to a single site/1day/1VO
Easily fill our networks*
*Not all the time
Data movements vary greatly• File size from ~10B to ~10GB• Latency between hosts from 0.1ms to 350ms (just for the UK )• Different workflows require different data movement
– WAN SE<->SE, SE->WN, WN->SE– LAN WN<->SE, SE<->SE
• Different Tools to monitor different workflows• Different storage middleware
– Native gridFTP, BeSTMan, DPM, dCache, SToRM• Different transfer protocols
– gsiFTP, http/WebDaV, xrootd, NFSv4.1, S3
• EGI Middleware Stack• Can handle many VOs
– 22 (HEP and non-HEP) • Checksum validation of files• Retry of failed transfers• Auto-optimisation of transfer parameters to maximise throughput• Ability to set limits suitable for varied storage setups• Web friendly GUI also available!! Federated Failover
– Mainly use Command line tools or higher level control systems.• Handle many file transfers (~1.5M a day)
– Single to thousands of files per single submission
File Transfer Service (FTS) Moves data!
Web GUI
• Overview of all transfers to see problematic sites is needed– But also be need ability to look at individual transfers
• Web GUIs, reading log files– Even have web GUIs which parse log files
• People using the monitoring Vary: – Site Admins ,regional support, VO users, Middleware developers.
• Management and technical– Different systems work well for different use cases.
• What is of interest?– Do transfers complete or fail?– How Fast do they complete?
• How can I tell if my changes improve/worsen the system.
Monitoring at different Levels
Central FTS Monitoring (dashboards and server GUIs)
Three Main VOs usage varies
Overview to see if single site is having issues
View smaller selections…Able to make sub-selections to diagnose problems not a the world scale
Comparison between inter-SE rates
Sites want to know if they are better than their collaborators/competitors
Ability to delve into greater detail at the server level
Many imbedded links to further monitoring
Down to individual transfers
To the log file
Which VO can then re-interpret
Transfer optimisation within FTS to increase individual transfers
Listing Errors (Helps find most important errors to solve)
Single failure mode failed transfers file list
History of a single file
Dedicated transfers to monitor rates
Users Gather their own information
• But systems change which breaks the monitoring.
AAA et al for federated failover • VOs each have their own system (AAA/FAX)
– But do similar actions– Copies data from remote storage if local copy does not exist to WN
• Allows for storage-less sites to be used.• Helps to reduce failures caused by local storage related issues.• Hierarchical Redirection
– Local->regional->continental->Global (or another convention)
Example of global network
FAX backup transfer mechanism also monitored
• Outline of the scale data transport issue for WLCG• What is the File Transfer Service (FTS)• Monitoring at different levels
– Central FTS data transfer monitoring– VO specific – User Monitoring
• Federated Failover• Use of “generic” monitoring tools
– Site Monitoring in conjunction with VO monitoring
Generic network monitoring tools
• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti
• Organising host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring
– http://atlas.ripe.net • perfSONAR toolkit
• Goals: – Find and isolate “network” problems; alerting in time– Characterize network use (base-lining) – Provide a source of network metrics for higher level services
• Choice of a standard open source tool: perfSONAR– Benefiting from the R&E community consensus
• Tasks achieved:– Finalized core deployment and commissioned perfSONAR network – Monitoring in place to create a baseline of the current situation between sites– Developed test coverage and made it possible to run “on-demand” tests to
quickly isolate problems and identify problematic links
Shawn McKee UoM
Overview of perfSONAR in WLCG/OSG
• End-to-end network issues are difficult to spot and localize – Network problems are multi-domain, complicating the process– Standardizing on specific tools and methods allows groups to focus resources more effectively and
better self-support– Performance issues involving the network are complicated by the number of components involved
end-to-end. • perfSONAR provides a number of standard metrics we can use• Latency measurements provide one-way delays and packet loss metrics
– Packet loss is almost always very bad for performance• Bandwidth tests measure achievable throughput and track TCP retries (using Iperf3)
– Provides a baseline to watch for changes; identify bottlenecks• Traceroute/Tracepath track network topology
– All measurements are only useful when we know the exact path they are taking through the network. – Tracepath additionally measures MTU but is frequently blocked
Shawn McKee UoM
Importance of Measuring Our Networks
Current perfSONAR Deployment
246 Active perfSONAR instances202 Running latest version (3.5+)- 95 sonars in latency mesh
- 8930 links measured at 10Hz
- packet-loss, one-way latency, jitter, ttl, packet-reordering
- 115 sonars in traceroutes mesh - 13110 links - hourly traceroutes, path-mtu
- 102 sonars in bandwidth mesh- 10920 links (iperf3)
Shawn McKee UoM
https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3
Generic network monitoring tools
• Sites have access to established programs– Ping, traceroute, tracepath, ganglia, iftop, cacti
• Organasing bi-directional host testing and port blocking can be troublesome• Separate “off the shelf” hardware and monitoring
– http://atlas.ripe.net • perfSONAR toolkit
Overview Dashboards
Dedicated monitoring Tools for the TCP layer
Central Service Monitoring
Analysis of the results garners useful information
Range of connections and rates on single host
Comparison between hosts at a single site
Conclusions• We have a lot of data to move (but successfully do so.)
– In many workflows• FTS is a method for how to do it.• Federated failover
– Automatic retries at multiple levels helps make problem transparent to the user• Lots of monitoring to ensure both a high success rate of transfers and a
high throughput both per file and overall.– Monitoring needs to be done at multiple levels
• Generic monitoring tools also useful. • Thank You