improving ict support for large-scale science

56
1 RNP Brazilian National Education and Research Network Leandro N. Ciuffo leandro.ciuffo@rn p.br ICT Support for Large-scale Science 5 th J-PAS collaboration meeting / 11-Sep- 2012 Alex S. Moura [email protected]

Upload: leandro-ciuffo

Post on 09-Dec-2014

945 views

Category:

Technology


1 download

DESCRIPTION

RNP presentation at the 5th J-PAS collaboration meeting.

TRANSCRIPT

Page 1: Improving ICT Support for Large-scale Science

1

RNPBrazilian National Education and Research Network

Leandro N. [email protected]

ICT Support for Large-scale Science5th J-PAS collaboration meeting / 11-Sep-2012

Alex S. [email protected]

Page 2: Improving ICT Support for Large-scale Science

About RNP

Page 3: Improving ICT Support for Large-scale Science

Qualified as a non-profit Social Organization (OS)maintained by federal public resources• Government budget includes items to cover costs of

network and also RNP operating costs• Support by MCTI, MEC and now MinC (Culture)

• RNP monitored by MCTI, CGU and TCU

• Additional projects supported by sectorial funds directly or through management contract, plus MS (Health)

• “Reserach Unit” of the Ministry of S&T

http://www.mct.gov.br/index.php/content/view/741.html

LNCC - Laboratório Nacional de

Computação Científica

INPE - Instituto Nacional de Pesquisas

EspaciaisINPA - Instituto

Nacional de Pesquisas da Amazônia

Page 4: Improving ICT Support for Large-scale Science

Build and operate R&E networks• Maintenance and continued renewal of

infrastructure

• RNP backbone of 2000 has been renewed 3 times (2004, 2005 and 2011), with large increases in maximum link capacity from 25 Mbps to 10 Gbps (factor of 400)

• Metro networks have been built in capital cities to provide access to Point of Presence (PoP) at 1 Gbps or morewww.redecomep.rnp.br

• International capacity has increased from 300 Mbps in 2002 to over 20 Gbps since 2009 (factor of 70). RNP has also played a major role in building the RedCLARA (Latin American regional) network, linking R&E networks from more than 12 countries – www.redclara.net

• Testbed networks for network experimentation, especially project GIGA (with CPqD) since 2003 and the EU-BR FIBRE project (2011-2014)

Page 5: Improving ICT Support for Large-scale Science

Ipê NetworkRNP Backbone

http://www.rnp.br/backbone/

BandwidthMinimum…: 20MbpsMaximum..: 10GbpsAggregated: 250Gbps

Americas Light (to USA)RedCLARA (to Europe)Commodity Internet

Florianópolis

Rio de Janeiro

Brasilia

São Paulo

Porto Alegre

Manaus Fortaleza

Salvador

Boa VistaMacapá

Page 6: Improving ICT Support for Large-scale Science

MCTI Roadmap 2012-2015

http://www.mcti.gov.br/index.php/content/view/335668.html

Page 7: Improving ICT Support for Large-scale Science

MetroNetworks

Institution A

Institution C

Institution B

RNPPoP

to IPÊ network

• 23 cities operational• 6 under deployment• 13 planned• 1980 Km

http://redecomep.rnp.br

Page 8: Improving ICT Support for Large-scale Science
Page 9: Improving ICT Support for Large-scale Science

São Paulo

Page 10: Improving ICT Support for Large-scale Science

Florianópolis

Page 11: Improving ICT Support for Large-scale Science

Why your data-driven research

is relevant to RNP?

Page 12: Improving ICT Support for Large-scale Science

Data-intensive researchunify theory, experiment and simulation at scale. “big data”

Computational Simulationssimulating complex phenomena. “in silico”

Theoretical Modelinge,g, Kepler's and Newtonw´s laws

Empirical Sciencedescribing natural phenomena

Science paradigms evolution

Page 13: Improving ICT Support for Large-scale Science

Key components of a new research infrastructure

PublishersPublishers

DiscoveringDiscovering

Harvesting & Indexing

Harvesting & Indexing

Scientific portalScientific portal

RegisteringRegistering

Local services and

workflows

Local services and

workflows

PublishingPublishing

UsersUsers

Big DataProcessing

Big DataProcessing

InformationVisualizationInformationVisualization

Data Repositories

Data Repositories

InstrumentsInstruments

Page 14: Improving ICT Support for Large-scale Science
Page 15: Improving ICT Support for Large-scale Science

Network Requirements Workshop

1. What science is being done?

2. What instruments and facilities are used?

3. What is the process/workflow of science?

4. How are the use of instruments and facilities, the process of science, and other aspects of the science going to change over the next 5 years?

5. What is coming beyond 5 years out?

6. Are new instruments or facilities being built, or are there other significant changes coming?

Page 16: Improving ICT Support for Large-scale Science

CSIC

T80S35 TB/y raw

~270 TB/yimages

J-PAS J-PAS Data transfer requirementsData transfer requirements(to be validated)(to be validated)

Page 17: Improving ICT Support for Large-scale Science

International collaboration

Page 18: Improving ICT Support for Large-scale Science

Academic Networks Worldwide

Page 19: Improving ICT Support for Large-scale Science

RedCLARA Logical Topology – August 2012

http://www.redclara.net

2.5Gbps

1Gbps

10 Gbps

622 MbpsSP

POA (foreseen)

10 Gbps

Page 20: Improving ICT Support for Large-scale Science

Hybrid Networks

• Since the beginning of the Internet, NRENs provide the routed IP service

• Around 2002, NRENs have begun to provide two network services:

routed IP (traditional Internet)end-to-end virtual circuits (a.k.a. “lightpaths”)

– This lightpath service is intended for users with high QoS needs, usually

guaranteed bandwidth, as is implemented by segregation between their

traffic and the general routed IP traffic.

• The GLIF organisation (www.glif.is) coordinates international collaboration using lighpaths

Page 21: Improving ICT Support for Large-scale Science

High bandwidth research connectivity(lightpaths for supporting international

collaboration)

GLIF world map, 2011 http://www.glif.is

Page 22: Improving ICT Support for Large-scale Science

• RNP networks

• Ipê backbone (29,000 km)

• metro networks in state capitals

• GIGA optical testbed, from RNP and CPqD

• links 20 research institutions in 7 cities (750 km)

• KyaTera research network in S. Paulo

• links research institutions in 11 cities (1500 km)

GLIF links in South America

Page 23: Improving ICT Support for Large-scale Science

Examples of use of international lightpaths

Page 24: Improving ICT Support for Large-scale Science

Why all this is relevant to you?

Page 25: Improving ICT Support for Large-scale Science

Why?• R&E networks in Brazil, and especially RNP,

are funded by government agencies to provide quality network services to the national R&E community

• In most cases, this is handled normally by providing R&E institutions with a connection to our networks, which operate standard Internet services of good quality.

• However, there are times when this is not enough…

Page 26: Improving ICT Support for Large-scale Science

Network Requirements and Expectations

• Expected transfer rates to transfer data• As a first step in improving your network

performance, it is critical to have a baseline understanding of what speed you should expect from your network connection under ideal conditions.

• The following shows how long it takes to transfer 1 Terabyte of data across various speed networks:

10 Mbps network: 300h (12.5 days)

100 Mbps network: 30h

1Gbps network: 3h

10Gbps network: 20min

Page 27: Improving ICT Support for Large-scale Science

http://fasterdata.es.net/fasterdata-home/requirements-and-expectations/

Network Requirements and Expectations

Page 28: Improving ICT Support for Large-scale Science

28

Inadequate performance for critical applications

- Numerous cases of firewalls causing problems

- Often difficult to diagnose

- Router filters can often provide equivalent security without the performance impact

In some cases, the standard Internet services are not good enough for high-performance or data-intensive projects.

Sensitive to perturbations caused by security devices:

Science and Enterprise network requirements are in conflict

Page 29: Improving ICT Support for Large-scale Science

Recommended

approaches

Page 30: Improving ICT Support for Large-scale Science

30

Remedies which can be applied

Tuning of networking software is generally necessary on high bandwidth and long latency data connections, because of the peculiarities of TCP implementations

In the case of high QoS requirements it is often necessary to use lightpaths, to avoid interference with cross traffic

In many cases, both these approaches are required

Page 31: Improving ICT Support for Large-scale Science

The Cipó Experimental Service• We are now beginning to deploy dynamic circuits as

an experimental service on our network– This will also interoperate with similar services in other

networks.

Page 32: Improving ICT Support for Large-scale Science

Getting support

• If you need advice or assistance with these network problems, it is important to get in touch with network support1. At your own institution2. At your state network provider

www.rnp.br/pops/index.php

3. In the case of an specific circuit (lightpath) services, you may contact RNP directly at [email protected]

Page 33: Improving ICT Support for Large-scale Science

RNP Website: Backbone Operations

Page 34: Improving ICT Support for Large-scale Science

RNP Website: Backbone Operations http://www.rnp.br/ceo/

Tools to help Tools to help verify network verify network performanceperformancestatistics statistics

Page 35: Improving ICT Support for Large-scale Science

RNP Backbone Statistics

http://www.rnp.br/ceo/

Network Panorama

Page 36: Improving ICT Support for Large-scale Science

RNP Backbone Statistics

http://www.rnp.br/ceo/

Page 37: Improving ICT Support for Large-scale Science

Network Diagnostic Tool (NDT) • Test your bandwidth from your computer to

the RNP’s PoP

• São Paulo: http://ndt.pop-sp.rnp.br• Rio de Janeiro: http://ndt.pop-rj.rnp.br• Florianopolis: http://ndt.pop-sc.rnp.br

Page 38: Improving ICT Support for Large-scale Science

Recommended Approach• On a high-speed network it takes less time to transfer 1 Terabyte of

data than one might expect.

• It is usually sub-optimal to try and get 900 megabits per second of throughput on a 1 gigabit per second network path in order to move one or two terabytes of data per day.  The disk subsystem can also be a bottleneck - simple storage systems often have trouble filling a 1 gigabit per second pipe.

• In general it is not a good idea to try to completely saturate the network, as you will likely end up causing problems for both yourself and others trying to use the same link. A good rule of thumb is that for periodic transfers it should be straightforward to get throughput equivalent to 1/4 to 1/3 of a shared path that has nominal background load.

• For example, if you know your receiving host is connected to 1 Gbps Ethernet, then a target of speed of 150-200 Mbps is reasonable. You can adjust the number of parallel streams (as described on the tools page) that you are using to achieve this.

• Many labs and large universities are connected at speeds of at least 1 Gbps, and most LANs are at least 100 Mbps, so if you don't get at least 20 Mbps, there may be a problem that needs to be addressed.

32

Page 39: Improving ICT Support for Large-scale Science

Performance using TCP• There are 3 important variables (there are

others) that affect TCP performance:  packet loss, latency (or RTT - Round Trip Time), and buffer size/window.  All are interrelated.

• The optimal buffer size is twice the product bandwidth*delay of the link/connection:

• buffer size = bandwidth x RTT• e.g.: if the result of ping if 50ms and the end-

to-end network is all 1G or 10G Ethernet, the TCP receiving buffers (an operating system parameter) should be:• 0.05 seg x (1 Gbit / 8 bits) = 6.25 MBytes

Page 40: Improving ICT Support for Large-scale Science

TCP Congestion Avoidance Algorithms• The TCP reno congestion avoidance algorithm was the

default in all TCP implementations for many years. However, as networks got faster and faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including:

• reno: Traditional TCP used by almost all other operating systems. (default)

• cubic: CUBIC-TCP• bic: BIC-TCP• htcp: Hamilton TCP• vegas: TCP Vegas• westwood: optimized for lossy networks• Most Linux distributions now use cubic by default, and

Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp.

• More details on can be found at: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm

Page 41: Improving ICT Support for Large-scale Science

TCP Congestion Avoidance Algorithms• The TCP reno congestion avoidance algorithm was the default in all TCP

implementations for many years. However, as networks got faster and faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including:

• reno: Traditional TCP used by almost all other operating systems (default)

• cubic: CUBIC-TCP• bic: BIC-TCP• htcp: Hamilton TCP• vegas: TCP Vegas

• westwood: optimized for lossy networks• Most Linux distributions now use cubic by default, and

Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp.

• More details on can be found at: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm

Page 42: Improving ICT Support for Large-scale Science

MTU Issues• Jumbo Ethernet frames can increase

performance by a factor of 2-4.

• ping tool can be used to verify the MTU size.

• For example, on Linux you can do:• ping -s 8972 -M do -c 4 10.200.200.12

•Other tools that can help verify the MTU size are scamper and tracepath

Page 43: Improving ICT Support for Large-scale Science

Say No to scp: Why you should avoid scp over a WAN

• In a Unix environment scp, sftp, and rsync are commonly used to copy data between hosts.

• While these tools work fine in a local environment, they perform poorly on a WAN.

• The openssh versions of scp and sftp have a built in 1 MB buffer (previously only 64 KB in openssh older than version 4.7) that severely limits performance on a WAN.

• rsync is not part of the openssh distribution, but typically uses ssh as transport (and is subject to the limitations imposed by the underlying ssh implementation).

• DO NOT USE THESE TOOLS if you need to transfer large data sets across a network path with a RTT of more than around 25ms.

• More information is here.

Page 44: Improving ICT Support for Large-scale Science

Why you should avoid scp over a WAN (cont.)• The following results are typical: scp is 10x

slower than single stream GridFTP, and 50x slower than parallel GridFTP.

• Sample ResultsBerkeley, CA to Argonne, IL (near Chicago).RTT = 53 ms, network capacity = 10Gbps.

Page 45: Improving ICT Support for Large-scale Science

42

Demilitarizing your network

for science

Page 46: Improving ICT Support for Large-scale Science

A Simple Science DMZ• A simple Science DMZ has several essential components.  These

include dedicated access to high-performance wide area networks and advanced services infrastructures, high-performance network equipment, and dedicated science resources such as Data Transfer Nodes.  Here is a diagram of a simple Science DMZ showing these components and data paths:

Page 47: Improving ICT Support for Large-scale Science

Science DMZ: Supercomputer Center Network• The diagram below illustrates a simplified

supercomputer center network.  While this may not look much like the previous simple Science DMZ diagram, the same principles are used in its design.

Page 48: Improving ICT Support for Large-scale Science

Science DMZ: Big Data Site• For sites that handle very large data volumes (e.g. for big experiments such as the LHC),

individual data transfer nodes are not enough. 

• Data transfer clusters are needed: groups of machines serve data from multi-petabyte data stores. 

• The same principles of the Science DMZ apply - dedicated systems are used for data transfer, and the path to the wide area is clean, simple, and easy to troubleshoot.  Test and measurement are integrated in multiple locations to enable fault isolation.  This network is similar to the supercomputer center example in that the wide area data path covers the entire network front-end.

Page 49: Improving ICT Support for Large-scale Science

Data Transfer Node (DTN)• Computer systems used for wide area data transfers perform far

better if they are purpose-built and dedicated to the function of wide area data transfer.  These systems, which we call Data Transfer Nodes (DTNs), are typically PC-based Linux servers built with high-quality components and configured specifically for wide area data transfer.

• ESnet has assembled a reference implementation of a host that can be deployed as a DTN or as a high-speed GridFTP test machine.

• The host can fill a 10Gbps network connection with disk-to-disk data transfers using GridFTP.

• The total cost of this server was around $10K, or $12.5K with the more expensive RAID controller. If your DTN node is used only as a data cache using RAID0 instead of a reliable storage server using RAID5, you can get by with the less expensive RAID controller. 

• Key aspects of the configuration include: recent version of the Linux kernel running the ext4 file system, 2 RAID controllers, and 16 disks.

Page 50: Improving ICT Support for Large-scale Science

DTN Hardware Description• Chassis: AC SuperMicro SM-936A-R1200B 3U 19" Rack Case

with Dual 1200W PS• Motherboard: SuperMicro X8DAH+F version 1.0c• CPU: 2 x Intel Xeon Nehalem E5530 2.4GHz• Memory: 6 x 4GB DDR3-1066MHz ECC/REG• I/O Controller: 2 x 3ware SAS 9750SA-8i (about $600)

or 3ware SAS 9750-24i4e (about $1500)• Disks: 16 x Seagate 500GB SAS HDD 7,200 RPM

ST3500620SS• Network Controller: Myricom 10G-PCIE2-8B2-2S+E

• Linux Distribution• Most recent distribution of CentOS Linux • Install 3ware driver: http://www.3ware.com/support/download.asp• Install ext4 utilities: yum install e4fsprogs.x86_64

Page 51: Improving ICT Support for Large-scale Science

DTN Tuning• Add to /etc/sysctl.conf, then run sysctl -p• # standard TCP tuning for 10GE• net.core.rmem_max = 33554432• net.core.wmem_max = 33554432• net.ipv4.tcp_rmem = 4096 87380 33554432• net.ipv4.tcp_wmem = 4096 65536 33554432• net.ipv4.tcp_no_metrics_save = 1• net.core.netdev_max_backlog = 250000

• Add to /etc/rc.local• #Increase the size of data the kernel will read ahead (this favors sequential reads)• /sbin/blockdev --setra 262144 /dev/sdb• /sbin/blockdev --setra 262144 /dev/sdc• /sbin/blockdev --setra 262144 /dev/sdd

• # increase txqueuelen • /sbin/ifconfig eth2 txqueuelen 10000 • /sbin/ifconfig eth3 txqueuelen 10000

• # make sure cubic and htcp are loaded• /sbin/modprobe tcp_htcp• /sbin/modprobe tcp_cubic• # set default to htcp• /sbin/sysctl net.ipv4.tcp_congestion_control=htcp

• # with the Myricom 10G NIC increasing interrupt coalencing helps a lot:• /usr/sbin/ethtool -C ethN rx-usecs 75

Page 52: Improving ICT Support for Large-scale Science

DTN Tuning (cont.)• Tools• Install a data transfer tool such as GridFTP - see the

GridFTP quick start page. Information on other tools can be found on the tools page.

• Performance Results for this configuration• Back-to-Back Testing using GridFTP• - memory to memory, 1 10GE NIC: 9.9 Gbps

- memory to memory, 4 10GE NICs:  38 Gbps- disk to disk: 9.6 Gbps  (1.2 GBytes/sec) using large files on all 3 disk partitions in parallel

Page 53: Improving ICT Support for Large-scale Science

• TCP Performance Tuning for WAN Transfers - NASA HECC Knowledge Basehttp://www.nas.nasa.gov/hecc/support/kb/TCP-Performance-Tuning-for-WAN-Transfers_137.html

• Google's software-defined/OpenFlow backbone drives WAN links to 100 per cent utilization - Computerworldhttp://www.computerworld.com.au/article/427022/google_software-defined_openflow_backbone_drives_wan_links_100_per_cent_utilization/

• Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streamshttp://www.internet2.edu/presentations/jt2012winter/20120125-Pouyoul-JT-lighting.pdf

• Tutorials / Talks• Achieving the Science DMZ: Eli Dart, Eric Pouyoul, Brian Tierney, and Joe Breen, Joint Techs, January

2012. (watch the webcast)

• Tutorial in 4 sections: Overview and Archetecture, Building a Data Transfer Node,Bulk Data Transfer Tools and PerfSONAR, Case Study: University of Utah's Science DMZ

• How to Build a Low Cost Data Transfer Node: Eric Pouyoul, Brian Tierney and Eli Dart, Joint Techs, July

2011.

• High Performance Bulk Data Transfer: (includes TCP tuning tutorial), Brian Tierney and Joe Metzger,

Joint Techs, July 2010.

• Science Data Movement: Deployment of a Capability: Eli Dart, Joint Techs, January 2010.

• Bulk Data Transfer Tutorial, Brian Tierney, September 2009

• Internet2 Performance Workshop, current slides

• SC06 Tutorial on high performance networking, Phil Dykstra, Nov 2006

43

References (1/3)

Page 54: Improving ICT Support for Large-scale Science

• Papers

• O'Reilly ONLamp Article on TCP Tuning

• Tuning

• PSC TCP performance tuning guide

• SARA Server Performance Tuning Guide

• Troubleshooting

• Fermilab Network Troubleshooting Methodology

• Geant2 Network Tuning Knowledge Base

• Network and OS Tuning

• Linux IP Tuning Info

• Linux TCP Tuning Info

• A Comparison of Alternative Transport protocols

44

References (2/3)

Page 55: Improving ICT Support for Large-scale Science

• Network Performance measurement tools

• Convert Bytes/Sec to bits/sec, etc. 

• Measurement Lab Tools

• Speed Guide's performance tester and TCP analyzer . (mostly useful for home users)

• ICSI's Netalyzr

• CAIDA Taxonomy

• SLAC Tool List

• iperf vs ttcp vs nuttcp comparison

• Sally Floyd's list of Bandwidth Estimation Tools

• Linux Foundation's TCP Testing Page

• Others

• bufferbloat.net: Site devoted to pointing out the problems with large network buffers on slower

networks, such as homes or wireless.

45

References (3/3)

Page 56: Improving ICT Support for Large-scale Science

53

Thank you / Obrigado!

Leandro Ciuffo - [email protected] Moura - [email protected]: @RNP_pd

53