perfsonar update shawn mckee/university of michigan lhcone/lhcopn meeting cambridge, uk february 9...

Download PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015

If you can't read please download the document

Upload: adelia-jocelyn-rich

Post on 17-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

LHCONE MaDDash – 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee3 We still have a couple hosts with issues: ps01-nl.geant.nl (called perfSONAR- latency) and the Internet2 host at ManLan (called Internet2 perfSONAR) both show issues. NOTE: labels are now generated from Mesh registration information

TRANSCRIPT

perfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015 Overview of Talk Overview of Status (Changes, Issues) Status of perfSONAR Monitoring for LHCONE/LHCOPN Discussion February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee2 LHCONE MaDDash 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee3 We still have a couple hosts with issues: ps01-nl.geant.nl (called perfSONAR- latency) and the Internet2 host at ManLan (called Internet2 perfSONAR) both show issues. NOTE: labels are now generated from Mesh registration information LHCOPN MaDDash 09 Feb 2015 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee4 We still have a couple hosts with issues: Kisti had firewall issues: updated today, Still LOTS of orange on latency mesh BW mesh much better but red throughput is worth examining NOTE: labels are now generated from Mesh registration information OSG Network Service Open Science Grid (OSG) has deployed a network service for WLCG (and LHCONE). It consists of: A datastore based upon Esmond (new MA in perfSONAR v3.4) A GUI using MaDDash A service monitoring component built on OMD A mesh-creation-configuration utility built on registered information in OIM and GOCDB Demo on how the mesh-creation works (have to use slides for this since we need X509 credentials) February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee5 OIM / Mesh Config / Hostgroups February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee6 OIM / Mesh Config / Parameters February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee7 OIM / Mesh Config / Configs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee8 Mesh Config Adding Tests February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee9 MyOSG / Mesh Config February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee10 MyOSG / Mesh Config (us-atlas) February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee11 perfSONAR Monitoring Pages We have 3 versions of our perfSONAR monitoring pages Prototype at maddash.aglt2.org Testing at OSGs ITB instance Production at OSGs production instance Main monitoring types are MaDDash and OMD/Check_MK Prototype:https://maddash.aglt2.org/WLCGperfSONAR/check_mk Testing:https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk /https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk / Production:https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk Notes: OSG instances rely upon OSG Datastore: X509 cert needed to view check_mk/OMD pages (any IGTF cert) OSG datastore currently DOWN for resource consumption debugging February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee12 Prototype OMD for LHCONE perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee13 https://maddash.aglt2.org/WLCGperfSONAR/check_mk OMD (Open Monitoring Distribution) wraps a set of Nagios packages into a single preconfigured RPM Needs x509 credential from IGTF CA Very green now! Prototype OMD for LHCOPN perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee14 Almost all green. A few tests are failingwhy? Problem is in the Perl library used to get/parse the HTTPS page. There is a conflict in a library that the OMD host installs. The Fix the the HTTPS issues requires a newer version that conflicts with the version needed by other software on the host. Solution will be to update tests to utilize a Python library that will directly read the JSON host information via HTTPS https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ Issue for LHCONE Monitoring OSG has assigned a subnet for LHC related monitoring and the network service components: /24 All production perfSONAR monitoring is on that subnet The network hosting OSGs subnet is a campus production network and it is NOT willing to allow this subnet to setup a peering with an LHCONE VRF Attempted solution was to utilize SOCKS5 proxying via AGLT2 to access LHCONE-only endpoints Problem: not really working. May require software version changes For now we are keeping the Prototype instances at AGLT2 running to provide the needed coverage. February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee15 Production OMD for LHCONE perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee16 https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/ Notice the difference in what the production instance measures vs the prototype instance. Certain hosts are not allowing icmp pings from the OSG subnet. Some checks are not working from the production host on certain systems. Production OMD for LHCOPN perfSONARs February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee17 Similar issues for the production LHCOPN monitoring. BNL hosts are not allowing icmp pings from the OSG subnet. Some checks are not working from the production host on certain systems. https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/ OSG Network Datastore All perfSONAR metrics should be collected into the OSG network datastore This is an Esmond datastore from perfSONAR (postgresql+cassandra backends) Loaded via RSV probes; currently one probe per perfSONAR instance every 15 minutes. Probes have a bug: TWICE the BW as measured by the node Datastore on pfsd.grid.iu.edu JSON at Python API at Perl API at https://code.google.com/p/perfsonar- ps/wiki/MeasurementArchivePerlAPIhttps://code.google.com/p/perfsonar- ps/wiki/MeasurementArchivePerlAPI Currently the datastore is down for debugging resource usage All LHONE and (LHC)OPN data should be stored there February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee18 Network Datastore Access via JSON February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee19 PuNDIT Project February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee20 Next Steps We are working on getting ALL WLCG/OSG perfSONAR instances fully updated and properly configured Need to be reliably gathering all network metrics centrally Feb 16 is the deadline for sites to update and configure instances There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee21 Discussion/Questions/Comments? February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee22 Useful URLs Open Science Grid Networking URL https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG LHCOPN instructions for perfSONAR-PS (needs update): https://twiki.cern.ch/twiki/bin/view/LHCOPN/PerfsonarPS https://twiki.cern.ch/twiki/bin/view/LHCOPN/PerfsonarPS MaDDash Monitoring webui/index.cgi?dashboard=LHCONE%20testing%20siteswebui/index.cgi?dashboard=LHCONE%20testing%20sites webui/index.cgi?dashboard=LHCONE%20Mesh%20Configwebui/index.cgi?dashboard=LHCONE%20Mesh%20Config OMD Monitoring https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?st art_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview _name%3Dhostgroup%26hostgroup%3DLHCONE https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?st art_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview _name%3Dhostgroup%26hostgroup%3DLHCONE https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?sta rt_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_ name%3Dhostgroup%26hostgroup%3DLHCONE https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?sta rt_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_ name%3Dhostgroup%26hostgroup%3DLHCONE February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee23 LHCONE Network Matrices: 28Apr2014 February 9, 2015LHCONE/OPN-Cambridge-Shawn McKee24 OWAMP (Latency)BWCTL (Bandwidth) No packet loss, packet loss> Gb, 0.5