overlay hpc information
DESCRIPTION
In this presentation from ISC'14, Christian Kniep from Bull presents: Understand Your Cluster by Overlaying Multiple Information Layers. Kniep is using Docker technology in a novel way to ease the administration of InfiniBand networks. "Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job XYZ starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike. This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for system operation and management level personal. Mr. Kniep held two BoF sessions which described the lack of InfiniBand (ISC'12) and generic HPC monitoring (ISC'13). This years' session aims to propose a way to fix it. To drill into the issue, Mr. Kniep uses his recently started project QNIBTerminal to spin up a complete clusterstack using LXC containers."TRANSCRIPT
©Bull 2012
Overlay HPC Information
1
Christian Kniep
R&D HPC Engineer2014-06-25
©Bull 2014
About Me
2
‣ 10y+ SysAdmin
‣ 8y+ SysOps
‣ B.Sc. (2008-2011)
‣ 6y+ DevOps
‣ 1y+ R&D- @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
©Bull 2014
My ISC History - Motivation
3
©Bull 2014
My ISC History - Description
4
©Bull 2014
HPC Software Stack (rough estimate)
5
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Mgm
t
SysO
ps
SysO
ps M
gmt
User
Power User/ISV
ISV Mgm
t
Services:! ! Storage, Job Scheduler
HW
©Bull 2014
HPC Software Stack (goal)
6
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - History
7
!!!
• Created my own
!!
• No useful tools in sight
©Bull 2014
QNIB
8
‣ Cluster of n*1000+ IB nodes • Hard to debug
!!!
• Created my own - Graphite-Update in late 2013
!!
• No useful tools in sight
©Bull 2014
QNIB
9
‣ Cluster of n*1000+ IB nodes • Hard to debug
©Bull 2014
Achieved HPC Software Stack
10
Hardware:! ! IB-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - Implementation
11
©Bull 2014
QNIBTerminal -blog.qnib.org
12
haproxy haproxy
dnshelixdns
elk
kibana
logstash
etcd
carboncarbon
graphite-webgraphite-web
graphite-apigraphite-api
grafanagrafana
slurmctldslurmctld
compute0slurmd
compute<N>slurmd
Log/Events
Services Performance
Compute
elasticsearch
©Bull 2012
DEMONSTRATION
13
©Bull 2012
Future Work
14
©Bull 2014
More Services
15
‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
©Bull 2014
Graph Representation
16
‣ Graph inventory needed • Hierarchical view is not enough
©Bull 2014
Graph Representation
17
!!
• GraphDB seems to be a good idea
comp0 comp1 comp2
ibsw0
eth1
eth10
ldap12
lustre0
ibsw2
‣ Graph inventory needed • Hierarchical view is not enough
RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
©Bull 2012
Conclusion
18
!!!!!!!!!!‣ Training
• New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error.
!!!!!!!‣ Showcase
• Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘
!!!‣ complete toolchain could be automated
• Testing • Verification • Q&A
!!‣ n*1000 containers through clustering
©Bull 2014
Conclusion
19
‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
©Bull 2014
Log AND Performance Management
20
‣ Metric w/o Logs are useless!
©Bull 2014
Log AND Performance Management
21
‣ Metric w/o Logs are useless!!• and the other way around…
©Bull 2014
Log AND Performance Management
22
!!
• overlapping is king
‣ Metric w/o Logs are useless!!• and the other way around…
©Bull 2012 23