host health monitoring with docker run
TRANSCRIPT
![Page 2: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/2.jpg)
Health Monitoring
circa 1999• Nagios Core
• Event scheduler • Event processor • Alert manager
• Host groups config • Ping • HTTP • SSH
• Nagios Remote Plugin Executor • SNMP • load • disk
photo credit: https://en.wikipedia.org/wiki/Nagios
![Page 3: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/3.jpg)
Health Monitoring circa 2012
• AMI • Chef / Ansible
• ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10
• EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems
• ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html
![Page 4: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/4.jpg)
But you probably still need…
• Nagios for monitoring
• or Zabbix, Ganglia, Sensu…
• or OpsView, SolarWinds…
• or Pingdom, Datadog…
• To provide system feedback
• ASG SetInstanceHealth
photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938
![Page 5: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/5.jpg)
Health Monitoring circa 2016, the age of containers
• Generic AMI • Docker
• ECS • Container scheduling and re-scheduling as a service
• ASG / EC2 / Status Checks • Simple monitoring container
photo credit: https://github.com/docker/swarm
![Page 6: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/6.jpg)
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
![Page 7: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/7.jpg)
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
![Page 8: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/8.jpg)
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693
>rescheduletask
![Page 9: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/9.jpg)
Container Schedulers are the new watchman
• Container process monitoring
• Service health check monitoring
• Automatic re-scheduling
photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html
![Page 10: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/10.jpg)
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
Still need to configure an ASG to maintain capacity…
![Page 11: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/11.jpg)
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Failure Scenarios• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
Still need a monitor…
![Page 12: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/12.jpg)
ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd
api128 MB
registry256 MB
rails web.21024 MB
data worker.1512 MB
rails web.31024 MB
data worker.2512 MB
rails worker.2256 MB
rails worker.3256 MB
rails web.11024 MB
rails worker.1256 MB
rails worker.4256 MB
ECS
ASG
api ELB rails ELB
Health Monitoring circa 2016, the age of containers
• Schedule a monitor process in container cluster
• Describe ASG an ECS membership
• Mark all instances unregistered with ECS unhealthy
• `docker run` a user space health check on every instance
• Mark instances that fail to connect to Docker unhealthy
• Mark instances that fail user space health check unhealthy
No Nagios server + plugins!
![Page 13: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/13.jpg)
Partial Failure Scenarios battle scars
• web.2 container crashes
• web.2 port unresponsive
• ecs-agent fails
• dockerd fails
• Instance hardware fails
• Instance fails to register with ECS
• Instance userspace gets wacky
• Disk full
• Disk partition corrupt / read-only
• Network packet loss
• CPU steal
• Kernel bugs triggered
• Security vulnerabilities
• Security breaches
• …
![Page 14: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/14.jpg)
User Space Health Check
$dockerrunbusyboxsh-c\'dmesg|grep"Remountingfilesystemread-only"'
#whynot:$dockerrunhealth-check
To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts
![Page 15: Host Health Monitoring with Docker Run](https://reader031.vdocuments.net/reader031/viewer/2022022413/58ed2a4e1a28abc04e8b45fb/html5/thumbnails/15.jpg)
Thanks!
Slides available on Medium / SlideSharehttps://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286
http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run
Open source Golang monitor available on GitHubhttps://github.com/convox/rack/blob/master/api/workers/cluster.go
Questions / feedback to @nzoschke or [email protected]