lessons learned running large real-world docker environments
TRANSCRIPT
Lessons learned running largereal-world Docker environments
Oct 27th 2015
Alois [email protected]@ruxit.com
Dec 3rd 2015
App #1App #2
App #1 depends on App #2
Where is this specified?
Unwanted dependencies break architecture
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
Retransmissions
Retransmissions Retransmissions
Retransmissions Retransmissions
Retransmissions
Retransmissions
• Hardware defect in a single network interface card• NIC worked well under low load• Retransmissions only under heavy load• Affected communications to other machines
in datacenter
• Still not sure about exact defect on NIC
What was the problem?
#2 – The Network Retransmission Episode
Campfire stories#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown
• Shared /logs partition on host• No log rotation, no archiving for app logs• No proper log management used for Docker environment• Shared /logs partition on a single host ran out of space
What was the problem?
#3 – The Hungry Container Breakdown
• Container health checks failed• Marathon terminated task and rescheduled new one• Still no free space on /logs• Termination and rescheduling• /var/lib/docker ran out of space• Mesos slave unable to run Docker tasks
How the problem evolved over time
#3 – The Hungry Container Breakdown
• Log management tools for app logs, e.g. Fluentd and Logstash --log-driver=none|syslog
• Remove container--rm=true
• Run Mesos slave with --docker_remove_delay=VALUE
How the problem could have been avoided
#3 – The Hungry Container Breakdown
Campfire stories#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown#4 – The Day Orchestration Stood Still
• Marathon 0.8.x keeps all versions of applications for recovery (by default)• High frequency of microservices deployments• Slowdown through zk overload
What was the problem?
#4 – The Day Orchestration Stood Still
• Respective parameter (zk_max_versions) was not set to proper limit--zk_max_versions=20
How the problem could have been avoided
#4 – The Day Orchestration Stood Still
Campfire stories#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
• Massive load testing in preparation for Black Friday• Tests ran for 3 days• No impact to real users, only backend services affected• Many components to take into account
What was the problem?
174 / 3.4k
22 / 13.3k
Service
Container
Host
1
1..*
*
1
#5 – The Mushroom Cloud Effect
Campfire stories#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
Free trial - https://ruxit.com/docker-monitoring/Blog - https://blog.ruxit.com/
@ruxit
What lessons have you learned?