2016 - easing your way into docker: lessons from a journey to production

Docker at SpareFoot Lessons From a Journey to Production

DevOps Days Austin May 3, 2016

Who am I? Steve Woodruff

❏ Director of DevOps at SpareFoot implementing CI/CD

❏ Spent 10+ years at Motorola doing embedded development (C, C++)

❏ Spent 5 years at IBM as a sys admin in a large server farm (Linux, AIX, Solaris)

[email protected] Twitter: @sjwoodr GitHub: sjwoodr

mailto:[email protected]

http://twitter.com/sjwoodr

http://github.com/sjwoodr

● Think Hotels.com for self storage* ● All infrastructure in AWS ● 40 Developers on 7 Teams ○ Continuous Delivery

● Docker in production since 2014

*This kind of storage:

The Beginning: SpareFoot + Docker

Hackathon! Docker + Fig (now compose) allowed us to run production architecture locally.

Yim - Call Center Application

Used exclusively by our call center

Chrome ONLY

Node version n+1

React + Flux

Vers. n+1

Vers. n+1

Vers. n

CI and deploymentsJanky shell scripts… slow builds, etc…

Used Bamboo to build images

feature branches were built/deployed to Dev

master branch was built/deployed to Staging

Dynamically created custom container start script

Tried to auto-detect when the containers started to begin post-deploy test

Build times were rather long

Spent an awful long time doing docker push (to our registry) and docker pull (on the target hosts)

Ok, so Docker feels like the a solution… and we kind of know how to do this. But....

Continuous Integration / Delivery?

○ Docker Registry

○ Bamboo

○ Deployments

● Host Volumes and Port Forwarding rules?

○ Not saved with the source code

● Get Docker to run in local, dev, staging, and production environments?

○ Configuration?

Docker in Production (technically)!

We had 2 load balanced EC2 instances running a node app.

ELB

443

3000 3000



Now we have 2 load balanced EC2 instances running docker containers that run a node app!

ELB

443

3000 3000

ELB

App 1 App 1

3000 3000

443


ELB ELB

App 1 App 1


Now we have 2 load balanced EC2 instances running docker containers that run a node app!

NEW443

3000 3000 3000 3000

443

Yim: Trouble in Docker Paradise

Hosting our own Docker registry was a bad idea Stability was a problem

No level of access control on the registry itself Mimicking servers - 1 container per host. Need orchestration please! Amazon Linux AMI -> old version of Docker… doh! Docker push/pull to/from registry was very slow

build - push to registry deploy - pull from registry to each host, serially

Performance was fine….

But stability was the issue

This internal-facing nodejs app was moved to a pair of EC2 instances and out of Docker after about 4 months of pain and suffering

Yim: Lessons Learned

We need orchestration Rolling our own docker deployments was confusing to OPS and to the Dev team

Our own docker registry is a bad idea Stability was a problem No level of access control on the registry itself Our S3 backend would grow by several GB per month with no automated

cleanup

No easy way to rollback failed deploys Just fix it and deploy again...

All this culminated in a poor build process and affected CI velocity Longer builds, longer deploys, no real gain

Like everyone else....

...we were “deconstructing the monolith”

Application

Monolithic Library

Data

Like everyone else....

...we were “deconstructing the monolith”

Application

REST API

Data

Microservice

REST API

Data

Microservice

REST API

Data

Microservice

REST API

Data

Microservice

API Gateway

A Better Docker Registry

With Yim we learned that rolling our own Registry was a bad idea.

Limited Access Control

We have to maintain it

Let’s try Quay...

Has Access Control

Robots, yusss!

We don’t have to maintain it

We’ve learned some things...

● Easier than we thought ● Quay was the glue we needed

○ Use an off the shelf solution. ○ We like Quay.io

● Bolting on to our existing CI pipeline worked really well. ○ Developers didn’t have to learn new process ○ Microservice consumers can pull tagged versions ○ We can automate tests against all versions

Now we talk containers from local -> dev -> staging but NOT in production.

MASTER

BRANCH A

Dev Staging

Service1

service1:prod

Production

Service1

service1:stage

Service1

service1:dev-branch-name

Production - What is still needed

Orchestration

Yim sucked because we tried to do this ourselves

Better Deployments

With rollbacks

Configuration Management

We have things to hide

Production - Orchestration

Production - Software SelectionChoosing orchestration software / container service in early 2015

StackEngine

Lacked docker-compose support

Kubernetes

PhD Required

Mesosphere

Nice, but slow to deploy

EC2 Container Service

Lacked docker-compose support and custom AMIs

Tutum (now Docker Cloud)

Rancher

Production - Enter RancherAfter running proof-of-concepts of both Tutum and Rancher, we decided to continue down our path to production deploys with Rancher.

Had more mature support for docker-compose files.

Tutum added this after our evaluation had ended

Did not require us to orchestrate the deployments through their remote endpoint

Rancher server runs on our EC2 instances and we are in full control of all the things

Had a full API we can work with in addition to the custom rancher-compose cli

Had a very-active user community and a beta-forum where the Rancher development team was active in answering questions and even troubleshooting configuration problems.

Overlaying Docker on AWS

ELB

EC2

Containers


Why the extra HAProxy layer?

Allows us to create the ELB and leave them alone

When we deploy new versioned services we update the service alias / haproxy links

Allows for fast rollback to previous version of the service

Deployments and RollbacksDevelopers can deploy to production whenever they want

HipChat bot commands to deploy and rollback/revert

Deployments to each of the 3 environments use rancher-compose to

Deploy new versioned services / containers

Create or update service aliases / haproxy links

Delete previous versioned services except for current and previous

When things go haywire…

We simply rollback

Production deploy creates a docker-compose-rollback.yml file

Query Rancher API to get list of running services

Allows us to change haproxy and service alias links back to the previous version

Super fast to rollback, no containers need to be spun up!


ELB

EC2

Containers


ELB

EC2

Containers

Rollback!

Secret Management

We’re already using SaltStack to manage our EC2 minions (VMs)

Salt Grains are used for some common variables used in salt states

Salt Pillar Data exists which is configuration data available only to certain minions

This Salt Pillar Data is already broken down by environment (dev/stage/prod)

We should just use this data to dynamically create the docker-compose and rancher-compose files!

Technical Challenge - docker-compose

We needed to support a single docker-compose.yml file, maintained by developers of an app or service

They don’t want to maintain local, dev, stage, and prod versions of this file

Changes to multiple files would be error-prone

Must support differences in the architecture or configuration of services across environments

Secret Secret, I’ve got a Secret

A templated rancher-compose file

{% set sf_env = grains['bookingservice-env'] %}

{% set version = grains['bookingservice-version'] %}

bookingservice-{{ sf_env }}-{{ version }}:

scale: 1

We use a scale of 1 because we use global host scheduling combined with host affinity so that one container of this service is deployed to each VM of the specified environment (dev/stage/prod). This allows us to spin up a new Rancher host and easily deploy to the new host VM.

A templated docker-compose file

A Closer Look MYSQL_SPAREFOOT_HOST: {{ salt['pillar.get']('bookingservice-dev:MYSQL_SPAREFOOT_HOST') }}

MYSQL_SPAREFOOT_DB: {{ salt['pillar.get']('bookingservice-dev:MYSQL_SPAREFOOT_DB') }}

MYSQL_SPAREFOOT_USER: {{ salt['pillar.get']('bookingservice-dev:MYSQL_SPAREFOOT_USER') }}

MYSQL_SPAREFOOT_PASS: {{ salt['pillar.get']('bookingservice-dev:MYSQL_SPAREFOOT_PASS') }}

MYSQL_SPAREFOOT_PORT: {{ salt['pillar.get']('bookingservice-dev:MYSQL_SPAREFOOT_PORT') }}

APP_LOG_FILE: {{ salt['pillar.get']('bookingservice-dev:APP_LOG_FILE') }}

REDIS_HOST: {{ salt['pillar.get']('bookingservice-dev:REDIS_HOST') }}

REDIS_PORT: {{ salt['pillar.get']('bookingservice-dev:REDIS_PORT') }}

Deployments with rancher-composeDeployments to Dev and Staging are done via Bamboo

Deployments to Production are done by developers via HipChat commands

In the end, everything is invoking our salt-deploy.py script

Set some salt grains for target env, version, buildid, image tag in quay.io

Services get versioned with a timestamp and bamboo build id

Render jinja2 / Inject Salt grains and pillar data via salt minion python code

caller.sminion.functions['cp.get_template'](cwd + '/docker-compose.yml', cwd + '/docker-compose-salt.yml')

caller.sminion.functions['cp.get_template'](cwd + '/rancher-compose.yml', cwd + '/rancher-compose-salt.yml')

Invokes rancher-compose create / up

Cleanup to keep the live verison of a service and live-1 version. The rest are purged.

Surprise! Rancher Adds Variable Support

Does the support for interpolating variables, added in Rancher 0.41, deprecate the work we've done with Salt and rendering jinja2 templates?

No. We already maintain data in grains and pillars so we just reuse that data.

Rancher implementation uses the environment variables on the host running rancher-compose to fill in the blanks

It would require logic to load those env variables based on the target env (dev/stage/prod) so might as well get the data out of salt pillar which has separate pillars for each service and then broken down by target environment.

So we deployed our first microservice and...


...Everything worked...


...Everything worked...

… Until it didn’t.

The Day Rancher Died

ELB

EC2

Containers

Where are we now?

52 Microservices in production with Rancher + Docker

5-10 Deployments per day on average

Busiest services handling around 50 requests / second

Consumer facing applications being containerized in development

New teams cutting their teeth

Keep on “Strangling”*

* DO NOT: google image search for “strangling hands”

Finally

Start small

Fail (a lot)

Move on and apply what you learned

Thank you!These Slides: http://bit.ly/1SVGaRA

Reach out:

● Steve ([email protected], Twitter @sjwoodr)

Questions?

http://bit.ly/1SVGaRA

mailto:[email protected]

http://twitter.com/sjwoodr