ci and cd at scale: scaling jenkins with docker and apache mesos
TRANSCRIPT
![Page 1: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/1.jpg)
CI AND CD AT SCALE
SCALING JENKINS WITH DOCKERAND APACHE MESOS
Carlos Sanchez
/ csanchez.org @csanchez
See online at http://carlossg.github.io/presentations
![Page 2: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/2.jpg)
ABOUT MESenior So�ware Engineer @ CloudBees
Contributor to the Jenkins Mesos plugin and the JavaMarathon client
Author of Jenkins Kubernetes plugin
Long time OSS contributor at Apache, Eclipse, Puppet,…
![Page 3: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/3.jpg)
OUR USE CASE
Scaling JenkinsYour mileage may vary
![Page 4: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/4.jpg)
SCALING JENKINSTwo options:
More build agents per masterMore masters
![Page 5: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/5.jpg)
SCALING JENKINS: MORE BUILDAGENTS
Pros
Multiple plugins to add more agents, even dynamically
Cons
The master is still a SPOFHandling multiple configurations, plugin versions,...There is a limit on how many build agents can beattached
![Page 6: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/6.jpg)
SCALING JENKINS: MORE MASTERSPros
Different sub-organizations can self service and operateindependently
Cons
Single Sign-OnCentralized configuration and operation
![Page 7: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/7.jpg)
CLOUDBEES JENKINS ENTERPRISE EDITIONCloudBees Jenkins Operations Center
![Page 8: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/8.jpg)
CLOUDBEES JENKINS PLATFORM - PRIVATESAAS EDITION
The best of both worlds
CloudBees Jenkins Operations Center with multiple masters
Dynamic build agent creation in each master
ElasticSearch for Jenkins metrics and Logstash
![Page 9: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/9.jpg)
BUT IT IS NOT TRIVIAL
![Page 10: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/10.jpg)
ARCHITECTUREDocker Docker Docker
![Page 11: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/11.jpg)
Isolated Jenkins masters
Isolated build agents and jobs
Memory and CPU limits
![Page 12: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/12.jpg)
How would you design your infrastructure ifyou couldn't login? Ever.
Kelsey Hightower
![Page 13: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/13.jpg)
EMBRACE FAILURE!
![Page 14: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/14.jpg)
CLUSTER SCHEDULINGRunning in public cloud, private cloud, VMs or bare metal
Starting with AWS and OpenStackHA and fault tolerantWith Docker support of course
![Page 15: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/15.jpg)
MESOSPHERE MARATHON
![Page 16: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/16.jpg)
TERRAFORM
![Page 17: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/17.jpg)
TERRAFORMresource "aws_instance" "worker" { count = 1 instance_type = "m3.large" ami = "ami-xxxxxx" key_name = "tiger-csanchez" security_groups = ["sg-61bc8c18"] subnet_id = "subnet-xxxxxx" associate_public_ip_address = true tags { Name = "tiger-csanchez-worker-1" "cloudbees:pse:cluster" = "tiger-csanchez" "cloudbees:pse:type" = "worker" } root_block_device { volume_size = 50 } }
![Page 18: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/18.jpg)
TERRAFORMState is managedRuns are idempotentterraform apply
Sometimes it is too automaticChanging image id will restart all instances
![Page 19: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/19.jpg)
![Page 20: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/20.jpg)
Preinstall packages: Mesos, Marathon, DockerCached docker imagesOther drivers: XFS, NFS,...Enhanced networking driver (AWS)
![Page 21: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/21.jpg)
STORAGEHandling distributed storage
Servers can start in any host of the cluster
And they can move when they are restarted
Jenkins masters need persistent storage, agents (typically)don't
Supporting EBS (AWS) and external NFS
![Page 22: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/22.jpg)
SIDEKICK CONTAINERA privileged container that manages mounting for other
containers
Can execute commands in the host and other containers
![Page 23: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/23.jpg)
SIDEKICK CONTAINER CASTLERunning in Marathon in each host
"constraints": [ [ "hostname", "UNIQUE" ] ]
![Page 24: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/24.jpg)
A lot of magic happening with nsenter
both in host and other containers
![Page 25: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/25.jpg)
Jenkins master container requests data on startup usingentrypoint
REST call to CastleCastle checks authenticationCreates necessary storage in the backend
EBS volumes from snapshotsDirectories in NFS backend
![Page 26: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/26.jpg)
Mounts storage in requesting containerEBS is mounted to host, then bind mounted intocontainerNFS is mounted directly in container
Listens to Docker event stream for killed containers
![Page 27: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/27.jpg)
CASTLE: BACKUPS AND CLEANUPPeriodically takes S3 snapshots from EBS volumes in AWS
Cleanups happening at different stages and periodically
EMBRACE FAILURE!
![Page 28: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/28.jpg)
PERMISSIONSContainers should not run as root
Container user id != host user id
i.e. jenkins user in container is always 1000 but matchesubuntu user in host
![Page 29: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/29.jpg)
CAVEATSOnly a limited number of EBS volumes can be mounted
Docs say /dev/sd[f-p], but /dev/sd[q-z] seem towork too
Sometimes the device gets corrupt and no more EBSvolumes can be mounted there
NFS users must be centralized and match in cluster and NFSserver
![Page 30: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/30.jpg)
MEMORYScheduler needs to account for container memory
requirements and host available memory
Prevent containers for using more memory than allowed
Memory constrains translate to Docker --memory
![Page 31: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/31.jpg)
WHAT DO YOU THINK HAPPENSWHEN?
Your container goes over memory quota?
![Page 32: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/32.jpg)
![Page 33: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/33.jpg)
WHAT ABOUT THE JVM?
![Page 34: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/34.jpg)
WHAT ABOUT THE CHILDPROCESSES?
![Page 35: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/35.jpg)
OTHERCONSIDERATIONS
ZOMBIE REAPING PROBLEM
![Page 36: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/36.jpg)
ZOMBIE REAPING PROBLEM
Zombie processes are processes that have terminated buthave not (yet) been waited for by their parent processes.
The init process -- PID 1 -- task is to "adopt" orphaned childprocesses
source
![Page 37: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/37.jpg)
THIS IS A PROBLEM IN DOCKERJenkins build agent run multiple processes
But Jenkins masters too, and they are long running
![Page 38: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/38.jpg)
TINISystemd or SysV init is too heavyweight for containers
All Tini does is spawn a single child (Tini ismeant to be run in a container), and waitfor it to exit all the while reaping zombies
and performing signal forwarding.
PROCESS REAPINGDocker 1.9 gave us trouble at scale, rolled back to 1.8
Lots of defunct processes
![Page 39: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/39.jpg)
NETWORKINGJenkins masters open several ports
HTTPJNLP Build agentSSH server (Jenkins CLI type operations)
![Page 40: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/40.jpg)
NETWORKING: HTTPWe use a simple nginx reverse proxy for
MesosMarathonElasticSearchCJOCJenkins masters
Gets destination host and port from Marathon
![Page 41: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/41.jpg)
NETWORKING: HTTPDoing both
domain based routing master1.pse.example.compath based routing pse.example.com/master1
because not everybody can touch the DNS or get awildcard SSL certificate
![Page 42: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/42.jpg)
NETWORKING: JNLPBuild agents started dynamically in Mesos cluster can
connect to masters internally
Build agents manually started outside cluster get host andport destination from HTTP, then connect directly
![Page 43: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/43.jpg)
NETWORKING: SSHSSH Gateway Service
Tunnel SSH requests to the correct host
Simple configuration needed in clientHost=*.ci.cloudbees.com ProxyCommand=ssh -q -p 22 ssh.ci.cloudbees.com tunnel %h
allows to runssh master1.ci.cloudbees.com
![Page 45: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/45.jpg)
![Page 46: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/46.jpg)
A 300 JENKINS MASTERS CLUSTER3 Mesos masters (m3.xlarge: 4 vCPU, 15GB, 2x40 SSD)80 Mesos slaves (m3.xlarge)7 Mesos slaves dedicated to ElasticSearch: (r3.2xlarge: 8vCPU, 61GB, 1x160 SSD)
Total: 1.5TB 376 CPUs
Running 300 masters and ~3 concurrent jobs per master
Masters: 2GB 0.1 CPU / Build agents: 512MB 0.1 CPU
![Page 47: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/47.jpg)
![Page 48: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/48.jpg)
![Page 49: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/49.jpg)
![Page 50: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/50.jpg)
![Page 51: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/51.jpg)
![Page 52: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/52.jpg)
TERRAFORM AWSInstancesKeypairsSecurity GroupsS3 bucketsELBVPCs
![Page 53: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/53.jpg)
AWSResource limits: VPCs, S3 snapshots, some instance sizes
Rate limits: affect the whole account
Retrying is your friend, but with exponential backoff
![Page 54: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/54.jpg)
AWSRunning with a patched Terraform to overcome timeouts
and AWS eventual consistency<?xml version="1.0" encoding="UTF-8"?> <DescribeVpcsResponse xmlns="http://ec2.amazonaws.com/doc/2015-10-01/" <requestId>8f855bob-3421-4cff-8c36-4b517eb0456c</requestld> <vpcSet> <item> <vpcId>vpc-30136159</vpcId> <state>available</state> <cidrBlock>10.16.0.0/16</cidrBlock> ... </DescribeVpcsResponse> 2016/05/18 12:55:57 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeVpcAttribute Details:--[ RESPONSE] ------------------------------------ HTTP/1.1 400 Bad Request<Response><Errors><Error><Code>InvalidVpcID.NotFound</Code><Message> The vpc ID 'vpc-30136159‘ does not exist</Message></Error></Errors>
![Page 55: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/55.jpg)
TERRAFORM OPENSTACKInstancesKeypairsSecurity GroupsLoad BalancerNetworks
![Page 56: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/56.jpg)
OPENSTACKCustom flavors
Custom images
Different CLI commands
There are not two OpenStack installations that are the same
![Page 57: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/57.jpg)
THE FUTURENew framework using Netflix Fenzo
Runs under marathon, exposes REST API that masters call
AffinityReduce number of frameworksFaster to spawn new build agents because framework isnot startedPipeline durable builds, can survive a restart of the masterDedicated workers for builds
![Page 58: CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos](https://reader034.vdocuments.net/reader034/viewer/2022042619/588303d31a28abe70d8b6059/html5/thumbnails/58.jpg)
THANKScsanchez.org
csanchez
carlossg