inside the chef push jobs service - chefconf 2015

33

Upload: chef

Post on 11-Aug-2015

69 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Inside the Chef Push Jobs Service - ChefConf 2015
Page 2: Inside the Chef Push Jobs Service - ChefConf 2015

Chef Push in 2015Mark Anderson, 2015-04-01

Page 3: Inside the Chef Push Jobs Service - ChefConf 2015

Mark AndersonEngineer, Chef

Page 4: Inside the Chef Push Jobs Service - ChefConf 2015

The basics of Chef Push

Page 5: Inside the Chef Push Jobs Service - ChefConf 2015

If you want to run a command on a set of nodes • `knife ssh` can be problematic

• Key distribution/revocation • Access control/User accounts • Difficult to audit • Extra work required if the node is behind firewall • Doesn’t really scale very far past tens of nodes

• None of the alternative systems suited our needs

Why Chef Push?

Page 6: Inside the Chef Push Jobs Service - ChefConf 2015

• We wanted a remote execution system that is • Robust under network and client failure • Gates execution on a quorum being available • Provides presence information • Scale to hundreds if not thousands of nodes • Integrated with Chef authentication and

authorization system • Works behind firewalls and NAT

Why Chef Push?

Page 7: Inside the Chef Push Jobs Service - ChefConf 2015

• knife job start -quorum 90% 'chef-client' --search 'role:webapp'

• Finds all nodes with role webapp • Submits a job to the push server. • Checks quorum; 90% nodes listed must be available • Starts job chef-client on available nodes • Gathers success and failures • And will do this for ten nodes...or a thousand

Push jobs in a command line

Page 8: Inside the Chef Push Jobs Service - ChefConf 2015

The lifecycle of a job

Server

Client

Job Accepted

Send Command

Clients ACK

Wait for Quorum Start Exec

Clients Exec

Collect Results

Page 9: Inside the Chef Push Jobs Service - ChefConf 2015

• Erlang service • Extends the Chef REST API

• Job creation and tracking • Push client configuration

• Controls the clients via ZeroMQ • Heartbeats to track node availability • Command execution • All ZeroMQ packets are signed

Chef Push Server

Page 10: Inside the Chef Push Jobs Service - ChefConf 2015

• Simple ruby client • Receives heartbeats from the server • Sends back heartbeats to the server • Executes commands

• Configuration requirements are minimal • The client initiates all connections to the server

• Most configuration is via Chef API call to config endpoint

• Using that info opens ZeroMQ connections to server

Chef Push Client

Page 11: Inside the Chef Push Jobs Service - ChefConf 2015

Chef Push Networking

Message switch

Heartbeat generator

REST API

Client

HTTPS

PUB/SUB

DEALERROUTER

Page 12: Inside the Chef Push Jobs Service - ChefConf 2015

• All control for push is via extensions to the chef API • Node status • Job control

• start • stop • status

• Job listing

Chef Push knife extension

Page 13: Inside the Chef Push Jobs Service - ChefConf 2015

• Access rights controlled by groups • ‘push_job_writers’ group controls job creation and

deletion • ‘push_job_readers’ group controls read access to

job status and results • Whitelist for commands

• The client rejects commands that aren’t on the whitelist

• We’d like to do finer grained access control in the future

Access control

Page 14: Inside the Chef Push Jobs Service - ChefConf 2015

• Version 1.0 scales to 2k nodes • Works with Chef 12 • Open source since Fall 2014

• We’ve been working on new features since last spring

• But Chef 12 had to go out first • Required features from Enterprise Chef • Open sourcing chef push pretty meaningless

without a open source server

Status:

Page 15: Inside the Chef Push Jobs Service - ChefConf 2015

New Features in Chef Push 2.0

Page 16: Inside the Chef Push Jobs Service - ChefConf 2015

• Breaking change to the protocol • End to end encryption of every packet

• Required for us to implement parameter passing and output return features

• Built on the ZeroMQ4 implementation of CurveCP • CurveCP provides a framework which is

• Fast • Crypto hardened against modern attacks • Forward secrecy

• We still bootstrap the authentication using the Chef Client key

End to End Encryption

Page 17: Inside the Chef Push Jobs Service - ChefConf 2015

Enhanced control for the job execution environment • A config file up 100k • Effective User • Working directory • Environment variables

• User defined variables • Special variables for

• job id • job file location

Command environment and config files

Page 18: Inside the Chef Push Jobs Service - ChefConf 2015

• New flag for job • capture_output: boolean

• Capture is all or nothing • All nodes in the job • Both stdout and stderr

• Stored on server with job description • No streaming output … yet

Command output capture

Page 19: Inside the Chef Push Jobs Service - ChefConf 2015

Two event feeds • Per org feed

• Job start • Job completion summary • Runs forever

• Per job feed with fine grained execution data • Job voting start • Quorum votes by node • Job start • Completion state by node • Job completion

Server Sent Event Feeds

Page 20: Inside the Chef Push Jobs Service - ChefConf 2015

• Previously we’ve been advertising around 2k as the limit

• 10k connected nodes demonstrated • 10 sec heartbeats • c3.2xlarge chef server in standalone mode • Push server consumes 2 cores and about 2GB

• Up to 1k nodes in a single job • around 1.5-2k nodes we start seeing some

stampede problems • Not done scaling; there are a few tweaks left to do

Stable at 10k connected nodes

Page 21: Inside the Chef Push Jobs Service - ChefConf 2015

Demo some improvements

Page 22: Inside the Chef Push Jobs Service - ChefConf 2015

• That test was done with real push clients • 20 m3.2xlarge nodes, • Each running 500 docker containers

• But we also do a lot of testing using a simulator • Understanding the limits of our current system

• SystemTap is amazing for this kind of work

Current work: Scalability and Stability drive

Page 23: Inside the Chef Push Jobs Service - ChefConf 2015

Axes of scaling tested • # of active clients • Heartbeat rate for a client • Number of clients in a single job

Below 10k clients there is a pretty linear trade between heartbeat rate and number of connected clients; heartbeats/sec is was a useful metric

Must use care to avoid stampedes in job execution

Scaling and Tuning

Page 24: Inside the Chef Push Jobs Service - ChefConf 2015

• A port in ZeroMQ is bound to a single thread • All communications go through a single ‘command

switch’ • Client heartbeats, and all command messages go

through the switch • The switch ended up being a bottleneck at around

2k messages/sec • Experiment: multiple command switches

• Exercises some weaknesses in the ZeroMQ - Erlang interface

• Not as big of a win as hoped, ended up being more complex than we’d like

Lessons from scaling

Page 25: Inside the Chef Push Jobs Service - ChefConf 2015

Nearly feature complete but: • Remaining work for new features

• Knife push extensions for everything • Documentation

• Windows testing and stability • Committed to making Windows a first class citizen

• CentOS 7 • Polish around installation and cookbooks • Upgrade tooling for 1.0->2.0 • Bug fixes

• Please file bugs

Remaining work for 2.0

Page 26: Inside the Chef Push Jobs Service - ChefConf 2015

Roadmap for 2.1 and beyond

Page 27: Inside the Chef Push Jobs Service - ChefConf 2015

• Currently we support • Ubuntu 10.04, 12.04, 14.04 LTS • CentOS 5, 6, and 7 soon • Windows (client only)

• Investigating client support for • AIX • Solaris

Platform Support

Page 28: Inside the Chef Push Jobs Service - ChefConf 2015

• Key rotation support • Multiple keys breaks some assumptions around

how we auth in push • Needs fixes on Chef Server as well as Push

• Better access control • Controlling access on a node by node basis • Examining persistent jobs as a first class object

with their own ACLs - look for the RFC

Features for 2.x releases

Page 29: Inside the Chef Push Jobs Service - ChefConf 2015

• Integration into Chef Client package • Delayed joining the two because of the protocol

breaking changes in 2.0 • Future server versions will be backward

compatible.

Features for 2.x releases

Page 30: Inside the Chef Push Jobs Service - ChefConf 2015

Scaling • Rate limited job execution

• Prevent stampede effect • Protects both push and chef server • Starting 1k chef client runs at once is a bad idea

anyways • Per-job and server global limits

• Multiple socket command switch • Biggest scaling bottleneck • Infrastructure for distributed server

Features for 2.x releases

Page 31: Inside the Chef Push Jobs Service - ChefConf 2015

• Move push connections to front ends in tiered Chef • Push will be running on all of the front end nodes • Expect should improve scaling

• Better HA support • Move to a true active-active model on BE

• Scaling • Our goal is to scale with Chef server

Future major releases - 3.x and beyond

Page 32: Inside the Chef Push Jobs Service - ChefConf 2015

Protocol changes required • Complex networks difficult; proxies are hard

• ZeroMQ was helpful at first, but hitting limitations • Stability problems at scale • Erlang doesn’t need a lot of what ZeroMQ brings

• Backward compatibility will be a priority

Future major releases - 3.x and beyond

Page 33: Inside the Chef Push Jobs Service - ChefConf 2015

• Office hours • Currently Monday and Wednesday 12:00PST

• chef-push is the master repository • github.com/chef/chef-push • File issues here • Specific issues and PRs are fine to file against the

individual repos • Pull requests always welcome

• RFCs for major new features

`