dockercon14 cluster management and containerization

127
Cluster Management and Containerization Benjamin Hindman, @benh Twitter, Inc.

Upload: docker-inc

Post on 14-Jun-2015

1.135 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DockerCon14 Cluster Management and Containerization

Cluster  Management  and  Containerization  Benjamin  Hindman,  @benh  

Twitter,  Inc.  

Page 2: DockerCon14 Cluster Management and Containerization

$  whoami  

2006  -­‐  2011   2009  -­‐   2010  -­‐  

Page 3: DockerCon14 Cluster Management and Containerization

cluster management

Page 4: DockerCon14 Cluster Management and Containerization

cluster management (server/IT automation)

Page 5: DockerCon14 Cluster Management and Containerization

cluster  management  

① configuration/package  management  

② deployment  

③ naming  

Page 6: DockerCon14 Cluster Management and Containerization

cluster  management  

① configuration/package  management  

② deployment  

③ naming  

④ monitoring  

Page 7: DockerCon14 Cluster Management and Containerization

cluster  management  

① configuration/package  management  

② deployment  

③ naming  

④ monitoring  

ops  

Page 8: DockerCon14 Cluster Management and Containerization

cluster  management  

① configuration/package  management  

② deployment  

③ naming  

④ monitoring  

ops  developers  

Page 9: DockerCon14 Cluster Management and Containerization

cluster  management  

configuration/package  management  

naming  deployment  

Page 10: DockerCon14 Cluster Management and Containerization

configuration/package  management  

“what/how  do  things  get  installed?”  

(10’s  of  machines)  

hosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

$  ssh  host  ./configure  &&  make  install  

Page 11: DockerCon14 Cluster Management and Containerization

configuration/package  management  

“what/how  do  things  get  installed?”  

(10’s  of  machines)  

hosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

$  ssh  host  rpm  -­‐ivh  pkg-­‐x.y.z.rpm  

Page 12: DockerCon14 Cluster Management and Containerization

deployment  

“what  should  run  where?”  “how  should  it  be  started/stopped?”  

$  ssh  host  nohup  myapp  (10’s  of  machines)  

hosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

Page 13: DockerCon14 Cluster Management and Containerization

deployment  

“what  should  run  where?”  “how  should  it  be  started/stopped?”  

$  ssh  host  monit  start  myapp  (10’s  of  machines)  

hosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

Page 14: DockerCon14 Cluster Management and Containerization

deployment  

“what  should  run  where?”  “how  should  it  be  started/stopped?”  

$  scp  myapp  host    $  ssh  host  monit  myapp  

(10’s  of  machines)  

hosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

Page 15: DockerCon14 Cluster Management and Containerization

deployment  

“what  should  run  where?”  “how  should  it  be  started/stopped?”  

(10’s  of  machines)  

$  ssh  host  git  pull  &&  \          monit  myapp  

hosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

Page 16: DockerCon14 Cluster Management and Containerization

“how  should  apps  find  each  other?”  

naming  

webhosts.txt  

web1.twttr.com  web2.twttr.com  web3.twttr.com  web4.twttr.com  

dbhosts.txt  

db1.twttr.com  db2.twttr.com  db3.twttr.com  db4.twttr.com  

(10’s  of  machines)  

Page 17: DockerCon14 Cluster Management and Containerization

(10’s  of  machines)  

(100’s  -­‐>  1000’s  of  machines)  

to  scale,  need  more  automation  

Page 18: DockerCon14 Cluster Management and Containerization

Twitter,  circa  2010  

webhosts.txt  dbhosts.txt  

$  ssh  host  …  

(configuration/package  management)  

(deployment)  

Page 19: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 20: DockerCon14 Cluster Management and Containerization

challenges

Page 21: DockerCon14 Cluster Management and Containerization

challenges ①  failures  

Page 22: DockerCon14 Cluster Management and Containerization

failures  

Page 23: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 24: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 25: DockerCon14 Cluster Management and Containerization

types  of  failure:  fault  domains  

machine  (disk,  memory,  CPU,  etc)  

rack  (switch,  PDU)  

datacenter  

Page 26: DockerCon14 Cluster Management and Containerization

challenges ②   maintenance  

(aka  “planned  failures”)    

Page 27: DockerCon14 Cluster Management and Containerization

maintenance  ① upgrading  software  (i.e.,  installing  and  

uninstalling  packages)  

ops  developers  

Page 28: DockerCon14 Cluster Management and Containerization

maintenance  ① upgrading  software  (i.e.,  installing  and  

uninstalling  packages)  

②  replacing  machines,  switches,  PDUs,  etc  

Page 29: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 30: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 31: DockerCon14 Cluster Management and Containerization

challenges ③   utilization  

Page 32: DockerCon14 Cluster Management and Containerization

Rails  

Hadoop  

memcached  

utilization  

Page 33: DockerCon14 Cluster Management and Containerization

utilization  

Rails  

Hadoop  

memcached  

Page 34: DockerCon14 Cluster Management and Containerization

utilization  

Rails  

Hadoop  

memcached  buy  less  machines  

or  run  more  applications!  

Page 35: DockerCon14 Cluster Management and Containerization

challenges ①  failures  

② maintenance  

③ utilization  

Page 36: DockerCon14 Cluster Management and Containerization

challenges ①  failures  

② maintenance  

③ utilization  

planning  for  failure?  

Page 37: DockerCon14 Cluster Management and Containerization

planning  for  failure  

Page 38: DockerCon14 Cluster Management and Containerization

challenges ①  failures  

② maintenance  

③ utilization  

planning  for  utilization?  

Page 39: DockerCon14 Cluster Management and Containerization

planning  for  utilization  intra-­‐machine  resource  sharing:  

share  a  single  machine’s  resources  between  multiple  applications  (multi-­‐tenancy)  

intra-­‐datacenter  resource  sharing:  

share  multiple  machine’s  resources  between  multiple  applications  

Page 40: DockerCon14 Cluster Management and Containerization

Twitter,  circa  2010  

what  software  can  help  me!?  

Page 41: DockerCon14 Cluster Management and Containerization

cluster  management  

industry  academia  

Page 42: DockerCon14 Cluster Management and Containerization

different  software  

academia   industry  

•   MPI  (Message  Passing  Interface)   •   Apache  (mod_perl,  mod_php)  •   web  services  (Java,  Ruby,  …)  

Page 43: DockerCon14 Cluster Management and Containerization

different  scale  (at  first)  

academia   industry  

•   1,000’s  of  machines   •   10’s  of  machines  

Page 44: DockerCon14 Cluster Management and Containerization

cluster  management  

academia   industry  

•   PBS  (Portable  Batch  System)  •   TORQUE  •   SGE  (Sun  Grid  Engine)  

•   ssh  •   Puppet/Chef  •   Capistrano/Ansible  

cluster  managers  

Page 45: DockerCon14 Cluster Management and Containerization

cluster  manager  provides  a  level-­‐of-­‐indirection  between  hardware  resources  (machines)  and  applications/jobs  

Page 46: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 47: DockerCon14 Cluster Management and Containerization

cluster  manager  

Rails   Hadoop   memcached   …  

…  

Page 48: DockerCon14 Cluster Management and Containerization

cluster  management  

academia   industry  

•   PBS  (Portable  Batch  System)  •   TORQUE  •   SGE  (Sun  Grid  Engine)  

•   ssh  •   Puppet/Chef  •   Capistrano/Ansible  

batch  computation!  

Page 49: DockerCon14 Cluster Management and Containerization

Mesos  is  a  modern  general  purpose  cluster  manager  (i.e.,  not  just  focused  on  batch  scheduling)  

Page 50: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  

…  

streaming  

support  many  different  types  of  computation/scheduling  

Page 51: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  coordinate  for  resources  

Page 52: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(2)  launch  tasks  

Page 53: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(3)  launch  tasks  

Page 54: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

Page 55: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

Page 56: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(4)  task  termination  

Page 57: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(5)  task  status  update  

Page 58: DockerCon14 Cluster Management and Containerization

challenges revisited ①  failures  

② maintenance  

③ utilization  

Page 59: DockerCon14 Cluster Management and Containerization

challenges revisited ①  failures  

② maintenance  

③ utilization  

Page 60: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

Page 61: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

Page 62: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

Page 63: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(5)  task  status  update  

Page 64: DockerCon14 Cluster Management and Containerization

challenges revisited ①  failures  

② maintenance  

③ utilization  

Page 65: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

Page 66: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

Page 67: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

Page 68: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

Page 69: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

Page 70: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

(2)  multi-­‐tenancy  on  individual  machines    

Page 71: DockerCon14 Cluster Management and Containerization

Mesos  

service   batch   storage   …  streaming  

(1)  when  resources  become  idle,  can  be  scheduled  and  reused  by  other  schedulers  

(2)  multi-­‐tenancy  on  individual  machines    

Page 72: DockerCon14 Cluster Management and Containerization

multi-­‐tenancy  

task!

task!

containers  

task!

Page 73: DockerCon14 Cluster Management and Containerization

containerization started  leveraging  containerization  technology  

in  ~2011  

2011  

LXC  

2012  

cgroups  

2013  

Docker  (preliminary)  

2014  

Page 74: DockerCon14 Cluster Management and Containerization

how  Mesos  has  changed  cluster  management  at  Twitter  today  

Page 75: DockerCon14 Cluster Management and Containerization

configuration/package  management  

developers  

(1)  bundle  services  as  jar,  tar/gzip  

(2)  upload  to  HDFS  

Page 76: DockerCon14 Cluster Management and Containerization

configuration/package  management  (planning)  

developers  

(1)  bundle  services  as  jar,  tar/gzip,  and  using  Docker    

(2)  upload  to  HDFS  (or  use  registry)  

Page 77: DockerCon14 Cluster Management and Containerization

deployment  

Apache  Aurora  (incubating),  a  scheduler  for  running  stateless  services  written  in  any  language  (but  primarily  used  at  Twitter  for  JVM  services)  

Page 78: DockerCon14 Cluster Management and Containerization

deployment  (via  Aurora)  

developers  

(1)  describe  service  using  Python  based  DSL  

(2)  submit  service  to  Aurora  using  CLI  

Page 79: DockerCon14 Cluster Management and Containerization

deployment  (via  Marathon)  

developers  

(1)  describe  services  using  JSON  

(2)  submit  service  to  Marathon  via  REST  

Page 80: DockerCon14 Cluster Management and Containerization

naming  

Apache  ZooKeeper  

using  Apache  ZooKeeper  and  server  sets  (github.com/twitter/commons)  

Page 81: DockerCon14 Cluster Management and Containerization

naming  

(1)  task  gets  launched  on  machine  

Apache  ZooKeeper  

using  Apache  ZooKeeper  and  server  sets  (github.com/twitter/commons)  

Page 82: DockerCon14 Cluster Management and Containerization

naming  

(2)  service  gets  registered  in  a  server  set  in  ZooKeeper  

(1)  task  gets  launched  on  machine  

Apache  ZooKeeper  

using  Apache  ZooKeeper  and  server  sets  (github.com/twitter/commons)  

Page 83: DockerCon14 Cluster Management and Containerization

naming  

(2)  service  gets  registered  in  a  server  set  in  ZooKeeper  

(1)  task  gets  launched  on  machine  

(3)  other  services  use  ZooKeeper  to  find  services  they  need  

Apache  ZooKeeper  

using  Apache  ZooKeeper  and  server  sets  (github.com/twitter/commons)  

Page 84: DockerCon14 Cluster Management and Containerization

naming  

(2)  service  gets  registered  in  a  server  set  in  ZooKeeper  

(1)  task  gets  launched  on  machine  

(3)  other  services  use  ZooKeeper  to  find  services  they  need  

(4)  services  connect  directly  with  one  another  

Apache  ZooKeeper  

using  Apache  ZooKeeper  and  server  sets  (github.com/twitter/commons)  

Page 85: DockerCon14 Cluster Management and Containerization

naming  alternative  

(2)  update  HAProxy  with  new  service  location  

(1)  task  gets  launched  on  machine  

(3)  other  services  send  traffic  through  HAProxy  

ZooKeeper/server sets requires injecting code into your clients!

Page 86: DockerCon14 Cluster Management and Containerization

where  are  we  today?  

ops  developers  

Page 87: DockerCon14 Cluster Management and Containerization

where  are  we  today?  

ops  developers  

deploys  decoupled  from  ops  (many  deploys  per  day,  per  service)  

maintenance  consists  of  “draining”  hosts,  getting  tasks  rescheduled,  then  pulling  the  cord  

Page 88: DockerCon14 Cluster Management and Containerization

wait  …  don’t  virtual  machines  solve  my  cluster  management  

challenges?  

Page 89: DockerCon14 Cluster Management and Containerization

wait  …  don’t  virtual  machines  solve  my  cluster  management  

challenges?  

No.  

VMs  are  neither  sufficient  nor  necessary!  

Page 90: DockerCon14 Cluster Management and Containerization

big  computers  

small applications

Page 91: DockerCon14 Cluster Management and Containerization

IaaS  public private

Page 92: DockerCon14 Cluster Management and Containerization
Page 93: DockerCon14 Cluster Management and Containerization
Page 94: DockerCon14 Cluster Management and Containerization

challenges revisited ①  failures  

② maintenance  

③ utilization  

Page 95: DockerCon14 Cluster Management and Containerization

challenges revisited ①  failures  

② maintenance  

③ utilization  

public  or  private  IaaS,  failures  still  occur  (on  EC2,  instead  of  racks,  have  availability  zones,  instead  of  datacenters,  have  regions)  

Page 96: DockerCon14 Cluster Management and Containerization

challenges revisited ①  failures  

② maintenance  

③ utilization  provider  wins  with  public  IaaS,  better  resource  sharing  with  private  IaaS,  but  a  static  partition  of  VMs  is  still  a  static  partition!  

Page 97: DockerCon14 Cluster Management and Containerization

MySQL   Cassandra   Rails   Hadoop   memcached  

Page 98: DockerCon14 Cluster Management and Containerization

cluster  manager  

Rails   Hadoop   memcached   …  

…  

Page 99: DockerCon14 Cluster Management and Containerization

Mesos:  level  of  abstraction  

Mesos  build  and  run  using  resources  

Page 100: DockerCon14 Cluster Management and Containerization

Mesos:  level  of  abstraction  

IaaS  

Mesos  

provision  and  manage  machines  

build  and  run  using  resources  

Page 101: DockerCon14 Cluster Management and Containerization

Mesos  on  IaaS  

IaaS  

Mesos  

use  OpenStack  or  EC2  to  run  Mesos  

Page 102: DockerCon14 Cluster Management and Containerization

Mesos  on  IaaS/hardware  

IaaS  

Mesos  

hardware  use  OpenStack  or  EC2  or  physical  machines  

to  run  Mesos  

Page 103: DockerCon14 Cluster Management and Containerization

physical machines virtual machines

aggregation  not  virtualization  

Page 104: DockerCon14 Cluster Management and Containerization

physical machines datacenter computer

aggregation  not  virtualization  

Page 105: DockerCon14 Cluster Management and Containerization

small  computers  

?

big applications

Page 106: DockerCon14 Cluster Management and Containerization

"small" computers?

Page 107: DockerCon14 Cluster Management and Containerization

power wall

time

complex single core

simple many core

Page 108: DockerCon14 Cluster Management and Containerization

"big" applications?

Page 109: DockerCon14 Cluster Management and Containerization

applications  don’t  fit  on  a  single  computer  anymore  

Page 110: DockerCon14 Cluster Management and Containerization
Page 111: DockerCon14 Cluster Management and Containerization

"BIG"  

Page 112: DockerCon14 Cluster Management and Containerization
Page 113: DockerCon14 Cluster Management and Containerization

(1)  lots  of  data  …  (2)  lots  of  users  …  

growing  everyday  

Page 114: DockerCon14 Cluster Management and Containerization

these  applications  need  lots  of  resources  (CPUs,  memory,  I/O)  

Page 115: DockerCon14 Cluster Management and Containerization

these  applications  need  datacenters  

Page 116: DockerCon14 Cluster Management and Containerization

the  datacenter  is  the  new  computer  

Page 117: DockerCon14 Cluster Management and Containerization

desktop computer

server datacenter

OS  

OS  

OS  

the  datacenter  computer  needs  an  OS  

Page 118: DockerCon14 Cluster Management and Containerization

operating  system  

“a  collection  of  software  that  manages  the  computer  hardware  resources  and  provides  common  services  for  computer  programs”  

- Wikipedia

Page 119: DockerCon14 Cluster Management and Containerization

datacenter  operating  system  

“a  collection  of  software  that  manages  the  datacenter  computer  hardware  resources  and  provides  common  services  for  computer  programs”  

- Wikipedia

Page 120: DockerCon14 Cluster Management and Containerization

datacenter  operating  system  

“a  collection  of  software  that  manages  the  datacenter  computer  hardware  resources  and  provides  common  services  for  computer  programs”  

- Wikipedia

Page 121: DockerCon14 Cluster Management and Containerization

today

Your App

API

tomorrow

datacenter  OS  provides  common  functionality  every  new  distributed  system  re-­‐implements:  

•   failure  detection  •   package  distribution  •   task  starting  •   resource  isolation  •   resource  monitoring  •   task  killing,  cleanup  •   …  

Page 122: DockerCon14 Cluster Management and Containerization

today

Your App

API

tomorrow

provides  common  functionality  every  new  distributed  system  re-­‐implements:  

•   failure  detection  •   package  distribution  •   task  starting  •   resource  isolation  •   resource  monitoring  •   task  killing,  cleanup  •   …  

datacenter  OS  

Don’t  reinvent  the  wheel!  

Page 123: DockerCon14 Cluster Management and Containerization

case  studies  

distributed  in-­‐memory  analytics  framework  

distributed  cron  scheduler  (with  dependencies)  

github.com/apache/spark  

github.com/airbnb/chronos  

Page 124: DockerCon14 Cluster Management and Containerization

~Presentation()  

Page 125: DockerCon14 Cluster Management and Containerization

cluster  management  w/  Docker  +  Mesos  

①  configuration/package  management  

②  deployment  

③  naming  

Page 126: DockerCon14 Cluster Management and Containerization

datacenter  OS  

IaaS  

Mesos  

hardware  

provide  common  functionality  via  an  API  (kernel)  

your  distributed  system  

Page 127: DockerCon14 Cluster Management and Containerization

Mesos  0.19.0  released  today!  

mesos.apache.org  

mesos.apache.org/blog  

@ApacheMesos