applica’on*modeling*and*hardware* par’’oning*mechanisms...

25
Applica’on Modeling and Hardware Par’’oning Mechanisms for Resource Management Sarah Bird, Henry Cook, Krste Asanovic, John Kubiatowicz, Dave PaFerson ParLab Arch and OS Groups

Upload: others

Post on 25-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Applica'on  Modeling  and  Hardware  Par''oning  Mechanisms  for  

Resource  Management  Sarah  Bird,  Henry  Cook,  Krste  Asanovic,  John  Kubiatowicz,  

Dave  PaFerson  

ParLab  Arch  and  OS  Groups  

Page 2: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Resource  Alloca'on  Objec'ves  

•  Each  par''on  receives  a  vector  of  basic  resources  dedicated  to  it  –  Some  number  of  processing  elements  (e.g.,  cores)  

–  A  por'on  of  physical  memory    

–  A  por'on  of  shared  cache  memory    

–  A  frac'on  of  memory  bandwidth  

•  Allocate  minimum  resources  necessary  for  each  applica'ons  QoS  requirements  

Cell1  

Cell2  

Cell3  

Resource    Sub-­‐Group  

Resource  Group  

Cell3  

•  Allocate  remaining  resources  to  meet  some  system-­‐level  objec've  

–  Best  performance  

–  Lowest  Energy  •  Doesn’t  require  applica'on  developers  to  worry  about  low-­‐level  resources  

Page 3: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Tessellation K

ernel (Trusted)

System-­‐wide  Adapta'on  Loop  

MicrosoV/UPCRC  

Policy Service

STRG Validator & Resource Planner

Partition Mapping and Multiplexing

Layer

Resource Plan Executor

Partition Mechanism

Layer QoS

Enforcement Partition

Implementation Channel

Authenticator

Partitionable Hardware Resources

Cores Physical Memory

Memory Bandwidth

Cache/ Local Store Disks NICs

Major Change Request

ACK / NACK

Cell-Creation & Resizing Requests from Users Admission

Control

Minor Changes

Cell   Cell   Cell  

All  system  resources  

Cell  group  with  frac'on  of  resources  

Cell  

Space-­‐Time  Resource  Graph    

(STRG)  

(Current Resources)

Global Policies / User Policies and Preferences

Resource-Allocation and Adaptation

Mechanism

Offline Models and Behavioral

Parameters

Online Performance Monitoring,

Model Building, and Prediction

Performance Counters

ACK / NACK

Page 4: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

APPLICATION  MODELING  Techniques  and  Tradeoffs  

Page 5: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Mo'va'on  

• Programmers  are  unlikely  to  know  exactly  how  low-­‐level  resources  effect  performance  

– Developers  are  concerned  applica'on-­‐level  metrics    •  e.g.,  frames/sec,  requests/sec  

– Opera'ng  system  has  to  make  decisions  about  resource  quali'es  

•  e.g.,  number  of  cores,  cache  slices,  memory  bandwidth  

• Automa'cally  construc'ng  performance  models  is  a  good  way  to  bridge  the  gap  between  applica'on-­‐level  metrics  and  hardware  resources  

Page 6: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Model  Building  

•  Collect  data  points  of  performance  for  specific  resource  alloca'on  vectors  

 Li  =  PMi(r(0,i),  r(1,i),  …,  r(n-­‐1,i))  •  Use  mul'variate  regression  techniques  to  fit  a  model  to  the  data  points  

Linear  and  Quadra'c   GPRS   KCCA  

• Advantages  • Very  Accurate  

• Disadvantages  • Can  overfit  the  data  • Computa'onally  expensive  to  build  • Doesn’t  work  with  simple  op'mizers  

• Advantages  • Can  represent  all  output  metrics  in  one  model  • Successful  in  the  past  

• Disadvantages  • Doesn’t  work  with  simple  op'mizers  

• Advantages  • Simple  to  build  • Work  well  with  simple  op'mizers  

• Disadvantages  • Poten'ally  inaccurate  • Can’t  represent  variable  interac'on  

Page 7: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Model  Accuracy  

Page 8: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Model  Crea'on  Online  vs.  Offline  Training  

• Offline  profiling  op'ons  – Profile  applica'ons  in  advance  

•  Distribute  with  applica'on  •  iTunes  App  Store  or  Android  Market  

– Create  applica'on  profiles  in  the  Cloud  •  Record  performance  and  resource  sta's'cs  from  users  

•  MSR  is  currently  doing  this  to  make  perf.  models  for  app  developers  

• Online  profiling  op'ons  – Install  'me  profiling  

•  Opera'ng  system  tests  out  a  variety  of  configura'ons  

– Online  refinement  of  models  •  Opera'ng  system  starts  with  a    generic  model  

•  Retrains  the  model  with  new  informa'on  as  the  applica'on  runs  

Page 9: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Performance  Isola'on  • Without  performance  isola'on,  

–   An  applica'ons  performance  could  vary  widely  as  a  result  of  concurrently  running  applica'ons  

– Inaccurate  models  – Requires  different  models  of  the  applica'on  based  on  the  system  load  

• Performance  predictability  is  an  important  component  for  applica'on  modeling  

– Other  advantages  •  BeFer  tuned,  more  efficient  applica'ons  

•  Easier  to  make  QoS  guarantees  

Page 10: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

RESOURCE  ALLOCATION  USING  CONVEX  OPTIMIZATION  

An  Example  

Page 11: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Minimizing  the  Urgency  of  The  System  

 La  =  PMa(r(0,a),  r(1,a),  …,  r(n-­‐1,a))    La  

Ua(La)  

Con'nuously    Minimize  (subject  to  

restric'ons  on  the  total  amount  of  

resources)  

 Lb  =  PMb(r(0,b),  r(1,b),  …,  r(n-­‐1,b))    Lb  

Ub(Lb)  

 Li  =  PMi(r(0,i),  r(1,i),  …,  r(n-­‐1,i))    Li  

Ui(Li)  

Performance  Model  Urgency  Func'on  

[Burton  Smith  (MSR),  Opera'ng  System  Resource  Management  (Keynote),  IPDPS  2010]  

Page 12: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

 Li  =  PM(r(0,  i),  r(1,  i),  …,  r(n-­‐1,  i))  

 Li  

Ui  (Li)  

Service    Requirement  

si  =  slope  

Ui  (Li)  =  MAX(si  ·∙  (Li  -­‐  di),  0)  

Urgency  Func'on  

• Reflects  the  importance  of  cell  Ci  to  the  user  

di  

[Burton  Smith  (MSR),  Opera'ng  System  Resource  Management  (Keynote),  IPDPS  2010]  

Performance  Model  (PM)  Expected  to  decrease  with  resources  

r(1,i):  Alloca'on  of  resource  of  type  1  to  Cell  Ci  

 Li  

Ui  (Li)  

si  =  slope  

Page 13: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Performance-­‐Aware  Convex  Op'miza'on  for  Resource  Alloca'on  

•  Advantages  •  Convex  op'miza'on  is  relaJvely  inexpensive  op'miza'on  problem  with  a  single  extreme  point  

•  Urgency  Func'on  Slopes  allow  the  system  to  express  relaJve  prioriJes  of  applica'on  

•  Priori'es  change  as  a  func'on  of  performance  

•  Urgency  Func'on  Intercept  encapsulates  QoS  requirements  

•  And  addi'onal  process  can  be  used  to  represent  system  energy  

[Burton  Smith  (MSR),  Opera'ng  System  Resource  Management  (Keynote),  IPDPS  2010]  

Very  sensi've  to  the  performance  models  of  the  applica'ons  

Page 14: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

HARDWARE  PARTITIONING  Approaches  +  Mechanisms  

Page 15: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

GSFm:  Globally  Synchronized  Frames  Bandwidth  Par''oning  for  the  memory  hierarchy  

•  Frame-­‐Based  QoS  System  – Transac'ons  are  labeled  with  a  frame  number  

– Head  frame  moves  through  the  network  with  a  top  priority  

A

A VC 0 (Fr0) VC 1 (Fr1) VC 2 (Fr2)

C B VC 0 (Fr0) VC 1 (Fr1) VC 2 (Fr2)

Proc A Proc B Proc C Proc D

B C

B A D

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

frame window

shift

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

with  Jae  Lee,  MIT  

Page 16: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

GSFm:  Globally  Synchronized  Frames  Bandwidth  Par''oning  for  the  memory  hierarchy  

•  Uses  source-­‐side  suppression  •  Applica'ons  are  given  a  bandwidth  alloca'on  per  frame  

–  Credits  per  resource  •  Memory  channels  and  memory  bank,  

network  link,  etc  – Memory  transac'ons  are  charged  for  all  possible  resources  

•  Delayed  into  a  future  frame  if  the  app  doesn’t  have  enough  credits  

off-­‐chip  DRAMs  0   1   2   3  

4   5   6   7  

8   9   10   11  

12   13   14   15  

dir  0  

mem  ctrl  0   dir  1  

mem  ctrl  1  

off-­‐chip  DRAMs  

1  

Networks   Memory  

channel   bank  c2hREQ   h2cRESP   h2cREQ   c2cRESP  

1H   1P   0   0   1T   1T  

H:  header-­‐only  message  P:  header+payload  message  T:  memory  transac'on  

Page 17: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Advantages  of  GSFm  

• Minimum  Bandwidth  Guarantees  – Flexible  – Differen'ated  – Weighted  sharing  of  the  excess  bandwidth  

• Good  U'liza'on  – Early  frame  reclama'on  – Excess  bandwidth  

• Minimal  Hardware  Requirements  – Reasonable  area  – Distributed     1234567812

3456

78

0

0.02

0.04

0.06

accepted throughput [flits/cycle/node]

0.02!

0.04!

0.06!

0!

Page 18: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Cache  Par''oning  

•  Way-­‐Based  –  Simple  Indexing  –  Changes  the  replacement  policy  –  Reduced  associa'vity    –  Limited  by  number  of  ways  –  No  locality  for  NUCA  systems  

•  Bank-­‐Based  –  Locality  in  NUCA  –  More  complex  indexing  –  Requires  flush  on  reconfigura'on  –  Limited  by  number  of  banks  

Page 19: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

A  SIMPLE  EVALUATION  RAMP  Gold  Experiments  

Page 20: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Experimental  Planorm  

•  RAMP  Gold:  FPGA-­‐Based  Simulator  Target  Machine  

–  64  single-­‐issue  in-­‐order  cores  @  1GHz  

•  Par''onable  into  sets  of  8  –  Private  L1  Instruc'on  and  Data  Caches  each  32KB  –  Shared  L2  Cache  8MB  inclusive  10  ns  latency  

•  Par''onable  into  8  slices  using  page  coloring  

– Memory  bandwidth  with  magic  interconnect  

•  Par''onable  into  3.4  GB/s  units  assigned  to  a  set  of  cores  

•  ROS  Kernel  Code  – Microbenchmarks  &  PARSEC  Benchmarks  

Page 21: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Applica'on  Modeling  

•  Use  10  sample  points  – 18.5%  of  the  54  Possible  Alloca'ons    – Selected  using  Audze-­‐Eglais  Design  of  Experiments  

•  Create  a  model  of  the  applica'on  –  Input:  Resources  – Output:  Performance  

•  Explore  different  types  of  models  –  Linear  – Quadra'c  – KCCA  (Machine  Learning)  – Gene'cally  Programmed  Response  Surfaces  (GPRS)  

•  Run  all  54  Alloca'ons  to  test  model  accuracy  Latency  

Memory  alloca'on  

Page 22: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Scheduling  Experiment  

•  Evaluate  Objec've  Func'on  for  a  Pair  of  Benchmarks  using  the  models    

–  Race  to  Halt  •  Min  Max(Cycle1,  Cycle2)  

–  Least  Cycles  •  Min  (Cycle1*Cores1  +  Cycle2*Cores2)  

–  Lowest  Energy  •  Min  Σi  (resource  u'liza'on  i*energy  parameter  i)  

– Using  MATLAB’s  fmincon  

•  Run  all  54  possible  resource  alloca'ons  for  each  pair  of  benchmarks  

–  Assumes  that  all  resources  must  be  allocated  

Page 23: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Spa'al  Par''oning  Results  

•  Time-­‐Mux’ing  is  on  average  of  2x  worse  than  the  best  spa'al  par''on  

•  However  the  worst  spa'al  par''on  is  quite  bad.  

•  Naively  dividing  the  machine  in  half  is  1.75x  worse  than  the  best  spa'al  par''on  

•  Linear  model  is  within  8%  of  op'mal  every  'me  

•  Quadra'c  Model  is  within  3%  of  op'mal  every  'me  

•  KCCA  does  well  in  some  cases  but  very  poorly  in  others  

0.00  

0.50  

1.00  

1.50  

2.00  

2.50  

3.00  

3.50  

Blackscholes  and  Streamcluster  

Bodytrack  and  Streamcluster  

Blackscholes  and  Fluidanimate  

Loop  Micro  and  Random  Access  Micro  

Sum  of  C

ycles  on

 All  Co

res  (Normalized

 to  Best)  

Best  Spa'al  Par''oning   Time  Mul'plexing  

Divide  the  Machine  in  Half   Predic'on  with  Linear  Model  

Predic'on  with  Quadra'c  Model   Predic'on  with  KCCA  Model  

Worst  Spa'al  Par''oning  

(7.15)  

Page 24: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Conclusions  

•  Simple  applica'on  models  show  promise  – S'll  lots  of  challenges    

•  When  to  build?  

•  How  to  store?  •  Variability  

• Resource  alloca'on  using  convex  op'miza'on  poten'ally  very  interes'ng  

– Lots  of  parameters  to  tune  •  How  do  we  set  the  urgency  func'ons?  

– How  does  it  compare  with  other  op'ons?  

Page 25: Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

QUESTIONS?