benchmarking,and,performance, evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp...

40
Benchmarking and Performance Evalua5ons Todd Mytkowicz Microso= Research

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Benchmarking  and  Performance  Evalua5ons  

Todd  Mytkowicz  Microso=  Research  

Page 2: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Let’s  pole  for  an  upcoming  elec5on  

I  ask  3  of  my  co-­‐workers  who  they  are  vo3ng  for.  

Page 3: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Let’s  pole  for  an  upcoming  elec5on  

I  ask  3  of  my  co-­‐workers  who  they  are  vo3ng  for.  

•  My  approach  does  not  deal  with    – Variability    –  Bias  

Page 4: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Issues  with  my  approach  

Variability   source:  hDp://www.pollster.com  

My  approach  is  not  reproducible  

Page 5: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Issues  with  my  approach(II)  

Bias  

source:  hDp://www.pollster.com  

My  approach  is  not  generalizable  

Page 6: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Take  Home  Message  

•  Variability  and  Bias  are  two  different  things  – Difference  between  reproducible  and  generalizable!  

Page 7: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Take  Home  Message  

•  Variability  and  Bias  are  two  different  things  – Difference  between  reproducible  and  generalizable!  

Do  we  have  to  worry  about  Variability  and  Bias  when  we  benchmark?  

Page 8: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Let’s  evaluate  the  speedup  of  my  whizbang  idea  

What  do  we  do  about  Variability?  

Page 9: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Let’s  evaluate  the  speedup  of  my  whizbang  idea  

What  do  we  do  about  Variability?  

Page 10: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Let’s  evaluate  the  speedup  of  my  whizbang  idea  

What  do  we  do  about  Variability?  

•  Sta3s3cs  to  the  rescue  – mean  – confidence  interval  

Page 11: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Intui5on  for  T-­‐Test  

•  1-­‐6  is  uniformly  likely  (p  =  1/6)  •  Throw  die  10  5mes:  calculate  mean  

Page 12: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Intui5on  for  T-­‐Test  

•  1-­‐6  is  uniformly  likely  (p  =  1/6)  •  Throw  die  10  5mes:  calculate  mean  

Trial   Mean  of  10  throws  

1   4.0  

2   4.3  

3   4.9  

4   3.8  

5   4.3  

6   2.9  

…   …  

Page 13: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Intui5on  for  T-­‐Test  

•  1-­‐6  is  uniformly  likely  (p  =  1/6)  •  Throw  die  10  5mes:  calculate  mean  

Trial   Mean  of  10  throws  

1   4.0  

2   4.3  

3   4.9  

4   3.8  

5   4.3  

6   2.9  

…   …  

Page 14: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Back  to  our  Benchmark:  Managing  Variability  

Page 15: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Back  to  our  Benchmark:  Managing  Variability  

>  x=scan('file')  Read  20  items  >  t.test(x)  

 One  Sample  t-­‐test  

data:    x    t  =  49.277,  df  =  19,  p-­‐value  <  2.2e-­‐16  95  percent  confidence  interval:    1.146525  1.248241    sample  es5mates:  mean  of  x      1.197383    

Page 16: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

So  we  can  handle  Variability.    What  about  Bias?  

Page 17: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

System                                                    =  gcc  -­‐O2  perlbench    System  +  Innova5on    =  gcc  -­‐O3  perlbench  

Evalua5ng  compiler  op5miza5ons  

Page 18: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Madan:  speedup  =  1.18  ±  0.0002  

Conclusion:    O3  is  good  

System                                                    =  gcc  -­‐O2  perlbench    System  +  Innova5on    =  gcc  -­‐O3  perlbench  

Evalua5ng  compiler  op5miza5ons  

Page 19: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Madan:  speedup  =  1.18  ±  0.0002  

Conclusion:    O3  is  good  

Todd:  speedup  =  0.84  ±  0.0002  

Conclusion:  O3  is  bad  

System                                                    =  gcc  -­‐O2  perlbench    System  +  Innova5on    =  gcc  -­‐O3  perlbench  

Evalua5ng  compiler  op5miza5ons  

Page 20: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Madan:  speedup  =  1.18  ±  0.0002  

Conclusion:    O3  is  good  

Todd:  speedup  =  0.84  ±  0.0002  

Conclusion:  O3  is  bad  

System                                                    =  gcc  -­‐O2  perlbench    System  +  Innova5on    =  gcc  -­‐O3  perlbench  

Why  does  this  happen?  

Evalua5ng  compiler  op5miza5ons  

Page 21: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Madan:  HOME=/home/madan  

Todd:  HOME=/home/toddmytkowicz  

env  

stack  

text   text  

env  

stack  

Differences  in  our  experimental  setup  

Page 22: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Run5me  of  SPEC  CPU  2006  perlbench  depends  on  who  runs  it!  

Page 23: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

32  randomly  generated  linking  orders  

Bias  from  linking  order  speedu

p

Page 24: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

32  randomly  generated  linking  orders  

Order  of  .o  files  can  lead  to  contradictory  conclusions  

Bias  from  linking  order  speedu

p

Page 25: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Where  exactly  does  Bias  come  from?  

Page 26: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Interac5ons  with  hardware  buffers  

O2  

Page  N   Page  N  +  1  

Page 27: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Interac5ons  with  hardware  buffers  

O2  

Page  N   Page  N  +  1  

Dead  Code  

Page 28: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Interac5ons  with  hardware  buffers  

O2  

Page  N   Page  N  +  1  

Code  affected  by  O3  

Page 29: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Interac5ons  with  hardware  buffers  

O2  

Page  N   Page  N  +  1  

Hot  code  

Page 30: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Page  N   Page  N  +  1  

Interac5ons  with  hardware  buffers  

O2  

O3  

O3  beDer  than  O2  

Page 31: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Page  N   Page  N  +  1  

Interac5ons  with  hardware  buffers  

O2  

O3  

O2  

O3  

O3  beDer  than  O2  

O2  beDer  than  O3  

Page 32: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Cachline  N   Cacheline  N  +  1  

Interac5ons  with  hardware  buffers  

O2  

O3  

O2  

O3  

O3  beDer  than  O2  

O2  beDer  than  O3  

Page 33: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Other  Sources  of  Bias  

•  JIT    •  Garbage  Collec5on  •  CPU  Affinity  

•  Domain  specific  (e.g.  size  of  input  data)  

•  How  do  we  manage  these?  

Page 34: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Other  Sources  of  Bias  

How  do  we  manage  these?  –  JIT:    

•  ngen  to  remove  impact  of  JIT  •  “warmup”  phase  to  JIT  code  before  measurement  

– Garbage  Collec5on  •  Try  different  heap  sizes  (JVM)  •  “warmup”  phase  to  build  data  structures  •  Ensure  program  is  not  “leaking”  memory  

–  CPU  Affinity  •  Try  to  bind  threads  to  CPUs  (SetProcessAffinityMask)  

– Domain  Specific:  •  Up  to  you!  

Page 35: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

R  for  the  T-­‐Test  

•  Where  to  download  – hDp://cran.r-­‐project.org  

•  Simple  intro  to  get  data  into  R  

•  Simple  intro  to  do  t.test  

Page 36: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."
Page 37: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."
Page 38: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."
Page 39: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."
Page 40: Benchmarking,and,Performance, Evaluaons,courses.cs.washington.edu › courses › csep506 › 11sp › ... · Let’s,pole,for,an,upcoming,elec5on, I"ask3"of"my"co,workers"who"they"are"vo3ng"for."

Some  Conclusions  

•  Performance  Evalua5ons  are  hard!  – Variability  and  Bias  are  not  easy  to  deal  with  

•  Other  experimental  sciences  go  to  great  effort  to  work  around  variability  and  bias  – We  should  too!