ehsan&totoni& babak&behzad& swapnil&ghike& › iacoma-papers › pres ›...

27
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Ehsan  Totoni  Babak  Behzad  Swapnil  Ghike  Josep  Torrellas  

  • Illinois-Intel Parallelism Center �2  

    2  

    •  Increasing  number  of  transistors  on  chip  – Power  and  energy  limited  – Single-‐thread  performance  limited  – =>  parallelism  

    •  Many  opEons:  heavy  mulEcore,  manycore,  “SIMD-‐like”  GPUs,  low-‐power  designs  –  Intel’s  SCC  presents  manycore  

  • Illinois-Intel Parallelism Center �3  

    3  

    •  Various  trade-‐offs  – Performance,  power  and  energy  – Programmability  – Different  for  each  applicaEon  

    •  Each  architecture  has  advantages  – No  single  best  soluEon  –  (But  manycore  is  a  good  opEon)  

  • Illinois-Intel Parallelism Center �4  

    4  

    •  Benchmark  different  parallel  applicaEons  – On  all  architectures  

    •  Analyze  results,  learn  about  architecture  •  Where  does  each  architecture  stand?  –  In  various  tradeoffs  

    •  What  could  be  probably  beZer  for  future?  – Reasonable  in  all  metrics:  power,  performance…  – For  most  applicaEons  

  • Illinois-Intel Parallelism Center �5  

    5  

    •  Jacobi:  nearest  neighbor  communicaEon  – Standard  floaEng-‐point  benchmark  

    •  NAMD:  Molecular  dynamics  – Dynamic  scienEfic  applicaEon  

    •  Nqueens:  state  space  search  –  Integer  intensive  

  • Illinois-Intel Parallelism Center �6  

    6  

    •  CG:  Conjugate  Gradient,  from  NAS  Parallel  Benchmarks  (NPB)  – A  lot  of  fine-‐grain  communicaEon  

    •  Integer  Sort:  represents  commercial  workloads  –  Integer  intensive,  heavy  communicaEon    

  • Illinois-Intel Parallelism Center �7  

    7  

    •  All  codes  reasonably  opEmized,  not  extensively!  – Programmer  producEvity  

    •  MPI  and  Charm++  •  CUDA  and  OpenCL  codes  for  GPU  – Reasonably  opEmized,  same  complexity  as  CPU  ones  

    – Fair  apple-‐to-‐apple  comparison    

  • Illinois-Intel Parallelism Center �8  

    8  

    •  MulE-‐core:  Intel  Core  i7  – 4  cores,  8  threads,  2.8  GHz  

    •  Low-‐power:  Intel  Atom  D  – 2  cores,  4  threads,  1.8  GHz  

    •  “SIMD-‐like”  GPU:  Nvidia  ION2  – 16  CUDA  cores,  low  power,  475  MHz  

    •  Manycore:  Intel’s  SCC  – 48  cores,  800  MHz  

  • Illinois-Intel Parallelism Center �9  

    9  

    •  “Single-‐Chip  Cloud  Computer”  (SCC)  –  Just  a  name,  not  only  for  cloud  –  Experimental  chip,  for  research  on  manycore  

    •  Slower  than  producEon  chips,  but  we  can  analyze  –  48  PenEum  cores,  Mesh  interconnect  

    •  Not  cache  coherent,  message-‐passing  programs  

  • Illinois-Intel Parallelism Center �10  

    10  

    •  We  ported  Charm  infrastructure  to  SCC  – To  run  Charm++  and  MPI  parallel  codes  – Charm++  is  message-‐passing  parallel  objects  

    •  No  conceptual  difficulty  – SCC  is  similar  to  a  small  cluster  

    •  Just  headaches  of  experimental  chip  – Old  compiler,  OS  and  libraries…  

  • Illinois-Intel Parallelism Center �11  

    11  

    Apps  can  use  parallelism  in  architecture  

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    1 2 3 4 5 6 7 8

    Sp

    eed

    up

    Number of Threads

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �12  

    12  

    150W  

    85W   80

    90

    100

    110

    120

    130

    140

    150

    160

    1 2 3 4 5 6 7 8

    Po

    wer

    (W

    )

    Number of Threads

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �13  

    13  

    Speedup  offsets  Power  increase  

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    2.2

    2.4

    2.6

    2.8

    1 2 3 4 5 6 7 8

    No

    rmal

    ized

    En

    erg

    y

    Number of Threads

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �14  

    14  

    1

    1.5

    2

    2.5

    3

    3.5

    1 2 3 4

    Sp

    eed

    up

    Number of Threads

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �15  

    15  

    31W  

    29W  

    28.5

    29

    29.5

    30

    30.5

    31

    31.5

    32

    1 2 3 4

    Po

    wer

    (W

    )

    Number of Threads

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �16  

    16  

    Speedup  and  near  constant  

    power  

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    2.2

    2.4

    2.6

    2.8

    3

    1 2 3 4

    No

    rmal

    ized

    En

    erg

    y

    Number of Threads

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �17  

    17  

    0

    5

    10

    15

    20

    25

    30

    35

    40

    Jacobi NAMD NQueens Sort 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    Spee

    dup, P

    ow

    er (

    W)

    Rel

    ativ

    e E

    ner

    gy

    SpeedupPower (W)

    Relative Energy

    Energy  

  • Illinois-Intel Parallelism Center �18  

    18  

    Except  CG  (fine-‐grain  collec5ves)  

    No  change  in  source  code!  

    0

    5

    10

    15

    20

    25

    30

    35

    0 5 10 15 20 25 30 35 40 45 50

    Sp

    eed

    up

    Number of Cores

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �19  

    19  

    Different  power  

    (Communica5on  bound  so  processors  stall)  

    35

    40

    45

    50

    55

    60

    65

    70

    75

    80

    85

    90

    0 5 10 15 20 25 30 35 40 45 50

    Pow

    er (

    W)

    Number of Cores

    NAMDJacobi

    NQueensCG

    Sort

    60W  

    85W  

    40W  

  • Illinois-Intel Parallelism Center �20  

    20  

    not  for  CG  

    Apps  scalability  needed  to  offset  power  increase!  

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    22

    0 5 10 15 20 25 30 35 40 45 50

    Norm

    aliz

    ed E

    ner

    gy

    Number of Cores

    NAMDJacobi

    NQueensCG

    Sort

  • Illinois-Intel Parallelism Center �21  

    21  

    0

    5

    10

    15

    20

    25

    30

    Jacobi NAMD NQueens CG Sort

    Rel

    ativ

    e S

    pee

    d

    Atom (4 Threads)

    Core i7 (8 Threads)

    SCC (48 Threads)

    ION2 (16 Threads)GPU  good  for  some  apps  

    Regular,  highly  parallel  

    Mul5core  good  for  most  apps  

    Even  dynamic,  irregular,  less  parallel  

    SCC  beats  mul5core  on  Nqueens  

    experimental  chip!  

    Low-‐power  is  really  slow!  

  • Illinois-Intel Parallelism Center �22  

    22  

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Jacobi NAMD NQueens CG Sort

    Pow

    er (

    W)

    Atom (4 Threads)Core i7 (8 Threads)

    SCC (48 Threads)ION2 (16 Threads)

    GPU  and  have  low  

    consump5on  

    Mul5core  has  high  consump5on  

    SCC  has  moderate  power  

    More  parallelism,  less  

    frequency  

    High  frequency,  fancy  features  

    Power  nearly  same  for  different  apps  

  • Illinois-Intel Parallelism Center �23  

    23  

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    Jacobi NAMD NQueens CG Sort

    Rel

    ativ

    e E

    ner

    gy

    Atom (4 Threads)

    Core i7 (8 Threads)

    SCC (48 Threads)

    ION2 (16 Threads)

    Mul5core  is  energy  efficient!  

    High  speed  offsets  power  

    Low-‐power  design  is  not  

    energy  efficient!  

    Low  speed  offsets  power  

    savings  

    SCC  is  OK  

    Despite  low  speed  

  • Illinois-Intel Parallelism Center �24  

    24  

    •  MulEcore  sEll  effecEve  for  many  apps  – Dynamic,  irregular,  heavy  communicaEon  

    •  GPU  is  excepEonal  for  some  apps  – Not  for  all  apps  – Programmability,  portability  issues  

    •  Low-‐power  design  just  low  power  – Slow!  So  not  energy  efficient  

  • Illinois-Intel Parallelism Center �25  

    25  

    •  SCC:  a  good  opEon  – Low  power  and  can  run  legacy  code  fast  

    •  Balanced  design  point  – Lower  power  than  mulEcore,  faster  than  low-‐power  design  

    – No  programmability  and  portability  issues  of  GPU  •  Experimental,  slow  but  can  be  improved  

     

  • Illinois-Intel Parallelism Center �26  

    26  

    •  SequenEal  performance,  floaEng  point  – Apps  show  good  scaling  on  more  cores  of  SCC  – Easy  with  more  CMOS  transistors  –  Interconnect  also  has  to  improve  with  it  (sEll  easy)  

    •  Global  collecEves  network  – Like  BlueGene  supercomputers,  low  cost  – To  support  fine-‐grain  global  communicaEons  

    •  As  in  CG