primehpc fx10 - performance evaluations and approach ... · toshiyuki shimizu next generation...

22
Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations and approach towards the next - SC12 Booth presentation

Upload: others

Post on 16-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012

PRIMEHPC FX10 - Performance evaluations and

approach towards the next -

SC12 Booth presentation

Page 2: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Outline

Introduction of Fujitsu petascale supercomputers and key technologies of massively parallel computers K computer and FX10 performance evaluations HPC-ACE: CPU architecture extension VISIMPACT: Hybrid parallel execution support

Challenges towards Exascale computing Summary

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 1

Page 3: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Fujitsu HPC from workplace to top-end

PRIMERGY x86 Clusters

PRIMEHPC FX10 Supercomputers

Celsius workstations

SC12 Booth presentation for FX10 Copyright 2012 FUJITSU LIMITED 2

Page 4: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Customers of large-scale HPC systems

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10

Customer Type Peak RIKEN (Kobe AICS) The K computer 11.28 PF National Computational Infrastructure, Australia (*) x86 Cluster (CX400), FX10 1.2 PF The University of Tokyo FX10 1.13 PF Central Weather Bureau of Taiwan (*) FX10 > 1 PF Kyushu University x86 Cluster (CX400), FX10 691 TF HPC Wales, UK x86 Cluster >300 TF Japan Atomic Energy Agency x86 Cluster, FX1, SMP 214 TF Institute for Molecular Science x86 Cluster (RX300), FX10 >168 TF Institute for Molecular Science (*) x86 Cluster (CX400) >136 TF Japan Aerospace Exploration Agency FX1, SMP >135 TF RIKEN (Wako Lab. RICC) x86 Cluster (RX200) 108 TF Institute for Solid State Physics of the Univ. of Tokyo (*) FX10 90 TF NAGOYA University x86 Cluster (HX600), FX1, SMP 60 TF A*STAR, Singapore x86 Cluster (BX900) > 45 TF A Manufacturer x86 Cluster >250 TF

(*) To be operated, Type definitions: FX10=PRIMEHPC FX10, x86 Cluster=Clusters based on PRIMERGY x86 server, SMP= SPARC Enterprise SMP server

3

Page 5: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Massively parallel computing products

Single CPU / node architecture for multicore CPU FX1(4 cores) → K(8 cores) → FX10(16 cores) High memory bandwidth balanced w/ CPU performance

Key technologies for massively parallel are developed and inherited

CY2008~ 40GF, 4-core CPU

CY2012~ 236.5GF, 16-core CPU

CY2010~ 128GF, 8-core CPU

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 4

Page 6: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Key technologies for massively parallel

Describe two technologies and show evaluation results of their effects by using real applications HPC-ACE: CPU architecture extension Number of floating register extension Floating-point reciprocal approximation instructions Conditional execution (move, store, and mask generation)

instructions

VISIMPACT: Hybrid parallel execution model (process & thread) support Automatic parallelization compiler Hardware inter-core barrier and shared L2 cache

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 5

Page 7: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

HPC-ACE extension and its effect • # of register extension • Conditional execution instructions • Floating-point reciprocal approximations

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 6

Page 8: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

HPC-ACE: # of register extension

Enhancement of SPARC V9 specification # of integer regs from 32 to 64 # of DP FP regs from 32 to 256

The extended registers can be used as same as standard registers with any instructions

Increasing # of register enables to explorer more parallelism Large loops can be software pipelined

Copyright 2012 FUJITSU LIMITED

Loop0

Loop1

Loop0 Loop1

LoopN

Limited by the # of regs.

Increase instruction level parallelism

Ext.

V9

Ext.

V9 V9 Ext.

Integer registers FP registers

Reg. window

32

224

32 32

160

32

SC12 Booth presentation for FX10 7

Page 9: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

HPC-ACE: Conditional execution instructions

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10

subroutine sub(a, b, c, x, n) real*8 a(n), b(n), c(n), x(n) do i = 1 , n if ( x(i) .gt. 0.0 ) then a(i) = b(i) * c(i) else a(i) = b(i) - c(i) endif enddo end

Conditional execution instructions

: .L100: : calculations of b(i)*c(i) and b(i)-c(i) fcmpgted,s %f32,%f34,%f32 fselmovd,s %f42,%f40,%f32,%f42 std,s %f42,[%o5+%xg1] add %xg1,16,%xg1 bne,pt %icc, .L100 nop :

• Conditional branch is eliminated • SIMDization and software

pipelining can be widely applied

8

Page 10: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

HPC-ACE: Floating-point reciprocal approx.

Copyright 2012 FUJITSU LIMITED

real(kind=8),dimension(1024) :: a, b, c do i=1,1024 a(i) = b(i) / sqrt(c(i)) end do

.LL1: ldd [%o2 + %g2], %f2 ldd [%o1 + %g2], %f0 subcc %g3, 1, %g3 add %g2, 8, %g1 fsqrtd %f2, %f2 fdivd %f0, %f2, %f0 std %f0, [%o0 + %g2] mov %g1, %g2 bne,pt %icc, .LL1 nop

No reciprocal approximations

SC12 Booth presentation for FX10

• Limited overlapping of instructions

.LL1: ldd [%o2 + %g2], %f2 ldd [%o1 + %g2], %f0 subcc %g3, 1, %g3 add %g2, 8, %g1 fmuld %f8, %f2, %f10 frsqrtad %f2, %f2 fmuld %f10, %f2, %f14 fnmsubd %f2, %f14, %f8, %f14 fmaddd %f2, %f14, %f2, %f2 fmuld %f10, %f2, %f12 fnmsubd %f2, %f12, %f8, %f12 fmaddd %f2, %f12, %f2, %f2 fmuld %f10, %f2, %f10 fnmsubd %f2, %f10, %f8, %f6 fcmpeqd %f10, %f16, %f18 fmaddd %f2, %f6, %f2, %f2 fselmovd %f4, %f2, %f18, %f2 fmuld %f0, %f2, %f0 std %f0, [%o0 + %g2] mov %g1, %g2 bne,pt %icc, .LL1 nop

Reciprocal approximation instructions

• Divide and Sqrt are pipelined • Software pipelining can also

be applied 9

Page 11: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

HPC-ACE improves application efficiency

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10

2% 5% 7% 15%

17% 19%

8%

19% 8%

20%

14%

12%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Nano tech. Lifescience2

Fluiddynamics

Lifescience1

ConditionalexecutionReciprocalapproximations256regs

SPARC-V9

Effic

ienc

y

Applications

% of peak

10

Page 12: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Performance efficiency of real applications Measured by petascale K computer and PRIMEHPC FX10 K computer & FX10 runs broad applications efficiently

Copyright 2012 FUJITSU LIMITED

* Measured by K computer: Results are from trial use & not the final. SC12 Booth presentation for FX10

Applications # of Cores Efficiency IPC System (Peak) A* 98,304 30% 1.7 K(1.57PF) B* 98,304 41% 1.5 C* 98,304 32% 1.3 D* 98,304 27% 1.0 E* 98,304 12% 0.6 F* 98,304 52% 1.5 G* 98,304 9% 0.7 H* 20,480 3% 0.6 K(0.33PF) I* 32,768 23% 0.9 K(0.52PF) J* 98,304 40% 1.5 K(1.57PF) K 6,144 30% 1.6 FX10(0.1PF)

11

Page 13: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

VISIMPACT and its effect • Hybrid execution model support • Scalability and efficiency • Load imbalance

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 12

Page 14: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

VISIMPACT (Virtual Single Processor by Integrated Multi-core Parallel Architecture)

Hybrid parallel execution is preferable for Good scalability (reduce communications by reducing the # of processes) Larger usable memory per process Efficient use of memory (smaller work area)

But… Programing is harder due to two level of parallelization

Flat MPI Hybrid parallel

MPI VISIMPACT MPI

• Automatic multithreading • HW barrier between cores • Shared L2 cache

16 procs×1 thread 4 process×4 thread

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 13

Page 15: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Evaluations with practical meteorological apps.

WRF and COSMO were evaluated Operational weather forecasting codes Regional models

Different computational characteristics WRF uses “IO Quilting” mechanism COSMO is memory intensive WRF is thread-parallelized by OpenMP, but COSMO is not

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 14

Page 16: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Advantage of hybrid parallelization on FX10

Flat-MPI (1x1536) spends much time for communication

Copyright 2012 FUJITSU LIMITED

Reduction of Communication Cost

(WRF V3.3.1, 1200x700x45 grids, by courtesy of CWB)

0

0.2

0.4

0.6

0.8

1

1.2

1x1536 2x768 4x384 8x192 16x96

Rela

tive

elap

sed

time

# of threads & processes

Communication Calculation

WRF V3.3.1

SC12 Booth presentation for FX10

Difference of Node Scalability

0

1.5

3

4.5

6

7.5

9

0

1,200

2,400

3,600

4,800

6,000

7,200

96 192 288 384 576 768 1,152 1,536

Spee

dup

Elap

sed

time

(sec

)

# of coresTime (1 thread / proc) Time (16 threads / proc)Speedup (1 thread / proc) Speedup (16 threads / proc)

Node scalability of hybrid execution (16 threads/proc) is better †:LoadImbalanceRatio=(MaxCalculationTime÷AvgCalculationTime)-1

21% load imbalance 9.8%

The communication includes load imbalance between processes Load imbalance ratio† of Flat-MPI is 21%, while 9.8% for 16x96 Load imbalance between threads in a process is in the calculation time

*

*

15

Page 17: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Copyright 2012 FUJITSU LIMITED

VISIMPACT parallelizes with threads and improves performance when cores are increased

Automatic thread-parallelization on FX10

Sustained Performance & Efficiency

(COSMO RAPS 5.0, 421x461x50 grids)

0123456

050

100150200250300

96 192 288 384

Effic

ienc

y (%

)

Perf

orm

ance

(Gflo

ps)

# of coresPerformance (1 thread / proc) Performance (2 threads / proc)Efficiency (1 thread / proc) Efficiency (2 threads / proc)

COSMO-DE(RAPS 5.0)

SC12 Booth presentation for FX10

Effective B/F Ratio & Memory BW

0

15

30

45

60

0

1

2

3

4

96 192 288 384 Mem

ory

BW p

er n

ode

(GB/

s)

Byte

/flo

p ra

tio

# of cores

Bandwidth (1 thread / proc) Bandwidth (2 threads / proc)Byte/flop (1 thread / proc) Byte/flop (2 threads / proc)

Thread parallelization reduces byte/flop and utilizes memory BW For 384 cores, load imbalance ratio† was mitigated from 15% to 7.9%

†:LoadImbalanceRatio=(MaxCalculationTime÷AvgCalculationTime)-1

15% 7.9%

16

Page 18: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Summary of VISIMPACT (hybrid execution) Real applications run on the hybrid parallel mode using

VISIMPACT show better performance than Flat MPI By eliminating memory copy By reducing load imbalance between processes

Hybrid parallel execution model is less memory usage Intrinsic and indispensable features for many core environment

Reducing a load imbalance between threads in a process should be studied

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 17

Page 19: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Challenges towards Exascale computing

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10 18

Page 20: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

Challenges toward exascale computing Fujitsu is developing a 100 Petaflops capable system as a

midterm goal

Copyright 2012 FUJITSU LIMITED SC12 Booth presentation for FX10

National project to develop the next generation supercomputer is expected

Panels: Booth #1943 “Oakleaf/Kashiwa Alliance”

Feasibility study

Sets the goal & a time line

Fujitsu has started two-year feasibility study to set the goal & schedule since July 2012

White-paper

Projects the system around 2018

2012 2014 2016 2018 2010 2020

Participates two consecutive national projects for exascale Fujitsu contributed to develop the whitepaper

• “Report on Strategic Direction/Development of HPC in Japan”

19

Page 21: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

K computer 2010 2015 2020

Trans-Exa system No.1 in Top500 (June, Nov. 2011)

Exascale system

Summary K computer & FX10 employ massively parallel technologies HPC-ACE VISIMPACT

Evaluation running real applications Good performance efficiency Good scalability and smaller load imbalance

Research and develop massively parallel architecture and approach toward exascale step-by-step

SC12 Booth presentation for FX10 Copyright 2012 FUJITSU LIMITED 20

Page 22: PRIMEHPC FX10 - Performance evaluations and approach ... · Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED November, 2012 PRIMEHPC FX10 - Performance evaluations

SC12 Booth presentation for FX10