gtc japan 2014

(')��!� YLOa�$)b¬JS��ßÿzGuã«�

éÖDĝD�³Ì��¯D�¯ð�ÉËÀs�uG�

��(��,9,7� ��

��> ��Ì��$��ĉČ§��$��ĎČ§��Ç��

��78/0;��:,.5;��$)��7<06�+087�+��.8:0;��C��$)��"*��(0;6,�� +�C��!�!��&��!�A��''��C� ��

��C�%�&��=66�-4;0.<487�#9<4.,6��7B74-,7/�

��''�� (��$�� =;<:0��$�'��(,90��$��'8:,20(05�' ��C� ��

TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD)

�

Y�

“Global'Work'Space”'#1�

SFA10k'#5�

“Global'Work'Space”'#2� “Global'Work'Space”'#3�

SFA10k'#4�SFA10k'#3�SFA10k'#2�SFA10k'#1�

/data0'� /work0� /work1'''''/gscr�

“cNFS/Clusterd'Samba'w/'GPFS”''

HOME�

System'applicaJon�

“NFS/CIFS/iSCSI'by'BlueARC”''

HOME�

iSCSI�

Infiniband'QDR'Networks�

SFA10k'#6�

GPFS#1� GPFS#2� GPFS#3� GPFS#4�

Parallel'File'System'Volumes�Home'Volumes�

QDR'IB(×4)'×'20� 10GbE'×'2�QDR'IB'(×4)'×'8�

1.2PB�3.6(PB�

/data1'�

''

''

Thin'nodes� 1408nodes'''(32nodes'x44'Racks)'

HP'Proliant'SL390s'G7'1408nodesyyyyyyyyyyyy�CPU:'Intel'WestmereÈP''2.93GHz'''''''''''6cores'×'2'='12cores/node'GPU:'NVIDIA'Tesla'K20X,'3GPUs/node'Mem:'54GB'(96GB)'SSD:''60GB'x'2'='120GB'(120GB'x'2'='240GB)y'

Medium'nodes�

HP'Proliant'DL580'G7'24nodes''CPU:'Intel'NehalemÈX'2.0GHz''''''''''8cores'×'2'='32cores/node'GPU:'NVIDIA''Tesla'S1070,''''''''''''NextIO'vCORE'Express'2070'Mem:128GB'SSD:'120GB'x'4'='480GB'

''

Fat'nodes�

HP'Proliant'DL580'G7'10nodes'CPU:'Intel'NehalemÈX'2.0GHz''''''''''8cores'×'2'='32cores/node''GPU:'NVIDIA'Tesla'S1070'Mem:'256GB'(512GB)'SSD:'120GB'x'4'='480GB'

¯¯¯¯¯¯�

yyCompu.ng(Nodes³Q17.1PFlops(SFP),(5.76PFlops(DFP),(224.69TFlops(CPU),(~100TB(MEM,(~200TB(SSD(

Interconnets:yFullKbisec.on(Op.cal(QDR(Infiniband(Network�

''

Voltaire'Grid'Director'4700''×12'IB'QDR:'324'ports'

Core'Switch'

''

Edge'Switch'

''

Edge'Switch'(/w'10GbE'ports)'

Voltaire'Grid'Director'4036'×179'IB'QDR':'36'ports'

Voltaire''Grid'Director'4036E'×6'IB'QDR:34ports'''10GbE:''2port'

12switches'

6switches'179switches'

2.4(PB(HDD(+((�4PB(Tape�

例��:,93��3<<9�???2:,93��8:2� ý�ßÿj��ZĘěÓ¦Vq�l�bāÎRa�wjzGu¼O ��v�Gi�

!  ºÜZÂă�TEPS(Traversed Edges Per Second) !  ÅûXá¬�¥�(Cybersecurity, Medical Informatics,

Social Networks, Data Enrichment, Symbolic Networks) !  µTZgG~�

!  concurrent search(Breadth First Search : BFS) !  optimization (Single Source Shortest Path) !  edge-oriented (Maximal Independent Set)

!  ý�ßÿj��[ZĈ¬ !  ÁÆ¦·bíK�Green Graph500 ��v�Gi

!  http://green.graph500.org/

•  ��9�6#��QKronecker'Graph'��'<'(BFS)'!?'–  ��.)�'16'(=m/n)'�I��4��ª¦±��.)'32'�+��ª¦��%}²'–  ¤ª¨° QSCALE'�8|��ª¦B-�Q3)'2SCALE'',)'2SCALE'+'4'�0��'–  �²SCALE30'��z10'3'172',�4��ª¦±+�³344',²'

•  ¨�®¨©«��D±G�²¯J��N��|A/'–  PG��z:¨©«�� ="�� A�{�'

Input parameters •  SCALE •  edgefactor (=16) �

Graph'GeneraJon�

Graph'ConstrucJon� BFS� ValidaJon� results �

64 iterations�

(')��!� �Z�ßÿzGuã«ÊĐWQUZÕÒ�•  dis��GuZġĕ�

–  "*��(0;6,�� +�� Ě�

•  �G|IS`Z��ġĕõN�äğ¢�ÅJ�–  ��G|�C��G|�

•  �cw{x�Gè¸Z�=,6�&,46�%�&��7B74-,7/�–  ��(-9;Z��esio��|Ā�

•  ÞÃ×�G|Y''�bġĕ�–  oqy�¨V ��(�ü§�

�•  �ßÿq{�Gp�

–  =;<:0W�$�'Zć¬�–  �$�M_Xa��öô��$�M_XayG�öô�

�

Ć¹Zq�l�ZÄđWò¾�•  ĄĒ»Zď�¢È�F±ċìªF�Ĕ§ª�

–  ��swnZ��vldªF�}Gldª�–  dis��GuZÑî�

'

•  ÍĠĊØZÝñFÅąčª�–  ±Ĝ�·��FÏ¿®��ZĂ��

•  � �'��$�!��'((�!&�!��&0&�!��!��0<.�

–  zGuï°Zlq{��·ÓFåÚÆ¦�NÈ�'

F*H�

�$!�18

67!

>(L

)��B-�Q ��°ª¥«¡�

��D!

��¬�«�§

:M�

&@9CO�

�£°�� K�!�2�

�B-¢° I/O

Ć¹Zq�l�ZÄđWò¾�•  ĄĒ»Zď�¢È�F±ċìªF�Ĕ§ª�

–  ��swnZ��vldªF�}Gldª�–  dis��GuZÑî�

'

•  ÍĠĊØZÝñFÅąčª�–  ±Ĝ�·��FÏ¿®��ZĂ��

•  � �'��$�!��'((�!&�!��&0&�!��!��0<.�

–  zGuï°Zlq{��·ÓFåÚÆ¦�NÈ�'

F*H�

�$!�18

67!

>(L

)��B-�Q ��°ª¥«¡�

��D!

��¬�«�§

:M�

&@9CO�

�£°�� K�!�2�

�B-¢° I/O

��}��¦£�� A~µ'�$!�18¯�K��¨©«;5¯'E��°ª¥«¡��

�� 4236@��..060:,<0/�!,9&0/=.0�•  '�ßÿzGuĄĒã«��G��Gi'

–  dis��GuW±Ĝ�·��bġĕQS�q�l�b ê�

–  þªRa��Ząč·bÐĖ�'

•  Ûę�–  ²¤Zz�eqZ'��b¬´YĈ¬�

•  ��Y^a©Ċ�•  �)��#907"*!��0<.�

–  ��ÑHZdis��GuVZ�ėqkG��j��

•  (')��!� �–  �$)VZ#=<�81�.8:0XzGuë«F�d�m�r�Z½¬�

•  �$)��$)¡q{�G��jzGuàç�Z²Ĉª�

•  �$)�GqZ¶£tG{��–  dis��Gu¼OZ²ĈXzGu�èóZ½¬�

•  ��'�fG�w{�

Hamar'Overview�

Map�

Distributed'Array�

Rank'0� Rank'1� Rank'n�

Local'Array� Local'Array� Local'Array� Local'Array�

Reduce�Map�

Reduce�

Map�

Reduce�Shuffle�

Shuffle�

Data'Transfer'between'ranks�

Shuffle�Shuffle�

Local'Array� Local'Array� Local'Array� Local'Array�

Device(GPU)'Data�

Host(CPU)'Data� Memcpy''

(H2D,'D2H)�

Virtualized'Data'Object�

Map/Reduce'code'sample�class'MapImpl':'public'hamar::funcJon::cuda::Map<MapContext>'{'''public:'''''''__host__'__device__'Operate(MapContext'*context)'{''''''''''KeyType'key'='context`>input_key();''''''''''ValueType'value'='context`>input_value();'''''''''context`>Emit(key,'value);''''''}'}''class'ReduceImpl':'public'hamar::funcJon::cuda::Reduce<ReduceContext>'{'''public:''''''___host__'__device__''Operate(ReduceContext'*context)'{''''''''''KeyType'key'='context`>input_key();''''''''''ValueType'values'='context`>input_values();''''''''''int'n'='context`>num_input_values();''''''''''ValueType'sum'='values[0]'+'…'+'values[n];''''''''''context`>Emit(key,'sum);''''''}'}'

Map/Reduce'code'sample'(cont’d)�int'main()'{'''''MapImpl'map;''''ReduceImple'reduce;'''''Environment'env;''''env.Init();''//'MPI/CUDA'IniJalizaJon'''''Directory'object(&env);''''object.Init(path);'''''object.Map(map);''''object.Reduce(reduce);'''''object.Destroy();'''''env.Destroy();''//'MPI/'CUDA'FinalizaJon''}�

Highly'Accelerated'MapReduce'with''Out`of`core'support'on'GPUs�

Map�

Reduce�

Map�

Reduce�

Map�

Reduce�

•  Hierarchical'memory'management'for'large`scale''data'parallel'processing'using'mulJ`GPUs'–  Support'out`of`core'processing'on'GPU'devices'– Overlapping'computaJon'and'communicaJon'

Map�

Reduce�

GPU�

CPU�

Memcpy''(H2D,'D2H)�Processing''for'each'chunk�

WY�

Shuffle� Shuffle�

Map/Reduce'ImplementaJon�•  IniJalizaJon'before'each'operaJon'

–  Remove'unnecessary'keys'–  Reordering'data'structures'

•  OpJmizaJons'for'GPU'accelerators'–  Assign'a'warp'(32'threads)'per'key'for'avoiding'warp'divergence'in'

Map/Reduce'–  Overlapping'computaJon'on'GPU'and'data'transfer'between'CPU'and'

GPU'

Map/'Reduce�

Map/'Reduce�

Sort�Sort�

Scan�

Sort'key`value'for'Scan�

Compact'keys'to'unique�

Overlap'computaJon'and'data'transfer�

GPU`based'External'Sort'ImplementaJon�

CPU�GPU�

1.'Divide'input'data'into'chunks,'then'sort'on'GPU'for'each'chunk�

2.'Swap'intermediate''''''data'on'CPU�

GPU�

3.'Sort'intermediate'data'on'GPU�

*1:'Y.'Ye'et'al.,'“GPUMemSort:'A'High'Performance'Graphics'Co`processors'SorJng'Algorithm'for'Large'''''''''Scale'In`Memory'Data”,'GSTF'InternaJonal'Journal'on'CompuJng,'2011'

•  Out`of`core'GPU'sorJng'algorithm'*1'–  Adopted'Sample`based'Parallel'SorJng'Algorithm'–  Overlapping'computaJon'on'GPU'and'data'transfer'between'CPU'and'GPU'

WZ�

ApplicaJon'Example':'GIM`V'Generalized'IteraJve'Matrix`Vector'mulJplicaJon*1�

•  Easy'descripJon'of'various'graph'algorithms'by'implemenJng'combine2,'combineAll,'assign'funcJons'

•  PageRank,'Random'Walk'Restart,'Connected'Component'–  v’#=#M#×G#v''where'

v’i'='assign(vj','combineAllj'({xj#|'j#='1..n,'xj#='combine2(mi,j,'vj)}))''(i'='1..n)'

–  IteraJve'2'phases'MapReduce'operaJons'

´� ×G�v’i� mi,j�

vj�

v’� M�

combineAll(and(assign((stage2)�

combine2((stage1)�

assign� v�

*1':'Kang,'U.'et'al,'“PEGASUS:'A'Peta`Scale'Graph'Mining'System`'ImplementaJon''and'ObservaJons”,'IEEE'INTERNATIONAL'CONFERENCE'ON'DATA'MINING'2009�

Straigh|orward'implementaJon'using'Hamar�

Weak'Scaling'Performance''[Sato,'Shirahata'et'al.'Cluster2014]'�

•  PageRank'applicaJon'on'TSUBAME'2.5'•  Data'size'is'larger'than'GPU'memory'capacity�

0'

500'

1000'

1500'

2000'

2500'

3000'

0' 200' 400' 600' 800' 1000' 1200'

Perform

ance([MEdges/sec]�

Number(of(Compute(Nodes�

SCALE(23(K(24(per(Node�

1CPU'(S23'per'node)'1GPU'(S23'per'node)'2CPUs'(S24'per'node)'2GPUs'(S24'per'node)'3GPUs'(S24'per'node)'

2.81'GE/s'on'3072'GPUs'(SCALE'34)�

2.10x'Speedup'(3'GPU'v'2CPU)�

Breakdown�•  Performance'on'3'GPUs'compared'with'2'CPUs'

–  SCALE'33,'1024'nodes'– Map:'2.82x,'Reduce:'1.11x,'Sort:'5.04x'speedup'

•  Overlapping'communicaJon'effecJvely�

0'

10000'

20000'

30000'

40000'

50000'

60000'

70000'

1CPU' 1GPU' 2CPUs' 2GPUs' 3GPUs'

Elapsed(.me([ms]�

Map'

Shuffle'

Reduce'

Sort'

Others'

Towards(Mul.level(data(management((on(Hamar(using(GPUs(and(NVMs([GTC2014]�

Mother'board�

'''''''''''''''''''''''''''''''''''''RAID'card�

mSATA� mSATA� mSATA� mSATA�

0'

1000'

2000'

3000'

4000'

5000'

6000'

7000'

8000'

9000'

0' 5' 10' 15' 20'

Bandwidth([MB/s]�

#(mSATAs�

Raw'mSATA'4KB'RAID0'1MB'RAID0'64KB'

0'

0.5'

1'

1.5'

2'

2.5'

3'

3.5'

0.274'0.547'1.09' 2.19' 4.38' 8.75' 17.5' 35' 70' 140'

Throughuput([GB/s]�

Matrix(Size([GB]�

Raw'8'mSATA'8'mSATA'RAID0'(1MB)'8'mSATA'RAID0'(64KB)'

I/O'performance'of'mulJple'mSATA'SSD� I/O'performance'from'GPU'to'mulJple'mSATA'SSDs�

�(7.39(GB/s(from((

16(mSATA(SSDs((Enabled(RAID0)(

�(3.06(GB/s(from((

8(mSATA(SSDs(to(GPU(

How(to(design(local(storage(for(nextKgen(supercomputers(?(K(Designed(a(local(I/O(prototype(using(16(mSATA(SSDs(

¯Capacity:((4TB(¯Read(bandwidth:(8(GB/s(

SorJng'for'Rapidly'Increasing'Datasets'[Shamoto,'Sato'et'al]'�

•  The'need'to'process'huge'datasets'is'increasing'due'to'growth'of'data'collecJon'in'various'fields'–  Sensor'data'–  SNS'network'

•  Fast'sorJng'methods'–  Distributed'SorJng:'SorJng'for'distributed'system'

•  Spli~er`based'parallel'sort'•  Radix'sort'•  Merge'sort'

–  SorJng'on'heterogeneous'architectures'•  Many'sorJng'algorithms'are'accelerated'by'many'cores'and'high'memory'bandwidth.'

•  SorJng'for'large`scale'heterogeneous'systems'remains'unclear'

ExisJng'SorJng'Algorithms�SpligerKbased(parallel(sor.ng(

–  The'flow'of'the'algorithm'1.   local'sort:'Each'process'sorts'its'own'array'2.   Select'spli0ers:'Choose'criteria'for'data'segmentaJon'3.   Data'transfer:'Transfer'data'segments'4.   Local'merge:'Merge'sorted'arrays'

–  Low'communicaJon'costs'x 'ComputaJon'costs'starts'dominaJng'the'overall'performance(

(

Sor.ng(on(GPU(–  There'are'many'a~empts'to'accelerate'sorJng'

•  Thrust'sort[D.merrill'et'al.,'2011]'–  Fast'sorJng'for'one'compute'node'

•  A'GPU'external'sort[Y.'Ye'et'al.,'2010]'–  Handle'GPU'memory'overflows'

•  A'mulFGnode'GPU'sort[K.'L.'Spafford'et'al.,'2011]'–  Does'not'sort'huge'data'sets'

U.lize(GPU(accelerators(for(spligerKbased(parallel(sor.ng�

GPU'implementaJon'for'Spli~er`based'Parallel'SorJng�

•  Offloading'the'most'Jme`consuming'phase'to'GPU'accelerators�

0

20

40

4 8 16 32 64 128

256

512

1024

2048

# of proccesses (2 proccesses per node)

Elap

sed

time[

s]

synchronization costsdata transfer and Mergelocal sort (original)merge (remaining arrays)select splitters

select'spli~ers�

data'transfer�

merge�

'�

'�

GPU�

local'sort�'� unsorted�

sorted�

'�

'�

•  2'~'1024'nodes'(4'~'2048'GPUs)'on'TSUBAME2.5'•  2'processes'per'node'and'each'node'has'2GB'64bit'integer�

Weak'Scaling'Performance�

0

10000

20000

30000

0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Keys

/sec

ond(

mill

ions

)

HykSort 1threadHykSort 6threadsHykSort GPU + 6threads

GPU(implementa.on(based(on(mul.Kthreaded(

implementa.on�

Mul.Kthreaded(implementa.on�

SingleKthreaded(implementa.on�

x1.4�

x3.6�

When'the'#'of'processes'is'2048�

K20x x4 faster than K20x

0

20000

40000

60000

0 500 1000 1500 2000 0 500 1000 1500 2000# of proccesses (2 proccesses per node)

Key

s/se

cond

(mill

ions

)HykSort 6threadsHykSort GPU + 6threadsPCIe_10PCIe_100PCIe_200PCIe_50Prediction of our implementation

Performance'PredicJon�•  PCIe_#:'#GB/s'bandwidth'of'interconnect'between'CPU'and'GPU'

8.8%'reducJon'of'overall'runJme'when'the'accelerators'work'4'Jmes'faster'than'K20x�

x2.2'speedup'when'the'#'of'PCI'bandwidth'increase'to'50GB/s�

\W]�•  (')��!� YLOa�$)b¬JS��ßÿzGuã«�÷ZPĞø�– ��!�&��ßÿzGuĄĒã«��G��Gi��– '964<<0:�-,;0/Z�ētG{��

•  �$)Zæù¢X½¬NúÔ¢�– tG{Eqh��E0<.�

•  �$)Z��YÙ\_XJßÿZzGuner�– úâ¢Xq{�G��jzGuë«F�#=<�81�.8:0d�m�r�Z½¬�

gtc japan 2014

Software