a scalable double in-memory checkpoint and restart scheme towards exascale
DESCRIPTION
A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale. Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign. Motivation. As machines grow in size MTBF decreases - PowerPoint PPT PresentationTRANSCRIPT
-
Gengbin ZhengXiang NiLaxmikant V. Kale
Parallel Programming LabUniversity of Illinois at Urbana-Champaign
Charm++ Workshop 2012
-
MotivationAs machines grow in sizeMTBF decreasesJaguar had 2.33 average failures/day from 2008 to 2010Applications have to tolerate faultsChallenges for exascale:Disk-based (NFS reliable disk) checkpointing is slowSystem-level checkpointing can be expensiveScalable checkpointing/restart can be a communication intensive processJob scheduler prevent fault tolerance support in runtime
Charm++ Workshop 2012*
Charm++ Workshop 2012
-
Motivation (cont.)Applications on future exascale machines need fast, low cost and scalable fault tolerance supportPrevious work: double in-memory checkpoint/restart schemeIn production version of Charm++ since 2004
Charm++ Workshop 2012*
Charm++ Workshop 2012
-
Double in-memory Checkpoint/Restart ProtocolCharm++ Workshop 2012*HIJABCEDFGABCDEFGHIJABCFGDEHIJABCDEFGHIJAFCDEFGHIJHIJABCDEBGAAAAPE0PE1PE2PE3PE0PE2PE3objectcheckpoint 1checkpoint 2restored objectPE1 crashed ( lost 1 processor )
Charm++ Workshop 2012
-
Runtime Support for FTAutomatically checkpointing threadsIncluding stack and heap (isomalloc)User helper functionsTo pack and unpack dataCheckpointing only the live variablesCharm++ Workshop 2012*
Charm++ Workshop 2012
-
Local Disk-Based ProtocolDouble in-memory checkpointingMemory concernPick checkpointing time where global state is smallMD, N-body, quantum chemistryDouble In-disk checkpointingMake use of local disk (or SSD)Also does not rely on any reliable storageUseful for applications with very big memory footprint
Charm++ Workshop 2012*
Charm++ Workshop 2012
-
Previous Results: Performance Comparisons with Traditional Disk-based CheckpointingCharm++ Workshop 2012*
Charm++ Workshop 2012
Chart2
0.0040.0420.2180.3872.196
0.0080.0450.2440.4052.234
0.0160.0970.6230.782.353
0.0320.1711.1981.1482.546
0.0610.3041.2161.5853.52
0.140.61.5982.1648.3
0.271.1882.1123.58217.65
0.5172.374.6696.85433.2
1.0354.716.90114.01279.71
2.0299.4311.76226.629129.87
3.84518.8321.48147.06215.78
double in-memory (Myrinet)
double in-memory (100Mb)
Local Disk
double in-disk (Myrinet)
NFS disk
Problem size (MB)
Checkpoint overhead (s)
Sheet1
6.40.0040.0420.2180.3872.196
12.80.0080.0450.2440.4052.234
25.60.0160.0970.6230.782.353
51.20.0320.1711.1981.1482.546
102.40.0610.3041.2161.5853.52
204.80.140.61.5982.1648.3
409.60.271.1882.1123.58217.65
819.20.5172.374.6696.85433.2
1638.41.0354.716.90114.01279.71
3276.82.0299.4311.76226.629129.87
6553.63.84518.8321.48147.06215.78
Sheet1
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
double in-memory (Myrinet)
double in-memory (100Mb)
Local Disk
double in-disk (Myrinet)
NFS disk
Problem size (MB)
Checkpoint overhead (s)
Sheet2
Sheet3
-
Previous Results: Restart with Load BalancingCharm++ Workshop 2012*LeanMD, Apoa1, 128 processors
Charm++ Workshop 2012
Chart2
2.009614
1.715303
1.695939
1.715194
1.371235
1.950715
1.978949
1.757267
1.687806
1.701496
1.700515
1.626623
1.597934
1.440968
1.355004
1.214825
1.228043
1.078525
1.081117
1.158797
0.935191
0.926109
0.943082
0.926749
0.96469
0.927354
0.961384
0.899342
0.931402
0.949933
0.988828
0.994174
1.025657
1.047986
0.895591
0.965002
1.042113
0.979932
1.032694
1.001448
1.003917
1.0304
1.038706
1.087721
1.069413
1.02002
1.174063
1.192655
1.098414
1.153628
1.099
0.967127
1.157128
1.223548
1.213982
1.212825
1.261162
1.040725
1.035942
0.944774
0.927978
0.947262
0.913283
1.042702
0.950637
0.933469
0.869751
0.958103
1.040595
1.064749
1.041798
0.966193
1.075353
0.991624
1.118805
1.101449
1.044793
1.166518
1.063404
0.949263
0.985938
1.044694
0.97056
0.944894
1.026142
0.992745
1.04838
1.019766
1.043817
1.063878
1.062542
1.077293
1.078989
0.927144
1.045735
1.184777
0.968462
1.010038
0.943256
0.902721
0.898092
0.935502
0.881208
0.936183
0.899393
0.88032
0.967682
1.004605
1.03273
1.054115
1.14725
1.007501
1.002491
1.101577
1.203901
1.170128
1.105745
1.172174
1.362858
1.349746
1.276214
1.232062
1.223564
1.200905
1.246819
1.257887
1.261652
1.266324
1.322484
1.148785
1.170641
1.128079
1.080882
1.146171
1.112182
1.184934
1.103204
1.241024
1.255153
1.137981
1.10045
1.05994
1.160035
1.11048
1.071578
1.089167
1.156947
1.134598
1.176696
1.051746
1.08226
1.148134
1.077623
1.114203
1.157763
1.120371
1.082479
1.100304
1.139752
1.095038
1.102715
1.073009
1.130851
1.235905
1.29095
1.21523
1.149708
1.11114
1.26187
1.048929
1.038305
1.071786
1.193509
1.176457
1.292901
1.19144
1.196389
1.168556
1.127554
1.103472
1.175753
1.141691
1.168513
1.089175
1.108441
1.079966
1.004083
1.109842
1.080457
1.0846
1.011798
1.029041
1.010761
1.090176
1.106607
1.094634
1.134613
1.198009
1.267268
1.116588
0.906991
0.930289
0.947122
0.991644
0.859498
2.88266
3.293003
3.388471
3.211137
3.232565
3.17068
2.866359
2.631764
2.534208
2.352287
2.087968
1.896062
1.994562
1.998752
2.010156
1.97682
1.994103
1.995991
2.007732
1.924195
1.923999
1.947834
2.002525
1.987392
1.931417
1.941952
1.961183
1.965781
1.964549
1.979992
1.962649
1.984623
2.042883
2.032125
1.99038
2.028829
2.020808
1.950227
1.922465
1.955345
2.047458
2.038188
1.982511
1.932677
1.936854
2.03362
1.950351
1.924653
1.965704
1.954082
2.03491
1.934451
1.958489
1.927321
1.930575
1.943952
1.987292
1.918903
1.941163
1.924595
1.991671
1.980222
1.973316
2.051759
2.033623
2.031429
1.968183
2.083912
2.07348
2.015043
1.914541
1.957682
1.931788
1.923598
1.937796
1.955661
1.921733
2.001874
1.97265
1.967819
1.930842
1.928904
2.024632
1.927654
1.99485
2.019106
2.0046
1.984987
1.965613
1.977414
1.974106
1.99165
1.955033
1.93804
1.934888
1.958446
1.962997
1.980825
2.063889
2.010659
2.063889
1.965613
1.977414
1.974106
1.99165
1.955033
1.93804
1.934888
1.958446
2.200172
1.981707
2.017706
1.912272
1.975005
1.901524
1.989627
1.950138
1.944595
1.909765
1.952108
1.941951
1.909355
1.956829
1.985959
1.960357
1.897455
1.923289
1.933349
1.945136
1.908449
1.921104
1.898984
1.935738
1.939566
1.962883
1.959983
2.004256
1.95039
1.990977
1.988153
1.971523
1.931339
1.936834
1.929571
1.932951
1.927635
1.907997
1.914607
1.961778
1.96884
1.914686
1.93231
1.92104
1.930202
2.006238
1.974535
1.971951
1.958482
1.995817
1.942418
1.970828
1.934542
1.97977
1.964298
1.93154
1.93805
1.958871
1.985174
1.934364
1.970927
1.945426
1.93256
1.944642
1.898459
1.924726
1.975699
1.933096
1.956693
1.90019
1.976753
1.950444
1.986071
1.976988
1.949482
1.984249
1.942268
1.923077
1.975876
1.969254
1.97008
1.957495
1.9133
1.951175
1.974283
1.920863
1.935784
1.954725
1.969842
1.917033
1.950176
1.975876
1.969254
1.97008
1.957495
1.9133
1.951175
1.974283
1.920863
1.935784
1.954725
1.975135
1.994313
1.986376
1.932574
1.930701
1.938055
1.931742
1.925482
1.920733
1.904041
1.877132
1.892237
1.968057
1.930821
1.966495
1.923515
1.934028
2.001933
1.968827
1.964655
1.914891
1.97013
1.903098
1.98663
1.951209
1.941464
1.921714
1.915151
1.993102
1.982194
1.928612
1.912259
1.950094
1.979134
1.945588
2.004483
1.935869
1.937256
1.965171
1.951399
1.92098
1.916989
1.932824
1.914597
1.887689
1.893165
1.93146
1.926887
1.953792
1.936362
1.957064
1.930316
1.926294
1.977252
1.938591
1.950371
1.97272
1.93755
1.97443
1.971506
1.942103
1.966804
1.93183
1.925058
1.966637
2.013649
1.924295
1.933238
1.897978
1.920966
1.949399
1.892507
1.886588
1.917741
1.925794
1.9202
1.913254
1.91181
1.974917
1.902329
1.88712
1.93976
1.904374
1.980521
1.920983
1.927846
1.896039
1.921018
1.93107
1.906748
1.9202
1.913254
1.91181
1.974917
1.902329
1.88712
1.93976
1.904374
1.980521
1.920983
2.048415
1.932005
1.996658
1.950263
1.963529
1.931642
1.914656
1.921085
1.922576
1.98944
1.920537
1.962262
1.894785
1.959783
1.929744
1.91836
1.917831
1.930237
1.906392
1.95393
1.952592
1.961637
1.925792
1.92846
1.987622
1.949983
1.967137
1.922768
1.938523
1.915719
1.963132
1.891995
1.911063
1.951286
1.929961
1.918687
1.943827
1.952086
1.987933
1.92094
1.949606
1.923209
1.921539
1.885482
1.923213
1.944563
1.981385
1.910231
1.939388
1.897881
1.999324
1.905841
1.924322
1.924182
1.906817
1.920647
1.913346
1.918498
1.989822
1.937303
1.90323
1.918191
1.910337
1.909426
1.944418
1.948985
1.907036
1.953726
1.923503
1.949636
1.96483
1.935372
1.943347
1.945511
1.976622
1.900946
1.90749
1.906462
1.993606
1.919058
1.970217
1.925131
1.965142
1.922287
1.931435
1.936987
1.938416
1.967613
1.907173
1.892249
Timestep
Simulation time per step (s)
Without LB
Sheet1
2.009614
1.715303
1.695939
1.715194
1.371235
1.950715
1.978949
1.757267
1.687806
1.701496
1.700515
1.626623
1.597934
1.440968
1.355004
1.214825
1.228043
1.078525
1.081117
1.158797
0.935191
0.926109
0.943082
0.926749
0.96469
0.927354
0.961384
0.899342
0.931402
0.949933
0.988828
0.994174
1.025657
1.047986
0.895591
0.965002
1.042113
0.979932
1.032694
1.001448
1.003917
1.0304
1.038706
1.087721
1.069413
1.02002
1.174063
1.192655
1.098414
1.153628
1.099
0.967127
1.157128
1.223548
1.213982
1.212825
1.261162
1.040725
1.035942
0.944774
0.927978
0.947262
0.913283
1.042702
0.950637
0.933469
0.869751
0.958103
1.040595
1.064749
1.041798
0.966193
1.075353
0.991624
1.118805
1.101449
1.044793
1.166518
1.063404
0.949263
0.985938
1.044694
0.97056
0.944894
1.026142
0.992745
1.04838
1.019766
1.043817
1.063878
1.062542
1.077293
1.078989
0.927144
1.045735
1.184777
0.968462
1.010038
0.943256
0.902721
0.898092
0.935502
0.881208
0.936183
0.899393
0.88032
0.967682
1.004605
1.03273
1.054115
1.14725
1.007501
1.002491
1.101577
1.203901
1.170128
1.105745
1.172174
1.362858
1.349746
1.276214
1.232062
1.223564
1.200905
1.246819
1.257887
1.261652
1.266324
1.322484
1.148785
1.170641
1.128079
1.080882
1.146171
1.112182
1.184934
1.103204
1.241024
1.255153
1.137981
1.10045
1.05994
1.160035
1.11048
1.071578
1.089167
1.156947
1.134598
1.176696
1.051746
1.08226
1.148134
1.077623
1.114203
1.157763
1.120371
1.082479
1.100304
1.139752
1.095038
1.102715
1.073009
1.130851
1.235905
1.29095
1.21523
1.149708
1.11114
1.26187
1.048929
1.038305
1.071786
1.193509
1.176457
1.292901
1.19144
1.196389
1.168556
1.127554
1.103472
1.175753
1.141691
1.168513
1.089175
1.108441
1.079966
1.004083
1.109842
1.080457
1.0846
1.011798
1.029041
1.010761
1.090176
1.106607
1.094634
1.134613
1.198009
1.267268
1.116588
0.906991
0.930289
0.947122
0.991644
0.859498
2.88266
3.293003
3.388471
3.211137
3.232565
3.17068
2.866359
2.631764
2.534208
2.352287
2.087968
1.896062
1.994562
1.998752
2.010156
1.97682
1.994103
1.995991
2.007732
1.924195
1.923999
1.947834
2.002525
1.987392
1.931417
1.941952
1.961183
1.965781
1.964549
1.979992
1.962649
1.984623
2.042883
2.032125
1.99038
2.028829
2.020808
1.950227
1.922465
1.955345
2.047458
2.038188
1.982511
1.932677
1.936854
2.03362
1.950351
1.924653
1.965704
1.954082
2.03491
1.934451
1.958489
1.927321
1.930575
1.943952
1.987292
1.918903
1.941163
1.924595
1.991671
1.980222
1.973316
2.051759
2.033623
2.031429
1.968183
2.083912
2.07348
2.015043
1.914541
1.957682
1.931788
1.923598
1.937796
1.955661
1.921733
2.001874
1.97265
1.967819
1.930842
1.928904
2.024632
1.927654
1.99485
2.019106
2.0046
1.984987
1.965613
1.977414
1.974106
1.99165
1.955033
1.93804
1.934888
1.958446
1.962997
1.980825
2.063889
2.010659
2.063889
1.965613
1.977414
1.974106
1.99165
1.955033
1.93804
1.934888
1.958446
2.200172
1.981707
2.017706
1.912272
1.975005
1.901524
1.989627
1.950138
1.944595
1.909765
1.952108
1.941951
1.909355
1.956829
1.985959
1.960357
1.897455
1.923289
1.933349
1.945136
1.908449
1.921104
1.898984
1.935738
1.939566
1.962883
1.959983
2.004256
1.95039
1.990977
1.988153
1.971523
1.931339
1.936834
1.929571
1.932951
1.927635
1.907997
1.914607
1.961778
1.96884
1.914686
1.93231
1.92104
1.930202
2.006238
1.974535
1.971951
1.958482
1.995817
1.942418
1.970828
1.934542
1.97977
1.964298
1.93154
1.93805
1.958871
1.985174
1.934364
1.970927
1.945426
1.93256
1.944642
1.898459
1.924726
1.975699
1.933096
1.956693
1.90019
1.976753
1.950444
1.986071
1.976988
1.949482
1.984249
1.942268
1.923077
1.975876
1.969254
1.97008
1.957495
1.9133
1.951175
1.974283
1.920863
1.935784
1.954725
1.969842
1.917033
1.950176
1.975876
1.969254
1.97008
1.957495
1.9133
1.951175
1.974283
1.920863
1.935784
1.954725
1.975135
1.994313
1.986376
1.932574
1.930701
1.938055
1.931742
1.925482
1.920733
1.904041
1.877132
1.892237
1.968057
1.930821
1.966495
1.923515
1.934028
2.001933
1.968827
1.964655
1.914891
1.97013
1.903098
1.98663
1.951209
1.941464
1.921714
1.915151
1.993102
1.982194
1.928612
1.912259
1.950094
1.979134
1.945588
2.004483
1.935869
1.937256
1.965171
1.951399
1.92098
1.916989
1.932824
1.914597
1.887689
1.893165
1.93146
1.926887
1.953792
1.936362
1.957064
1.930316
1.926294
1.977252
1.938591
1.950371
1.97272
1.93755
1.97443
1.971506
1.942103
1.966804
1.93183
1.925058
1.966637
2.013649
1.924295
1.933238
1.897978
1.920966
1.949399
1.892507
1.886588
1.917741
1.925794
1.9202
1.913254
1.91181
1.974917
1.902329
1.88712
1.93976
1.904374
1.980521
1.920983
1.927846
1.896039
1.921018
1.93107
1.906748
1.9202
1.913254
1.91181
1.974917
1.902329
1.88712
1.93976
1.904374
1.980521
1.920983
2.048415
1.932005
1.996658
1.950263
1.963529
1.931642
1.914656
1.921085
1.922576
1.98944
1.920537
1.962262
1.894785
1.959783
1.929744
1.91836
1.917831
1.930237
1.906392
1.95393
1.952592
1.961637
1.925792
1.92846
1.987622
1.949983
1.967137
1.922768
1.938523
1.915719
1.963132
1.891995
1.911063
1.951286
1.929961
1.918687
1.943827
1.952086
1.987933
1.92094
1.949606
1.923209
1.921539
1.885482
1.923213
1.944563
1.981385
1.910231
1.939388
1.897881
1.999324
1.905841
1.924322
1.924182
1.906817
1.920647
1.913346
1.918498
1.989822
1.937303
1.90323
1.918191
1.910337
1.909426
1.944418
1.948985
1.907036
1.953726
1.923503
1.949636
1.96483
1.935372
1.943347
1.945511
1.976622
1.900946
1.90749
1.906462
1.993606
1.919058
1.970217
1.925131
1.965142
1.922287
1.931435
1.936987
1.938416
1.967613
1.907173
1.892249
Sheet1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Timestep
Simulation time per step (s)
Without LB
Sheet2
Sheet3
Chart2
2.060572
1.931205
1.878904
1.627035
1.54573
1.838356
1.940753
1.66237
1.798899
1.699308
1.609657
1.582982
1.725091
1.724436
1.416855
1.912544
1.992336
1.843522
1.787968
1.578881
1.749074
1.721319
1.720792
1.548338
1.404611
1.852854
1.768577
1.598787
1.687954
1.706513
1.620017
1.255407
1.233622
1.127641
1.234462
1.287475
1.193388
1.189892
0.99324
0.962672
0.992411
1.006699
1.04138
0.969301
1.113824
0.916239
0.936094
0.954022
0.989555
1.02479
1.005211
1.163595
1.201799
1.102465
1.098228
1.028322
1.036103
0.972076
1.065422
1.138067
0.986625
1.005647
1.095719
0.958778
0.998048
1.010745
1.11048
0.983394
0.993608
1.083521
1.074188
1.038286
1.088879
1.148243
1.041047
1.119522
1.069704
1.069352
0.985704
0.960018
1.025316
0.988069
0.978957
1.035497
0.976275
1.068464
0.967931
0.974109
0.985639
0.977586
1.080294
0.95472
1.010183
1.008256
1.042001
1.000356
1.04785
0.989684
0.950716
0.907331
0.879343
0.887231
0.874751
0.856354
0.920724
0.900419
0.945531
1.044399
0.935695
0.918593
1.049466
1.081448
1.116128
1.116112
1.180923
1.246483
1.292915
1.214981
1.314288
1.305134
1.386524
1.285486
1.339381
1.21084
1.229857
1.248031
1.224514
1.168148
1.29184
1.127459
1.076682
1.115381
1.119832
1.165949
1.225126
1.175402
0.972285
0.999218
1.049842
1.089163
1.0472
1.019105
1.02164
1.067423
0.980737
1.012048
1.05805
1.130666
1.120378
1.065913
1.076211
1.076157
1.019195
1.08021
1.108113
1.052279
1.011755
1.004593
1.012997
1.083555
1.146402
1.169461
1.085392
1.134367
1.098053
1.073385
1.104686
1.069164
1.099568
1.184347
1.113693
1.190831
1.255191
1.038505
1.073652
1.106564
1.078479
1.102903
1.079685
1.122704
1.213064
1.12968
1.08834
1.083566
1.071111
1.102088
1.148053
1.143092
0.98726
1.034385
1.074711
1.149926
1.121928
1.114652
1.232812
1.12267
1.07861
1.065033
1.019278
0.893063
0.864822
0.875184
0.872714
0.867169
0.914183
2.51822
3.274045
3.38535
3.327303
3.326231
1.572805
1.291528
1.282563
1.195413
1.281697
1.351278
1.41182
1.420893
1.35073
1.207345
1.423579
1.215017
1.169013
1.101311
0.984618
1.002282
1.009545
0.940676
0.930696
1.089722
1.12236
1.094353
1.02755
1.10523
1.078062
1.185435
1.172529
1.180859
1.135149
1.215596
1.146032
1.173749
1.160676
1.114597
1.098378
1.20156
1.126391
1.218134
1.365837
1.112039
1.179607
1.352579
1.134288
1.149921
1.091482
1.230376
1.3414
1.270711
1.321813
1.387056
1.179084
1.071516
1.080868
1.194492
1.003742
1.153284
1.288704
1.218278
1.10546
1.107355
1.082475
1.126309
1.127273
1.260399
1.205362
1.23908
1.176623
1.151854
1.142528
1.084323
1.031449
1.047176
1.096269
1.068707
1.075525
1.020311
1.11261
1.063544
1.040865
1.123212
1.004555
1.057848
1.149934
1.080261
1.188141
1.201995
1.203672
1.232376
1.20621
1.268279
1.249325
1.142575
1.075439
1.066331
0.939303
1.028284
1.150136
0.975997
1.072057
1.032319
1.047562
1.021373
0.991336
0.978471
1.077165
1.074773
1.065461
1.097531
1.154555
1.069852
1.096352
1.145798
1.194215
1.259295
1.22811
1.387169
1.400844
1.296343
1.161405
1.207471
1.096671
1.099053
1.116261
1.224266
1.173608
1.107899
1.170175
1.134412
1.289924
1.14338
1.157075
1.023501
1.171553
1.208359
1.264599
1.318647
1.381978
1.313919
1.408652
1.355754
1.19197
1.227243
1.186915
1.073975
1.111496
1.159998
1.049054
1.053717
1.087436
1.1362
1.146585
1.171901
1.304624
1.393229
1.344975
1.247171
1.19004
1.214342
1.169926
1.102449
1.176876
1.189897
1.260473
1.219286
1.240637
1.196267
1.222259
1.092598
1.20193
1.204242
1.272453
1.207051
1.131122
1.226318
1.099369
1.124663
1.13474
1.183924
1.203704
1.099796
1.08283
1.110308
1.172176
1.161562
1.08917
1.158742
1.171614
1.26025
1.152344
1.078808
1.214086
1.242712
1.320226
1.422131
1.269468
1.050201
1.126931
0.980549
0.89928
0.951257
1.018921
1.072066
1.022013
0.997903
1.070778
1.235411
1.145889
1.184446
1.13628
1.316929
1.273835
1.270294
1.325384
1.339973
1.201388
1.180919
1.107804
1.191401
1.193776
1.237621
1.287715
1.216725
1.302029
1.437431
1.267819
1.224654
1.387489
1.359403
1.438596
1.325198
1.210793
1.308453
1.157467
1.294329
1.253489
1.30115
1.247543
1.270057
1.302583
1.413441
1.224054
1.333359
1.250685
1.293967
1.255599
1.206257
1.134268
1.21995
1.232118
1.199247
1.276804
1.333451
1.325729
1.347565
1.336253
1.180034
1.218165
1.295983
1.250252
1.36607
1.20234
1.21411
1.187355
1.179482
1.202792
1.269934
1.209173
1.191557
1.201336
1.212744
1.310085
1.226332
1.317346
1.328281
1.295426
1.255357
1.232607
1.187254
1.19741
1.15615
1.240302
1.247239
1.05425
1.175569
1.259295
1.316296
1.312434
1.214804
1.214533
1.182005
1.091171
1.090968
1.228359
1.221041
1.077193
1.042002
1.107627
0.969915
0.960628
0.936166
1.060246
0.962746
0.991709
1.038922
1.182681
1.139499
1.113086
1.104547
1.134765
1.251538
1.200886
1.24352
1.29842
1.450039
1.387679
1.321627
1.201209
1.142341
1.089895
1.201425
1.210527
1.287517
1.397484
1.458638
1.181752
1.177029
1.178251
1.246274
1.135265
1.195558
1.142152
1.114611
1.301401
1.242675
1.245565
1.345349
1.239675
1.244319
1.221322
1.288002
1.296624
1.242214
1.277246
1.311366
1.261963
1.288182
1.226164
1.303767
1.213738
1.150819
1.269432
1.27805
1.309832
1.350773
1.160464
1.14494
1.195354
1.272599
1.271826
1.406962
1.46772
1.28377
1.274632
1.296633
1.312492
1.32592
1.261491
1.157473
1.252089
1.221721
1.247894
1.237201
1.196247
1.191015
1.276539
1.265503
1.167094
1.120498
1.136233
1.154613
1.163526
1.361613
1.21193
1.21968
1.205614
1.188679
1.239615
1.074431
1.125611
1.205336
1.098654
1.192173
1.308113
1.212591
1.157649
Timestep
Simulation time per step (s)
With LB
Sheet1
2.060572
1.931205
1.878904
1.627035
1.54573
1.838356
1.940753
1.66237
1.798899
1.699308
1.609657
1.582982
1.725091
1.724436
1.416855
1.912544
1.992336
1.843522
1.787968
1.578881
1.749074
1.721319
1.720792
1.548338
1.404611
1.852854
1.768577
1.598787
1.687954
1.706513
1.620017
1.255407
1.233622
1.127641
1.234462
1.287475
1.193388
1.189892
0.99324
0.962672
0.992411
1.006699
1.04138
0.969301
1.113824
0.916239
0.936094
0.954022
0.989555
1.02479
1.005211
1.163595
1.201799
1.102465
1.098228
1.028322
1.036103
0.972076
1.065422
1.138067
0.986625
1.005647
1.095719
0.958778
0.998048
1.010745
1.11048
0.983394
0.993608
1.083521
1.074188
1.038286
1.088879
1.148243
1.041047
1.119522
1.069704
1.069352
0.985704
0.960018
1.025316
0.988069
0.978957
1.035497
0.976275
1.068464
0.967931
0.974109
0.985639
0.977586
1.080294
0.95472
1.010183
1.008256
1.042001
1.000356
1.04785
0.989684
0.950716
0.907331
0.879343
0.887231
0.874751
0.856354
0.920724
0.900419
0.945531
1.044399
0.935695
0.918593
1.049466
1.081448
1.116128
1.116112
1.180923
1.246483
1.292915
1.214981
1.314288
1.305134
1.386524
1.285486
1.339381
1.21084
1.229857
1.248031
1.224514
1.168148
1.29184
1.127459
1.076682
1.115381
1.119832
1.165949
1.225126
1.175402
0.972285
0.999218
1.049842
1.089163
1.0472
1.019105
1.02164
1.067423
0.980737
1.012048
1.05805
1.130666
1.120378
1.065913
1.076211
1.076157
1.019195
1.08021
1.108113
1.052279
1.011755
1.004593
1.012997
1.083555
1.146402
1.169461
1.085392
1.134367
1.098053
1.073385
1.104686
1.069164
1.099568
1.184347
1.113693
1.190831
1.255191
1.038505
1.073652
1.106564
1.078479
1.102903
1.079685
1.122704
1.213064
1.12968
1.08834
1.083566
1.071111
1.102088
1.148053
1.143092
0.98726
1.034385
1.074711
1.149926
1.121928
1.114652
1.232812
1.12267
1.07861
1.065033
1.019278
0.893063
0.864822
0.875184
0.872714
0.867169
0.914183
2.51822
3.274045
3.38535
3.327303
3.326231
1.572805
1.291528
1.282563
1.195413
1.281697
1.351278
1.41182
1.420893
1.35073
1.207345
1.423579
1.215017
1.169013
1.101311
0.984618
1.002282
1.009545
0.940676
0.930696
1.089722
1.12236
1.094353
1.02755
1.10523
1.078062
1.185435
1.172529
1.180859
1.135149
1.215596
1.146032
1.173749
1.160676
1.114597
1.098378
1.20156
1.126391
1.218134
1.365837
1.112039
1.179607
1.352579
1.134288
1.149921
1.091482
1.230376
1.3414
1.270711
1.321813
1.387056
1.179084
1.071516
1.080868
1.194492
1.003742
1.153284
1.288704
1.218278
1.10546
1.107355
1.082475
1.126309
1.127273
1.260399
1.205362
1.23908
1.176623
1.151854
1.142528
1.084323
1.031449
1.047176
1.096269
1.068707
1.075525
1.020311
1.11261
1.063544
1.040865
1.123212
1.004555
1.057848
1.149934
1.080261
1.188141
1.201995
1.203672
1.232376
1.20621
1.268279
1.249325
1.142575
1.075439
1.066331
0.939303
1.028284
1.150136
0.975997
1.072057
1.032319
1.047562
1.021373
0.991336
0.978471
1.077165
1.074773
1.065461
1.097531
1.154555
1.069852
1.096352
1.145798
1.194215
1.259295
1.22811
1.387169
1.400844
1.296343
1.161405
1.207471
1.096671
1.099053
1.116261
1.224266
1.173608
1.107899
1.170175
1.134412
1.289924
1.14338
1.157075
1.023501
1.171553
1.208359
1.264599
1.318647
1.381978
1.313919
1.408652
1.355754
1.19197
1.227243
1.186915
1.073975
1.111496
1.159998
1.049054
1.053717
1.087436
1.1362
1.146585
1.171901
1.304624
1.393229
1.344975
1.247171
1.19004
1.214342
1.169926
1.102449
1.176876
1.189897
1.260473
1.219286
1.240637
1.196267
1.222259
1.092598
1.20193
1.204242
1.272453
1.207051
1.131122
1.226318
1.099369
1.124663
1.13474
1.183924
1.203704
1.099796
1.08283
1.110308
1.172176
1.161562
1.08917
1.158742
1.171614
1.26025
1.152344
1.078808
1.214086
1.242712
1.320226
1.422131
1.269468
1.050201
1.126931
0.980549
0.89928
0.951257
1.018921
1.072066
1.022013
0.997903
1.070778
1.235411
1.145889
1.184446
1.13628
1.316929
1.273835
1.270294
1.325384
1.339973
1.201388
1.180919
1.107804
1.191401
1.193776
1.237621
1.287715
1.216725
1.302029
1.437431
1.267819
1.224654
1.387489
1.359403
1.438596
1.325198
1.210793
1.308453
1.157467
1.294329
1.253489
1.30115
1.247543
1.270057
1.302583
1.413441
1.224054
1.333359
1.250685
1.293967
1.255599
1.206257
1.134268
1.21995
1.232118
1.199247
1.276804
1.333451
1.325729
1.347565
1.336253
1.180034
1.218165
1.295983
1.250252
1.36607
1.20234
1.21411
1.187355
1.179482
1.202792
1.269934
1.209173
1.191557
1.201336
1.212744
1.310085
1.226332
1.317346
1.328281
1.295426
1.255357
1.232607
1.187254
1.19741
1.15615
1.240302
1.247239
1.05425
1.175569
1.259295
1.316296
1.312434
1.214804
1.214533
1.182005
1.091171
1.090968
1.228359
1.221041
1.077193
1.042002
1.107627
0.969915
0.960628
0.936166
1.060246
0.962746
0.991709
1.038922
1.182681
1.139499
1.113086
1.104547
1.134765
1.251538
1.200886
1.24352
1.29842
1.450039
1.387679
1.321627
1.201209
1.142341
1.089895
1.201425
1.210527
1.287517
1.397484
1.458638
1.181752
1.177029
1.178251
1.246274
1.135265
1.195558
1.142152
1.114611
1.301401
1.242675
1.245565
1.345349
1.239675
1.244319
1.221322
1.288002
1.296624
1.242214
1.277246
1.311366
1.261963
1.288182
1.226164
1.303767
1.213738
1.150819
1.269432
1.27805
1.309832
1.350773
1.160464
1.14494
1.195354
1.272599
1.271826
1.406962
1.46772
1.28377
1.274632
1.296633
1.312492
1.32592
1.261491
1.157473
1.252089
1.221721
1.247894
1.237201
1.196247
1.191015
1.276539
1.265503
1.167094
1.120498
1.136233
1.154613
1.163526
1.361613
1.21193
1.21968
1.205614
1.188679
1.239615
1.074431
1.125611
1.205336
1.098654
1.192173
1.308113
1.212591
1.157649
Sheet1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Timestep
Simulation time per step (s)
With LB
Sheet2
Sheet3
-
Previous Result: Recovery PerformanceCharm++ Workshop 2012*10 crashes128 processorsCheckpoint every 10 time steps
Charm++ Workshop 2012
-
Charm++ Workshop 2012*LeanMD with Apoa1 benchmark90K atoms8498 objects
Charm++ Workshop 2012
-
FT on MPI-based Charm++Practical challenge: job schedulerJob scheduler kills the entire job when a process failsMPI-based Charm++ is portable on major supercomputersA fault injection scheme in MPI machine layerDieNow() MPI process stop respondingFault detection by keep-alive messagesSpare processors to replace failed ones Demonstrated on 64K cores of BG/P machineCharm++ Workshop 2012*
Charm++ Workshop 2012
-
Performance at Large ScaleCharm++ Workshop 2012*
Charm++ Workshop 2012
-
Optimization for scalabilityCommunication bottlenecksCheckpoint/restart time takes O(P) timeOptimizations:Collectives (barriers)Switch O(P) barrier to a tree-based barrierStale message handlingEpoch numberA phase to discard stale messages as quickly as possibleSmall messagesStreaming optimizationCharm++ Workshop 2012*
Charm++ Workshop 2012
-
LeanMD Checkpoint Time before/after OptimizationCharm++ Workshop 2012*
Charm++ Workshop 2012
-
Checkpoint Time for Jacobi/AMPICharm++ Workshop 2012*Kraken
Charm++ Workshop 2012
-
LeanMD Restart TimeCharm++ Workshop 2012*
Charm++ Workshop 2012
-
Conclusions and Future workIn-memory checkpointing after optimization is scalable towards ExascaleA short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012)Future work:Non-blocking checkpointingCharm++ Workshop 2012*
Charm++ Workshop 2012
****Memory usage increase by a factor of 2.*Log scaleVaried the problem size from 6.4MB to as big as 6GB32 processors**Run time with multiple
*Original spanning tree can not handle failed processors*