dimemas and multi-level cache simulations

Universitat Politecnica deCatalunya

Measurement and Tools Project Report

Dimemas and Multi-levelCache Simulations

Author:Mario Almeida

Supervisor:Alejandro Ramirez Bellido

June 22, 2012

Contents

1 Introduction 2

2 Methodology 22.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 32.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 3

3 Results 43.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 43.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 8

4 Conclusions 11

A Used Scripts 13A.1 Dimemas instrumentation . . . . . . . . . . . . . . . . . . . . 13

A.1.1 Generating Dimemas Configuration . . . . . . . . . . . 13A.1.2 Running experiments . . . . . . . . . . . . . . . . . . . 13A.1.3 Graph generator . . . . . . . . . . . . . . . . . . . . . 17A.1.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 19

A.2 Pin tool instrumentation . . . . . . . . . . . . . . . . . . . . . 20A.2.1 Generate and Compile Application and DCache tool . 20A.2.2 Running the experiments . . . . . . . . . . . . . . . . . 28A.2.3 Importing the results to a database . . . . . . . . . . . 31A.2.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 31

1

Abstract

This report describes the simulation and benchmarking steps takenin order to predict the parallel performance of an application usingDimemas and Cache-level simulations. Using Dimemas [3] the timebehaviour of NAS [1] integer sort was simulated for the architecture ofthe Barcelona Super Computer, MareNostrum [4]. The performancewas evaluated as a function of the architecture latency, bandwidth,connectivity and CPU speed. For Cache-Level Simulations, Intel’s pintool was used to benchmark a simple parallel application in functionof the cache and cluster sizes.

1 Introduction

This report describes the simulation and benchmarking steps taken in orderto predict the parallel performance of an application using Dimemas [3] andCache-level simulations.

Previous work was focused on benchmarking a PARSEC [2] ray-tracingapplication on the multi-processor Boada server. For this purpose EXTRAEand Paraver [5] were used to instrument and provide detailed quantitativeanalysis of the application performance.

Following the study of measurement tools and techniques, this reportdescribes the usage of Dimemas to simulate the time behaviour of anotherbenchmarking application on the Barcelona Super Computer, MareNostrum.This time the used traces were taken from a NAS benchmark application alsorunning on boada server. The performance of the application in this simu-lation environment was evaluated as a function of the architecture latency,bandwidth, connectivity and CPU speed.

To conclude this study on performance analysis, Cache-Level Simulationswere performed using Intel’s pin tool. The chosen application was a sim-ple parallel application that performs distributed arithmetic operations. Itrepresents the typical Master-Slave paradigm with embarrassingly parallelworkload. For evaluating the cache architecture, the total cache miss ratesper cache level were calculated as a function of the cache sizes, associativity,number of threads and the cluster size.

2 Methodology

This section presents the two different simulation configurations: Dimemasand Multi-Level Cache simulations. Both sections describe the used tools,configuration values and metrics used.

2

Boada ServerBandwidth 1 Gb/s

Latency 6-10 usNumber of cores 12

Ram 24 GB

Table 1: Boada server configuration.

2.1 Dimemas Simulation

The application chosen for this experiment was the NAS Parallel Benchmarkapplication, integer sort. The NAS benchmark is a set of programs designedto help evaluate the performance of parallel super computers. In this case,the benchmark was done on the boada server which attributes are describedin table 1.

In order to perform an architecture simulation, it was decided to use theMareNostrum Super Compute configuration which parameters are shown intable 2. Note that a simplification was made, since it was considered thateach processor runs a single thread. Starting from MareNostrums original ar-chitecture, multiple simulations were performed changing its attributes. Forthis purpose, the script in section A.1.1 was created that generates Dimemasconfiguration files and another to automate its variations. The changed at-tributes in the simulated architecture consisted of latency, CPU speed, band-width and the number of buses. All the measurements were stored in a sqlite3database and then queried in order to automatically generate the graphs (sec-tion A.1.3) presented on the section 3 using gnuplot.

To conclude, the changed attributes were recursively fixed on a chosenoptimal value to find a final architecture that needs lesser resources whilehaving similar execution times to the original MareNostrum configuration.

2.2 Multi-Level Cache Simulation

To conclude this study on performance analysis, Cache-Level Simulationswere performed using Intel’s pin tool. The chosen application was a sim-ple parallel application that performs distributed arithmetic operations. Itrepresents the typical Master-Slave paradigm with embarrassingly parallelworkload.

For evaluating the cache architecture, the pin tool dcache application waschanged in order to support multiple levels of cache shared by parallel pro-

3

cessors. The implemented cache architecture is represented in figure 1. Asone might infer from the figure, the cache level two is cluster shared and thecache level three is globally shared.

P0

P7

L1

L1

L2

L2

L3

.

.

.

.

.

.

P8

P15

L1

L1

.

.

.

.

.

.

Size of L1 = 16 KB

Size of L2

= 1 MB

Size of L1

= 4 MB

Figure 1: Cache architecture for a cluster size of 8 and a total of 16 processors.

For this experiments, the total cache miss rates per cache level were calcu-lated as a function of the multiple cache sizes, number of processors and thecluster size. Some experiments were performed in terms of cache associativityand the number of cache lines per cache set.

3 Results

In this section the results of both the experiments will be described alongsidewith the resulting charts, descriptions and discussion.

3.1 Dimemas Simulation

Starting with the initial architecture of MareNostrum, the first experimentconsisted on varying the number of buses and observing its impact on the ex-ecution time of our application. The results of this experiment are depictedon the Figure 2.

As it can be observed from figure 2, the execution time decreases whileincreasing the number of buses. This result was expected since this is a

4

0

200

400

600

800

1000

1200

1400

1600

0 5 10 15 20

Exe

cuti

on t

ime(s

)

buses

Execution time with variable buses

#Procs = 2#Procs = 4#Procs = 8

#Procs = 16#Procs = 32

Figure 2: Execution time of IntegerSort depending on the number of buses.

multi-threaded application in which the data is transferred between threadsand adding more buses increases the amount of data that can be transferredin parallel. Also it can be seen that from sixteen buses, the execution timestarts stabilizing. This is probably because most of the data to be sent, isalready sent in parallel and thus the increase of buses does not impact theperformance.

The second experiment consisted on varying the available bandwidth fromthe initial MareNostrum configuration. The results are shown in Figure 3.

0

20

40

60

80

100

120

140

170 180 190 200 210 220 230 240 250

Exe

cuti

on t

ime(s

)

bandwidth

Execution time with variable bandwidth



Figure 3: Execution time of IntegerSort depending on the bandwidth (MB/s).

5

Figure 3 shows that the bandwidth as a bigger impact on performance ifthe application is run on a smaller set of threads. For example, a variationof 40 MB/s can increase the execution time by 20 seconds for four threads,but for 32 threads, the changes are almost unnoticeable. This is probablydue to the fact that the master thread has to send the initial data to allslaves. This means that increasing the number of slaves, the data can be di-vided in smaller chunks that can be sent in parallel and thus taking less time.

The third experiment consisted on varying the processing capacity of theCPU. As one can observe in figure 4, increasing the processing power of eachprocessor decreases the execution time. This impact is more noticeable if weconsider processing capacity smaller than 100%. It is not very tunable interms of optimizing the usage of resources in terms of decreasing the CPUpower since a small decrease has a big impact on the execution time.

0

50

100

150

200

250

300

350

400

450

500

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Exe

cuti

on t

ime(s

)

cpu

Execution time with variable cpu



Figure 4: Execution time of IntegerSort depending on the available CPU(%).

To conclude the experiments on the variation of the architecture param-eters, figure 5 shows the impact of latency on the execution time.

For figure 5 a logarithmic scale was chosen for the x axis since changesin the same order of the initial MareNostrum configuration do not have asignificant impact on the execution time. The latency can be increased tovalue significantly bigger without affecting much the performance since thelatency values in MareNostrum are very small. Only from values of latencyclose to 0.01 seconds we start seeing bigger increases of the execution time.This attribute should have a bigger impact for more communication intensive

6

0

100

200

300

400

500

600

1e-06 1e-05 0.0001 0.001 0.01 0.1 1

Exe

cuti

on t

ime(s

)

latency

Execution time with variable latency



Figure 5: Execution time of IntegerSort depending on the latency (s).

applications.

Figure 6: Execution time of IntegerSort depending on the number of threads.

To conclude, a comparison is shown in table 2 that presents the dif-ferences between a less resource demanding configuration and the originalMareNostrum configuration, both achieving similar execution times. Thechosen number of threads was 32 due to its better performance as shown in

7

Parameters MareNostrum Config 1 Config 2Cpu (%) 1.0 0.95 0.9

Latency (s) 0.000008 0.0001 0.001Bandwidth (MB/s) 250.0 240 230Number of buses 20+ * 16 16

Execution time (s) 12.506 13.150 13.779

Table 2: Comparison between the execution times of the initial MareNostrumconfiguration and its less resource demanding configuration.

figure 6.The table 2 confirms the predictions made in previous experiments. The

chosen values increase the execution time at most 1 second while reducingmost parameters by around 10% and increasing significantly the latency.

3.2 Multi-Level Cache Simulation

As previously mentioned, the chosen application was a simple parallel ap-plication that performs distributed arithmetic operations. It represents thetypical Master-Slave paradigm with embarrassingly parallel workload.

28 30 32 34 36 38 40 42 44 46 48 50

2 3 4 5 6 7 8

Tota

l M

iss

Rate

(%

)

Cluster size

MissRate of cache L2 per cluster size (Lsize=[16,1,4])

#procs = 2#procs = 4#procs = 8

#procs = 16

Figure 7: MissRate of Cache L2 for L1, L2 and L3 sized, respectively 16K,1M, 4M

For evaluating the cache architecture, the cache architecture was changeddepending on multiple factors, such as the cluster size, caches sizes and cacheline sizes. To start with this experiments the cache architecture was set asshown in figure 1. It has 16 processors with one L1 cache of 16 KB each.

8

The cache level two has 1 MB and is cluster shared with a cluster size of 8.And finally, the cache level three is globally shared and has a size of 4 MB.The first experiment consisted on varying the cluster size as shown in figure7 and verifying its impact on the cache L2 miss rate. As it can be seen,for the number of threads of this experiment the impact on the miss ratesof changing the cluster size was not very significant. For up to 4 threads ithas almost no impact at all, but when the system has more than 8 threadsit can reduce the miss rate by 2%. It is interesting to notice that in thisexperiment, the more threads sharing the same L2 cache, the lesser the missrate becomes.Since most cache size configurations produced similar variations for the clus-ter size experiment, the next step consisted on verifying the the impact ofthe cache sizes on the miss rates. The first step consisted on varying the sizeof the non-shared cache L1 and its results are presented on figure 8.

8

9

10

11

12

13

14

15

15 20 25 30 35 40 45 50 55 60 65

Tota

l M

iss

Rate

(%

)

Size of cache L1

MissRate of cache 1 per L1 size


#procs = 16

Figure 8: MissRate of Cache L1 for a variable L1 cache size (KB).

Looking at figure 8 it might seem strange that a smaller number of threadshas such a lower miss rate. This is because of the master/slave paradigm thatfor an increasing number of threads makes the accesses to data more sparse.For bigger numbers of threads the miss rates can reach values close to 15%.As expected, bigger sizes of L1 caches achieve smaller miss rates, althoughthe difference isn’t greater than 2%.

Although the experiments were performed for more sizes of L1 cache, inorder to study the impact of the L2 cache size, the L1 cache size was fixedon 16 KB. The variation of L2 cache size is presented on figure 9. As one canobserve, the miss rate of L2 cache for 2 threads is high, being close to 50%.This is probably because of the low miss rate of the L1 cache, the accesses

9

30

32

34

36

38

40

42

44

46

48

50

1 1.5 2 2.5 3 3.5 4

Tota

l M

iss

Rate

(%

)

Size of cache L2

MissRate of cache 2 per L2 size (Lsize=[16,.,.])


#procs = 16

Figure 9: MissRate of Cache L2 for a variable L2 cache size (MB) and a L1cache size of 16KB.

that don’t produces hits on L1 should have lower predictability. For biggernumbers of threads, the miss rates are still high although they don’t reachvalues higher than 33%.

0

20

40

60

80

100

4 6 8 10 12 14 16

Tota

l M

iss

Rate

(%

)

Size of cache L3

MissRate of cache L3 per L3 size (Lsize=[16,1,.])


#procs = 16

Figure 10: MissRate of Cache L3 for a variable L3 cache size (MB) and a L1cache size of 16KB.

Finally, for the L3 cache size, the impact on the miss rate of the L3 cachesize is shown in figure 10. It seems that accesses that don’t produce hits onthe previous two levels of cache, will hardly produce hits on the third levelof cache. The only exception are the 2 threads for which the set of accesseddata is bigger. This probably shows that either the application doesn’t justifythe use of three levels of cache, or the data accessed by each thread at each

10

moment is too short.

4 Conclusions

Dimemas allowed to experiment the theoretical performance of the applica-tion in the MareNostrum architecture. Through the variation of each dif-ferent parameter it was possible to create graphs depicting their impact onthe execution time. By the end of the experiment it was possible to suggestan architecture with less resources that achieves similar results to the initialMareNostrum architecture. This architecture is presented in table 2 and con-firms the predictions made in the Dimemas experiments. The chosen valuesincrease the execution time at most 1 second while reducing most parametersby around 10% and increasing significantly the latency.

For the second experiment, the impact of the cluster size and caches sizeswere presented for a simple parallel arithmetic calculations application. Theexperiments showed that the cluster size impact on the miss rate was notvery significant. For more than 8 threads it can reduce the miss rate by2%. Overall, the more threads sharing the same L2 cache, the lesser themiss rate becomes. This is because of the master/slave paradigm that foran increasing number of threads makes the accesses to data more sparse.As expected, bigger sizes of L1 caches achieve smaller miss rates. For bignumbers of threads, the miss rates in L2 cache were high although they don’treach values higher than 33%. In general, accesses that didn’t produce hitson the first two levels of cache, hardly produced hits on the third level ofcache. The experiments showed that either the application doesn’t justifythe use of three levels of cache, or the data accessed by each thread at eachmoment is too short.

Scripting the experiments had a huge impact on the time needed to per-form them. Some of the experiments produced thousands of results. Thetechnique that has proven to be more efficient was to script the generationof results, output them to a sql database and perform queries to generategraphs through gnuplot.

References

[1] http://www.nas.nasa.gov/publications/npb.html, NAS benchmark.

[2] http://parsec.cs.princeton.edu/, PARSEC benchmark.

11

[3] http://www.bsc.es/computer-sciences/performance-tools/dimemas,Dimemas.

[4] http://en.wikipedia.org/wiki/MareNostrum, MareNostrum.

[5] http://www.bsc.es/computer-sciences/performance-tools/paraver, Par-aver.

12

A Used Scripts

A.1 Dimemas instrumentation

A.1.1 Generating Dimemas Configuration

1 #!/ bin / bash23 i f [ $# −ne 6 ]4 then5 echo ”$0 : Wrong number o f arguments . ”6 echo ”$0 : <input . t r f > <nthreads> <nbuses> <l a tency> <

bandwidth> <%cpuspeed>”7 e x i t 18 f i9

10 cat b e g i n o f c o n f i g1112 #Bandwidth d e f i n i t i o n13 echo −e ”\n\n\” environment in fo rmat ion \” {\”\” , 0 , \”\” , 128 , $5

, $3 , 3 } ; ; \ n”1415 #Latency and %cpu speed d e f i n i t i o n s16 for ( ( i =0; i <=127; i++ ) )17 do18 echo ”\”node in fo rmat ion \” {0 , $i , \”\” , 1 , 1 , 1 , 0 . 0 , $4 ,

$6 , 0 . 0 , 0 . 0 } ; ; ”19 done2021 #Fi l e name and number o f p roce s so r s d e f i n i t i o n s22 echo ””23 echo −n ”\”mapping in fo rmat ion \” {\”$1\” , $2 , [ $2 ] ”24 echo −n ”{0”2526 for ( ( i =1; i<=$2−1; i++ ) )27 do28 echo −n ” , $ i ”29 done3031 echo ” } } ; ; ”3233 cat endo f c on f i g

A.1.2 Running experiments

1 #!/ bin / bash2 #

13

3 # Scr i p t by aknahs (Mario Almeida )4 #56 cat logo78 echo ”Removing out f o l d e r ( f o r c e ) ”9 rm −r f out

1011 echo ” Creat ing out f o l d e r ”12 mkdir out13 mkdir out/ c f g14 mkdir out/prv15 mkdir out/ d e t a i l s16 mkdir out/ r e s u l t s1718 echo ” Creat ing s q l i t e 3 database ”19 s q l i t e 3 out / r e s u l t s / r e s . db ’CREATE TABLE dimemas ( procs INTEGER,

buses INTEGER, la t ency REAL, bandwidth REAL, cpu REAL, runtimeREAL) ; ’

2021 echo ” Se t t i ng d e f a u l t va lue s ”22 LATENCY=” 0.000008 ”23 BANDWIDTH=” 250 .0 ”24 BUSES=”0”25 CPU=” 1 .0 ”2627 for i in 02 04 08 16 3228 do29 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−”30 i f [ ${ i : 0 : 1 } == 0 ]31 then32 #echo ” Se t t i n g nthreads to ${ i :1}”33 nthreads=${ i : 1}34 else35 #echo ” Se t t i n g nthreads to ${ i }”36 nthreads=$ i37 f i3839 echo −n ” Generating r e s u l t s f o r $nthreads ”4041 #BUSES−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−42 for j in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

2043 do44 #echo ”Generating con f i g u r a t i on f i l e f o r BUSES = $ j ”45 . / con f i ggen in / mpi p ing $ i . t r f $nthreads $ j $LATENCY

$BANDWIDTH $CPU > out / c f g / con f i g−$nthreads−$j−$LATENCY−$BANDWIDTH−$CPU. c f g

46 #echo ”Convert ing to paraver t race . . . ”

14

47 . / Dimemas3 −S 32K −pa out /prv/ paraver−$nthreads−$j−$LATENCY−$BANDWIDTH−$CPU. prv out/ c f g / con f i g−$nthreads−$j−$LATENCY−$BANDWIDTH−$CPU. c f g > out / d e t a i l s / d e t a i l−$nthreads−$j−$LATENCY−$BANDWIDTH−$CPU

48 #echo ”Outputing r e s u l t s .”49 echo −n ” $nthreads , $j ,$LATENCY,$BANDWIDTH,$CPU, ” >> out /

r e s u l t s / res−$nthreads . csv50 grep Execution out / d e t a i l s / d e t a i l−$nthreads−$j−$LATENCY−

$BANDWIDTH−$CPU | awk ”{ pr in t \$3}” >> out / r e s u l t s / res−$nthreads . csv

51 done5253 echo −n ” . ”5455 #LATENCY−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−56 for j in 0 .000001 0.00001 0 .0001 0 .001 0 .01 0 .1 1 .057 do58 . / con f i ggen in / mpi p ing $ i . t r f $nthreads $BUSES $ j $BANDWIDTH

$CPU > out / c f g / con f i g−$nthreads−$BUSES−$j−$BANDWIDTH−$CPU.c f g

59 . / Dimemas3 −S 32K −pa out /prv/ paraver−$nthreads−$BUSES−$j−$BANDWIDTH−$CPU. prv out/ c f g / con f i g−$nthreads−$BUSES−$j−$BANDWIDTH−$CPU. c f g > out / d e t a i l s / d e t a i l−$nthreads−$BUSES−$j−$BANDWIDTH−$CPU

60 echo −n ” $nthreads , $BUSES, $j ,$BANDWIDTH,$CPU, ” >> out / r e s u l t s /res−$nthreads . csv

61 grep Execution out / d e t a i l s / d e t a i l−$nthreads−$BUSES−$j−$BANDWIDTH−$CPU | awk ”{ pr in t \$3}” >> out / r e s u l t s / res−$nthreads . csv

62 done6364 echo −n ” . ”6566 #BANDWIDTH−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−67 for j in 250 .0 245 .0 240 .0 235 .0 230 .0 225 .0 220 .0 215 .0

210 .0 205 .0 200 .0 195 .0 190 .0 185 .0 180 .0 175 .0 170 .068 do69 . / con f i ggen in / mpi p ing $ i . t r f $nthreads $BUSES $LATENCY $j

$CPU > out / c f g / con f i g−$nthreads−$BUSES−$LATENCY−$j−$CPU. c f g70 . / Dimemas3 −S 32K −pa out /prv/ paraver−$nthreads−$BUSES−

$LATENCY−$j−$CPU. prv out/ c f g / con f i g−$nthreads−$BUSES−$LATENCY−$j−$CPU. c f g > out / d e t a i l s / d e t a i l−$nthreads−$BUSES−$LATENCY−$j−$CPU

71 echo −n ” $nthreads , $BUSES,$LATENCY, $j ,$CPU, ” >> out / r e s u l t s /res−$nthreads . csv

72 grep Execution out / d e t a i l s / d e t a i l−$nthreads−$BUSES−$LATENCY−$j−$CPU | awk ”{ pr in t \$3}” >> out / r e s u l t s / res−$nthreads . csv

73 done74

15

75 echo −n ” . ”7677 #CPU SPEED−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−78 for j in 5 .0 4 . 0 3 . 0 2 .0 1 .0 0 .95 0 .9 0 .85 0 .8 0 .75 0 .7 0 .65

0 .6 0 .55 0 .5 0 .45 0 .4 0 .35 0 .3 0 .25 0 .2 0 .25 0 .1 0 .0579 do80 . / con f i ggen in / mpi p ing $ i . t r f $nthreads $BUSES $LATENCY

$BANDWIDTH $j > out / c f g / con f i g−$nthreads−$BUSES−$LATENCY−$BANDWIDTH−$ j . c f g

81 . / Dimemas3 −S 32K −pa out /prv/ paraver−$nthreads−$BUSES−$LATENCY−$BANDWIDTH−$ j . prv out / c f g / con f i g−$nthreads−$BUSES−$LATENCY−$BANDWIDTH−$ j . c f g > out / d e t a i l s / d e t a i l−$nthreads−$BUSES−$LATENCY−$BANDWIDTH−$ j

82 echo −n ” $nthreads , $BUSES,$LATENCY,$BANDWIDTH, $j , ” >> out /r e s u l t s / res−$nthreads . csv

83 grep Execution out / d e t a i l s / d e t a i l−$nthreads−$BUSES−$LATENCY−$BANDWIDTH−$ j | awk ”{ pr in t \$3}” >> out / r e s u l t s / res−$nthreads . csv

848586 done87 echo ” . ”88 echo ” Importing to database ”89 echo ” . s epa ra to r \” ,\ ”” > out / r e s u l t s /command90 echo ” . import out / r e s u l t s / res−${nthreads } . csv dimemas” >>

out / r e s u l t s /command91 s q l i t e 3 out / r e s u l t s / r e s . db < out / r e s u l t s /command92 rm out / r e s u l t s /command93 done9495 echo ” Generating best c o n f i g u r a t i o n 1”96 . / con f i ggen in / mpi ping 32 . t r f 32 16 0 .0001 240 .0 0 .95 > out / c f g

/ con f i g −32−16−0.0001−240.0−0.95. c f g97 . / Dimemas3 −S 32K −pa out /prv/ paraver −32−16−0.0001−240.0−0.95.

prv out / c f g / con f i g −32−16−0.0001−240.0−0.95. c f g > out / d e t a i l s /d e t a i l −32−16−0.0001−240.0−0.95

98 echo −n ” 3 2 , 1 6 , 0 . 0 0 0 1 , 2 4 0 . 0 , 0 . 9 5 , ” > out / r e s u l t s / optimal . csv99 grep Execution out / d e t a i l s / d e t a i l −32−16−0.0001−240.0−0.95 | awk

”{ pr in t \$3}” >> out / r e s u l t s / optimal . csv100101 echo ” Generating best c o n f i g u r a t i o n ”102 . / con f i ggen in / mpi ping 16 . t r f 16 16 0 .0001 230 .0 0 .9 > out / c f g /

con f i g −16−16−0.0001−230.0−0.9. c f g103 . / Dimemas3 −S 32K −pa out /prv/ paraver −16−16−0.0001−230.0−0.9. prv

out / c f g / con f i g −16−16−0.0001−230.0−0.9. c f g > out / d e t a i l s /d e t a i l −16−16−0.0001−230.0−0.9

104 echo −n ” 1 6 , 1 6 , 0 . 0 0 0 1 , 2 3 0 . 0 , 0 . 9 , ” >> out / r e s u l t s / optimal . csv105 grep Execution out / d e t a i l s / d e t a i l −16−16−0.0001−230.0−0.9 | awk ”

{ pr in t \$3}” >> out / r e s u l t s / optimal . csv

16

106107 . / g rapha l l buses108 . / g rapha l l cpu109 . / g rapha l l bandwidth110 . / g rapha l l l a t ency111112 echo ” Al l done ! ”

A.1.3 Graph generator

1 #!/ bin / bash2 #3 # Scr i p t by aknahs (Mario Almeida )4 #56 la t ency=” 0.000008 ”7 bandwidth=” 250 .0 ”8 buses=”0”9 cpu=” 1 .0 ”

10 aux=””11 aux2=””1213 i f [ ”$1” == ” la t ency ” ]14 then15 comp=$la tency16 aux=” s e t l og x”17 aux2=” s e t mxtics 10”18 f i19 i f [ ”$1” == ”bandwidth” ]20 then21 comp=$bandwidth22 f i23 i f [ ”$1” == ” buses ” ]24 then25 comp=$buses26 f i27 i f [ ”$1” == ”cpu” ]28 then29 comp=$cpu30 f i313233 echo ” Generating Graph”34 gnuplot << EOF35 set d a t a f i l e s epa ra to r ” | ”3637 # Line s t y l e f o r axes38 set s t y l e l ine 80 l t rgb ”#808080”

17

3940 # Line s t y l e f o r g r i d41 set s t y l e l ine 81 l t 0 # dashed42 set s t y l e l ine 81 l t rgb ”#808080” # grey4344 set grid back l i n e s t y l e 8145 set border 3 back l i n e s t y l e 80 # Remove border on top and

r i g h t . These46 # borders are u s e l e s s and make i t harder47 # to see p l o t t e d l i n e s near the border .48 # Also , put i t in grey ; no need f o r so much emphasis on a

border .49 set x t i c s nomirror50 set y t i c s nomirror5152 #se t l o g x53 #se t mxt ics 10 # Makes l o g s c a l e l ook good .5455 # Line s t y l e s : t r y to p i ck p l e a s i n g co lor s , ra the r56 # than s t r i c t l y primary c o l o r s or hard−to−see c o l o r s57 # l i k e gnuplot ’ s d e f a u l t y e l l ow . Make the l i n e s t h i c k58 # so they ’ re easy to see in sma l l p l o t s in papers .59 set s t y l e l ine 1 l t rgb ”#A00000” lw 2 pt 160 set s t y l e l ine 2 l t rgb ”#00A000” lw 2 pt 661 set s t y l e l ine 3 l t rgb ”#5060D0” lw 2 pt 262 set s t y l e l ine 4 l t rgb ”#F25900” lw 2 pt 963 set s t y l e l ine 5 lw 2 pt 96465 #se t key top r i g h t6667 #se t xrange [ 0 : 1 ]68 #se t yrange [ 0 : 1 ]6970 #p l o t ” temp la te . dat ” \71 #index 0 t i t l e ”Example l i n e ” w l p l s 1 , \72 #”” index 1 t i t l e ”Another example” w l p l s 27374 #se t s t y l e data l i n e s75 set key out s id e76 #se t x t i c s r o t a t e by −4577 #se t s i z e r a t i o 0 .878 set t i t l e ” Execution time with v a r i a b l e $1”79 set xlabel ”$1”80 $aux81 $aux282 set ylabel ” Execution time ( s ) ”8384 plot ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from

dimemas where $1 != $comp and procs = 2 UNION s e l e c t $1 ,

18

runtime from dimemas where procs = 2 and buses = $buses andla t ency = $ la tency and bandwidth = $bandwidth and cpu =

$cpu ’ ” us ing 1 :2 w lp l s 1 t i t l e ’#Procs = 2 ’ ,\85 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from

dimemas where $1 != $comp and procs = 4 UNION s e l e c t $1 ,runtime from dimemas where procs = 4 and buses = $busesand la t ency = $ la tency and bandwidth = $bandwidth and cpu= $cpu ’ ” us ing 1 :2 w lp l s 2 t i t l e ’#Procs = 4 ’ ,\

86 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime fromdimemas where $1 != $comp and procs = 8 UNION s e l e c t $1 ,runtime from dimemas where procs = 8 and buses = $busesand la t ency = $ la tency and bandwidth = $bandwidth and cpu= $cpu ’ ” us ing 1 :2 w lp l s 3 t i t l e ’#Procs = 8 ’ ,\

87 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime fromdimemas where $1 != $comp and procs = 16 UNION s e l e c t $1 ,runtime from dimemas where procs = 16 and buses = $busesand la t ency = $ la tency and bandwidth = $bandwidth and cpu= $cpu ’ ” us ing 1 :2 with l i n e s l s 4 t i t l e ’#Procs = 16 ’ ,\

88 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime fromdimemas where $1 != $comp and procs = 32 UNION s e l e c t $1 ,runtime from dimemas where procs = 32 and buses = $busesand la t ency = $ la tency and bandwidth = $bandwidth and cpu= $cpu ’ ” us ing 1 :2 w lp l s 5 t i t l e ’#Procs = 32 ’

8990 set terminal p d f c a i r o font ” G i l l Sans , 7 ” l i n ew id th 4 rounded91 #se t termina l pd f ca i r o s i z e 10cm,20cm92 set output ” out / r e s u l t s /$1 . pdf ”93 r e p l o t94 EOF9596 echo ”Done”

A.1.4 Generating graphs

1 . / g rapha l l buses2 . / g rapha l l l a t ency3 . / g rapha l l cpu4 . / g rapha l l bandwidth56 echo ” Generating Graph”7 gnuplot << EOF8 set d a t a f i l e s epa ra to r ” , ”9 set nokey

1011 set t i t l e ” Execution time depending on the number o f threads ”12 set xlabel ”Number o f threads ”1314 set x t i c s (0 , 2 , 4 , 8 , 16 , 32 , 34 )

19

1516 set ylabel ” Execution time ( s ) ”1718 set s t y l e l ine 1 l t rgb ”#A00000” lw 501920 plot ” out / r e s u l t s / comparisonThreads . csv ” us ing 1 :2 with imp l s

12122 set term p o s t s c r i p t eps enhanced c o l o r23 set output ” out / r e s u l t s / comparison . pdf ”24 r e p l o t25 EOF

A.2 Pin tool instrumentation

A.2.1 Generate and Compile Application and DCache tool

1 #!/ bin / bash23 #c l u s t e r S i z e4 # const UINT32 cacheS i ze = 256∗KILO;5 # const UINT32 l i n e S i z e = 1;6 # const UINT32 a s s o c i a t i v i t y = 256;7 #<c l u s t e r S i z e> <L1caches ize> <L1 l ineS i ze> <L1assoc> <L2caches ize

> <L2 l ineS i ze> <L2assoc> <L3caches ize> <L3 l ineS i ze> <L3assoc><nThreads>

8 # $1 $2 $3 $4 $5$6 $7 $8 $9

$10 $119

10 i f [ $# −ne 11 ]11 then12 echo ”$0 : Wrong number o f arguments . ”13 echo ”$0 : <c l u s t e r S i z e > <L1caches ize> <L1 l ineS i z e> <L1assoc

> <L2caches ize> <L2 l ineS i z e> <L2assoc> <L3caches ize> <L3 l ineS i z e> <L3assoc> <nThreads>”

14 e x i t 115 f i1617 threadsAndMaster=$ ( ( ${11}−1) )18 #echo ”TreadsAndMaster = $threadsAndMaster”1920 #echo −n ”INPUT=”21 #echo ”$1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} ${11}”2223 #echo ”Saving backup o f dcache f i l e ”24 mv −f dcache . cpp dcache backup . cpp

20

2526 echo ”27 #inc lude <iostream>28 #inc lude <fstream>29 #inc lude <ca s s e r t>3031 #inc lude \” pin .H\”323334 typede f UINT32 CACHE STATS; // type o f cache h i t / miss counter s3536 #inc lude \” p in cache .H\”3738 KNOB<s t r i ng> KnobOutputFile (KNOB MODE WRITEONCE, \” p i n t o o l \” ,39 \”o\” , \” a l l c a c h e . out\” , \” s p e c i f y dcache f i l e name\” ) ;4041 PIN LOCK lock ;4243 INT32 numThreads = 0 ;44 const INT32 MaxNumThreads = $11 ;45 const INT32 c l u s t e r S i z e = $1 ;4647 s t r u c t THREAD DATA48 {49 UINT64 Hits ;50 UINT64 Miss ;51 } ;5253 THREAD DATA l1count [ MaxNumThreads ] ;54 THREAD DATA l2count [ c l u s t e r S i z e ] ;5556 VOID ThreadStart (THREADID threadid , CONTEXT ∗ ctxt , INT32 f l a g s ,

VOID ∗v )57 {58 GetLock(&lock , thread id +1) ;59 numThreads++;60 ReleaseLock(& lock ) ;6162 ASSERT( numThreads <= MaxNumThreads , \”Maximum number o f

threads exceeded \n\” ) ;63 }6465 namespace DL166 {67 // 1 s t l e v e l data cache : 32 kB , 32 B l i n e s , 32−way

a s s o c i a t i v e68 const UINT32 cacheS i ze = $2∗KILO;69 const UINT32 l i n e S i z e = $3 ;70 const UINT32 a s s o c i a t i v i t y = $4 ;

21

71 const CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC: : STORE NO ALLOCATE;

7273 const UINT32 max sets = cacheS i ze / ( l i n e S i z e ∗

a s s o c i a t i v i t y ) ;74 const UINT32 m a x a s s o c i a t i v i t y = a s s o c i a t i v i t y ;7576 typede f CACHE ROUND ROBIN( max sets , max as soc i a t i v i t y ,

a l l o c a t i o n ) CACHE;77 }78 LOCALVAR DL1 : :CACHE dl1 (\ ”L1 Data Cache\” , DL1 : : cacheSize , DL1 : :

l i n e S i z e , DL1 : : a s s o c i a t i v i t y ) ;7980 namespace UL281 {82 // 2nd l e v e l u n i f i e d cache : 2 MB, 64 B l i n e s , d i r e c t mapped83 const UINT32 cacheS i ze = $5∗MEGA;84 const UINT32 l i n e S i z e = $6 ;85 const UINT32 a s s o c i a t i v i t y = $7 ;86 const CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC

: : STORE ALLOCATE;8788 const UINT32 max sets = cacheS i ze / ( l i n e S i z e ∗

a s s o c i a t i v i t y ) ;8990 typede f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;91 }92 LOCALVAR UL2 : :CACHE ul2 (\ ”L2 Cluster−shared Cache\” , UL2 : :

cacheSize , UL2 : : l i n e S i z e , UL2 : : a s s o c i a t i v i t y ) ;9394 namespace UL395 {96 // 3 rd l e v e l u n i f i e d cache : 16 MB, 64 B l i n e s , d i r e c t mapped97 const UINT32 cacheS i ze = $8∗MEGA;98 const UINT32 l i n e S i z e = $9 ;99 const UINT32 a s s o c i a t i v i t y = $ {10} ;

100 const CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC: : STORE ALLOCATE;

101102 const UINT32 max sets = cacheS i ze / ( l i n e S i z e ∗

a s s o c i a t i v i t y ) ;103104 typede f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;105 }106 LOCALVAR UL3 : :CACHE ul3 (\ ”L3 Global ly−shared Cache\” , UL3 : :

cacheSize , UL3 : : l i n e S i z e , UL3 : : a s s o c i a t i v i t y ) ;107108 LOCALFUN VOID Fin i ( i n t code , VOID ∗ v )109 {

22

110 std : : o f s tream out ( KnobOutputFile . Value ( ) . c s t r ( ) ) ;111112 out <<113 \”#\n\”114 \”# DCACHE s t a t s \n\”115 \”#\n\” ;116117 out << dl1 ;118 out << ul2 ;119 out << ul3 ;120121 out . close ( ) ;122123 for ( i n t i =0; i<numThreads ; i++)124 {125 printf (\ ”%d L1 Hits : %I64d\n\” , i , ( unsigned i n t ) l1count [ i ] . Hits

) ;126 printf (\ ”%d L1 Miss : %I64d\n\” , i , ( unsigned i n t ) l1count [ i ] . Miss

) ;127 printf (\ ”%d L1 Hit ra t e : %f \n\n\” , i , ( 1 0 0 . 0 ∗ l 1count [ i ] . Hits /(

l1count [ i ] . Hits+l1count [ i ] . Miss ) ) ) ;128 }129130 for ( i n t i =0; i<c l u s t e r S i z e ; i++)131 {132 printf (\ ”%d L2 Hits : %I64d\n\” , i , ( unsigned i n t ) l2count [ i ] . Hits

) ;133 printf (\ ”%d L2 Miss : %I64d\n\” , i , ( unsigned i n t ) l2count [ i ] . Miss

) ;134 printf (\ ”%d L2 Hit ra t e : %f \n\n\” , i , ( 1 0 0 . 0 ∗ l 2count [ i ] . Hits /(

l2count [ i ] . Hits+l2count [ i ] . Miss ) ) ) ;135 }136 }137138 LOCALFUN VOID Ul2Access (ADDRINT addr , UINT32 size , CACHE BASE : :

ACCESS TYPE accessType , THREADID t i d )139 {140 // second l e v e l u n i f i e d cache141 const BOOL dl2Hit = ul2 . Access ( addr , size , accessType ) ;142143 // t h i rd l e v e l u n i f i e d cache144 i n t c id = t i d /(MaxNumThreads/ c l u s t e r S i z e ) ;145 i f ( ! d l2Hit )146 {147 GetLock(&lock , t i d +1) ;148 l2count [ c id ] . Miss++;149 ReleaseLock(& lock ) ;150 ul3 . Access ( addr , size , accessType ) ;151 } else

23

152 l2count [ c id ] . Hits++;153 }154155 LOCALFUN VOID MemRefMulti (ADDRINT addr , UINT32 size , CACHE BASE

: : ACCESS TYPE accessType , THREADID t i d )156 {157 // f i r s t l e v e l D−cache158 const BOOL dl1Hit = dl1 . Access ( addr , size , accessType ) ;159160 i f ( ! d l1Hit ) {161 l1count [ t i d ] . Miss++;162 Ul2Access ( addr , size , accessType , t i d ) ;163 }164 else165 {166 l1count [ t i d ] . Hits++;167 }168 }169170 LOCALFUN VOID MemRefSingle (ADDRINT addr , UINT32 size , CACHE BASE

: : ACCESS TYPE accessType , THREADID t i d )171 {172 // f i r s t l e v e l D−cache173 const BOOL dl1Hit = dl1 . Acce s sS ing l eL ine ( addr , accessType ) ;174175 i f ( ! d l1Hit ) {176 l1count [ t i d ] . Miss++;177 Ul2Access ( addr , size , accessType , t i d ) ;178 }179 else180 {181 l1count [ t i d ] . Hits++;182 }183 }184185 LOCALFUN VOID I n s t r u c t i o n ( INS ins , VOID ∗v )186 {187 i f ( INS IsMemoryRead ( i n s ) )188 {189 const UINT32 s ize = INS MemoryReadSize ( i n s ) ;190 const AFUNPTR countFun = ( s ize <= 4 ? (AFUNPTR)

MemRefSingle : (AFUNPTR) MemRefMulti ) ;191192 // only pred icated−on memory i n s t r u c t i o n s a c c e s s D−cache193 INS Ins e r tPred i ca t edCa l l (194 ins , IPOINT BEFORE, countFun ,195 IARG MEMORYREAD EA,196 IARG MEMORYREAD SIZE,197 IARG UINT32 , CACHE BASE : : ACCESS TYPE LOAD,

24

198 IARG THREAD ID,199 IARG END) ;200 }201202 i f ( INS IsMemoryWrite ( i n s ) )203 {204 const UINT32 s ize = INS MemoryWriteSize ( i n s ) ;205 const AFUNPTR countFun = ( s ize <= 4 ? (AFUNPTR)

MemRefSingle : (AFUNPTR) MemRefMulti ) ;206207 // only pred icated−on memory i n s t r u c t i o n s a c c e s s D−cache208 INS Ins e r tPred i ca t edCa l l (209 ins , IPOINT BEFORE, countFun ,210 IARG MEMORYWRITE EA,211 IARG MEMORYWRITE SIZE,212 IARG UINT32 , CACHE BASE : : ACCESS TYPE STORE,213 IARG THREAD ID,214 IARG END) ;215 }216 }217218 GLOBALFUN i n t main ( i n t argc , char ∗argv [ ] )219 {220 PIN Init ( argc , argv ) ;221222 for ( INT32 t =0; t<MaxNumThreads ; t++)223 {224 l1count [ t ] . Hits = 0 ;225 l1count [ t ] . Miss =0;226 }227228 for ( i n t i =0; i<c l u s t e r S i z e ; i++)229 {230 l2count [ i ] . Hits =0;231 l2count [ i ] . Miss =0;232 }233234 PIN AddThreadStartFunction ( ThreadStart , 0) ;235 INS AddInstrumentFunction ( In s t ruc t i on , 0) ;236 PIN AddFiniFunction ( Fini , 0) ;237238 // Never r e tu rn s239 PIN StartProgram ( ) ;240241 return 0 ; // make compi le r happy242 }” > dcache . cpp243244 make > makeres245

25

246 echo ”247 #inc lude <pthread . h>248 #inc lude <s t d i o . h>249 #inc lude <s t d l i b . h>250 #inc lude<time . h>251 typede f s t r u c t252 {253 double ∗a ;254 double ∗b ;255 double sum ;256 i n t vec l en ;257 } DOTDATA;258259260 #d e f i n e NUMTHRDS $threadsAndMaster261 #d e f i n e VECLEN 1000000262263 DOTDATA d o t s t r ;264 pthread t cal lThd [NUMTHRDS] ;265 pthread mutex t mutexsum ;266267 void ∗dotprod ( void ∗ arg )268 {269 i n t i , s t a r t , end , l en ;270 long o f f s e t ;271 // p r i n t f (\ ”%d\n\” ,( i n t ) arg ) ;272 double mysum, ∗x , ∗y ;273 o f f s e t = ( long ) arg ;274275 // Extrae eventandcounters ( 1 , 8 ) ;276277 l en = d o t s t r . vec l en ;278 //printf (\ ”%d\n\” , l en ) ;279 s t a r t = o f f s e t ∗( l en /NUMTHRDS) ;280 end = s t a r t + ( l en /NUMTHRDS) ;281 x = d o t s t r . a ;282 y = d o t s t r . b ;283284 mysum = 0 ;285 for ( i=s t a r t ; i<end ; i++)286 mysum += ( x [ i ] ∗ y [ i ] ) ;287288 // Extrae eventandcounters ( 1 , 9 ) ;289290 pthread mutex lock (&mutexsum) ;291292 // Extrae eventandcounters (1 , 10 ) ;293 d o t s t r .sum += mysum;294 // Extrae eventandcounters (1 , 11 ) ;

26

295296 pthread mutex unlock (&mutexsum) ;297298 // Extrae eventandcounters ( 1 , 0 ) ;299300 // p t h r e a d e x i t ( ( void ∗) 0) ;301 }302303304 i n t main ( i n t argc , char ∗argv [ ] )305 {306 long i ;307 double ∗a , ∗b ;308 void ∗ s t a t u s ;309 p t h r e a d a t t r t a t t r ;310311 c l o c k t begin , end ;312 double t ime spent ;313314 begin = clock ( ) ;315 // E x t r a e i n i t ( ) ;316317 // Extrae eventandcounters ( 1 , 1 ) ;318319 a = ( double ∗) mal loc (NUMTHRDS∗VECLEN∗ s i z e o f ( double ) ) ;320 b = ( double ∗) mal loc (NUMTHRDS∗VECLEN∗ s i z e o f ( double ) ) ;321322 // Extrae eventandcounters ( 1 , 2 ) ;323324 for ( i =0; i<VECLEN∗NUMTHRDS; i++)325 {326 a [ i ]=1;327 b [ i ]=a [ i ] ;328 }329330 d o t s t r . vec l en = VECLEN;331 d o t s t r . a = a ;332 d o t s t r . b = b ;333 d o t s t r .sum=0;334335 // Extrae eventandcounters ( 1 , 3 ) ;336337 pthread mutex in i t (&mutexsum , NULL) ;338339 p t h r e a d a t t r i n i t (& a t t r ) ;340 p t h r e a d a t t r s e t d e t a c h s t a t e (&attr , PTHREAD CREATE JOINABLE) ;341342 for ( i =0; i<NUMTHRDS; i++)343 {

27

344 // Extrae eventandcounters ( 1 , 4 ) ;345 pth r ead c r ea t e (&cal lThd [ i ] , &att r , dotprod , ( void ∗) i ) ;346 // Extrae eventandcounters ( 1 , 3 ) ;347 }348349 p t h r e a d a t t r d e s t r o y (& a t t r ) ;350 // Extrae eventandcounters ( 1 , 5 ) ;351352 for ( i =0; i<NUMTHRDS; i++)353 {354 // Extrae eventandcounters ( 1 , 6 ) ;355 p t h r e a d j o i n ( cal lThd [ i ] , &s t a t u s ) ;356 // Extrae eventandcounters ( 1 , 7 ) ;357 }358359 printf (\ ”Sum = %f \n\” , d o t s t r .sum) ;360 f r e e ( a ) ;361 f r e e (b) ;362363 end=clock ( ) ;364 t ime spent= ( double ) (end − begin ) / CLOCKS PER SEC;365 printf (\ ” Execution time : %f \n\” , t ime spent ) ;366367 // E x t r a e f i n i ( ) ;368369 pthread mutex destroy(&mutexsum) ;370 p t h r e a d e x i t (NULL) ;371 }372 ” > dotprod . c373374 #echo ” Compiling dotprod ”375 gcc −o dotprod dotprod . c −l p thread376377 #echo ”Running pin t o o l ”378 cd / s c ra t ch /boada−1/etm022/ pin379 . / pin −t / s c ra t ch /boada−1/etm022/ pin / source / t o o l s /Memory/obj−

i n t e l 6 4 / dcache . so −− / s c ra t ch /boada−1/etm022/ pin / source / t o o l s/Memory/ dotprod > / s c ra t ch /boada−1/etm022/ pin / source / t o o l s /Memory/ r e s u l t s / res−$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ {11} . r e s

380381 mv a l l c a c h e . out / s c ra t ch /boada−1/etm022/ pin / source / t o o l s /Memory/

r e s u l t s / res−$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ {11} . a l l c a c h e382383 cd / s c ra t ch /boada−1/etm022/ pin / source / t o o l s /Memory384 echo ”done ! ”

A.2.2 Running the experiments

28

1 #!/ bin / bash2 #3 #Scr i p t by aknahs (Mario Almeida )4 #5 #c l u s t e r S i z e6 # const UINT32 cacheS i ze = 256∗KILO;7 # const UINT32 l i n e S i z e = 1;8 # const UINT32 a s s o c i a t i v i t y = 256;9 #<c l u s t e r S i z e> <L1caches ize> <L1 l ineS i ze> <L1assoc> <L2caches ize

> <L2 l ineS i ze> <L2assoc> <L3caches ize> <L3 l ineS i ze> <L3assoc><nThreads>

10 # $1 $2 $3 $4 $5$6 $7 $8 $9

$10 $111112 rm −r f r e s u l t s13 mkdir r e s u l t s1415 t o t a l=$ ( (3 ∗ 4 ∗ 3 ∗ 3 ∗ 2 ∗ 3 ∗ 2) )16 n=017 re s1=$ (date +%s.%N)181920 #c l u s t e r S i z e21 for cs in 2 4 822 do23 for mt in 2 4 8 1624 do25 #L1cacheSize26 for l 1 c in 16 32 6427 do28 #L1 l in eS i z e29 for l 1 l in 32 #64 12830 do31 #L1assoc32 for l 1a in 1 #2 433 do34 #L2cacheSize35 for l 2 c in 1 2 436 do37 #L2 l in eS i z e38 for l 2 l in 32 64 #12839 do40 #L2assoc41 for l 2a in 1 #2 442 do43 #L3cacheSize44 for l 3 c in 4 8 16

29

45 do46 #L3 l in eS i z e47 for l 3 l in 32 64 #12848 do49 #L3assoc50 for l 3a in 1 #2 451 do52 clear53 cat logo54 echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−by aknahs”55 echo −n ” Generating [ $n/ $ t o t a l ] . . . ”56 r e s2=$ (date +%s.%N)57 printf ” Elapsed : %.3F\n” $ (echo ” $re s2 − $re s1 ” | bc

)5859 n=$ ( ( $n + 1) )60 #echo

”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−”

61 #echo ”Generating CPP and Make”62 #echo ”./genMakeCPP $cs $ l 1 c $ l 1 l $ l 1a $ l 2 c $ l 2 l

$ l 2a $ l 3 c $ l 3 l $ l 3a $mt”63 . / genMakeCPP $cs $ l 1 c $ l 1 l $ l1a $ l 2 c $ l 2 l $ l2a

$ l 3 c $ l 3 l $ l3a $mt64 echo ” . ”65 done66 done67 done68 done69 done70 done71 done72 done73 done74 done75 done76 echo ” a l l done . ”7778 grep ” Total Miss Rate” r e s u l t s /∗ . a l l c a c h e | awk ’BEGIN{n=0;

p r i n t f ” Clus te r Size , L1 Cache Size , L1 Line Size , L1Assoc ia t ion , L2 Cache Size , L2 Line Size , L2 Assoc ia t ion , L3Cache Size , L3 Line Size , L3 Assoc ia t ion , Number o f threads ,Total Miss Caches\n”} { s p l i t ( $1 , a , ” . ” ) ; s p l i t ( a [ 1 ] , b ,”−”) ;p r i n t f ”%d,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s \n” ,n%3 +1,b[ 2 ] , b [ 3 ] , b [ 4 ] , b [ 5 ] , b [ 6 ] , b [ 7 ] , b [ 8 ] , b [ 9 ] , b [ 1 0 ] , b [ 1 1 ] , b [ 1 2 ] , $5;++n} ’ >> r e s u l t s / bruta l db . csv

30

A.2.3 Importing the results to a database

1 #!/ bin / bash2 #3 # Scr i p t by aknahs (Mario Almeida )4 #56 rm −r f power7 mkdir power89 s q l i t e 3 power/ r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,

c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER, l 2 s i z e INTEGER, l 2 l i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER, l 3 l i n e INTEGER, l 3 a s s o c INTEGER, threads INTEGER, mi s s ra t e REAL) ; ’

1011 echo ” Importing to database ”12 echo ” . s epa ra to r \” ,\ ”” > power/command13 echo ” . import bruta l db . csv r e s ” >> power/command14 s q l i t e 3 power/ r e s . db < power/command15 rm power/command1617 echo ”done”1819 . / g rapha l l

A.2.4 Generating graphs

1 #!/ bin / bash2 #3 # Scr i p t by aknahs (Mario Almeida )4 #56 #s q l i t e 3 power/ res . db ’CREATE TABLE res ( c a c h e l e v e l INTEGER,

c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1a s s o c INTEGER, l 2 s i z e INTEGER, l 2 l i n e i n e INTEGER, l 2a s s o c INTEGER, l 3 s i z eINTEGER, l 3 l i n e i n e INTEGER, l 3a s s o c INTEGER, threads INTEGER,missra te REAL) ; ’

78 mkdir power/ c l u s t e r 29 mkdir power/ c l u s t e r 4

10 mkdir power/ c l u s t e r 81112 #for ins t rumenta t ion l e v e l13 for set in 1 2 314 do15 #for each c l u s t e r s i z e

31

16 for cs in 2 4 817 do18 #for each l e v e l o f cache19 for l in 1 2 320 do2122 i f [ $ s e t == 1 ]23 then24 f i l ename=”power/ c l u s t e r $ { cs }/L${ l }MissRate−L1Size16−c l u s t e r $ { cs }

”25 s q l=” s e l e c t l $ { l } s i z e , m i s s ra t e from r e s where c l u s t e r = $cs and

l 1 s i z e = 16 and c a c h e l e v e l = ${ l } and l 1 l i n e = 32 and l 2 l i n e= 32 and l 3 l i n e = 32”

26 t i t l e=”MissRate o f cache ${ l } per L${ l } s i z e ( Ls i z e = [ 1 6 , . , . ] ) ”27 xlabel=” S i z e o f cache L${ l }”28 f i2930 i f [ $ s e t == 2 ]31 then32 f i l ename=”power/ c l u s t e r $ { cs }/L${ l }MissRate−c l u s t e r $ { cs }”33 s q l=” s e l e c t l $ { l } s i z e , m i s s ra t e from r e s where c l u s t e r = $cs and

c a c h e l e v e l = ${ l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e= 32”

34 t i t l e=”MissRate o f cache ${ l } per L${ l } s i z e ”35 xlabel=” S i z e o f cache L${ l }”36 f i3738 i f [ $ s e t == 3 ]39 then40 f i l ename=”power/ c l u s t e r $ { cs }/L${ l }MissRate−L1Size16−L2size1−

c l u s t e r $ { cs }”41 s q l=” s e l e c t l $ { l } s i z e , m i s s ra t e from r e s where c l u s t e r = $cs and

l 1 s i z e = 16 and l 2 s i z e = 1 and c a c h e l e v e l = ${ l } and l 1 l i n e =32 and l 2 l i n e = 32 and l 3 l i n e = 32”

42 t i t l e=”MissRate o f cache L${ l } per L${ l } s i z e ( Ls i z e = [ 1 6 , 1 , . ] ) ”43 xlabel=” S i z e o f cache L${ l }”44 f i4546 i f [ [ $ s e t = 1 && $ l = 1 ] ]47 then48 cont inue49 f i5051 i f [ [ $ s e t == 3 && ( $ l == 1 | | $ l == 2 ) ] ]52 then53 cont inue54 f i5556

32

57 echo ” Generating Graph f o r s e t $ s e t on cache l e v e l $ l ”58 gnuplot << EOF59 set d a t a f i l e s epa ra to r ” | ”6061 # Line s t y l e f o r axes62 set s t y l e l ine 80 l t rgb ”#808080”6364 # Line s t y l e f o r g r i d65 set s t y l e l ine 81 l t 0 # dashed66 set s t y l e l ine 81 l t rgb ”#808080” # grey6768 set grid back l i n e s t y l e 8169 set border 3 back l i n e s t y l e 80 # Remove border on top and


border .73 set x t i c s nomirror74 set y t i c s nomirror7576 #se t l o g x77 #se t mxt ics 10 # Makes l o g s c a l e l ook good .7879 # Line s t y l e s : t r y to p i ck p l e a s i n g co lor s , ra the r80 # than s t r i c t l y primary c o l o r s or hard−to−see c o l o r s81 # l i k e gnuplot ’ s d e f a u l t y e l l ow . Make the l i n e s t h i c k82 # so they ’ re easy to see in sma l l p l o t s in papers .83 set s t y l e l ine 1 l t rgb ”#A00000” lw 2 ps 1 pt 184 set s t y l e l ine 2 l t rgb ”#00A000” lw 2 ps 1 pt 685 set s t y l e l ine 3 l t rgb ”#5060D0” lw 2 ps 1 pt 286 set s t y l e l ine 4 l t rgb ”#F25900” lw 2 ps 1 pt 98788 #se t key top r i g h t8990 #se t xrange [ 0 : 1 ]91 set yrange [ 0 : 1 0 0 ]9293 #p l o t ” temp la te . dat ” \94 #index 0 t i t l e ”Example l i n e ” w l p l s 1 , \95 #”” index 1 t i t l e ”Another example” w l p l s 29697 #se t s t y l e data l i n e s98 #se t key ou t s i d e99 #se t x t i c s r o t a t e by −45

100 #se t s i z e r a t i o 0 .8101 set t i t l e ” $ t i t l e ”102 set xlabel ” $ x l a b e l ”103 $aux

33

104 $aux2105 set ylabel ” Total Miss Rate (%)”106 plot ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 2 ’ ” us ing

1 :2 with po in t s l s 1 t i t l e ’#procs = 2 ’ ,\107 ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 4 ’ ” us ing 1 :2

with po in t s l s 2 t i t l e ’#procs = 4 ’ ,\108 ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 8 ’ ” us ing 1 :2

with po in t s l s 3 t i t l e ’#procs = 8 ’ ,\109 ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 16 ’ ” us ing 1 :2

with po in t s l s 4 t i t l e ’#procs = 16 ’110111 set terminal p d f c a i r o font ” G i l l Sans , 7 ” l i n ew id th 4 rounded112 #se t termina l pd f ca i r o s i z e 30cm,15cm113 set output ”${ f i l ename } . pdf ”114 r e p l o t115 EOF116 done117 done118 done119120 echo ”Done”121122 f i l ename=”power/L2MissRate−L1Size32−L2size4−l 3 s i z e 4−varClus te r ”123 s q l=” s e l e c t c l u s t e r , m i s s ra t e from r e s where l 1 s i z e = 32 and

l 2 s i z e = 4 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32and l 2 l i n e = 32 and l 3 l i n e = 32”

124 t i t l e=”MissRate o f cache L2 per c l u s t e r s i z e ( Ls i z e = [32 , 4 , 4 ] ) ”125 xlabel=” Clus te r s i z e ”126127 echo ” Generating Graph f o r s e t v a r i a b l e c l u s t e r s ”128 gnuplot << EOF129 set d a t a f i l e s epa ra to r ” | ”130131 # Line s t y l e f o r axes132 set s t y l e l ine 80 l t rgb ”#808080”133134 # Line s t y l e f o r g r i d135 set s t y l e l ine 81 l t 0 # dashed136 set s t y l e l ine 81 l t rgb ”#808080” # grey137138 set grid back l i n e s t y l e 81139 set border 3 back l i n e s t y l e 80 # Remove border on top and


border .143 set x t i c s nomirror144 set y t i c s nomirror

34

145146 #se t l o g x147 #se t mxt ics 10 # Makes l o g s c a l e l ook good .148149 # Line s t y l e s : t r y to p i ck p l e a s i n g co lor s , ra the r150 # than s t r i c t l y primary c o l o r s or hard−to−see c o l o r s151 # l i k e gnuplot ’ s d e f a u l t y e l l ow . Make the l i n e s t h i c k152 # so they ’ re easy to see in sma l l p l o t s in papers .153 set s t y l e l ine 1 ps 1 pt 1154 set s t y l e l ine 2 ps 1 pt 6155 set s t y l e l ine 3 ps 1 pt 2156 set s t y l e l ine 4 ps 1 pt 9157158 #se t key top r i g h t159160 #se t xrange [ 0 : 1 ]161 #se t yrange [ 0 : 1 ]162163 #p l o t ” temp la te . dat ” \164 #index 0 t i t l e ”Example l i n e ” w l p l s 1 , \165 #”” index 1 t i t l e ”Another example” w l p l s 2166167 #se t s t y l e data l i n e s168 #se t key ou t s i d e169 #se t x t i c s r o t a t e by −45170 #se t s i z e r a t i o 0 .8171 set t i t l e ” $ t i t l e ”172 set xlabel ” $ x l a b e l ”173 $aux174 $aux2175 set ylabel ” Total Miss Rate (%)”176 plot ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 2 ’ ” us ing

1 :2 with lp l s 1 t i t l e ’#procs = 2 ’ ,\177 ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 4 ’ ” us ing 1 :2

with lp l s 2 t i t l e ’#procs = 4 ’ ,\178 ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 8 ’ ” us ing 1 :2


with lp l s 4 t i t l e ’#procs = 16 ’180181 set terminal p d f c a i r o font ” G i l l Sans , 7 ” l i n ew id th 4 rounded182 #se t termina l pd f ca i r o s i z e 30cm,15cm183 set output ”${ f i l ename } . pdf ”184 r e p l o t185 EOF186187 f i l ename=”power/L2MissRate−L1Size16−L2size1−l 3 s i z e 4−varClus te r ”188 s q l=” s e l e c t c l u s t e r , m i s s ra t e from r e s where l 1 s i z e = 16 and

l 2 s i z e = 1 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32

35

and l 2 l i n e = 32 and l 3 l i n e = 32”189 t i t l e=”MissRate o f cache L2 per c l u s t e r s i z e ( Ls i z e = [16 , 1 , 4 ] ) ”190 xlabel=” Clus te r s i z e ”191192 echo ” Generating Graph f o r s e t v a r i a b l e c l u s t e r s ”193 gnuplot << EOF194 set d a t a f i l e s epa ra to r ” | ”195196 # Line s t y l e f o r axes197 set s t y l e l ine 80 l t rgb ”#808080”198199 # Line s t y l e f o r g r i d200 set s t y l e l ine 81 l t 0 # dashed201 set s t y l e l ine 81 l t rgb ”#808080” # grey202203 set grid back l i n e s t y l e 81204 set border 3 back l i n e s t y l e 80 # Remove border on top and


border .208 set x t i c s nomirror209 set y t i c s nomirror210211 #se t l o g x212 #se t mxt ics 10 # Makes l o g s c a l e l ook good .213214 # Line s t y l e s : t r y to p i ck p l e a s i n g co lor s , ra the r215 # than s t r i c t l y primary c o l o r s or hard−to−see c o l o r s216 # l i k e gnuplot ’ s d e f a u l t y e l l ow . Make the l i n e s t h i c k217 # so they ’ re easy to see in sma l l p l o t s in papers .218 set s t y l e l ine 1 ps 1 pt 1219 set s t y l e l ine 2 ps 1 pt 6220 set s t y l e l ine 3 ps 1 pt 2221 set s t y l e l ine 4 ps 1 pt 9222223 #se t key top r i g h t224225 #se t xrange [ 0 : 1 ]226 #se t yrange [ 0 : 1 ]227228 #p l o t ” temp la te . dat ” \229 #index 0 t i t l e ”Example l i n e ” w l p l s 1 , \230 #”” index 1 t i t l e ”Another example” w l p l s 2231232 #se t s t y l e data l i n e s233 #se t key ou t s i d e234 #se t x t i c s r o t a t e by −45

36

235 #se t s i z e r a t i o 0 .8236 set t i t l e ” $ t i t l e ”237 set xlabel ” $ x l a b e l ”238 $aux239 $aux2240 set ylabel ” Total Miss Rate (%)”241 plot ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 2 ’ ” us ing

1 :2 with lp l s 1 t i t l e ’#procs = 2 ’ ,\242 ”< s q l i t e 3 power/ r e s . db ’ $ s q l and threads = 4 ’ ” us ing 1 :2



with lp l s 4 t i t l e ’#procs = 16 ’245246 set terminal p d f c a i r o font ” G i l l Sans , 7 ” l i n ew id th 4 rounded247 #se t termina l pd f ca i r o s i z e 30cm,15cm248 set output ”${ f i l ename } . pdf ”249 r e p l o t250 EOF

37

dimemas and multi-level cache simulations

Technology

cachelevel simulationswere

level cache simulationsauthor

ratesper cache level

cache sizes

application performance

total cache

implemented cache architecture

multilevel cache simulationto