metoo - stochastic modeling of memory traffic timing behavior

Upload: uvsoft

Post on 06-Mar-2016

212 views

Category:

Documents


0 download

DESCRIPTION

MeToo - Stochastic Modeling of Memory Traffic Timing Behavior

TRANSCRIPT

  • MeToo: Stochastic Modeling of Memory Traffic Timing Behavior

    Yipeng Wang, Yan SolihinDept. of Electrical and Computer Engineering

    NC State University{ywang50,solihin}@ncsu.edu

    Ganesh BalakrishnanMemory System Architecture and Performance

    Advanced Micro [email protected]

    AbstractThe memory subsystem (memory controller, bus, andDRAM) is becoming a bottleneck in computer system performance.Optimizing the design of the multicore memory subsystem requiresgood understanding of the representative workload. A common practicein designing the memory subsystem is to rely on trace simulation.However, the conventional method of relying on traditional traces facestwo major challenges. First, many software users are apprehensiveabout sharing their code (source or binaries) due to the proprietarynature of the code or secrecy of data, so representative traces aresometimes not available. Second, there is a feedback loop wherememory performance affects processor performance, which in turnalters the timing of memory requests that reach the bus. Such feedbackloop is difficult to capture with traces.

    In this paper, we present MeToo, a framework for generatingsynthetic memory traffic for memory subsystem design exploration.MeToo uses a small set of statistics that summarizes the performancebehavior of the original applications, and generates synthetic traces orexecutables stochastically, allowing applications to remain proprietary.MeToo uses novel methods for mimicking the memory feedback loop.We validate MeToo clones, and show very good fit with the originalapplications behavior, with an average error of only 4.2%, which isa small fraction of the errors obtained using geometric inter-arrival(commonly used in queueing models) and uniform inter-arrival.

    Keywords-workload cloning, memory subsystem, memory controller,memory bus, DRAM

    I. INTRODUCTION

    The memory subsystem (memory controller, bus, and DRAM)is becoming a performance bottleneck in many computer systems.The continuing fast pace of growth in the number of cores ona die places an increasing demand for data bandwidth from thememory subsystem [1]. In addition, the memory subsystems shareof power consumption is also increasing, partly due to the increasedbandwidth demand, and partly due to scarce opportunities to entera low power mode. Thus, optimizing the memory subsystem isbecoming increasingly critical in achieving the performance andpower goals of the multicore system.

    Optimizing the design of the multicore processors memory sub-system requires good understanding of the representative workload.Memory subsystem designers often rely on synthetically generatedtraces, or workload-derived traces (collected by a trace capturehardware) that are replayed. The trace may include a list of memoryaccesses, each identified by whether it is a read/write, address thatis accessed, and the time the access occurs. Experiments with thesetraces lead to decisions about memory controller design (schedulingpolicy, queue size, etc.), memory bus design (clock frequency,width, etc.), and the DRAM-based main memory design (internalorganization, interleaving, page policy, etc.).

    This conventional method of relying on traditional traces facetwo major challenges. First, many software users are apprehensiveabout sharing their code (source or binaries) due to the proprietary

    nature of the code or secrecy of data, so representative tracesare sometimes not available. These users may include nationallaboratories, industrial research labs, banking and trading firms,etc. Even when a representative trace is available, there is anothermajor challenge: the trace may not capture the complexity ofthe interaction between the processor and the memory subsystem.Specifically, there is a feedback loop where memory performanceaffects processor performance, which in turn alters the timingof memory requests that reach the bus. In a multicore system,the processor core (and cache) configuration and the applicationtogether determine the processor performance which affect thetiming of memory requests that are received by the memorysubsystem. The memory requests and the memory configurationtogether determine the memory performance. More importantly,the memory subsystem has a feedback loop with the core whichaffects the timing of future memory requests. The degree of thefeedback loop affecting processor performance depends on eventsin the memory subsystem (e.g. row buffer hits, bank conflicts), thecharacteristics of the applications (e.g. a pointer chasing applicationis more latency sensitive than others), and the characteristics of themix of applications co-scheduled together.

    Due to this feedback loop, a conventional trace is woefullyinadequate for evaluating the memory subsystem design. To get anidea of the extent of the problem, Figure 1(a) shows the inter-arrivaltiming distribution of SPEC 2006 benchmark bzip2 when runningon a system with a DDR3 DRAM with 1.5ns cycle time (i.e. 667MHz) versus 12ns cycle time. The x-axes shows the inter-arrivaltiming gap between two consecutive memory bus requests, whilethe y-axes shows the number of occurrence of each gap. Notice thatthe inter-arrival distribution is highly dependent on DRAM speed.With a slow DRAM, the inter-arrival distribution shifts to the rightsignificantly because the slow DRAMs feedback loop forces theprocessor speed to decline when instructions waiting for data fromDRAM are blocked in the pipeline. If one uses the trace simulation,the feedback loop will be missing. Figure 1(b) shows DRAMpower consumption when the DRAM cycle time is varied from1.5ns to 6ns and 12ns, normalized to the 1.5ns case. Traditionaltraces show increased DRAM power consumption as the cycle timeincreases, but severely under-estimate the power consumption whenthe feedback loop is accounted for. Other design aspects are alsoaffected. For example, estimating the size of memory controllerscheduling queue depends on how bursty requests may be, butburstiness in turn depends heavily on DRAM speed.

    Note, however, that even with feedback loop being hypotheticallyaccounted, the inter-arrival distribution still has to be modeledcorrectly. Many synthetically generated traces assume standardinter-arrival probability distributions, such as uniform or geometric.

  • 010000

    20000

    30000

    400001 27 53 79 105

    131

    157

    183

    209

    235

    261

    287

    Fast DRAM

    0

    200000

    400000

    600000

    1 27 53 79 105

    131

    157

    183

    209

    235

    261

    287

    Slow DRAM

    0

    1

    2

    3

    4

    5

    1.5ns 6ns 12ns

    w/ feedback

    w/o feedback

    (a) (b)

    Figure 1. Missing feedback loop influences the accuracy of tracesimulation. Inter-arrival time distribution with fast vs. slow DRAM (a),and DRAM power consumption with vs. without the feedback loop. Fullevaluation parameters are shown in Table II.

    00.5

    11.5

    22.5

    33.5

    Pow

    er (

    wat

    ts)

    orig

    geo

    uniform

    0

    0.5

    1

    1.5

    2

    2.5

    3

    Ban

    dw

    idth

    (G

    B/s

    )

    orig

    geo

    uniform

    Figure 2. Comparison of DRAM power and bandwidth for geometric,uniform, and original inter-arrival of memory requests for various bench-marks.

    A uniform inter-arrival (not to be confused with constant inter-arrival) is one where the inter-arrival time values are randomlygenerated with equal probabilities. A geometric inter-arrival followsthe outcome of a series of Bernoulli trials over time. A Bernoullitrial is one with either a successful outcome (e.g. whether amemory reference results in a Last Level Cache/LLC miss) ora failure. A series of Bernoulli trials result in a stream of LLCmisses reaching the memory subsystem, with their inter-arrivaltime following a geometric distribution. A Bernoulli process canbe viewed as the discrete-time version of Poisson arrival in theM/*/1 queueing model, which is widely used in modeling queueingperformance. We generate traces with uniform and geometric inter-arrival and compare them against the original application in termsof DRAM power and bandwidth in Figure 2, assuming 667MHzDDR3 DRAM. The Figure shows that uniform and geometric inter-arrival do not represent the original benchmarks behavior wellacross most benchmarks. While not shown, we also observe similarmismatch on other metrics such as the average queue length ofmemory controller and the average bus queueing delay.

    In this paper, we propose Memory Traffic Timing (MeT2, or

    more pronunciation-friendly MeToo), a new method for solvingthe proprietary and representativeness problems of memory trafficcloning. To solve the proprietary problem, we propose to collect asmall set of statistics of the original applications memory behaviorthrough novel profiling. We use the statistics to generate a proxyapplication (or clone) that produces similar memory behavior. Toensure the representativeness of the timing of memory accesses, wepropose a novel method for generating instructions that form theclone binary such that the feedback loop is captured by injectingdependences between instructions. We demonstrate that our clonescan mimic the original applications timing behavior of memorytraffic much better than uniform or geometric traffic patterns, andthat they can capture the performance feedback loop well. Wedemonstrate the results both for single applications running alone aswell as co-scheduled applications running on a multicore platform,where MeToo achieve an average absolute error of only 4.2%. Webelieve this is a new capability that can prove valuable for memorysubsystem design exploration.

    In the rest of the paper, Section II discusses related work, SectionIII gives an overview of MeToo, Section IV discusses MeToosprofiling method, Section V describes MeToos clone generationprocess, Section VI presents evaluation methodology, and Sec-tion VII shows and discusses validation and design explorationresults.

    II. RELATED WORK

    The most related work to ours is workload cloning, wheresynthetic traces or executables are generated as proxy (or substi-tutes) for the original proprietary applications in computer designevaluation. Clones are generated based on statistical profiles of theoriginal applications, and are designed to reproduce targeted perfor-mance behavior. There are cloning techniques to mimic instructionlevel behavior [2][8], cache behavior [9], [10], cache coherencebehavior [11], data sharing patterns in parallel programs [7], [8],and I/O behavior [12], [13]. In the studies, memory access pattern islargely simplified or ignored. For example, [2], [4] assume single-strided memory accesses, In other studies that focus on memorybehavior, timing information is embedded into the traces and isassumed to be unaffected by the components to which the tracesare fed into. The feedback loop between the components and theinter-arrival pattern of the events that make up the trace is largelyignored. Examples include cache miss and coherence requests inSynFull [11] and I/O cloning studies in [12], [13]. In some contexts,ignoring the feedback loop may be sufficient, e.g. I/O requestperformance are determined mostly by disk latency. However, aswe have shown, this is not the case with requests to the memorysubsystem.

    The closest of the studies cited above are West [9], STM [10],SynFull [11], and MEMST [14]. West and STM are capable ofestimating cache miss ratios at caches and generating syntheticmemory traces past the LLC. However, the traces do not in-clude timing information. SynFull captures memory access tracespast the L2 cache, but the memory accesses are generated withconstant inter-arrival time, and the feedback loop is unaccountedfor. MEMST generates memory requests through cloning LLCmisses. MEMST stochastically replays the timing of when memoryrequests are generated hence its memory request trace includes tim-ing information. However, the feedback loop for memory request

  • App

    Proc

    CacheHierarchy

    MC

    DRAM

    Fixed

    Focus of design

    exploration

    ProfilingProfiler CloneGenerator

    HW/Sim

    MeToo

    Perfresults

    Clone

    Trace Clone

    Figure 3. MeToo cloning framework.

    timing is not modeled, apart from coincidental back pressure thatarises in the special case when the memory bus is over-utilized.In contrast, MeToo captures the memory performance feedbackloop fully: any delay that occurs for any reason in the memorysubsystem will affect instruction execution schedule that results inchanges to when the next memory requests are generated.

    III. OVERVIEW

    In this section, we will give an overview of the proposed MeToocloning framework. The goal of MeToo is to summarize the traffic(or requests) to the memory subsystem, and generate a trace orbinary that can be used for memory subsystem design exploration.As such, we make several assumptions.

    Figure 3 illustrates MeToos assumptions and framework. Theleft portion of the figure shows that MeToo is designed only fordesign exploration of the memory subsystem (memory controller,memory bus, and DRAM), with the design of the processor andcache hierarchy is assumed to have been determined and fixed. Inother words, a single clone can be used for evaluating multiplememory subsystem configurations, but can only be used for oneprocessor and cache configuration. Experimenting with differentprocessor and cache hierarchy models is possible, but it requiresprofiling to be repeated for each different combination of processorand cache hierarchy configuration. Even so, we experimented witha limited set of changes to the processor parameters and foundthat the same clone can still be accurate and applicable over theseprocessor configurations. A consequence of such an assumption isthat the profile used in MeToo focuses on memory traffic generatedpast the Last Level Cache (LLC). This includes LLC misses(corresponding to read or write misses in a write back cache),write backs, and hardware-generated prefetch requests.

    Figure 3 illustrates the basic flow of the MeToo framework. First,an application is run on processor and a cache hierarchy modeland its profile is collected. The profile is fed into a clone generatorwhich produces an executable clone that includes feedback loop-sensitive timing information. The executable clone can be used torun on a full-system simulator or on real hardware. Alternatively,a traditional trace can also be generated, but without the feedbackloop mechanism.

    IV. METOO PROFILING METHOD

    A. Capturing Cache Access Pattern

    Memory traffic depends on cache accesses at the LLC level.There are three ways an LLC produces memory requests (i.e.requests that reach the memory subsystem). First, a read or write

    A CMRU LRU

    Initial content

    st Ald Bst Ald C

    Case 1: one future write-back

    West's write statistics:- 50% writes- 50% MRU, 50% LRU

    C AMRU LRU st C

    ld Ast Ald C

    Case 2: two future write-backs

    West's write statistics:- 50% writes- 50% MRU, 50% LRU

    C AMRU LRU

    Figure 4. Illustration of how Wests write statistics cannot capture thenumber of write backs accurately. Shaded blocks indicates dirty cacheblocks.

    that misses in the LLC will generate a read request to the memorysubsystem, assuming the LLC is a write back and write allocatecache. Second, a block write back generates a write request tothe memory subsystem. Finally, hardware prefetches generate readrequests to the memory subsystem. Note that our MeToo clonesmust capture the traffic that correspond to all three types ofrequests but only one type of request is generated by executionof instructions on the processor. Thus, a MeToo clone can onlyconsist of instructions that will be executed on the same processormodel, yet a challenge is that it must capture and generate all threetypes of requests. Let us discuss how MeToo handles each type ofrequests.

    The first type of requests MeToo must capture faithfully ismemory traffic that results from loads/stores generated by theprocessor that produce read memory requests. To achieve that, wecapture the per-set stack distance profile [15] at the last level cache.Let Sij represent the percentage of accesses that occurs to theith cache set and jth stack position, where the 0th stack positionrepresents the most recently used stack position, 1st stack positionrepresents the second most recently used stack position, etc. Let Sidenote the percentage of accesses that occurs to the ith set, summedover all stack positions. These two statistics alone provide sufficientinformation to generate a synthetic memory trace that reproducesthe first type of requests (read/write misses), assuming a stochasticclone generation is utilized. These are the only statistics that aresimilar to the ones used in West [9]. All other statistics discussedfrom hereon are unique to MeToo.

    To deal with the second type of memory requests (i.e. ones thatoccur due to LLC write backs), we found that Wests statistics areinsufficient. West captures the fraction of accesses that are writesin each set and stack distance position. However, the number ofwrite backs is not determined by the fraction of accesses thatare writes. Instead, it depends on the number of dirty blocks.Wests write statistics does not capture the latter. To illustrate thedifference, Figure 4 shows two cases that are indistinguishable inWest, but contribute to different number of write backs. The figureshows a two-way cache set initially populated with clean blocksA and C. Case 1 and 2 show four different memory accesses thatproduce indistinguishable results in Wests write statistics: 50% ofthe accesses are writes, half of the writes occur to the MRU blockand the other half to the LRU block. In both cases, the final cachecontent includes block A and C. However, in Case 1, only one

  • -10000

    0

    10000

    20000

    30000

    40000

    50000

    0 100 200 300 400 500

    Nu

    mb

    er o

    f in

    stan

    ces

    ICI size

    Figure 5. Example of Instruction Count Interval (ICI) distribution forbzip2.

    block is dirty, while in Case 2, both cached blocks are dirty. Thisresults in a different number of eventual write backs. The differencein the two cases is simple: a store to a dirty block does not increasethe number of future write backs, while a store to a clean blockincreases the number of future write backs. In order to distinguishthe two cases, MeToo records WCij and WDij , representing theprobability of writes to clean and dirty blocks respectively, at theith cache set and jth stack position.

    In addition to this, we also record the average dirty block ratio(fraction of all cache blocks that are dirty) fD and write ratiofW (fraction of all requests to the memory subsystem that arewrite backs). The clone generator uses these two values to guidethe generation of load and store for the clone. For example, thegenerator will generate more store instructions when the dirtyblock ratio is found to be too low during the generation process.The generator will try to converge to all these statistics duringgeneration process.

    The final type of requests is one due to hardware-generatedprefetches. The LLC hardware prefetcher can contribute signifi-cantly to memory traffic. One option to deal with this is to ensurethat the real or simulated hardware includes the same prefetcher asduring profiling. However, such an option cannot capture or repro-duce the timing of prefetch requests. Thus, we pursue a differentapproach, which is to gather , the statistics of the percentageof traffic-generation instructions that trigger hardware prefetches.Then, we generate clones with software prefetch requests insertedin such a way that we match . In the design exploration runs,hardware prefetchers at the LLC can be turned off.

    B. Capturing the Timing of Memory Requests

    In the previous section, we discussed how we can generatememory requests corresponding to the read, write, and prefetchmemory requests of the original applications. In this section, wewill investigate how the inter-arrival timing of the requests can becaptured in MeToo clones.

    The first question is how the inter-arrival timing informationbetween memory traffic requests should be represented. One choiceis to capture the inter-arrival as the number of processor clockcycles that elapsed between the current request and the oneimmediately in the past. Each inter-arrival can then be collected ina histogram to represent the inter-arrival distribution. One problemwith this approach is that the number of clock cycles is rigid andcannot be scaled to take into account the memory performancefeedback loop.

    Thus, we pursue an alternative choice to measure the inter-arrival time in terms of the number of instructions that separateconsecutive memory requests. We refer to this as Instruction CountInterval (ICI). An example of ICI profile is shown in Figure 5.The x-axes shows the number of instructions that separate twoconsecutive memory requests (i.e. ICI size), while the y-axesshows the number of occurrences of each ICI size. The x-axes istruncated at 500 instructions. In the particular example in the figure,there are many instances of 20-40 instructions separating twoconsecutive memory requests, indicating a high level of MemoryLevel Parallelism (MLP).

    The ICI profile alone does not account for the behavior weobserve where certain ICI sizes tends to recur. ICI size recurrencecan be attributed to the behavior of loops. To capture the temporalrecurrence, we collect ICI recurrence probability in addition to theICI distribution. The ICI sizes are split into bins where each binrepresents similar ICI sizes. For each bin, the probability of theICI sizes in that bin to repeat is recorded.

    Together the ICI distribution and recurrence probability profilesummarize the inter-arrival timing of memory subsystem requests.However, they still suffer from two shortcomings. First, the numberof instructions separating two memory requests may not be linearlyproportional to the number of clock cycles separating them. Not allinstructions are equal or take the same amount of execution timeon the processor. Second, the statistics are not capable of capturingthe feedback loop yet. We will deal with these shortcomings next.

    C. Capturing the Timing Feedback Loop

    The ICI distribution and recurrence profiles discussed above maynot fully capture the timing of inter-arrival of memory subsystemrequests due to several factors. First, in an out-of-order processorpipeline, the amount of Instruction Level Parallelism (ILP) deter-mines the time it takes to execute instructions: higher ILP allowsfaster execution of instructions, while lower ILP causes slowerexecution of instructions.

    One of the determinants of ILP is data dependences betweeninstructions. For a pair of consecutive (traffic-generating) memoryrequest instructions, the ILP of the instructions executed betweenthem will determine how soon the second traffic-generating in-struction occurs. Figure 6(a) illustrates an example of two casesof how dependences can influence inter-arrival pattern of memoryrequests. In the figure, each circle represents one instruction. Largercircles represent traffic-generating instructions while smaller circlesrepresent other instructions such as non-memory instructions, orloads/stores that do not cause memory traffic requests. The arrowsindicate true dependence between instructions. Let us suppose thatthe issue queue has 4 entries. In the first case, instructions willblock the pipeline until the first memory request is satisfied. Thesecond traffic-generating instruction needs to wait until the pipelineis unblocked. In the second case, while the first traffic-generatinginstruction is waiting, only one non-traffic instruction needs to wait.The other instructions will not block the pipeline and the secondtraffic-generating instruction can be issued much earlier since itdoes not need to wait for the first request to be satisfied. The figureemphasizes the point that although both cases show the same ICIsize and the same number of instructions that have dependences,the inter-arrival time of memory requests may be very different inthe two cases.

  • T N N NN

    T N N NN

    T

    T

    N

    N

    Traffic-generating instruction

    Other instruction

    True dependence

    Legend:

    TN

    (a)

    T N N

    N

    T

    N TT N N

    (b)

    Figure 6. An example of how the inter-arrival timing of memory requestsis affected by dependences between traffic-generating memory instructionsand other instructions (a) or between traffic-generating memory instructionsthemselves (b).

    To capture such characteristics, we first categorize instructionsbetween a pair of traffic-generating instructions into two sets: adependent set consisting of instructions that directly or indirectlydepend on the first traffic-generating instruction, and an indepen-dent set consisting of instructions that do not depend on the firsttraffic-generating instruction. For each pair of traffic-generatinginstructions, we collect the size of dependent set and non-dependentsets and calculate ratio between the former and the latter. We referto it as the dependence ratio or R. Across all traffic-generatinginstruction pairs with the same ICI size k, the dependence ratiosare averaged into Rk, to keep the profile compact and reduce thecomplexity of the clone generation process.

    Next, for each of the dependent and independent set, we analyzeinstructions in the set to discover true dependences within theICI. If two instructions are found to have true dependence, wemeasure their dependence distance , defined as the instructioncount between the producer and the consumer instructions. LetD be a counter that represents the number of occurrences ofdependence distance (where varies), within an ICI. We collectthe occurrences of dependence distance over all ICIs of size kand represent it in a 2D matrix Dk, separately for the dependentset and for the independent set. We collect the dependence distanceprofile separately for the dependent and independent sets becauseinstructions in the dependent set have a larger influence on inter-arrival timing of request.

    So far, we have only discussed the dependence of non traffic-generating instructions on themselves or on a traffic-generatinginstruction in an ICI. However, the dependence between two trafficinstructions is an even more important determinant of the timing ofthe inter-arrival of memory subsystem requests. For example, let usobserve the instructions shown in Figure 6(b). The top portion ofthe figure shows a pair of traffic-generating instructions with an ICIsize of 2. However, since the second traffic-generating instructiondepends on the first one, the inter-arrival time of the two memoryrequests is at least the time it takes for the first instruction to get itsdata. In the second pattern, although the ICI size of 4 is larger thanin the first portion, since there is no dependence between the twotraffic-generating instructions, the inter-arrival time of the memoryrequests may be shorter. To distinguish these situations, for eachpair of traffic-generating instructions, we record whether they have

    true dependence relationship or not. To achieve that, we define Mkas the fraction of pairs of traffic-generating instructions that havedependences when their ICI size is k.

    The dependence distance statistics provides the timing feedbackloop between memory performance and processor performance.When a request is stalled in the memory controller queue forwhatever reason, the instruction that causes the request will alsostall in the processor pipeline, and causes dependent instructionsto also stall, unable to execute. In turn, this leads to an increasein the inter-arrival time for the future memory subsystem request.The increased inter-arrival depends on whether the next traffic-generating instruction is dependent on the current one being stalledor not. These effects are captured in the statistics that are collectedfor MeToo clone generation process.

    D. Capturing Instruction Types

    In addition to the ILP, another determinant of how the ICI affectsinter-arrival time is the instruction mix that the ICI is comprisedof. For example, an integer division (DIV) may take more cyclesto execute than simpler arithmetic instructions. The numbers offloating-point and integer functional units may also influence theICI execution time. Similar to the ILP information, we profile theinstruction type mix in each ICI, separately for the dependent andindependent sets. We average the number of instructions with thesame type and within the same ICI interval. The final profile is a2D matrix with elements Nkm, where the ICI size k varies, and theinstruction type m (e.g. 0 for integer addition, 1 for floating-pointaddition, and so on) varies. Control flow is removed by convertingbranches into arithmetic instructions while preserving their datadependences during clone generation.

    E. Capturing Spatial Locality

    The inter-arrival time between two traffic-generating requests atthe memory subsystem is also affected by how long the first traffic-generating instruction takes to retrieve data from the DRAM, whichdepends on the DRAM latency to complete the request, includingthe read/write order, bank conflict, row buffer hit or miss. Theseevents are dependent on the temporal and spatial locality of therequests. West [9] assumes that any cache miss will go to a randomaddress, which is too simplistic because it ignores locality. Tocapture locality, we collect the probability that the next requestfalls to the same address region of size r as the current request.The probability is denoted as Pr . Pr may be collected for variousvalues of rs but we find that tracking the probability for 1KBregion provides sufficient accuracy.

    F. Implementation of the profiler

    To summarize, Table I shows the statistics that are collected bythe profiler. Figure 7 shows the profiling infrastructure that MeTooimplements. We implement the infrastructure on top of gem5 [16]processor simulator and DRAMSim2 [17] DRAM simulator. Eachpart of the profiler is a stand alone module, separated from gem5soriginal code, and works individually.

    To collect the statistics shown in Table I, we place profilingprobes at three places while original applications run on a simula-tor. To collect cache statistics, we use a cache stack distance profileron the simulators data caches input ports. To collect the instructionstatistics, we profile the decode stage of the processor pipeline.

  • Table ISTATISTICS THAT ARE COLLECTED FOR METOO.

    Notation What It RepresentsSij stack distance profile probability at the ith set and

    jth stack positionSi stack distance profile probability at the ith setWCij probability of write to clean block at the ith set and

    jth stack positionWDij probability of write to clean block at the ith set and

    jth stack positionfD fraction of LLC blocks that are dirtyfW fraction of memory requests that are write backsReck recurrence probability of ICI of size kRk dependence ratio (number of dependent instructions

    divided by number of independent instructions) forICI of size k

    Nkm the number of instructions of type m in an ICI of sizek.

    Pr probability the next memory request is to the sameregion of size r as the current memory request.

    Dk occurrences of dependence distance over all ICIs ofsize k

    Mk probability that a pair of traffic-generating instructionsexhibit a true dependence relation

    fmem the fraction of instructions that are memory instruc-tions

    k and k percentage of traffic-generating instructions in ICI ofsize k that triggers prefetches (single instance andaveraged over all ICIs of size k)

    Benchmark Core Cache StructureDRAM

    structure

    gem5 DRAMSim2

    simulatorprofiler

    inst. pattern profiler

    cache accessprofiler

    locality profiler

    Figure 7. Profiling infrastructure in MeToo.

    During each instructions life time in pipeline, we record truedependence relation and the producer and consumer instructions,and track them whether they will become a traffic-generatinginstruction or not. We maintain this list with a window size of N .When a new instruction is added into the list and the size of the listis larger than N , we will evict the oldest instruction from the list.When the evicted instruction is a traffic-generating instruction, weanalyze the list from oldest instruction to the youngest instructionto find the next traffic-generating instruction. If such an instructionis found, the two traffic-generating instructions become a pair thatform an ICI. Then, we collect statistics related to the ICI, such asthe ICI size, dependence ratio, instruction type mix, etc. If a secondtraffic-generating instruction is not found, this means that the next

    traffic-generating instruction is too far from the evicted instructionand the ICI size is larger than N , then we ignore the statistics. Inour current set up, we use N = 1500. What this means is that verylarge ICI sizes are ignored. In practice, ignoring the ICI statisticsfor very large ICIs is fine because their inter-arrival timing is verylikely larger than the memory subsystem latency, which means thebehavior is as if they appear in isolation.

    Finally, the locality statistics is collected between the cache andmemory subsystem.

    As can be seen in the profiles, we do not use any actual timinginformation. Actual timing information is architecture dependentand is sensitive to hardware changes. However, information suchas the instruction order, instruction types, dependences, etc. areindependent of the configurations of the memory subsystem. Usingarchitecture-independent statistics enables the feedback loop fromnew memory subsystem configurations to affect the processor per-formance, and in turn determines the memory request inter-arrivalpattern. This makes MeToo clones independent also distinguishesMeToo from prior cloning studies which either rely on static timestamps and hence ignore the memory performance feedback loop,or do not take into account timing at all.

    V. METOO CLONE GENERATION

    After profiling, a MeToo clone is generated stochastically. Es-sentially, multiple random numbers are generated, and for eachinstruction to generate for the clone, a process of random walk isperformed over the statistics that were collected during profiling,shown in Table I. This process is repeated until we have generatedsufficient number of instructions desired for the clone. Throughrandom walk, the clone that is produced conforms the behaviorof the original applications, but without utilizing the originalcode or data of the original application, thereby ensuring that theapplication remains proprietary.

    One important design issue related to random walk is whetherto adjust the statistics dynamically during the generation process,or keep the statistics unchanged. The issue is related to howthe clone generation should converge to the statistics: weak andstrict convergence [9]. Strict convergence requires the sequence ofinstructions generated for the clone to fully match the statisticsof the profiled original application. This requires the statistics tobe recalculated after each generated instruction. At the end ofgeneration, all the counts in the statistics should be zero or veryclose to zero. This is akin to picking a ball from a bin withmultiple balls of different colors without returning it to the bin.After all balls are picked, the picked balls have the same mixof colors as prior to picking the balls. Weak convergence, onthe other hand, does not adjust or recalculate the statistics afteran instruction is generated. This is akin to picking a ball froma bin with multiple balls of different colors, and returning theball into the bin after each picking. The mix of colors of pickedballs is not guaranteed to converge to the original mix of colors,however with larger number of picks they will tend to converge.For cloning, the number of instructions generated has a similareffect as the number of picked balls. Strict convergence is veryimportant for generating small clones with a small number ofinstructions, because weak convergence may not have sufficienttrials for samples to converge to the population. However, forlarge clones with a large number of instructions, weak convergence

  • Choose traffic-gen inst or cache-hit inst based on

    current ICI

    Determine the address, registers, and load/store type

    for the cache-hit inst

    Start

    Determine the address, registers, and load/store type for the traffic-gen

    inst. Choose next ICI.

    Generate non-memory instructions

    Target number of inst reached? End

    cache-hit

    traffic-gen

    No

    YesGenerate non-memory instructions

    Step1 Step2

    Step3

    Step6Step4

    Step5

    Figure 8. The high level view of the clone generation process.

    is good enough. A drawback with strict convergence is that itis not always possible to satisfy when there are many differentstatistics that have to be satisfied simultaneously. Convergence isnot guaranteed, and clone generation may stall without converging.With weak convergence, clone generation is guaranteed comple-tion. For MeToo, clone generation uses a mixed approach. Fordistribution statistics, instruction count distribution, dependencedistance distribution, instruction type mix distribution, and mostof cache access pattern statistics, the generation process usesstrict convergence. They are important characteristics that directlyinfluence the intensity of traffic requests. For other parameters likethe ICI recurrence probability and locality statistics, we use weakconvergence. It is less important to be highly accurate for theseparameters since they only represent general behavior.

    Figure 8 shows the high-level view of the clone generationflow chart. In Figure 8, each one of the grey colored boxesrepresents a generation step that involves multiple substeps. Dueto space limitation, we can only describe the high level steps inthe generation process:

    1) As an initialization step, warm up the caches with randomlygenerated memory accesses.

    2) This step chooses the type of a memory instruction (ld/st)to generate: either a traffic-generating memory instructionor a memory instruction that hits in any (L1 or L2) cache.For the first memory instruction, generate a traffic-generatinginstruction and choose a random ICI size based on the ICIdistribution. Otherwise, check if the number of instructionsgenerated so far has satisfied the ICI size. If not, generate amemory instruction that results in a cache hit and go to Step3. If the ICI size is satisfied, start a new ICI by generatinga new traffic-generating instruction. Go to Step 4.

    3) Determine the address and registers to be used by theinstruction, and whether the instruction should be a loador store. For address, use Si to choose the set and Sij tochoose the stack distance position. For registers, use Rkand Dk to determine if the instruction should be dependenton a previous instruction or the previous traffic-generatinginstruction. For load/store, check if the block is dirty orclean, then use the WCij or WDij (and also fD and fW )to determine if the instruction should be a load or store,respectively. Finally, generate this instruction and continueto Step 5.

    4) Determine the address and registers to be used by theinstruction, and whether the instruction should be a load orstore. For address, use Pr to determine if the address should

    be within the region of size r from the previous traffic-generating instruction. For registers, use Mk to determineif this instruction should depend on the previous traffic-generating instruction. Determine load vs. store based onWCij for the case where j is larger than the cache asso-ciativity. fD and fW are also consulted as needed. Finally,generate this instruction, then choose a new ICI size basedon the ICI recurrence probability and ICI size probabilitydistribution, and continue to Step 5.

    5) Generate at least one non-memory instructions, based on thenumber of non-memory to memory instruction ratio of theoriginal benchmark. Determine each non-memory instructiontype by using Nkm, and determine the registers of eachinstruction using Rk and Dk. Go to Step 6.

    6) Determine if the expected number of instructions has beengenerated. If so, stop the generation. If no instruction can begenerated any more, also stop the generation. Otherwise, goto Step 2.

    These steps will be run in a loop until the desired number oftraffic instruction pairs are generated. Since we incorporate strictconvergence approach, there is probability that not all statisticsfully converge. In this case, the final number of instructions may beslightly fewer than the target number of instructions. The generatedinstructions are implemented as assembly instructions, after thatthey are bundled into a source file that can be compiled to generatean executable clone.

    VI. VALIDATION AND DESIGN EXPLORATION METHOD

    To generate and evaluate the clones, we use a full-systemsimulator gem5 [16] to model the processor and cache hierarchy,and DRAMSim2 [17] to model the DRAM memory subsystemand the memory controller. The profiling run is performed witha processor and cache hierarchy model with parameters shown inTable II. The L2 cache is intentionally chosen to be relatively smallso that its cache misses will stress test MeToos cloning accuracy.

    We use benchmarks from SPEC2006 [18] for evaluation. Wefast forwarded each benchmark to skip the initialization stage andclone one particular phase of the application consisting of 100million instructions. (to capture other phases, we can choose other100 million instruction windows as identified by Simpoint [19])The clone that we generate contains 1 million memory references,yielding a clone that is roughly 2050 smaller than the originalapplication, allow much faster simulation. For example, the originalbenchmarks usually run for several hours while the clones finishrunning in several minutes. The clone generation itself typicallytakes a couple of minutes.

  • 05000

    10000

    15000

    20000

    0 100 200 300

    zeusmp_orig

    0500

    100015002000250030003500

    0 100 200 300

    libquantum_clone

    0

    50000

    100000

    150000

    0 100 200 300

    libquantum_orig

    0100200300400500600700

    0 100 200 300

    bzip2_clone

    0

    10000

    20000

    30000

    40000

    50000

    60000

    0 100 200 300

    bzip2_orig

    0

    500

    1000

    1500

    2000

    2500

    0 100 200 300

    gcc_clone

    0

    50000

    100000

    150000

    200000

    250000

    0 100 200 300

    gcc_orig

    0

    1000

    2000

    3000

    4000

    0 100 200 300

    mcf_clone

    0

    50000

    100000

    150000

    200000

    250000

    0 100 200 300

    mcf_orig

    0

    500

    1000

    1500

    2000

    0 100 200 300

    astar_clone

    0

    20000

    40000

    60000

    80000

    0 100 200 300

    astar_orig

    0

    5000

    10000

    15000

    20000

    25000

    30000

    0 100 200 300

    gromacs_orig

    0

    100

    200

    300

    400

    500

    0 100 200 300

    gromacs_clone

    01000200030004000500060007000

    0 100 200 300

    lbm_clone

    0

    100000

    200000

    300000

    400000

    500000

    0 100 200 300

    lbm_orig

    0500

    100015002000250030003500

    0 100 200 300

    milc_clone

    050

    100150200250300350

    0 100 200 300

    zeusmp_clone

    0

    100

    200

    300

    400

    500

    600

    0 100 200 300

    hmmer_clone

    0

    10000

    20000

    30000

    40000

    0 100 200 300

    hmmer_orig

    -50000

    0

    50000

    100000

    150000

    200000

    0 100 200 300

    milc_orig

    Figure 9. Comparison of inter-arrival timing distribution of memory requests of the original applications vs. their clones.

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    libquantum

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    gcc

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    bzip2

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    mcf

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    astar

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    gromacs

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    lbm

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    milc

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    zeusmp

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    hmmer

    origclone

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    gcc_astar

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    gromacs_libquantum

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    bzip2_gcc

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    mcf_lbm

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    milc_lbm

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    mcf_libquantum

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    gcc_zeusmp

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    libquantum_hmmer

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    hmmer_zeusmp

    0

    20

    40

    60

    80

    100

    1 51 101 151 201 251 301

    astar_gromacs

    origclone

    Figure 10. Cumulative distribution of inter-arrival times of memory requests of benchmarks vs. clones, in standalone (top two rows) and co-scheduledmodes (bottom two rows).

  • Table IIBASELINE HARDWARE CONFIGURATION USED FOR PROFILING AND

    CLONE GENERATION.

    Coredual core, each 2GHz, X86 ISA,8-wide out-of-order superscalar, 64-entry IQ

    Cache L1-D: 32KB, 4-way, 64B block, LRUL2: 256KB, 8-way, 64B block, LRU, non-inclusiveBus 64-bit wide, 1GHz

    DRAM

    DDR3, 2GB8 banks/rank, 32K rows/bank, 1KB row buffer size1.5ns cycle timeAddress mapping: chan:row:col:bank:rank

    We validate MeToo clones for two workload scenarios: 10 mem-ory intensive benchmarks running alone, and 10 multi-programmedworkloads constructed from randomly pairing individual bench-marks on a dual-core processor configuration. The benchmarks thatare categorized as memory intensive are ones that show L2 MPKI(misses per kilo instructions) of at least 3, and the ratio of L2 cachemisses to all memory references of at least 1%. For co-scheduledbenchmarks, we run both benchmarks with the same number ofinstructions. If one benchmark finishes earlier than the other one,we stall its core and wait for the other benchmark to finish. Wecollect results only after both benchmarks finish their windows ofsimulation.

    MeToo profiles are collected using the hardware configurationshown in Table II, which are then used to generate clones. Theclones are validated in two steps. In the first step, they are com-pared against the original benchmarks using the baseline machineconfiguration (Table II). The purpose of this step is to see if theclones faithfully replicate the behavior of the benchmarks on thesame machine configuration for which profiles are collected.

    The second step of the validation has a goal of testing the clonesability to be used for design exploration of the memory subsystemin place of the original benchmarks. Thus, in the second step, werun the previously-generated clones on new hardware configura-tions that were not used for collecting profiles. These hardwareconfigurations are listed in Table III. In the configurations, we varythe memory subsystem configurations significantly, by varying thebus width from 1 to16 bytes, bus clock frequency from 100 MHz to1 GHz. We also vary the DRAM cycle time from 1.5ns to 12ns, andvary address mapping schemes by permuting the address bits usedfor addressing channels, ranks, banks, rows, and columns. This hasthe effect of producing a different degree of bank conflicts for thesame memory access streams. We also vary the L2 prefetcher sothat we can specifically test MeToos clone generations ability toreproduce the effect of hardware prefetches using software prefetchinstructions. Finally, we mentioned in Section III our assumptionthat MeToo clones are designed for design exploration of thememory subsystem while keeping the processor and caches fixed.However, since we had to capture instruction dependence andmixes, we added a limited range of core configuration variationsjust to test how well MeToo clones memory performance feedbackloop works under different core configurations.

    Table IIIHARDWARE CONFIGURATIONS USED FOR MEMORY DESIGN

    EXPLORATION.

    Parameter Variation

    Core Pipeline width 2, 4, 8, 16IQ size 8,16, 32, 64, 128Cache L2 prefetcher Stride w/ degree 0, 2, 4, 8

    Bus Width 1, 2, 4, 8, 16 BytesSpeed 100 MHz, 500 MHz, 1 GHz

    DRAM

    Cycle time 1.5ns, 3ns, 6ns, 12ns

    Address mapping

    chan:row:col:bank:rankchan:rank:row:col:bankchan:rank:bank:col:rowchan:row:col:rank:bank

    VII. EVALUATION RESULTS

    A. Validation

    In order to validate MeToo clones, we collect inter-arrival timedistribution statistics, shown in Figure 9. The first and third row ofthe charts show the inter-arrival time distribution for the originalbenchmarks, while the second and fourth row of the charts showthe inter-arrival time distribution for MeToo clones. The x-axesshows the inter-arrival time (in bus clock cycles), while the y-axes shows the number of occurrences of each inter-arrival time.To improve legibility, we aggregate inter-arrival times larger thanor equal to 300 clock cycles into one single bar, and smooth allinter-arrival times smaller than 300 cycles with a 5-cycle windowmoving average. The figure shows that for most benchmarks, thereare inter-arrival times that are dominant. There is a significantvariation between benchmarks as well. Some benchmarks such aslibquantum, lbm, and milc, have mostly small inter arrival times(less than 50 cycles). Others, such as bzip2, mcc, zeusmp, hmmer,have significant inter-arrival times between 50-200 cycles. Somebenchmarks have inter-arrival times larger than 200, such as gcc,gromacs, hmmer, and astar.

    The figure also shows that MeToo clones are able to capturethe dominant inter-arrival times in each benchmark with surprisingaccuracies. The dominant inter-arrival times are mostly replicatedvery well, however some of them tend to get a little bit diffused.For example, in libquantum, the two dominant inter-arrival timesare more diffused in the clone. Small peaks in the inter-arrivaltimes may also get more diffused. For example, the small peak atabout 80 cycles for mcf and milc are diffused. The reason for thisdiffusion effect is that MeToo clones are generated stochastically,hence there is additional randomness introduced in MeToo thatwas not present in the original benchmarks. However, the diffusioneffect is mild. Overall, the figure shows that MeToo clones matchthe inter-arrival distribution of the original benchmarks well. Recallthat what MeToo captures is inter-arrival statistics in terms of ICIsize, rather than number of clock cycles. The good match in inter-arrival timing shows that MeToo successfully translates the numberof instructions in the intervals into number of clock cycles in theintervals.

    To get a better idea of how well MeToo clones and the orig-inal benchmarks fit, we super impose the cumulative probabilitydistribution of the inter-arrival timing statistics of the clones andbenchmarks in Figure 10. The x-axes shows the inter-arrival timein clock cycles, while the y-axes shows the fraction of all instances

  • 05

    10

    15

    20

    hmm

    erze

    usm

    pbz

    ip2

    gcc

    mcf

    libqu

    antu

    mas

    tar

    milc

    grom

    acs

    lbm

    gcc_

    asta

    rgr

    omac

    s_lib

    quan

    tum

    bzip

    2_gc

    cm

    cf_l

    bmm

    ilc_l

    bmm

    cf_l

    ibqu

    antu

    mgc

    c_ze

    usm

    plib

    quan

    tum

    _hm

    mer

    hmm

    er_z

    eusm

    pas

    tar_

    grom

    acs

    Bus queueing delay (in cycles)

    origclone

    00.5

    11.5

    22.5

    33.5

    4

    hmm

    erze

    usm

    pbz

    ip2

    gcc

    mcf

    libqu

    antu

    mas

    tar

    milc

    grom

    acs

    lbm

    gcc_

    asta

    rgr

    omac

    s_lib

    quan

    tum

    bzip

    2_gc

    cm

    cf_l

    bmm

    ilc_l

    bmm

    cf_l

    ibqu

    antu

    mgc

    c_ze

    usm

    plib

    quan

    tum

    _hm

    mer

    hmm

    er_z

    eusm

    pas

    tar_

    grom

    acs

    DRAM power (watts)

    00.5

    11.5

    22.5

    33.5

    4

    hmm

    erze

    usm

    pbz

    ip2

    gcc

    mcf

    libqu

    antu

    mas

    tar

    milc

    grom

    acs

    lbm

    gcc_

    asta

    rgr

    omac

    s_lib

    quan

    tum

    bzip

    2_gc

    cm

    cf_l

    bmm

    ilc_l

    bmm

    cf_l

    ibqu

    antu

    mgc

    c_ze

    usm

    plib

    quan

    tum

    _hm

    mer

    hmm

    er_z

    eusm

    pas

    tar_

    grom

    acs

    DRAM bandwidth (GB/s)

    00.05

    0.10.15

    0.20.25

    0.30.35

    0.4

    hmm

    erze

    usm

    pbz

    ip2

    gcc

    mcf

    libqu

    antu

    mas

    tar

    milc

    grom

    acs

    lbm

    gcc_

    asta

    rgr

    omac

    s_lib

    quan

    tum

    bzip

    2_gc

    cm

    cf_l

    bmm

    ilc_l

    bmm

    cf_l

    ibqu

    antu

    mgc

    c_ze

    usm

    plib

    quan

    tum

    _hm

    mer

    hmm

    er_z

    eusm

    pas

    tar_

    grom

    acs

    Average queue size

    Figure 11. Comparing results obtained by the original benchmarks vs. clones in terms of DRAM power , aggregate DRAM bandwidth, memory controllersscheduling queue length, and bus queueing delay.

    of inter-arrivals that are smaller than or equal to the inter-arrivaltime. The top two rows show individual benchmarks runningalone, compared to the clones running alone. The bottom tworows show co-scheduled benchmarks running together on differentcores, compared to their respective clones co-scheduled to runtogether on different cores. As the figure shows, MeToo clones areable to match the inter-arrival patterns of the original benchmarksboth in standalone mode as well as in co-scheduled mode. Wedefine the fitting error as the absolute distance between the curveof the clone and the curve of the original benchmark, averagedacross all inter-arrival times. The fitting error, averaged across allstandalone benchmarks, is small (4.2%). We expected the fittingerror to be higher on co-scheduled benchmarks because thereis an additional memory feedback loop where one benchmarksmemory request timing causes conflicts in shared resources (e.g.in the memory scheduling queue, channels, banks, etc.), and theconflict ends up affecting the other benchmarks memory requesttiming. However, our results indicate an even smaller fitting errorfor the co-scheduled benchmarks (2.8%). This is a good newsas it indicates MeToo clones do not lose accuracy, and insteadimprove accuracy, in the co-scheduled execution mode commonin multicore systems. A probable cause for this is because co-scheduling increases the degree of randomness in the inter-arrivaltime that come from two different benchmarks, moving their degreeof randomness closer to those of MeToo clones.

    Now let us look at whether matching inter-arrival timing of mem-ory requests translate to good match in DRAM power consumption,DRAM bandwidth, average MC queue size, and bus queuing delay.These are end metrics that are important for the design of memorysubsystem. For example, the average queue size gives an indicationwhat may be an appropriate memory controller scheduling queuesize. The average queue size is measured as the average numberof entries in the queue at the time a new request is inserted intothe queue. For the DRAM queue length, and bus queueing delaywe use 12ns DRAM cycle time, and 1 byte-wide memory bus, inorder to stress the bus and memory controller. Figure 11 showsthese metrics for benchmarks and clones running alone as wellas co-scheduled benchmarks and co-scheduled clones running ona multicore configuration. Across all cases, the average absoluteerrors are 0.25 watts, 0.18 GB/s, 0.03, and 1.06 ns. for DRAMpower consumption, DRAM bandwidth, average queue length, andqueuing delay, respectively, which are small.

    Figure 12 shows the comparison of errors in the four metricsfrom the previous figure for MeToo clones, versus geometric inter-arrival and uniform inter-arrival. The y-axes shows the averageabsolute errors (across all workloads), normalized to the largest bar

    in each case. The figure shows MeToo clones errors are only afraction of errors (at most 35%) from using geometric or uniforminter-arrival. This indicates that geometric/Poisson arrival patternassumed in most queueing models is highly inaccurate for use inmemory subsystem performance studies.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    power bandwidth queue size queue delay

    Normalized avr. error clonegeouniform

    Figure 12. The fitting errors of MeToo clones, normalized to geometricor uniform inter-arrival clones.

    B. Design Exploration Results

    Finally, Figure 13 shows the fitting errors of MeToo clones vs.geometric inter-arrival for various machine configurations, withvarying bus width, bus frequency, DRAM cycle time, DRAMaddress mapping scheme, pipeline width, issue queue size, andprefetching degree. None of the configurations was used for profil-ing and clone generations. The parameters for these configurationsare shown in Table III. Across all cases, MeToo clones errorslargely remain constant at about 5%, except in very extreme cases,for example when the instruction queue is very small (8 entries) which is not a part of memory subsystem design explorationparameter. The fitting errors for MeToo clones are typically onlyone third of the fitting errors obtained using clones that rely ongeometric inter-arrival. The results demonstrate that MeToo clonescan be used to substitute the original applications across a widerange of memory subsystem configurations, and even includingsome processor pipeline configurations.

    VIII. CONCLUSION

    In this paper, we have presented MeToo, a framework forgenerating synthetic memory traffic can be used for memorysubsystem design exploration in place of proprietary applicationsor workloads. MeToo relies on novel profiling and clone generationmethods that capture the memory performance feedback loop,where memory performance affects processor performance whichin turn affects inter-arrival timing of memory requests. We validatedthe clones, and showed that they achieved an average error of4.2%, which is only a fraction of the errors compared to competingapproaches of using geometric inter-arrival (commonly used inqueueing models) and uniform inter-arrival.

  • 05

    10

    15

    20

    16 8 4 2 1

    Bus width

    0

    5

    10

    15

    20

    1G 500M 100M

    Bus speed

    0

    5

    10

    15

    20

    16 8 4 2

    Pipeline width

    0

    5

    10

    15

    20

    1.5 3 6 12

    DRAM cycle time

    0

    5

    10

    15

    20

    128 64 32 16 8

    IQ size

    0

    5

    10

    15

    20

    0 2 4 8

    Prefetching degree

    geo

    clone

    0

    5

    10

    15

    20

    Address mapping scheme

    Figure 13. The fitting errors of MeToo vs. geometric inter-arrival clones, across various hardware configurations.

    ACKNOWLEDGEMENT

    This research was supported in part by NSF Award CNS-0834664. Balakrishnan was a PhD student at NCSU when hecontributed to this work. We thank anonymous reviewers and KirkCameron for valuable feedback for improving this paper.

    REFERENCES

    [1] B. Rogers et al., Scaling the Bandwidth Wall: Challenges in andAvenues for CMP Scaling, in Proc. of the 36th InternationalConference on Computer Architecture, 2009.

    [2] A. Joshi, L. Eeckhout, and L. John, The return of syntheticbenchmarks, in Proc. of the 2008 SPEC Benchmark Workshop,2008.

    [3] R. H. Bell, Jr and L. K. John, Efficient power analysis usingsynthetic testcases, in Proc. of the International Workload Char-acterization Symposium. IEEE, 2005.

    [4] A. M. Joshi et al., Automated microprocessor stressmark gen-eration, in Proc. of the 14th International Symposium on HighPerformance Computer Architecture, 2008.

    [5] K. Ganesan, J. Jo, and L. K. John, Synthesizing memory-levelparallelism aware miniature clones for spec cpu2006 and implant-bench workloads, in Proc. of the International Symposium onPerformance Analysis of Systems and Software, 2010.

    [6] K. Ganesan and L. K. John, Automatic generation of miniatur-ized synthetic proxies for target applications to efficiently designmulticore processors, IEEE Transactions on Computers, vol. 99,p. 1, 2013.

    [7] E. Deniz, A. Sen, J. Holt, and B. Kahne, Using softwarearchitectural patterns for synthetic embedded multicore bench-mark development, in Proc. of the International Symposium onWorkload Characterization, 2012.

    [8] E. Deniz, A. Sen, B. Kahne, and J. Holt, Minime: Pattern-awaremulticore benchmark synthesizer, Computers, IEEE Transactionson, vol. 64, no. 8, pp. 22392252, Aug 2015.

    [9] G. Balakrishnan and Y. Solihin, West: Cloning data cache be-havior using stochastic traces, in Proc. of the 18th InternationalSymposium on High Performance Computer Architecture ), 2012.

    [10] A. Awad and Y. Solihin, STM: Cloning the spatial and temporalmemory access behavior, in Proc. of the 20th InternationalSymposium on High Performance Computer Architecture, 2014.

    [11] M. Badr and N. Enright Jerger, SynFull: Synthetic traffic modelscapturing cache coherent behaviour, in Proc. of the InternationalSymposium on Computer Architecture, 2014.

    [12] Z. Kurmas and K. Keeton, Synthesizing representative I/O work-loads using iterative distillation, in Proc. of the InternationalSymposium on Modeling, Analysis and Simulation of ComputerTelecommunications Systems, 2003.

    [13] C. Delimitrou, S. Sankar, K. Vaid, and C. Kozyrakis, Storage I/Ogeneration and replay for datacenter applications, in Proc. of theInternational Symposium on Performance Analysis of Systems andSoftware, 2011.

    [14] G. Balakrishnan and Y. Solihin, MEMST: Cloning memorybehavior using stochastic traces, in Proc. of the InternationalSymposium on Memory Systems, 2015.

    [15] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger, EvaluationTechniques for Storage Hierarchies, IBM Systems Journal, 9(2),1970.

    [16] N. Binkert et al., The gem5 simulator, SIGARCH Comput.Archit. News, vol. 39, no. 2, pp. 17, 2011.

    [17] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, DRAMSim2: Acycle accurate memory system simulator, Computer ArchitectureLetters, vol. 10, no. 1, pp. 16 19, jan.-june 2011.

    [18] J. L. Henning, Spec cpu2006 benchmark descriptions,SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 117, Sep.2006.

    [19] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, andB. Calder, Using simpoint for accurate and efficient simulation,in Proc. of the International Conference on Measurement andModeling of Computer Systems, 2003.