designing on-chip memory systems for throughput architectures

PowerPoint Presentation

Designing On-chip Memory Systems for Throughput ArchitecturesPh.D. ProposalJeff DiamondAdvisor: Stephen KecklerTurning to Heterogeneous Chips1

AMD - TRINITY

Intel Ivy BridgeWell be seeing a lot more than 2-4 cores per chip really quickly Bill Mark, 2005

nVIDIA Tegra 3Throughput more efficient way to efficiently accelerate parallel applications. Leverage on chip L3 cache to deal with reduced bandwidth compared to dedicated GPUs.Can mention the power wall Supercomputers, sockets, mobile, all power capped.

Punchline we need throughput architectures, but even they need to be more power efficient.

1Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceArchitectural EnhancementsThread SchedulingCache PoliciesMethodologyProposed Work

2Afer go through outline points, reread first slide thruput

2Throughput Architectures (TA)Key Features:Use explicit parallelism to break application into threadsOptimize hardware for performance density, not single thread performanceBenefits:Drop voltage, peak frequencyquadratic improvement in power efficiencyCores smaller, more energy efficientLess need for OO, register renaming, branch prediction, fast synchronization, low latency ALUsFurther economize through multithreading each coreAmortize expense using SIMD3Performance per area, power per area, performance per wattScratchpad like Godson-TSome simplify memory syste,3Scope Highly Threaded TAArchitecture Continuum:MultithreadingLarge number of threads mask long latencySmall amount of cache primarily for bandwidthCachingLarge amounts of cache to reduce latencySmall number of threadsCan we get benefits of both?

4

Power 74 threads/core~1MB/threadSPARC T48 threads/core~80KB/threadGTX 58048 threads/core~2KB/threadCMP end vs GPU end On heterogeneous cores, they tend to pair cores from both ends of the spectrum. The initial part of this study dealt with the left side, but the rest of this study focuses on the right side. Agnostic to SIMD.So why not just add some cache? IN a way, heterogeneous cores do this. Turns out that doesnt work so well. To understand why, need to understand the issues with HTTA.Moves pressure from cores to memory system

4Problem - Technology MismatchComputation is cheap, data movement is expensiveHit in L1 cache, 2.5x power of 64-bit FMADDMove across chip, 50x powerFetch from DRAM, 320x powerExponential growth in cores saturates off-chip bandwidthPerformance cappedLatency to off-chip DRAM now hundreds of cyclesNeed hundreds of threads in flight to cover latency

5The Downward SpiralLittles Law Threads needed is proportional to average latencyOn-chip resources: opportunity costThread contextsIn flight memory accessesToo many threads negative feedbackAdding threads to cover latency increases latencySlower register access, thread schedulingReduced LocalityReduces bandwidth and DRAM efficiencyReduces effectiveness of cachingParallel starvation6When more threads increase latency, need more threads to cover latency



7Goal: Increase Parallel EfficiencyProblem: Too Many Threads!Increase Parallel Efficiency, i.e.Number of threads needed to achieve a given level of performanceImproves throughput performanceApply low latency cachesLeverage upwards spiralDifficult to mix multithreading and cachingTypically used just for bandwidth amplificationImportant factorsThread schedulingInstruction Scheduling (per thread parallelism)8Also increase per thread ILP

8ContributionsQuantifying the impact of single thread performance on throughput performanceDeveloping a mathematical analysis of throughput performanceBuilding a novel hybrid-trace based simulation infrastructureDemonstrating unique architectural enhancements in thread scheduling and cache policies9But before we can do any of this, we have to understand throughput performance.



10Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache PoliciesMethodologyProposed Work


11Mathematical AnalysisWhy take a mathematical approach?Be very precise about what we want to optimizeUnderstand the relationship and sensitivities to throughput performance:Single thread performanceCache improvementsApplication characteristicsSuggest most fruitful architectural improvements

12One contribution of my dissertation

Spare the details just the basics Show model of throughput performance, performance from cachesThis is one of the key contributions of this paper12Modeling Throughput Performance

13

NT = Total Active ThreadsPCHIP = Total Throughput PerformancePST = Single Thread PerformanceLAVG = Average Latency per instructionPowerCHIP = EAVG(Joules)xPCHIPIf per cycle, get IPC per thread, Latency in cycles. But then multiply by frequency (cycles/sec) to get true performance

Key insight is that Latency varies by number of threads latency is most important, can use caching to benefit latency

Need to address power, which is Performance x energy13Cache As A Performance Unit14Key: How much does a thread need?

FMADDSRAMArea: 2-11KB SRAM, 8-40KB eDRAM Shared through pipeliningActive Power: 20pJ / OpLeakage Power: 1 watt/mm^2Active Power: 50pJ/L1 access, 1.1nJ/L2 accessLeakage Power: 70 milliwatts/mm^2 1.4 watts/MB eDRAM 350milliwatts/MBMake loads 150x faster, 300x more energy efficientUse10-15x less power/mm^2 than FPUsOne FPU = ~64KB SRAM / 256KB eDRAMSTT-MRAM 2010 ISCA paper at 32nm scaled to match 28nm nVidia numbers = 36pJ access2011 28nm Bill Dally keynote at IPDPS14Performance From CachingIgnore changes to DRAM latency & off-chip BW We will simulate theseAssume ideal cachesWhat is the maximum performance benefit?15

Memory Intensity, M=1-A

NT = Total active threads on chipA = Arithmetic intensity of application (fraction of non-memory instructions)L = Average latency per instruction For power, replace L with E, the average energy per instructionQualitatively identical, but differences more dramaticMemory IntensityKey insight although qualitatvely identical, only peformance (latency) improvements improve parallel efficiency and avoid downwards spiral.

15Ideal Cache = Frequency CacheHit rate depends on amount of cache, application working setStore items used the most timesThis is the concept of frequencyOnce we know an applications memory access characteristics, we can model throughput performance

16Modeling Cache Performance

17First thing describe horizontal access as representing fraction of entire application working set.After exposing Hit rate, describe how this is divided up, and why you so quickly find yourself in the leftmost side of the graph!

Integration of frequency yields hit rate of ideal cache. Latency linearly varies with miss rate, which is Hc upside downPerformance is approximately 1 over miss rate17Performance Per Thread

PS(t) is a steep reciprocal18Here we show how fast performance drops as a function of threads.Emphasize just how quickly were at the leftmost edge of hit rate, that this is an ideal cache, that theres so little cache per thread so fast18Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache PoliciesMethodologyProposed Work


19Valley in Cache Space

X=20Valley Annotated

CacheRegimeMTRegimeValley

Width21No CacheCacheTalk about Thread Throttling in CMPs never with a valley involved, never done with TPA. Already evidence of impact.Threads vs. caches: Modeling the behavior of parallel workloads, by Zvika Guz and Oved Itzhak and Idit Keidar and Avinoam Kolodny and Avi Mendelson and Uri C. Weiser, 2010Many-Core vs. Many-Thread Machines: Stay Away From the Valley, 2008Weve uncovered a lotDont know which point is higher

21Prior WorkHong et al, 2009, 2010Simple, cacheless GPU modelsUsed to predict MT peakGuz et al, 2008, 2010Graphed throughput performance with assumed cache profileIdentified valley structureValidated against PARSEC benchmarksNo mathematical analysisDidnt analysis bandwidth limited regimeFocus on CMP benchmarksGalal et al, 2011Excellent mathematical analysisFocused on FPU+Register design22Valley Annotated

CacheRegimeMTRegimeValley

Width23No CacheCacheTalk about Thread Throttling in CMPs never with a valley involved, never done with TPA. Already evidence of impact.Threads vs. caches: Modeling the behavior of parallel workloads, by Zvika Guz and Oved Itzhak and Idit Keidar and Avinoam Kolodny and Avi Mendelson and Uri C. Weiser, 2010Many-Core vs. Many-Thread Machines: Stay Away From the Valley, 2008Weve uncovered a lotDont know which point is higher

23Energy vs Latency24

* Bill Dally, IPDPS Keynote, 2011

Need to simpliy, compare with latency numbers

24Valley Energy Efficiency

25

Point cache regime is most energy efficientWhats wrong with a high latency cache? You will get a boost in peak performance by virtue of more threads running, but you wont improve parallel efficiency, so you wont get high hit rates

Now we have hit rate and BW information, and we know how to use them! Know arithmetic intensityThis immediately indicated two key areas to solve this problem dynamically finding the optimum operating points, and preserving as much of the cache peak as we can, by making caches act more like ideal LFU caches,





27Thread ThrottlingHave real time informationArithmetic IntensityBandwidth UtilizationCurrent Hit RateCan match to approximate/conservative localityApproximate optimum operating pointsShut down / Activate threads to increase performanceConcentrate power and overclock

28Will do two elements thread throttling and thread scheduling

28Prior WorkSeveral studies in CMP and GPU area scale back threadsCMP When miss rates get too highGPU When off-chip bandwidth is saturatedPrior attempts simple, unidirectionalWe have two complex points to hit, three different operating regimesMathematical analysis lets us approximate both points with as little as two samplesBoth off-chip bandwidth and 1/Hitrate are nearly linear for a wide range of applications29Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache Policies (Indexing, replacement)MethodologyProposed Work


30Mathematical AnalysisNeed to work like LFU cacheHard to implement in practiceStill very little cache per threadPolicies make big differences for small cachesAssociativity a big issueCannot cache every line referencedBeyond dead line predictionStream lines with lower reuse31Cache Conflict MissesDifferent addresses map to same wayProgrammers prefer power of 2 array sizesPower of 2 strides pathologicalPrime number of banks/sets thought idealNo efficient implementationMersenne Primes not so convenient:3, 7, 15, 31, 63, 127, 255Early paper on prime strides for vector computers showed 3x speedupKharbutli, HPCA 04 showed prime sets as hash function for caches worked wellOdd-sets work as wellFastest implementation of DIV-MODSilver Bullet, e.g., allowed banks with same conflict rate

32Early Study using PARSECPARSEC L2 with 64 threads33Mathematical model shows us low hit rates are important

33(Re)placement PoliciesNot all data should be cachedRecent papers for LLC cachesHard drive cache algorithmsFrequency over Recency Frequency hard to implementARC good compromiseDirect Mapping Replacement dominatesLook for explicit approachesPriority ClassesEpochs

34We have done very little preliminary analysis., but believe the simple approach can work because of PC based methods above

34Prior WorkBelady solved it all, light on implementation detailsThree hierarchies of methodsBest one utilized information of prior line usageApproximationsARC cache ghost entries, recency and frequency groupsGenerational Caches, multiqueueQureshi, 2006, 2007 Adaptive Insertion policies35If cant do it all, use 235Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache Policies (Indexing, replacement)Methodology (Applications, Simulation)Proposed Work


36BenchmarksInitially studied regular HPC kernels/applications in CMP environmentDense Matrix MultiplyFast Fourier TransformHomme weather simulationAdded CUDA throughput benchmarksParboil old school MPI, coarse grainedRodinia fine grained, variedBenchmarks typical of historical GPGPU applicationsWill add irregular benchmarksSparseMM, Adaptive Finite Elements, Photon mapping

37Used largest input sets37Subset of Benchmarks38For cache analysis, we choose the 6 benchmnarks with the highest memory intensity. We will add in 4 more with medium intensity once we compensate for the compiler

38Preliminary ResultsMost of the benchmarks should benefit:Small working setsConcentrated working setsHit rate curves easy to predict39Typical Concentration of Locality

40People thought thered be no reuse, but they were wrong

40Scratchpad Locality41

Golden Addresses will redraw as a stack graph

41Hybrid Simulator DesignC++/CUDAPTX IntermediateNVCCOcelot Functional SimModifyCustom Simulator42Goals: Fast simulation, Overcome compiler issues for reasonable base caseCustom Trace ModuleAssembly ListingDynamic Trace BlocksAttachment PointsCompressed Trace Data

Simulate Different Architecture Than TracedThis is another contribution We can Simulate a different architecture than we trace42Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache Policies (Indexing, replacement)Methodology (Applications, Simulation)Proposed Work


43Phase 1 HPC ApplicationsLooked at GEMM, FFT & Homme in CMP settingLearned implementation algorithms, alternative algorithmsExpertise allows for credible throughput analysisValuable Lessons in multithreading and cachingDense Matrix MultiplyBlocking to maximize arithmetic intensityEnough contexts to cover latencyFast Fourier TransformPathologically hard on memory systemCommunication & synchronizationHOMME weather modelingIntra-chip scaling incredibly difficultMemory system performance variationReplacing data movement with computationFirst author publications: PPoPP 2008, ISPASS 2011 (Best Paper)

44Phase 2 Benchmark CharacterizationMemory Access Characteristics of Rodinia and Parboil benchmarksApply Mathematical AnalysisValidate modelFind optimum operating points for benchmarksFind optimum TA topology for benchmarksNEARLY COMPLETE45Phase 3 Evaluate EnhancementsAutomatic Thread ThrottlingLow latency hierarchical cacheBenefits of odd-sets/odd-bankingBenefits of explicit placement (Priority/Epoch)NEED FINAL EVALUATION and explicit placement study

46Final Phase Extend DomainStudy regular HPC applications in throughput settingAdd at least two irregular benchmarksLess likely to benefit from cachingNew opportunities for enhancementExplore impact of future TA topologiesMemory Cubes, TSV DRAM, etc.

47Proposed TimelinePhase 1 HPC applications completedPhase 2 Mathematical model & Benchmark CharacterizationMAY-JUNEPhase 3 Architectural EnhancementsJULY-AUGUSTPhase 4 Domain enhancement / new featuresSeptember-November48ConclusionDissertation Goals:Quantify the degree single thread performance affects throughput performance for an important class of applicationsImprove parallel efficiency through thread scheduling, cache topology, and cache policiesFeasibilityRegular Benchmarks show promising memory behaviorCycle accurate simulator nearly completed49Related Publications To Date

50One Outlier

51Priority Scheduling5252Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline

53Small caches have severe issues

53Modeling Throughput Performance

54

NT = Total Active ThreadsPCHIP = Total Throughput PerformancePST = Single Thread PerformanceLAVG = Average Latency per instructionPowerCHIP = EAVG(Joules)xPCHIPIf per cycle, get IPC per thread, Latency in cycles. But then multiply by frequency (cycles/sec) to get true performance

Key insight is that Latency varies by number of threads latency is most important, can use caching to benefit latency

Need to address power, which is Performance x energy54Phase 1 HPC ApplicationsLooked at GEMM, FFT & Homme in CMP settingLearned implementation algorithms, alternative algorithmsExpertise allows for credible throughput analysisValuable Lessons in multithreading and cachingDense Matrix MultiplyBlocking to maximize arithmetic intensityNeed enough contexts to cover latencyFast Fourier TransformPathologically hard on memory systemCommunication & synchronizationHOMME weather modelingIntra-chip scaling incredibly difficultMemory system performance variationReplacing data movement with computationMost significant publications:55

Odd Banking - Scratchpad56Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline


57Problem - Technology Mismatch58Computation is cheap, data movement is expensive:Exponential growth in cores saturates off-chip bandwidth- Performance capped Latency to off-chip DRAM now hundreds of cycles- Need hundreds of threads per core to mask

* Bill Dally, IPDPS Keynote, 2011 Still communicating across perimeter, traansfer rates grow slowly, ratio of BW/flop worse in 2017

58Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline


59The Power WallSocket power economically cappedDARPAs UHCP Exascale Initiative:Supercomputers now power capped10-20x power efficiency by 2017Supercomputing Moores Law:Double power efficiency every yearPost-PC client era requires >20x power efficiency of desktop60Even Throughput Architectures arent efficient enough!Short Latencies Also Matter61Does not include the feedback from caching, which would amplify impact of schedulingNote inversions5-6x throughput performance from latency2x throughput performance from scheduling

61Importance of Scatchpad62Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline


63Work Finished To DateMathematic AnalysisArchitectural algorithmsBenchmark CharacterizationNearly finished full chip simulatorCurrently simulates one core at a time

Almost ready to publish 2 papers64Benchmark Characterization (May-June)Latency Sensitivity with cache feedback, multiple blocks per coreGlobal caching, BW across coresValidate mathematical model with benchmarksCompiler Controls

65Architectural Evaluation(July-August)Priority Thread SchedulingAutomatic Thread ThrottlingOptimized Cache TopologyLow latency / fast pathOdd-set bankingExplicit Epoch placement

66These 2 papers should not take 4 months to do very conseravative, just gong by paper deadlines

66Extending the Domain (Sep-Nov)Extend benchmarksPort HPC applications/kernels to throughput environmentAdd at least two irregular applicationsE.g. Sparse MM, Photon Mapping, Adaptive Finite ElementsExtend topologies, enhancementsExplore design space of emerging architecturesExamine optimizations beneficial to irregular applications67Will likely start earlier67Questions?68ContributionsMathematical Analysis of Throughput PerformanceCaching, saturated bandwidth, sensitivities to application characteristics, latencyQuantify Importance of Single Thread LatencyDemonstrate novel enhancementsValley based thread throttlingPriority SchedulingSubcritical Caching Techniques69HOMME70Dense Matrix Multiply71PARSEC L2 64KB Hit Rates72Backup Slide72Odd Banking, L1 Cache Access73Local vs Global Working Sets74Dynamic Working Sets

75Fast Fourier Transform (blocked)

76Performance From CachingAssume ideal cachesIgnore changes to DRAM latency & off-chip BW

7777

designing on-chip memory systems for throughput architectures

Documents

chip dram

heterogeneous cores

power efficient

chip l3 cache

chip memory systems

throughput architecturesph

power efficiencycores

l1 cache