the memory gap: to tolerate or to reduce? jean-luc gaudiot professor university of california,...
TRANSCRIPT
![Page 1: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/1.jpg)
The Memory Gap: to Tolerate or to Reduce?
Jean-Luc Gaudiot
Professor
University of California, Irvine
April 2nd, 2002
![Page 2: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/2.jpg)
Outline
The problem: the Memory GapThe problem: the Memory Gap Simultaneous Multithreading Decoupled Architectures Memory Technology Processor-In-Memory
![Page 3: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/3.jpg)
The Memory Latency Problem
Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)
Problem: Memory Latency - Conventional Memory Hierarchy Insufficient:• Many applications have large data sets that are accessed non-contiguously.• Some SPEC benchmarks spend more than half of their time stalling [Lebeck and
Wood 1994]. Domain: benchmarks with large data sets: symbolic, signal processing and
scientific programs
![Page 4: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/4.jpg)
Some SolutionsSolutionLarger Caches
Hardware Prefetching
Software Prefetching
Multithreading
Limitations— Slow
— Works well only if working set fits cache and there is temporal locality.
— Cannot be tailored for each application
— Behavior based on past and present execution-time behavior
— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching
— Adaptive software prefetching is required to change prefetch distance during run-time
— Hard to insert prefetches for irregular access patterns
— Solves the throughput problem, not the memory latency problem
![Page 5: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/5.jpg)
Limitation of Present Solutions
Huge cache: • Slow and works well only if the working set fits cache
and there is some kind of locality
Prefetching• Hardware prefetching
– Cannot be tailored for each application– Behavior based on past and present execution-time behavior
• Software prefetching– Ensure overheads of prefetching do not outweigh the benefits– Hard to insert prefetches for irregular access patterns
SMT• Enhance the utilization and throughput at thread level
![Page 6: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/6.jpg)
Outline
The problem: the memory gap Simultaneous MultithreadingSimultaneous Multithreading Decoupled Architectures Memory Technology Processor-In-Memory
![Page 7: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/7.jpg)
Simultaneous Multi-Threading (SMT)
Horizontal and vertical sharing Hardware support of multiple threads Functional resources shared by multiple
threads Shared caches Highest utilization with multi-program or
parallel workload
![Page 8: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/8.jpg)
SMT Compared to SS
Superscalar processors execute multiple instructions per cycle Superscalar functional units idle due to I-fetch stalls, conditional branches, data
dependencies SMT dispatches instructions from multiple data streams, allowing efficient execution and
latency tolerance• Vertical sharing (TLP and block multi-threading)• Horizontal sharing (ILP and simultaneous multiple thread instruction dispatch)
Thread 5
Thread 6
Thread 7
Thread 8
S tall
Thread 1
Thread 2
Thread 3
Thread 4
Superscalar
INT M EM FP
1
2
3
4
5
6
7
9 instr
Cycles
20 instr
SM T
INT M EM FP
1
2
3
4
5
6
7
Cycles
![Page 9: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/9.jpg)
CMP Compared to SS
CMP uses thread-level parallelism to increase throughput CMP has layout efficiency
• More functional units• Faster clock rate
CMP hardware partition limits performance• Smaller level-1 resources cause increased miss rates• Execution resources not available from across partition
Thread 5
Thread 6
Thread 7
Thread 8
S tall
Thread 1
Thread 2
Thread 3
Thread 4
Superscalar
INT M EM FP
1
2
3
4
5
6
7
9 instr
Super-scalar
INT/M EM
FP
C M P-p2
1
2
3
4
5
6
7
8
9
FP INT/M EM
Super-scalar
13 instr
CyclesCycles
![Page 10: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/10.jpg)
Wide Issue SS Inefficiencies
Architecture and software limitations• Limited program ILP => idle functional units• Increased waste of speculative execution
Technology issues• Area grows O((d3) {d = issue or dispatch width}• Area grows an additional O(tLog2(t)) {t= #SMT threads}• Increased wire delays (increased area, tighter spacings, thinner
oxides, thinner metal)• Increased memory access delays versus processor clock• Larger pipeline penalties
Problems solved through: CMP - localizes processor resources SMT - efficient use of FUs, latency tolerance Both CMP and SMT - thread level parallelism
![Page 11: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/11.jpg)
POSM Configurations
All architectures above have eight threads Which configuration has the highest performance for an
average workload? Run benchmarks on various configurations, find optimal
performance point
FetchiL1
TLB
dL1Decode,Renam e
ReorderBuffer,
Instr. Queues,O-O-O logic
INTunit
W ide-issue SMT
FP Unit
Level 2 Cache
Ext In t
iL1 dL1
iL1 dL1
L2 C ross bar
Two processor POSM
Proc 2
Proc 1
TLB
TLB
Level 2 Cache
Ext In t
Four processor POSM
Proc 4Proc 3
Proc 1 Proc 2
L2 C ross bar
dL1iL1
TLB
dL1iL1
TLB
dL1iL1
TLB
dL1iL1
TLB
Ext In t
Level 2 Cache
8 processor CMP
L2 Cross bar
Ext In t
Level 2 Cache
P6
dL1
iL1
P7
dL1
iL1
P8
dL1
iL1
P2
dL1
iL1
P3
dL1
iL1
P4
dL1
iL1
P1
dL1
iL1
P5
dL1
iL1
![Page 12: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/12.jpg)
Superscalar, SMT, CMP, and POSM Processors
CMP and SMT both have higher throughput than superscalar Combination of CMP/SMT has highest throughput Experiment results
Thread 5
Thread 6
Thread 7
Thread 8
S tall
Thread 1
Thread 2
Thread 3
Thread 4
Superscalar
INT M EM FP
1
2
3
4
5
6
7
SM T
INT M EM FP
1
2
3
4
5
6
7
Super-scalar
INT/M EM
FP
C M P-p2
1
2
3
4
5
6
7
8
9
FP INT/M EM
Super-scalar SM T
INT/M EM
FP
PO SM -p2
1
2
3
4
5
6
7
8
9
FP INT/M EM
SM T
9 instr 20 instr 13 instr 33 instr
![Page 13: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/13.jpg)
Equivalent Functional Units
•SMT.p1 has highest performance through vertical and horizontal sharing•cmp.p8 has linear increase in performance
0.001.002.003.004.005.006.007.008.009.00
10.00
1 2 3 4 5 6 7 8
Number of threads
IPC
smt.p1.f2.t8.d16
posm.p2.f2.t4.d8
posm.p4.f1.t2.d4
cmp.p8.f1.t1.d2
![Page 14: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/14.jpg)
Equivalent Silicon Area and System Clock Effects
•SMT.p1 throughput is limited•SMT.p1 and POSM.p2 have equivalent single thread performance•POSM.p4 and CMP.p8 have highest throughput
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
1 2 3 4 5 6 7 8
Number of threads
NIP
C smt.p1.f2.t8.d9
posm.p2.f2.t4.d6
posm.p4.f1.t2.d4
cmp.p8.f1.t1.d2
![Page 15: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/15.jpg)
Synthesis
“Comparable silicon resources” are required for processor evaluation
POSM.p4 has 56% more throughput than wide-issue SMT.p1
Future wide-issue processors are difficult to implement, increasing the POSM advantage• Smaller technology spacings have higher routing delays due to
parasitic resistance and capacitance• The larger the processor, the larger the O(d2tLog2(t)) and O(d3t)
impact on area and delays
SMT works well with deep pipelines The ISA and micro-architecture affect SMT overhead
• 4-thread x86 SMT would have 1/8th the SMT overhead• Layout and micro-architecture techniques reduces SMT overhead
![Page 16: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/16.jpg)
Outline
The problem: the memory gap Simultaneous Multithreading Decoupled Decoupled ArchitectureArchitecturess Memory Technology Processor-In-Memory
![Page 17: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/17.jpg)
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching
The HiDISC Approach
Approach:
• Add a processor to manage prefetching -> hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
![Page 18: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/18.jpg)
MIPS DEAP CAPP HiDISC(Conventional) (Decoupled) (New Decoupled)
2nd-Level Cache and Main Memory
Registers
5-issue
Cache
Access Processor(AP) - (3-issue)
Access Processor(AP) - (3-issue)
2-issue
Cache Mgmt. Processor (CMP)
Registers
Access Processor(AP) - (5-issue)
Access Processor(AP) - (5-issue)
ComputationProcessor (CP)Computation
Processor (CP)
3-issue
2nd-Level Cache and Main Memory
Registers
8-issue
Registers
3-issue 3-issueCache Mgmt. Processor (CMP)
DEAP: [Kurian, Hulina, & Caraor ‘94]
PIPE: [Goodman ‘85]
Other Decoupled Processors: ACRI, ZS-1, WA
Cache
2nd-Level Cache and Main Memory
2nd-Level Cache and Main Memory
2nd-Level Cache and Main Memory
ComputationProcessor (CP)Computation
Processor (CP)
ComputationProcessor (CP)Computation
Processor (CP)
ComputationProcessor (CP)Computation
Processor (CP)
CacheCache
Decoupled Architectures
![Page 19: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/19.jpg)
What is HiDISC?
A dedicated processor for each level of the memory hierarchy
Explicitly manage each level of the memory hierarchy using instructions generated by the compiler
Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)
Exploit instruction-level parallelism without extensive scheduling hardware
Zero overhead prefetches for maximal computation throughput
HiDISC
Access Processor(AP)
Access Processor(AP)
2-issue
Cache Mgmt. Processor (CMP)
Registers
3-issue
L2 Cache and Higher Level
ComputationProcessor (CP)Computation
Processor (CP)
L1 Cache
3-issueSlip
Control Queue
Store Data Queue
Store Address Queue
Load Data Queue
![Page 20: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/20.jpg)
Slip Control Queue The Slip Control Queue (SCQ) adapts
dynamically
• Late prefetches = prefetched data arrived after load had been issued
• Useful prefetches = prefetched data arrived before load had been issued
if (prefetch_buffer_full ())Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;else
Decrease size of SCQ;
![Page 21: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/21.jpg)
Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop)
for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);
while (not EOD)y = y + (x * h);
send y to SDQ
for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;
}send (EOD token)send address of y[i] to SAQ
for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;
}
Inner Loop Convolution
Computation Processor Code
Access Processor Code
Cache Management Code
SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data
![Page 22: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/22.jpg)
Benchmarks
Benchmarks Source of Benchmark
Lines of Source Code
Description Data Set Size
LLL1 Livermore Loops [45]
20 1024-element arrays, 100 iterations
24 KB
LLL2 Livermore Loops
24 1024-element arrays, 100 iterations
16 KB
LLL3 Livermore Loops
18 1024-element arrays, 100 iterations
16 KB
LLL4 Livermore Loops
25 1024-element arrays, 100 iterations
16 KB
LLL5 Livermore Loops
17 1024-element arrays, 100 iterations
24 KB
Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations
<64 KB
MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations
448 KB
CHOLSKY NAS kernels 156 Cholesky matrix decomposition
724 KB
VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously
128 KB
Qsort Quicksort sorting algorithm [14]
58 Quicksort 128 KB
![Page 23: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/23.jpg)
Parameter Value Parameter Value
L1 cache size 4 KB L2 cache size 16 KB
L1 cache associativity 2 L2 cache associativity 2
L1 cache block size 32 B L2 cache block size 32 B
Memory Latency Variable, (0-200 cycles)
Memory contention time
Variable
Victim cache size 32 entries Prefetch buffer size 8 entries
Load queue size 128 Store address queue size
128
Store data queue size 128 Total issue width 8
Simulation Parameters
![Page 24: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/24.jpg)
Simulation Results
0
1
2
3
4
5
0 40 80 120 160 200Main Memory Latency
LLL3
MIPSDEAPCAPP
HiDISC
0
0.5
1
1.5
2
2.5
3
0 40 80 120 160 200Main Memory Latency
Tomcatv
MIPSDEAPCAPP
HiDISC
02468
10121416
0 40 80 120 160 200Main Memory Latency
Cholsky
MIPSDEAPCAPP
HiDISC
0
2
4
6
8
10
12
0 40 80 120 160 200Main Memory Latency
Vpenta
MIPSDEAP
CAPP
HiDISC
![Page 25: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/25.jpg)
VLSI Layout Overhead (I)
Goal: Cost effectiveness of HiDISC architecture Cache has become a major portion of the chip area Methodology: Extrapolated HiDISC VLSI Layout based on
MIPS10000 processor (0.35 μm, 1996) The space overhead of HiDISC is extrapolated to be
11.3% more than a comparable MIPS processor The benchmark should be run again using these
parameters and new memory architectures
![Page 26: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/26.jpg)
VLSI Layout Overhead (II)
Component Original MIPS R10K(0.35 m)
Extrapolation (0.15 m)
HiDISC (0.15 m)
D-Cache (32KB) 26 mm2 6.5 mm2 6.5 mm2
I-Cache (32KB) 28 mm2 7 mm2 14 mm2
TLB Part 10 mm2 2.5 mm2 2.5 mm2
External Interface Unit 27 mm2 6.8 mm2 6.8 mm2
Instruction Fetch Unit and BTB 18 mm2 4.5 mm2 13.5 mm2
Instruction Decode Section 21 mm2 5.3 mm2 5.3 mm2
Instruction Queue 28 mm2 7 mm2 0 mm2
Reorder Buffer 17 mm2 4.3 mm2 0 mm2
Integer Functional Unit 20 mm2 5 mm2 15 mm2
FP Functional Units 24 mm2 6 mm2 6 mm2
Clocking & Overhead 73 mm2 18.3 mm2 18.3 mm2
Total Size without L2 Cache 292 mm2 73.2 mm2 87.9 mm2
Total Size with on chip L2 Cache 129.2 mm2 143.9 mm2
![Page 27: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/27.jpg)
The Flexi-DISC
Fundamental characteristics:• inherently highly dynamic at
execution time. Dynamic reconfigurable
central computational kernel (CK)
Multiple levels of caching and processing around CK• adjustable prefetching
Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.
Memory Interface Ring
Low Level Cache Access Ring
Computation Kernel
![Page 28: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/28.jpg)
The Flexi-DISC
Partitioning of Computation Kernel• It can be allocated to the
different portions of the application or different applications
CK requires separation of the next ring to feed it with data
The variety of target applications makes the memory accesses unpredictable
Identical processing units for outer rings• Highly efficient dynamic
partitioning of the resources and their run-time allocation can be achieved
Memory Interface Ring
Low Level Cache Access Ring
Computation Kernel
Application 1
Application 2
Application 3
![Page 29: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/29.jpg)
Multiple HiDISC: McDISC
Problem: All extant, large-scale multiprocessors perform poorly when faced with a tightly-coupled parallel program.
Reason: Extant machines have a long latency when communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. (Note that multi-threading à la Tera does not help when there are dependencies.)
The McDISC solution: Provide the network interface processor (NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code, generated by the compiler.
Advantage: The NIP, executing user code, fetches data before it is needed by the node processors, eliminating the network fetch latency most of the time.
Result: Fast execution (speedup) of tightly-coupled parallel programs.
![Page 30: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/30.jpg)
The McDISC System: Memory-Centered Distributed Instruction Set Computer
Registers
Cache
Access
Main Memory
Processor (AP)
ComputationProcessor (CP)
Cache ManagementProcessor (CMP)
Program CompilerAccess Instructions
Computation Instructions
Cache ManagementInstructions
DiscProcessor (DP)
Adaptive SignalPIM (ASP)
Adaptive GraphicsPIM (AGP)
Network InterfaceProcessor (NIP)
Disc Cache
SituationAwareness
SensorInputs
SAR Video
DynamicDatabase
X Y Z
3-D Torusof Pipelined Rings
RAID
Understanding
Inference Analysis
Decision ProcessTargeting
NetworkManagement
Register Linksto CP Neighbors
Instructions
FLIR SAR VIDEO ESS SES
SensorData
Disc Farm
to Displaysand Network
![Page 31: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/31.jpg)
Summary
A processor for each level of the memory hierarchy
Adaptive memory hierarchy management
Reduces memory latency for systems with high memory
bandwidths (PIMs, RAMBUS)
2x speedup for scientific benchmarks
3x speedup for matrix decomposition/substitution
(Cholesky)
7x speedup for matrix multiply (MXM) (similar results
expected for ATR/SLD)
![Page 32: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/32.jpg)
Outline
The problem: the memory gap Simultaneous Multithreading Decoupled Architectures Memory TechnologyMemory Technology Processor-In-Memory
![Page 33: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/33.jpg)
Memory Technology
New DRAM technologies • DDR DRAM, SLDRAM and DRDRAM• Most DRAM technologies achieve higher
bandwidth Integrating memory and processor on a
single chip (PIM and IRAM)• Bandwidth and memory access latency sharply
improve
![Page 34: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/34.jpg)
New Memory Technologies (Cont.)
Rambus DRAM (RDRAM)• memory interleaving system integrated onto a single
memory chip• Four outstanding requests with pipelined micro
architecture• Operates at much higher frequencies than SDRAM
Direct Rambus DRAM (DRDRAM)• Direct control of all row and column resources
concurrently with data transfer operations• Current DRDRAM can achieve 1.6 Gbytes/sec
bandwidth transferring on both clock edges
![Page 35: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/35.jpg)
Intelligent RAM (IRAM)
Merging technology of processor and memory
All the memory accesses remain within a single chip• Bandwidth can be as high as 100 to 200
Gbytes/sec• Access latency is less than 20ns
Good solution for data intensive streaming application
![Page 36: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/36.jpg)
Vector IRAM
Cost effective system • Incorporates vector processing units and the
memory system on a single chip Beneficial for the multimedia application with
critical DSP features Good energy efficiency Attractive for future mobile computing
processors
![Page 37: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/37.jpg)
Outline
The problem: the memory gap Simultaneous Multithreading Decoupled Architectures Memory Technology Processor-In-MemoryProcessor-In-Memory
![Page 38: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/38.jpg)
Overview of the System
Proposed DCS (Data-intensive Computing System) Architecture
![Page 39: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/39.jpg)
DCS System (Cont’d)
Programming• Different from the conventional programming model• Applications are divided into two separate sections
– Software : Executed by the host processor– Hardware : Executed by the CMP
• The programmer must use CMP instructions CMP
• Several CMPs can be connected to the system bus• Variable CMP size and configuration depending on the
amount and complexity of job it has to handle• Variable size, function and location of logics inside of
CMP to better handle the application. Memory, Coprocessors, I/O
![Page 40: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/40.jpg)
CMP Architecture
CMP (Computational Memory Processor) Architecture• The Heart of our work• Responsible for executing the core operation of data-
intensive applications• Attached to the system bus• CMP instructions are encapsulated in the normal memory
operations.• Consists of many ACME (Application-specific Computational
Memory Element) cells interconnected amongst themselves through dedicated communication links
CMC(Computing Memory Cluster)• A small number of ACME cells are put together to form a
CMC • The Network for connecting the CMCs are separate from the
memory decoder
![Page 41: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/41.jpg)
CMP Architecture
![Page 42: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/42.jpg)
CMC Architecture
![Page 43: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/43.jpg)
ACME Architecture
ACME (Application-specific Computational Memory Elements) Architecture
• ACME-memory, configuration cache, CE (Computing Element), FSM
• CE is the reconfigurable computing unit and consists of many CC (Computing Cells)
• FSM govern the overall execution of the ACME
![Page 44: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/44.jpg)
Inside the Computing Elements
![Page 45: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/45.jpg)
Synchronization and Interface
Three different kinds of communications• Host processor with CMP (eventually with each ACME)
– Done by synchronization variables (specific memory locations) located inside the memory of each ACME cells
– Example : start and end signals for operations. CMP instructions for each ACME
• ACME to ACME– Two different approaches
• Host mediated– Simple– Not practical for frequent communications
• Distributed mediated approach– Expensive and complex– Efficient
• CMP to CMP
![Page 46: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/46.jpg)
Benefits of the Paradigm
All the benefits from being the PIM• Increased bandwidth and Reduced latency• Faster Computation
– Parallel execution among many ACMEs
Effective usage of the full memory bandwidth Efficient co-existence of Software and Hardware More parallel execution inside of ACMEs by
efficiently configuring the structure with considerations for applications
Scalability
![Page 47: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/47.jpg)
Implementation of the CMP
Projected how our CMP will be implemented…• According to 2000 edition of ITRS (International
Technology Roadmap for Semiconductors), in year 2008
– A High-end MPU with 1.381 billion transistors will be in production with 0.06um technology and 427mm2
– If half of the die size is allocated to memory, 8.13 Gbits storage will be available and 690 million transistors for logic
– There can be 2048 ACME cells with each 512Kbytes of memory and 315K transistors for logic, control, anything inside ACME and rest of resources (36M transistors) for interconnections inside.
![Page 48: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/48.jpg)
Motion Estimation of MPEG
Finding the motion vectors for a macro block in the frame.
It absorbs about 70% of the total execution time of MPEG
Huge amount of simple addition, subtraction and comparisons
![Page 49: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/49.jpg)
Example ME execution
One ACME structure to find a motion vector for a macro block• Executes in pipelined fashion reusing the data
![Page 50: The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002](https://reader036.vdocuments.net/reader036/viewer/2022062511/551782c955034645368b521a/html5/thumbnails/50.jpg)
Example ME execution
Performance• For a 8*8 macro block with 8 pixel displacement• 276 clock cycles to find the motion vector for
one macro block Performance comparison with other
architectures