processing in memory, advanced seminar, ws … · called processing in memory(pim) should...

PROCESSING IN MEMORY, ADVANCED SEMINAR, WS 2016/2017 1

Processing in MemoryFelix Kaiser

Abstract—In the age of Big Data memory accesses are thealmost most important operation for many applications. One ofthe biggest problems of computations nowadays is the mismatchbetween the development of memory speed and the execution offloating point operations. One approach to make the problemless relevant is the so called Processing in Memory. This paperdiscusses firstly the different definitions of this phrase. After thatit summarizes some of its implementations. Finally there are someresults regarding to Speed and Power followed by a conclusionif the implementations are realizable.

I. INTRODUCTION

W ITH the development of faster and faster processingunits it is expected that the application’s speed will

scale as well. But that is often not the case. The last decadesnearly every 2 years the transistor count per socket doubled[9]. That lead to an immense acceleration of circuits. Ata certain point the speed of memory architectures and in-terconnects stopped to scale at the same speed. The resultare extremely unbalanced systems where in many cases theprocessing units wait more time for the memory than do actualwork. An approach which was already mentioned in the 1960scalled Processing in Memory(PIM) should accelerate memorydependent applications. The idea is to bring the computingunits closer or even in the memory. This paper’s goal isto show similar and subclasses of Processing in Memory. Itwill also give a glance at PIM architectures and discuss theadvantages while considering the problems. In a conclusioncurrent point of research is summarized and a speculation forthe future is stated.

II. MOTIVATION

All arguments for Processing in Memory arise from the needfor more speed or less power. Firstly this section discussesdifferent discrepancies between computation and communica-tion. Furthermore the actual influence in power and speed arehighlighted.

A. System Unbalances

The problem of the system unbalances is originally aconsequence out of Moores Law [9]. One of the derivedMoores Laws says that the Peak Floating Point Operationsper Second per socket increase with a growth rate of 50-60%([1]). The original von Neumann architecture would require amemory with nearly the same speed as the ALU to have abalanced speed(cite). In the von Neumann architecture onememory access or one computation need both exactly onepipeline stage. Processors with no additional optimizationswould scale only with the memory latency. Unfortunatelythe memory latency has the worst growth of all of thesecharacteristics, like it is shown in picture 1. Computer

Fig. 1. Development of Memory against Network against Computation Speed[1]

architects realized this early and developed techniques likememory hierarchy respectively caching. The result wasthat the actual speed strongly depend on the used cachingstrategies and prefetch algorithms and even if these werereally good the memory bandwidth could be a bottleneck forproblems with general bad spatiality.This shows that some problems with irregular memoryaccesses will not become faster with any of these techniques.To solve this both memory latency and memory bandwidthneeds to be fastened. As Fig. 1 shows, memory bandwidthis like computations actually increasing exponentially. Theproblem here is that the increasing speed is 23% againstthe 50-60% per year of the computations. Because of theexponential growth this is hard to compensate. The resultis that the FLOPS grow by the factor of 4.5 per decadecompared to the memory bandwidth. Even more problematicalis that the memory latency does not become faster but slower.It has got an increasing rate of 4% per year. This results in amore than 100 times growth rate per decade in comparisionto the FLOPS.

With the development of multi processor systems respec-tively clusters the problem of communication arises. Thenetwork bandwidth and latency start to become important forHigh Performance Computing systems. 1 shows that theseare even worse than the memory speed characteristics. Theinterconnect bandwidth has only a increasing rate of 20%and from 1990 until now the FLOPS increased by nearly 9times per decade against it. The interconnect latency is actuala really big issue because it causes a lot of overhead whenapplications communicate rather irregular. In the last decadesthe FLOPS are growing 30 times faster per decade. This iseven more problematical than the other characteristics becausethe network latency was already in the late 1990’s extremelybehind the other ones, as it is shown in 1.


Fig. 2. Growth of total pins vs. growth of supply pins in relation to increasingsupply current [2]

B. Influence in Power

Another big issue is the increasing of the power dissi-pation(Watts per cm2). Reasons for this are the growingtransistor density, which results from the process technologyscaling and the raising clock frequencies. The outcome of thisis not only a bigger heat problem and worse cooling constraintsbut also the problem of getting the power into the chip. As Fig.2 shows the pin limitation problem scales dramatically with thesupply current. That is because the more current a chip needsthe more pins are needed to carry the power into the chip.Accordingly these pins can not be used for communicationanymore. With to much power requirements for a chip thereis no space for communication anymore.Once considering how the scaling of communication works thepower problem gets even worse. Traditionally faster commu-nication is reached by increasing the number of bus channels.Unfortunately this is limited by the PCB’s area, cost andthermal issues. Hence another approach is to increase the busfrequency. The disadvantage of this is the power limit of thePCB [3].With these insights it can be asserted that designers nowadaysneed to find the optimum between logic and communicationbecause not only the communication is limited by the pins butalso the power budget.

C. Consequences of the Walls

To overcome the walls the subsections before explainedsome techniques for hiding latency or avoid memory accesseswere developed. The latency hiding techniques are based onreordering instructions. Memory access avoidance is usuallybased on caching techniques and in some cases also predictione.g. prefetching.All these techniques will not work in situations were thememory accesses are extremely irregular. Also they are noteven nearly compensating the whole gap between FLOPSand communication/memory accesses. So nowadays it is notjust important to increase the FLOPS but also to keep theFLOPS per memory operation and the FLOPS per interconnectoperation low. This is called system balance [1]. For this aparadigm shift where the data and not the processing units arethe center of the systems. Siegl et al. call it the Data-CentricComputing or Near-Data Processing.

III. NEAR-DATA-PROCESSING

The phrase Processing in Memory(PIM) is because of itsinexistent definition used in many different contexts. Some of

Fig. 3. Near Data Processing Taxonomy from close to far to the CPU [2]

them do not correlate with the meaning of PIM in this paper[2]). Therefore the nomination and taxonomy of Siegl. et al.is used. As mentioned before they call each kind of bringingcomputations closer to the data Near-Data Processing. Pro-cessing in Memory(PIM) or Near Memory Processing(NMP)are in this taxonomy just one respectively 2 subclasses of NDPwhich operate effectively in- or near memory [2].NDP is divided into 5 subclasses(Fig.3). The first kind of NDPin this taxonomy is computing in caches which is as mentionedin II already an almost everywhere applied method. A hugeamount of the area of a system on chip is reserved for SRAMnowadays. PIM and NMP are extensively explained in IV.Processing in Flash and Computing on Disk are two subclasseswhere processing units are applied in non volatile storagearchitectures. These are not covered in this paper. IntelligentNetwork describes the idea of implementing processing unitsin crossbars or network interface controllers to save latencyand network load.

IV. PROCESSING IN MEMORY AND NEAR MEMORYPROCESSING

As mentioned in III PIM and NMP are subclasses of NDP.They are the terms for NDP with volatile memory especiallywith DRAM cells. Because a clear distinction between PIMsand similar architectures is rather difficult. Siegl et al. tries todefine it passive acceleration architecture, offering a transpar-ent storage for data in default mode but can be instructed bythe host processor to execute computations.

A. History of PIM

Research in Processing in Memory started in the 1960swith the invention of the single-transistor cell by Dennard.In the first PIM architectures researchers connected theseDRAM cells in arrays in combination with some hardwiredlogic. Because this approach is rather limited, in 1970 Stoneproposed a first architecture with an general-purpose logicin memory processor. Ten years later Fuchs et al. developedthe so called PIXEL-planes architecture which was like thefirst PIM accelerator for graphical processing. He realizedthat the memory bandwidth is a huge bottleneck for thegraphical processing. So he decided to use the bandwidthadvantages of PIM to make massively parallel GPUs. Latersuch architectures are called 2D-integrated PIM. This termarises later because of the 3D-stacked PIM which will beexplained in this section as well [2].


Fig. 4. Traditional interconnecting on a PCB, 2.5D integration and true 3D-stacking [2]

From this point on PIM developed in similar architectures witheither small logic for small groups of DRAM cells in thearray or a little bit more complicated general purpose logicwith for a bigger DRAM array. The embedding of logic andmemory on the same die has the significant advantage thatit enables very wide on-chip bus interfaces with low latency.This provides not only an enormous speed-up but also theact of leaving the chip for memory accesses costs a lot ofpower. Unfortunately DRAM cells in usual logic technologyhave very high leakage. The outcome of this is a higher powerusage because the cells need more refresh cycles. Also thesecells would be bigger and more expensive what would leadto less memory storage. On the other hand logic implementedin DRAM technology would be rather slow. These are thekey arguments against 2D-integrated PIM. Another argumentagainst it is the programmability because a software developerneeds to think about another hierarchy level of memory andlogic but this is not just a problem for 2D-integrated PIM andcan be solved by specialized compilers.During the time 2D PIMs were researched intensely Kleineret al. proposed an architecture which has a L1 and L2 cachestacked on top of a RISC processor. The stacked dies areconnected with vertical vias. This provides the opportunityto use different technologies for the memory and the logic.This technique is called 3D-stacking. Although this solvesthe problem of the memory and the logic on one die itcauses another problem: The stacking of dies interferes withthe heat dissipation. Kleiner et al. asserts that there wouldnot be any heat problems what disagrees with current re-search. Temperatures bigger than 85◦C increase the leakagethe DRAM cells dramatically. Hence the self-refresh-rate mustbe doubled what requires much more energy and causes aperformance overhead [10]. Another problem is the yieldbecause if one Through Silicon Vias(TSV) on one die is defectthis results in an defect for the whole TSV. That means thatthe defects summarize over the different planes. Due to thetechnological foundation of 3D interconnection these problemsare manageable since 2009. That the heat management in thisstacks is still complicated will be explained more detailed inVI. An easier application of 3D-Stacks are pure memory stacksbecause they create less heat.Another technique for Near Memory Processing is the 2.5D-integration. The idea is to combine different components byplacing them on the same silicon interposer. With this it isnot necessary to go on the Printed Circuit Board(PCB) forinterconnecting these components which provides wider businterfaces and shorter connections respectively lower capaci-tance. The wider bus interfaces also provide higher bandwidthcompared to the traditional approach but not against 3D-

Fig. 5. A vault in 3D-memory stack [8]

stacking. It is like a compromise between 3D-stacking and thetraditional approach. On the one hand it has the advantages inbandwidth and latency against the traditional and on the otherhand it has less heat accumulation than 3D-stacking becausein 2.5D the dies are not actual stacked.

B. Subclasses of Processing in Memory

The concepts of implementations vary and because of thatSiegl et al. categorized them as well. Some approaches werePIM can be applied are shown in IV-B. This partitioning bySiegl et al. does not need to be exhaustive [2].

1) Between Sense Amplifier and Column Decoder2) Behind the Column Decoder3) Memory embedded into a Pipeline of a Processor4) In front of a Bus/Crossbar Switch5) In 3D-Stack

• In each Vault• Processing Dies in Memory Stack

IV-B is sorted by closeness to the DRAM cells. Point 1-3 are all meant to be 2D-integrated architectures. 4 can beimplemented in different kinds and 5 needs to be, like the nameimplies, a 3D-stacked architecture. The first point is the termfor PIM architectures where the hardwired logic is actual rightbehind the sense amplifiers. It does not vary much from thesecond one. The different point is that in the first case the logicmust be replicated more often. That is an advantage in termsof parallelization. The third point can be located at the almostsame position in hierarchy as the second but it differs in thetype of architecture. It describes the architectures which embedthe memory into a pipeline of a general purpose processor.In this case the memory behaves like registers respectivelycache. The fourth point is similar to the intelligent networksbut integrates the memory controller into the network interfacecontroller. The fifth point unites all the architectures whichuse 3D-stacking. These can be divided into these which haveprocessing elements in each vault(Fig. 5) and these which havea bigger GPP like architecture for all vaults integrated in alogic die, usually together with the memory controller.

Additionally to these categories O. Mutlu [4] presents an In-Memory Processing architecture which operates in front of a


Fig. 6. In-memory processing architecture by O. Mutlu(after [4])

Sense Amplifier. This must be seen with reservations becausethe processing abilities are limited. This is the nearest optionto place logic to the DRAM cells.

V. EXAMPLES FOR PIM ARCHITECTURES

This section presents some explicit examples for the designideas mentioned in IV-B.

A. PIM before Sense Amplifiers

This subsection presents an architecture by O. Mutlu [4] thatuses the behaviour of the capacitors in DRAM cells as logicoperations. As 6 shows for these operations are 3 transistorsnecessary. Before an operation starts the cells need to berefreshed so that values are clearly 1 or 0. For a read operationusually one of the switches respectively transistors are closedand a sense amplifier detects if the capacitor was charged. InO. Mutlus architecture all of the 3 switches in this picture areclosed at the same time. A common sense amplifier just detectsif the charge was bigger than the half of the supply voltage.The 3 parallel capacitors behave like one bigger capacitor. Ifat least 2 of them were charged the outcome of this would bea 1 for the sense amplifier. The realized function is a majorityfunction. The majority function with 3 inputs can be writtenas:

AB +BC +AC

Although a majority function of 3 one bit values does not countto the frequently used application this can be interesting. Theboolean function can be written as:

C(A+B) + C(AB)

With this technique the value of C decides which operationis executed on A and B without much additional hardware.In terms of complete values in the memory this can be usedas bitwise ”and” and ”or” operations.

Fig. 7. Schematic of a PIM node in the DIVA architecure [5]

FPGA

EXTOLL Link

openHMCController

Hybrid Memory

Cube

EXTOLL

EXTOLLLink

EXTOLL

512 Bit Extoll Link

512 Bit Extoll Link

NAM Intelligence

Fig. 8. Modules in a NAM node [6]

B. Memory embedded into a Pipeline of a GPP

One of many examples for Memory embedded into apipeline of a GPP is the DIVA project [5]. DIVA standsfor Data-Intensive Architecture. In such an architectureare several PIM nodes which can be connected up to 2neighboring nodes. As shown in 7 each of these nodesconsist of proporrtionally a big amount of memory, a 32Bit processor pipeline, an interface to neighboring PIMs andan interface to the memory bus of the host processor. Thisarchitecture was actually implemented as a prototype in aVLSI process. What it differentiates essentially from theprocessors of the time it was released is that the biggest partof the area is used for memory. Because it was a prototypeinstead of DRAM SRAM was used. That was because of thetechnology issues mentioned in IV-B. From today’s point ofview that does not seem special because it is similar to theusual architectures with cache.

C. PIM in front of a Bus/Crossbar Switch

This approach is very special in comparison to the otherones, because it is not a solution for the problems withmemory speed but with the network speed. The systemwhere this architecture should be implemented is a HPC(HighPerformance Computing) cluster [6]. It is a special clusterwhich is divided into two parts. One part is like most ofthe clusters nowadays build out of COTS(commercial off-the-shelf) hardware. These COTS components are Intel XEONprocessors. The other part is build out of the Intel XEONPHI accelerators. In this part of the cluster are some NAMnodes as well. NAM or Network Attached Memory is, similarto the more familiar Network Attached Storage just withvolatile memory, a concept were memory is almost directlyconnected with the network. Therefore such a node needs tocontain a Network Interface Controller(NIC) and a memorycontroller. In 8 the NIC is called EXTOLL Link and theopenHMC Controller is the memory controller. These units arelocated on a FPGA(Field Programmable Gate Array) for the


prototype. The utilized memory is a so called Hybrid MemoryCube(HMC). This is a 3D-stacked memory by Micron whichoffers a high bandwidth and enables some small PIM featureslike integer operations as well.But the actual PIM featuresget located in the NAM intelligence. The NAM intelligenceis located on the same FPGA as the other units. It firstlyinterprets incoming packages and pass them to the openHMCController if they were usual memory accesses on the NAM. Italso gives processors in the cluster the opportunity to executecomputations. This can be useful for example for reductions.The used network bandwidth and energy for spreading the datato different nodes for a global sum can be saved by adding astreaming processor on this node.

D. 3D-stacked PIM architecture

PIM architectures in a 3D stack usually use, as mentionedbefore in IV, 3D-stacked DRAM dies and on the bottom of thisstack there is a logic die. In this die the memory controller canbe placed. Additionally functional units can be added. There isa differentiation between architectures where each of the vaultshave its own PIM features like shown in 9 and architectureswhere a GPP often a RISC processor is placed in the stack.Both of such systems are presented in VI. The PIMs in eachvault can be fixed functions or programmable PIMs. In 9 bothis integrated.

VI. SIMULATION OF PIM ARCHITECTURES

For the evaluation of speed and power and the heat prob-lems two different papers are considered. E. Azarkhish etal. presents a simulator which estimates speed and power ofa PIM architecture with an integrated RISC processor [7].Yuxion Zhu et al. in contrary developed a simulation for thethermal issues.

A. Speed- and Power Simulation

E. Azarkhish et al. created in their words a worst casescenario for a programmable PIM in a 3D-stack and let itcompete with a single thread on a host processor [7]. Thisis done in a self-developed simulation environment calledSMCSim which was verified against a Cycle-Accurate model.The system is composed of a host SOC(System on Chip) with2 ARM Cores and 3D-stacked memory with a programmablePIM in its logic layer. The cores of the host are Cortex A-15at 2GHz with 32KB of instruction cache and 64kB of datacache. The L2 is a shared 2MB cache with an associativity of8. As 3D-stacked memory a HMC with 512MB of memory 16vaults and 4 memory dies is used. The PIM in this memoryis similar to 1 of the ARM cores but is frequency and voltagescaled, not only to save power but also to have a realisticchance to meet the heat constraints.The result of the simulated experiments was even in a worstcase construction a speedup of 2X and 1.5X if the host sidehad an accelerator as well. Additionally the PIM reduced theenergy by 70% respectively 55%.

Fig. 9. Logic view on a heterogeneous architecture with both fixed-functionand programmable PIMs in each vault [8]

B. Simulation of Heat Issues

Y. Zhu et al. presents simulation considering the heatemergence of different PIM solutions in one system. Thissystem should be simulated as realistic as possible. The aimwas to identify under which circumstances a system withintegrated PIM is thermal feasible [8].As shown in Fig. 9 the simulated system combines thetechniques of fixed-function PIMs and programmable PIMsboth in each vault. Additionally the heat emission of a hostCPU on the same silicon interposer is respected.The heat constraints of the host CPU are similar to the onesof an Intel XEON processor with four Cores. The memorystack is like a HMC with DDR4 DRAMs. Its capacity amounts4GB in total. The logic die is as shown in Fig. 9 an arrayof vault controller combined with the fixed-function and theprogrammable PIMs connected with an on-chip network. Thefixed-function PIMs are modeled as simple logic or arithmeticfunctions like adder, multiplier, AND, OR, XOR, shifting,dot product, memory mover, compare-and-swap, fetch-and-add, test-and-set, sorting and scatter-gather. The programmablePIMs are modeled as ARM Cortex-A 9 with 2 GHz clock ratebut modified to have an in-order pipeline. The area matchesthat of the DRAM dies. The reason for this is less simulationcomplexity.A thermal analysis of the memory stack without the host showsthat a commodity-server or a high-end-server active coolingsolution needs to be picked. The high-end-server cooling costsnearly 250$. When picking the high-end-server solution, thissimulation results that the logic die cannot exceed 5.16 Wper Vault. Otherwise one of the DRAM die’s temperature canbecome higher than 85◦C.After the cooling solutions the impact on the PIM temperatureof the distance between host and memory stack is analyzed.The result was that the temperature of the PIMs drops closeto a linear rate between 1 and 10mm from 75.23 to 72.63◦Cwhen the PIM is passive.For the following simulation the distance is set to 10mm. Inthe following simulations it is tried how many fixed-functionPIMs fit into the logic layer with only paying attention to thepower budget and the thermal constraints. The result were upto 1168 simple ones like they were explained before or 224complex ones. The complex fixed-function PIMs, which arefor example floating-point units, were constrained by the area.After that the memory stack including the PIM and the


CPU is simulated. This time real workloads respectively HPCBenchmarks are used. The used Benchmarks are CG andMG from the NAS parallel benchmark suite. CG uses theinverse power method to estimate the largest eigenvalue ofa symmetric positive definite sparse matrix. The dominatingoperation is a FMA operation:

a = b+ c ∗ d

The memory access patterns have less data locality what makesa PIM sensible.MG approximates the solution of a three-dimensional discretePoisson equation. In this workload lots of stencil operationshappen. These are offloaded to the fixed-function PIMs.The peak temperature during simulations with host and fixedfunction PIMs do not get higher than 68.53◦C for MG and68.78◦C for CG. Another simulation with the programmablePIMs shows that the peak temperature for MG and CG is68.78◦C respectively 69.19◦C. With both activated the resultsare 69.74 and 69.59◦C. With the PIM disabled the CG andMG’s execution temperatures are 65.51 and 68.28◦C. For MGthat means the temperature with PIM is just slightly higher.Y. Zhu et al. reasoned that the thermal impact is dominatedby the host CPU, this impact is a function of the distance tothe memory stack. Other implications for PIM-based systemswhich are stated in the conclusion of their paper are:

• The standalone memory stack needs at least acommodity-server cooling solution

• The PIM-based memory stack dominates the thermalconstaint in the whole system because the DRAM diescan not have higher temperature than 85◦C.

• The scale of fixed-function PIM is constraint by the areaof the logic die, while that of the heterogeneous PIM bythe thermal/power constraint

[8] [10]

VII. CONCLUSION

Most of the papers especially the presented ones say thatPIM is the solution for the increasing unbalance in systemsnowadays. Their statement is that with less memory accessesand communication power and cycles can be saved. Power cannot only be saved by avoiding communication and memoryaccesses itself but also by avoiding that the CPU sometimescan not work during that time. To get a survey of this topicdifferent systems and different kinds of systems are presentedin this paper. Each of these systems seem to solve one or moreof these problems. The biggest problem of these system is therealization. All of 2D-integrating methods can not be appliedefficiently because of the technology differences betweenDRAM and logic processes. Therefore mainly the 3D-stackedsystems are studied currently. Their biggest criticism is thatintegrated PIMs in these systems are always just simulated.This is probably because such realization are very difficultbecause of the heat which accumulates in the memory stacks.Additionally such a realization is rather expensive.The paper of Y. Zhu et al. simulated the heat issues extensively.The essences of this paper were some implications how a3D-stacked memory with PIM in combination with a host

processor needs to be arranged to hold the temperature lowenough for the DRAM. These implications can be very helpfulfor someone who actually wants to implement such a system.The biggest criticism in this work is that it was again justsimulated. Another point the used system is made up andthere are no proofs or even hints that the simulated systemwould be a sensible PIM system with speedup and lower powerconsumption. Also they make the assumption that 5W peakpower are enough for a PIM without any proofs.Summing up PIM seems like the way to go for reducingthe discrepancy between computations and memory accessesrespectively communications but when glancing at the heatproblems it is recognizable that this will be still a long wayto go.

REFERENCES

[1] John McCalpin, Invited Talks, Supercomputing Conference 2016[2] P. Siegl, R. Buchty and M. Berekovic, Data-Centric Computing Frontiers[3] Rogers, Brian M. and Krishna, Anil and Bell, Gordon B. and Vu, Ken and

Jiang, Xiaowei and Solihin, Yan, Scaling the Bandwidth Wall: Challengesin and Avenues for CMP Scaling, June 2009, New York, NY, USA,

[4] Onur Mutlu Introduction to Computer Architecture - Spring 2015http://www.ece.cmu.edu/ ece447/s15/doku.php?id=start Spring 2015

[5] Jeffret Draper and Jeff Sondeen Implementation of a 32-bit RISC Pro-cessor for the Data-Intensive Architecture Processing-In-Memory Chip2002

[6] http://www.deep-er.eu/[7] Erfan Azarkhish, David Rossi, Igor Loi and Luca Benini Design and

Evaluation of a Processing-in-Memory Archicture for the Smart MemoryCube

[8] Yuxiong Zhu, Borui Wang, Dong Li and Jishen Zhao Integrated ThermalAnalysis for Processing in Die-Stacking Memory

[9] https://en.wikipedia.org/wiki/Moore%27s law Moores Law[10] JEDEC. DDR3 SDRAM Specification, JESD79-3E, July 2010

processing in memory, advanced seminar, ws … · called processing in memory(pim) should...

Documents