intel scc paper

11
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011 173 A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling Jason Howard, Saurabh Dighe, Sriram R. Vangal, Gregory Ruhl, Member, IEEE, Nitin Borkar, Shailendra Jain, Vasantha Erraguntla, Michael Konow, Michael Riepen, Matthias Gries, Guido Droege, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek K. De, and Rob Van Der Wijngaart Abstract—This paper describes a multi-core processor that integrates 48 cores, 4 DDR3 memory channels, and a voltage regulator controller in a 6 4 2D-mesh network-on-chip ar- chitecture. Located at each mesh node is a five-port virtual cut-through packet-switched router shared between two IA-32 cores. Core-to-core communication uses message passing while exploiting 384 KB of on-die shared memory. Fine grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. At the nominal 1.1 V supply, the cores operate at 1 GHz while the 2D-mesh oper- ates at 2 GHz. As performance and voltage scales, the processor dissipates between 25 W and 125 W. The mm processor is implemented in 45 nm Hi-K CMOS and has 1.3 billion transistors. Index Terms—2D-routing, CMOS digital integrated circuits, DDR3 controllers, dynamic voltage frequency scaling (DVFS), IA-32, message passing, network-on-chip (NoC). I. INTRODUCTION A FUNDAMENTAL shift in microprocessor design from frequency scaling to increased core counts has facilitated the emergence of many-core architectures. Recent many-core designs have proven to optimize performance while achieving higher energy efficiency [1]. However, the complexity of maintaining coherency across traditional memory hierarchies in many-cores designs is causing a dilemma. Simply stated, the computational value gained through additional cores will at some point be exceeded by the protocol overhead needed to maintain cache coherency among the cores. Architectural techniques can be used to delay this crossover point for only so long. Alternatively, another approach is to altogether elim- inate cache coherency and rely on software to maintain data consistency between cores. Many-core architectures also face steep design challenges with respect to power consumption. The seemingly endless compaction and density increase of Manuscript received April 15, 2010; revised July 16, 2010; accepted August 30, 2010. Date of publication November 09, 2010; date of current version De- cember 27, 2010. This paper was approved by Guest Editor Tanay Karnik. J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Borkar, and V. K. De are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail: [email protected]). S. Jain and V. Erraguntla are with Intel Labs, Bangalore, India. M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, and S. Steibl are with Intel Labs, Braunschweig, Germany. R. Van Der Wijngaart is with Intel Labs, Santa Clara, CA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSSC.2010.2079450 transistors, as stated by Moore’s Law [2], has both positively impacted the growth in core counts while negatively impacting thermal gradients via the exponential surge in power density. In an effort to mitigate these effects, many-core architectures will be required to employ a variety of power saving techniques. The prototype processor (Fig. 1) described in this paper is an evolutionary approach toward many-core Network-on-Chip (NoC) architectures that remove dependence on hardware main- tained cache coherency while remaining in a constrained power budget. The 48 cores communicate over an on-die network uti- lizing a message passing architecture that allows data sharing with software maintained memory consistency. The processor also uses voltage and frequency islands to combine the advan- tages of Dynamic Voltage and Frequency Scaling (DVFS) for improving energy efficiency. The remainder of the paper is organized as follows. Section II gives a more in-depth architectural description of the 48-core processor and describes key building blocks. The section also highlights enhancements made to the IA-32 core and describes accompanying non-core logic. Router architectural details and packet formats are also described, followed by an explanation of the DDR3 memory controller. Section III presents a novel message passing based software protocol used to maintain data consistency in shared memory. Section IV describes the DVFS power reduction techniques. Details of an on-die voltage regulator controller are also discussed. Experimental results, chip measurement, and programming methodologies are given in Section V. II. TOP LEVEL ARCHITECTURE The processor is implemented in 45 nm high-K metal-gate CMOS [3] with a total die area of mm and contains 1.3 bil- lion transistors. The architecture integrates 48 Pentium™ class IA-32 cores [4] using a “tiled” design methodology; the 24 tiles are arrayed in a 6 4 grid with 2 cores per tile. High speed, low latency routers are also embedded within each tile to pro- vide a 2D-mesh interconnect network with sufficient bandwidth, an essential ingredient in complex, many-core NoCs. 4 DDR3 memory channels reside on the periphery of the 2D-mesh net- work to provide up to 64 GB of system memory. Additionally, an 8-byte bidirectional high-speed I/O interface is used for all off-die communication. Included within a tile are two 256 KB unified L2 caches, one for each core, and supporting network interface (NI) logic re- 0018-9200/$26.00 © 2010 IEEE

Upload: mos-rem

Post on 08-Nov-2014

38 views

Category:

Documents


0 download

DESCRIPTION

Intel SCC Paper

TRANSCRIPT

Page 1: Intel SCC Paper

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011 173

A 48-Core IA-32 Processor in 45 nm CMOSUsing On-Die Message-Passing and DVFS

for Performance and Power ScalingJason Howard, Saurabh Dighe, Sriram R. Vangal, Gregory Ruhl, Member, IEEE, Nitin Borkar, Shailendra Jain,

Vasantha Erraguntla, Michael Konow, Michael Riepen, Matthias Gries, Guido Droege, Tor Lund-Larsen,Sebastian Steibl, Shekhar Borkar, Vivek K. De, and Rob Van Der Wijngaart

Abstract—This paper describes a multi-core processor thatintegrates 48 cores, 4 DDR3 memory channels, and a voltageregulator controller in a 6 4 2D-mesh network-on-chip ar-chitecture. Located at each mesh node is a five-port virtualcut-through packet-switched router shared between two IA-32cores. Core-to-core communication uses message passing whileexploiting 384 KB of on-die shared memory. Fine grain powermanagement takes advantage of 8 voltage and 28 frequency islandsto allow independent DVFS of cores and mesh. At the nominal1.1 V supply, the cores operate at 1 GHz while the 2D-mesh oper-ates at 2 GHz. As performance and voltage scales, the processordissipates between 25 W and 125 W. The ��� mm� processor isimplemented in 45 nm Hi-K CMOS and has 1.3 billion transistors.

Index Terms—2D-routing, CMOS digital integrated circuits,DDR3 controllers, dynamic voltage frequency scaling (DVFS),IA-32, message passing, network-on-chip (NoC).

I. INTRODUCTION

A FUNDAMENTAL shift in microprocessor design fromfrequency scaling to increased core counts has facilitated

the emergence of many-core architectures. Recent many-coredesigns have proven to optimize performance while achievinghigher energy efficiency [1]. However, the complexity ofmaintaining coherency across traditional memory hierarchiesin many-cores designs is causing a dilemma. Simply stated,the computational value gained through additional cores willat some point be exceeded by the protocol overhead neededto maintain cache coherency among the cores. Architecturaltechniques can be used to delay this crossover point for onlyso long. Alternatively, another approach is to altogether elim-inate cache coherency and rely on software to maintain dataconsistency between cores. Many-core architectures also facesteep design challenges with respect to power consumption.The seemingly endless compaction and density increase of

Manuscript received April 15, 2010; revised July 16, 2010; accepted August30, 2010. Date of publication November 09, 2010; date of current version De-cember 27, 2010. This paper was approved by Guest Editor Tanay Karnik.

J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Borkar, andV. K. De are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:[email protected]).

S. Jain and V. Erraguntla are with Intel Labs, Bangalore, India.M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, and S. Steibl

are with Intel Labs, Braunschweig, Germany.R. Van Der Wijngaart is with Intel Labs, Santa Clara, CA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JSSC.2010.2079450

transistors, as stated by Moore’s Law [2], has both positivelyimpacted the growth in core counts while negatively impactingthermal gradients via the exponential surge in power density. Inan effort to mitigate these effects, many-core architectures willbe required to employ a variety of power saving techniques.

The prototype processor (Fig. 1) described in this paper isan evolutionary approach toward many-core Network-on-Chip(NoC) architectures that remove dependence on hardware main-tained cache coherency while remaining in a constrained powerbudget. The 48 cores communicate over an on-die network uti-lizing a message passing architecture that allows data sharingwith software maintained memory consistency. The processoralso uses voltage and frequency islands to combine the advan-tages of Dynamic Voltage and Frequency Scaling (DVFS) forimproving energy efficiency.

The remainder of the paper is organized as follows. Section IIgives a more in-depth architectural description of the 48-coreprocessor and describes key building blocks. The section alsohighlights enhancements made to the IA-32 core and describesaccompanying non-core logic. Router architectural details andpacket formats are also described, followed by an explanationof the DDR3 memory controller. Section III presents a novelmessage passing based software protocol used to maintaindata consistency in shared memory. Section IV describes theDVFS power reduction techniques. Details of an on-die voltageregulator controller are also discussed. Experimental results,chip measurement, and programming methodologies are givenin Section V.

II. TOP LEVEL ARCHITECTURE

The processor is implemented in 45 nm high-K metal-gateCMOS [3] with a total die area of mm and contains 1.3 bil-lion transistors. The architecture integrates 48 Pentium™ classIA-32 cores [4] using a “tiled” design methodology; the 24 tilesare arrayed in a 6 4 grid with 2 cores per tile. High speed,low latency routers are also embedded within each tile to pro-vide a 2D-mesh interconnect network with sufficient bandwidth,an essential ingredient in complex, many-core NoCs. 4 DDR3memory channels reside on the periphery of the 2D-mesh net-work to provide up to 64 GB of system memory. Additionally,an 8-byte bidirectional high-speed I/O interface is used for alloff-die communication.

Included within a tile are two 256 KB unified L2 caches, onefor each core, and supporting network interface (NI) logic re-

0018-9200/$26.00 © 2010 IEEE

Page 2: Intel SCC Paper

174 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 1. Block diagram and tile architecture.

Fig. 2. Full-chip and tile micrograph and characteristics.

quired for core-to-router communication. Each tile’s NI logicalso features a Message Passing Buffer (MPB), or 16 KB ofon-die shared memory. The MPB is used to increase perfor-mance of a message passing programming model whereby corescommunicate through local shared memory.

Total die power is kept at a minimum by dynamically scalingboth voltage and performance. Fine grained voltage changecommands are transmitted over the on-die network to a VoltageRegulator Controller (VRC). The VRC interfaces with twoon package voltage regulators. Each voltage regulator has a4-rail output to supply 8 on-die voltage islands. Further powersavings are achieved through active frequency scaling at a tilegranularity. Frequency change commands are issued to a tile’sun-core logic, whereby frequency adjustments are processed.Tile performance is scalable from 300 MHz at 700 mV to 1.3GHz at 1.3 V. The on-chip network scales from 60 MHz at 550mV to 2.6 GHz at 1.3 V. The design target for nominal usage is1 GHz for tiles and 2 GHz for the 2-D network, when suppliedby 1.1 V. Full-Chip and tile micrograph and characteristics areshown in Fig. 2.

A. Core Architecture

The core is an enhanced version of the second generation Pen-tium processor [4]. L1 instruction and data caches have beenupsized to 16 KB, over the previous 8 KB design, and sup-port 4-way set associativity and both write-through and write-back modes for increased performance. Additionally, data cachelines have been modified with a new status bit used to markthe content of the cache line as Message Passing Memory Type(MPMT). The MPMT is introduced to differentiate betweennormal memory data and message passing data. The cache line’sMPMT bit is determined by page table information found in thecore’s TLB and must be setup properly by the operating system.

The Pentium instruction set architecture has been extended toinclude a new instruction, INVDMB, used to support softwaremanaged coherency. When executed, an INVDMB instructioninvalidates all MPMT cache lines in a single clock cycle. Subse-quently, reads or writes to the MPMT cache lines are guaranteedto miss and data will be fetched or written. The instruction ex-poses the programmer to direct control of cache managementwhile in a message passing environment.

Page 3: Intel SCC Paper

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING 175

Fig. 3. Router architecture and latency.

The addressable space of the core has been extended from32-bits to 36-bits to support 64 GB system memory. This isaccomplished using a 256 entry Look-up-table or LUT exten-sion. To ensure proper physical routing of system addresses, theLUTs also provide addresses destination and routing informa-tion. A bypass status bit in an LUT entry allows for direct ac-cess to local tile MPB. To provide applications and designersthe most flexibility, the LUT can be reconfigured dynamically.

B. L2 Cache and Controller

Each core’s L1 data and instruction caches are reinforced bya unified 256 KB 4-way write-back L2 cache. The L2 cache usesa 32 byte line size to match the line sizes internal to the core’sL1 caches. Salient features of the L2 cache include: a 10 cyclehit latency, in-line double-error-detection and single-error-cor-rection for improved performance, several programmable sleepmodes for power reduction and a programmable a time-out andretry mechanism for increased system reliability. Evicted cachelines are determined through a strict least-recently-used (LRU)algorithm. After every reset deassertion in the L2 cache, 4000cycles are needed to initialization the state array and LRU bits.During that time the grants to the core will be deasserted and norequests will be accepted.

Several architectural attribute’s for the core were consideredduring the design of the L2 cache and associated cache con-troller. Simplifications to the controller’s pipeline were madedue to the core limitation of only 1 outstanding read/write re-quest at a given time. Additionally, inclusion is not maintainedbetween the core’s L1 caches and the core’s L2 cache. Thiseliminates the necessity for snoop or inquire cycles between L1and L2 caches and allows data to be evicted from the L2 cachewithout an inquire cycle to L1 cache. Finally, the there is no al-locate-on-write capability in L1 caches of the core. Thus, a L1

cache write miss and L2 cache write hit does not write the cacheline back into the L1 cache.

High post-silicon visibility into the L2 cache was achievedthrough a comprehensive scan approach. All L2 cache lineswere made scan addressable, including both tag and data arraysand LRU status bits. A full self test feature was also includedthat allowed the L2 cache controller to write either random orprogrammed data to all cache lines, followed by a read compar-ison of the results.

C. Router Architecture

The 5-port router [5] uses two 144 bit uni-directional linksto connect with 4 neighboring routers and one local port whilecreating the 2-D mesh on-die network. As an alternative towormhole routing used in earlier work [1], virtual cut-throughswitching is used for reduced mesh latency over the previouswork. The router has 4 pipe stages (Fig. 3) and an operationalfrequency of 2 GHz at a 1.1 V. The first stage includes linktraversal for incoming packet traversal and input buffer write.The switch arbitration is done in the second stage and third &fourth stages are the VC allocation and switch traversal stagesrespectively. Two message classes (MCs) and eight virtualchannels (VCs) ensure deadlock free routing and maximizebandwidth utilization. Two VCs are reserved: VC6 for requestMCs and VC7 for response MCs.

Dimension-ordered XY routing eliminates network deadlockand route pre-computation in the previous hop allows fast outputport identification on packet arrival. Input port and output portarbitrations are done concurrently using a centralized conflict-free wrapped wave-front arbiter [6] formed using a 5 5 arrayof asymmetric cells (Fig. 4). A cell with a row (column) tokenthat is unable to use the token passes the token to the right(down), wrapping around at the end of the array. These tokens

Page 4: Intel SCC Paper

176 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 4. Wrapped wave-front arbiter.

propagate in a wave-front from the priority diagonal group. If acell with a request receives both a row and a column token, itgrants the request and stops the propagation of tokens.

Crossbar switch allocation is done in a single clock cycle, ona packet granularity. No-load router latency is 4 clock cycles,including link traversal. Individual links offer 64 GB/s of inter-connect bandwidth, enabling the total network to support 256GB/s of bisection bandwidth.

D. Communication and Router Packet Format

A packet is the granularity at which all 2-D mesh agents com-municate with each other. However, a packet maybe subdividedinto a collection of one or multiple FLITs” or “Flow controlunits”. The header (Fig. 5) is the first FLIT of any packet andcontains routing related fields and flow control commands. Pro-tocol layer packets are divided into request and response packetsthrough the message class field. When required, data payloadFLITs will follow the header FLIT. Important fields within theheader FLIT include:

Route—identifies the tile/router a packet will traverse toDestID—identifies the final agent within a tile a packet isaddressed toSourceID—identifies the mesh agent within each node/tilea packet is fromCommand—identifies the type of MC (request or response)TransactionID—a unique identifier assigned at packet gen-eration time

Packetization of core requests and responses into FLITs ishandled by the associated un-core logic. It is incumbent on theuser to correctly packetize all FLITs generated off die.

E. DRR3 Memory Controller

Memory transactions are serviced by four DDR3 [7] inte-grated memory controllers (IMC) positioned at the periphery

of the 2D-mesh network. The controllers feature support forDDR3–800, 1066 and 1333 speed grades and reach 75%bandwidth efficiency with rank and bank interleaving applyingclosed-page mode. By supporting dual rank and two DIMMsper channel, a system memory of 64 GB is realized using 8 GBDIMMs. An overview of the IMC is shown in Fig. 6.

All memory access requests enter the IMC through the MeshInterface Unit (MIU). The MIU reassembles memory transac-tions from the 2-D mesh packet protocol and passes the transac-tion to the Access Controller (ACC) block or the controller statemachine. The Analog Front End (AFE) circuits provide the ac-tual I/O buffers and fine-grain DDR3-compliant compensationand training control. The IMC’s AFE is a derivative of produc-tized IP [8].

The ACC block is responsible for buffering and issuing upto eight data transfers in-order while interleaving control se-quences such as refresh and ZQ calibration commands. Controlsequence interleaving results in a 5X increase the achievablebandwidth since activate and precharge delays can be hiddenbehind data transfers on the DDR3 bus. The ACC also appliesclosed-page mode by precharging activated memory pages assoon as one burst access is finished using auto-precharge. Acomplete feature list of the IMC is shown in Table I.

III. MESSAGE PASSING

Shared memory coherency is maintained through software inan effort to eliminate the communication and hardware over-head required for a memory coherent 2D-mesh. Inspired by soft-ware coherency models such as SHMEM [9], MPI and openMP[10], the message passing protocol is based on one-sided “putand get” primitives that efficiently moves data between the L1cache of one core to the L1 cache of another [11]. As describedearlier, the new Message Passing Memory Type (MPMT) is in-troduced in conjunction with the new instruction, INVDMB,

Page 5: Intel SCC Paper

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING 177

Fig. 5. Request (a) and Response (b) Header FLITs.

Fig. 6. Integrated memory controller block diagram.

as an architectural enhancement to optimize data sharing usingthese software procedures. The MPMT retains all the perfor-mance benefits of a conventional cache line, but distinguishes it-self by addressing non-coherent shared memory. The strict mes-sage passing protocol proceeds as follows (Fig. 7(a)):

A core initiates a message write by first invalidating all mes-sage passing data cache lines. Next, when the core attempts towrite data from a private address to a message passing address awrite miss occurs and the data is written to memory. Similarly,when a core reads a message, it begins by invalidating all mes-sage passing data cache lines. The read attempt will cause a readmiss, and the data from memory will be fetched.

The 16 KB MPB, found in each tile and used as on-die sharedmemory, further optimizes the design by decreasing the latencyof shared memory accesses. Messages that are smaller than 16KB see a 15x latency improvement when passed through theMPB, rather than sent through main memory (Fig. 7(b)). How-ever, messages larger than 16 KB lose this performance edgesince the MPB is completely filled and the remaining portionmust be sent to main memory.

IV. DVFS

To maximize power savings, the processor is implementedusing 8 Voltage Islands (VIs) and 28 Frequency Islands (FIs)

Page 6: Intel SCC Paper

178 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

TABLE IINTEGRATED MEMORY CONTROLLER FEATURES

Fig. 7. Message passing protocol (a) and message passing versus DDR3-800 (b).

(Fig. 8). Software based power management protocols takeadvantage of the voltage/frequency islands through DynamicVoltage and Frequency Scaling (DVFS). Two voltage islandssupply the 2-D mesh and die periphery, with the remaining 6voltage islands being divided among the core area. The on-dieVRC interfaces with two on-package voltage regulators [12] toscale the voltages of the 2-D mesh and core area dynamicallyfrom 0V-1.3 V in 6.25 mV steps. Since the VRC acts as anyother 2-D mesh agent, the VRC is addressable by all cores.Upon reception of a new voltage command change, the VRCand on-package voltage regulators respond in under a mil-lisecond. VIs for idle cores can be set to 0.7 V, a safe voltagefor state retention, or completely collapsed to 0 V, if retentionis unnecessary. Voltage level isolation and translation circuitryallow VIs with active cores to continue execution with noimpact from collapsed VIs and provide a clean interface acrossvoltage domains.

The processor’s 28 FIs are divided as such: one FI foreach tile (24 total), one FI for the entire 2-D mesh, and theremaining three FIs for the system interface, VRC, and memorycontrollers, respectively. Similar to the VIs, all core area FIsare dynamically adjustable to an integer division (up to 16)of the globally distributed clock. However, unlike voltagechanges, the response time of frequency changes are signif-icantly faster—around 20 ns when a 1 GHz clock is beingused. Thus, frequency changes are much more common thanvoltage changes in power optimized software. Deterministicfirst-in-first-out (FIFO) based clock crossing units (CCFs) areused for synchronization across clocking domains [5]. Fre-quency aware read pointers ensure that the same FIFO locationis not read from and written to simultaneously. Embedded levelshifters (Fig. 9) within the clock crossing unit handle voltagetranslation.

Page 7: Intel SCC Paper

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING 179

Fig. 8. Voltage/frequency islands, clock crossing FIFOs and clock gating.

Fig. 9. Voltage level translation circuit.

V. EXPERIMENTAL RESULTS & PROGRAMMING

The board and packaged processor used for evaluation andtesting is shown in Fig. 10. The die is packaged in a 14-layer(5-4-5) 1567 pin LGA package with 970 signal pins, most ofwhich are allocated to the 4 DDR3 channels. A standard Xeonserver socket is used to house the package on the board. A stan-dard PC running customized software is able to interface withthe processor and populate the DDR3 memory with a bootableOS for every core. After reset de-assertion, all 48-cores bootindependent OS threads. Silicon has been validated to be fullyfunctional.

The Lower-Upper Symmetric Gauss-Seidel solver (LU) andBlock tri-diagonal solver (BT) benchmarks from the NAS par-allel benchmarks [13] were successfully ported to the processorarchitecture with minimal effort. LU employs a symmetric

successive over-relaxation scheme to solve regular, sparse,block 5 5 lower and upper triangular systems of equations.LU uses a “pencil decomposition” to assign a column block of a3-dimensional discretization grid to each core. A 2 dimensionalpipeline algorithm is used to propagate a wavefront communi-cation pattern across the cores. BT solves multiple independentsystems of block tri-diagonal equations with 5 5 blocks. BTdecomposes the problem into larger numbers of blocks thatare distributed about the cores with a cyclic distribution. Thecommunication patterns are regular and emphasize nearestneighbor communication as the algorithm sweeps successivelyover each plane of blocks. It is important to note that these twobenchmarks have distinctly different communication patterns.Results for runs on the processor with cores running at 533MHz and the mesh at 1 GHz for the benchmarks running on a102 102 102 discretization grid are shown in Fig. 11. Asexpected, the speedup is effectively linear across the range ofproblems studied.

Measured maximum frequency for both the core and router asa function of voltage is shown in Fig. 12. Silicon measurementswere taken while maintaining a constant case temperature of50 using a 400 W external chiller. As voltage for the routeris scaled from 550 mV to 1.34 V, a resulting Fmax increase isobserved from 60 MHz to 2.6 GHz. Likewise, as voltage forthe IA-core is scaled from 730 mV to 1.32 V, Fmax increasesfrom 300 MHz to 1.3 GHz. The offset between the two profilesis explained by the difference in design target points; the corewas designed for 1 GHz operation at 1.1 V while the router wasdesigned for 2 GHz operation at the same voltage.

This chart shown in Fig. 13 illustrates the increase in totalpower as supply voltage is scaled. During data collection, the

Page 8: Intel SCC Paper

180 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 10. Package and test board.

Fig. 11. NAS parallel benchmark results with increasing core count.

Fig. 12. Maximum frequency (Fmax) versus supply.

processor’s operating frequency was always set to the Fmaxof the current supply voltage being measuring. Thus, we seea cubic trend of power increase since power is proportional tofrequency times voltage squared. At 700 mV the processor con-sumes 25 W, while at 1.28 V it consumes 201 W. During nom-inal operation we see a 1 GHz core and a 2 GHz router operating

Fig. 13. Measured chip power versus supply.

at 1.14 V and consuming 125 W of power. Fig. 14 presents themeasured power breakdown at full power and low power op-eration. When the processor is dissipating 125 W, 69% of thispower, or 88 W, is attributed to the cores. When both voltageand frequency are reduced and the processor dissipates 25 W,only 21% of this power, or 5.1 W, is due to the cores. At thislow power operation we see the memory controllers becomingthe major power consumer, largely because the analog I/O volt-ages cannot be scaled due to the DDR3 spec.

VI. CONCLUSION

In this paper, we have presented a 48 IA-32 core processor in a45 nm Hi-K CMOS process that utilizes a 2D-mesh network and4 DDR3 channels. The processor uses a new message passingprotocol and 384 KB of on-die shared memory for increased per-formance of core to core communication. It employs dynamicvoltage and frequency scaling with 8 voltage islands and 28 fre-quency islands for power management. Silicon operates over awide voltage and frequency range, 0.7 V and 125 MHz up to 1.3V and 1.3 GHz. Measured results show a power consumption of125 W at 50 when operating under typical conditions, 1.14V and 1 GHz. With active DVFS measured power is reduced by80% to 25 W at 50 . These results demonstrate the feasibilityof many-core architectures and high-performance, energy-effi-cient computing in the near future.

Page 9: Intel SCC Paper

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING 181

Fig. 14. Measured full power and low power breakdowns.

ACKNOWLEDGMENT

The authors thank Yatin Hoskote, D. Finan, D. Jenkins,H. Wilson, G. Schrom, F. Paillet, T. Jacob, S. Yada, S. Marella,P. Salihundam, J. Lindemann, T. Apel, K. Henriss, T. Mattson,J. Rattner, J. Schutz, M. Haycock, G. Taylor, and J. Held fortheir leadership, encouragement, and support, and the entiremask design team for chip layout.

REFERENCES

[1] S. Vangal et al., “An 80-Tile 1.28TFLOPS network-on-Chip in 65 nmCMOS,” ISSCC Dig. Tech. Papers, pp. 98–99, Feb. 2007.

[2] G. Moore, “Cramming more components onto integrated circuits,”Electronics, vol. 38, no. 8, Apr. 1965.

[3] K. Mistry et al., “A 45 nm logic technology with high -� � �����

gate transistors, strained silicon, 9 Cu interconnect layers, 193 nm drypatterning, and 100% Pb-free packaging,” IEDM Dig. Tech. Papers,Dec. 2007.

[4] J. Schutz, “A 3.3 V 0.6 �m BiCMOS superscalar microprocessor,”ISSCC Dig. Tech. Papers, pp. 202–203, Feb. 1994.

[5] P. Salihundam et al., “A 2 Tb/s 6� 4 mesh network with DVFS and2.3 Tb/s/W router in 45 nm CMOS,” in Symp. VLSI Circuits Dig. Tech.Papers, Jun. 2010.

[6] Y. Tamir and H.-C. Chi, “Symmetric crossbar arbiters for VLSI com-munication switches,” IEEE Trans. Parallel Distrib Syst., vol. 4, no. 1,pp. 13–27, Jan. 1993.

[7] JEDEC, Solid State Technology Association: DDR3 SDRAM Specifi-cation, Apr. 2008, JESD79-3B.

[8] R. Kumar and G. Hinton, “A family of 45 nm IA processors,” ISSCCDig. Tech. Papers, pp. 58–59, Feb. 2009.

[9] SHMEM Technical Note for C, Cray Research, Inc., 1994, SG-25162.3..

[10] L. Smith and M. Bull, “Development of hybrid mode MPI/OpenMPapplications,” Scientific Programming, vol. 9, no. 2–3, pp. 83–98, 2001.

[11] T. Mattson et al., “The intel 48-core single-chip cloud computer (SCC)processor: Programmer’s view,” in Int. Conf. High Performance Com-puting, 2010.

[12] G. Schrom, F. Faillet, and J. Hahn, “A 60 MHz 50 W fine-grain packageintegrated VR powering a CPU from 3.3 V,” in Applied Power Elec-tronics Conf., 2010.

[13] D. H. Bailey et al., “The NAS parallel benchmarks,” Int. J. Supercom-puter Applications, vol. 5, no. 3, pp. 63–73, 1991.

Jason Howard is a senior technical research lead forthe Advanced Microprocessor Research team withinIntel Labs, Hillsboro, Oregon. During his time withIntel Labs, Howard has worked on projects rangingfrom high performance low power digital buildingblocks to the 80-Tile TeraFLOPs NoC Processor. Hisresearch interests include alternative microprocessorarchitectures, energy efficient design techniques,variation aware and tolerant circuitry, and exascalecomputing.

Jason Howard received the B.S. degree and M.Sdegree in electrical engineering Brigham Young University, Provo, UT, in 1998and 2000 respectively. He joined Intel Corporation in 2000. He has authoredand co-authored several papers and has several patents issued and pending.

Saurabh Dighe received his MS degree in Com-puter Engineering from the University of Minnesota,Minneapolis in 2003. He was with Intel Corpora-tion, Santa Clara, working on front end logic andvalidation methodologies for the Itanium processorand the Core processor design team Currentlyhe is a member of the Advanced MicroprocessorResearch team at Intel Labs, Oregon, involved in thedefinition, implementation and validation of futureTera-scale computing technologies like the IntelTeraflops processor and 48-Core IA-32 Message

Passing Processor. His research interests are in the area of energy efficientcomputing and low power high performance circuits.

Sriram R. Vangal (S’90–M’98) received the B.S.degree from Bangalore University, India, and theM.S. degree from University of Nebraska, Lincoln,USA, and the Ph.D. degree from Linköping Uni-versity, Sweden, all in Electrical Engineering. Heis currently a Principal Research Scientist with Ad-vanced Microprocessor Research, Intel Labs. Sriramwas the technical lead for the advanced prototypeteam that designed the industry’s first single-chip80-core, sub-100 W “Polaris” TeraFLOPS processor(2006) and co-led the development of the 48-core

“Rock-Creek” prototype (2009). His research interests are in the area oflow-power high-performance circuits, power-aware computing and NoCarchitectures. He has published 20 journal and conference papers and has 16issued patents with 8 pending.

Page 10: Intel SCC Paper

182 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Gregory Ruhl (M’07) received the B.S. degree incomputer engineering and the M.S. degree in elec-trical and computer engineering from the GeorgiaInstitute of Technology, Atlanta, in 1998 and 1999,respectively.

He joined Intel Corporation, Hillsboro, OR, in1999 as a part of the Rotation Engineering Programwhere he worked on the PCI-X I/O switch, GigabitEthernet validation, and clocking circuit and testautomation research projects. After completing theREP program, he joined Intel’s Circuits Research

Lab where he worked on design, research and validation on a variety of topicsranging from SRAMs and signaling to terascale computing. In 2009, Gregbecame a part of the Microprocessor Research Lab within Intel Labs where hehas since been designing and working on tera- and exa-scale research siliconand near threshold voltage computing projects.

Nitin Borkar received the M.Sc. degree in physicsfrom the University of Bombay, India, in 1982, andthe M.S.E.E. degree from Louisiana State Universityin 1985.

He joined Intel Corporation in Portland, OR, in1986. He worked on the design of the i960 familyof embedded microcontrollers. In 1990, he joined thei486DX2 microprocessor design team and led the de-sign and the performance verification program. Aftersuccessful completion of the i486DX2 development,he worked on high-speed signaling technology for the

Teraflop machine. He now leads the prototype design team in the Micropro-cessor & Programming Research Laboratory, developing novel technologies inthe high-performance low power circuit areas and applying those towards futurecomputing and systems research.

Shailendra Jain received the B.Tech. degree inelectronics engineering from Devi Ahilya Vish-wavidyalaya, Indore India in 1999, and M.Tech.degree in Microelectronics and VLSI Design fromIIT, Madras, India in 2001. With Intel since 2004,he is currently a technical research lead at theBangalore Design Lab of Intel Labs, Bangalore,India. His research interests includes near-thresholdvoltage range digital circuits design, energy-efficientdesign techniques for TeraFLOPs NoC processorsand floating-point arithmetic units, and many core

advance rapid prototyping. He has co-authored ten papers in these areas.

Vasantha Erraguntla received her B.E. in ElectricalEngineering from Osmania University, India andan M.S. in Computer Engineering from Univer-sity of Louisiana. She joined Intel in 1991 to bea part of the Teraflop machine design team andworked on its high-speed router technology. SinceJune 2004, Vasantha has been heading Intel Lab’sBangalore Design Lab to facilitate the world’s firstprogrammable Terascale processor and the 48-iAcore Single-Chip Cloud Computer. Vasantha hasco-authored over 13 IEEE journal and conference

papers and holds 3 patents and 2 pending. She is also a member of IEEE.She served on the organizing committee of the 2008 and 2009 International

Symposium on Low Power Electronics and Design (ISLPED) and on the Tech-nical Program Committee of ISLPED 2007, Asia Solid State Circuits Confer-ence (A-SSCC) in 2008 and 2009. She is also a Technical Program Committeemember for energy-efficient digital design for ISSCC 2010 and ISSCC 2011.She is also serving on the Organizing Committee for the VLSI Design Confer-ence 2011.

Michael Konow manages an engineering teamworking on research and prototyping of futureprocessor architectures within Intel Labs, Braun-schweig, Germany. During his time with Intel Labs,Michael was leading the development of the 1stin-socket FPGA prototype of a x86 processor andhas worked on several FPGA and silicon prototypingprojects. His research interests include future mi-croprocessor architectures and FPGA prototypingtechnologies.

Michael received his diploma degree in ElectricalEngineering from University of Braunschweig in 1996. Since then he hasworked on the development of integrated circuits for various companies and awide range of applications. He joined Intel in 2000 and Intel Labs in 2005.

Michael Riepen is a senior technical researchengineer in the Advanced Processor PrototypingTeam within Intel Labs, Braunschweig, Germany.During his time in Intel Labs, Michael worked onFPGA prototyping of processor architectures as wellas on efficient many-core pre-silicon validation en-vironments. His research interests include exascalecomputing, Many-core programmability as well asefficient validation methodologies.

Michael Riepen received the master of computerscience from University of applied sciences Wedel,

Germany, in 1999. He joined Intel Corporation in 2000. He has authored andco-authored several papers and has several patents issued and pending.

Matthias Gries joined Intel Labs at Braunschweig,Germany, in 2007, where he is working on architec-tures and design methods for memory subsystems.Before, he spent three years at Infineon Technologiesin Munich, Germany, refining micro-architecturesfor network applications at the Corporate Researchand Communication Solutions departments. Hewas a post-doctoral researcher at the University ofCalifornia, Berkeley, in the Computer-Aided Designgroup, implementing design methods for applica-tion-specific programmable processors from 2002

to ’04. He received the Doctor of Technical Sciences degree from the SwissFederal Institute of Technology (ETH) Zurich in 2001 and the Dipl.-Ing. degreein electrical engineering from the Technical University Hamburg-Harburg,Germany, in 1996.

His interests include architectures, methods and tools for developing x86 plat-forms, resource management and MP-SoCs.

Guido Droege received his Diploma in electrical en-gineering from Technical University Braunschweig,Germany in 1992 and the Ph.D. degree in 1997. Hisacademic work focused on analog circuit design au-tomation. After graduation, Droege worked an ASICcompany, Sican GmbH, and later Infineon Technolo-gies. He designed RF circuits for telecommunicationand worked on MEMS technology for automotive.In 2001 he joined Intel Corporation where he startedwith high-speed interface designs for optical commu-nication. As part of Intel Labs he was responsible for

the analog frontend of several Silicon prototype designs. Currently, he works inthe area of high-bandwidth memory research.

Page 11: Intel SCC Paper

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING 183

Tor Lund-Larsen is the Engineering Managerfor the Advanced Memory Research team withinIntel Labs Germany in Braunschweig, Germany.During his time with Intel Labs, Tor has worked onprojects ranging from FPGA based proto-types formulti-radio and adaptive clocking to analog clockingconcepts and memory controllers. His research in-terests include multi-level memory, Computation-inMemory for many-core architecture, resiliency andhigh bandwidth memory.

Tor Lund-Larsen received MBA and MSE degreesfrom the Program of Engineering and Manufacturing Management program atthe University of Washington, Seattle in 1996. He joined Intel Corporation in1997.

Sebastian Steibl is the Director of Intel Labs Braun-schweig in Germany and leads a team of researchersand engineers in developing technologies rangingfrom the next generation Intel CPU architectures,high bandwidth memory and memory architecturesto emulation and FPGA many-core prototypingmethodology.His research interests include on-diemessage passing and embedded many-core micro-processor architectures.

Sebastian Steibl has a Degree in Electrical Engi-neering from Technical University of Braunschweig

and holds three patents.

Shekhar Borkar is an Intel Fellow, an IEEE Fellowand director of Exascale Technologies in Intel Labs.He received B.S. and M.S. degrees in physics fromthe University of Bombay, India, in 1979, andthe M.S. degree in electrical engineering from theUniversity of Notre Dame, Notre Dame, IN, in 1981and joined Intel Corporation where he has workedon the design of the 8051 family of microcontrollers,iWarp multi-computer, and high-speed signalingtechnology for Intel supercomputers. Shekhar’sresearch interests are low power, high performance

digital circuits, and high speed signaling. He has published over 100 articlesand holds 60 patents.

Vivek K. De is an Intel Fellow and director of Cir-cuit Technology Research in Intel Labs. In his currentrole, De provides strategic direction for future circuittechnologies and is responsible for aligning Intel’scircuit research with technology scaling challenges.

De received his bachelor’s degree in electrical en-gineering from the Indian Institute of Technology inMadras, India in 1985 and his master’s degree in elec-trical engineering from Duke University in 1986. Hereceived a Ph.D. in electrical engineering from Rens-selaer Polytechnic Institute in 1992. De has published

more than 185 technical papers and holds 169 patents with 33 patents pending.

Rob Van Der Wijngaart is a senior softwareengineer in the Developer Products Division ofIntel’s Software and Services Group. At Intel he hasworked on parallel programming projects rangingfrom benchmark development, programming modeldesign and implementation, algorithmic research,and fine-grain power management. He developed thefirst application to break the TeraFLOP-on-a-chipbarrier for the 80-Tile TeraFLOPs NoC Processor.

Van der Wijngaart received an M.S, degree inApplied Mathematics from Delft University of

Technology, Netherlands, and a Ph.D. degree in Mechanical Engineering fromStanford University, CA, in 1982 and 1989, respectively. He joined Intel in2005. Before that he worked at NASA Ames Research Center for 12 years,focusing on high performance computing. He was one of the implementers ofthe NAS Parallel Benchmarks.