intel xeon phi hotchips architecture presentation

31
Intel® Xeon Phi™ coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Upload: thomasdem

Post on 28-Oct-2014

2.263 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel Xeon Phi Hotchips Architecture Presentation

Intel® Xeon Phi™ coprocessor (codename Knights Corner)

George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Page 2: Intel Xeon Phi Hotchips Architecture Presentation

Legal Disclaimers Copyright © 2012 Intel Corporation. All rights reserved.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.

EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF

INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU

SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND

REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL

OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall

have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20

Intel, the Intel logo, Xeon, Intel Core and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries Other names and brands may be claimed as the property of others.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and

functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other

products.

For more complete information about performance and benchmark results, visit Performance Test Disclosure

This document contains information on products in the design phase of development.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not

guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides

for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

WARNING: Altering clock frequency and/or voltage may: (i) reduce system stability and useful life of the system and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other

damage; and (v) affect system data integrity. Intel has not tested, and does not warranty, the operation of the processor beyond its specif ications. Intel assumes no responsibility that the processor, including if used with altered clock frequencies and/or voltages, will be fit

for any particular purpose. For more information, visit Overclocking Intel Processors

Warning: Altering PC memory frequency and/or voltage may (i) reduce system stability and use life of the system, memory and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional

heat or other damage; and (v) affect system data integrity. Intel assumes no responsibility that the memory, included if used with altered clock frequencies and/or voltages, will be fit for any particular purpose. Check with memory manufacturer for warranty and additional

details

Available on select Intel® Core™ Intel® Xeon® and Intel® Xeon Phi™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information

including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit

http://www.intel.com/info/em64t

Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and

system configuration. For more information, visit http://www.intel.com/go/turbo

ENERGY STAR is a system-level energy specification, defined by the Environmental Protection Agency, that relies on all system components, such as processor, chipset, power supply, etc.) For more information, visit http://www.intel.com/technology/epa/index.html

Page 3: Intel Xeon Phi Hotchips Architecture Presentation

Intel® Many Integrated Core (Intel MIC) Architecture

Targeted at highly parallel HPC workloads

• Physics, Chemistry, Biology, Financial Services

Power efficient cores, support for parallelism

• Cores: less speculation, threads, wider SIMD

• Scalability: high BW on die interconnect and memory

General Purpose Programming Environment

• Runs Linux (full service, open source OS)

• Runs applications written in Fortran, C, C++, …

• Supports X86 memory model, IEEE 754

• x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc)

Copyright © 2012 Intel Corporation. All rights reserved. 3 Visual and Parallel Computing Group

Page 4: Intel Xeon Phi Hotchips Architecture Presentation

Knights Corner Coprocessor

Copyright © 2012 Intel Corporation. All rights reserved. 4 Visual and Parallel Computing Group

KN

KNC Card

KN

Intel® Xeon®

Processor PCIe x16

>= 8GB GDDR5 memory

TCP/IP

System Memory

> 50 Cores

Linux OS

GDDR5

Channel … PC e x16

KNC Card GDDR5

Channel

GDDR5

Channel … GDDR5

Channel

GD

DR

5

Ch

an

ne

l …

G

DD

R5

Ch

an

ne

l

Page 5: Intel Xeon Phi Hotchips Architecture Presentation

Knights Corner – Power Efficient

Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters

Copyright © 2012 Intel Corporation. All rights reserved. 5 Visual and Parallel Computing Group

1381 1380 1266

0

200

400

600

800

1000

1200

1400

MF

LOP

S/W

att

Higher is Better Source: www.green500.org

Intel Corp Knights Corner Top500 #150 72.9 kW

Nagasaki Univ. ATI Radeon Top500 #456 47 kW

Barcelona Supercomputing Center Nvidia Tesla 2090 Top500 #177 81.5 kW

+ + +

Page 6: Intel Xeon Phi Hotchips Architecture Presentation

Knights Corner Micro-architecture

PCIe

Client

Logic

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 6 Visual and Parallel Computing Group

Page 7: Intel Xeon Phi Hotchips Architecture Presentation

Knights Corner Core

X86 specific logic < 2% of core + L2 area

L2 Ctl

L1 TLB and 32KB

Code Cache

T0 IP

4 Threads In-Order

TLB Miss

Code Cache Miss

Decode uCode

16B/Cycle (2 IPC)

Pipe 0

X87 RF Scalar RF

X87 ALU 0 ALU 1

VPU RF

VPU 512b SIMD

Pipe 1

TLB Miss Handler

L2 TLB

T1 IP

T2 IP

T3 IP

L1 TLB and 32KB Data Cache DCache Miss

TLB Miss

To On-Die Interconnect

HWP

Core

512KB L2 Cache

PPF PF D0 D1 D2 E WB

Copyright © 2012 Intel Corporation. All rights reserved. 7 Visual and Parallel Computing Group

Page 8: Intel Xeon Phi Hotchips Architecture Presentation

Vector Processing Unit

PPF PF D0 D1 D2 E WB

VC2 V1-V4 WB D2 E VC1

VC2 V1 V2 D2 E VC1 V3 V4

DEC

VPU

RF

3R, 1W

Mask

RF

Scatter

Gather

ST

LD

EMU Vector ALUs

16 Wide x 32 bit

8 Wide x 64 bit

Fused Multiply Add

Copyright © 2012 Intel Corporation. All rights reserved. 8 Visual and Parallel Computing Group

Page 9: Intel Xeon Phi Hotchips Architecture Presentation

Interconnect

Core

L2

Data

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

BL - 64 Bytes

AD

AK

Copyright © 2012 Intel Corporation. All rights reserved. 9 Visual and Parallel Computing Group

BL – 64 Bytes

AD

AK

Command and Address

Coherence and Credits

Page 10: Intel Xeon Phi Hotchips Architecture Presentation

Distributed Tag Directories

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Tag Directories track cache-lines in all L2s

TAG Core Valid Mask State

Copyright © 2012 Intel Corporation. All rights reserved. 10 Visual and Parallel Computing Group

TAG Core Valid Mask State

Page 11: Intel Xeon Phi Hotchips Architecture Presentation

Interleaved Memory Access

Copyright © 2012 Intel Corporation. All rights reserved. 11 Visual and Parallel Computing Group

Core

L2

Core

L2 GD

DR

MC

Co

re

L2

Co

re

L2

GDDR MC

Co

re

L2

Co

re

L2

GDDR MC

Core

L2

Core

L2 GD

DR

MC

TD TD TD

T

D

TD

T

D

TD TD

Page 12: Intel Xeon Phi Hotchips Architecture Presentation

Interconnect: 2X AD/AK

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

BL - 64 Bytes

AD

AK

Copyright © 2012 Intel Corporation. All rights reserved. 12 Visual and Parallel Computing Group

BL – 64 Bytes

AD

AK

2x

Page 13: Intel Xeon Phi Hotchips Architecture Presentation

Multi-threaded Triad – Saturation for 1 AD/AK Ring

Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance

Copyright © 2012 Intel Corporation. All rights reserved. 13 Visual and Parallel Computing Group

0 5 10 15 20 25 30 35 40 45 50

Pe

rfo

rma

nce

Cores Running

Simulation Data indicates

saturation for a single

AD/AK ring

Page 14: Intel Xeon Phi Hotchips Architecture Presentation

0 5 10 15 20 25 30 35 40 45 50

Multi-threaded Triad – Benefit of Doubling AD/AK

Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance

Silicon Data for

2 AD + AK rings > 40%

Copyright © 2012 Intel Corporation. All rights reserved. 14 Visual and Parallel Computing Group

Pe

rfo

rma

nce

Cores Running

Simulation Data indicates

saturation for a single

AD/AK ring

Page 15: Intel Xeon Phi Hotchips Architecture Presentation

Streams Triad for (i=0; i<HUGE; i++) A[i] = k*B[i] + C[i];

Without Streaming Stores Read A, B, C, Write A 256 Bytes transferred to/from memory per iteration

With Streaming Stores Read B, C, Write A 192 Bytes transferred to/from memory per iteration

Streaming Stores

Copyright © 2012 Intel Corporation. All rights reserved. 15 Visual and Parallel Computing Group

Page 16: Intel Xeon Phi Hotchips Architecture Presentation

0 5 10 15 20 25 30 35 40 45 50

Multi-threaded Triad — with Streaming Stores

Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance

Silicon Data

Streaming Stores > 30%

Copyright © 2012 Intel Corporation. All rights reserved. 16 Visual and Parallel Computing Group

Pe

rfo

rma

nce

Cores Running

Page 17: Intel Xeon Phi Hotchips Architecture Presentation

Cache Hierarchy Micro-architecture Choices

L2 TLB 64 entry, holds PTEs and PDEs vs. no L2 TLB

Dcache Capability Simultaneous 512b load and 512b store vs. 1 load or store per cycle

L2 Cache 512 KB vs. 256 KB

Hardware Prefetcher 16 stream detectors, prefetch into the L2 vs. no HWP (rely only on software prefetching)

Copyright © 2012 Intel Corporation. All rights reserved. 17 Visual and Parallel Computing Group

Page 18: Intel Xeon Phi Hotchips Architecture Presentation

Per-Core ST Performance Improvement (per cycle)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Spec FP 2006

Performance impact of KNC core uArch improvements

Results measured in development labs at Intel on Knights Corner and Knights Ferry prototype hardware and systems. For more information go to http://www.intel.com/performance

Copyright © 2012 Intel Corporation. All rights reserved. 18 Visual and Parallel Computing Group

>1.8x Average Performance/Cycle Improvement – 1 Core, 1 Thread

Page 19: Intel Xeon Phi Hotchips Architecture Presentation

0

5

10

15

20

25

30

35

40

45

50

Memory BW L2 Cache BW L1 Cache BW

Relative BW Relative BW/Watt

Caches – For or Against?

Copyright © 2012 Intel Corporation. All rights reserved. 19 Visual and Parallel Computing Group

Coherent Caches are a key MIC Architecture Advantage

Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance.

Caches:

high data BW

low energy per byte of data supplied

programmer friendly (coherence just works)

Page 20: Intel Xeon Phi Hotchips Architecture Presentation

Example: Stencils

L2$ Sized

spatial time-step simulation of a physical system

Copyright © 2012 Intel Corporation. All rights reserved. 20 Visual and Parallel Computing Group

Cache blocking promotes much higher performance and performance/watt vs. memory streaming

Page 21: Intel Xeon Phi Hotchips Architecture Presentation

Power Management: All On and Running

PCIe

Client

Logic GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

PCIe IO

GD

DR

IO G

DD

R IO

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 21 Visual and Parallel Computing Group

Page 22: Intel Xeon Phi Hotchips Architecture Presentation

Core C1: Clock Gate Core

PCIe

Client

Logic GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

GD

DR

IO G

DD

R IO

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 22 Visual and Parallel Computing Group

PCIe IO

When all 4T on a core have halted, core clock gates itself

Page 23: Intel Xeon Phi Hotchips Architecture Presentation

Core C6: Power Gate Core

PCIe

Client

Logic GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

GD

DR

IO G

DD

R IO

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 23 Visual and Parallel Computing Group

PCIe IO

C1 time-out, power gate core, save leakage, requires core-re-init

Page 24: Intel Xeon Phi Hotchips Architecture Presentation

Package Auto C3

PCIe

Client

Logic GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

GD

DR

IO G

DD

R IO

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 24 Visual and Parallel Computing Group

PCIe IO

Timeout when all cores have been in C6, clock gate the L2 and interconnect

Page 25: Intel Xeon Phi Hotchips Architecture Presentation

Package C6

PCIe

Client

Logic GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

GD

DR

IO G

DD

R IO

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 25 Visual and Parallel Computing Group

PCIe IO

Host Driver can initiate Package C6 – Uncore Voltage Off, requires partial restart

Page 26: Intel Xeon Phi Hotchips Architecture Presentation

Summary

Intel® Xeon Phi™ coprocessor provides:

Performance and Performance/Watt for highly parallel HPC with cores, threads, wide-SIMD, caches, memory BW

Intel Architecture general purpose programming environment

advanced power management technology

Copyright © 2012 Intel Corporation. All rights reserved. 26 Visual and Parallel Computing Group

KNC delivers programmability and performance/watt for highly parallel HPC

Page 27: Intel Xeon Phi Hotchips Architecture Presentation

Thank You

Knights Corner brought to you by:

IAG (Intel Architecture Group)

• DCSG (Data Center and Systems Group)

• VPG (Visual and Parallel Group) MIC

– HW Architecture

– HW Design

– SW

SSG (Software and Services Group) MIC

IL PCL (Intel Labs – Parallel Computing Lab)

Copyright © 2012 Intel Corporation. All rights reserved. 27 Visual and Parallel Computing Group

Page 28: Intel Xeon Phi Hotchips Architecture Presentation
Page 29: Intel Xeon Phi Hotchips Architecture Presentation

Vector Processor: 512b SIMD Width

Shared Multiplier Circuit for SP/DP

RF3 RF2 RF1 RF0

SP

15 DP7

SP

14

SP

13 DP6

SP

12

SP

11 DP5

SP

10

SP

9 DP4

SP

8

SP

7 DP3

SP

6

SP

5 DP2

SP

4

SP

3 DP1

SP

2

SP

1 DP0

SP

0

Copyright © 2012 Intel Corporation. All rights reserved. 29 Visual and Parallel Computing Group

16 wide SP SIMD, 8 wide DP SIMD 2:1 Ratio good for circuit optimization

Page 30: Intel Xeon Phi Hotchips Architecture Presentation

Gather/Scatter Address Machinery

gather-prime

loop: gather-step; jump-mask-not-zero loop

Copyright © 2012 Intel Corporation. All rights reserved. 30 Visual and Parallel Computing Group

Index0

+

Base Address

Addr0

Index1

+

Addr1

Index2

+

Addr2

Index3

+

Addr3

Index4

+

Addr4

Index5

+

Addr5

Index6

+

Addr6

Index7

+

Addr7

1

1

1

1

1

1

1

1

Clear

Clear = =

Access Address

Find

First

Gather/Scatter machine takes advantage of cache-line locality

Gather Instruction Loop

Scalar Register

Vector Register

Mask Register

To TLB/ DCACHE

Page 31: Intel Xeon Phi Hotchips Architecture Presentation

Package Deep C3

PCIe

Client

Logic GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

GD

DR

IO G

DD

R IO

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Copyright © 2012 Intel Corporation. All rights reserved. 31 Visual and Parallel Computing Group

PCIe IO

Host Driver Initiated – L2/Ring/TDs dropped to retention V, memory in self refresh