processor technology update final draft - arm … 5x camera ... xeon-e5 2650 v3 cortex-a57...

27
1 ARM Processor Technology Update ARM Cortex ® -A72 Processor Taking Mobile Performance and Efficiency To New Levels ARM Tech Forum, June 2015 Ian Smythe Director of Marketing Programs CPU Group

Upload: nguyentu

Post on 27-May-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

1

ARM Processor Technology Update ARM Cortex®-A72 Processor Taking Mobile Performance

and Efficiency To New Levels

ARM Tech Forum, June 2015

Ian Smythe

Director of Marketing Programs

CPU Group

Page 2: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

2

Processing Solutions

for Consumer

Markets

Page 3: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

3

Accelerating the Pace of Innovation

2009 Display 5x

Camera 4x

Connectivity 20x

Sensors 3x

Video 34x

CPU 17x

GPU 40x

Memory Bandwidth 16x

2014

By GalaxyOptimus (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-

sa/3.0)], via Wikimedia Commons By Creative Tools. Watermark removed by User:Ainali [CC BY 2.0

(http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Page 4: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

4

ARM®v8-A Architecture: Mobile Leadership in 2015

Asus

Pegasus X002

Huawei Honor

4X

Huawei

Ascend Y550

Lenovo

A858T

Lenovo

Lemon K3

Lenovo

Sisley S90

Lenovo

Vibe X2 Pro

LG

Flex 2

Galaxy S6

Edge

HTC

Desire 820

Meizu

M1 Note

Oppo

R5

Oppo

1105

Samsung

Galaxy A7

Samsung

Galaxy Mega 2

Samsung

Galaxy Note 4

Vivo

X5Max

Xiaomi

Redmi 2

Just some of the ARMv8-A architecture-based phones announced so far

Unsubsidized price estimates* from $100 to $750

*Pricing information from www.gsmarena.com

HTC

Desire 510

Page 5: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

5

Scales efficiently to significantly higher performance in larger screen devices

Fits even more compute in a smaller footprint

with less power

Cortex®-A Processors: Scalable for Large Screen Devices

By Google (Open Source OS Screenshot) [CC-BY-SA-3.0

(http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons

Page 6: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

6

Cortex-A72 as ‘big’ core

increases performance and efficiency

ARM big.LITTLE™: Must-Have for Longer Battery Life

Technology Evolution

big.LITTLE Cluster switching to big.LITTLE MP

big.LITTLE with

Intelligent Power Allocation

Page 7: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

7

3.5x performance of Cortex-A15 in smartphone

power envelope

Maximizes sustained device performance

75% less energy for same workloads enabling slimmer and

cooler devices

Compelling scalable solutions

Smartphones to large-screen compute solutions

16nm FF+ POP enables high frequency designs to 2.5GHz+

Designed with the system in mind

CoreLink CCI-500 interconnect

Mali-T880 GPU, V550 Video, DP550 Display

MMU-400, NIC-400, ELA-500

ARM Cortex-A72: Highest Performance ARM Cortex CPU

Page 8: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

8

Compelling single-threaded performance

Large performance increase across all workloads including integer, memory-intensive, crypto, floating point, etc.

Baseline microarchitecture similar to Cortex-A57

Significant advancements in power efficiency

Re-optimized every logical block from Cortex-A57

Power reduction enables sustained operation at Fmax

Area reduction lowers costs and static power

Feature support for enterprise and mobile SoCs

Cortex-A72: Increased Performance and Reduced Power

Page 9: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

9

1.9

2.6

Cortex-A72: Accelerating Usable Performance

2016

Premium

2014

2015

x

x

Increase in sustained performance within

smartphone power budget 3.5x

Cortex-A15

28nm

1.6 GHz

Cortex-A57

20nm

2.0 GHz

Cortex-A57

14/16nm

2.3 GHz

Cortex-A72

14/16nm

2.5 GHz

Page 10: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

10

28nm

28nm

28nm

Cortex-A72: Reducing Power Consumption

28nm

20nm

16FF+

75% Less energy

at target

process

Energy consumed for same mobile workloads

Cortex-A72

2GHz max

1.1 GHz @ equivalent performance

50% Less energy

At iso-process 40-60% further reductions on

average across multiple workloads

Combined with Cortex-A53:

Cortex-A15 Cortex-A57

2GHz max

1.3 GHz @ equivalent performance

1.6GHz 2.2GHz max 2.5GHz max

Page 11: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

11

Intel workloads measured on Dell Venue Pro II. SPEC benchmarks measured using gcc compiler v4.9 with –o3 flag.

Cortex-A72 measured on RTL with realistic memory system with the same compiler settings

Multi-threaded workloads use 2C4T Core-M CPU and estimated on 4C Cortex-A72 configuration w/2MB L2 cache.

Cortex-A72: More performance in constrained envelopes Compelling Mobile SoCs for smartphone, tablet, and laptop form factors

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Geekbench ST SPECint SPECfp Geekbench MT

(4T)

SPECintRate (4T) STREAM Add STREAM Copy STREAM Scale STREAM Triad

No

rmalized

Perf

orm

an

ce

Core-M 2 GHz (14FF)

Cortex-A72 2.5 GHz est (16nm)

Single-thread Multi-thread Memory

4W <1W

Page 12: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

12

L2 Cache L2 Cache

Cache Coherent Interconnect

Interrupt Control

big Cluster

LITTLE Cluster Architecturally Identical Processors

High performance tuned “big” cores

High efficiency tuned “LITTLE” cores

Hardware Coherency

Cache Coherent Interconnect (CCI)

L1 and L2 snooping between clusters

Seamless & Automatic Task Allocation

Global Task Scheduling (big.LITTLE MP)

Heterogeneous Computing

Up to1.8x higher performance vs. LITTLE-only*

45% to 65% CPU power savings vs. big-only*

big.LITTLE Technology: Right Core for the Right Task

* Measured across a set of common use-cases on a 4xCortex-A57.4xCortex-A53 big.LITTLE device

† Average power across high-end gaming and low-utilisation workloads

1 2

Relative big. LITTLE Power

Cortex-A57

Cortex-A53

Cortex-A15

Cortex-A7

35%†

Lower

power

Page 13: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

13

The combination of High Performance ‘big’ and High

Efficiency ‘LITTLE’ CPUs deliver optimal power efficiency

and user experience within the thermal constraints

big.LITTLE: Optimizing for Power Efficiency Measured Power and Performance during Web Browsing

LITTLE Cluster big Cluster LITTLE Cluster big Cluster

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

big.LITTLE* LITTLE-only* big-only*

Power Page Load Time

*Measurements taken from the same SoC

Lower is better

Page 14: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

14

Three Ways big.LITTLE Has Improved in 2015

big.LITTLE Validation Suite ARM Intelligent Power

Allocation (IPA)

Validation suite simplifies

tuning and shortens time to

market

Native support for IPA, the

new Linux Thermal

Framework

Testcases

Report

Generator

ARMv8 Cortex-A CPUs

big.LITTLE devices in 2015

will achieve higher

performance efficiency

Traditional

IPA

0 1 2 3

AnTuTu HTML5

Epic Citadel

Vellamo HTML5

Quadrant CPU

AnTuTu CPU

Octane

AndEBench

WebXPRT big.LITTLE(ARMv8)

big.LITTLE(ARMv7)

Cortex-A57

Page 15: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

15

Single thread performance is crucial for gaming, video playback and web browsing applications

big.LITTLE software migrates latency sensitive threads to High Performance CPUs to reduce

execution time and deliver an improved mobile user experience

125% 138% 159% 157% 140% 119%

Angry

Birds

Audio

Player

Photo

Editor

Facebook Castle

Master

Asphalt 8

big.LITTLE User Experience Improvement

LITTLE-only (4L) big.LITTLE (1b+4L)

big.LITTLE Delivers a Richer User Experience

b: “big” High Performance CPU

L: “LITTLE” High Efficiency CPU

0 0.2 0.4 0.6 0.8 1

4b+4L

2b+4L

1b+4L

4L

2L

1L

Normalised Time

Web Page Load Time Performance

(Higher is Better)

40%

Norm

aliz

ed

Applic

atio

n S

peed

Page 16: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

16

big.LITTLE with Cortex-A72 for Entry to Mid-range

Configurations with High Performance CPUs

offers greater user experience and higher power

efficiency benefits relative to LITTLE only

Topologies with Cortex-A72 CPU as big core

offer improved user experience at reduced area

0

0.5

1

1.5

2

Angry Birds Temple Run Video Playback Asphalt 8

Normalised User Experience

LITTLE-only (SMP)

big.LITTLE (1b+4L)

Cortex-A72 with 2MB L2 for 2 cores, 1MB L2 for 1 core

Cortex-A53 1MB L2 for MP4, 512kB L2 for 2 MP2 and Octa-LITTLE 2nd cluster

LITTLE-only

Increasing in Single Thread Performance

Increasing in User Experience

Increasing in Energy Efficiency

big.LITTLE with Cortex-A72

1.09x Area: 1.0x

1.3x 0.98x

Page 17: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

17

Standalone Devices Companion Devices Tethered Embedded Deeply Embedded

Embedded OS Rich OS

Always aware, lowest-power High-efficiency performance, constrained power budget

Peripheral Autonomous Compute

ARM at the Heart of the Wearables Market

Page 18: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

18

Processing Solutions for

Networking and Infrastructure

Markets

Page 19: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

19

Range of SoCs Addressing Infrastructure

Highly Accelerated Balanced Massively Multicore

QorIQ LS2 ThunderX Tile-MX 100 MPSoC

Opeteron™ A1100 Stratix® 10 X-Gene™

One Size Does Not Fit All

Page 20: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

20

Cortex-A57 Networking Solutions Gather Pace ARMv8 SoCs in deployments now, many more coming.

Freescale LS2085 and LS 2045 Cortex-A57 based 8-core and 4-core complex SDN Switching, NFV Solutions

Networking Applications: Enterprise Routing, Data Center

Solutions, OpenFlow switching, Enterprise Switching,

Security Appliances/IPS/IDS, DPI, ADC/Wan-Opt

HiSilicon 32-core

First 16nm FinFET ARMv8-A networking chip

32-core ARM Cortex-A57 SoC

Networking applications: Next Generation BTS,

Core Routers, Virtualized appliances, SDN

AMD Hierofalcon, Seattle platforms

Page 21: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

21

Enterprise Compute Requirements

Specialised Processing

L1, Content Delivery, Security

Diverse requirements

Trend: Advanced modulation schemes

Need: DSPs, Accelerators

Data Plane Processing

Throughput driven, IO intensive

Deterministic performance

Trend: Higher packet rates

Need: Small Cores at Maximum Efficiency

Control Plane Processing

Fast Event Processing

Complex signalling

Trend: Evolving Software

Need: Efficient, High Compute Performance

MAC Scheduling

Real Time, Latency Driven

Multiple core processing

Trend: More Complexity (LTE-A, 5G)

Need: High Compute, Low Latency Performance

High Bandwidth, Low Latency Interconnect

Wide Range of Implementations from Few to Many Coherent Devices

Page 22: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

22

DSPDSP

ACE

Network Interconnect

NIC-400

Flash

NIC-400

USB

Memory

Controller

DMC-520

x72

DDR4-3200

AHB

Snoop Filter1-32MB L3 cache

PCIe

10-40

GbE

DPI Crypto

CoreLink™ CCN-512 Cache Coherent Network

DSP SATA

Memory

Controller

DMC-520

x72

DDR4-3200

Cortex-A72

Memory

Controller

DMC-520

x72

DDR4-3200

Memory

Controller

DMC-520

x72

DDR4-3200

PCIe

DPI

I/O Virtualisation CoreLink MMU-500

SRAM

Network Interconnect

NIC-400

GPIO PCIe

GIC-500

Cortex CPU

or CHI

master

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

®

Extensible Architecture for Heterogeneous Multi-core Solutions

Up to 4

cores per

cluster

Up to 12

coherent

clusters

Integrated

L3 cache

Up to 24 I/O

coherent

interfaces for

accelerators

and I/O

Peripheral address space

Heterogeneous processors – CPU, GPU, DSP and

accelerators Virtualized Interrupts

Up to Quad

channel

DDR3/4 x72

Page 23: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

23

Maximizing Throughput Density: per mm2, per Watt

0

0.2

0.4

0.6

0.8

1

1.2

Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3

20 Thread Workload

2.3

GH

z

2.7

GH

z

2.6

G

Hz

Rela

tive

perf

orm

ance

(Sp

ec2

K6 r

ate)

Comparison for equivalent number of threads Platforms used:

Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag

Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache

per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+ interconnect

2.5

GH

z

105W

105W

<30W

<30W

ARM Solution Benefits:

Less than 1/3rd the power for equivalent

performance

Allows more specialized computing or

significantly greater thread density in

the same power budget

(10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads)

POP

Optimizations

POP

Optimizations

Page 24: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

24

Cortex-A72: Ideal for Dense Compute Environments

Cortex-A72 is <20 % size

Single Broadwell CPU + 256K1 L2

~8mm2

Cortex-A72 MP4 + 2MB L23

~8mm2

Single Cortex-A72 core 2

~1.15mm2

A quad core Cortex-A72 with 8x L2 cache RAM is

the same size

1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries

Core

Page 25: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

25

ARM Ecosystem

ARM

Scalable

ISA

This diagram is a sample representation of the ARM Partner Ecosystem for illustration purposes only

Page 26: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

26

Mobile Cortex-A72 delivers 3.5x performance of Cortex-A15 in the smartphone envelope

Compelling scalable solutions from smartphone to large-screen compute

Designed with the system in mind: CPU, CCI, GPU, Video, MMU, NIC, ELA

Wearables from Cortex-M to Cortex-A

Infrastructure Cortex-A72 (and Cortex-A57) are ideal for dense, high-throughput computing

Small footprint for greater density on-die for larger core counts

Scalable configurations of larger (40+) cores with ARM Corelink CCN products

Deliver maximum throughput per mm2, per watt and per chip

Enterprise ready feature set and ecosystem

Summary

Page 27: Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3 ... Processor Technology Update Final draft Author:

27

Thank you