high performance, energy efficient implementation of arm ... · •ccs receiver cap modeling...

18
© 2017 Synopsys, Inc. 1 High Performance, Energy Efficient Implementation of ARM ® Processors Alan Gibbons, Principal Consultant, Synopsys Inc. September 13 th , 2017

Upload: others

Post on 18-Mar-2020

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 1

High Performance, Energy Efficient Implementation of

ARM® Processors

Alan Gibbons, Principal Consultant, Synopsys Inc.

September 13th, 2017

Page 2: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 2

Power

Gap

Source: ITRS System Driver Chapter 2010 Updates

• User experience demands:

• High levels of device functionality, HD graphics, fast response, long battery life

• Can continue to extend CPU processing power to satisfy the performance requirements

• Cannot continue to extend energy source to match – battery technology is not evolving fast enough

• Solution:

Need to become much smarter in

how we spend our power/energy budget

Must maximize our entitled performance

within an energy (or power) budget

Energy efficiency is a system problem and

needs a system solution - so industry

collaboration also essential

The Power Gap for Application Processors

Page 3: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 3

1.0 GHz / 40nm

2.0GHz/28nm

3.0GHz+ in 16nm

3.5GHz in 7nm

2010

1.0GHz

40nm

2013

2.0+GHz

28nm+

2016

3.0+GHz

16nm

2017

3.5+GHz

7nm

• ARM & Synopsys have been collaborating for over 20 years

– Rapid, optimized implementation and verification of synthesizable ARM IP and sub-systems

– Low power methodology and the development of energy efficient ARM based sub-systems

– High performance verification, emulation and prototyping

• Why collaborate?

– To deliver a better User experience and better QoR

– Get early versions of IP, libraries and tools working

together from the outset

– De-risk the implementation for our mutual

customers

Collaboration Enables Early Adoption of ARM’s Latest IP

Page 4: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 4

• Development of optimized implementations

– ARM Cortex-A75 CPU

– ARM Cortex-A55 CPU

– DynamIQ™ Shared Unit (DSU)

• Tuned implementations

– Optimizing performance, power and area

– TSMC 16FF+ technology

– ARM libraries and Synopsys EDA tools

• Available now

– RIs are available for download from SolvNet

– New QuickStart Implementation Kits (QIKs)

– Additional support via Synopsys Design Services

Latest Collaboration – ARM DynamIQ™ CPU Sub-System

Page 5: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 5

Cortex-A75 Implementation – Major Challenges

Analyze Library for Balanced Power/Performance/TAT Power

OCV Sequential

Flops

Multi-Vt &

gate-length Multibit

Manage Crosstalk and Optimization for Best Frequency Performance Concurrent

Clock & Data

(CCD)

Global Route-

based

Optimization

Crosstalk

Optimization

PrimeTime

Delaycalc

Determine an Optimum Floorplan Area

Placement

Controls

Macro

Placement

Data Flow

Analysis Bounds

Page 6: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 6

Performance and Power Managed Concurrently

Reduce Power Meet Timing

• place_opt CCD

• New global route-based opt.

• CCS receiver cap modeling

• PrimeTime delay calc in

route_opt

• Redundant VIA insertion

• Incremental timing-driven

multibit register banking and

de-banking

• Clock gating optimization

• Low power placement

• High effort leakage flow

• Timing-driven multibit register

banking and de-banking

• Physical-aware clock gating

• Low power placement

• Enhanced physical guidance

(eSPG)

• Enhanced layer-aware

optimization

• Placement pre-clustering

IC Compiler II

DC Graphical

PrimeTime

ECO • Path-based analysis (PBA)

• Clock skew ECO

• Physical-aware ECO

• Leakage-aware timing ECO

Page 7: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 7

Cortex-A75 CPU Power Optimization Flow

• Deliver energy efficient performance

– Not simply “high performance, low power” rather

highest entitled performance within a

power/energy budget

– Optimal point(s) on a power v. performance curve

• Key considerations

– Vt class availability

– Multibit (MB) banking/de-banking

– Leakage vs. timing vs. dynamic optimization

– Leave headroom (both timing and power) for

ECO

• Library impacts all these decisions

Meet Power Target

MB banking/de-banking

1bit, 2bit, 4bit

VT selection

Across 12 vt/channel

options

QL (leakage) vs. Q (std)

vs. QA (area)

flop selection

SI TNS Reduction, very

congested, clock NDRs

fix_eco_power to meet

leakage target, expect

15-20% reduction

IC Compiler II

DC Graphical

PrimeTime

ECO

Page 8: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 8

Vt Distribution Through Cortex-A75 Full Flow

In DC, datapath delay is prioritized

Faster cells are used

ECO brings in 4 new SVT classes

and does positive slack recovery

Pessimism reduces through the

flow and CTS brings in useful

skew, cells are swapped for power

DC Graphical

ULVT LVT SVT

c16 c18 c20 c24 c16 c18 c20 c24 c16 c18 c20 c24

ICC II route_opt PrimeTime ECO

Vt Class/Channel Mix Changes As Implementation Progresses

Page 9: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 9

Floorplanning

• Floorplanning is and always has been, a key element in ARM CPU implementation

• Module and macro placement critical for hitting aggressive QoR targets

• Design topology, power switch mesh and power supply network also key considerations

• Macro placement, bounds and blockages impact both timing and power

Cortex-A57 Cortex-A72 Cortex-A73

Bounds

Blockage

Magnet

placement

SV SNUG 2016 SV SNUG 2015

Page 10: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 10

QoR Challenges in Placement

• Design topology

– Analysis of Critical Module Timing on Cortex-

A75 CPU

• Critical paths to & from CORE sub-module

– CORE connects heavily to DSIDE and

DENGINE

– Critical paths seen throughout the flow

(challenging to fix downstream)

• CORE being “pushed” out of center of core

area and near IOs

• Created FMAX-limited paths due to long-

path buffering across block

DENGINE DSIDE

ISIDE

CORE

Page 11: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 11

Cortex-A75 CPU Floorplan Changes

Move RAMs To Guide Module Placement

Move CORE module

back into the center

Floorplan changes that allowed CORE to float to the center and close to DSIDE

CORE

DSIDE

ISIDE

DENGINE

DENGINE

DSIDE

ISIDE

CORE

Page 12: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 12

Crosstalk: An Ongoing Challenge on ARM Cores

• Lower geometry processes always

have a crosstalk component

• ARM CPUs have traditional SI

prevention

– Clock NDRs

– Congestion-aware placement

– Logic and density controls

• The Cortex-A73 flow used NDRs to

dramatically reduce crosstalk

• We have used all these techniques

plus more on the Cortex-A75 CPU SV SNUG 2016

Page 13: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 13

Early Results - Cortex-A75 CPU

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

place_opt clock_opt route_auto route_opt

First Full Flow Trial

TNS WNS Leakage (%)

• When starting the Cortex-A75 CPU flow, used

best practices from Cortex-A73 RI

– Clock NDRs

– Crosstalk threshold noise ratio of 20%

– Congestion optimization for placer settings

– 80ps max_transition limit

• Early results: crosstalk still an issue

– Large TNS increase at route_auto stage

– Leakage increase at route_opt stage

More attention needed to address crosstalk

Page 14: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 14

After Before

Crosstalk Mitigation – Cortex-A75 CPU

New optimization solutions with better control of placement in both core area and macro channels

resulted in dramatic FMAX & power improvements

Decre

asin

g n

et

cro

ssta

lk d

elta d

ela

y

Page 15: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 15

Concurrent Clock and Data Optimization - CCD

place_opt with data only: higher area & power

place_opt w/ useful skew: lower area & power

-100ps 200ps

50ps 50ps

Delay 150ps

Cortex-A75 CPU WNS TNS Leakage

Power

place_opt + route_opt -23 -95 100%

place_opt CCD + route_opt -20 -68 98%

Cortex-A75 CPU WNS TNS Leakage

Power

route_opt (Baseline) -112 -134 100%

Power CCD + route_opt -99 -126 99%

Apply

useful

skew

Datapath

area/power

recovery

CLK

90ps 10ps

CLK

90ps 10ps

Size down/swap LVT

Size up clock buf

Page 16: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 16

DC Graphical + IC Compiler II + PrimeTime ECO

Results on Cortex-A75 CPU

0%

50%

100%

150%

200%

-150

-100

-50

0

50

100

150

DCG place_opt clock_opt route_opt Signoff ECO

% o

f F

ina

l

FM

AX

an

d L

eakag

e

TN

S (

ns)

Implementation Stages

Cortex-A75 CPU PPA

TNS (ns) FMAX (%) Leakage (%)

Page 17: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 17

Summary

Manage power and performance concurrently – power is not an after thought

Analyze library for balanced power, performance and turnaround time

Determine an optimum floorplan – module and memory placement critical

Manage crosstalk as well as concurrent optimization of clock and data

Page 18: High Performance, Energy Efficient Implementation of ARM ... · •CCS receiver cap modeling •PrimeTime delay calc in route_opt •Redundant VIA insertion •Incremental timing-driven

© 2017 Synopsys, Inc. 18

Latest RIs Available Now

Synopsys Reference Implementations (RIs)

for Cortex-A75/-A55 are ready

• CPU and DSU flows

• TSMC 16nm FFC process

• ARM POP™ IP – core optimized

standard cells & fast cache RAMs

• Complete implementation and static

verification flows

Contact your Synopsys AC for additional information

RIs available on SolvNet (solvnet.synopsys.com/ARM-RI)