computing beyond moore’s law: architecture and device innovations

0 Copyright 2016 FUJITSU

Fujitsu Forum 2016

#FujitsuForum


Computing Beyond Moore’s Law: Architecture and Device Innovations

TAKESHI HORIE Head of Computer Systems Laboratory FUJITSU LABORATORIES LTD.


Why computing now?

Data explosion Data is generated by many IoT devices and the amount of data is exploding.

Computing creates knowledge and intelligence from data. But traditional computing cannot handle it.

End of Moore’s law

For 50 years we have enjoyed device technology scaling. But that is ending.

Fundamentally rethink new computing architecture


Demand for Computing and Fujitsu Computer Systems


Computer performance Since ENIAC was developed 70 year ago, computer performance is increasing twice every

1.5 years.

1,E+00

1,E+03

1,E+06

1,E+09

1,E+12

1930 1950 1970 1990 2010

ENIAC

Com

put

atio

ns

per

sec

ond

per

com

pute

r

ENIAC, 1946 U.S. federal government

2x / 1.5 years


Computing demand for scientific applications Although computing has enabled applications in variety of fields, still much higher

computing power is required to solve complex problems of the real world.

Heart simulation Joint research with the

University of Tokyo

Tsunami simulation Joint research with Tohoku University

- International Research Institute for Disaster

Life science and drug manufacturing

Global change prediction for reducing disaster

Industrial innovation

New material and energy creation

Origin of matter and the universe


Computing demand for financial applications

Tokyo Stock Exchange, Inc. (TSE) is one of the world's top trading market and lists around 3,800 brands. Daily trading value exceeds three trillion yen.

Trading volume is constantly increasing year by year

For high frequency trading, response time is reduced from 2ms to 500us in 5 years

0

100

200

300

400

500

600

700

800

900

Mill

ion

2015

Trading Volume in TSE 1st section

1949

Response Time of TSE

2ms

900μs

2010 2012 2015

500μs


Fujitsu computer systems

1950 1960 1970 1980 1990 2000 2010

FACOM100 (1954)

FACOM230-10 (1965)

M-190 (1976)

M-780 (1985)

M-1800 (1990)

VPP-500 (1992)

FM V (1993)

OAYSYS100 (1980)

PRIMEHPC FX10 (2011)

VP-100 (1982)

FM TOWNS (1989)

PRIMEQUEST (2005)

GS21 (2002)

DS90 (1991)

Arrows (2011)

SPARC M10 (2013)

Supercomputer

Mainframe

Enterprise Servers

Ubiquitous Terminal


SPARC64 XIfx

2000 - 2003 - 1999

SPARC64

V

SPARC64

GP

GS8900

GS21 600

GS8800B

SPARC64 VII

GS21

SPARC64

V +

SPARC64

VI

GS8800

GS21 900

Mainframe

Hig

h Perform

ance

Hig

h Reliab

ility

Store Ahead Branch History Prefetch

Single-chip CPU

Non-Blocking $ O-O-O Execution Super-Scalar

L2$ on Die

HPC-ACE System on Chip Hardware Barrier

Multi-core Multi-thread

2004 - 2007 2008 - 2011

SPARC64

GP

2012 - 2015

SPARC64 IXfx

Virtual Machine Architecture Software On Chip High-speed Interconnect

SPARC64 X

SPARC64 X+

Supercomputer

UNIX

$ECC Register/ALU Parity Instruction Retry $ Dynamic degradation RC/RT/History

SPARC64 VIIIfx

GS21 M2600

2016 -

K computer

SPARC64

SPARC64 II

GS8600

Fujitsu microprocessors


Fujitsu provides many HPC solutions to satisfy various customer demands

Support for both supercomputers with original CPU and x86 cluster systems

Post-K will be developed with collaboration with RIKEN and ARM

Supercomputer PRIMEHPC

PRIMEHPC FX10 PRIMEHPC FX100

K computer

（Co-developed with RIKEN） Large-Scale SMP System

RX900

x86 Cluster

CX400/CX600(KNL) BX900/BX400

Post-K (Co developed with RIKEN and ARM)

Fujitsu high performance computing


IoT and Data Explosion


IoT connects everything By 2020, 50 billion devices will be connected and generate data constantly

1990 2010 2020 2000 Year

Bill

ion

s of

dev

ices

10

20

30

40

50

(src: CISCO)

Only 1 million PCs were

connected to the Internet

Number of devices exceeded

the world wide populations

More than 50 billion devices

in 2020

World wide populations


Data explosion As amount of data is exploding, it exceeds capability of traditional ICT Need new processing to create valuable information from unstructured data

1990 2010 2020 2000 Year

Am

oun

t of

dat

a

1 ZB=1021

1 YB=1024

Amount of data will reach: 40 Zetta Byte by 2020 1 Yotta Byte by 2030

40 ZB 1 ZB 1 YB

Unstructured data IOT, sensors

Structured data Business data, RDB


New information processing for data explosion

Information

Knowledge

Intelligence/ Knowledge

Volume

Quality for

Value

Numeric

Data Computing


Technology Trend for Computing


Microprocessor trend Tr. counts are growing exponentially following Moore’s law

Single thread performance •Increased by 60%/year (-2005) •Slowed down to +20%/year (2005-)

Power & operating frequency •Power restriction limits operating frequency (2005-)

Performance growth is limited by power consumption

Source: Stanford, K. Rupp

Tr, counts(K) Performance Frequency(MHz) Power(W) Core counts

https://www.karlrupp.net/wp-content/uploads/2015/06/40-years-processor-trend.png


Memory trend

0,001

0,01

0,1

1

10

100

1000

2000 2002 2004 2006 2008 2010 2012 2014 2016

Year

(Source: ISSCC, VLSI Circuits & Tech., ASSCC, IEDM)

NAND +32%/Yr

DRAM +18%/Yr

MRAM +52%/Yr

PCM +95%/Yr

103

102

10

1

10-1

10-2

10-3

Mem

ory

IC C

apac

ity

[ G

b/d

ie ]

ReRAM +140%/Yr

ms us ns

SRAM DRAM

HDD SSD CPU Cache

Flash Magnetic

1000x Performance Gap

Access Time

Next Gen. Memory

Memory

Next generation memories are required to fill DRAM-NAND gap

DRAM density saturated. NAND Flash density growing with limited endurance

Big performance gap between DRAM and NAND Flash


Moore’s law

Device technology scaling has brought higher performance as well as higher power efficiency for these 50 years.

The trade off line is determined by device technology at each generation. As technology scales, the trade-off line moves upward.

Technology node will reach 7nm in 2020. (physical limitation of current Tr. technology)

s: Scaling factor

Power efficiency*(Performance)2 = K∝s5

1

10

102

103

104

102 103 104 105

Performance (a.u.)

Pow

er e

ffic

ien

cy (

a.u

.) 1990 2000

2010 2020

Technology scaling will never be a driver for computing

Mobile

Server

Moore’s limit line advancement


Computing innovations beyond Moore’s law

To overcome the limit of Moore’s law in terms of both performance and power efficiency, realize beyond-Moore’s law computing by two approaches

1

10

102

103

104

102 103 104 105

Performance (a.u.)

Pow

er E

ffic

ien

cy (

a.u

.)

Moore’s limit line

Beyond Moore’s Law

Moore’s Law

Computing architecture innovation

Device innovation


Computing Architecture Innovation


Data explosion and challenges

Overcome challenges to realize new information processing

40ZB(40*1021B)

Unstructured data

Structured data

2020 2030 2010 Year

Am

ou

nt

of

da

ta

Intelligence

電力,伝送, 集積,処理の限界

2000

1YB (1024B)

Essence of Intelligence

Data

Information

knowledge

Challenges • Process Technology • Network Bandwidth • Power Consumption • Computing Power

Data explosion


Computing architecture innovation

Create new computing paradigm for data explosion

40ZB(40*1021B)

Unstructured data

Structured data

2020 2030 2010 Year

Am

oun

t of

dat

a

Intelligence

電力,伝送, 集積,処理の限界

2000

1YB (1024B)

Essence of Intelligence

Data

Information

knowledge

Challenges • Process Technology • Network Bandwidth • Power Consumption • Computing Power

Data explosion New Computing

Architecture Moore’s

Law Computing

Hyperconnected Cloud

Cloud Computing

System


Hyperconnected Cloud R&D vision and strategy: “Hyperconnected Cloud”

Web scale ICT provides computing and data processing power through service-oriented connection

AI and security are embedded at every layer to create knowledge in safe and secure society


New computing architecture

Conventional computing

Neural computing （Inference）

Neural computing （Learning）

Accelerators

Brain inspired computing

Supercomputers

Quantum computers

Spec

iali

zati

on

Processing Numeric

Media

Knowledge

Intelligence

Evolving from numeric computing to intelligence computing

End of Moore’s Law


Approach for new computing architecture





Supercomputers

Quantum computers

Processing Numeric

Media

Knowledge

Conventional Computing



Spec

iali

zati

on

Intelligence





Accelerators


Supercomputers

Quantum computers

Processing Numeric

Media

Knowledge




Domain Specific

Computing

Spec

iali

zati

on

Intelligence






Accelerators


Scientific computing

Quantum computers

Processing Numeric

Media

Knowledge


Domain Specific

Computing



Domain Specific

Computing

New Computing Paradigm

Spec

iali

zati

on

Intelligence






Accelerators


Scientific computing

Quantum computers

Processing Numeric

Media

Knowledge


Domain Specific

Computing




Domain Specific

Computing


Future Computing

Technologies

Spec

iali

zati

on

Intelligence



Achieve extremely high performance, simple operation and low cost by specializing hardware and software in specific application domains Optimize architecture to the characteristics of the specific domain

Optimize hardware and software to the major functions of the domain

What is domain specific computing?

Media Search Big Data Analysis

Control, Compression

Encryption, Attack Detection

Domain Specific Computing

Hardware configuration optimized for the domain

Combinatorial Optimization

Domain Specific


Three areas for domain specific computing

Media Search Big Data Analysis

Control, Compression

Encryption, Attack Detection

Domain Specific Computing

Hardware configuration optimized for the domain

Combinatorial Optimization

Domain Specific

Media processing

Rivalling quantum computing

Neural computing



Rivalling Quantum Computing for Combinatorial Optimization Demonstration 1


What is combinatorial optimization?

Power delivery Disaster recovery Investment portfolio

City City

City

City

Combinatorial optimization Find the shortest distance of tour course ？

Number of combinations: (N-1)!/2 e. g., 32 cities 1033 order combinations

Combinatorial explosion


Fast Slow

Applicable to practical problems

Limitation of problems

Conventional processor

Quantum Computer *

Our goal

* Quantum Annealing type

Strategy to solve combinatorial optimization

Create high-speed and widely applicable architecture

• Locating power grid failure

• Pick-up and delivery of 2000 depots

• Locating failures in 20-breaker power grid

• Map coloring


Architecture to meet usability and scalability for combinatorial optimization Solve practical problems by using CMOS digital design Realize scalability for larger problems and speed enhancement

Features Minimize the volume of date to move in parallel and hierarchical structure Accelerate search for paths by parallel score calculation and transition facilitation

Proposed new computing architecture

Multiple engines for larger problems

Further speed up achieved by parallelism

Speed up by parallel score calculation and transition facilitation

Press release on Oct. 20th 2016


Evaluation of our prototype

12,000 speedup confirmed by using 32-city traveling salesman problem

Engine performance evaluated using FPGA implementation

0.1

1

10

100

1,000

10,000

2 x

Tim

e to

sol

uti

on (

sec)

Conventional processor

F P G A Parallel Score

Calculation

1000 x

6 x

Transition Facilitation

T h i s W o r k s

12,000 x

*3.5-GHz Intel Xeon E5


Current status and future plan High-speed, widely applicable architecture for optimization

Operates 12,000 times faster than conventional processor

1,000,000 times speedup envisioned using higher-layer parallelism

Engine

Integrating many engines

Upper layer parallelism

Achieved 12,000 times speedup using internal-engine parallelism

Further speed up and larger network size by using upper layers


Ecosystem of combinatorial optimizer

Ecosystem of Combinatorial Optimizer

Architecture

Software Development Environment

Application

Research Institutes

Fujitsu Universities

Application to Practical Problem

Combinatorial Optimizer Engine

High-Speed Engine Scalable Solution

Next Step

Combinatorial Optimizer

Make SDK with the Engine available for joint research project

Delivery, Distribution

Manufacturing, CAD

Decision Making AI



Neural Computing Demonstration 5


Neural computing comes back again Deep Learning algorithm and enhanced computing capability have enabled much higher

object recognition rate than ever since 2012.

Features Results Input image

Feature extraction

Classification

Manual design

Features Results Input image

Feature extraction

Classification

Automatic extraction（Deep Learning）

Automatic

0.00

0.05

0.10

0.15

0.20

0.25

0.30

2011 2012 2013 2014 2015

Neural computing

Conventional machine learning algorithm

Large difference

Improving every year

1y ny2y

ijw

Output

Input

Learning Inference

Neural network (Feedforwad)

Gen

eral

ob

ject

rec

ogn

itio

n r

ate


Computing for deeper neural network To achieve higher accuracy, neural network has been deeper and larger Processing speed: computing for learning with deeper neural network is time consuming

Processing capacity: limited memory size on GPU is critical for larger neural network

0

2

4

6

8

10

12

14

16

18

1998 ～

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

Mem

ory

Size

[G

B]

Year

GPU Memory Size

NN Size(Batch=8)

ResNet

AlexNet

VGGNet LeNet

~16GB Neural network size trend


Fastest learning w/ HPC technology

Developed high-speed technology to process deep learning

Using "AlexNet," 64 GPUs in parallel achieve 27 times the speed of a single GPU for world's fastest processing

Press release on Aug. 9th 2016

1.8x faster

Conventional

Same accuracy 64 GPUs 1 GPU

27x faster learning speed (60x faster execution speed)

Our approach

…

(64 GPUs)

(64 GPUs)


Doubles deep learning neural network scale Developed technology to streamline

internal memory of GPUs to support growing neural network scale that works to heighten machine learning accuracy

Enabled neural network machine learning of a scale up to twice what was capable with previous technology

Response after press release

“How A New Technology Promises To Make Learning More Powerful Than It Already Is” By Kelvin Murae, Forbes

4% more accuracy

Conventional Our approach

Same memory

2x more images

Press release on Sep. 21st 2016



Media Processing Demonstration 2


Needs for image retrieval

Routinely create and store numerous documents that contain images like presentation materials.

Stored massive image materials are not reused sufficiently.

To search for documents, 10% of work-time is wasted at offices

Needs more intuitive search method “Search by image” increases productivity


Partial image retrieval

Find images based on matches with a part of the query image

Query image Search results

・Partial match ・Enlarged/Reduce image

Search Massive image DB Results

General-purpose server takes a long processing time for massive calculations of partial matching

Requires acceleration of partial image retrieval to search a target image intuitively and efficiently


Image search acceleration system We develops technology for instantaneous searches of a target image

from a massive volume of images

Query by Image

Results Server

Database

I found it!

Partial image retrieval engine

CPU FPGA

Matching

Feature Extraction

I/O Processing

Overall Control

Client

Visual, intuitive user interface

Press release on Feb. 2nd 2016


Demonstration


Performance

Performance Search performance : more than 50 times

Power consumption: less than 1/30†

Cubic volume of space: less than 1/50†

† for equivalent search performance

Conventional server

Media domain specific server

200 Image/sec

12,000 Image/sec Th

roug

hpu

t More than 50 times

“Search by image” makes document creation more productive


Device Innovation


Device innovations for beyond Moore’s law

Novel Packaging Technology

System in Package to be replaced by new and different types of integration and scaling

• 2.5D integration with Interposer

• 3D stacked ICs

Beyond CMOS

New technology that may take the place of silicon CMOS technology

• New channel materials : Compound Semiconductor, Graphene and CNTs (Carbon nanotube)

• New principle devices：Tunneling FET, Spin FET, Mott FET, …

Device innovation accelerates further innovation in architecture


Summary


Computing architecture innovations The demand for computing performance is unlimited We will continue to innovate computing architecture and penetrate new

applications with data explosion. P

enet

rati

on

Graphics

Processing

Computing paradigm

shift

Vector

Neural computing

Accelerators


Quantum computers