computing beyond moore’s law: architecture and device innovations
TRANSCRIPT
0 Copyright 2016 FUJITSU
Fujitsu Forum 2016
#FujitsuForum
1 Copyright 2016 FUJITSU
Computing Beyond Moore’s Law: Architecture and Device Innovations
TAKESHI HORIE Head of Computer Systems Laboratory FUJITSU LABORATORIES LTD.
2 Copyright 2016 FUJITSU
Why computing now?
Data explosion Data is generated by many IoT devices and the amount of data is exploding.
Computing creates knowledge and intelligence from data. But traditional computing cannot handle it.
End of Moore’s law
For 50 years we have enjoyed device technology scaling. But that is ending.
Fundamentally rethink new computing architecture
3 Copyright 2016 FUJITSU
Demand for Computing and Fujitsu Computer Systems
4 Copyright 2016 FUJITSU
Computer performance Since ENIAC was developed 70 year ago, computer performance is increasing twice every
1.5 years.
1,E+00
1,E+03
1,E+06
1,E+09
1,E+12
1930 1950 1970 1990 2010
ENIAC
Com
put
atio
ns
per
sec
ond
per
com
pute
r
ENIAC, 1946 U.S. federal government
2x / 1.5 years
5 Copyright 2016 FUJITSU
Computing demand for scientific applications Although computing has enabled applications in variety of fields, still much higher
computing power is required to solve complex problems of the real world.
Heart simulation Joint research with the
University of Tokyo
Tsunami simulation Joint research with Tohoku University
- International Research Institute for Disaster
Life science and drug manufacturing
Global change prediction for reducing disaster
Industrial innovation
New material and energy creation
Origin of matter and the universe
6 Copyright 2016 FUJITSU
Computing demand for financial applications
Tokyo Stock Exchange, Inc. (TSE) is one of the world's top trading market and lists around 3,800 brands. Daily trading value exceeds three trillion yen.
Trading volume is constantly increasing year by year
For high frequency trading, response time is reduced from 2ms to 500us in 5 years
0
100
200
300
400
500
600
700
800
900
Mill
ion
2015
Trading Volume in TSE 1st section
1949
Response Time of TSE
2ms
900μs
2010 2012 2015
500μs
7 Copyright 2016 FUJITSU
Fujitsu computer systems
1950 1960 1970 1980 1990 2000 2010
FACOM100 (1954)
FACOM230-10 (1965)
M-190 (1976)
M-780 (1985)
M-1800 (1990)
VPP-500 (1992)
FM V (1993)
OAYSYS100 (1980)
PRIMEHPC FX10 (2011)
VP-100 (1982)
FM TOWNS (1989)
PRIMEQUEST (2005)
GS21 (2002)
DS90 (1991)
Arrows (2011)
SPARC M10 (2013)
Supercomputer
Mainframe
Enterprise Servers
Ubiquitous Terminal
8 Copyright 2016 FUJITSU
SPARC64 XIfx
2000 - 2003 - 1999
SPARC64
V
SPARC64
GP
GS8900
GS21 600
GS8800B
SPARC64 VII
GS21
SPARC64
V +
SPARC64
VI
GS8800
GS21 900
Mainframe
Hig
h Perform
ance
Hig
h Reliab
ility
Store Ahead Branch History Prefetch
Single-chip CPU
Non-Blocking $ O-O-O Execution Super-Scalar
L2$ on Die
HPC-ACE System on Chip Hardware Barrier
Multi-core Multi-thread
2004 - 2007 2008 - 2011
SPARC64
GP
2012 - 2015
SPARC64 IXfx
Virtual Machine Architecture Software On Chip High-speed Interconnect
SPARC64 X
SPARC64 X+
Supercomputer
UNIX
$ECC Register/ALU Parity Instruction Retry $ Dynamic degradation RC/RT/History
SPARC64 VIIIfx
GS21 M2600
2016 -
K computer
SPARC64
SPARC64 II
GS8600
Fujitsu microprocessors
9 Copyright 2016 FUJITSU
Fujitsu provides many HPC solutions to satisfy various customer demands
Support for both supercomputers with original CPU and x86 cluster systems
Post-K will be developed with collaboration with RIKEN and ARM
Supercomputer PRIMEHPC
PRIMEHPC FX10 PRIMEHPC FX100
K computer
(Co-developed with RIKEN) Large-Scale SMP System
RX900
x86 Cluster
CX400/CX600(KNL) BX900/BX400
Post-K (Co developed with RIKEN and ARM)
Fujitsu high performance computing
10 Copyright 2016 FUJITSU
IoT and Data Explosion
11 Copyright 2016 FUJITSU
IoT connects everything By 2020, 50 billion devices will be connected and generate data constantly
1990 2010 2020 2000 Year
Bill
ion
s of
dev
ices
10
20
30
40
50
(src: CISCO)
Only 1 million PCs were
connected to the Internet
Number of devices exceeded
the world wide populations
More than 50 billion devices
in 2020
World wide populations
12 Copyright 2016 FUJITSU
Data explosion As amount of data is exploding, it exceeds capability of traditional ICT Need new processing to create valuable information from unstructured data
1990 2010 2020 2000 Year
Am
oun
t of
dat
a
1 ZB=1021
1 YB=1024
Amount of data will reach: 40 Zetta Byte by 2020 1 Yotta Byte by 2030
40 ZB 1 ZB 1 YB
Unstructured data IOT, sensors
Structured data Business data, RDB
13 Copyright 2016 FUJITSU
New information processing for data explosion
Information
Knowledge
Intelligence/ Knowledge
Volume
Quality for
Value
Numeric
Data Computing
14 Copyright 2016 FUJITSU
Technology Trend for Computing
15 Copyright 2016 FUJITSU
Microprocessor trend Tr. counts are growing exponentially following Moore’s law
Single thread performance •Increased by 60%/year (-2005) •Slowed down to +20%/year (2005-)
Power & operating frequency •Power restriction limits operating frequency (2005-)
Performance growth is limited by power consumption
Source: Stanford, K. Rupp
Tr, counts(K) Performance Frequency(MHz) Power(W) Core counts
16 Copyright 2016 FUJITSU
Memory trend
0,001
0,01
0,1
1
10
100
1000
2000 2002 2004 2006 2008 2010 2012 2014 2016
Year
(Source: ISSCC, VLSI Circuits & Tech., ASSCC, IEDM)
NAND +32%/Yr
DRAM +18%/Yr
MRAM +52%/Yr
PCM +95%/Yr
103
102
10
1
10-1
10-2
10-3
Mem
ory
IC C
apac
ity
[ G
b/d
ie ]
ReRAM +140%/Yr
ms us ns
SRAM DRAM
HDD SSD CPU Cache
Flash Magnetic
1000x Performance Gap
Access Time
Next Gen. Memory
Memory
Next generation memories are required to fill DRAM-NAND gap
DRAM density saturated. NAND Flash density growing with limited endurance
Big performance gap between DRAM and NAND Flash
17 Copyright 2016 FUJITSU
Moore’s law
Device technology scaling has brought higher performance as well as higher power efficiency for these 50 years.
The trade off line is determined by device technology at each generation. As technology scales, the trade-off line moves upward.
Technology node will reach 7nm in 2020. (physical limitation of current Tr. technology)
s: Scaling factor
Power efficiency*(Performance)2 = K∝s5
1
10
102
103
104
102 103 104 105
Performance (a.u.)
Pow
er e
ffic
ien
cy (
a.u
.) 1990 2000
2010 2020
Technology scaling will never be a driver for computing
Mobile
Server
Moore’s limit line advancement
18 Copyright 2016 FUJITSU
Computing innovations beyond Moore’s law
To overcome the limit of Moore’s law in terms of both performance and power efficiency, realize beyond-Moore’s law computing by two approaches
1
10
102
103
104
102 103 104 105
Performance (a.u.)
Pow
er E
ffic
ien
cy (
a.u
.)
Moore’s limit line
Beyond Moore’s Law
Moore’s Law
Computing architecture innovation
Device innovation
19 Copyright 2016 FUJITSU
Computing Architecture Innovation
20 Copyright 2016 FUJITSU
Data explosion and challenges
Overcome challenges to realize new information processing
40ZB(40*1021B)
Unstructured data
Structured data
2020 2030 2010 Year
Am
ou
nt
of
da
ta
Intelligence
電力,伝送, 集積,処理 の限界
2000
1YB (1024B)
Essence of Intelligence
Data
Information
knowledge
Challenges • Process Technology • Network Bandwidth • Power Consumption • Computing Power
Data explosion
21 Copyright 2016 FUJITSU
Computing architecture innovation
Create new computing paradigm for data explosion
40ZB(40*1021B)
Unstructured data
Structured data
2020 2030 2010 Year
Am
oun
t of
dat
a
Intelligence
電力,伝送, 集積,処理 の限界
2000
1YB (1024B)
Essence of Intelligence
Data
Information
knowledge
Challenges • Process Technology • Network Bandwidth • Power Consumption • Computing Power
Data explosion New Computing
Architecture Moore’s
Law Computing
Hyperconnected Cloud
Cloud Computing
System
22 Copyright 2016 FUJITSU
Hyperconnected Cloud R&D vision and strategy: “Hyperconnected Cloud”
Web scale ICT provides computing and data processing power through service-oriented connection
AI and security are embedded at every layer to create knowledge in safe and secure society
23 Copyright 2016 FUJITSU
New computing architecture
Conventional computing
Neural computing (Inference)
Neural computing (Learning)
Accelerators
Brain inspired computing
Supercomputers
Quantum computers
Spec
iali
zati
on
Processing Numeric
Media
Knowledge
Intelligence
Evolving from numeric computing to intelligence computing
End of Moore’s Law
24 Copyright 2016 FUJITSU
Approach for new computing architecture
Conventional computing
Neural computing (Inference)
Neural computing (Learning)
Brain inspired computing
Supercomputers
Quantum computers
Processing Numeric
Media
Knowledge
Conventional Computing
Evolving from numeric computing to intelligence computing
Conventional Computing
Spec
iali
zati
on
Intelligence
25 Copyright 2016 FUJITSU
Conventional computing
Neural computing (Inference)
Neural computing (Learning)
Accelerators
Brain inspired computing
Supercomputers
Quantum computers
Processing Numeric
Media
Knowledge
Conventional Computing
Evolving from numeric computing to intelligence computing
Conventional Computing
Domain Specific
Computing
Spec
iali
zati
on
Intelligence
Approach for new computing architecture
26 Copyright 2016 FUJITSU
Conventional computing
Neural computing (Inference)
Neural computing (Learning)
Accelerators
Brain inspired computing
Scientific computing
Quantum computers
Processing Numeric
Media
Knowledge
Conventional Computing
Domain Specific
Computing
Evolving from numeric computing to intelligence computing
Conventional Computing
Domain Specific
Computing
New Computing Paradigm
Spec
iali
zati
on
Intelligence
Approach for new computing architecture
27 Copyright 2016 FUJITSU
Conventional computing
Neural computing (Inference)
Neural computing (Learning)
Accelerators
Brain inspired computing
Scientific computing
Quantum computers
Processing Numeric
Media
Knowledge
Conventional Computing
Domain Specific
Computing
New Computing Paradigm
Evolving from numeric computing to intelligence computing
Conventional Computing
Domain Specific
Computing
New Computing Paradigm
Future Computing
Technologies
Spec
iali
zati
on
Intelligence
Approach for new computing architecture
28 Copyright 2016 FUJITSU
Achieve extremely high performance, simple operation and low cost by specializing hardware and software in specific application domains Optimize architecture to the characteristics of the specific domain
Optimize hardware and software to the major functions of the domain
What is domain specific computing?
Media Search Big Data Analysis
Control, Compression
Encryption, Attack Detection
Domain Specific Computing
Hardware configuration optimized for the domain
Combinatorial Optimization
Domain Specific
29 Copyright 2016 FUJITSU
Three areas for domain specific computing
Media Search Big Data Analysis
Control, Compression
Encryption, Attack Detection
Domain Specific Computing
Hardware configuration optimized for the domain
Combinatorial Optimization
Domain Specific
Media processing
Rivalling quantum computing
Neural computing
30 Copyright 2016 FUJITSU
Computing Architecture Innovation
Rivalling Quantum Computing for Combinatorial Optimization Demonstration 1
31 Copyright 2016 FUJITSU
What is combinatorial optimization?
Power delivery Disaster recovery Investment portfolio
City City
City
City
Combinatorial optimization Find the shortest distance of tour course ?
Number of combinations: (N-1)!/2 e. g., 32 cities 1033 order combinations
Combinatorial explosion
32 Copyright 2016 FUJITSU
Fast Slow
Applicable to practical problems
Limitation of problems
Conventional processor
Quantum Computer *
Our goal
* Quantum Annealing type
Strategy to solve combinatorial optimization
Create high-speed and widely applicable architecture
• Locating power grid failure
• Pick-up and delivery of 2000 depots
• Locating failures in 20-breaker power grid
• Map coloring
33 Copyright 2016 FUJITSU
Architecture to meet usability and scalability for combinatorial optimization Solve practical problems by using CMOS digital design Realize scalability for larger problems and speed enhancement
Features Minimize the volume of date to move in parallel and hierarchical structure Accelerate search for paths by parallel score calculation and transition facilitation
Proposed new computing architecture
Multiple engines for larger problems
Further speed up achieved by parallelism
Speed up by parallel score calculation and transition facilitation
Press release on Oct. 20th 2016
34 Copyright 2016 FUJITSU
Evaluation of our prototype
12,000 speedup confirmed by using 32-city traveling salesman problem
Engine performance evaluated using FPGA implementation
0.1
1
10
100
1,000
10,000
2 x
Tim
e to
sol
uti
on (
sec)
Conventional processor
F P G A Parallel Score
Calculation
1000 x
6 x
Transition Facilitation
T h i s W o r k s
12,000 x
*3.5-GHz Intel Xeon E5
35 Copyright 2016 FUJITSU
Current status and future plan High-speed, widely applicable architecture for optimization
Operates 12,000 times faster than conventional processor
1,000,000 times speedup envisioned using higher-layer parallelism
Engine
Integrating many engines
Upper layer parallelism
Achieved 12,000 times speedup using internal-engine parallelism
Further speed up and larger network size by using upper layers
36 Copyright 2016 FUJITSU
Ecosystem of combinatorial optimizer
Ecosystem of Combinatorial Optimizer
Architecture
Software Development Environment
Application
Research Institutes
Fujitsu Universities
Application to Practical Problem
Combinatorial Optimizer Engine
High-Speed Engine Scalable Solution
Next Step
Combinatorial Optimizer
Make SDK with the Engine available for joint research project
Delivery, Distribution
Manufacturing, CAD
Decision Making AI
37 Copyright 2016 FUJITSU
Computing Architecture Innovation
Neural Computing Demonstration 5
38 Copyright 2016 FUJITSU
Neural computing comes back again Deep Learning algorithm and enhanced computing capability have enabled much higher
object recognition rate than ever since 2012.
Features Results Input image
Feature extraction
Classification
Manual design
Features Results Input image
Feature extraction
Classification
Automatic extraction(Deep Learning)
Automatic
0.00
0.05
0.10
0.15
0.20
0.25
0.30
2011 2012 2013 2014 2015
Neural computing
Conventional machine learning algorithm
Large difference
Improving every year
1y ny2y
ijw
Output
Input
Learning Inference
Neural network (Feedforwad)
Gen
eral
ob
ject
rec
ogn
itio
n r
ate
39 Copyright 2016 FUJITSU
Computing for deeper neural network To achieve higher accuracy, neural network has been deeper and larger Processing speed: computing for learning with deeper neural network is time consuming
Processing capacity: limited memory size on GPU is critical for larger neural network
0
2
4
6
8
10
12
14
16
18
1998 ~
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Mem
ory
Size
[G
B]
Year
GPU Memory Size
NN Size(Batch=8)
ResNet
AlexNet
VGGNet LeNet
~16GB Neural network size trend
40 Copyright 2016 FUJITSU
Fastest learning w/ HPC technology
Developed high-speed technology to process deep learning
Using "AlexNet," 64 GPUs in parallel achieve 27 times the speed of a single GPU for world's fastest processing
Press release on Aug. 9th 2016
1.8x faster
Conventional
Same accuracy 64 GPUs 1 GPU
27x faster learning speed (60x faster execution speed)
Our approach
…
(64 GPUs)
(64 GPUs)
41 Copyright 2016 FUJITSU
Doubles deep learning neural network scale Developed technology to streamline
internal memory of GPUs to support growing neural network scale that works to heighten machine learning accuracy
Enabled neural network machine learning of a scale up to twice what was capable with previous technology
Response after press release
“How A New Technology Promises To Make Learning More Powerful Than It Already Is” By Kelvin Murae, Forbes
4% more accuracy
Conventional Our approach
Same memory
2x more images
Press release on Sep. 21st 2016
42 Copyright 2016 FUJITSU
Computing Architecture Innovation
Media Processing Demonstration 2
43 Copyright 2016 FUJITSU
Needs for image retrieval
Routinely create and store numerous documents that contain images like presentation materials.
Stored massive image materials are not reused sufficiently.
To search for documents, 10% of work-time is wasted at offices
Needs more intuitive search method “Search by image” increases productivity
44 Copyright 2016 FUJITSU
Partial image retrieval
Find images based on matches with a part of the query image
Query image Search results
・Partial match ・Enlarged/Reduce image
Search Massive image DB Results
General-purpose server takes a long processing time for massive calculations of partial matching
Requires acceleration of partial image retrieval to search a target image intuitively and efficiently
45 Copyright 2016 FUJITSU
Image search acceleration system We develops technology for instantaneous searches of a target image
from a massive volume of images
Query by Image
Results Server
Database
I found it!
Partial image retrieval engine
CPU FPGA
Matching
Feature Extraction
I/O Processing
Overall Control
Client
Visual, intuitive user interface
Press release on Feb. 2nd 2016
46 Copyright 2016 FUJITSU
Demonstration
47 Copyright 2016 FUJITSU
Performance
Performance Search performance : more than 50 times
Power consumption: less than 1/30†
Cubic volume of space: less than 1/50†
† for equivalent search performance
Conventional server
Media domain specific server
200 Image/sec
12,000 Image/sec Th
roug
hpu
t More than 50 times
“Search by image” makes document creation more productive
48 Copyright 2016 FUJITSU
Device Innovation
49 Copyright 2016 FUJITSU
Device innovations for beyond Moore’s law
Novel Packaging Technology
System in Package to be replaced by new and different types of integration and scaling
• 2.5D integration with Interposer
• 3D stacked ICs
Beyond CMOS
New technology that may take the place of silicon CMOS technology
• New channel materials : Compound Semiconductor, Graphene and CNTs (Carbon nanotube)
• New principle devices:Tunneling FET, Spin FET, Mott FET, …
Device innovation accelerates further innovation in architecture
50 Copyright 2016 FUJITSU
Summary
51 Copyright 2016 FUJITSU
Computing architecture innovations The demand for computing performance is unlimited We will continue to innovate computing architecture and penetrate new
applications with data explosion. P
enet
rati
on
Graphics
Processing
Computing paradigm
shift
Vector
Neural computing
Accelerators
Brain inspired computing
Quantum computers
52 Copyright 2016 FUJITSU