designing for intensity: parallelism from analytics to ai

33
0 © Copyright 2017 FUJITSU Fujitsu Forum 2017 #FujitsuForum

Upload: fujitsu-global

Post on 21-Jan-2018

165 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Designing for intensity: parallelism from analytics to AI

0 copy Copyright 2017 FUJITSU

Fujitsu Forum2017

FujitsuForum

1 copy Copyright 2017 FUJITSU

Designing for intensity parallelism from analytics to AI

Ian Godfrey

Director of the Solutions Business for Fujitsu Systems Europe

Manju Annie Oommen

Global Product Marketing Manager Fujitsu

2 copy Copyright 2017 FUJITSU

Agenda

HPC Diversifies

1

Co-creating solutions

4

Q amp A

5

Similarities between HPC and Deep

Learning optimization

3

What Changed over the years

2

3 copy Copyright 2017 FUJITSU

HPC Diversifies Hunger for compute power

Increasing connected devices worldwide

Size of digital universeincreasing

Driving more applications

64Bn Devices

10 Zettabytes

1000s of apps

2016

28Bn Devices

180 Zettabytes

20K New apps

gt2020

10 times more data to be generated by 2025 Emergence of High Performance Data Analytics

Fraud and anomaly detectionIdentifying harmful potentially harmful patterns and causes using graphical semantic analysis or other high performance analytics techniques real time

MarketingPromote products or services using complex algorithms to discern potential customers demographics buyingpreferences and habits

Business intelligenceUses HPDA to identify opportunities to advance the market position and competitiveness of businesses by better understanding themselves their competitors and the evolving dynamics of the markets they participate in

Other Commercial HPDAAn example of such a high-potential workload is the use of HPDA to manage large IT infrastructures ranging from on premise data centers to public clouds and Internet-of-Things (IoT) Infrastructures- involves solving complex problems

Existing HPC usersbull Intelligence

community FSIbull Data-driven

scienceengineering (eg biology)

bull Knowledge discovery

bull MLDL cognitive AI

New commercial users

bull Fraudanomaly detection

bull Business intelligence

bull Affinity marketingbull Personalized

medicine

Fastest processingtransformationof large volume data

Real-time analysisto extract invisible insight from the data

Accelerated deep-learning technologyby GPU computation

HPDA to grow robustly to be a $54Bn market

Cust

om

er

be

ne

fits

2

3

1

Source Information from analysts and various tele communication firms

4 copy Copyright 2017 FUJITSU

Neural Networks are Old ndash What changed

Scale drives deep learning progress

Availability of

More Data

Faster ComputeHardware

Better Algorithm

Best results are obtained by training a large neural network orand by feeding in more data

RepetitiveTraining

His

tory

1943 First electrical model of neural network

1958 Perceptron

1986 Backpropogation

1990s Convolutional Networks (LeCun)

2006 Deep Belief Network (Hinton)

201314 Google buys Deep Mind

HPC speeding up Deep learning Research

5 copy Copyright 2017 FUJITSU

What does deep learning deal with

Deep Learning

Dee

p L

earn

ing

is t

he

mac

hin

ersquos

per

cep

tio

n o

f Imagesbull Facesbull Self driving

Soundbull Voice searchbull Music Genbull Translation

Textbull CRMbull Search +bull Ads

Time Seriesbull Health databull Sensorsbull Finance

ARTIFICIAL INTELLIGENCEA program that can sense reasons act and adapt

MACHINE LEARNINGAlgorithms whose performance improve when

exposed to more data over time

DEEP LEARNINGMulti-layered neural networks learn from

vast amounts of data

Unsupervised LearningSupervised Learning

Cluster Analysis Time Series Unstructured

Convolutional Neural Network(CNN)

Recurrent Neural Network(RNN)

RNN+ Long-short term Memory(LSTM)

Reinforcement Learning

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 2: Designing for intensity: parallelism from analytics to AI

1 copy Copyright 2017 FUJITSU

Designing for intensity parallelism from analytics to AI

Ian Godfrey

Director of the Solutions Business for Fujitsu Systems Europe

Manju Annie Oommen

Global Product Marketing Manager Fujitsu

2 copy Copyright 2017 FUJITSU

Agenda

HPC Diversifies

1

Co-creating solutions

4

Q amp A

5

Similarities between HPC and Deep

Learning optimization

3

What Changed over the years

2

3 copy Copyright 2017 FUJITSU

HPC Diversifies Hunger for compute power

Increasing connected devices worldwide

Size of digital universeincreasing

Driving more applications

64Bn Devices

10 Zettabytes

1000s of apps

2016

28Bn Devices

180 Zettabytes

20K New apps

gt2020

10 times more data to be generated by 2025 Emergence of High Performance Data Analytics

Fraud and anomaly detectionIdentifying harmful potentially harmful patterns and causes using graphical semantic analysis or other high performance analytics techniques real time

MarketingPromote products or services using complex algorithms to discern potential customers demographics buyingpreferences and habits

Business intelligenceUses HPDA to identify opportunities to advance the market position and competitiveness of businesses by better understanding themselves their competitors and the evolving dynamics of the markets they participate in

Other Commercial HPDAAn example of such a high-potential workload is the use of HPDA to manage large IT infrastructures ranging from on premise data centers to public clouds and Internet-of-Things (IoT) Infrastructures- involves solving complex problems

Existing HPC usersbull Intelligence

community FSIbull Data-driven

scienceengineering (eg biology)

bull Knowledge discovery

bull MLDL cognitive AI

New commercial users

bull Fraudanomaly detection

bull Business intelligence

bull Affinity marketingbull Personalized

medicine

Fastest processingtransformationof large volume data

Real-time analysisto extract invisible insight from the data

Accelerated deep-learning technologyby GPU computation

HPDA to grow robustly to be a $54Bn market

Cust

om

er

be

ne

fits

2

3

1

Source Information from analysts and various tele communication firms

4 copy Copyright 2017 FUJITSU

Neural Networks are Old ndash What changed

Scale drives deep learning progress

Availability of

More Data

Faster ComputeHardware

Better Algorithm

Best results are obtained by training a large neural network orand by feeding in more data

RepetitiveTraining

His

tory

1943 First electrical model of neural network

1958 Perceptron

1986 Backpropogation

1990s Convolutional Networks (LeCun)

2006 Deep Belief Network (Hinton)

201314 Google buys Deep Mind

HPC speeding up Deep learning Research

5 copy Copyright 2017 FUJITSU

What does deep learning deal with

Deep Learning

Dee

p L

earn

ing

is t

he

mac

hin

ersquos

per

cep

tio

n o

f Imagesbull Facesbull Self driving

Soundbull Voice searchbull Music Genbull Translation

Textbull CRMbull Search +bull Ads

Time Seriesbull Health databull Sensorsbull Finance

ARTIFICIAL INTELLIGENCEA program that can sense reasons act and adapt

MACHINE LEARNINGAlgorithms whose performance improve when

exposed to more data over time

DEEP LEARNINGMulti-layered neural networks learn from

vast amounts of data

Unsupervised LearningSupervised Learning

Cluster Analysis Time Series Unstructured

Convolutional Neural Network(CNN)

Recurrent Neural Network(RNN)

RNN+ Long-short term Memory(LSTM)

Reinforcement Learning

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 3: Designing for intensity: parallelism from analytics to AI

2 copy Copyright 2017 FUJITSU

Agenda

HPC Diversifies

1

Co-creating solutions

4

Q amp A

5

Similarities between HPC and Deep

Learning optimization

3

What Changed over the years

2

3 copy Copyright 2017 FUJITSU

HPC Diversifies Hunger for compute power

Increasing connected devices worldwide

Size of digital universeincreasing

Driving more applications

64Bn Devices

10 Zettabytes

1000s of apps

2016

28Bn Devices

180 Zettabytes

20K New apps

gt2020

10 times more data to be generated by 2025 Emergence of High Performance Data Analytics

Fraud and anomaly detectionIdentifying harmful potentially harmful patterns and causes using graphical semantic analysis or other high performance analytics techniques real time

MarketingPromote products or services using complex algorithms to discern potential customers demographics buyingpreferences and habits

Business intelligenceUses HPDA to identify opportunities to advance the market position and competitiveness of businesses by better understanding themselves their competitors and the evolving dynamics of the markets they participate in

Other Commercial HPDAAn example of such a high-potential workload is the use of HPDA to manage large IT infrastructures ranging from on premise data centers to public clouds and Internet-of-Things (IoT) Infrastructures- involves solving complex problems

Existing HPC usersbull Intelligence

community FSIbull Data-driven

scienceengineering (eg biology)

bull Knowledge discovery

bull MLDL cognitive AI

New commercial users

bull Fraudanomaly detection

bull Business intelligence

bull Affinity marketingbull Personalized

medicine

Fastest processingtransformationof large volume data

Real-time analysisto extract invisible insight from the data

Accelerated deep-learning technologyby GPU computation

HPDA to grow robustly to be a $54Bn market

Cust

om

er

be

ne

fits

2

3

1

Source Information from analysts and various tele communication firms

4 copy Copyright 2017 FUJITSU

Neural Networks are Old ndash What changed

Scale drives deep learning progress

Availability of

More Data

Faster ComputeHardware

Better Algorithm

Best results are obtained by training a large neural network orand by feeding in more data

RepetitiveTraining

His

tory

1943 First electrical model of neural network

1958 Perceptron

1986 Backpropogation

1990s Convolutional Networks (LeCun)

2006 Deep Belief Network (Hinton)

201314 Google buys Deep Mind

HPC speeding up Deep learning Research

5 copy Copyright 2017 FUJITSU

What does deep learning deal with

Deep Learning

Dee

p L

earn

ing

is t

he

mac

hin

ersquos

per

cep

tio

n o

f Imagesbull Facesbull Self driving

Soundbull Voice searchbull Music Genbull Translation

Textbull CRMbull Search +bull Ads

Time Seriesbull Health databull Sensorsbull Finance

ARTIFICIAL INTELLIGENCEA program that can sense reasons act and adapt

MACHINE LEARNINGAlgorithms whose performance improve when

exposed to more data over time

DEEP LEARNINGMulti-layered neural networks learn from

vast amounts of data

Unsupervised LearningSupervised Learning

Cluster Analysis Time Series Unstructured

Convolutional Neural Network(CNN)

Recurrent Neural Network(RNN)

RNN+ Long-short term Memory(LSTM)

Reinforcement Learning

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 4: Designing for intensity: parallelism from analytics to AI

3 copy Copyright 2017 FUJITSU

HPC Diversifies Hunger for compute power

Increasing connected devices worldwide

Size of digital universeincreasing

Driving more applications

64Bn Devices

10 Zettabytes

1000s of apps

2016

28Bn Devices

180 Zettabytes

20K New apps

gt2020

10 times more data to be generated by 2025 Emergence of High Performance Data Analytics

Fraud and anomaly detectionIdentifying harmful potentially harmful patterns and causes using graphical semantic analysis or other high performance analytics techniques real time

MarketingPromote products or services using complex algorithms to discern potential customers demographics buyingpreferences and habits

Business intelligenceUses HPDA to identify opportunities to advance the market position and competitiveness of businesses by better understanding themselves their competitors and the evolving dynamics of the markets they participate in

Other Commercial HPDAAn example of such a high-potential workload is the use of HPDA to manage large IT infrastructures ranging from on premise data centers to public clouds and Internet-of-Things (IoT) Infrastructures- involves solving complex problems

Existing HPC usersbull Intelligence

community FSIbull Data-driven

scienceengineering (eg biology)

bull Knowledge discovery

bull MLDL cognitive AI

New commercial users

bull Fraudanomaly detection

bull Business intelligence

bull Affinity marketingbull Personalized

medicine

Fastest processingtransformationof large volume data

Real-time analysisto extract invisible insight from the data

Accelerated deep-learning technologyby GPU computation

HPDA to grow robustly to be a $54Bn market

Cust

om

er

be

ne

fits

2

3

1

Source Information from analysts and various tele communication firms

4 copy Copyright 2017 FUJITSU

Neural Networks are Old ndash What changed

Scale drives deep learning progress

Availability of

More Data

Faster ComputeHardware

Better Algorithm

Best results are obtained by training a large neural network orand by feeding in more data

RepetitiveTraining

His

tory

1943 First electrical model of neural network

1958 Perceptron

1986 Backpropogation

1990s Convolutional Networks (LeCun)

2006 Deep Belief Network (Hinton)

201314 Google buys Deep Mind

HPC speeding up Deep learning Research

5 copy Copyright 2017 FUJITSU

What does deep learning deal with

Deep Learning

Dee

p L

earn

ing

is t

he

mac

hin

ersquos

per

cep

tio

n o

f Imagesbull Facesbull Self driving

Soundbull Voice searchbull Music Genbull Translation

Textbull CRMbull Search +bull Ads

Time Seriesbull Health databull Sensorsbull Finance

ARTIFICIAL INTELLIGENCEA program that can sense reasons act and adapt

MACHINE LEARNINGAlgorithms whose performance improve when

exposed to more data over time

DEEP LEARNINGMulti-layered neural networks learn from

vast amounts of data

Unsupervised LearningSupervised Learning

Cluster Analysis Time Series Unstructured

Convolutional Neural Network(CNN)

Recurrent Neural Network(RNN)

RNN+ Long-short term Memory(LSTM)

Reinforcement Learning

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 5: Designing for intensity: parallelism from analytics to AI

4 copy Copyright 2017 FUJITSU

Neural Networks are Old ndash What changed

Scale drives deep learning progress

Availability of

More Data

Faster ComputeHardware

Better Algorithm

Best results are obtained by training a large neural network orand by feeding in more data

RepetitiveTraining

His

tory

1943 First electrical model of neural network

1958 Perceptron

1986 Backpropogation

1990s Convolutional Networks (LeCun)

2006 Deep Belief Network (Hinton)

201314 Google buys Deep Mind

HPC speeding up Deep learning Research

5 copy Copyright 2017 FUJITSU

What does deep learning deal with

Deep Learning

Dee

p L

earn

ing

is t

he

mac

hin

ersquos

per

cep

tio

n o

f Imagesbull Facesbull Self driving

Soundbull Voice searchbull Music Genbull Translation

Textbull CRMbull Search +bull Ads

Time Seriesbull Health databull Sensorsbull Finance

ARTIFICIAL INTELLIGENCEA program that can sense reasons act and adapt

MACHINE LEARNINGAlgorithms whose performance improve when

exposed to more data over time

DEEP LEARNINGMulti-layered neural networks learn from

vast amounts of data

Unsupervised LearningSupervised Learning

Cluster Analysis Time Series Unstructured

Convolutional Neural Network(CNN)

Recurrent Neural Network(RNN)

RNN+ Long-short term Memory(LSTM)

Reinforcement Learning

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 6: Designing for intensity: parallelism from analytics to AI

5 copy Copyright 2017 FUJITSU

What does deep learning deal with

Deep Learning

Dee

p L

earn

ing

is t

he

mac

hin

ersquos

per

cep

tio

n o

f Imagesbull Facesbull Self driving

Soundbull Voice searchbull Music Genbull Translation

Textbull CRMbull Search +bull Ads

Time Seriesbull Health databull Sensorsbull Finance

ARTIFICIAL INTELLIGENCEA program that can sense reasons act and adapt

MACHINE LEARNINGAlgorithms whose performance improve when

exposed to more data over time

DEEP LEARNINGMulti-layered neural networks learn from

vast amounts of data

Unsupervised LearningSupervised Learning

Cluster Analysis Time Series Unstructured

Convolutional Neural Network(CNN)

Recurrent Neural Network(RNN)

RNN+ Long-short term Memory(LSTM)

Reinforcement Learning

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 7: Designing for intensity: parallelism from analytics to AI

6 copy Copyright 2017 FUJITSU

Industry segmentation and use cases

Healthcare

bull Pharmaceuticalbull Genomicsbull Imagery and medical

diagnostic

Marketing Automation

bull CRMbull Market Classificationbull Demand Predictionbull Document Generation

bull Enterprise Resource Planning

bull Predictive MaintenanceAnalysis

bull Machine transcriptionbull Machine translation

Defense and Social Security

bull Surveillance and Security

bull Cyber securitybull Image recognitionbull Motion detection

Consumere-commerceRetail

TransportLogistics

bull Autonomous carsbull Motion detectionbull Networked carCo-

ordinated trafficbull Commercial Dronesbull Optimized route

bull Sentiment Analysisbull Classificationbull Recommendation enginebull Demand predictionbull Automated consulting

bull Search bull Emailsbull Personalizationbull Smart Assistantbull Chatbots

Others

bull Educationbull Fintechbull Gamingbull Telcobull Media

Manufacturing Industrial

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 8: Designing for intensity: parallelism from analytics to AI

7 copy Copyright 2017 FUJITSU

Industry wide presence of Deep Learning

Social Infra4 Financial

9

Public Sector18

Distribution26

Manufacturing43

Sector wise

Call center28

Knowledge Utilization

20

Manufacturing16

Demand Prediction

13

Maintenance 8

Fintech9

Healthcare6

Application wise

Source Based on projects amp PoCs in Fujitsu

Artificial Intelligence is the new ElectricityhellipAndrew Ng

DL is not a vertical market It is more akin to an algorithm or method of computation like an FFT

Intersect360 Research tracks AI (including deep learning machine learning cognitive computing etc) as part of the hyper scale market

Similar to but distinct from HPC

Low precision intensely parallel strong affinity to public cloud

Cloud providers and end users are in early stages of investment for their applications

AI may become a pervasive technology that is embedded in non-hyperscale manifestations

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 9: Designing for intensity: parallelism from analytics to AI

8 copy Copyright 2017 FUJITSU

Fujitsu shaping HPC Diversification

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 10: Designing for intensity: parallelism from analytics to AI

9 copy Copyright 2017 FUJITSU

HPC the foundation to accelerating AI technology

ampFX100

for simulation andpre-processing technology

Zinrai Deep Learning amp DLUfor a high-speed learning environment

Digital Annealerfor combinatorial optimal solutions

Quantum

Computing

Deep

Learning

HPC

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 11: Designing for intensity: parallelism from analytics to AI

10 copy Copyright 2017 FUJITSU

Proximity in AI and HPC

HPC AIDL

HyperscaleSupercomputing

Multi-node

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 12: Designing for intensity: parallelism from analytics to AI

11 copy Copyright 2017 FUJITSU

Characterising Performance Computing

Computational scope Customer usage

Primary focus is performance

Compute-intensive algorithms

Maths solvers

Applications arbitrarily scalable

Is still ldquoHPCrdquo on only a few nodes ndash there is entry-level HPC

Largest supercomputers are gt$100 million

Problem-solving

Data Analysis

Scientific Simulation

Technical Modelling

Virtual Prototyping

Top tier users push boundaries and influence technology throughout industry

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 13: Designing for intensity: parallelism from analytics to AI

12 copy Copyright 2017 FUJITSU

Convolutional Neural Network Breakthrough

Krizhevsky A Sutskever I Hinton G Imagenet classification with deep convolutional neural networks In NIPS (2012)

Deeper Network

in Network

Deep DNN first blood

One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

at the bottom The GPUs communicate only at certain layers The networkrsquos input is 150528-

dimensional and the number of neurons in the networkrsquos remaining layers is given by 253440ndash

186624ndash64896ndash64896ndash43264ndash4096ndash4096ndash1000

2014 2013 2012

Use of 2 GPUs ndash data parallelism

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 14: Designing for intensity: parallelism from analytics to AI

13 copy Copyright 2017 FUJITSU

Neural Network starting point

119860119888119905 119871 119895 = 120590 119860119888119905 119871 minus 1 119894 119909 119882 119871 119894 119895 + 119861119894119886119904 119871 [119895]

119860119888119905 119871 minus 1 1

119860119888119905 119871 minus 1 2

119860119888119905 119871 minus 1 3

119882 119871 1][1

119882 119871 3][1

119882 119871 2][1 120590

Activation function

eg tanh ReLu

Weight

Feed-forward network

3 neurons 1 hidden layer

Fundamental multiply-add structure

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 15: Designing for intensity: parallelism from analytics to AI

14 copy Copyright 2017 FUJITSU

Vectorisation in Linear Algebra

Core intensive code in Linpack benchmark

do 30 j = kp1 n

t = a(lj)

if (l eq k) go to 20

a(lj) = a(kj)

a(kj) = t

20 continue

call daxpy(n-kta(k+1k)1a(k+1j)1)

30 continue

do 40 kb = 1 n

k = n + 1 - kb

b(k) = b(k)a(kk)

t = -b(k)

call daxpy(k-1ta(1k)1b(1)1)

40 continue

do 10 i = 1n

dy(iy) = dy(iy) + dadx(ix)

ix = ix + incx

iy = iy + incy

10 continue

Fujitsu K computer

Source httpswwwtop500orglists201706

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 16: Designing for intensity: parallelism from analytics to AI

15 copy Copyright 2017 FUJITSU

Network Illustration

Source Nervana

119882119894rarr119895 784 times 100

119887119895 100

119882119894rarr119895 100 times 10

119887119895 10

Total

parameters119888(119900119906119905119901119906119905 119905119903119906119905ℎ)

Cost function

N = 10 output units

(one for each digit)

Each unit i encodes the

probability of the input image

of being of the digit iN = 100 hidden units

(user-defined parameter)

N = 28 x 28 pixels

= 784 input units

Fully connected network

convolution not present for now

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 17: Designing for intensity: parallelism from analytics to AI

16 copy Copyright 2017 FUJITSU

CNN Computing Operations

Dense Matrix Multiplies

Recurrent Layers

Convolutions All-Reduce

Deep Learning ingredients

1 Randomly seed weights

2 Forward-pass

3 Cost

4 Backward-pass

5 Update weights

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 18: Designing for intensity: parallelism from analytics to AI

17 copy Copyright 2017 FUJITSU

Parallelisation Hierarchy

Vectorisation ndash Is SIMD parallelism used well

Scalar tuning ndash What happens in the pipeline

Memory ndash Is cache usage maximised or RAM access streamlined

Threading ndash do cores cooperation efficiently

Communication ndash can coordination in a distributed or

heterogeneous system be improved

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 19: Designing for intensity: parallelism from analytics to AI

18 copy Copyright 2017 FUJITSU

Naiumlve Nested Loops in CNN Algorithms

Forward Propagation

Backward Propagation Convolution

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 20: Designing for intensity: parallelism from analytics to AI

19 copy Copyright 2017 FUJITSU

A short word on Tensors

Tensors are systems of components organized by one or more indices that transform according to specific rules under a set of transformations

The number of indices is called the rank of the tensor

Tensor rank 0 is a scalar

Tensor rank 1 is a vector

Tensors are important in many areas of physics (general relativity electromagnetic theory)

In N-dimensional space a tensor of rank n has Nn components

Transformation rules are independent of choice of reference frame ndash ideal for expressing universal physical laws

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 21: Designing for intensity: parallelism from analytics to AI

20 copy Copyright 2017 FUJITSU

Optimised Functions

Software Libraries

Tensor functions hand-coded for CPUs or GPUs

Intel MKL-DNN

Emergence of dedicated processing units and ISAs

Tensor Arithmetic in hardware

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 22: Designing for intensity: parallelism from analytics to AI

21 copy Copyright 2017 FUJITSU

Multi-threading CNN Training

1 thread

4 threads

16 threads

64 threads

Training on CIFAR-10 with Intel-Caffe 1000 iterations Full Solver

Dataset consists of 60000 32x32 colour images

in 10 classes with 6000 images per class ndash

50000 training images and 10000 test images

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 23: Designing for intensity: parallelism from analytics to AI

22 copy Copyright 2017 FUJITSU

MPI Parallelism in CFD

Global model decomposed into

8 balanced MPI domainsHalo at interface

between domains

Communicate between processes with

MPI primitivesMPI_Send MPI_Recv MPI_Wait

MPI_AllToAll MPI_AllReduce MPI_BarrierDomain surfaces

adapted to cell

weights

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 24: Designing for intensity: parallelism from analytics to AI

23 copy Copyright 2017 FUJITSU

MPI in Deep Learning

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 25: Designing for intensity: parallelism from analytics to AI

24 copy Copyright 2017 FUJITSU

MPI Parallel Performance

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 26: Designing for intensity: parallelism from analytics to AI

25 copy Copyright 2017 FUJITSU

AI evolution driving CPU and GPU releases

Performance

Intelreg Xeon Phitrade Processor

Knights Mill

Intelreg Xeon Processor

Skylake

Lake

Crest

Intelreg Xeonreg Processor + FPGA

Intelreg Lake Crest Deep neural network processor

Da

tace

nte

rEd

ge

Clo

ud

Da

tace

nte

r

Infe

ren

ceTr

ain

ing

Intelreg Nervana

NVIDIA Tesla P4P40

NVIDIA Drive PX

Google TPU

NVIDIA Pascal 100

FPGA SOC(IntelXilinx)

FUJITSU

PRIMERGY CX600

K Computer

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 27: Designing for intensity: parallelism from analytics to AI

26 copy Copyright 2017 FUJITSU

Fujitsu Gateway ndashIntelligent Application Platform

Cloud Services

Cloud bursting ndash

Gateway

On premise cloud ndash

UNCAIArtificial Intelligence

Smart City Surveillance

Manufacturing process

optimisation

HPC for Data Analytics

Based on PRIMERGY

with Parallel File

System

Reference Architecture

Products and Solutions

CELSIUS

Intel amp Mellanox

Cluster InterconnectNVDIA GPGPU

PRIMERGY

RX2540 M4

SKL based

copy FUJITSU LIMITED 201726

PRIMEFLEX for HPC

Solutions

ProductsCX600 M1

KNL KNM

based

Entry ETERNUS

storage Cloud

PRIMERGY

RX2530 M4

SKL based

High-end ETERNUS

storage

NetApp storageDDN storage

Workgroup Data CenterDepartmental

Liquid Cooling

+ immersion cooling

FY2018

CX400 M4

SKL based

CX2550

M4

HPC

CX2570

M4

GPU

CX2580

M4

FPGA

Engineering Cloud

Industry 40

MONOZUKURI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 28: Designing for intensity: parallelism from analytics to AI

27 copy Copyright 2017 FUJITSU

New PRIMEFLEX Options

Reference designs defined for AI Deep Learning frameworks

PRIMEFLEX configuration tool provided for

fast definition of a complete solution

PRIMEFLEX for HPC Integrated Solutions incorporate Fujitsu Intelligent Application Platform as the application platform within the software stack

Ref arch for off-premise

Cloud-bursting capability

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 29: Designing for intensity: parallelism from analytics to AI

28 copy Copyright 2017 FUJITSU

DLHPC trends

DL opportunity represents 6-7 of Hyperscale Market

Speculative figure likely 100 yy growth

DL is not a vertical market

It is more akin to an algorithm or method of computation like an FFT

AIDL exists in proximity to HPC

Driven by same architectural objective ndash performance and scale

Converged math and programming methodologies

Technological cross-fertilization

bull Software compilers libraries tools

bull Hardware processors memory interconnect

Source Intersect360 Research 2016

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 30: Designing for intensity: parallelism from analytics to AI

29 copy Copyright 2017 FUJITSU

Summary

Combine algorithmic expertise on HPC and MLFujitsu has the rare capability to combine technologies amp provide fully optimized solution

Shape of a network is subject to skilled programming Optimise through algorithmic and modelling discoveries Relationship between depth and result quality remains largely empirical

AI usage is primarily on Cloud todayCustomer looks for simplified integrated solutions

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 31: Designing for intensity: parallelism from analytics to AI

30 copy Copyright 2017 FUJITSU

Fujitsu Sans Light ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacutethorn

yumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789

notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucircuumlyacute

thornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

Fujitsu Sans Medium ndash abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

0123456789 notrdquopound$^amp()_+-=[]rsquo~ltgt| copyuml~iexclcentcurrenyenbrvbarsectumlordflaquoraquonot-

regmacrdegplusmnsup2sup3microparamiddotcedilsup1ordmfrac14frac12frac34iquestAgraveAacuteAcircAtildeAumlAringCcedilEgraveAEligEacuteEcircEumlIgraveIacuteIcircIumlETHNtildeOgraveOacuteOcircOtildeOumltimesOslashUgraveUacuteUcircUumlYacuteTHORNszligagraveaacuteacircatildeaumlaringaeligccedilegraveeacuteecirceumligraveiacuteicirciumlethntildeograveoacuteocircotildeoumldivideoslashugraveuacuteucirc

uumlyacutethornyumlĐıŒœŠšŸŽžƒʼˆˇˉ˙˚˛˜˝-‒ndashmdash―lsquorsquosbquoldquordquobdquodaggerDaggerbullhellippermillsaquorsaquoolinefrasl⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉eurotradeΩrarrpart∆prodsumminusradicinfinintasympnelegesdotlozfifl

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 32: Designing for intensity: parallelism from analytics to AI

31 copy Copyright 2017 FUJITSU

Deep Learning Networks

Image Identity

BACK

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK

Page 33: Designing for intensity: parallelism from analytics to AI

32 copy Copyright 2017 FUJITSU

Unsupervised Learning

Genome Market Segmentation Fraud Detection

Astronomical data analysisGoogle NewsBACK