accelerating ai from the cloud to the edge

Wei LI, PHDVice President, Software and Services GroupGM, Machine Learning and Translation

Dec, 2017

Accelerating AI

Avg. Internet User Autonomous DrivingSmart Hospital Smart Factory Airplane

1.5 GB¹3,000 GB²

4,000 GB³40 k GB2

1 M GB2

All numbers are approximatedhttp://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.html , http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172 , http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.html

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

https://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172



DataDeluge

ComputeBreakthrough

InnovationSurge

mainframes Standards-based servers Cloud computing

AI Compute Cycles will grow by 2020 12X12X is a compute demand (volume x some level of perf/throughput of each sever). The overall compute demand is growing tremendously as early deployments had relatively low utilization. Over time through the forecast, this transitions to broader deployments powering multiple applications with higher utilization, improved performance of the HW, and SW continued optimizations that drive compute growth.

Consumer Health Finance Retail Government Energy Transport Industrial OtherSmart

Assistants

Chatbots

Search

Personalization

Augmented Reality

Robots

Enhanced Diagnostics

Drug Discovery

Patient Care

Research

Sensory Aids

Algorithmic Trading

Fraud Detection

Research

Personal Finance

Risk Mitigation

Support

Experience

Marketing

Merchandising

Loyalty

Supply Chain

Security

Defense

Data Insights

Safety & Security

Resident Engagement

Smarter Cities

Oil & Gas Exploration

Smart Grid

Operational Improvement

Conservation

Autonomous Cars

Automated Trucking

Aerospace

Shipping

Search & Rescue

Factory Automation

Predictive Maintenance

Precision Agriculture

Field Automation

Advertising

Education

Gaming

Professional & IT Services

Telco/Media

Space Exploration

Sports

Step 1: Training(In Data Center – Over Hours/Days/Weeks)

•Step 2: Inference(At the Edge or in the Data Center – Instantaneous)

Person

90% person8% traffic light

97% person

Trained Model

Forward & Backwards Propagation Forward Propagation

OutputClassification

Create “Deep neural net”

math model

Massive data sets: labeled or tagged

input data

OutputClassification

Trained neuralnetwork model

New input fromcamera and sensors

HARDWAREFor the cloud/The edge

Intel® Xeon® Processors

Intel® Core™ & Atom™ Processors

Intel® FPGA

Intel® Xeon Phi™ Processors*

Intel® Nervana™ Neural Network Processors

Intel® Processor Graphics

Movidius VPUVision

Intel® GNA (IP)*

CPU+

Speech

datacenter Edge

Potent performanceTrain in days HOURS with up to 113X2 perf vs. 2-3 year old servers (2.2x excluding optimized SW1)

*Optimization notice slide 31

Intel® Nervana™Neural Network Processor

Compute Density

Blazing Data Access

High-speed Scalability

Project brainwave for real-time AI “A major leap forward in both performance and flexibility for cloud-based serving of deep learning models.”

- Doug BurgerDistinguished Engineer

+

*Google Cliphttps://www.theverge.com/2017/10/4/16402682/google-clips-camera-announced-price-release-date-wireless

https://www.theverge.com/2017/10/4/16402682/google-clips-camera-announced-price-release-date-wireless

Technologies

HOW DO WE DO IT: Innovate Hardware

TrainingExpert-led trainings,

hands-on workshops, exclusive remote access,

and more.

ToolsLatest libraries,

frameworks, tools and technologies from Intel

CommunityCollaborate with

industry luminaries, developers and Intel

engineers.

Intel® Nervana™

DEVCLOUD

platforms

Frameworks

experiences

libraries

Intel® Nervana™DL Studio

Movidius(VPU)

Intel® ComputerVision SDK

Intel® Nervana™Cloud & Appliance

Mllib BigDL

Intel® Math Kernel Library (MKL-DNN)

Intel® Data Analytics Acceleration Library

(DAAL)

Intel® Nervana™

Graph*

Intel Python Distribution

* Other names and brands may be claimed as the property of others.

Intel® Math Kernel Libraryfor Deep Neural Networks (MKL-DNN)

Intel® Machine Learning Scaling Library (Intel® MLSL)

Computational Primitives Communication Primitives

Deep Learning Frameworks


INFERENCE THROUGHPUT

Up to

138xIntel® Xeon® Platinum 8180 Processor

higher Intel optimized Caffe GoogleNet v1 with Intel® MKL inference throughput compared to

Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe

INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 24,28Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any ofthose factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit:http://www.intel.com/performance Source: Intel measured as of June 2017

* Optimization Notice – Slide 31

TRAINING THROUGHPUT

Up to

113xIntel® Xeon® Platinum 8180 Processor

higher Intel Optimized Caffe AlexNet with Intel® MKL training throughput compared to

Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe

Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Family

Optimized Frameworks

Optimized Intel® MKL Libraries

Inference and training throughput uses FP32 instructions

http://www.intel.com/performance

* Blog by Valeriu Codreanu, Damian Podareanu, Zheng Meyer-Zhao, Vikram Saletore: http://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/ System Details: https://www.bsc.es/news/bsc-news/marenostrum-4-begins-operation

System specs2S Intel® Xeon® Processor 8160

*We acknowledge PRACE for awarding us access to resource MareNostrum 4 based in Spain at Barcelona Supercomputing Center (BSC)

0

10

20

30

40

50

60

70

4 16 32 64 96 128 192 256

Ideal SURFsara/Intel IBM Caffe IBM Torch Facebook

Resnet-50 training time in 44 minutes on 512 Xeon’s

* Optimization notice slide 31.

http://www.intel.com/benchmarks

ResNet-50 Time to Train

31minutes

Intel® Xeon® Platinum 8160 Processor 1600 nodes

Batch size = 16K, 75.3% Top-1 Accuracy

AlexNet Time to Train

11minutes

Intel® Xeon® Platinum 8160 Processor1024 nodes

Batch size = 32K, 58.6% Top-1 Accuracy

Measured on Intel Caffe and Intel® Machine Learning Scaling Library (Intel® MLSL)

• Technical Report by Y. You, Z. Zhang, C-J. Hsieh, J. Demmel, K. Keutzer:https://people.eecs.berkeley.edu/~youyang/publications/imagenet_minutes.pdf

Large Batch Size method with Layer-wise Scaling Layer-wise Adaptive Rate Scaling (LARS) algorithm

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality,

or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product

User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific

computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with

other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of November 2017.


Get Excellent multi-node scaling and generational performance with your existing hardware

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that productwhen combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of June 2017

* Optimization Notice: Slide 31

Configuration Details 1 Link

Higher is Better

Higher is Better


FRAMEWORK HARDWARE

Deep learning Training in hours



FRAMEWORK HARDWARE

Deep learning training with huge datasetfor stunning accuracy


Deep learning Inference for Smooth user experience with lower cost

FRAMEWORK HARDWARE

www.pikazoapp.com

http://www.pikazoapp.com/

Similar properties:


Get The Best AI Performance Workload

Scaling Utilize all the cores Vectorize/SIMD Efficient memory/cache useImprove load

balancing

Reduce synchronization events, all-to-all

comms

OpenMP, MPI

Reduce synchronization

events, serial code

Improve load balancing

Unit strided access per SIMD lane

High vector efficiency

Data alignment

Blocking

Data reuse

Prefetching

Memory allocation

Naïve convolution algorithm is slow

Not vectorization friendly

Not cache friendly while having good compute/memory ratio

Optimized implementation of convolution algorithm in MKL-DNN

Effective usage of SIMD registers, vectorization across channels.

Blocked data format for cache friendly data blocking, data reuse, and effective prefetching.

Parallelization across mini-batch, output channels and spatial domains allows utilization of all computational units and improves load balancing in case of small convolutions.

Optimizations: AVX-512 vectorization, data reuse, parallelization

Distributed SW Implementation

Support of Parallel Training Algorithms

Optimizing DL Frameworks w/ Intel® Machine Learning Scaling Library (Intel® MLSL) & Intel® MPI

Increasing Scaling EfficiencyLarge mini-batch training methods

Communication volume reduction

Data/model Parallelism

Synchronous, asynchronous & hybrid SGD

Massive scale-out solution for Deep Learning training with large datasets

Hybrid Approach• Nodes are split into groups

• Each group performs synchronous SGD

• Groups communicates with asynchronous SGD

•Results• Kurth, Thorsten, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Ali

Patwary, Tareq Malas et al. "Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data“, arXiv:1708.05256 (2017).

• Scale training of a single model to obtain peak performance of 11.73-15.07 PFLOP/s and

sustained performance of 11.41-13.27 PFLOP/s on 9600 Intel® Xeon® Phi™ nodes

• Implemented on top of Intel* Distribution of Caffe* and Intel® Machine Learning Scaling Library (Intel®MLSL)


Intel® XEON®’s INT8 capability for Deep Learning inference: Lower response time, higher throughput, less memory

IA optimized framework provide automated process to ease deployment

–Collect statistics

–Calibrate scaling factor

–Quantize

Result: no significant accuracy loss comparing with FP32

TopologyFP32 (Top1/Top5Accuracy

INT8

ResNet-50 72.50% / 90.87% 71.76% / 90.48%

GoogLeNet v3 74.50% / 92.42% 74.21% / 92.27%

SSD 77.68% (mAp) 77.37% (mAp)

accelerating AI from the cloud to the edgeboth hardware and software

Visit www.intel.com/ai for more information

http://www.intel.com/ai

Configuration details32/64-node CPU system Intel® Xeon® 6148 Gold processor with 10GB Ethernet / OPA

Benchmark Segment AI/MLBenchmark type TrainingBenchmark Metric Images/Sec or Time to train in secondsFramework CaffeTopology Resnet-50, VGG-16, GoogleNet V3# of Nodes 32/64Platform Wolfpass (Skylake)Sockets 2S

ProcessorXeon Processor code named Skylake, B0, ES2*, 24c, 2.4GHz, 145W, 2666MT/s, QL1K CPUID=0x50652

BIOS SE5C620.86B.01.00.0412.020920172159Enabled Cores 24 cores / socketPlatform Wolfpass (Skylake)Slots 12Total Memory 192GBMemory Configuration 12x16GB DDR4 2R, 1.2V, RDIMM, 2666MT/sMemory Comments Micron MTA 18ASF2G72PDZ-2G6B1SSD 800GB Model: ATA INTEL SSDSC2BA80 (scsi)

OSOracle Linux Server 7.3, Linux kernel 3.10.0-514.6.2.0.1.el7.x86_64.knl1

EthernetConfigurations

Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 03)

Omni-PathConfigurations

Intel Omni-Path HFI Silicon PCIe Adapter 100 Series [discrete]. OFED Version 10.2.0.0.158_72. 48 port OPA switch, with dual leaf switches per rack 48 nodes per rack, 24 spine switches

HT ONTurbo ONComputer Type ServerFramework Version Internal Caffe version

Topology VersionInternal ResNet-50 topologyInternal VGG-16 topologyInternal GoogleNet V3 topology

Batch sizeResNet-50 : 128 x # of nodeVGG-16 : 64 x # of nodeGoogleNet V3 : 64 x # of node

Dataset, versionImagenet, ILSVRC 2012 (Endeavor location), JPEG resized 256x256

MKLDNN aab753280e83137ba955f8f19d72cb6aaba545efMKL mklml_lnx_2018.0.1.20171007MLSL 2017.2.018

Compiler Intel compiler 2017.4.196

Optimization Notice• Intel's compilers may or may not optimize to the same degree for non-Intel

microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

• For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

http://www.intel.com/benchmarks

System Configuration Network Fabric Minibatch Size Top-1 Accuracy Measured TTT

64-node Xeon systemSKX-6148 based *

10Gb Ethernet 8192 75.9% 7.3 hours

64 card GPU systemP100 based (Facebook Research **)

50Gb Ethernet 8192 76.3% 4 hours

Throughput Scaling (1 node 32/64 nodes)

Resnet-50 Time to Train Performance

1x

1x

1x

1x

1x

1x

28.6x

31.4x

32.0x

29.5x

31.4x

32.0x63.9x64.0x

0 1000 2000 3000 4000 5000 6000

VGG-16

GOOGLENET V3

RESNET50

images/sec

64 nodes OPA

64 nodes 10Gb Ethernet

32 nodes OPA

32 nodes 10Gb Ethernet

1 node OPA

1 node 10Gb Ethernet

Intel Internal measured data on system configuration noted above, Configuration slide 30

** https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf?

Measured with Intel® Distribution of Caffe* and Intel® Machine Learning Scaling Library (Intel® MLSL)

90% scaling efficiency with up to 74% Top-1 accuracy on 256 nodesConfiguration Details 2

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality,

or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product

User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific

computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with

other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of November 2017.

PUBLIC


accelerating ai from the cloud to the edge

Technology