ai: what makes it hard and fun!site.ieee.org/scv-cas/files/2019/05/2019dubey.pdf · 2019. 8....

AI: What Makes It Hard and Fun!

Pradeep K Dubey

Intel Senior Fellow and Fellow of IEEEDirector Parallel Computing Lab, Intel Labs

IEEE CASS-SCV Industry Forum, San Jose, May 22, 2019

Machines:Crunch

Numbers

Humans:Make

Decisions

Division of Labor Between Man and Machine Is Getting Disrupted:Faster than Anyone Predicted!

Machines:Number Crunching

ANDDecision Making

MILS: Machine Intelligence Led Services

Mills

Information Revolution

http://bits.blogs.nytimes.com/2014/06/11/intelligence-too-big-for-a-single-machine/

“We’re seeing a rebirth of artificial intelligence driven by the cloud, huge amounts of data and the learning algorithms of software,”Larry Smarr, founding director of the California Institute for

Telecommunications and Information Technology

Intelligence Too Big for a Single Machine

MILS

http://bits.blogs.nytimes.com/2014/06/11/intelligence-too-big-for-a-single-machine/

FROMA World of analytical models

Computational Fluid Dynamics

Start with Mathematical Model

Model Simulate Predict

Start with Data

Initial State Increment Steer

TOA World of Data driven

Models Event Detection from Social Media

Inside - Out Outside - In

“Achieving Exascale is imperative not only to better the scientific community, but also to better the lives of everyday Americans. Aurora and the next-generation of Exascalesupercomputers will apply HPC and AI technologies to areas such as cancer research, climate modeling, and veterans’ health treatments. The innovative advancements that will be made with Exascale will have an incredibly significant impact on our society.”

- Rick Perry, US Secretary of Energy

Intel – partnered with Argonne National Laboratory – driving the convergence of HPC and AI

Recently announced:Intel to Build First Exascale Supercomputer

for U.S. dOE

Aurora will accelerate the convergence of traditional HPC, data

analytics, and AI

Intel’s data centric portfolio at the heart of Aurora, integrated with Cray’s

“Shasta” system

Leading research & academia programs already engaged to harness Aurora and

enable software ecosystem

Aurora: Accelerating Science by applying HPC & ai at scale

DEEP LEARNING IN CANCER TREATMENT

NEURAL NETWORKS IN MATERIALS SCIENCE

Precisionmedicine

Discovery of transformative materials

NEXT GEN MOLECULAR MODELING

Advanced biofueldevelopment

EXTREME SCALE AERODYNAMIC FLOW

SIMULATIONS

Next generation aircraft & nuclear reactors

Virtuous Cycle of Compute: A Functional View

Sense Act

Reason

The Future: Third Wave of AI

Deep Neural Networks Getting Augmented: NN + X + MemorySuch As: CNN + Bayes Net + Sparse Embeddings

AI Compute Needs

AlphaGo Zero needs:2 EF-days to train Need a 100 ExaFlop machine to train within an hour *

* https://blog.openai.com/ai-and-compute/

https://blog.openai.com/ai-and-compute/

AI: What makes it hard and fun!

LEARNING WITH LESS DATA AND SUPERVISION

DEEP NEURAL NETWORKS GETTING AUGMENTED: DATA-DRIVEN + ANALYTICAL + MEMORY

LEARNING MODELS THAT ARE EASIER TO REASON

CONTINUOUS LEARNING FOR MISSION-CRITICAL AI

Better model building

10

THROUGHPUT, ACCURACY, AND MODEL SIZE TRADEOFFS: SPARSIFICATION AND PRUNING

SELF-LEARNING AND PERSONALIZATION AT THE EDGE

More efficient and pervasive model deployment

Compute architecture needs of AIREDUCING ARITHMETIC PRECISION WHILE PRESERVING ACCURACY: ALL 32 16, 8, 4, 2 …

FEEDING THE COMPUTE MEMORY AND NETWORK; COMPUTE NEAR NETWORK AND MEMORYDOMAIN-SPECIFIC ARCHITECTURES TRADITIONAL, SPATIAL, NEUROMORPHIC, QUANTUM

STRONG-SCALING AI TO HPC SCALE ON CLOUD INFRASTURCTURE: LARG BATCH SIZE AND 2nd ORDER METHODS

DELIVERING PERFORMANCE-PRODUCTIVITY: FaaS AND HIGHER ORDER LANGUAGE CONSTRUCTS

Productivity and Scaling needs of AI

Thinking Differently

Deep learning▪ Known answers▪ Generate procedures

with training

Conventional▪ Known procedures▪ Generate answers

ProbabilisticComputing

▪ Learn distributions▪ Compute with

uncertainties in natural data

Quantum▪ Answers superimposed▪ Select and measure

the answer

Graph Analytics

▪ Graph edges and vertices represent relationships

▪ Big, sparse data structures

Example Application: Pedestrian Intent Estimation

Problem• Does the pedestrian want to cross?• At what moment will he attempt to cross? • What direction will he follow?

Approach• Generative Model to for possible collisions scenarios• Observed values (speed, pose, trajectory) • Refine scenarios with observations: Maximum likelihood

GAIL: Generative Adversarial Imitation Learning *

Given a dataset of trajectories: sequence of (state, action) tuples of an expert behavior, find a

policy: 𝜋: 𝑆 → 𝐴

Inverse Reinforcement Learning

* https://arxiv.org/pdf/1606.03476.pdf

RAIL: Risk-Averse Imitation LearningDeep RL Symposium at NIPS-2017, Extended Abstract in AAMAS-2018

Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Heavy Tail Problem of Generative Adversarial Imitation Learning (GAIL) [Ho/Ermon 2016]

Implication: GAIL agents prone to catastrophic failure

Our Solution: The RAIL objective function

Evaluation:• RAIL is a superior choice than GAIL in

risk-sensitive applications• RAIL converges almost as fast as GAIL

in mean• RAIL preserves the scalability of GAIL

Conditional Value at Risk (CVaR)

Results:

Wild Wild East *

* https://factordaily.com/india-driving-dataset-iiit-hyderabad/ ** http://idd.insaan.iiit.ac.in

Current driving datasets: well-delineated lanes, a small categories of traffic participants, low variation in object/background appearance and strict adherence to traffic rules.

IDD: A novel dataset for unstructured conditions: where the above assumptions are largely not satisfied.

Indian Driving Data (IDD) **10,004 images, finely annotated with 34 classes, 182 drive sequences on Indian roads.

Probabilistic Computing Taking Shape

16

Sources: Uber, Google Source: blogs@Intel

Intel Science and Technology Center for Probabilistic Computing

KU Leuven/UCLAMarian Verhelst/Guy Van den Broeck

Deep arithmetic networks for efficient learning, Inference and decision making

Duke Alvin Lebeck Accelerating Markov Chain Monte Carlo Inference

Northeastern Jan-Willem van de Meent Design and Evaluation of Deep Probabilistic Programs

MITMichael Carbin/Vikash Mansingkha

Probabilistic Programming with Fast, Verified, Programmable Inference

Oxford University Yarin GalReal-World Benchmarks and a Public Competition in Bayesian Deep Learning

CIMAT Jean-Bernard Hayet Motion and intents prediction for unmotorized road users

Harvard David Brooks (Funded under Intel JUMP)

Energy-Efficient Probabilistic Graphical Models and Inference



with training






the answer

Graph Analytics



challenges

Graph AnalyticsImprove social-network analysis, fraud-ring detection, anomaly detection

▪ Sparse and irregular memory accesses

▪ Small data accesses with frequent synchronization

▪ Scaling to very large datasets

Real-Time Decision Making

challenges

▪ Sparse and irregular memory accesses

▪ Small data accesses with frequent synchronization

▪ Scaling to very large datasets

Is this a Botnet?

How do “bad actors” manifest? Diseases

spread?

Graph AnalyticsImprove social-network analysis, fraud-ring detection, anomaly detection

Real-Time Decision Making

DARPA Graph Analytics Challenge

Program to develop new technologies to realize 1,000x

performance-per-watt gains in the

ability to handle graph

analytics

Re-imagined Architecture

Fully Integrated Seamless Scaling

Key Technologies and Scalability

Network as First-class Citizen

Optimized for Small Messages

Architected to Scale

Aggressive MCP/MPP

Near-Memory Atomics

Support for irregular memory accesses

Global Memory Model

High IO & Memory bandwidth

Memory Form-Factor

Sparse/IrregularDense

Dense “Easy,” Sparse/Irregular Really Hard

23

93.5

0.22

12.1

0.44 1.10

10

20

30

40

50

60

70

80

90

100

Large DGEMM SpGEMM MD force PageRank CG Solver

“Use

ful”

Op

s/s

as %

of

Pe

akData collected on single socket Intel Xeon® Platinum 8180, w/ turbo disabled and frequency of 1.7 GHz

Spatial architectures*

24

▪ Software defines custom instruction pipeline for computation of interest

– Distributed (and minimal) control logic and data storage

– Capture dataflow to minimize on-chip data movement

– Potential for big boost to power efficiency and thus performance

▪ Programmability is not a motivation to migrate to spatial architectures

* https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01220-hyperflex-architecture-fpga-socs.pdf* A. Parashar et al., "Efficient Spatial Processing Element Control via Triggered Instructions," in IEEE Micro, vol. 34, no. 3, pp. 120-137, May-June 2014.

“Programmatic Control of a Compiler for Generating High- performance Spatial Hardware” Hongbo Rong; https://arxiv.org/pdf/1711.07606.pdf

https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01220-hyperflex-architecture-fpga-socs.pdf

https://arxiv.org/pdf/1711.07606.pdf

Can we address the Achilles’ heel of Spatial Architectures?

25

Spatial architectures less intuitive than von Neumann architectures

▪ We’re used to thinking sequentially

▪ Even experienced parallel programmers struggle with thinking spatial

Domain-specific languages might hide this, but narrow scope too much

T2S: General-purpose, high-productivity, high-performance spatial programming *

• “General-purpose”: loop nests with affine dependences

• “high-productivity”: 20 or 30 LOC to write under 0.5hr

• “high-performance”: ninja/product performance

* Programmatic Control of a Compiler for Generating High- performance Spatial Hardware; Hongbo Rong; https://arxiv.org/pdf/1711.07606.pdf

https://arxiv.org/pdf/1711.07606.pdf

Performance-productive generation of spatial architectures*

26

T2S

For average programmers

Intel FPGAs

CGRA

PolySA

For perf. engineers

ARE(Affine Recurrence Equations)

URE (Uniform Recurrence Equations)

Compute partitionLoop nest transformations

User-managed cacheStreaming loads/stores

Uniformalization

OpenCL Assembly& C

C(i,j)+=A(i,k)*B(k,j)

C(i,j,k)=C(i,j,k-1)+A(i,j-1,k)*B(i-1,j,k)

A(i,j,k)=A(i,j-1,k)

B(i,j,k)=B(i-1,j,k)

* “Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations”, Nitish Srivastav et. al. To appear in FCCM 2019

Temporal-to-Spatial (T2S):A Performance-Productive Programming of Spatial Architectures*

27

• Close-to or better-than ninja GEMM on both architectures in a fraction of time • 2 weeks (vs. 18 month) for FPGA• 3 days (vs. 3 months) for CGRA

• First time MTTKRP, TTM, TTMc ever been implemented on any spatial architecture

* “Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations”, Nitish Srivastav et. al. To appear in FCCM 2019



with training






the answer

Graph Analytics



Quantum co-processor: augmenting, not replacing, traditional HPC systems

29

Applications Space: HPC~50+ Qubits: Proof of concept

• Computational power exceeds supercomputers

• Learning test bed for quantum “system”

~1000+ Qubits: Small problems

• Limited error correction

• Chemistry, materials design

• Optimization

~1M+ Qubits: Commercial scale

• Fault tolerant operation

• Cryptography

• Machine Learning

Intel – QuTech Research Collaboration

Atomic Layer Control PackagingPatterning

24nm Pitch Lines Assembly and Packaging Research

Metal Gate / High k on300mm Silicon Wafer

QuTech’s Expertise in qubit operation and control

Intel Labs:• Algorithms• System Architecture• Control Electronics

TMG Components Research

Combining Intel capabilities with Delft expertise

31

A quantum systemrequires a full stack

Intel- QuTech Research Collaboration

vSpin Qubits in Silicon

Single electron transistors, where qubit

is spin state

Superconducting Qubits

Very high quality microwave circuit

Enta

ngl

emen

t G

ap

Insulator

Conductor

Short-range Interaction Strength

Application Algorithms

v

Quantum Chip

Control Electronics Qubit Simulator

Compilers/Runtimes

Algorithms

Libraries

ApplicationsQubit SimulatorSystem Research

www.nersc.gov/systems/edison-cray-xc30/

High Perf QuBit Simulation

Practical Quantum Supremacy:

• Quantum compared to state-of-the-art solver

• In the presence of realistic noise

• On a realistic problem

Quantum: Quantum Approximate OptimizationAlgorithm (QAOA)

Classical: akmaxsat Solver

Problem: Max-Cut graph partitioning state-of-the-artclassical solver

quantum heuristics

32

Quantum Advantage: For Intractable, Yet Useful Computational Problems*

* [classical optimization] Guerreschi G.G. & Smelyanskiy M., arXiv:1701.01450* [compiler of quantum circuits] Guerreschi G.G. & Park J., Quantum Science and Technology, 3(4), 045003 (2018)* [performance crossover] Guerreschi G.G. & Matsuura A.Y., arXiv:1812. 07589 – To appear in Scientific Reports (2019)

Learned Indices *

* The Case for Learned Index Structures: https://arxiv.org/abs/1712.01208 and ”Everything Is a Model” http://deliprao.com/archives/262

DSAIL Launched late 2018 @ MIT in collaboration with Google-Microsoft <link>

https://www.csail.mit.edu/news/google-intel-and-microsoft-team-wcsail-new-data-driven-initiative

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100

Pe

rfo

rma

nc

e R

ela

tiv

e t

o P

ostg

reS

QL

Episodes

RL (Row Vectors)RL (Histogram)

RL (1-Hot)Postgres

Machine Learning Based End-to-End Query Optimizer*

* * Neo: A Learned Query Optimizer; Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul; Under Review

JOB

(Jo

in O

rde

r B

en

chm

ark

Qu

erie

s)

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100

Pe

rfo

rma

nce

Re

lati

ve

to

Ora

cle

Episodes

RL (Row Vectors)RL (Histogram)

RL (1-Hot)

OraclePostgreSQL on Oracle

Machine Learning Based End-to-End Query Optimizer*

* Neo: A Learned Query Optimizer; Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul; Under Review

JOB

(Jo

in O

rde

r B

en

chm

ark

Qu

eri

es)

A process in which some or all of the steps of turning a user’s intent into an executable program are automated.

Machine Programming*

Invention

Intention Adaptation

* Three Pillars paper (MAPL ’18): Justin Gottschlich, Armando Solar-Lezama, Nesime Tatbul, Michael Carbin, Martin Rinard, Regina Barzilay, Saman Amarasinghe, Joshua B Tenenbaum, Tim Mattson; https://arxiv.org/abs/1803.07244

Summary

• AI is driving a virtuous cycle of compute An insatiable appetite for compute

• AI impacts all parts of stack: algorithm, software, architecture … and ALUs too!

• Today’s HPC will get augmented with deep learning, and other forms of ML and graph analytics

• Quantum computing: challenges ahead, opportunities enormous!

• Disruptive and exciting applications of tomorrow go well beyond self-driving, such as, machine programming

We are at an unprecedented convergence of massive compute with massive data ...

This confluence will have a lasting impact on both how we do computing and what computing can do for us!

Thank You

41

Notice and DisclaimersNotice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a

design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.

Intel® Itanium®, Intel® Xeon®, Xeon Phi™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2012, Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others..

http://www.intel.com/

42

Notice and Disclaimers Continued …

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Notices & DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration.

No product or component can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

*Other names and brands may be claimed as property of others.

Intel, the Intel logo, Xeon, and Optane are trademarks of Intel Corporation in the United States and other countries.

© 2019 Intel Corporation.

45



http://www.intel.com/go/turbo

ai: what makes it hard and fun!site.ieee.org/scv-cas/files/2019/05/2019dubey.pdf · 2019. 8....

Documents