ai: what makes it hard and fun!site.ieee.org/scv-cas/files/2019/05/2019dubey.pdf · 2019. 8....
TRANSCRIPT
AI: What Makes It Hard and Fun!
Pradeep K Dubey
Intel Senior Fellow and Fellow of IEEEDirector Parallel Computing Lab, Intel Labs
IEEE CASS-SCV Industry Forum, San Jose, May 22, 2019
Machines:Crunch
Numbers
Humans:Make
Decisions
Division of Labor Between Man and Machine Is Getting Disrupted:Faster than Anyone Predicted!
Machines:Number Crunching
ANDDecision Making
MILS: Machine Intelligence Led Services
Mills
Information Revolution
http://bits.blogs.nytimes.com/2014/06/11/intelligence-too-big-for-a-single-machine/
“We’re seeing a rebirth of artificial intelligence driven by the cloud, huge amounts of data and the learning algorithms of software,”Larry Smarr, founding director of the California Institute for
Telecommunications and Information Technology
Intelligence Too Big for a Single Machine
MILS
FROMA World of analytical models
Computational Fluid Dynamics
Start with Mathematical Model
Model Simulate Predict
Start with Data
Initial State Increment Steer
TOA World of Data driven
Models Event Detection from Social Media
Inside - Out Outside - In
“Achieving Exascale is imperative not only to better the scientific community, but also to better the lives of everyday Americans. Aurora and the next-generation of Exascalesupercomputers will apply HPC and AI technologies to areas such as cancer research, climate modeling, and veterans’ health treatments. The innovative advancements that will be made with Exascale will have an incredibly significant impact on our society.”
- Rick Perry, US Secretary of Energy
Intel – partnered with Argonne National Laboratory – driving the convergence of HPC and AI
Recently announced:Intel to Build First Exascale Supercomputer
for U.S. dOE
Aurora will accelerate the convergence of traditional HPC, data
analytics, and AI
Intel’s data centric portfolio at the heart of Aurora, integrated with Cray’s
“Shasta” system
Leading research & academia programs already engaged to harness Aurora and
enable software ecosystem
Aurora: Accelerating Science by applying HPC & ai at scale
DEEP LEARNING IN CANCER TREATMENT
NEURAL NETWORKS IN MATERIALS SCIENCE
Precisionmedicine
Discovery of transformative materials
NEXT GEN MOLECULAR MODELING
Advanced biofueldevelopment
EXTREME SCALE AERODYNAMIC FLOW
SIMULATIONS
Next generation aircraft & nuclear reactors
Virtuous Cycle of Compute: A Functional View
Sense Act
Reason
The Future: Third Wave of AI
Deep Neural Networks Getting Augmented: NN + X + MemorySuch As: CNN + Bayes Net + Sparse Embeddings
AI Compute Needs
AlphaGo Zero needs:2 EF-days to train Need a 100 ExaFlop machine to train within an hour *
* https://blog.openai.com/ai-and-compute/
AI: What makes it hard and fun!
LEARNING WITH LESS DATA AND SUPERVISION
DEEP NEURAL NETWORKS GETTING AUGMENTED: DATA-DRIVEN + ANALYTICAL + MEMORY
LEARNING MODELS THAT ARE EASIER TO REASON
CONTINUOUS LEARNING FOR MISSION-CRITICAL AI
Better model building
10
THROUGHPUT, ACCURACY, AND MODEL SIZE TRADEOFFS: SPARSIFICATION AND PRUNING
SELF-LEARNING AND PERSONALIZATION AT THE EDGE
More efficient and pervasive model deployment
Compute architecture needs of AIREDUCING ARITHMETIC PRECISION WHILE PRESERVING ACCURACY: ALL 32 16, 8, 4, 2 …
FEEDING THE COMPUTE MEMORY AND NETWORK; COMPUTE NEAR NETWORK AND MEMORYDOMAIN-SPECIFIC ARCHITECTURES TRADITIONAL, SPATIAL, NEUROMORPHIC, QUANTUM
STRONG-SCALING AI TO HPC SCALE ON CLOUD INFRASTURCTURE: LARG BATCH SIZE AND 2nd ORDER METHODS
DELIVERING PERFORMANCE-PRODUCTIVITY: FaaS AND HIGHER ORDER LANGUAGE CONSTRUCTS
Productivity and Scaling needs of AI
Thinking Differently
Deep learning▪ Known answers▪ Generate procedures
with training
Conventional▪ Known procedures▪ Generate answers
ProbabilisticComputing
▪ Learn distributions▪ Compute with
uncertainties in natural data
Quantum▪ Answers superimposed▪ Select and measure
the answer
Graph Analytics
▪ Graph edges and vertices represent relationships
▪ Big, sparse data structures
Example Application: Pedestrian Intent Estimation
Problem• Does the pedestrian want to cross?• At what moment will he attempt to cross? • What direction will he follow?
Approach• Generative Model to for possible collisions scenarios• Observed values (speed, pose, trajectory) • Refine scenarios with observations: Maximum likelihood
GAIL: Generative Adversarial Imitation Learning *
Given a dataset of trajectories: sequence of (state, action) tuples of an expert behavior, find a
policy: 𝜋: 𝑆 → 𝐴
Inverse Reinforcement Learning
* https://arxiv.org/pdf/1606.03476.pdf
RAIL: Risk-Averse Imitation LearningDeep RL Symposium at NIPS-2017, Extended Abstract in AAMAS-2018
Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
Heavy Tail Problem of Generative Adversarial Imitation Learning (GAIL) [Ho/Ermon 2016]
Implication: GAIL agents prone to catastrophic failure
Our Solution: The RAIL objective function
Evaluation:• RAIL is a superior choice than GAIL in
risk-sensitive applications• RAIL converges almost as fast as GAIL
in mean• RAIL preserves the scalability of GAIL
Conditional Value at Risk (CVaR)
Results:
Wild Wild East *
* https://factordaily.com/india-driving-dataset-iiit-hyderabad/ ** http://idd.insaan.iiit.ac.in
Current driving datasets: well-delineated lanes, a small categories of traffic participants, low variation in object/background appearance and strict adherence to traffic rules.
IDD: A novel dataset for unstructured conditions: where the above assumptions are largely not satisfied.
Indian Driving Data (IDD) **10,004 images, finely annotated with 34 classes, 182 drive sequences on Indian roads.
Probabilistic Computing Taking Shape
16
Sources: Uber, Google Source: blogs@Intel
Intel Science and Technology Center for Probabilistic Computing
KU Leuven/UCLAMarian Verhelst/Guy Van den Broeck
Deep arithmetic networks for efficient learning, Inference and decision making
Duke Alvin Lebeck Accelerating Markov Chain Monte Carlo Inference
Northeastern Jan-Willem van de Meent Design and Evaluation of Deep Probabilistic Programs
MITMichael Carbin/Vikash Mansingkha
Probabilistic Programming with Fast, Verified, Programmable Inference
Oxford University Yarin GalReal-World Benchmarks and a Public Competition in Bayesian Deep Learning
CIMAT Jean-Bernard Hayet Motion and intents prediction for unmotorized road users
Harvard David Brooks (Funded under Intel JUMP)
Energy-Efficient Probabilistic Graphical Models and Inference
Thinking Differently
Deep learning▪ Known answers▪ Generate procedures
with training
Conventional▪ Known procedures▪ Generate answers
ProbabilisticComputing
▪ Learn distributions▪ Compute with
uncertainties in natural data
Quantum▪ Answers superimposed▪ Select and measure
the answer
Graph Analytics
▪ Graph edges and vertices represent relationships
▪ Big, sparse data structures
challenges
Graph AnalyticsImprove social-network analysis, fraud-ring detection, anomaly detection
▪ Sparse and irregular memory accesses
▪ Small data accesses with frequent synchronization
▪ Scaling to very large datasets
Real-Time Decision Making
challenges
▪ Sparse and irregular memory accesses
▪ Small data accesses with frequent synchronization
▪ Scaling to very large datasets
Is this a Botnet?
How do “bad actors” manifest? Diseases
spread?
Graph AnalyticsImprove social-network analysis, fraud-ring detection, anomaly detection
Real-Time Decision Making
DARPA Graph Analytics Challenge
Program to develop new technologies to realize 1,000x
performance-per-watt gains in the
ability to handle graph
analytics
Re-imagined Architecture
Fully Integrated Seamless Scaling
Key Technologies and Scalability
Network as First-class Citizen
Optimized for Small Messages
Architected to Scale
Aggressive MCP/MPP
Near-Memory Atomics
Support for irregular memory accesses
Global Memory Model
High IO & Memory bandwidth
Memory Form-Factor
Sparse/IrregularDense
Dense “Easy,” Sparse/Irregular Really Hard
23
93.5
0.22
12.1
0.44 1.10
10
20
30
40
50
60
70
80
90
100
Large DGEMM SpGEMM MD force PageRank CG Solver
“Use
ful”
Op
s/s
as %
of
Pe
akData collected on single socket Intel Xeon® Platinum 8180, w/ turbo disabled and frequency of 1.7 GHz
Spatial architectures*
24
▪ Software defines custom instruction pipeline for computation of interest
– Distributed (and minimal) control logic and data storage
– Capture dataflow to minimize on-chip data movement
– Potential for big boost to power efficiency and thus performance
▪ Programmability is not a motivation to migrate to spatial architectures
* https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01220-hyperflex-architecture-fpga-socs.pdf* A. Parashar et al., "Efficient Spatial Processing Element Control via Triggered Instructions," in IEEE Micro, vol. 34, no. 3, pp. 120-137, May-June 2014.
“Programmatic Control of a Compiler for Generating High- performance Spatial Hardware” Hongbo Rong; https://arxiv.org/pdf/1711.07606.pdf
Can we address the Achilles’ heel of Spatial Architectures?
25
Spatial architectures less intuitive than von Neumann architectures
▪ We’re used to thinking sequentially
▪ Even experienced parallel programmers struggle with thinking spatial
Domain-specific languages might hide this, but narrow scope too much
T2S: General-purpose, high-productivity, high-performance spatial programming *
• “General-purpose”: loop nests with affine dependences
• “high-productivity”: 20 or 30 LOC to write under 0.5hr
• “high-performance”: ninja/product performance
* Programmatic Control of a Compiler for Generating High- performance Spatial Hardware; Hongbo Rong; https://arxiv.org/pdf/1711.07606.pdf
Performance-productive generation of spatial architectures*
26
T2S
For average programmers
Intel FPGAs
CGRA
PolySA
For perf. engineers
ARE(Affine Recurrence Equations)
URE (Uniform Recurrence Equations)
Compute partitionLoop nest transformations
User-managed cacheStreaming loads/stores
Uniformalization
OpenCL Assembly& C
C(i,j)+=A(i,k)*B(k,j)
C(i,j,k)=C(i,j,k-1)+A(i,j-1,k)*B(i-1,j,k)
A(i,j,k)=A(i,j-1,k)
B(i,j,k)=B(i-1,j,k)
* “Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations”, Nitish Srivastav et. al. To appear in FCCM 2019
Temporal-to-Spatial (T2S):A Performance-Productive Programming of Spatial Architectures*
27
• Close-to or better-than ninja GEMM on both architectures in a fraction of time • 2 weeks (vs. 18 month) for FPGA• 3 days (vs. 3 months) for CGRA
• First time MTTKRP, TTM, TTMc ever been implemented on any spatial architecture
* “Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations”, Nitish Srivastav et. al. To appear in FCCM 2019
Thinking Differently
Deep learning▪ Known answers▪ Generate procedures
with training
Conventional▪ Known procedures▪ Generate answers
ProbabilisticComputing
▪ Learn distributions▪ Compute with
uncertainties in natural data
Quantum▪ Answers superimposed▪ Select and measure
the answer
Graph Analytics
▪ Graph edges and vertices represent relationships
▪ Big, sparse data structures
Quantum co-processor: augmenting, not replacing, traditional HPC systems
29
Applications Space: HPC~50+ Qubits: Proof of concept
• Computational power exceeds supercomputers
• Learning test bed for quantum “system”
~1000+ Qubits: Small problems
• Limited error correction
• Chemistry, materials design
• Optimization
~1M+ Qubits: Commercial scale
• Fault tolerant operation
• Cryptography
• Machine Learning
Intel – QuTech Research Collaboration
Atomic Layer Control PackagingPatterning
24nm Pitch Lines Assembly and Packaging Research
Metal Gate / High k on300mm Silicon Wafer
QuTech’s Expertise in qubit operation and control
Intel Labs:• Algorithms• System Architecture• Control Electronics
TMG Components Research
Combining Intel capabilities with Delft expertise
31
A quantum systemrequires a full stack
Intel- QuTech Research Collaboration
vSpin Qubits in Silicon
Single electron transistors, where qubit
is spin state
Superconducting Qubits
Very high quality microwave circuit
Enta
ngl
emen
t G
ap
Insulator
Conductor
Short-range Interaction Strength
Application Algorithms
v
Quantum Chip
Control Electronics Qubit Simulator
Compilers/Runtimes
Algorithms
Libraries
ApplicationsQubit SimulatorSystem Research
www.nersc.gov/systems/edison-cray-xc30/
High Perf QuBit Simulation
Practical Quantum Supremacy:
• Quantum compared to state-of-the-art solver
• In the presence of realistic noise
• On a realistic problem
Quantum: Quantum Approximate OptimizationAlgorithm (QAOA)
Classical: akmaxsat Solver
Problem: Max-Cut graph partitioning state-of-the-artclassical solver
quantum heuristics
32
Quantum Advantage: For Intractable, Yet Useful Computational Problems*
* [classical optimization] Guerreschi G.G. & Smelyanskiy M., arXiv:1701.01450* [compiler of quantum circuits] Guerreschi G.G. & Park J., Quantum Science and Technology, 3(4), 045003 (2018)* [performance crossover] Guerreschi G.G. & Matsuura A.Y., arXiv:1812. 07589 – To appear in Scientific Reports (2019)
33
Learned Indices *
* The Case for Learned Index Structures: https://arxiv.org/abs/1712.01208 and ”Everything Is a Model” http://deliprao.com/archives/262
DSAIL Launched late 2018 @ MIT in collaboration with Google-Microsoft <link>
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100
Pe
rfo
rma
nc
e R
ela
tiv
e t
o P
ostg
reS
QL
Episodes
RL (Row Vectors)RL (Histogram)
RL (1-Hot)Postgres
Machine Learning Based End-to-End Query Optimizer*
* * Neo: A Learned Query Optimizer; Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul; Under Review
JOB
(Jo
in O
rde
r B
en
chm
ark
Qu
erie
s)
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100
Pe
rfo
rma
nce
Re
lati
ve
to
Ora
cle
Episodes
RL (Row Vectors)RL (Histogram)
RL (1-Hot)
OraclePostgreSQL on Oracle
Machine Learning Based End-to-End Query Optimizer*
* Neo: A Learned Query Optimizer; Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul; Under Review
JOB
(Jo
in O
rde
r B
en
chm
ark
Qu
eri
es)
A process in which some or all of the steps of turning a user’s intent into an executable program are automated.
Machine Programming*
Invention
Intention Adaptation
* Three Pillars paper (MAPL ’18): Justin Gottschlich, Armando Solar-Lezama, Nesime Tatbul, Michael Carbin, Martin Rinard, Regina Barzilay, Saman Amarasinghe, Joshua B Tenenbaum, Tim Mattson; https://arxiv.org/abs/1803.07244
Summary
• AI is driving a virtuous cycle of compute An insatiable appetite for compute
• AI impacts all parts of stack: algorithm, software, architecture … and ALUs too!
• Today’s HPC will get augmented with deep learning, and other forms of ML and graph analytics
• Quantum computing: challenges ahead, opportunities enormous!
• Disruptive and exciting applications of tomorrow go well beyond self-driving, such as, machine programming
We are at an unprecedented convergence of massive compute with massive data ...
This confluence will have a lasting impact on both how we do computing and what computing can do for us!
Thank You
41
Notice and DisclaimersNotice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a
design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.
All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.
Intel® Itanium®, Intel® Xeon®, Xeon Phi™, Pentium®, Intel SpeedStep® and Intel NetBurst® , Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2012, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others..
42
Notice and Disclaimers Continued …
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
Notices & DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration.
No product or component can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .
Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
*Other names and brands may be claimed as property of others.
Intel, the Intel logo, Xeon, and Optane are trademarks of Intel Corporation in the United States and other countries.
© 2019 Intel Corporation.
45