neural networks and deep learning for physicists

February 19th, 2015

Data Science Consulting

Héloïse Nonne, Data Scientist

Big Data & deep learning

CINaM, Aix-Marseille University

Big Data?

Big Data?

Explosion of data size Falling cost of data

storageIncrease of

computing power

“Information is the oil of the 21st century, and analytics is the combustion engine.”Peter Sondergaard, Senior Vice President, Gartner Research

The falling cost of data storage

1980 1990 2000 2014

300 000 $ 1 000 $ 100$ 0,1$

1956

IBM 350 RAMAC

Capacity: 3.75 MB

Storage cost for 1 Go

Data growing exponentially• Over 90% of all the data in the

world was created in the past 2 years.

• Now, every year, 2 ZB are generated1 ZB (zettabyte) = 1 trillion GB

• IDC (International Data Corporation) predicts a generation of 40 ZB in 2020

• Around 100 hours of video are uploaded to YouTube every minute

• Today’s datacenters occupy an area of land equal in size to almost 6,000 football fields

Where data comes from?

Two approaches to large databases

Total failure rate = product of local failure rates

Design for failure at software level

Source; www.tomshardware.com

High-Tech hardware• Roughly double the cost of commodity• Roughly 5% failure rateCommodity (≠ low end) hardware• Roughly half the cost• Roughly 10-15% failure rate

Distribution algorithm: MapReduce

Key principles of a DFS• Duplication of data• Distribution of data• Colocalization of treatments• Parallel treatments• Horizontal and vertical elasticity

Hadoop Distributed File System (HDFS) / Computing

Distribution of data over multiple servers

Yes but, what for?

Big Data is about having

an understanding of what your relationship is with the people who are the most

important to you

and an awareness of the potential in that relationship

Joe Rospars, Chief Digital Strategist, Obama for America

Les tendances de fond du Big Data

10

La digitalisation massive des sphères économique, industrielle

et sociale ouvre le champ à de nouvelles approches dans les

domaines du marketing, de la finance et de l’industrie.

L’enjeu pour les Directions Générales et les Directions

Opérationnelles est de maîtriser cette opportunité pour faire

face aux changements profonds des marchés et anticiper les

évolutions des attentes des clients, des usages, des processus

et des infrastructures.

La Data Science ou l’art de maîtriser le Big Data tend à

supplanter son aspect technologique, de part son importante

stratégique.

Le Big Data et la Data Science redéfinissent profondément les

relations entre les métiers, la statistique et la technologie.

Digitalisation des

relations socialesMarketing

Entreprise

digitaleFinance

Usine digitaleIndustrie

Monétisation des

datasTMT/Banque

• Création et développement de produits spécifiques autour des technologies Big Data

• Veille technologique et scientifique

• Recherche et développement en Data Science

• Quantmetry est un cabinet de conseil « pure player » de la Data Science et du Big Data

• Nous aidons les entreprises à créer de la valeur grâce à l’analyse de leurs données

• Nous sommes une équipe pluridisciplinaire de consultants, data scientists, experts Big Data

• Nous appuyons nos recommandations sur des modèles mathématiques et statistiques

Quantmetry : Big Data & Data Science

11

Quantmetry

12

Exemples de Projets data

13

• Marketing, ciblage

• Compteurs intelligents: prédiction de consommation d’électricité ou d’eau

• Identification des molécules les plus efficaces dans la chimiothérapie contre le cancer du sein

• Prédiction d’occupation de station Vélib

• Optimisation des routes aériennes en fonction du trafic

• Prédiction de pannes sur des flottes automobiles

• Prédiction de sécheresse en utilisant les photos satellites

• Détection de fraude (sécurité sociale, assurance, impôts)

DataminingInterpretation

Actions

Modeling

Collection

Preparation

• Reporting

• Visualization

• Analysis

• Predictions

Data Science Process

Artificial intelligence and neurons

Artificial intelligence (1956)

16

How to mimic the brain?

Build artificial intelligences able to think

and act like humans

• Information travels as electric signals (spikes) along the dendrites and axon

• Neuron gets activated if electric signal is higher than a threshold at the synapse

• Activation is more intense if the frequency of the signal is high

McCulloch & Pitts, Rosenblatt (1950s) The perceptron

17

a 𝑥 = 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏

ℎ 𝑥 = 𝑔(𝑎 𝑥 )

Artifical neuron = a computational unit that makes a computation based on the information it gets from other neurons

• 𝑥 = input vector (real valued)

electric signal

• 𝑤 = connection weights

excitation or inhibition of the neuron

• 𝑏 = neuron bias

simulates a threshold (in combination

with the weights)

• 𝑔 = activation function

Activation of the neuron

Activation functions

18

• Heaviside (perceptron):

𝑔 𝑎 = 1 if 𝑎 > 00 otherwise

• Linear function

𝑔 𝑎 = 𝑎

• Sigmoid

𝑔 𝑎 =1

1 + exp −𝑎

• Tanh

𝑔 𝑎 =𝑒𝑎 − 𝑒−𝑎

𝑒𝑎 + 𝑒−𝑎

Linear function:• Does not introduce non linearity• Does not bound the output-> Not very interesting

Heaviside function:• A little too harsh -> smoother activation is

preferable to extract valuable information

Sigmoid and tanh are commonly used (withsoftmax)

Capacity of a neuron: how much can it do?

19

Sigmoid function

𝑔 𝑎 =1

1 + exp −𝑎Output ∈ [𝟎, 𝟏]

h x = p(y = 1|x)

Interpretation: the output is the probabilityto belong to a given class (y = 0 or 1)

x1

x2

A neuron can solve linearly separable problems

Boolean functions

20

0 1

1

0 0 0

1 0

x1

x2

0 1

1

0 0 1

0 0

x1

x2

0 1

1

0 0 1

1 1

x1

x2

0 1

1

0 0 0

0 1

x1

x2

OR (𝑥1, 𝑥2) AND (𝑥1, 𝑥2)

AND (𝑥1, 𝑥2) AND (𝑥1, 𝑥2)

The XOR affair (1969)

21

Minsky and Papert (1969), Perceptrons: an introduction to computational geometry

XOR (𝑥1, 𝑥2) impossible

with only two layers

0 1

1

0 0 1

1 0

x1

x2

OK with three layers

An intermediatelayer builds

a betterrepresentation

(with AND functions)

Multilayer neural networks

Can they recognize objects?

Can they build their own representations like humans?

Towards a multiply distributed representation

23

Multiple layers neural networks

Each layer is a distributed representation.

The units are not mutually exclusive (neurons can all be activated simultaneously).

Different from a partition of the input (the input belong to a specific cluster)

The treachery of images

24

The CAR concept

• An infinity of possible images!

• A high-level abstraction represented by

pixels

• Many problems:

– Orientation

– Perspective

– Reflection

– Irrelevant background

A CAR detector

Built a CAR detector: decompose the problem

• What are the different shapes?

• How are they combined?

• Orientation?

• Perspective

PixelsLow level

abstraction

Intermediatelevel

abstraction…

High levelabstraction

Car

Spectrum of machine learning tasks (Hinton’s view)

Statistics

• Low-dimensional data (<100 dimensions)

• Lots of noise in the data

• Little structure that can be

captured by a rather simple model

Main problematic:

Separate true structure from noise

Artificial Intelligence

• High-dimensional data (>100 dimensions)

• Noise should not be a problem

• Huge amount of structure, very

complicated

Main problematic:

Represent the complicated

structure so that it can be learned

Training a NN / Learning

27

Training / learning is an optimization problem

M examples with n features 𝑥1, 𝑥2, … , 𝑥𝑛

Two class 𝟎, 𝟏 classification

Prediction1 if f x = p y = 1 x > 0.50 otherwise

• Classification error is not a smooth function• Better optimize a smooth upper bound substitute: the loss function

Learning algorithm

28

Backpropagation algorithm

• Invented in 1969 (Bryson and Ho)• Independently re-discovered in the mid-1980s by several groups• 1989: First successful application to deep neural network (LeCun) – Recognition of hand-written digits

1. Initialize the parameters 𝜃 = (𝑤, 𝑏)2. For i = 1…M iterations (examples)

• Each training example 𝑥 𝑡 , 𝑦 𝑡

∆= −𝛻𝜃l f 𝑥 𝑡 ; 𝜃 , 𝑦 𝑡 − 𝜆𝛻𝜃𝛺 𝜃

𝜃= 𝜃+𝛼∆• The gradient tells in what direction the biggest decrease in the loss function is, i.e. how

can we change the parameters to reduce the loss.• 𝛼: hyperparameter = learning rate

Important things: a good loss function, an initialization method, an efficient way of computing the gradient many times (for each example!)

Training a NN / Learning

29

Then backpropagate -> modify (w,b) for each layer

For each training example, do forward propagation -> get f(x)

Many tricks for training a NN

30

• Mini-batch learning• Regularization: the bias and variance

• How much variance in the correct model: 𝜆 ≫ 0• Bias: how far away from the true model are we? 𝜆 ∼ 0

• Tuning hyperparameter for a better generalization: do not optimize toomuchEarly stopping

Deep learning

Why is it so difficult?

Usually better to use only 1 layer! Why?

• Underfitting situation: a very difficult optimization problemWe would do better with a better optimization procedure.• Saturated units -> vanishing gradient -> updates are difficult

(close to 0)• But saturation corresponds to the nonlinearity of NN, their

interesting part• Overfitting situation: too many layers -> too fancy model• Not enough data!!!! -> But with big data, things tend to improve

Better optimizationBetter initialization and better regularization

2006: The Breakthrough

Before 2006: training deep neural networks was unsuccessful! (except for CNN)

2006: 3 seminal papers

• Hinton, Osindero, and Teh,

A Fast Learning Algorithm for Deep Belief Nets

Neural Computation, 2006

• Bengio, Lamblin, Popovici, Larochelle,

Greedy Layer-Wise Training of Deep Networks

Advances in neural information processing systems, 2007

• Ranzato, Poultney, Chopra, LeCun,

Efficient Learning of Sparse Representations with an Energy-Based Model

Advances in neural information processing systems, 2006

The main point: greedy learning

Find the good representation: do it using unsupervised training -> let the neural

networt learn by itself!!

• Recognize the difference between a character and a random image-> try to understand instead of copying -> less overfitting and improvedgeneralization

• Unsupervised pretraining: Train layer by layer (greedy learning) -> local extraction of information -> the previous layer is seen as raw input representing features

• Each layer is able to find the most common features in the training inputs (more common than random).

Once a good representation has been found at each level: it can be used to initialize and successfully train a deep neural network with usual supervised gradient-base optimization (backpropagation)

MNIST

35

Result of pretraining

36

Larochelle, Bengio, Louradour, Lamblin JMLR (2009)

Many unsupervised learning techniques

• Restricted Boltzmann machines• Stack denoising autoencoders• Semi-supervised embeddings• Stacked kernel PCA• Stacked independent subspace analysis• …

Partially solves the problem of unlabelled data• Pre-train on unlabelled data• Fine-tuning using labelled data (supervised learning)

Pretraining does help deep learning

38

Why does unsupervised pre-training help deep learning?Erhan, Courville, Manzagol, Bengio, 2011

Google Brain

39

2012: Google’s Large Scale Deep Learning Experiments• an artificial neural network • computation spread across 16,000 CPUs• models with more than 1 billion connections

The next steps

40

Deep learning is good for:

• Automatic speech recognition

• Image recognition

• Natural language processing

• How well can deep learning be adapted to distributed systems (Big Data)?

• Learning Online?

• Application to other problems?

• Time series (consumption prediction)

• Scoring (churn prediction, marketing)

• Application to clustering

• How much more data?

Questions?

@heloisenonne

www.quantmetry.com

neural networks and deep learning for physicists

Data & Analytics

big data et

data science et

data scientists

cost of data storage1980

big data nous aidons

explosion of data size

football fieldswhere

data science ou lart