[ introduction ] deep learningcs.kangwon.ac.kr/~leeck/nlp/introduction_deep_learning.pdf ·...

[ Introduction ] Deep Learning

정상근

2015-11-03

WHAT MACHINES CAN DO? (2015)

Applications

Video Understanding (Real-time Genre Detection)

Google

http://techcrunch.com/2014/11/18/new-google-research-project-can-auto-caption-complex-images/

Image Understanding

Google


DNN for Image Understanding

• Image to Natural Language (By Google)



Semantic Guessing

:: DNN 을 통해 Symbol 을 공간상에 Mapping 가능하게 됨으로써 Symbol 들 간의관계를 ‘수학적’ 으로 추측해 볼 수 있는 여지가 있음

Ex) King – Man + Woman ≈ Queen:: List of Number 가 Semantic Meaning 을 포함하고 있음을 의미

Microsoft, “Linguistic Regularities in Continuous Space Word Representations”, 2013

Semantic Guessing - Demo

http://deeplearner.fz-qqq.net/ Ex) korea – kimchichina - ?

http://deeplearner.fz-qqq.net/

Image Completion

:: Shape Boltzman machine 을 통해 학습한 모델에 Constraint 을 부여 하여 원하는방식의 이미지를 복원

:: 데모 - https://vimeo.com/38359771

The Shape Boltzmann Machine-SD.mp4

The Shape Boltzmann Machine-SD.mp4

https://vimeo.com/38359771

Hand Writing by Machine

:: 사람의 필체를 흉내내어, 필기체를 직접 쓸 수 있다.:: 데모 - http://www.cs.toronto.edu/~graves/handwriting.html

기계에 의해 씌어진글씨

http://www.cs.toronto.edu/~graves/handwriting.html

Music Composition

:: Recurrent Neural Network 를 사용하여 악보 Generation:: 데모 –https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/

https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/

Neural Machine Translation

:: http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/:: Neural Network 단위에서 Language to Language 번역을 시도

:: 데모 – http://104.131.78.120/

Bernard Vauquois' pyramid showing comparative depths of intermediary representation, interlingual machine translation at the peak, followed by transfer-based, then direct translation.[ http://en.wikipedia.org/wiki/Machine_translation]

http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/

http://104.131.78.120/

http://en.wikipedia.org/wiki/Interlingual_machine_translation

http://en.wikipedia.org/wiki/Machine_translation

Play for Fun – “Learn how to play game”

:: 게임기의 메모리를 직접 읽어서 딥러닝을 이용해 플레이 방법을 스스로 학습

ARTIFICIAL INTELLIGENCE

Overview

Overview – Artificial Intelligence & Cognitive Science

(1913) 보어 : 원자모델

(1915) 아인슈타인 : 상대성 이론

(1936) 튜링 : 튜링머신

(1939~45) 폰 노이만 : 컴퓨터 구조

(1948) 쉐논 : 이진법, 정보이론

(1955) 촘스키 : 논리적 언어학

(1957) 로젠블렛 : Perceptron (1st Neural Network)

(1960) Back-propagation Algo.(신경망학습)

2010

1980

1940

컴퓨터 구조정립

기호주의인공지능

(Computationalism)

연결주의인공지능

(Connectionism)

2000

(1989) 버너스리 : Word Wide Web

순수통계적인공지능*

* : 사람의 두뇌구조를 고려하지 않고 순수통계학적 방법으로만 인공지능을 구현하려는 시도

연결주의인공지능

[ Computer Science ] [ Cognitive Science ]

인지과학 태동

(1980) 존 설 : Chinese Room 논제

인지주의

체화된 인지주의

“마음 = Computer”

[ 규칙기반 AI ]

[ 신경망기반 AI ]

[ 통계기반 AI ]

[ DNN ]

“심신이원론”

“심신일원론”

Historical View - Artificial Intelligence

201019801940 2000

컴퓨터 구조정립

Rule Based AI Rule Based AI

Data Driven AI

Decision Tree

Statistic Only based AI (HMM, CRF, SVM ..)

DNN based AI

NN based AI

Strong AI

Weak AI

기계는 사람만큼 지능적일 수 있다.

기계는 부분적으로만 사람의 지능을 흉내 낼 수 있다.

계산주의 Vs. 연결주의

계산주의 (Computationalism) 연결주의 (Connectionism)

뇌구조를 추상화 / 기호화 뇌구조 자체를 저수준에서 모델링

개개의 기호 및 그들간의 규칙에 주목 외부환경과의 자극에 따른 뉴론의 학습에 주목

기호 조작을 통해 Mental Activity 를 설명 가능하다고 봄

기호 조작들만으로는 Mental Activity 를 충분히설명하지 못한다고 봄

특정 영역에 특화된 규칙을 이용한 학습 추구 여러 분야에 통용되는 일반적 학습 방법 추구

[ Representation ]

Cat

[0, 0, 0, 1, 0, … ]

One-Hot Representation

[ 34.2, 93.2, 45.3, … ]

Distributed Representation

Artificial Intelligence (전통적 의미의)

사람의 지능을 기계에 구현하려고 하는 모든 시도.사람의 생각, 기억, 이해, 학습, 인지, 조절 기능 등 모든 분야를 다룸

분야 연구 비고

Knowledge RepresentationKnowledge Representation, Commonsense Knowledge

Ontology 와 연계됨

Planning Automated planningGame, Robotics, Multi-Agent Cooperation

Learning Machine Learning 다방면에 사용됨

Communication Natural Language Processing

PerceptionComputer Vision, Speech Recognition

Interface

Motion and Manipulation Robotics

… … …

:: 전통적 의미의 AI는 사람의 지능을 구현하려는 시도. 최근 들어 사람이 부족한 지능을 강화시켜주는 지능에 대한 연구도 활발해짐

Machine Learning

지능 중 ‘학습’* 에 관련된 부분을 기계에 구현하려고 하는 시도

주 목적 : Prediction / Inference

Known Data

Known Responses

Model

Training Time

Model

New Data

Predicted Responses

Running Time

- Data 가공, 추출 과정, 최종 결과물 해석에 Data Mining 기법이 사용될 수 있음

Reproduce known knowledge

http://en.wikipedia.org/wiki/Machine_learning#Applications응용분야 :

* 기계를 학습시킨다고 해서, 꼭 사람과 같게 만든다는 것은 아님에 유의. 사람이 전혀 못하는 것도 기계는 잘하게 만들게 학습시키는 것도 machine learning 의 목표

http://en.wikipedia.org/wiki/Machine_learning#Applications

Data Mining

데이터에서 Pattern 을 발견하고자 하는 시도

주 목적 : Pattern Discovery

Unknown Data Miner

- Unknown knowledge 를 발견한다는 측면에서 Machine Learning 분야 중 Unsupervised Learning 과 긴밀한 연관이 있음.

- Data Mining 의 일부는 ‘사람의 지능적 발견’ 에 관련된 부분도 있지만, ‘사람이 발견하지 못하는 것’ 에 대한 것도 많음

Pattern

Produce unknown knowledge

http://en.wikipedia.org/wiki/Data_mining#Data_mining응용분야 :

http://en.wikipedia.org/wiki/Data_mining#Data_mining

Machine Learning and Other AI Tasks

최근의 AI 문제들은 대부분 Empirical Data 를 활용하여 풀고자 하는 경향이 있음. 이러한 측면에서 Machine learning 의 기술이 다른 분야로 전파되기도 함.

혹은, 다른 분야의 연구 결과물이 Machine Learning 의 새로운 문제 발견과 해결에 영향을 주기도 함

Ex) 음성인식 Task 와 ML

음성 녹음 데이터

전사 Script

HMM Model

HMM Training(EM Algo.)

HMM model training 에 사용되는 EM Algorithm 은 대표적인 Machine Learning 의 파라미터 훈련 알고리즘 중 하나

Ex) 형태소 분석 Task 와 ML

형태소태깅 데이터 CRF Model

CRF(LBFGS Algo.) 형태소 태깅의 대표적인 기술중 하나인 CRF는 Machine

Learning 커뮤니티에 발표되었고, 자연어 처리 전분야에성공적으로 적용된 대표적인 알고리즘

ML Community 에서는 대부분 수학적, 통계적 고도화 위주의 연구를 진행하며, Toy Example (자연어, Vision, Speech ..)에해당 알고리즘을 적용하여 성능이 향상되는 것을 보이는 방법으로 연구를 진행함. 혁신적이면서도 유망한 ML 기술이 발표되면, 다른 분야에 해당 기술이 전파되는 경우가 있음

Summary

• Statistics quantifies numbers • Data Mining explains patterns• Machine Learning predicts with models• Artificial Intelligence behaves and reasons• Cognitive Science is the scientific study of the mind and it’s process

Artificial Intelligence

Machine Learning

Data Mining

많은 부분의 기술을 공유

학습 능력을 다룸

지능을 다룸 (사람의 지능 사람+α)

Cognitive Science인간-인간, 인간-동물, 인간-인공물 간의 정보처리 활동을 다룸

ARTIFICIAL INTELLIGENCEPROBLEMS

Statistical View

Problem Formulation (통계적 입장에서 본 문제 정의)

Human

Act Recognize Understand Learn Plan Communicate

Raw data

Process

Selection Grouping Learning

최종 형태는 세 가지 형태에서 크게 벗어나지 않음

Selection & Grouping

하나를 고르는 문제

Classification

여러 개를 골라서 순서대로 세우는 문제

Ranking

여러 개를 Grouping

Clustering

여러 개를 구조화 하여Grouping Hierarchical

Clustering

Selection

Grouping

Statistical Approach to Classification

가장 단순한 Classification 은 ‘선 긋기‘ 문제이다.

두 개의 그룹을 나누는 선을 긋는 문제좌 / 우, 상 / 하, 내 / 외, + / - Y = ax + b 문제 (Linear)

Regression – 회귀 분석

http://wolfpack.hnu.ac.kr/lecture/Regression/ch1_introduction.pdf

Regression : 사전적 의미 - “Go back to an earlier and worse condition”

:: Francis Galton (1822 ~ 1911) 은 부모의 키와 자녀의 키 사이의 상관관계(928명)를 조사하는 과정에서, 키는 무한정 커기거나 작아지는 것이 아니라 “전체 키 평균으로 돌아가려는 경향”이 있음을 발견. 이를 회귀 분석이라 명명함.

:: Karl Pearson (1903) 은 1078명의 부자키를 조사하여 선형 함수 관계를 도출

아버지키 = 33.71 + 0.516*아들키

http://wolfpack.hnu.ac.kr/lecture/Regression/ch1_introduction.pdf

Support Vector Machine(SVM) – 가장 성공적인 Classifier 중 하나

• 선을 어떻게 그을 것인가? Maximum Margin (1963 – Vapnik)

• 직선으로 구분이 안 되는 문제는 어떻게 풀 것인가? Kernel Trick (1992 – Vapnik)

: support vector

:: 두 개의 Class(black/blank) 를 구분 짓는 선을 그을 때 그 선과 support vector 들 사이의 거리가 최대가 되도록

:: 원래 공간에 있던 각 점들을 Kernel Function 을이용해 새로운 차원으로 이동시키면, 직선으로 구분가능한 문제로 바뀔 수 있다.

http://en.wikipedia.org/wiki/Support_vector_machine

http://en.wikipedia.org/wiki/Support_vector_machine

Statistical Learning

다양한 형태가 있지만 대부분 아래의 형태를 따른다.

θ

Feature Extraction Prediction Evaluation FunctionDistance between (Reference ~ Prediction)

How closely predicted?

θ θ’

Parameter Update

LearningInference

Prediction

Feature Design / Evaluation Function / Parameter Update

Feature Design

Evaluation Function

Parameter Update

Features describe a real world object

Distance between Predication and Reference

How to update parameter to fit data

잘 설계된 feature 를 쓰는 것이 통계적 기계학습의 핵심(Feature Engineering)

최근에는 이 조차도 기계가 알아서 학습(DNN)

WHY DEEP LEARNING ?

Why Deep Learning - “Learning Representation”

color = ‘red’ shape = ‘round’ leafs = ‘yes’ dot = ‘yes’ …

Numbers

- 사과를 ‘사과’로 구별 짓는 표현방식을 스스로 학습

No more handcraft feature engineering!

http://www.iro.umontreal.ca/~bengioy/dlbook/intro.html


Why Deep Learning - “Distributed Representation” (1)

[ Representation ]

Cat

[0, 0, 0, 1, 0, … ]

One-Hot Representation

[ 34.2, 93.2, 45.3, … ]

Distributed Representation

:: DNN 가 기존 AI 방법론들에 비해 큰 의미가 있는 것은 실세계에 있는실제 Object를 표현할 때 Symbol 에 의존하지 않는다.

Why Deep Learning - “Distributed Representation” (2)

- 유사한 것은 ‘유사하게‘ 표현되어야 함- Curse of Dimensionality 를 극복 가능해야

Apple = 001

Pear = 010

Ball = 100

Distance(Apple ~ Pear) = Distance(Apple ~ Ball)

Why Deep Learning - “Reusable Learning Result”

- 기존에는 각각의 문제를 풀었고, 그 결과물은 유기적 결합이 어려웠음- Deep Learning은 다른 도메인에서 풀었던 문제를 현재 문제에 그대로 가져다 사용할 수 있음

형태소 분석

구문 분석

정규화

정규화형태소분석

구문분석

Why Deep Learning - “Design Network Solve Problem”

- 어떠한 Intelligence 를 어떻게 결합하는가에 따라 새로운 문제를 풀어낼 수 있다.

Plate 인식

Meaning : “Apple on Plate”

Apple 인식

~ on ~

Why Deep Learning - “Unlabeled Data >>>>>>>>>>>>>>>>>> Tagged Data”

- 수 많은 Unlabeled Data 를 활용할 수 있는 learning 방법

[ Previous Machine Learning ] [ Deep Learning ]

Small Tagged Data

P( y | x)

Large Raw Data

P(x)

Small Tagged Data

P( y | x)

NEURAL NETWORK

Review

One Learning Algorithm

Nero-Rewiring Experiment

Auditory cortex learns to see

Auditory Cortex

[Roe et al., 1992]

Slide from Andrew Ng

:: 청각과 연결되어 있는 신경망을 끊고, 이 부분에 시신경과 연결된 신경망을 연결하면, Auditory Cortex 가 ‘볼 수‘ 있게 된다.


Somatosensory cortex learns to see

Somatosensory Cortex

:: 촉감과 연결되어 있는 부분을 끊고, 이를 시신경과 연결된 신경망에 연결하면Somatosensory Cortex가 ‘볼 수‘ 있게 된다.

[Metin & Frost, 1989]



Seeing with your tongue

Low resolution gray-Scale Image

전기 신호로 바꿈 해당 전기 신호를 혀에 계속해서 전달 어느 순간부터 혀로 ‘볼 수’ 있게 됨


Neurons Firing Off in Real-time

http://www.dailymail.co.uk/sciencetech/article-2581184/The-dynamic-mind-Stunning-3D-glass-brain-shows-neurons-firing-real-time.html

http://www.dailymail.co.uk/sciencetech/article-2581184/The-dynamic-mind-Stunning-3D-glass-brain-shows-neurons-firing-real-time.html

뉴론은 계속해서 시그널을 받아- 그 것을 조합- ‘sum’ 하고, - 특정 threshold 가 넘어서면 - ‘fire’ 를 한다.

Neurons in Brain

Illustrative Example ( Apple Tree )

4

3

2

1

0

10 20 30 40 50Day

Size

- “어떤 사과나무에 대해서 몇 년에 걸쳐 날짜 별로 사과들의 크기를 측정, 기록”- 농부는 특정 크기가 넘을 때만 시장에 사과를 내다 팔 수 있다고 할 때,

- Q : 올해 Day -50 에 사과를 내다 팔 수 있을까? 없을까?

Illustrative Example

Default size = 5

Day 0 Day 10 Day 20 Day 30 Day 40

size = 10 size = 15 size = 20 size = 25

If size > 30, sell an apple!

상황 1 : 작년까지 이 사과나무는 위의 경향대로 사과 열매를 맺었다. 조건 : 사과의 크기가 30이 넘으면 팔 수 있다.

Question : 올해 Day-50 에 사과를 팔 수 있을까?

Regression Problem

Sell

Illustrative Example

10 20 30 40 50

5

10

15

20

25

30

y = ax + b

Size = 0.5*day + 5

Activation point to sell an apple!

Regression learn the parameter ‘a’ and ‘b’ from the data

Default size = 5

Day 0 Day 10 Day 20 Day 30 Day 40

size = 10 size = 15 size = 20 size = 25

If size > 30, sell an apple!

Sell

Apple Selling Example Neural Network Framework

y = ax + b

If y > 30 sell an apple

입력값을 변형해 새로운 값 계산(day size )

새로운 값을 다시해석해 최종 결과

산출(size

팔까(1)/말까(0))

Y = WX + b

Activation function

Step Function

정규화

Perceptron – Simplest ANN (1)

http://natureofcode.com/book/chapter-10-neural-networks/

A perceptron consists of one or more inputs, a processor, and a single output.

Step 1: Receive inputs.

Step 2: Weight inputs.

Step 3: Sum inputs.

Step 4: Generate output.

The Perceptron Algorithm:1) For every input, multiply that input by its weight.2) Sum all of the weighted inputs.3) Compute the output of the perceptron based on

that sum passed through an activation function (the sign of the sum).

if (sum > 0) return 1; else return -1;

+1 or -1

Sum = W0 * input0 + W1 * input1

processor

input 0

input 1

output

http://natureofcode.com/book/chapter-10-neural-networks/

Perceptron – Learning Rule

:: 동영상 (https://www.youtube.com/watch?v=vGwemZhPlsA ):: 기울기(w) 와 Bias(b) 를 계속해서 바꿔가면서 O 와 X 를 구분하는 선을 탐색

https://www.youtube.com/watch?v=vGwemZhPlsA

Limitation of ‘Perceptron’

Perceptron can do.

Perceptron cannot do.

Linearly Separable!

Not Linearly Separable!

What if multiple perceptron?

OR(solver)

NOT AND (solver)

outputXOR

input

input

Multilayer Perceptron (MLP)

The single-hidden layer Multi-Layer Perceptron (MLP).

An MLP can be viewed as a logistic regressor, where the input is first transformed using a learnt non-linear transformation

D is the size of input vector xL is the size of output vector f(x)

G : Scoring Function for top-layer

x

S : Activation Function for hidden layer

[ Softmax Function ]

[ tanh Function ]

Feed Forward Propagation

정답 예측한 답~비교

학습 진행 방향은?

오류가 작아지는 방향으로

오류

정답 예측한 답~비교

오류

오류가 작아지는 방향이란 어느 쪽인가?

얼마나 나의 지식을 고쳐야 오류를 작아지게 할 수 있을까?

방향을 결정하는 방법 (1)

이 곡선이전체의 오류를표현한다고 하면

이 지점에서의 오류가 작아지는방향을 결정해야 한다.

방향을 결정하는 방법 (2)

이 지점에서의 기울기 방향을 구해서, 기울기가 작아지는 방향으로 간다면, 오류를작게 할 수 있을 것

“미분”“기울기 = Gradient”

Gradient Descent

“A brief introduction to neural network”, David Kriesel, dkriesel.com

Gradient Descent

1

0

J(0,1)


Best-case

Gradient Descent


0

1

J(0,1)

Local Minimum

Minima and Maxima

어떻게 오류를 고칠 것인가?

오류

오류에 기여

오류에 기여

얼마만큼 오류에기여했는가?

오류를 수정하는방향으로 얼마나

움직일까?

아래쪽으로 반복해서 오류 수정

= 미분

= 학습가중치x미분값

= Back - Propagate

MLP – Training (Weight Optimization)

- How to learn the weights??

“Backpropagation Algorithm”

최종 결과물을 얻고 Feed Forward and Prediction

그 결과물과 우리가 원하는 결과물과의 차이점을 찾은 후

Cost Function

그 차이가 무엇으로 인해 생기는 지 Differentiation (미분)

역으로 내려가면서 추정하여 Back Propagation

새로운 Parameter 값을 배움 Weight Update

Cf) “속도”의 미분값이 “가속도” “가속도”로 인해 “속도” 변화

Input

Output

Summary : Neural Network – Core Components

Raw Data

x1 x2 x3 x4 x5 x6

z1

∑

┌┘

z2

∑

┌┘

z3

∑

┌┘

S

Score

Hidden Layer

Visible Layer

Decision : Scoring Function

Fire : Activation Function

Summation : Matrix Production

Neuron structure : Edge Connection

Sensing: Vector Form

Representation

이해를 돕기 위해 Single Hidden Layer NN 을 표현

vector vector vector

vector vector

vectorMatrix 연산1 2Parameter Update

Application Specific 연산Ex) 예상 주식 값

예상값과 실제값의 오류 만큼을 아래 네트워크로 전파Ex) 오류=실제-예상

Raw Data Raw Data Raw Data

Summary : Neural Network – Process

DEEP NEURAL NETWORK

Why old ANN was not successful?

Pre-Training Distributed Representation

Deep Learning

Initialization Local Minima Computation Power Data

Activation Function Understanding ANN

Initialization Techniques

Big Data

…

대표적인 Bottleneck만 표시

Remind – # of Parameters & Local Minima

𝑎1 = 𝑓(𝑊11𝑥1 +𝑊12𝑥2 +𝑊13𝑥3 + 𝑏1)𝑎2 = 𝑓(𝑊21𝑥1 +𝑊22𝑥2 +𝑊23𝑥3 + 𝑏2)𝑎3 = 𝑓(𝑊31𝑥1 +𝑊32𝑥2 +𝑊33𝑥3 + 𝑏3)

In Matrix notation

𝒛 = 𝑾𝒙 + 𝒃𝒂 = 𝑓(𝒛)

𝑾 =

𝑤11 𝑤12 𝑤13𝑤21 𝑤22 𝑤23𝑤31 𝑤32 𝑤33

𝒙 =

𝑥1𝑥2𝑥3

- 네트워크가 깊어지고, 복잡해질수록 parameter 수가 많아짐.- Parameter가 많아질수록 Local Minima 에 빠질 가능성이 높아짐

Initialization Problem

Input

Output

Raw Data

x1 x2 x3 x4 x5 x6

z1

∑

┌┘

z2

∑

┌┘

z3

∑

┌┘

S

Score

Hidden Layer

Visible Layer

“hello” 라는 Symbol 의 Vector Form?

최초의 Summation Weight 을 어떻게 결정?

W’

W’’

“Random Initialization”

Deeper Network, Harder Learning

- Network 가 깊으면 깊을 수록 최종 성능이 좋다는 것은 밝혀짐- 단, 깊어지면 깊어질 수록 Error Propagation 이 어려워짐

- “Vanishing gradient problem”

Pre-Training

Large Raw Data

P(x)

Small Tagged Data

P( y | x)

“Pretraining”

Unsupervised Learning

Supervised Learning

- Pretraning 의 개발로, NN의 성능이 비약적으로 향상- AutoEncoder 계열과, Restricted Boltzmann Machine 계열이 있음

- RBM is not covered today.

우리가 ‘사다리’를 알고 있다면다시 복원할 수도 있지 않을까?

생성 = Generation

무엇이 사다리를 생성해내게끔 하는가?

‘사다리’를 구성하는 핵심, 골격, 정보 …… (Essence)

핵심 정보는 원래 사다리보다 더 작은 양의 정보일 것 … (군더더기 없는)

Illustrative Image

원 데이터를 설명할 수 있는핵심 정보를 추출

핵심 정보로부터 데이터를재생

Deep Learning – Auto Encoder

OriginalData

AbstractedData

OriginalData’

Encoding Decoding

원래의 데이터 X 를 H 로 프로젝션 시킨 후, H 로 부터 X’ 를다시 생성시킴- 비교

- 압축 알고리즘 (Zip, MPEG, PNG … )- Principle Component Analysis (PCA)- Kernel Function in SVM ( original space hyper

space )

X H X’ 에서 Difference(X, X’) 가 적으면 적을수록 추상화는 완벽하게 이루어진 것이라 생각할 수 있다. 그러한 Projection 이 완벽하게 훈련된다면,

- Abstracted Data 는 그 자체로 원래 데이터를 설명하는 Feature라고 볼 수 있을 것이다.

- Feature Learning 이 자동으로 이루어지는 것이라 할 수 있음

X

H

X’

Deep Learning – Auto Encoder

Input Hidden

Z

(predication of x ) – x’

Prediction

Encoding/Decoding Error

reconstruction

… real value case

… bit vector or … vectors of bit probability

Cross-entropy

Note that it is purely unsupervised learning!

Deep Learning – Auto Encoder for Weight Optimization (1)

First, you would train a sparse autoencoder on the raw inputs x(k) to learn primary features h(1)(k) on the raw input.

Next, you would feed the raw input into this trained sparse autoencoder, obtaining the primary feature activations h(1)(k) for each of the inputs x(k). You would then use these primary features as the "raw input" to another sparse autoencoder to learn secondary features h(2)(k) on these primary features.

http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Deep Learning – Auto Encoder for Weight Optimization (2)

You would then treat these secondary features as "raw input" to a softmax classifier, training it to map secondary features to digit labels.

Next, you would feed the raw input into this trained sparse autoencoder, obtaining the primary feature activations h(1)(k) for each of the inputs x(k).

Finally, you would combine all three layers together to form a stacked autoencoder with 2 hidden layers and a final softmax classifier layer capable of classifying the MNIST digits as desired.

Deep Learning – Auto Encoder – Denoising Auto Encoder

Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders

OriginalData

AbstractedData

OriginalData’

Encoding Decoding

NoisyData

데이터 X 에 Noise 를 추가한 NX 를 만들어 낸 후, NX 를 H 로 프로젝션 시킨 후, H 로 부터 X’ 를 다시 생성시키도록 훈련

- Noise 가 추가됨에도 불구하고 Original Data를 복구 시킬 있다면, 그것이 ‘중요한‘ 정보다.- 오류에 강건한 Feature 를 학습 함

http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217

Illustration – Denoising Auto-Encoding

AbstractedInfo.

오류 추가 Hidden Layer

Encode Decode

원래의 데이터를 그대로 복원할수 있도록 Hidden Layer 를 학습시킴

원본 데이터 복원 데이터

Deep Learning – Auto Encoder – Denoising Auto Encoder

Vincent, H. Larochelle Y. Bengio and P.A. Manzagol,Extractingand Composing Robust Features with Denoising Autoencoders,

http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217

Deep Generative Models

“Learning Deep Generative Models”, Ruslan Salakhutdinov

Abstraction

Generation

Generated Numbers by machine

http://deeplearning.net/tutorial/rbm.html

Here are the samples generated by the RBM after training. Each row represents a mini-batch of negative particles (samples from independent Gibbs chains). 1000 steps of Gibbs sampling were taken between each of those rows.

http://deeplearning.net/tutorial/rbm.html

DEEP LEARNING INTRO 2보다 직관적인 Deep learning 이해

How?어떻게 DNN 은 사물의 특징을 스스로 파악할 수 있을까?

Latent Variable Deep Neural Network 의 핵심 Essence of Modern Machine Learning Hidden Variable

실세계에 존재하는 관측 가능한 것

관측 가능 Count 가능 P(x)𝑥

이세상에 존재하지 않는 가상의 값 간접적으로 추측 만 가능 무엇이든 될 수 있는 값

ℎ

∞

ℎ가 가질 수 있는 전체 의미 영역

𝑥 ℎ

두 개의 변수를 묶어 주고

𝑥 ℎ두 개가 같이 나오도록

P 𝑥, ℎ : 같이 나타날 횟수

P 𝑥, ℎ) = 𝑃 𝑥 ℎ 𝑃(ℎ

𝑃 𝑥 = ℎ

𝑃 𝑥 ℎ 𝑝 ℎ 𝑑ℎ

𝑃 𝑥 =

ℎ

𝑃 𝑥 ℎ 𝑝 ℎ

: continuous

: discrete

x 와 같이 잘 나타나는 h가 되도록 탐색

∞


𝑥 ℎ

𝑥 와 연관된 ℎ가 가질 수 있는 전체 의미 영역

∞


𝑥 ℎ

𝑥 와 연관된 ℎ가 가질 수 있는 의미 영역

𝑥의 원인 𝑥의 결과

여전히 ℎ 는 어떤 값도 될 수 있음

𝑥 ℎ

같이 많이 나타나는 h 를 찾을 때 사용되는

𝑥 의 개수가

100개 라면 ?1,000개 라면 ?10,000개 라면 ?100,000개 라면 ?

1,000,000개 라면 ?10,000,000개 라면 ?

…

∞


𝑥 ℎ


𝑥의 원인 𝑥의 결과

여전히 ℎ 는 어떤 값도 될 수 있음

많은 수의 𝑥 와 연관된 ℎ가 가질 수 있는 의미 영역

𝑥 ℎ 𝑦

또 다른 변수 𝑦 를 연관시켜 본다면?

𝑥 ℎ 𝑦

세 개가 같이 나오도록 P 𝑥, 𝑦, ℎ

∞



많은 수의 𝑥 와 연관된 ℎ가 가질 수 있는 의미 영역

𝑥 ℎ 𝑦

많은 수의 𝑥, 𝑦 와 연관된 ℎ가가질 수 있는 의미 영역

𝑥 ℎ 𝑦

또 다른 변수 z 를 연관시켜 본다면?

𝑧…..…..

또 다른 변수 z1 를 연관시켜 본다면?또 다른 변수 z2 를 연관시켜 본다면?

……

1) 많은 수의 데이터

𝑥 ℎ 𝑦

𝑧

Latent Variable 의 의미영역을 축소시킬 수 도구

2) 구조적 연관성

Latent Variable In DNN

“사과”

[ Task ]

x x x x x x x x x

y

Something

Describe xand

Cause y

[ What We Want ]

3x3 = 9


x x x x x x x x x

y

[ Design Structure ]

h h h h h

X 를- Abstraction- Encoding- Semantic Extraction- Summary - ….- 하기 위해서

- dim(h) < dim(x)

“Under-complete”


x x x x x x x x x

y

h h h h hh 하나가 모든 x와연결 되도록


x x x x x x x x x

y

h h h h h

Single Layer


x x x x x x x x x

y

h h h h h

Multilayer - 2

h h h


x x x x x x x x x

y

h h h h h

Multilayer - N

h h h

…..

Number of h >>>> number of x, y

Intuitive Interpretation of Latent Variable in DNN

“사과”


x x x x x x

y

h h h h

h h h

h h

“사과”

Abstraction

Abstraction

Abstraction


x x x x x x

y“사과”

Observation

Class

RepresentationSomething


x x x x x x

y“사과”

Observation

Class

RepresentationRepresentation

잘 설계된 구조와수많은 데이터를 통해 학습된(찾아낸) Latent Variable 은 사물의 특징을 설명할 수 있게 된다.

?Representation Learning 이 우리에게 주는 의미는?

사물 특징추출 특징AI

Algorithm

사람이 만든 규칙에 의한

사물 특징연산 숫자AI

Algorithm

학습된 파라미터에 의한

color = ‘red’ shape = ‘round’ …

Numbers

Classical Machine Learning Vs. Deep Learning based ML

Word Level

Phrase Level

Sentence Level

Document Level Document Embedding

SentenceEmbedding

PhaseEmbedding

WordEmbedding

사물 Number

[ Vision ] [ NLP ]

Le and Mikolov,

“Distributed Representations of Sentences and Documents“

Mikolov et al.,

“Distributed Representations of Words and Phrases and their

compositionality “

사물

현상 숫자

Observation Semantic

Representation Learning 은

실세계의 사물이나 현상을숫자로 바꿔주는

Semantic Filter, Semantic Glasses, Semantic Converter를가능하게 한다.

Analog to Digital Vs. Object to Semantic

Analog / DigitalConverter

SemanticConverter

Numbers

Analog

to

Digital

Object

to

Semantic

Analog Digital 과Object Semantic 의변화 구조가 유사함에 주목

미래의 정보 처리 흐름 ?


Numbers

Digital / SemanticConverter

정보처리


정보처리

과거

미래

Digital 정보를 Semantic 정보로 바꿔주는 Converter가 ICT 의 핵심자산 Semantic Converter는 단시간에 얻어질 수 있는 것이 아니며 Copy 도 불가능 함

PROBLEM SOLVING

Deep Learning

S2S – Arithmetic Calculation

Sequence 2 SequenceLearning

Output : Numbers

Input : Math Expression

342 + 21 = 363

3 4 2 + 2 1

∎ 3 6 3

∎ padding

Sequence Modeling for Arithmetic Calculation

3 4 2 + 2

RNN

Symbol to VectorLookup Table

1

InputLayer

Hidden Layer

OutputLayer

One-Hot

N to MRNN

3 6 3∎

Sequence Modeling for Arithmetic Calculation – Performance (1)

50 Iteration 근처에서 오차 1 미만으로 수렴

0

0.2

0.4

0.6

0.8

1

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Accuracy

add sub

0

100

200

300

400

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Difference

add sub

Sequence Modeling for Arithmetic Calculation – Performance (2)

‘덧셈’ 예제를 통해 훈련한 모델로 시작해, ‘뺄셈’을 훈련시키면 훨씬 빠르게 훈련이 수행된다. 마찬가지로 ‘뺄셈‘ 을 통해 훈련한 모델로 시작해, ‘덧셈＇을 훈련시켜도 빠르게 훈련 수행됨.

0

0.2

0.4

0.6

0.8

1

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Accuracy

add sub sub->add add->sub

0

100

200

300

400

500

600

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Difference

add sub sub->add add->sub

Pointer Networks

“Combinatorial Optimization Problem”

http://mathmunch.org/2013/05/08/circling-squaring-and-triangulating/

http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/AproxAlgor/TSP/tsp.htm

http://rendon.x10.mx/andrews-convex-hull-algorithm/

Convex Hull Delaunay triangulation Traveling Salesman Problem

Pointer Networks. Vinyals et al.

Attention model 활용

http://mathmunch.org/2013/05/08/circling-squaring-and-triangulating/

http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/AproxAlgor/TSP/tsp.htm

http://rendon.x10.mx/andrews-convex-hull-algorithm/

Pointer Networks - Idea

Graph

Solution

Sequence(Input)

Sequence(Output)

Algorithm Deep Learning

Pointer Network - Performance

Pointer Network – Performance (TSP Problem)

SUMMARY

Deep Learning


Algorithm Finding


Summary

Deep Learning = Representation Paradigm Shift

Deep Learning = Design Architecture

Deep Learning = Data, Data, Data

Deep Learning = Beyond Pattern Recognition

Q/A

감사합니다.

정상근, Ph.D

Intelligence Architect

Senior Researcher, AI Tech. Lab. SKT Future R&DContact : [email protected], [email protected]

mailto:[email protected]

mailto:[email protected]