jong youl choi computer science department ([email protected])

27
Jong Youl Choi Computer Science Department ([email protected])

Upload: dulce-coveney

Post on 29-Mar-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Jong Youl ChoiComputer Science Department([email protected])

Page 2: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Social Bookmarking

2

Socialized

Tags

Bookmarks

Page 3: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

3

Page 4: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Principles of Machine Learning Bayes’ theorem and maximum likelihood

Machine Learning Algorithms Clustering analysis Dimension reduction Classification

Parallel Computing General parallel computing architecture Parallel algorithms

4

Page 5: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

DefinitionAlgorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc.

Algorithm Types Unsupervised learning Supervised learning Reinforcement learning

5

Topics Models▪ Artificial Neural Network

(ANN)▪ Support Vector Machine

(SVM)

Optimization▪ Expectation-Maximization

(EM)▪ Deterministic Annealing

(DA)

Page 6: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Posterior probability of i, given X

i 2 : Parameter X : Observations P(i) : Prior (or marginal) probability

P(X|i) : likelihood

Maximum Likelihood (ML) Used to find the most plausible i 2 , given X Computing maximum likelihood (ML) or log-

likelihood Optimization problem

6

Page 7: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

ProblemEstimate hidden parameters (={, })from the given data extracted from k Gaussian distributions

Gaussian distribution

Maximum Likelihood

With Gaussian (P = N),

Solve either brute-force or numeric method

7

(Mitchell , 1997)

Page 8: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Problems in ML estimation Observation X is often not complete Latent (hidden) variable Z exists Hard to explore whole parameter space

Expectation-Maximization algorithm Object : To find ML, over latent distribution P(Z |X,) Steps

0. Init – Choose a random old

1. E-step – Expectation P(Z |X, old)2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating old à new

8

Page 9: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

DefinitionGrouping unlabeled data into clusters, for the purpose of inference of hidden structures or information

Dissimilarity measurement Distance : Euclidean(L2), Manhattan(L1), … Angle : Inner product, … Non-metric : Rank, Intensity, …

Types of Clustering Hierarchical ▪ Agglomerative or divisive

Partitioning▪ K-means, VQ, MDS, …

9(Matlab

helppage)

Page 10: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Find K partitions with the total intra-cluster variance minimized

Iterative method Initialization : Randomized yi

Assignment of x (yi fixed)

Update of yi (x fixed)

Problem? Trap in local minima

10(MacKay, 2003)

Page 11: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Deterministically avoid local minima No stochastic process (random walk) Tracing the global solution by changing

level of randomness

Statistical Mechanics Gibbs distribution

Helmholtz free energy F = D – TS▪ Average Energy D = < Ex>

▪ Entropy S = - P(Ex) ln P(Ex)

▪ F = – T ln Z

In DA, we make F minimized

11

(Maxima and Minima, Wikipedia)

Page 12: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Analogy to physical annealing process Control energy (randomness) by temperature (high

low) Starting with high temperature (T = 1) ▪ Soft (or fuzzy) association probability▪ Smooth cost function with one global minimum

Lowering the temperature (T ! 0)▪ Hard association▪ Revealing full complexity, clusters are emerged

Minimization of F, using E(x, yj) = ||x-yj||2

Iteratively,12

Page 13: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

DefinitionProcess to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises.

Curse of dimensionality Complexity grows exponentially

in volume by adding extra dimensions

Types Feature selection : Choose representatives (e.g.,

filter,…) Feature extraction : Map to lower dim. (e.g., PCA,

MDS, … )13

(Koppen, 2000)

Page 14: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Finding a map of principle components (PCs) of data into an orthogonal space, such that

y = W x where W 2 Rd£h (hÀd)

PCs – Variables with the largest variances Orthogonality Linearity – Optimal least mean-square

error

Limitations? Strict linearity specific distribution Large variance assumption 14

x1

x2

PC 1PC 2

Page 15: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 Rd£p (pÀd)

Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace,

the distance are approximately preserved

Generating R Hard to obtain orthogonalized R Gaussian R Simple approach

choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively

15

Page 16: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Dimension reduction preserving distance proximities observed in original data set

Loss functions Inner product Distance Squared distance

Classical MDS: minimizing STRAIN, given From , find inner product matrix B (Double

centering)

From B, recover the coordinates X’ (i.e., B=X’X’T )

16

Page 17: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

SMACOF : minimizing STRESS Majorization – for complex f(x),

find auxiliary simple g(x,y) s.t.:

Majorization for STRESS

Minimize tr(XT B(Y) Y), known as Guttman transform

17

(Cox, 2001)

Page 18: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Competitive and unsupervised learning process for clustering and visualization

Result : similar data getting closer in the model space

18

Input Model

Learning Choose the best similar

model vector mj with xi

Update the winner and its neighbors by mk = mk + (t) (t)(xi – mk)

(t) : learning rate(t) : neighborhood size

Page 19: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

19

DefinitionA procedure dividing data into the given set of categories based on the training set in a supervised way

Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining)▪ Early stopping▪ Holdout validation▪ K-fold cross validation ▪ Leave-one-out cross-validation

Validation Error

Training Error

Underfitting Overfitting

(Overfitting, Wikipedia)

Page 20: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Perceptron : A computational unit with binary threshold

Abilities Linear separable decision surface Represent boolean functions (AND, OR, NO)

Network (Multilayer) of perceptrons Various network architectures and capabilities

20

Weighted SumWeighted Sum Activation Function

Activation Function

(Jain, 1996)

Page 21: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Learning weights – random initialization and updating

Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning) ▪ With E = Ei ,

Stochastic approach (On-line learning)▪ Update gradient for each result

Various error functions Adding weight regularization term ( wi

2) to avoid overfitting

Adding momentum (wi(n-1)) to expedite convergence

21

Page 22: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Q: How to draw the optimal linear separating hyperplane? A: Maximizing margin

Margin maximization The distance between H+1 and

H-1:

Thus, ||w|| should be minimized 22

Margin

Page 23: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

23

Constraint optimization problem Given training set {xi, yi} (yi 2 {+1, -1}): Minimize :

Lagrangian equation with saddle points

Minimized w.r.t the primal variable w and b:

Maximized w.r.t the dual variables i (all i ¸ 0)

xi with i > 0 (not i = 0) is called support vector (SV)

Page 24: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Soft Margin (Non-separable case) Slack variables i < C Optimization with additional

constraint

Non-linear SVM Map non-linear input to feature space Kernel function k(x,y) = h(x), (y)i Kernel classifier with support vectors

si

24

Input Space Feature Space

Page 25: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Memory Architecture

Decomposition Strategy Task – E.g., Word, IE, … Data – scientific problem Pipelining – Task + Data

25

Shared Memory Distributed Memory

Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive

Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive

Commodity, off-the-shelf processors MPI Cost effective but hard to maintain

Commodity, off-the-shelf processors MPI Cost effective but hard to maintain

(Barney, 2007)

(Barney, 2007)

Page 26: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

Shrinking Recall : Only support vectors (i>0) are

used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space

Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge

26(Graf, 2005)

Page 27: Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)

27