practical machine learning tutorial

Upload: nikola001

Post on 14-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Practical Machine Learning Tutorial

    1/34

    CS 294-34: Practical Machine LearningTutorial

    Ariel Kleiner

    Content inspired by Fall 2006tutorial lecture by Alexandre Bouchard-Cote and Alex Simma

    January 22, 2008

  • 7/29/2019 Practical Machine Learning Tutorial

    2/34

    Machine Learning Draws Heavily On. . .

    Probability and Statistics

    Optimization

    Algorithms and Data Structures

  • 7/29/2019 Practical Machine Learning Tutorial

    3/34

    Probability: Foundations

    A probability space (,F,P) consists ofa set of "possible outcomes"a set1 Fof events, which are subsets of a probability measure P : F [0,1] which assignsprobabilities to events in F

    Example: Coin-Flipping

    Consider a coin that lands on heads with probability p [0,1].In this case,

    = {H,T}F= {, {H}, {T}, {H,T}}P() = 0,P({H}) = p,P({T}) = 1 p,P({H,T}) = 1

    1

    Actually, F is a -field. See Durretts Probability: Theory and Examplesfor thorough coverage of the measure-theoretic basis for probability theory.

  • 7/29/2019 Practical Machine Learning Tutorial

    4/34

    Probability: Random Variables

    A random variable is an assignment of (often numeric)values to outcomes in .

    For a set A in range of a random variable X, the probabilitythat X falls in A is written as P(X A). We have thusinduced a probability distribution for the random variable.

    Example Continued: Coin-Flipping

    Suppose that we bet $1 on each coin flip. LetX : {H,T} { 1,1} be a random variable denoting ourwinnings: X = 1 if the coin lands on heads, and X = 1 if thecoin lands on tails. Furthermore,

    P(X {1}) = p,P(X {1}) = 1 p.

  • 7/29/2019 Practical Machine Learning Tutorial

    5/34

    Probability: Common Discrete Distributions

    Various distributions arise commonly and are useful in boththeory and application. The following are among the most

    common discrete distributions for a random variable X:Bernoulli(p): p [0,1]; X {0, 1}

    P(X = 1) = p,P(X = 0) = 1 p

    Binomial(p, n): p [0,1],n N; X {0, . . . , n}P(X = x) =

    n

    x

    px(1 p)nx

    The multinomial distribution generalizes the Bernoulli and

    the Binomial beyond binary outcomes for individualexperiments.

    Poisson(): (0,); X N

    P(X = x) =

    ex

    x!

  • 7/29/2019 Practical Machine Learning Tutorial

    6/34

    Probability: More on Random Variables

    Notation: X P means "X has the distribution given by P"The cumulative distribution function (cdf) of a random

    variable X Rm is defined for x Rm as F(x) = P(X x).We say that X has a density function p if we can write

    P(X x) = x p(y)dy (densities exist under certainregularity conditions).

    In practice, the continuous random variables with which wewill work will have densities.

    For convenience, in the remainder of this lecture we willassume that all random variables take values in some

    countable numeric set, R, or a real vector space.

  • 7/29/2019 Practical Machine Learning Tutorial

    7/34

    Probability: Common Continuous Distributions

    The following are among the most common continuousdistributions for a random variable X (all have densities p):

    Uniform(a,b): a,b R, a< b; X [a,b]

    p(x) =1

    b a

    Normal(, 2): R, R++; X R

    p(x) =1

    2exp

    (x )

    2

    22

    The normal distribution can be easily generalized to themultivariate case, in which X Rm. In this context, becomes a real vector and is replaced by a covariance

    matrix.

    Other continuous distributions that also frequently arise arethe Beta, Gamma, and Dirichlet.

  • 7/29/2019 Practical Machine Learning Tutorial

    8/34

    Probability: DistributionsOther Distribution Types

    Exponential Family

    encompasses distributions of the form

    P(X = x) = h(x) exp(()T(x) A())

    includes many commonly encountered distributions

    well-studied and has various nice analytical properties

    while being fairly general

    Graphical Models

    Graphical models provide a flexible framework for building

    complex models involving many random variables whileallowing us to leverage conditional independence relationships

    among them to control computational tractability.

  • 7/29/2019 Practical Machine Learning Tutorial

    9/34

    Probability: Expectation

    Intuition: the expection of a random variable is its

    "average" value under its distribution.

    Formally, the expectation of a random variable X, denotedE[X], is its Lebesgue integral with respect to itsdistribution.

    If X takes values in some countable numeric set

    X, then

    E[X] =xX

    xP(X = x)

    .

    If X

    R

    m has a density p, then

    E[X] =

    Rm

    xp(x)dx

    .

    These ideas generalize naturally to expectations of

    functions of random variables.

  • 7/29/2019 Practical Machine Learning Tutorial

    10/34

    Probability: More on Expectation

    Expection is linear: E[aX + b] = aE[X] + b. Also, if Y isalso a random variable, then E[X + Y] = E[X] + E[Y].

    Expectation is monotone: if X Y, then E[X] E[Y].Expectations also obey various inequalities, including

    Jensens, Cauchy-Schwarz, and Chebyshevs.

    Variance

    The variance of a random variable X is defined as

    Var(X) = E[(X

    E[X])2] = E[X2]

    (E[X])2

    and obeys the following for a, b R:

    Var(aX + b) = a2Var(X).

  • 7/29/2019 Practical Machine Learning Tutorial

    11/34

    Probability: Independence

    Intuition: two random variables are independent if knowingthe value of one yields no knowledge about the value of

    the other.

    Formally, two random variables X and Y are independentiff P(X A,Y B) = P(X A)P(Y B) for all(measurable) subsets A and B in the ranges of X and Y.

    If X,Y have densities pX(x),pY(y), then they are

    independent if pX,Y(x, y) = pX(x)pY(y).

  • 7/29/2019 Practical Machine Learning Tutorial

    12/34

    Probability: Conditioning

    Intuition: conditioning allows us to capture the probabilisticrelationships between different random variables.

    For events A and B, P(A|B) is the probability that A willoccur given that we know that event B has occurred. IfP(B) > 0, then

    P(A|B) =

    P(A B)P(B)

    .

    If X,Y have joint density p(x, y), the conditional density ofX given Y is as follows:

    p(y|x) =p(x, y)

    p(x) , for p(x) > 0

    where p(x) =

    p(x, y)dy.

    If X and Y are independent, thenP(Y = y

    |X = x) = P(Y = y) and

    P(X = x|Y = y) = P(X = x).

  • 7/29/2019 Practical Machine Learning Tutorial

    13/34

    Probability: More on Conditional Probability

    For any events A and B (e.g., we might have A = {Y 5}),

    P(A B) = P(A|B)P(B)

    Bayes Theorem:

    P(A|B)P(B) = P(A B) = P(B A) = P(B|A)P(A)

    Equivalently, if P(B) > 0,

    P(A|B) =P(B

    |A)P(A)

    P(B)

    Bayes Theorem provides a means of inverting the "order"of conditioning.

  • 7/29/2019 Practical Machine Learning Tutorial

    14/34

    Probability: Law of Large Numbers

    Strong Law of Large Numbers

    Let X1,X2,X3, . . . be independent identically distributed (i.i.d.)random variables with E|Xi| < . Then

    1

    n

    n

    i=1

    Xi E[X1]

    with probability 1 as n .

    Application: Monte Carlo Methods

    How can we compute an (approximation of) an expectationE[f(X)] with respect to some distribution P of X? (assume thatwe can draw independent samples from P).

    A Solution: Draw a large number of samples x1, . . . , xn from P.

    Compute E[f(X)]

    f(x1)++f(xn)n

    .

    P b bili C l Li i Th

  • 7/29/2019 Practical Machine Learning Tutorial

    15/34

    Probability: Central Limit Theorem

    The Central Limit Theorem provides insight into the

    distribution of a normalized sum of independent randomvariables. In contrast, the law of large numbers onlyprovides a single limiting value.

    Intuition: The sum of a large number of small, independent,random terms is asymptotically normally distributed.

    This theorem is heavily used in statistics.

    Central Limit Theorem

    Let X1,X2,X3, . . . be i.i.d. random variables with E[Xi] = ,

    Var(Xi) = 2 (0,). Then, as n ,

    1n

    ni=1

    Xi

    d N(0, 1)

    St ti ti F ti t B i

  • 7/29/2019 Practical Machine Learning Tutorial

    16/34

    Statistics: Frequentist Basics

    Given data (i.e., realizations of random variables)

    x1, x2, . . . ,xn which is generally assumed to be i.i.d.Based on this data, we would like to estimate some(unknown) value associated with the distribution from

    which the data was generated.

    In general, our estimate will be a function(x1, . . . , xn) ofthe data (i.e., a statistic).

    Examples

    Given the results of n independent flips of a coin,

    determine the probability p with which it lands on heads.

    Simply determine whether or not the coin is fair.

    Find a function that distinguishes digital images of fivesfrom those of other handwritten digits.

    St ti ti P t E ti ti

  • 7/29/2019 Practical Machine Learning Tutorial

    17/34

    Statistics: Parameter Estimation

    In practice, we often seek to select from some class ofdistributions a single distribution corresponding to our data.

    If our model class is parametrized by some (possiblyuncountable) set of values, then this problem is that ofparameter estimation.

    That is, from a set of distributions {p(x) : }, we willselect that corresponding to our estimate (x

    1, . . . ,x

    n) of

    the parameter.

    How can we obtain estimators in general?

    One answer: maximize the likelihoodl(; x1, . . . , xn) = p(x1, . . . , xn) =

    ni=1 p(xi) (or,

    equivalently, log likelihood) of the data.

    Maximum Likelihood Estimation

    (x1, . . . , xn) = argmax

    n

    i=1

    p(xi) = argmax

    n

    i=1

    ln p(xi)

    St ti ti M i Lik lih d E ti ti

  • 7/29/2019 Practical Machine Learning Tutorial

    18/34

    Statistics: Maximum Likelihood EstimationExample: Normal Mean

    Suppose that our data is real-valued and known to be

    drawn i.i.d. from a normal distribution with variance 1 butunknown mean.

    Goal: estimate the mean of the distribution.

    Recall that a univariate N(, 1) distribution has density

    p(x) = 12 exp(12 (x )2).Given data x1, . . . , xn, we can obtain the maximumlikelihood estimate by maximizing the log likelihood w.r.t. :

    dd

    ni=1

    ln p(xi) =

    ni=1

    dd12 (xi )2 =

    ni=1

    (xi ) = 0

    (x1, . . . , xn) = argmax

    n

    i=1

    ln p(xi) =1

    n

    n

    i=1

    xi

    Statistics: Criteria for Estimator Evaluation

  • 7/29/2019 Practical Machine Learning Tutorial

    19/34

    Statistics: Criteria for Estimator Evaluation

    Bias: B() = E[(X1, . . . ,Xn)] Variance: Var((X1, . . . , Xn)) = E[( E[])2]Loss/Risk

    A loss function L(, (X1, . . . ,Xn)) assigns a penalty to anestimate when the true value of interest is .The risk is the expectation of the loss function:

    R() = E[L(, (X1, . . . ,Xn))].Example: squared loss is given by L(, ) = ( )2.

    Bias-Variance Decomposition

    Under squared loss,

    E[L(, )] = E[( )2] = [B()]2 + Var()

    Consistency: Does (X1, . . . ,Xn)p

    as n

    ?

    Statistics: Criteria for Estimator Evaluation

  • 7/29/2019 Practical Machine Learning Tutorial

    20/34

    Statistics: Criteria for Estimator EvaluationExample: Evaluation of Maximum Likelihood Normal Mean Estimator

    Recall that, in this example, X1, . . . ,Xni.i.d. N(, 1) and the

    maximum likelihood estimator for is

    (X1, . . . ,Xn) =1

    n

    ni=1

    Xi

    Therefore, we have the following:

    Bias:

    B() = E[(X1, . . . ,Xn)] = E

    1

    n

    ni=1

    Xi

    =

    1

    n

    ni=1

    E[Xi] =1

    n

    ni=1

    = 0

    Variance: Var() = Var

    1n

    ni=1 Xi

    = 1

    n2

    ni=1 Var(Xi) =

    1n

    Consistency: 1nni=1 Xi E[X1] = with probability 1 asn , by the strong law of large numbers.

    Statistics: Bayesian Basics

  • 7/29/2019 Practical Machine Learning Tutorial

    21/34

    Statistics: Bayesian Basics

    The Bayesian approach treats statistical problems bymaintaining probability distributions over possible

    parameter values.That is, we treat the parameters themselves as randomvariables having distributions:

    1 We have some beliefs about our parameter values beforewe see any data. These beliefs are encoded in the prior

    distribution P().2 Treating the parameters as random variables, we can

    write the likelihood of the data X as a conditionalprobability: P(X|).

    3 We would like to update our beliefs about based on thedata by obtaining P(

    |X), the posterior distribution.

    Solution: by Bayes theorem,

    P(|X) = P(X|)P()P(X)

    where

    P(X) =

    P(X|)P()d

    Statistics: More on the Bayesian Approach

  • 7/29/2019 Practical Machine Learning Tutorial

    22/34

    Statistics: More on the Bayesian Approach

    Within the Bayesian framework, estimation and predictionsimply reduce to probabilistic inference. This inference

    can, however, be analytically and computationallychallenging.

    It is possible to obtain point estimates from the posterior invarious ways, such as by taking the posterior mean

    E|X[] =

    P(|X)dor the mode of the posterior:

    argmax

    P(|X)

    .Alternatively, we can directly compute the predictivedistribution of a new data point Xnew, having already seen

    data X:

    P(Xnew|X) = P(Xnew|)P(|X)d

    Statistics: Bayesian Approach for the Normal Mean

  • 7/29/2019 Practical Machine Learning Tutorial

    23/34

    Statistics: Bayesian Approach for the Normal Mean

    Suppose that X| N(, 1) and we place a prior N(0,1) over (i.e., N(0,1)):P(X = x|) = 1

    2exp

    (x )

    2

    2

    P() =

    1

    2exp

    2

    2

    Then, if we observe X = 1,

    P(|X = 1) = P(X = 1|)P()P(X = 1)

    P(X = 1|)P()

    =

    12

    exp(1 )2

    2

    1

    2exp

    22

    N(0.5,0.5)

    Statistics: Bayesian Prior Distributions

  • 7/29/2019 Practical Machine Learning Tutorial

    24/34

    Statistics: Bayesian Prior Distributions

    Important Question: How do we select our prior distribution?Different possible approaches:

    based on actual prior knowledge about the system or data

    generation mechanismtarget analytical and computational tractability; e.g., useconjugate priors (those which yield posterior distributions

    in the same family)

    allow the data to have "maximal impact" on the posterior

    Statistics: Parametric vs Non-Parametric Models

  • 7/29/2019 Practical Machine Learning Tutorial

    25/34

    Statistics: Parametric vs. Non Parametric Models

    All of the models considered above are parametricmodels,in that they are determined by a fixed, finite number of

    parameters. This can limit the flexibility of the model.

    In contrast, we can endow our models with a potentially

    infinite number of parameters which is allowed to grow aswe see more data. Such models are callednon-parametric.

    Although non-parametric models yield greater modeling

    flexibility, they are generally statistically andcomputationally less efficient.

    Statistics: Generative vs Discriminative Models

  • 7/29/2019 Practical Machine Learning Tutorial

    26/34

    Statistics: Generative vs. Discriminative Models

    Suppose that, based on data (x1, y1), . . . , (xn, yn), wewould like to obtain a model whereby we can predict thevalue of Y based on an always-observed random variableX.

    Generative Approach: model the full joint distributionP(X,Y), which fully characterizes the relationship betweenthe random variables.

    Discriminative Approach: only model the conditionaldistribution P(Y

    |X)

    Both approaches have strengths and weaknesses and areuseful in different contexts.

    Linear Algebra: Basics

  • 7/29/2019 Practical Machine Learning Tutorial

    27/34

    Linear Algebra: Basics

    Matrix Transpose

    For an m

    nmatrix A with (A)ij

    = aij

    , its transpose AT isdefined by (AT)ij = aji.

    (AB)T = BTAT

    Matrix Inverse

    The inverse of a square matrix A Rnn is the matrix A1such that A1A = I.This notion generalizes to non-square matrices via left-and right-inverses.

    Not all matrices have inverses.If A and B are invertible, then (AB)1 = B1A1.Computation of inverses generally requires O(n3) time.However, given a matrix A and a vector b, we can compute

    a vector x such that Ax = b in O(n2

    ) time.

    Linear Algebra: Basics

  • 7/29/2019 Practical Machine Learning Tutorial

    28/34

    Linear Algebra: Basics

    Trace

    For a square matrix A Rnn, its trace is defined astr(A) =

    ni=1(A)ii.

    tr(AB) = tr(BA)

    Eigenvectors and EigenvaluesGiven a matrix A Rnn, u Rn\{0} is called aneigenvector of A with R the corresponding eigenvalue if

    Au= u

    .

    An n nmatrix can have no more than ndistincteigenvector/eigenvalue pairs.

    Linear Algebra: Basics

  • 7/29/2019 Practical Machine Learning Tutorial

    29/34

    Linear Algebra: Basics

    More definitions

    A matrix A is called symmetric if it is square and(A)ij = (A)ji,i,j.

    A symmetric matrix A is positive semi-definite (PSD) if allof its eigenvalues are greater than or equal to 0.

    Changing the above inequality to >, , or < yields thedefinitions of positive definite, negative semi-definite, andnegative definite matrices, respectively.

    A positive definite matrix is guaranteed to have an inverse.

    Linear Algebra: Matrix Decompositions

  • 7/29/2019 Practical Machine Learning Tutorial

    30/34

    ea geb a at eco pos t o s

    Eigenvalue Decomposition

    Any symmetric matrix AR

    nn can be decomposed as follows:

    A = UUT

    where is a diagonal matrix with the eigenvalues of A on itsdiagonal, U has the corresponding eigenvectors of A as its

    columns, and UUT = I.

    Singular Value Decomposition

    Any matrix A Rmn can be decomposed as follows:

    A = UVT

    where UUT = VVT = I and is diagonal.

    Other Decompositions: LU (into lower and upper triangularmatrices); QR; Cholesky (only for PSD matrices)

    Optimization: Basics

  • 7/29/2019 Practical Machine Learning Tutorial

    31/34

    p

    We often seek to find optima (minima or maxima) of some

    real-valued vector function f : Rn R. For example, wemight have f(x) = xTx.Furthermore, we often constrain the value of x in some

    way: for example, we might require that x 0.In standard notation, we write

    minxX f(x)s.t. gi(x) 0, i = 1, . . . ,N

    hi(x) = 0, i = 1, . . . ,M

    Every such problem has a (frequently useful)

    corresponding Lagrange dual problem which lower-boundsthe original, primal problem and, under certain conditions,has the same solution.

    It is only possible to solve these optimization problemsanalytically in special cases, though we can often find

    solutions numerically.

    Optimization: A Simple Example

  • 7/29/2019 Practical Machine Learning Tutorial

    32/34

    p p p

    Consider the following unconstrained optimization problem:

    minxRn

    Ax b22 = minxRn

    (Ax b)T(Ax b)

    .

    In fact, this is the optimization problem that we must solve

    to perform least-squares regression.

    To solve it, we can simply set the gradient of the objectivefunction equal to 0.

    The gradient of a function f(x) : Rn R is the vector ofpartial derivatives with respect to the components of x:

    xf(x) =

    f

    x1, . . .

    f

    xn

    Optimization: A Simple Example

  • 7/29/2019 Practical Machine Learning Tutorial

    33/34

    p p p

    Thus, we have

    xAx b22 = x

    (Ax b)T(Ax b)

    = xxTATAx 2xTATb+ bTb

    = 2ATAx 2ATb= 0

    and so the solution is

    x = (ATA)1ATb

    (if (ATA)1 exists).

    Optimization: Convexity

  • 7/29/2019 Practical Machine Learning Tutorial

    34/34

    p y

    In the previous example, we were guaranteed to obtain a

    global minimum because the objective function wasconvex.

    A differentiable function f : Rn R is convex if its Hessian(matrix of second derivatives) is everywhere PSD (if n= 1,then this corresponds to the second derivative being

    everywhere non-negative)2.An optimization problem is called convex if its objectivefunction f and inequality constraint functions g1, . . . , gN areall convex, and its equality constraint functions h1, . . . ,hMare linear.

    For a convex problem, all minima are in fact global minima.

    In practice, we can efficiently compute minima for problemsin a number of large, useful classes of convex problems.

    2This definition is in fact a special case of the general definition forarbitrary vector functions.