classification and clustering method

Upload: brandon-dean

Post on 02-Jun-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    1/30

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    2/30

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    3/30

    The task is to learn to assign instances topredefined classes.

    Requires supervised learning: the trainingdata has to specify what we are trying tolearn (the classes)

    Classifier is a mathematical function,implemented by a classificationalgorithm, that maps input data to acategory which performs classification

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    4/30

    Two common approaches:

    Probabilistic

    Geometric Useful for many search-related tasks

    Spam detection

    Sentiment classification Online advertising

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    5/30

    Decision trees

    Nave Bayes classifier

    Neural networks Quadratic classifiers

    Support vector machines

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    6/30

    A tree structured prediction model where

    each internal node denotes a test on an

    attribute, each outgoing branch represents

    an outcome of the test and each leaf node

    is labeled with a class or class distribution.

    Attribute to be predicted: dependent

    variable Attribute that help in predicting dependent

    variable: independent variable

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    7/30

    Figure below shows a decision tree withtests on attributes X and Y:

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    8/30

    Consider that the captain of a cricketteam has to decide whether to bat or

    field first in the event that they win thetoss.

    He decides to collect the statistic of thelast ten matches when the winningcaptain has decided to bat first andcompare in order to decide what to do.

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    9/30

    INDEPENDENT VARIABLES DEPENDENTVARIABLE

    Outlook Humidity No of batsmenin team > 6

    Final outcome

    Sunny High Yes Won

    Overcast High No Lost

    Sunny Low No Lost

    Sunny High No Won

    Overcast Low Yes Lost

    Sunny Low Yes Won

    Sunny Low No Lost

    Sunny High No Won

    Sunny Low Yes Won

    Sunny Low Yes Won

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    10/30

    Dependent variable: game won or lost

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    11/30

    Works on a simple, but comparatively

    intuitive concept.

    It makes use of the variables contained inthe data sample, by observing them

    individually, independent of each other.

    Based on the Bayes rule of conditional

    probability. It makes use of all the attributescontained in the data, and analyses them

    individually as though they are equally

    important and independent of each other.

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    12/30

    Consider that the training data consists of

    various animals (say elephants, monkeys

    and giraffes), and our classifier has to

    classify any new instance that it encounters.

    We know that elephants have attributes like

    they have a trunk, huge tusks, a short tail,

    are extremely big, etc. Monkeys are short insize, jump around a lot, and can climb

    trees; whereas giraffes are tall, have a long

    neck and short ears.

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    13/30

    The Nave Bayes classifier will consider each ofthese attributes separately when classifying anew instance.

    When checking to see if the new instance is anelephant, the Nave Bayes classifier will notcheck whether it has a trunk and has hugetusks and is large. Rather, it will separately

    check whether the new instance has a trunk,whether it has tusks, whether it is large, etc. Itworks under the assumption that one attributeworks independently of the other attributescontained by the sample

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    14/30

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    15/30

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    16/30

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    17/30

    The task is to learn a classification fromthe data. No predefined classification isrequired.

    An unsupervised learningthe trainingdata doesnt specify what we are tryingto learn (the clusters)

    Clustering algorithms divide a data setinto natural groups (clusters).

    Often use a distance measure fordissimilarity

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    18/30

    General outline of clustering algorithms1. Decide how items will be represented (e.g.,

    feature vectors)

    2. Define similarity measure between pairs orgroups of items (e.g., cosine similarity)3. Determine what makes a good clustering4. Iteratively construct clusters that are

    increasingly good

    5. Stop after a local/global optimum clustering isfound

    Steps 3 and 4 differ the most acrossalgorithms

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    19/30

    Segment customer database based onsimilar buying patterns

    Group houses in a town intoneighborhood based on similar features

    Identify similar Web usage patterns

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    20/30

    Hierarchical Clustering Has two versions:

    Agglomerative (bottom up) Divisive (top down)

    Overlapping Clustering Uses fuzzy sets to cluster data, so that each point

    may belong to two or more clusters with differentdegrees of membership.

    Exclusive clustering

    Data are grouped in exclusive way, so that a certaindatum belongs to only one definite cluster.

    Eg: K-means clustering

    Probabilistic Clustering Uses a completely probabilistic approach.

    Eg: Mixture of Gaussian

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    21/30

    Hierarchy can be visualized as aDendogram - a tree data structure

    which illustrates hierarchical clusteringtechniques.

    Each level shows clusters for that level

    Leafindividual clusters

    Rootone cluster

    A cluster at level i is the union of itschildren clusters at level i+1

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    22/30

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    23/30

    Divisive Initially all items in one cluster

    Large clusters are successively divided

    Top Down

    Agglomerative Initially each item in its own cluster

    Iteratively clusters are merged together

    Bottom Up

    How do we know how to divide or combinedclusters? Define a division or combination cost

    Perform the division or combination with the lowestcost

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    24/30

    F

    A

    C

    E

    B G

    D

    F

    A

    C

    E

    B G

    D

    F

    A

    C

    E

    BG

    D

    F

    A

    C

    E

    BG

    D

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    25/30

    F

    A

    C

    E

    B G

    D

    F

    A

    C

    E

    B G

    D

    F

    A

    C

    E

    BG

    D

    F

    A

    C

    E

    BG

    D

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    26/30

    Single Linkage

    Smallest distance between points

    Complete Linkage Largest distance between points

    Average Linkage

    Average distance between points Average Group Linkage

    Distance between centroids

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    27/30

    F

    A

    C

    E

    BG

    F

    A

    C

    E

    BG

    D

    F

    A

    C

    E

    BG

    D

    D

    Single Linkage CompleteLinkage

    AverageLinkage

    Average GroupLinkage

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    28/30

    One of the simplest unsupervisedlearning algorithms that solves the

    clustering problem. K-means always maintains exactly K

    clusters

    Clusters represented as centroids (center of

    mass)

    The main idea is to define K centroids,one for each cluster.

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    29/30

    Basic algorithm:

    Step 1: Choose Kcluster centroids

    Step 2: Assign points to closet centroid Step 3: Recompute cluster centroids

    Step 4: Goto 2

    Tends to converge quickly

    Can be sensitive to choice of initialcentroids

  • 8/10/2019 CLASSIFICATION AND CLUSTERING METHOD

    30/30