ameeta agrawal. outline parser evaluation text clustering common n-grams classification method...

89
AMEETA AGRAWAL AMEETA AGRAWAL

Upload: patience-carter

Post on 11-Jan-2016

230 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

AMEETA AGRAWALAMEETA AGRAWAL

Page 2: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

OutlineParser EvaluationText ClusteringCommon N-Grams classification method

(CNG)

2

Page 3: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Parser EvaluationPARSEVAL measurePrecision, recall, F-measure and cross-brackets

http://www.haskell.org/communities/11-2009/html/Parsing-ParseModule.jpg

3

Page 4: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Why automate evaluationManual inspection is slow and error-prone.Possibility of introducing bias by human

evaluators.

4

Page 5: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Parser output vs. Gold standardFormal definitionsLet GS=(S,A) be a gold standard. Let P be a parser.Parser output O(P,GS)=(P(s1), P(s2),… P(s|S|)) is a

sequence of analyses such that P(si) for each i, 1 < i < n is the analysis assigned by parser P for sentence si € S.

Gold standard is a 2-tuple GS=(S,A) whereS=(s1,s2,…,sn) is a finite sequence of grammatical

structures, i.e. constituents, dependency links or sentences.

A=(a1,a2,…,an) is a finite sequence of analyses. For each i, 1 < i < n , ai € A, is the analysis of si € S.

5

Page 6: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Parser evaluationCompare each element in O(P, GS) to each

element in A.

The sets of analyses in parser evaluation.6

Page 7: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

PARSEVALParsers are usually evaluated using the

PARSEVAL measures (Black et al. 1991).To compute the PARSEVAL measures:

the parse trees are decomposed into labelled constituents (LC) LC = triples consisting of the starting and ending

point of a constituent’s span in a sentence, and the constituent’s label.

for each sentence, the sets of LC got from a parser (PT), and gold standard parse tree (GT) are compared.

7

Page 8: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Labelled vs. UnlabelledLabelled Parseval:Two analyses match if and only if both the

brackets and the labels (POS and syntactic tags) match.

Unlabelled Parseval:Compares only the brackets

8

Page 9: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

PARSEVAL measuresPrecisionRecallF-scoreCross-brackets

9

Page 10: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Precision, Recall, F-scorePrecision:

# of correct constituents in parser output

total # of constituents in parser output

Recall: # of correct constituents in parser outputtotal # of constituents in gold standard

The F-score: harmonic mean of precision and recall2 · (labelled precision) · (labelled recall)(labelled precision) + (labelled recall)

10

Page 11: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Crossing bracketsThe mean number of bracketed sequences in

which the parser output overlaps with the gold standard structure.

Non-crossing and crossing brackets. The phrase boundaries [i, j] and [i’, j’] are boundaries in the gold standard and the parser output respectively. Pair [i, j] [i’, j’] is defined as a pair of crossing brackets if they overlap, that is, if i < i’ < j < j’.

11

Page 12: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Labelled PARSEVAL exampleConsider the following two sentences:

Time flies like an arrow.He ate the cake with a spoon.

Ambiguous sentences for a parser...

12

Page 13: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Gold standard parse tree(S (NP (NN time) (NN flies))

(VP (VB like)

(NP (DT an) (NN arrow))))

(S (NP (PRP he))

(VP (VBD ate) (NP (DT the)

(NN cake))

(PP (IN with)

(NP (DT a) (NN spoon)))))

13

Page 14: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Parser output parse tree(S (NP (NN time))

(VP (VB flies)

(PP (IN like)

(NP (DT an) (NN arrow)))))

(S (NP (PRP he))

(VP (VBD ate)

(NP (DT the) (NN cake)

(PP (IN with)

(NP (DT a) (NN spoon)))))) 14

Page 15: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Labelled edges of parse trees - 1

15

Page 16: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Labelled edges of parse trees - 2

16

Page 17: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

ResultPrecision = 73.9 % (17/23)Recall = 77.2 % (17/22)F-score = 75.5 %

17

Page 18: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Unlabelled PARSEVAL exampleA) [[He [hit [the post]]] [while [[the all-star

goalkeeper] [was [out [of [the goal]]]]]]]

B) [He [[hit [the post]] [while [[the [[all-star] goalkeeper]] [was [out of [the goal]]]]]]]

A) is the gold standard structure and B) the parser output adapted from Lin 1998.

Precision = 75.0% (9/12)Recall = 81.8% (9/11)F-score = 78.3%Crossing brackets = 1 pair

18

Page 19: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Strengths & Weakness: PARSEVALStrength:The state-of-the-art parsers obtain up to 90% precision and recall

on the Penn Treebank data (Bod, 2003; Charniak and Johnson, 2005)

Weaknesses:Evaluation based on phrase-structure constituents abstracts

away from basic predicate-argument relationships which are important for correctly capturing the semantics of the sentence (Lin, 1998; Carroll et al., 2002).

Also, using the same resource for training and testing may result in the parser learning systematic errors which are present in both the training and testing material (Rehbein and van Genabith, 2007).

Other metrics: the Leaf-Ancestor metric (G. Sampson, 1980s)19

Page 20: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Text ClusteringTask definitionPartitional clustering

Simple K-meansHierarchical clustering

Divisive & agglomerativeEvaluation of clustering Inter-cluster similarityCluster purityEntropy or information gain

http://www.miner3d.com/images/kmeans_medium.jpg

20

Page 21: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

ClusteringPartition unlabeled examples into disjoint subsets of

clusters, so that: examples within a cluster are very similar examples in different clusters are very different

Discover new categories in an unsupervised mannerInter-cluster distances are maximized

Intra-cluster distances are

minimized

21

Page 22: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Notion of a cluster can be ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Data Mining, Cluster Analysis: Basic Concepts and Algorithms, by Tan, Steinbach, Kumar

22

Page 23: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Ambiguous web queriesWeb queries are often truly ambiguous:

jaguarNLPparis hilton

Seems like word sense ambiguation should helpDifferent senses of jaguar: animal, car, OS X…

In practice WSD doesn’t help for web queriesDisambiguation is either impossible (“jaguar”) or

trivial (“jaguar car”)“Cluster” the results into useful groupings

23

Page 24: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Clusty: the clustering search engine

24

Page 25: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Types of clusteringPartitional clustering

A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.

E.g. K-means

Hierarchical clusteringA set of nested clusters organized as a hierarchical tree.e.g. Agglomerative and Divisive

Density-based clusteringArbitrary-shaped clusters; a cluster is regarded as a

region in which the density of data objects exceeds a threshold.

e.g. DBSCAN and OPTICS

25

Page 26: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Partitional clustering

Original Points A Partitional Clustering

26

Page 27: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Hierarchical clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering

Non-traditional Dendrogram

Traditional Dendrogram

27

Page 28: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Other types of clusteringHard vs. soft

Hard: each document is a member of exactly one clusterSoft: a document has fractional membership in several

clustersExclusive vs. non-exclusive

In non-exclusive clustering, points may belong to multiple clusters

Can represent multiple classes or ‘border’ pointsFuzzy vs. non-fuzzy

In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1

Weights must sum to 1Probabilistic clustering has similar characteristics

Partial vs. completeIn some cases, we only want to cluster some of the data

Heterogeneous vs. homogeneousCluster of widely different sizes, shapes, and densities

28

Page 29: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

K-means clusteringDocuments are represented as length-

normalized vectors in a real-valued space.use normalized, TF/IDF-weighted vectors

Initial centroids are often chosen randomly.clusters produced vary from one run to

another.The centroid is, typically, the mean of the

points in the cluster.‘Closeness’ is measured by Euclidean

distance, cosine similarity, correlation, etc.

29

Page 30: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

K-means algorithm

30

Page 31: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Stopping criteriaA fixed number of iterations has been

completed. Assignment of documents to clusters does not

change between iterations.Centroids do not change between iterations.When distance between the centroid and data

points falls below a threshold.

31

Page 32: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Two different K-means clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

32

Page 33: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Choosing initial centroids - 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 6

33

Page 34: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Choosing initial centroids - 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

34

Page 35: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Choosing initial centroids - 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

35

Page 36: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Choosing initial centroids - 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

36

Page 37: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Strengths & weaknesses: K-meansStrength: Relatively efficient: complexity is O( n * K * I )

n = number of points, K = number of clusters, I = number of iterations, normally, k, i << n.

Weakness: Sensitive to the initial clusters Need to specify k, the number of clusters, in advance Very sensitive to noise and outliers May have a problem when clusters have different

sizes Not suitable to discover clusters with non-convex

shapes Often terminates at a local optimum.

37

Page 38: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Hierarchical clusteringTwo main types of hierarchical clustering

Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one

cluster (or k clusters) left

Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point

(or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

38

Page 39: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Agglomerative clustering Bottom-up

Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters

39

Page 40: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Agglomerative exampleStart with clusters of individualpoints and a proximity matrixp1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

40

Page 41: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Agglomerative exampleAfter some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

41

Page 42: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Agglomerative exampleWe want to merge the two closest clusters (C2 and C5)and update the proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p1242

Page 43: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Agglomerative example

43

Page 44: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Inter-cluster similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group average Distance between centroids Other methods driven by an objective

function Ward’s Method uses squared error

Proximity Matrix

44

Page 45: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Inter-cluster similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

MIN MAX Group average Distance between centroids Other methods driven by an

objective function Ward’s Method uses squared error 45

Page 46: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Inter-cluster similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

MIN MAX Group average Distance between centroids Other methods driven by an

objective function Ward’s Method uses squared error 46

Page 47: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Inter-cluster similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group average Distance between centroids Other methods driven by an

objective function Ward’s Method uses squared error 47

Page 48: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Inter-cluster similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

MIN MAX Group average Distance between centroids Other methods driven by an

objective function Ward’s Method uses squared error

48

Page 49: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Inter-cluster similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

MIN MAX Group average Distance between centroids Other methods driven by an

objective function Ward’s Method uses squared error 49

Page 50: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Hierarchical clustering comparison

Group Average

Ward’s Method

1

2

3

4

5

61

2

5

3

4

MIN MAX

1

2

3

4

5

61

2

5

34

1

2

3

4

5

61

2 5

3

41

2

3

4

5

6

12

3

4

5

http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4

50

Page 51: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Divisive clusteringTop-down.Split a cluster iteratively.Start with all objects in one cluster and

subdivide them into smaller pieces.Less popular than agglomerative.

51

Page 52: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Hierarchical: strengths & weaknessesStrengths:Do not have to assume any particular number of

clustersAny desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper levelThey may correspond to meaningful taxonomies

E.g. animal kingdom in biological sciencesWeaknesses:When clusters are merged/split, the decision is

permanentErroneous decisions are impossible to correct later

Do not scale wellSpace complexity O(n2), n = total # of pointsTime complexity is O(n3) in many cases;

n steps and at each step the proximity matrix of size n2 must be updated and searched

52

Page 53: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Evaluating clusteringInternal criterion

high intra-cluster similarity and low inter-cluster similarity

External criteriacompare against gold standard produced by

humans Purity Normalized mutual information Rand index penalizes F measure

53

Page 54: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

PurityEach cluster is assigned to the class which is

most frequent in the clusterThen the accuracy of this assignment is

measured by counting the number of correctly assigned documents and dividing by N, where N = data points

54

Page 55: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Purity

Purity as an external evaluation criterion for cluster quality. x, 5(cluster 1); o, 4 (cluster 2); and ⋄, 3 (cluster 3). Purity is (5 + 4 + 3) / 17 ≈ 0.71.0 < Purity < 1

55

Page 56: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Pitfall of PurityPurity is 1 if each document gets its own

cluster. Thus, we cannot use purity to trade off the

quality of the clustering against the number of clusters.

Solution: Normalized Mutual Information

56

Page 57: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Normalized Mutual Information

57

I is mutual information, H is entropy, NMI is mutual information divided by normalized entropy

where P(ωk), P(cj), and P(ωk ∩ cj) are the probabilities of a document beingin cluster ωk, class cj, and in the intersection of ωk and cj, respectively.

Page 58: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Mutual informationMI measures the amount of information by which

our knowledge about the classes increases when we are told what the clusters are.

Minimum MI if the clustering is random. Maximum MI if K = N one-document clusters.So MI has the same problem as purity -- it does

not penalize large cardinalities and fewer clusters are better.

The normalization by the denominator [H(W)+H(C)]/2 fixes this problem since entropy tends to increase with the number of clusters.

0 < NMI < 1

58

Page 59: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

CNG - Common N-Gram analysisDefinitionExampleSimilarity measure

http://afflatus.ucd.ie/attachment/2009_6/tn_1246380626644.jpg

http://home.arcor.de/David-Peters/n-Grams.png 59

Page 60: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

N-gramsAn n-gram model is a type of probabilistic

model for predicting the next item in such a sequenceItems can be phonemes, syllables, letters,

words or base pairsInvolves splitting sentence into chunks of

consecutive items of length n

60

Page 61: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

N-grams example“I don’t know what to say”1-gram (unigram): I, don’t, know, what, to, say2-gram (bigram): I don’t, don’t know, know what, what to, to say3-gram (trigram): I don’t know, don’t know what, know what to, etc.…n-gram

“TEXT”unigram : {T, E, X, T}bigram : { _T, TE, EX, XT, T_}trigram : {_TE, TEX, EXT, XT_, T__}...n-gram

61

Page 62: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Why do we want to predict items?Author attribution

Plagiarism detection

Malicious code detection

Genre classification

Sentiment classification

Spam identification

Language and encoding identification

Spelling correction

62

Page 63: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Common N-grams methodCompare the semantic of two texts or audio or

video data files.Build a byte-level n-gram author profile of an

author's writing.The profile is a small set of L pairs {(x1, f1), (x2, f2),

...,(xL, fL)} of frequent n-grams and their normalized frequencies, generated from training data.

Two important operations:choose the optimal set of n-grams for a profilecalculate the similarity between two profiles

63

Page 64: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Common N-grams methodDoes not use any language-dependent

information (no information about space character, new line character, uppercase, lowercase).

The approach does not depend on a specific language.does not require segmentation for languages

such as Chinese or Thai. There is no text preprocessing.

so we avoid the necessity for use of taggers, parsers, feature selection.

64

Page 65: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

How do n-grams work?Marley was dead: to begin with. There

is no doubt whatever about that. …

n = 3

Mararlrleleyey_y_w_wa

was

_th 0.015 ___ 0.013 the 0.013 he_ 0.011 and 0.007 _an 0.007 nd_ 0.007 ed_ 0.006

sort by frequency

L=5

(from Christmas Carol by Charles Dickens)

…Detection of New Malicious Code Using N-grams Signatures© 2004 T. Abou-Assaleh, N. Cercone, V. Keselj, & R. Sweidan65

Page 66: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Comparing profiles

_th 0.015 ___ 0.013 the 0.013 he_ 0.011 and 0.007

Dickens: Christmas Carol _th 0.016 the 0.014 he_ 0.012 and 0.007 nd_ 0.007

Dickens: A Tale of Two Cities

_th 0.017 ___ 0.017 the 0.014 he_ 0.014 ing 0.007

Carroll: Alice’s adventures in wonderland

?

?

66

Page 67: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Similarity measureIn order to “normalize" the differences between two profiles, we divide them by the average frequency for a given n-gram (f1(s) + f2(s))/2 . E.g. the difference of 0.1 for an n-gram with frequencies 0.9 and 0.8 in two profiles will be less weighted than the same difference for an n-gram with frequencies 0.2 and 0.1.

weight

2

profile 21

21

2

profile 21

21

)()(

))()((2

2)()()()(

ss sfsf

sfsfsfsfsfsf

s is any n-gram from one of the two profiles, and f1(n) and f2(n) are n-gram frequencies in two profiles.

67

Page 68: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Profile dissimilarity algorithm

Returns a positive number, which is a measure of dissimilarity.

For identical texts, the dissimilarity is 0.

68

Page 69: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Text classification using CNGGiven a test document, a test profile is

produced.The distances between the test profile and

the author profiles are calculated.The test document is classified using k-

nearest neighbours method with k = 1, the test document is attributed to the author

whose profile is closest to the test profile.

69

Page 70: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Strengths & Weaknesses: CNG methodStrengths:

Easy to computeEasy to test

Weaknesses:Computational resources for trainingImbalanced datasetsAutomatic selection of N and L

70

Page 71: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

As an aside: Ordering doesn’t matterAoccdrnig to rscheearch at an Elingsh

uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, olny taht the frist and lsat ltteres are at the rghit pcleas. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by ilstef, but the wrod as a wlohe.

“Humans are interesting” – Ryuk

71

Page 72: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

ReferencesA procedure for quantitatively comparing the syntactic

coverage of English grammars, E. Black et al., 1991Framework and resources for natural language Parser

evaluation, Tuomo Kakkonen, 2007Solving the heterogeneity problem in e-government using

n-grams, Cornoiu SorinaOn automatic plagiarism detection based on n-grams

comparison, Alberto Barron-Cedeno and Paolo RossoN-gram-based author profiles for authorship attribution,

V. Keseljy, N. Cercone et al., 2003CNG method with weighted voting, V. Keseljy, N. CerconeN-gram-based detection of new malicious code, T. Abou-

Assaleh, N. Cercone et al., 2004Book: Introduction to Data Mining, Tan, Steinbach, Kumar

72

Page 73: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Thank you! Questions?

73

Page 74: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Evaluating K-means clustersMost common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest clusterTo get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster

Given two clusters, we can choose the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters

K

i Cxi

i

xmdistSSE1

2 ),(

74

Page 75: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Measures of cluster validity Numerical measures:

External Index: Used to measure the extent to which cluster labels match externally supplied class labels.

Entropy

Internal Index: Used to measure the goodness of a clustering structure without respect to external information.

Sum of Squared Error (SSE)

Relative Index: Used to compare two different clustering or clusters.

Often an external or internal index is used for this function, e.g., SSE or entropy

75

Page 76: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

External Measures of Cluster Validity: Entropy and Purity

76

Page 77: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Cluster validityFor supervised classification we have a variety of

measures to evaluate how good our model isAccuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?To avoid finding patterns in noiseTo compare clustering algorithmsTo compare two sets of clustersTo compare two clusters

77

Page 78: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Clusters found in random data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

yComplete Link

78

Page 79: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

K-means clustering Partitional clustering approach Each cluster is associated with a

centroid (center point) Each point is assigned to the cluster with

the closest centroid Number of clusters, K, must be specified

79

Page 80: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Text clusteringText clustering is quite different…

Feature representations of text will typically have a large number of dimensions (103 - 106)

Euclidean distance isn’t necessarily the best distance metric for feature representations

Typically use normalized, TF/IDF-weighted vectors and cosine similarity.

Optimize computations for sparse vectors.Applications:

During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall.

Clustering of results of retrieval to present more organized results to the user (e.g. Clusty, Northernlight folders).

Automated production of hierarchical taxonomies of documents for browsing purposes (e.g. Yahoo).

80

Page 81: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Cluster similarity: MIN (Single Link)Based on the two most similar (closest)

points in the different clustersDetermined by one pair of points, i.e., by one

link in the proximity graph

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

81

Page 82: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Strengths: MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

82

Page 83: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Limitation: MIN

Original Points Two Clusters

• Sensitive to noise and outliers83

Page 84: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Cluster similarity: MAX (Complete Linkage)Similarity of two clusters is based on the two

least similar (most distant) points in the different clustersDetermined by all pairs of points in the two

clustersI1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

84

Page 85: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Strength: MAX

Original Points Two Clusters

• Less susceptible to noise and outliers85

Page 86: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Limitation: MAX

Original Points Two Clusters

•Tends to break large clusters

•Biased towards globular clusters 86

Page 87: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Cluster similarity: Group averageProximity of two clusters is the average of pairwise

proximity between points in the two clusters.

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

1 2 3 4 5 87

Page 88: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Strength & Limitation: Group average Compromise between Single and

Complete Link Strength:

Less susceptible to noise and outliers Limitation:

Biased towards globular clusters

88

Page 89: AMEETA AGRAWAL. Outline  Parser Evaluation  Text Clustering  Common N-Grams classification method (CNG) 2

Cluster similarity: Ward’s methodSimilarity of two clusters is based on the

increase in squared error when two clusters are mergedSimilar to group average if distance between

points is distance squaredLess susceptible to noise and outliersBiased towards globular clustersCan be used to initialize K-means

89