text similarity & clustering

TextSimilarity & Clustering

Qinpei Zhao15.Feb.2011

Outline

String matching metrics Implementation and applications Online Resources Location-based clustering

String Matching Metrics

Exact String Matching

Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T.

Example: T=“AGCTTGA” P=“GCT” Applications:

Searching keywords in a fileSearching engines (like Google)Database searching

Approximate String Matching

Determine if a text string T of length n and a pattern string P of length m “partially” matches. Consider the string “approximate”. Which of these are partial matches? aproximate approximately appropriate proximate approx approximat

apropos approxximate A partial match can be thought of as one that has k differences from the st

ring where k is some small integer (for instance 1 or 2) A difference occurs if the string1.charAt(j) != string2.charAt(j) or if strin

g1.charAt(j) does not appear in string2 (or vice versa) The former case is known as a revise difference, the latter is a delete

or insert difference. What about two characters that appear out of position? For instance,

approximate vs. apporximate?

Keanu ReevesSamuel JacksonSchwarzeneggerSamuel Jackson…

Schwarrzenger

Query errors: Limited knowledge about data Typos Limited input device (cell phone) input

Data errors Typos Web data OCR

Applications Spellchecking Query relaxation …

Similarity functions: Edit distance Q-gram Cosine …

Approximate String Matching

Edit distance (Levenshtein distance)

Given two strings T and P, the edit distance is the minimum number of substitutions, insertion and deletions, which will transform some characters of T into P.

Time complexity by dynamic programming: O(mn)

Edit distance (1974)t m p

0 1 2 3

t 1 0 1 2

e 2 1 2 2

m 3 2 1 2

p 4 3 2 1

Dynamic programming:m[i][j] = min{m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else

Q-grams

b i n g o n 2-grams

Fixed length (q)ed(T, P) <= k, then

# of common grams >= # of T grams – k * q

Q-grams

T = “bingo”, P = “going”gram1 = {#b, bi, in, ng, go, o#}gram2 = {#g, go, oi, in, ng, g#}

Unique(gram1, gram2) = {#b, bi, in, ng, go, o#, #g, oi, g#}gram1.length = (T.length + (q - 1) * 2 + 1) – qgram2.length = (P.length + (q - 1) * 2 + 1) - qL = gram1.length + gram2.lengthSimilarity = (L- |common terms difference| )/ L

Cosine similarity

Two vectors A and B,θ is represented using a dot product and magnitude as

Implementation: Cosine similarity = (Common Terms) / (sqrt(Num

ber of terms in String1) + sqrt(Number of terms in String2))

Cosine similarity

T = “bingo right”, P = “going right”T1 = {bingo right}, P1 = {going right}

L1 = unique(T1).length;L2 = unique(T2).length;

Unique(T1&P1) = {bingo right going}L3 = Unique(T1&P1) .length;Common terms = (L1+L2)-L3;

Similarity = common terms / (sqrt(L1)*sqrt(L2))

Dice coefficient

Similar with cosine similarity Dices coefficient = (2*Common Terms) /

(Number of terms in String1 + Number of terms in String2)

Implementation & Applications

Similarity metrics

Edit distance Q-gram Cosine distance Dice coefficient …… similarity between two strings: Demo

Compared Strings Edit Distance (%)

Q_Grams Q =2 (%)

Q_Grams Q =3 (%)

Q_Grams Q =4 (%)

Cosin Distance (%)

Pizza Express Café Pizza Express

72% 78.79% 74.29% 70.27% 81.65%

Lounasravintola Pinja Ky – Ravintoloita Lounasravintola Pinja

54% 67.74% 67.19 % 65.15% 63.25%

Kioski Piirakkapaja Kioski Marttakahvio

47% 45.00% 33.33% 31.82% 50.00%

Kauppa Kulta Keidas Kauppa Kulta Nalle

68% 66.67% 63.41 % 60.47% 66.67%

Ravintola Beer Stop Pub Baari, Beer Stop R-kylä

39% 41.67% 36% 30.77% 50.00%

Ravintola Beer Stop Pub Baari, Wanha Mestari R-kylä

19% 7.69% 0% 0.00% 0.00%

Ravintola Foxie s Bar Siirry hakukenttään Baari, Foxie Karsikko

31% 25.00% 15.15% 11.76% 23.57%

Play baari Ravintola Bar Play – Ravintoloita

21% 31.11% 17.02% 8.16% 31.62%

Applications in MOPSI

Duplicated records clean Spelling check

Communication & comunication

query relevance/expansion Text-level Annotation recommendation * Keyword clustering * MOPSI search engine**

Annotation recommendation

500ms

String clustering The similarity between every string pair is

calculated as a basis for determining the clusters Using the vector model for clustering

A similarity measure is required to calculate the similarity between two strings.

String clustering (Cont.)

The final step in creating clusters is to determine when two objects (words) are in the same cluster Hierarchical agglomerative clustering (HAC) – start

with un-clustered items and perform pair-wise similarity measures to determine the clusters

Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters

Objectives of Hierarchy of Clusters

Reduce the overhead of search Perform top-down searches of the centroids of the clusters in the

hierarchy and trim those branches that are not relevant Provide for visual representation of the information

space Visual cues on the size of clusters (size of ellipse) and strengths of the

linkage between clusters (dashed line, sold line…) Expand the retrieval of relevant items

A user, once having identified an item of interest, can request to see other items in the cluster

The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters

Keyword clustering (semantic)

Thesaurus-based:WordNetAn advanced web-interface t

o browse the WordNet database

Thesaurus are not available for every language, e.g. Finnish.

example

http://wordvis.com/

http://cs.joensuu.fi/mopsi/tahti.php

Resources

Useful resources

Similarity metrics (http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html )

Similarity metrics (javascript) (http://cs.joensuu.fi/~zhao/Link/ )

Flamingo package (http://flamingo.ics.uci.edu/releases/4.0/ )

WordNet (http://wordnet.princeton.edu/wordnet/related-projects/ )

http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html

http://cs.joensuu.fi/~zhao/Link/

http://flamingo.ics.uci.edu/releases/4.0/

http://wordnet.princeton.edu/wordnet/related-projects/

Location-based clustering

DBSCAN- density based clustering (KDD’96)

Parameters: MinPts eps

Time complexityO(logn) – getNeighbursO(nlogn) – total

AdvantagesData shape unlimitedNoise considered

DBSCAN result

Joensuu: 29,76, 62.60Helsinki: 24, 60

Gaussian Mixture Model

Maximization likelihood estimation (Expectation Maximization algorithm)

Parameters requiredNumber of components Iteration number

Advantages:Probabilistic (fuzzy) theory

GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60

My activity area

http://cs.joensuu.fi/paikka/jinhua/web/track1/track.php

thanks!

text similarity & clustering

Documents