text similarity & clustering
DESCRIPTION
Text Similarity & Clustering. Qinpei Zhao 15.Feb.2011. Outline. String matching metrics Implementation and applications Online Resources Location-based clustering. String Matching Metrics. Exact String Matching. - PowerPoint PPT PresentationTRANSCRIPT
TextSimilarity & Clustering
Qinpei Zhao15.Feb.2011
Outline
String matching metrics Implementation and applications Online Resources Location-based clustering
String Matching Metrics
Exact String Matching
Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T.
Example: T=“AGCTTGA” P=“GCT” Applications:
Searching keywords in a fileSearching engines (like Google)Database searching
Approximate String Matching
Determine if a text string T of length n and a pattern string P of length m “partially” matches. Consider the string “approximate”. Which of these are partial matches? aproximate approximately appropriate proximate approx approximat
apropos approxximate A partial match can be thought of as one that has k differences from the st
ring where k is some small integer (for instance 1 or 2) A difference occurs if the string1.charAt(j) != string2.charAt(j) or if strin
g1.charAt(j) does not appear in string2 (or vice versa) The former case is known as a revise difference, the latter is a delete
or insert difference. What about two characters that appear out of position? For instance,
approximate vs. apporximate?
Keanu ReevesSamuel JacksonSchwarzeneggerSamuel Jackson…
Schwarrzenger
Query errors: Limited knowledge about data Typos Limited input device (cell phone) input
Data errors Typos Web data OCR
Applications Spellchecking Query relaxation …
Similarity functions: Edit distance Q-gram Cosine …
Approximate String Matching
Edit distance (Levenshtein distance)
Given two strings T and P, the edit distance is the minimum number of substitutions, insertion and deletions, which will transform some characters of T into P.
Time complexity by dynamic programming: O(mn)
Edit distance (1974)t m p
0 1 2 3
t 1 0 1 2
e 2 1 2 2
m 3 2 1 2
p 4 3 2 1
Dynamic programming:m[i][j] = min{m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else
Q-grams
b i n g o n 2-grams
Fixed length (q)ed(T, P) <= k, then
# of common grams >= # of T grams – k * q
Q-grams
T = “bingo”, P = “going”gram1 = {#b, bi, in, ng, go, o#}gram2 = {#g, go, oi, in, ng, g#}
Unique(gram1, gram2) = {#b, bi, in, ng, go, o#, #g, oi, g#}gram1.length = (T.length + (q - 1) * 2 + 1) – qgram2.length = (P.length + (q - 1) * 2 + 1) - qL = gram1.length + gram2.lengthSimilarity = (L- |common terms difference| )/ L
Cosine similarity
Two vectors A and B,θ is represented using a dot product and magnitude as
Implementation: Cosine similarity = (Common Terms) / (sqrt(Num
ber of terms in String1) + sqrt(Number of terms in String2))
Cosine similarity
T = “bingo right”, P = “going right”T1 = {bingo right}, P1 = {going right}
L1 = unique(T1).length;L2 = unique(T2).length;
Unique(T1&P1) = {bingo right going}L3 = Unique(T1&P1) .length;Common terms = (L1+L2)-L3;
Similarity = common terms / (sqrt(L1)*sqrt(L2))
Dice coefficient
Similar with cosine similarity Dices coefficient = (2*Common Terms) /
(Number of terms in String1 + Number of terms in String2)
Implementation & Applications
Similarity metrics
Edit distance Q-gram Cosine distance Dice coefficient …… similarity between two strings: Demo
Compared Strings Edit Distance (%)
Q_Grams Q =2 (%)
Q_Grams Q =3 (%)
Q_Grams Q =4 (%)
Cosin Distance (%)
Pizza Express Café Pizza Express
72% 78.79% 74.29% 70.27% 81.65%
Lounasravintola Pinja Ky – Ravintoloita Lounasravintola Pinja
54% 67.74% 67.19 % 65.15% 63.25%
Kioski Piirakkapaja Kioski Marttakahvio
47% 45.00% 33.33% 31.82% 50.00%
Kauppa Kulta Keidas Kauppa Kulta Nalle
68% 66.67% 63.41 % 60.47% 66.67%
Ravintola Beer Stop Pub Baari, Beer Stop R-kylä
39% 41.67% 36% 30.77% 50.00%
Ravintola Beer Stop Pub Baari, Wanha Mestari R-kylä
19% 7.69% 0% 0.00% 0.00%
Ravintola Foxie s Bar Siirry hakukenttään Baari, Foxie Karsikko
31% 25.00% 15.15% 11.76% 23.57%
Play baari Ravintola Bar Play – Ravintoloita
21% 31.11% 17.02% 8.16% 31.62%
Applications in MOPSI
Duplicated records clean Spelling check
Communication & comunication
query relevance/expansion Text-level Annotation recommendation * Keyword clustering * MOPSI search engine**
Annotation recommendation
500ms
String clustering The similarity between every string pair is
calculated as a basis for determining the clusters Using the vector model for clustering
A similarity measure is required to calculate the similarity between two strings.
String clustering (Cont.)
The final step in creating clusters is to determine when two objects (words) are in the same cluster Hierarchical agglomerative clustering (HAC) – start
with un-clustered items and perform pair-wise similarity measures to determine the clusters
Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters
Objectives of Hierarchy of Clusters
Reduce the overhead of search Perform top-down searches of the centroids of the clusters in the
hierarchy and trim those branches that are not relevant Provide for visual representation of the information
space Visual cues on the size of clusters (size of ellipse) and strengths of the
linkage between clusters (dashed line, sold line…) Expand the retrieval of relevant items
A user, once having identified an item of interest, can request to see other items in the cluster
The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters
Keyword clustering (semantic)
Thesaurus-based:WordNetAn advanced web-interface t
o browse the WordNet database
Thesaurus are not available for every language, e.g. Finnish.
example
Resources
Useful resources
Similarity metrics (http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html )
Similarity metrics (javascript) (http://cs.joensuu.fi/~zhao/Link/ )
Flamingo package (http://flamingo.ics.uci.edu/releases/4.0/ )
WordNet (http://wordnet.princeton.edu/wordnet/related-projects/ )
Location-based clustering
DBSCAN- density based clustering (KDD’96)
Parameters: MinPts eps
Time complexityO(logn) – getNeighbursO(nlogn) – total
AdvantagesData shape unlimitedNoise considered
DBSCAN result
Joensuu: 29,76, 62.60Helsinki: 24, 60
Gaussian Mixture Model
Maximization likelihood estimation (Expectation Maximization algorithm)
Parameters requiredNumber of components Iteration number
Advantages:Probabilistic (fuzzy) theory
GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60
GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60
My activity area
thanks!