issues in text similarity and categorization

1 / 22

Issues in Text Issues in Text Similarity and Similarity and CategorizationCategorization

Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

2 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

OutlineOutline

Why text?Why text?

Text categorization:Text categorization: Some sample problemsSome sample problems Comparison to MIRComparison to MIR Document indexingDocument indexing

Detailed exampleDetailed example


Why text?Why text? 28.9% of MIR queries refer to lyric fragments28.9% of MIR queries refer to lyric fragments

(Bainbridge et al. 2003)(Bainbridge et al. 2003)

Easy to collect!Easy to collect!(Knees et al. 2005, Geleijnse & Korst 2006)(Knees et al. 2005, Geleijnse & Korst 2006)

Accurate ground truthAccurate ground truth(Logan et al. 2004)(Logan et al. 2004)

Information about mood, “content”Information about mood, “content”


Why text?Why text?

Potential applications:Potential applications:

Genre, mood categorization (Maxwell 2007)Genre, mood categorization (Maxwell 2007) Similarity searches (Mahadero et al. 2005)Similarity searches (Mahadero et al. 2005) Hit-song prediction (Dhanaraj & Logan 2004)Hit-song prediction (Dhanaraj & Logan 2004) Musical document retrieval (Google)Musical document retrieval (Google) Accompany query-by-humming (Suzuki et al. Accompany query-by-humming (Suzuki et al.

2007, Fujihara et al. 2006)2007, Fujihara et al. 2006)


Some text categorization Some text categorization problemsproblems

IndexingIndexing Document organizationDocument organization FilteringFiltering Web content hierarchyWeb content hierarchy Language identificationLanguage identification

etc.etc.


What is text categorization?What is text categorization?

“ “ Text categorization may be defined as the task Text categorization may be defined as the task of assigning a Boolean value to each pair <of assigning a Boolean value to each pair <ddjj, ,

ccii> > ∈∈ D D x x CC, where , where D D is a domain of is a domain of

documents and documents and C C = {= {cc11, . . . , c, . . . , c|C||C|}} is a set of is a set of

pre-defined categories. ”pre-defined categories. ”

(Sebastiani 2002)(Sebastiani 2002)


Text vs. musicText vs. music

Music classification:Music classification:

extract featuresextract features train classifierstrain classifiers evaluate classifierevaluate classifier

Text categorization:Text categorization:

extract featuresextract features train classifierstrain classifiers evaluate classifierevaluate classifier

Same

Not the same


Text feature extractionText feature extraction

Convert each document Convert each document ddjj into a vector into a vector

ddjj = <w = <w1j1j, w, w2j, 2j, …, w…, w|T|j|T|j>>

where where TT is the set of terms { is the set of terms {tt11, t, t22, … t, … t|T||T|}.}.

Different indexing systems:Different indexing systems: Definition of set of termsDefinition of set of terms Computation of weightsComputation of weights


Indexing techniquesIndexing techniques

““Set of words” indexingSet of words” indexing

Terms: every word that occurs in the corpusTerms: every word that occurs in the corpus Weights: binaryWeights: binary



““Bag of words” indexingBag of words” indexing

Terms: every word that occurs in the corpusTerms: every word that occurs in the corpus Weights: tf-idfWeights: tf-idf

term frequency / inverse document frequency:term frequency / inverse document frequency:tf-idf(tf-idf(ttkk, d, djj) = #() = #(ttkk, d, djj) · log( |) · log( |TTrr| / #| / #TTrr((ttkk) )) )

Frequency of term tk in document dj

Number of documents that tk occurs in

Normalization:



Phrase indexingPhrase indexing

Terms: all word sequences that occur in the Terms: all word sequences that occur in the corpuscorpus

Weights: binary, tf-idfWeights: binary, tf-idf



““The Darmstadt Indexing Approach”The Darmstadt Indexing Approach”

Terms: properties of the words, documents, Terms: properties of the words, documents, categoriescategories

Weights: variousWeights: various


Feature reduction techniquesFeature reduction techniques

Remove function words (the, for, in, etc.)Remove function words (the, for, in, etc.) Remove words that are least frequent:Remove words that are least frequent:

in each documentin each document in the corpusin the corpus

Remainder:Remainder:

low and mid-range frequency wordslow and mid-range frequency words



Sebastiani 2002



Latent Semantic Analysis (LSA):Latent Semantic Analysis (LSA):

Search:Search: Demographic shifts in the U.S. with Demographic shifts in the U.S. with economic impact.economic impact.

Result:Result: The nation grew to 249.6 million people The nation grew to 249.6 million people in the 1980s as more Americans left the in the 1980s as more Americans left the industrial and agricultural heartlands industrial and agricultural heartlands for the South and West.for the South and West.

Sebastiani 2002


A word on speechA word on speech

““Expert” feature reduction:Expert” feature reduction: RhymingnessRhymingness Iambicness of meterIambicness of meter


Example: Hit song predictionExample: Hit song prediction

Goal:Goal: Measure some unknown, global, intrinsic propertyMeasure some unknown, global, intrinsic property

Features:Features: AcousticAcoustic -Mel-Frequency Cepstral Coefficient-Mel-Frequency Cepstral Coefficient LyricLyric -Probabilistic Latent Semantic Analysis-Probabilistic Latent Semantic Analysis

Classifiers:Classifiers: Support vector machinesSupport vector machines Boosting classifiersBoosting classifiers

Corpus:Corpus: 1700 #1 hits from 1956 to 20041700 #1 hits from 1956 to 2004

Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information RetrievalInternational Conference on Music Information Retrieval, London UK. 488-91., London UK. 488-91.


Example: Hit song detectionExample: Hit song detection

Results of PLSA:Results of PLSA:

Best features are for contraindication


Example: Genre classificationExample: Genre classificationLogan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.1-7.


ReferencesReferences

Sebastiani, F. 1999. Machine learning in automated text Sebastiani, F. 1999. Machine learning in automated text categorization. categorization. Technical reportTechnical report, Consiglio Nazionale , Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59.delle Ricerche. Pisa, Italy. 1–59.

Dhanaraj, R., and B. Logan. 2005. Automatic prediction of Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. hit songs. International Conference on Music Information International Conference on Music Information RetrievalRetrieval, London UK. 488–91., London UK. 488–91.

Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Multimedia and Expo. 11––7.7.

Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on MultimediaProceedings of the 13th Annual ACM International Conference on Multimedia . 475. 475––8.8.

Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.features. M.Sc. Thesis. University of Edinburgh.


Query-by-askingQuery-by-asking

issues in text similarity and categorization

Documents