issues in text similarity and categorization

21
1 / 22 Issues in Text Issues in Text Similarity and Similarity and Categorization Categorization Jordan Smith – MUMT 611 – 27 March 2008 Jordan Smith – MUMT 611 – 27 March 2008

Upload: finola

Post on 13-Jan-2016

39 views

Category:

Documents


2 download

DESCRIPTION

Issues in Text Similarity and Categorization. Jordan Smith – MUMT 611 – 27 March 2008. Outline. Why text? Text categorization: Some sample problems Comparison to MIR Document indexing Detailed example. Why text?. 28.9% of MIR queries refer to lyric fragments (Bainbridge et al. 2003) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Issues in Text Similarity and Categorization

1 / 22

Issues in Text Issues in Text Similarity and Similarity and CategorizationCategorization

Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Page 2: Issues in Text Similarity and Categorization

2 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

OutlineOutline

Why text?Why text?

Text categorization:Text categorization: Some sample problemsSome sample problems Comparison to MIRComparison to MIR Document indexingDocument indexing

Detailed exampleDetailed example

Page 3: Issues in Text Similarity and Categorization

3 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Why text?Why text? 28.9% of MIR queries refer to lyric fragments28.9% of MIR queries refer to lyric fragments

(Bainbridge et al. 2003)(Bainbridge et al. 2003)

Easy to collect!Easy to collect!(Knees et al. 2005, Geleijnse & Korst 2006)(Knees et al. 2005, Geleijnse & Korst 2006)

Accurate ground truthAccurate ground truth(Logan et al. 2004)(Logan et al. 2004)

Information about mood, “content”Information about mood, “content”

Page 4: Issues in Text Similarity and Categorization

4 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Why text?Why text?

Potential applications:Potential applications:

Genre, mood categorization (Maxwell 2007)Genre, mood categorization (Maxwell 2007) Similarity searches (Mahadero et al. 2005)Similarity searches (Mahadero et al. 2005) Hit-song prediction (Dhanaraj & Logan 2004)Hit-song prediction (Dhanaraj & Logan 2004) Musical document retrieval (Google)Musical document retrieval (Google) Accompany query-by-humming (Suzuki et al. Accompany query-by-humming (Suzuki et al.

2007, Fujihara et al. 2006)2007, Fujihara et al. 2006)

Page 5: Issues in Text Similarity and Categorization

5 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Some text categorization Some text categorization problemsproblems

IndexingIndexing Document organizationDocument organization FilteringFiltering Web content hierarchyWeb content hierarchy Language identificationLanguage identification

etc.etc.

Page 6: Issues in Text Similarity and Categorization

6 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

What is text categorization?What is text categorization?

“ “ Text categorization may be defined as the task Text categorization may be defined as the task of assigning a Boolean value to each pair <of assigning a Boolean value to each pair <ddjj, ,

ccii> > ∈∈ D D x x CC, where , where D D is a domain of is a domain of

documents and documents and C C = {= {cc11, . . . , c, . . . , c|C||C|}} is a set of is a set of

pre-defined categories. ”pre-defined categories. ”

(Sebastiani 2002)(Sebastiani 2002)

Page 7: Issues in Text Similarity and Categorization

7 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Text vs. musicText vs. music

Music classification:Music classification:

extract featuresextract features train classifierstrain classifiers evaluate classifierevaluate classifier

Text categorization:Text categorization:

extract featuresextract features train classifierstrain classifiers evaluate classifierevaluate classifier

Same

Not the same

Page 8: Issues in Text Similarity and Categorization

8 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Text feature extractionText feature extraction

Convert each document Convert each document ddjj into a vector into a vector

ddjj = <w = <w1j1j, w, w2j, 2j, …, w…, w|T|j|T|j>>

where where TT is the set of terms { is the set of terms {tt11, t, t22, … t, … t|T||T|}.}.

Different indexing systems:Different indexing systems: Definition of set of termsDefinition of set of terms Computation of weightsComputation of weights

Page 9: Issues in Text Similarity and Categorization

9 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Indexing techniquesIndexing techniques

““Set of words” indexingSet of words” indexing

Terms: every word that occurs in the corpusTerms: every word that occurs in the corpus Weights: binaryWeights: binary

Page 10: Issues in Text Similarity and Categorization

10 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Indexing techniquesIndexing techniques

““Bag of words” indexingBag of words” indexing

Terms: every word that occurs in the corpusTerms: every word that occurs in the corpus Weights: tf-idfWeights: tf-idf

term frequency / inverse document frequency:term frequency / inverse document frequency:tf-idf(tf-idf(ttkk, d, djj) = #() = #(ttkk, d, djj) · log( |) · log( |TTrr| / #| / #TTrr((ttkk) )) )

Frequency of term tk in document dj

Number of documents that tk occurs in

Normalization:

Page 11: Issues in Text Similarity and Categorization

11 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Indexing techniquesIndexing techniques

Phrase indexingPhrase indexing

Terms: all word sequences that occur in the Terms: all word sequences that occur in the corpuscorpus

Weights: binary, tf-idfWeights: binary, tf-idf

Page 12: Issues in Text Similarity and Categorization

12 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Indexing techniquesIndexing techniques

““The Darmstadt Indexing Approach”The Darmstadt Indexing Approach”

Terms: properties of the words, documents, Terms: properties of the words, documents, categoriescategories

Weights: variousWeights: various

Page 13: Issues in Text Similarity and Categorization

13 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Feature reduction techniquesFeature reduction techniques

Remove function words (the, for, in, etc.)Remove function words (the, for, in, etc.) Remove words that are least frequent:Remove words that are least frequent:

in each documentin each document in the corpusin the corpus

Remainder:Remainder:

low and mid-range frequency wordslow and mid-range frequency words

Page 14: Issues in Text Similarity and Categorization

14 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Feature reduction techniquesFeature reduction techniques

Sebastiani 2002

Page 15: Issues in Text Similarity and Categorization

15 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Feature reduction techniquesFeature reduction techniques

Latent Semantic Analysis (LSA):Latent Semantic Analysis (LSA):

Search:Search: Demographic shifts in the U.S. with Demographic shifts in the U.S. with economic impact.economic impact.

Result:Result: The nation grew to 249.6 million people The nation grew to 249.6 million people in the 1980s as more Americans left the in the 1980s as more Americans left the industrial and agricultural heartlands industrial and agricultural heartlands for the South and West.for the South and West.

Sebastiani 2002

Page 16: Issues in Text Similarity and Categorization

16 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

A word on speechA word on speech

““Expert” feature reduction:Expert” feature reduction: RhymingnessRhymingness Iambicness of meterIambicness of meter

Page 17: Issues in Text Similarity and Categorization

17 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Example: Hit song predictionExample: Hit song prediction

Goal:Goal: Measure some unknown, global, intrinsic propertyMeasure some unknown, global, intrinsic property

Features:Features: AcousticAcoustic -Mel-Frequency Cepstral Coefficient-Mel-Frequency Cepstral Coefficient LyricLyric -Probabilistic Latent Semantic Analysis-Probabilistic Latent Semantic Analysis

Classifiers:Classifiers: Support vector machinesSupport vector machines Boosting classifiersBoosting classifiers

Corpus:Corpus: 1700 #1 hits from 1956 to 20041700 #1 hits from 1956 to 2004

Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. Dhanaraj, R. and B. Logan. 2005. Automatic Prediction of Hit Songs. International Conference on Music Information RetrievalInternational Conference on Music Information Retrieval, London UK. 488-91., London UK. 488-91.

Page 18: Issues in Text Similarity and Categorization

18 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Example: Hit song detectionExample: Hit song detection

Results of PLSA:Results of PLSA:

Best features are for contraindication

Page 19: Issues in Text Similarity and Categorization

19 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Example: Genre classificationExample: Genre classificationLogan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Logan, B., A. Kositsky and P. Moreno. 2004. Semantic Analysis of Song Lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Multimedia and Expo. 1-7.1-7.

Page 20: Issues in Text Similarity and Categorization

20 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

ReferencesReferences

Sebastiani, F. 1999. Machine learning in automated text Sebastiani, F. 1999. Machine learning in automated text categorization. categorization. Technical reportTechnical report, Consiglio Nazionale , Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59.delle Ricerche. Pisa, Italy. 1–59.

Dhanaraj, R., and B. Logan. 2005. Automatic prediction of Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. hit songs. International Conference on Music Information International Conference on Music Information RetrievalRetrieval, London UK. 488–91., London UK. 488–91.

Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Multimedia and Expo. 11––7.7.

Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Mahadero, J., Á. Martínez, and P. Cano. 2005. Natural language processing of lyrics. Proceedings of the 13th Annual ACM International Conference on MultimediaProceedings of the 13th Annual ACM International Conference on Multimedia . 475. 475––8.8.

Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh.features. M.Sc. Thesis. University of Edinburgh.

Page 21: Issues in Text Similarity and Categorization

21 / 22 Jordan Smith – MUMT 611 – 27 March 2008Jordan Smith – MUMT 611 – 27 March 2008

Query-by-askingQuery-by-asking