comments from pre-submission presentation

SoC Presentation Title 2004

Comments from Pre-submission Presentation

Q: Check why kNN is so lower than SVM on Reuters and 20 Newsgroups corpus? -10%.

A: Refer to the following four references: [Joachims 98] [Debole 03 STM] [Dumais 98 Inductive] [Yang 99 Reexamination]


[Joachims98][Debole03][Dumais98]Results on the Reuters Corpus

Bayes Rocchio C4.5 kNN SVM

(linear)

SVM

(Poly)

SVM

(rbf)

Micro-BEP(%)

69.84 79.14 77.78 82.5 84.2 86 86

kNN SVM

(linear)

Micro-F1

85.4 92.0

NBayes DT SVM

(linear)

Micro-

BEP

81.5 88.4 92.0


[Yang 99 Re-examination]Significance Test

Micro-level analysis (s-test)

SVM > kNN >> {LLSF, NNet} >> NB

Macro-level analysis

{SVM, kNN, LLSF} >> {NB, NNet}

Error-rate based comparison

{SVM, kNN} > LLSF > NNet >> NB



2. Explain why BEP & F1 in Chap 7

-Add reference


Breakeven point (1)

BEP, first proposed by Lewis[1992]. Later, he himself pointed out that BEP is not a good effectiveness measure, because

1. there may be no parameter setting that yields the breakeven; in this case the final BEP value, obtained by interpolation, is artificial;2. to have P=R is not necessarily desirable, and it is not clear that a system that achieves high BEP can be tuned to score high on other effectiveness measure.


Breakeven point (2)

Yang[1999Re-examinatio] also noted that when for no value of the parameters P and R are close enough, interpolated breakeven may not be a reliable indicator of effectiveness.



3. Add more qualitative analysis would be better


Analysis and Proposal: Empirical observation

feature

Category: 00_acq Category: 03_earn

idf rf chi2 idf rf chi2

Acquir 3.553 4.368 850.66 3.553 1.074 81.50

Stake 4.201 2.975 303.94 4.201 1.082 31.26

Payout 4.999 1 10.87 4.999 7.820 44.68

dividend 3.567 1.033 46.63 3.567 4.408 295.46

Comparison of idf, rf and chi2 value of four features in two categories of Reuters Corpus



4. Chap 7 remove Joachims Results using quotation is fine



5. Tone down “best” claims

to our knowledge (experience, understanding)

Pay attention this usage when doing presentation


Introduction:Other Text Representation

• Word senses (meanings) [Kehagias 2001]

same word assumes different meanings in a different contexts

• Term clustering [Lewis 1992]

group words with high degree of pairwise semantic relatedness

• Semantic and syntactic representation [Scott & matwin 1999]

Relationship between words, i.e. phrases, synonyms and hypernyms


Introduction:Other Text Representation

• Latent Semantic Indexing [Deerwester 1990]A feature reconstruction technique

• Combination Approach [Peng 2003]combine two types of indexing terms, i.e. words and 3-grams

In general, high level representation did not show good performance in most cases


Literature Review:Knowledge-based Representation

• Theme Topic Mixture Model – Graphical Model [Keller 2004]• Using keywords from summarization [Li 2003]


Literature Review: 2. How to weight a term (feature)

[Salton 1988] elaborated three considerations:

1. term occurrences closely represent the content of document

2. other factors with the discriminating power pick up the relevant documents from other irrelevant documents

3. consider the effect of length of documents



1. Term Frequency Factor

Binary representation (1 for present and 0 for absent)

Term frequency (tf): number of times a term occurs in a document

Log(tf): log operation to scale the effect of unfavorably high term frequency

Inverse term frequency (ITF)



2. Collection Frequency Factor

idf: the most-commonly used factor

Probabilistic idf: aka. term relevance weight

Feature selection metrics: chi^2, information gain, gain ratio, odds ratio, etc.



3. Normalization Factor

Combine the above two factors by using multiplication operation

In order to eliminate the length effect, we use the cosine normalization to limit the term weighting range within (0,1)

comments from pre-submission presentation

Documents