data mining in bioinformatics day 4: text mining · 2014. 10. 29. · inductive vs. transductive.:...

.: Data Mining in Bioinformatics,

Data Mining in BioinformaticsDay 4: Text Mining

Karsten Borgwardt

February 21 to March 4, 2011

Machine Learning & Computational Biology Research GroupMPIs Tübingen

What is text mining?


DefinitionText mining is the use of automated methods for exploit-ing the enormous amount of knowledge available in the(biomedical) literature.

MotivationMost knowledge is stored in terms of texts, both in in-dustry and in academiaThis alone makes text mining an integral part of knowl-edge discovery!Furthermore, to make text machine-readable, one hasto solve several recognition (mining) tasks on text

What is text mining?


Common tasksInformation retrieval: Find documents that are relevantto a user, or to a query in a collection of documents

Document ranking: rank all documents in the collec-tionDocument selection: classify documents into relevantand irrelevant

Information filtering: Search newly created documentsfor information that is relevant to a userDocument classification: Assign a document to a cate-gory that describes its contentKeyword co-occurrence: Find groups of keywords thatco-occur in many documents

Evaluating text mining


Precision and RecallLet the set of documents that are relevant to a querybe denoted as {Relevant} and the set of retrieved doc-uments as {Retrieved}.The precision is the percentage of retrieved documentsthat are relevant to the query

precision =|{Relevant} ∩ {Retrieved}|

|{Retrieved}|(1)

The recall is the percentage of relevant documents thatwere retrieved by the query:

recall =|{Relevant} ∩ {Retrieved}|

|{Relevant}|(2)

Text representation


TokenizationProcess of identifying keywords in a documentNot all words in a text are relevantText mining ignores stop wordsStop words form the stop listStop lists are context-dependent

Text representation


Vector space modelGiven #d documents and #t termsModel each document as a vector v in a t-dimensionalspace

Weighted Term-frequency matrixMatrix TF of size #d×#t

Entries measure association of term and documentIf a term t does not occur in a document d, thenTF (d, t) = 0

If a term t does occur in a document d, then TF (d, t) >0.

Text representation


If term t occurs in document d, thenTF (d, t) = 1

TF (d, t) = frequency of t in d (freq(d, t))

TF (d, t) = freq(d,t)∑t′∈T freq(d,t

′)

TF (d, t) = 1 + log(1 + log(freq(d, t)))

Text representation


Inverse document frequencyrepresents the scaling factor, or importance, of a termA term that appears in many document is scaled down

IDF (t) = log1 + |d||dt|

(3)

where |d| is the number of all documents, and |dt| is thenumber of documents containing term t

TF-IDF measureProduct of term frequency and inverse document fre-quency:

TF -IDF (d, t) = TF (d, t)IDF (t); (4)

Measuring similarity


Cosine measureLet v1 and v2 be two document vectors.The cosine similarity is defined as

sim(v1, v2) =v>1 v2|v1||v2|

(5)

Kernelsdepending on how we represent a document, there aremany kernels available for measuring similarity of theserepresentations

vectorial representation: vector kernels like linear,polynomial, Gaussian RBF kernelone long string: string kernels that count common k-mers in two strings (more on that later in the course)

Keyword co-occurrence


ProblemFind sets of keyword that often co-occurCommon problem in biomedical literature: find associ-ations between genes, proteins or other entities usingco-occurrence searchKeyword co-occurrence search is an instance of a moregeneral problem in data mining, called association rulemining.

Association rules


DefinitionsLet I = {I1, I2, . . . , Im} be a set of items (keywords)Let D be the database of transactions T (collection ofdocuments)A transaction T ∈ D is a set of items: T ⊆ I (a docu-ment is a set of keywords)Let A be a set of items: A ⊆ T . An association rule isan implication of the form

A ⊆ T ⇒ B ⊆ T, (6)

where A,B ⊆ I and A ∩B = ∅

Association rules


Support and ConfidenceThe rule A⇒ B holds in the transaction set D with sup-port s, where s is the percentage of transactions in Dthat contain A ∪B:

support(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}|

|{T ∈ D}|(7)

The rule A ⇒ B has confidence c in the transactionset D, where c is the percentage of transactions in Dcontaining A that also contain B:

confidence(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}||{T ∈ D|A ⊆ T}|

(8)

Association rules


Strong rulesRules that satisfy both a minimum support thresh-old (minsup) and a minimum confidence threshold(minconf) are called strong association rules— andthese are the ones we are after!

Finding strong rules1. Search for all frequent itemsets (set of items that occur

in at least minsup % of all transactions)2. Generate strong association rules from the frequent

itemsets

Association rules


Apriori algorithmMakes use of the Apriori property: If an itemset A isfrequent, then any subset B of A (B ⊆ A) is frequentas well. If B is infrequent, then any superset A of B(A ⊇ B) is infrequent as well.

Steps1. Determine frequent items = k-itemsets with k = 1

2. Join all pairs of frequent k-itemsets that differ in at most1 item = candidatesCk+1 for being frequent k+1 itemsets

3. Check the frequency of these candidates Ck+1: the fre-quent ones form the frequent k + 1-itemsets (trick: dis-card any candidate immediately that contains an infre-quent k-itemset)

4. Repeat from Step 2 until no more candidate is frequent

Transduction


Known test setClassification on text databases often means that weknow all the data we will work with before trainingHence the test set is known aprioriThis setting is called ’transductive’Can we define classifiers that exploit the known test set?Yes!

Transductive SVM (Joachims, ICML 1999)Trains SVM on both training and test setUses test data to maximise margin

Inductive vs. transductive


ClassificationTask: predict label y from features x

Classic inductive settingStrategy: Learn classifier on (labelled) training dataGoal: Classifier shall generalise to unseen data fromsame distribution

Transductive settingStrategy: Learn classifier on (labelled) training dataAND a given (unlabelled) test datasetGoal: Predict class labels for this particular dataset

Why transduction?


Really necessary?Classic approach works: train on training dataset, teston test datasetThat is what we usually do in practice, for instance, incross-validation.We usually ignore or neglect that the fact that settingsare transductive.

The benefits of transductive classificationInductive setting: infinitely many potential classifiersTransductive setting: finite number of equivalenceclasses of classifiersf and f ′ in same equivalence class⇔ f and f ′ classifypoints from training and test dataset identically

Why transduction?


Idea of Transductive SVMsRisk on Test data ≤ Risk on Training data + confidenceinterval (depends on number of equivalence classes)Theorem by Vapnik(1998): The larger the margin, thelower the number of equivalence classes that contain aclassifier with this marginFind hyperplane that separates classes in training dataAND in test data with maximum margin.

Why transduction?


Transduction on text


Transductive SVM


Linearly separable case

minw,b,y∗

1

2‖w‖2

s.t. ∀ni=1 yi[w>xi + b] ≥ 1

∀kj=1 y∗j [w>x∗j + b] ≥ 1

Transductive SVM


Non-linearly separable case

minw,b,y∗,ξ,ξ∗

1

2‖w‖2 + C

n∑i=0

ξi + C∗k∑j=0

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

∀ni=1 ξi ≥ 0

∀kj=1 ξ∗j ≥ 0

Transductive SVM


OptimisationHow to solve this OP?Not so nice: combination of integer and convex OPJoachims’ approach: find approximate solution by itera-tive application of inductive SVM

train inductive SVM on training data, predict on testdata, assign labels to test dataretrain on all data, with special slack weights for testdata (C∗−, C

∗+)

Outer loop: repeat and slowly increase (C∗−, C∗+)

Inner loop: within each repetition switch pairs of ’mis-classified’ data points repeatedly

Local search with approximate solution to OP

Inductive SVM for TSVM


Variant of inductive SVM

minw,b,y∗,ξ,ξ∗

1

2‖w‖2 + C

n∑i=0

ξi + C∗−

k∑j:y∗j=−1

ξ∗j + C∗+

k∑j:y∗j=1

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

Three different penalty costsC for points from training datasetC∗− for points from in test dataset currently in class −1C∗+ for points from in test dataset currently in class +1

Experiments


Average P/R-breakeven point on the Reuters dataset fordifferent training set sizes and a test size of 3,299

Experiments


Average P/R-breakeven point on the Reuters dataset for 17training documents and varying test set size for the TSVM

Experiments


Average P/R-breakeven point on the WebKB category’course’ for different training set sizes

Experiments


Average P/R-breakeven point on the WebKB category’project’ for different training set sizes

Summary


ResultsTransductive version of SVMMaximizes margin on training and test dataImplementation uses variant of classic inductive SVMSolution is approximate and fastWorks well on text, in particular on small training sam-ples and large test sets

References and further reading


References

[1] T.-Joachims. Transductive Inference for Text Classifica-tion using Support Vector Machines ICML, 1999: 200-209.

[2] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Elsevier, Morgan-Kaufmann Publishers,2006.

The end


See you tomorrow! Next topic: Graph Mining

data mining in bioinformatics day 4: text mining · 2014. 10. 29. · inductive vs. transductive.:...

Documents