data mining in bioinformatics day 4: text mining · 2014. 10. 29. · inductive vs. transductive.:...
TRANSCRIPT
.: Data Mining in Bioinformatics, Page 1
Data Mining in BioinformaticsDay 4: Text Mining
Karsten Borgwardt
February 21 to March 4, 2011
Machine Learning & Computational Biology Research GroupMPIs Tübingen
What is text mining?
.: Data Mining in Bioinformatics, Page 2
DefinitionText mining is the use of automated methods for exploit-ing the enormous amount of knowledge available in the(biomedical) literature.
MotivationMost knowledge is stored in terms of texts, both in in-dustry and in academiaThis alone makes text mining an integral part of knowl-edge discovery!Furthermore, to make text machine-readable, one hasto solve several recognition (mining) tasks on text
What is text mining?
.: Data Mining in Bioinformatics, Page 3
Common tasksInformation retrieval: Find documents that are relevantto a user, or to a query in a collection of documents
Document ranking: rank all documents in the collec-tionDocument selection: classify documents into relevantand irrelevant
Information filtering: Search newly created documentsfor information that is relevant to a userDocument classification: Assign a document to a cate-gory that describes its contentKeyword co-occurrence: Find groups of keywords thatco-occur in many documents
Evaluating text mining
.: Data Mining in Bioinformatics, Page 4
Precision and RecallLet the set of documents that are relevant to a querybe denoted as {Relevant} and the set of retrieved doc-uments as {Retrieved}.The precision is the percentage of retrieved documentsthat are relevant to the query
precision =|{Relevant} ∩ {Retrieved}|
|{Retrieved}|(1)
The recall is the percentage of relevant documents thatwere retrieved by the query:
recall =|{Relevant} ∩ {Retrieved}|
|{Relevant}|(2)
Text representation
.: Data Mining in Bioinformatics, Page 5
TokenizationProcess of identifying keywords in a documentNot all words in a text are relevantText mining ignores stop wordsStop words form the stop listStop lists are context-dependent
Text representation
.: Data Mining in Bioinformatics, Page 6
Vector space modelGiven #d documents and #t termsModel each document as a vector v in a t-dimensionalspace
Weighted Term-frequency matrixMatrix TF of size #d×#t
Entries measure association of term and documentIf a term t does not occur in a document d, thenTF (d, t) = 0
If a term t does occur in a document d, then TF (d, t) >0.
Text representation
.: Data Mining in Bioinformatics, Page 7
If term t occurs in document d, thenTF (d, t) = 1
TF (d, t) = frequency of t in d (freq(d, t))
TF (d, t) = freq(d,t)∑t′∈T freq(d,t
′)
TF (d, t) = 1 + log(1 + log(freq(d, t)))
Text representation
.: Data Mining in Bioinformatics, Page 8
Inverse document frequencyrepresents the scaling factor, or importance, of a termA term that appears in many document is scaled down
IDF (t) = log1 + |d||dt|
(3)
where |d| is the number of all documents, and |dt| is thenumber of documents containing term t
TF-IDF measureProduct of term frequency and inverse document fre-quency:
TF -IDF (d, t) = TF (d, t)IDF (t); (4)
Measuring similarity
.: Data Mining in Bioinformatics, Page 9
Cosine measureLet v1 and v2 be two document vectors.The cosine similarity is defined as
sim(v1, v2) =v>1 v2|v1||v2|
(5)
Kernelsdepending on how we represent a document, there aremany kernels available for measuring similarity of theserepresentations
vectorial representation: vector kernels like linear,polynomial, Gaussian RBF kernelone long string: string kernels that count common k-mers in two strings (more on that later in the course)
Keyword co-occurrence
.: Data Mining in Bioinformatics, Page 10
ProblemFind sets of keyword that often co-occurCommon problem in biomedical literature: find associ-ations between genes, proteins or other entities usingco-occurrence searchKeyword co-occurrence search is an instance of a moregeneral problem in data mining, called association rulemining.
Association rules
.: Data Mining in Bioinformatics, Page 11
DefinitionsLet I = {I1, I2, . . . , Im} be a set of items (keywords)Let D be the database of transactions T (collection ofdocuments)A transaction T ∈ D is a set of items: T ⊆ I (a docu-ment is a set of keywords)Let A be a set of items: A ⊆ T . An association rule isan implication of the form
A ⊆ T ⇒ B ⊆ T, (6)
where A,B ⊆ I and A ∩B = ∅
Association rules
.: Data Mining in Bioinformatics, Page 12
Support and ConfidenceThe rule A⇒ B holds in the transaction set D with sup-port s, where s is the percentage of transactions in Dthat contain A ∪B:
support(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}|
|{T ∈ D}|(7)
The rule A ⇒ B has confidence c in the transactionset D, where c is the percentage of transactions in Dcontaining A that also contain B:
confidence(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}||{T ∈ D|A ⊆ T}|
(8)
Association rules
.: Data Mining in Bioinformatics, Page 13
Strong rulesRules that satisfy both a minimum support thresh-old (minsup) and a minimum confidence threshold(minconf) are called strong association rules— andthese are the ones we are after!
Finding strong rules1. Search for all frequent itemsets (set of items that occur
in at least minsup % of all transactions)2. Generate strong association rules from the frequent
itemsets
Association rules
.: Data Mining in Bioinformatics, Page 14
Apriori algorithmMakes use of the Apriori property: If an itemset A isfrequent, then any subset B of A (B ⊆ A) is frequentas well. If B is infrequent, then any superset A of B(A ⊇ B) is infrequent as well.
Steps1. Determine frequent items = k-itemsets with k = 1
2. Join all pairs of frequent k-itemsets that differ in at most1 item = candidatesCk+1 for being frequent k+1 itemsets
3. Check the frequency of these candidates Ck+1: the fre-quent ones form the frequent k + 1-itemsets (trick: dis-card any candidate immediately that contains an infre-quent k-itemset)
4. Repeat from Step 2 until no more candidate is frequent
Transduction
.: Data Mining in Bioinformatics, Page 15
Known test setClassification on text databases often means that weknow all the data we will work with before trainingHence the test set is known aprioriThis setting is called ’transductive’Can we define classifiers that exploit the known test set?Yes!
Transductive SVM (Joachims, ICML 1999)Trains SVM on both training and test setUses test data to maximise margin
Inductive vs. transductive
.: Data Mining in Bioinformatics, Page 16
ClassificationTask: predict label y from features x
Classic inductive settingStrategy: Learn classifier on (labelled) training dataGoal: Classifier shall generalise to unseen data fromsame distribution
Transductive settingStrategy: Learn classifier on (labelled) training dataAND a given (unlabelled) test datasetGoal: Predict class labels for this particular dataset
Why transduction?
.: Data Mining in Bioinformatics, Page 17
Really necessary?Classic approach works: train on training dataset, teston test datasetThat is what we usually do in practice, for instance, incross-validation.We usually ignore or neglect that the fact that settingsare transductive.
The benefits of transductive classificationInductive setting: infinitely many potential classifiersTransductive setting: finite number of equivalenceclasses of classifiersf and f ′ in same equivalence class⇔ f and f ′ classifypoints from training and test dataset identically
Why transduction?
.: Data Mining in Bioinformatics, Page 18
Idea of Transductive SVMsRisk on Test data ≤ Risk on Training data + confidenceinterval (depends on number of equivalence classes)Theorem by Vapnik(1998): The larger the margin, thelower the number of equivalence classes that contain aclassifier with this marginFind hyperplane that separates classes in training dataAND in test data with maximum margin.
Why transduction?
.: Data Mining in Bioinformatics, Page 19
Transduction on text
.: Data Mining in Bioinformatics, Page 20
Transductive SVM
.: Data Mining in Bioinformatics, Page 21
Linearly separable case
minw,b,y∗
1
2‖w‖2
s.t. ∀ni=1 yi[w>xi + b] ≥ 1
∀kj=1 y∗j [w>x∗j + b] ≥ 1
Transductive SVM
.: Data Mining in Bioinformatics, Page 22
Non-linearly separable case
minw,b,y∗,ξ,ξ∗
1
2‖w‖2 + C
n∑i=0
ξi + C∗k∑j=0
ξ∗j
s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi
∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j
∀ni=1 ξi ≥ 0
∀kj=1 ξ∗j ≥ 0
Transductive SVM
.: Data Mining in Bioinformatics, Page 23
OptimisationHow to solve this OP?Not so nice: combination of integer and convex OPJoachims’ approach: find approximate solution by itera-tive application of inductive SVM
train inductive SVM on training data, predict on testdata, assign labels to test dataretrain on all data, with special slack weights for testdata (C∗−, C
∗+)
Outer loop: repeat and slowly increase (C∗−, C∗+)
Inner loop: within each repetition switch pairs of ’mis-classified’ data points repeatedly
Local search with approximate solution to OP
Inductive SVM for TSVM
.: Data Mining in Bioinformatics, Page 24
Variant of inductive SVM
minw,b,y∗,ξ,ξ∗
1
2‖w‖2 + C
n∑i=0
ξi + C∗−
k∑j:y∗j=−1
ξ∗j + C∗+
k∑j:y∗j=1
ξ∗j
s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi
∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j
Three different penalty costsC for points from training datasetC∗− for points from in test dataset currently in class −1C∗+ for points from in test dataset currently in class +1
Experiments
.: Data Mining in Bioinformatics, Page 25
Average P/R-breakeven point on the Reuters dataset fordifferent training set sizes and a test size of 3,299
Experiments
.: Data Mining in Bioinformatics, Page 26
Average P/R-breakeven point on the Reuters dataset for 17training documents and varying test set size for the TSVM
Experiments
.: Data Mining in Bioinformatics, Page 27
Average P/R-breakeven point on the WebKB category’course’ for different training set sizes
Experiments
.: Data Mining in Bioinformatics, Page 28
Average P/R-breakeven point on the WebKB category’project’ for different training set sizes
Summary
.: Data Mining in Bioinformatics, Page 29
ResultsTransductive version of SVMMaximizes margin on training and test dataImplementation uses variant of classic inductive SVMSolution is approximate and fastWorks well on text, in particular on small training sam-ples and large test sets
References and further reading
.: Data Mining in Bioinformatics, Page 30
References
[1] T.-Joachims. Transductive Inference for Text Classifica-tion using Support Vector Machines ICML, 1999: 200-209.
[2] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Elsevier, Morgan-Kaufmann Publishers,2006.
The end
.: Data Mining in Bioinformatics, Page 31
See you tomorrow! Next topic: Graph Mining