![Page 2: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/2.jpg)
What is Categorization?
• {c1 … cm} set of predefined categories
• {d1 … dn} set of candidate documents• Fill decision matrix with values {0,1}
• Categories are symbolic labels
d1 … … dn
c1 a11 … … a1n
… … … … …
cm am1 … … amn
![Page 3: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/3.jpg)
Uses
• Document organization
• Document filtering
• Word sense disambiguation
• Web– Internet directories– Organization of search results
• Clustering
![Page 4: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/4.jpg)
Categorization Techniques
• Knowledge systems
• Machine Learning
![Page 5: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/5.jpg)
Knowledge Systems
• Manually build an expert system– Makes categorization judgments– Sequence of rules per category– If <boolean condition> then category– If document contains “buena vista home
entertainment” then document category is “Home Video”
![Page 6: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/6.jpg)
UltraSeek Content Classification Engine
![Page 7: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/7.jpg)
UltraSeek CCE
![Page 8: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/8.jpg)
Knowledge System Issues
• Scalability– Build– Tune
• Requires Domain Experts
• Transferability
![Page 9: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/9.jpg)
Machine Learning Approach
• Build a classifier for a category– Training set– Hierarchy of categories
• Submit candidate documents for automatic classification
• Expend effort in building a classifier, not in knowing the knowledge domain
![Page 10: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/10.jpg)
Machine Learning Process
Document Pre-
processing
documents
Classifier
Training
taxonomy
Training Set documents
DB
![Page 11: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/11.jpg)
Training Set
• Initial corpus can be divided into:– Training set– Test set
• Role of workflow tools
![Page 12: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/12.jpg)
Document Preprocessing• Document Conversion:
– Converts file formats (.doc, .ppt, .xls, .pdf etc) to text
• Tokenizing/Parsing:– Stemming
– Document vectorization
• Dimension reduction
![Page 13: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/13.jpg)
Document Vectorization
• Convert document text into “bag of words”
• Each document is a vector of n weighted terms
Federal express 3
Severe 3
Mountain 2
Exactly 1
Simple 5
Flight 2
Y2000-Q3 1
Document
![Page 14: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/14.jpg)
Document Vectorization• Use tfidf function for term weighting
• tfidf value may be normalized– All vectors of equal length– [0,1]
tfidf(tk, dj) = #(tk, dj) . Log [|Tr| / #(tk)]
# of times tk occurs in dj
# of documents where tk occurs at least once
Cardinality of training set
![Page 15: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/15.jpg)
Dimension Reduction
• Reduce dimensionality of vector space• Why?
– Reduce computational complexity– Address “overfitting” problem
• Overtuning classifier
• How?– Feature selection– Feature extraction
![Page 16: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/16.jpg)
Feature Selection
• Also known as “term space reduction”• Remove “stop” words• Identify “best” words to be used in
categorizing per topic– Document frequency of terms
• Keep terms that occur in highest number of documents
– Other measures• Chi square• Information gain
![Page 17: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/17.jpg)
Feature Extraction
• Synthesize new features from existing features
• Term clustering– Use clusters/centroids instead of terms– Co-occurrence and co-absence
• Latent Semantic Indexing– Compresses vectors into a lower
dimensional space
![Page 18: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/18.jpg)
Creating a Classifier
• Define a function, Categorization Status Value, CSV, that for a document d:– CSVi: D -> [0,1]
– Confidence that d belongs in ci
• Boolean• Probability• Vector distance
![Page 19: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/19.jpg)
Creating a Classifier
• Define a threshold, thresh, such that if CSVi(d) > thresh(i) then categorize d under ci otherwise, don’t
• CSV thresholding– Fixed value across all categories– Vary per category
• Optimize via testing
![Page 20: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/20.jpg)
Naïve Bayes Classifier
Probability of doc dj belonging in category ci
Training set terms/weights present in dj used to calculate probability of dj belonging to ci
![Page 21: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/21.jpg)
Naïve Bayes ClassifierIf wkj is binary (0, 1) and pki is short for P(wkx = 1 | ci)
After further derivation, the original equation looks like:
Can be used for CSV
Constants for all docs
![Page 22: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/22.jpg)
Naïve Bayes Classifier
• Independence assumption
• Feature selection can be counterproductive
![Page 23: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/23.jpg)
k-NN Classifier
• Compute closeness between candidate documents and category documents
Similarity between dj and training set document dz
Confidence score indicating whether dz belongs to category ci
![Page 24: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/24.jpg)
k-NN Classifier
• k nearest neighbors– Find k nearest neighbors from all training
documents and use their categories– K can also indicate the number of top
ranked training documents per category to compare against
• Similarity computation can be:– Inner product– Cosine coefficient
![Page 25: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/25.jpg)
Support Vector Machines
• “decision surface” that best separates data points in two classes
• Support vectors are the training docs that best define hyperplane
Optimal hyperplane
Max. margin
![Page 26: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/26.jpg)
Support Vector Machines
• Training process involves finding the support vectors
• Only care about support vectors in the training set, not other documents
![Page 27: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/27.jpg)
Neural Networks
• Train net to learn from a mapping of input words to a category
• One neural net per category– Too expensive
• One network overall• Perceptron approach without a hidden
layer• Three layered
![Page 28: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/28.jpg)
Classifier Committees
• Combine multiple classifiers
• Majority voting
• Category specialization
• Mixed results
![Page 29: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/29.jpg)
Classification Performance• Category ranking evaluation
– Recall = categories found and correct
– Precision = categories found and correct
• Micro and Macro averaging over categories
Total categories correct
Total categories found
![Page 30: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/30.jpg)
Classification Performance
• Hard
• Two studies– Yiming Yang, 1997– Yiming Yang and Xin Liu, 1999
• SVM, kNN >> Neural Net > Naïve Bayes
• Performance converges for common categories (with many training docs)
![Page 31: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/31.jpg)
Computational Bottlenecks
• Quiver– # of topics– # of training documents– # of candidate documents
![Page 32: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/32.jpg)
Categorization and the Internet
• Classification as a service– Standardizing vocabulary– Confidentiality– performance
• Use of hypertext in categorization– Augment existing classifiers to take
advantage
![Page 33: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/33.jpg)
Hypertext and Categorization
• An already categorized document links to documents within same category
• Neighboring documents in a similar category
• Hierarchical nature of categories
• Metatags
![Page 34: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/34.jpg)
Augmenting Classifiers
• Inject anchor text for a document into that document– Treat anchor text as separate terms
• Depends on dataset• Mixed experimental results• Links may be noisy
– Ads– Navigation
![Page 35: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/35.jpg)
Topics and the Web
• Topic distillation– Analysis of hyperlink graph structure
• Authorities – popular pages
• Hubs– Links to authorities
hubs authorities
![Page 36: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/36.jpg)
Topic Distillation• Kleinberg’s HITS algorithm• An initial set of pages: root set
– Use this to create an expanded set
• Weight propagation phase– Each node: authority score and hub score– Alternate
• Authority = sum of current hub weights of all nodes pointing to it
• Hub = sum of all authority score of all pages it points to
– Normalize node scores and iterate until convergence
• Output is a set of hubs and authorities
![Page 37: An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003](https://reader035.vdocuments.net/reader035/viewer/2022070305/55141edf550346ec488b5719/html5/thumbnails/37.jpg)
Conclusion
• Why Classifiy?
• The Classification Process
• Various Classifiers
• Which ones are better?
• Other applications