![Page 1: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/1.jpg)
TEXT
Mining & Retrieval
![Page 2: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/2.jpg)
![Page 3: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/3.jpg)
TEXTINFORMATION
THOUGHTS
OPINIONSFEELINGS
STORIES
DOCUMENTARIES
NEWS
LANGUAGES
EVERY DAY LIFE
SCIENCE
DISCUSSIONS
POLITICS PERSONALITIES
SOCIAL CONNECTIONS
DISASTERS
COMMERSE
![Page 4: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/4.jpg)
![Page 5: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/5.jpg)
Text Mining / Retrieval
• RETRIEVAL: discovery of text relevant to an information need
• MINING: discovery of new information in text (or reformulating information already there)
• natural language processing• computational linguistics• data mining, statistics
![Page 6: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/6.jpg)
Text: Peculiarities
![Page 7: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/7.jpg)
Text: Peculiarities
• Unstructured• Word dependencies (context, grammars)• Different languages, styles• Noisy (misspellings, typos, scanning errors…)• Burdensome formatting (HTML, XML…)• Humor, sarcasm, ambiguity, etc…
![Page 8: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/8.jpg)
Representing Text
• “Bag of words”, i.e. Vector Space Model
break the document into its constituent words and put them in a table
![Page 9: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/9.jpg)
Indexing for Retrieval
Doc Term
D1 Apple, Pear, Pear
D2 Cat, Dog
D3 Cat, Cat, Tiger
… …
Term Doc
Apple D1 1
Pear D1 2
Cat D2 1, D3 2
… …
Document Collection Forward Index Inverted Index
Conceptually, document is a vector of terms
Apple Pear Cat Tiger …
1 2 0 0 …
![Page 10: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/10.jpg)
Representing Text
• Preprocessing– Clean-up
• remove formatting, tables, HTML…
– Remove stopwords• the, of, to, a, in, and, that, for, is
– Stem words• Porter Stemmer – heuristic• statistical, brute-force (lookup tables)
![Page 11: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/11.jpg)
Representing Text
• Preserving some meaning of the words:– Part of Speech tagging– Word Sense Disambiguation– Semantic annotation
![Page 12: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/12.jpg)
EYE DROPS OFF SHELFPROSTITUTES APPEAL TO POPEKIDS MAKE NUTRITIOUS SNACKS
STOLEN PAINTING FOUND BY TREELUNG CANCER IN WOMEN MUSHROOMS
QUEEN MARY HAVING BOTTOM SCRAPEDDEALERS WILL HEAR CAR TALK AT NOONMINERS REFUSE TO WORK AFTER DEATH
MILK DRINKERS ARE TURNING TO POWDERDRUNK GETS NINE MONTHS IN VIOLIN CASE
JUVENILE COURT TO TRY SHOOTING DEFENDANTCOMPLAINTS ABOUT NBA REFEREES GROWING UGLY
PANDA MATING FAILS; VETERINARIAN TAKES OVERMAN EATING PIRANHA MISTAKENLY SOLD AS PET FISHASTRONAUT TAKES BLAME FOR GAS IN SPACECRAFT
QUARTER OF A MILLION CHINESE LIVE ON WATERINCLUDE YOUR CHILDREN WHEN BAKING COOKIESOLD SCHOOL PILLARS ARE REPLACED BY ALUMNI
GRANDMOTHER OF EIGHT MAKES HOLE IN ONEHOSPITALS ARE SUED BY 7 FOOT DOCTORSLAWMEN FROM MEXICO BARBECUE GUESTS
TWO SOVIET SHIPS COLLIDE, ONE DIESENRAGED COW INJURES FARMER WITH AX
LACK OF BRAINS HINDERS RESEARCHRED TAPE HOLDS UP NEW BRIDGE
SQUAD HELPS DOG BITE VICTIMIRAQI HEAD SEEKS ARMSHERSHEY BARS PROTEST
![Page 13: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/13.jpg)
Representing Text
• Vector Space Model:
D = (t1, wd1; t2, wd2; …, tv, wdv)
w: binary, count, TFIDF
Apple Pear Cat Tiger …
1 2 0 0 …
![Page 14: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/14.jpg)
TFIDF
![Page 15: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/15.jpg)
![Page 16: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/16.jpg)
Problems
• Synonymy– multiple words that have similar meanings
• Polysemy– words that have more than one meaning
![Page 17: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/17.jpg)
Latent Semantic Indexing
• Index by the hidden “meaning” of text
“words that are used in the same contexts tend to have similar meanings”
• using Singular Value Decomposition– a linear algebra technique for factorization of
matrixes
![Page 18: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/18.jpg)
Latent Semantic Indexing
X = RSTT
distribution of terms for a concept(concept language model)
distribution of concepts in a document
importance ofeach concept
![Page 19: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/19.jpg)
Latent Semantic Indexing
1. Index using concepts instead of terms
2. Query represented like another document
3. Retrieve documents “closest” to query
![Page 20: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/20.jpg)
Latent Semantic Analysis
• Document categorization (plagiarism)• Comparing terms (synonymy)• Works with any language• Tolerant of noise (misspellings)
• Faults:– requires lots of memory– how many concepts should we use?
![Page 21: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/21.jpg)
Probabilistic Text Retrieval
[http://nlp.stanford.edu]
Language ModelGenerative Model
![Page 22: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/22.jpg)
Probabilistic Text Retrieval
using chain rule:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2t1)P(t4|t3t2t1)
unigram language model:P(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)
bigram language model:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)
![Page 23: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/23.jpg)
Probabilistic Text Retrieval
• Query likelihood model
Each document d has language model Md
P(d|q) = P(q|d)P(d)/P(q)
Naïve Bayes with each document as a class
P(q|Md) ≈ ΠtεV P(t|Md)tf(t,d)
![Page 24: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/24.jpg)
Probabilistic Text Retrieval
• Estimating P(t|Md):
P(t|Md) = tf(t,d) / Lengthd
• Prior for terms not appearing in the document (smoothing):
P’(t|Md) = collectionFreq(t) / collectionSize
![Page 25: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/25.jpg)
Probabilistic Text Retrieval
• In practice: mixture between document language model Md and collection language model Mc
P(t|d) = λP(t|Md) + (1 – λ) P(t|Mc)
![Page 26: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/26.jpg)
Probabilistic Text Retrieval
• In summary:
P(d|q) = P(d) Πtεq (λP(t|Md) + (1 – λ) P(t|Mc))
• Rank the documents by P(d|q) • Return few top results
![Page 27: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/27.jpg)
Extensions
• Latent Dirichlet Allocation++
ww
T1T1
BB
T2T2
T3T3T4T4 T5T5Topics
GeneralEnglish
MdMdDocument-specific topic
![Page 28: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/28.jpg)
Modeling Text (an aside)
• Generate your own Computer Science paper:
http://pdos.csail.mit.edu/scigen
![Page 29: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/29.jpg)
Text Retrieval
Information Need
Query
Text Collection
Search Results
Start Here
![Page 30: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/30.jpg)
Query Expansion
• Fixing spelling errors• Stemming• Alternative query suggestion
– Query log mining
• Synonyms from a thesaurus– Medical terms: MESH (Medical Subject Headings)– Manually or automatically created thesauruses
![Page 31: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/31.jpg)
Pseudo-relevance Feedback
1. Assume top retrieved documents are relevant OR ask user to rate returned documents
2. Extract important words from these documents
3. Append to the query4. Try again
![Page 32: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/32.jpg)
Pseudo-relevance Feedback
• Rocchio algorithm for relevance feedback
qopt = argmaxq [sim(q,Cr) – sim(q,Cn)]
![Page 33: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/33.jpg)
Retrieval Evaluation
• Want– Results that address my information need best– These results should be on top of the returned list– Diverse set of results to choose from– Timely?
• Relevance– What user says it is
![Page 34: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/34.jpg)
Retrieval Evaluation
![Page 35: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/35.jpg)
Retrieval Evaluation
![Page 36: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/36.jpg)
Retrieval Evaluation
• Mean Average Precision
[area under the Precision-Recall curve]
![Page 37: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/37.jpg)
Retrieval Evaluation
• In web search results users usually don’t look past the top 5 results– Use cutoff: Metric @ 5 or Metric @ 10
• Comparison between systems:– Control dataset, queries, relevance judgments– Text Retrieval Conference (TREC)
![Page 38: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/38.jpg)
Beyond Retrieval
• Named entity recognition• Summarization• Template filling• Text categorization• Sentiment analysis• Taxonomy extraction• Hypothesis formation• Social network extraction/analysis
![Page 39: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/39.jpg)
Text Categorization
• Spam detection• News monitoring• Faceted search• Automated labeling• Authorship attribution
![Page 40: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/40.jpg)
Text Categorization
• Classes already known:– Naïve Bayes– SVM– kNearestNeighbor– Neural Nets
• Discovering classes:– kNN Clustering– LSI
![Page 41: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/41.jpg)
featureextraction
Text Categorization
positivereviews
negativereviews
classifiertraining
MpMp
MnMn
unlabeledreviews
classifyinginstances
Who likes my product?What features do they like?Do people like my competitor’s product?What experiences do people have with my product?
1 2 3
4
![Page 42: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/42.jpg)
Text Categorization
![Page 43: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/43.jpg)
Text Summarization
• Information overload
• Article summaries• Cliff Notes• TV Guide• Medical summary• Document “preview”
![Page 44: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/44.jpg)
Text Summarization
• Extraction– copying most important parts of the document
• Abstraction– paraphrasing sections of the document
• Single document vs multiple documents• Generic vs query-focused
![Page 45: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/45.jpg)
Text Summarization
• Finding important text:
– position– cue phrase indicators– word/phrase frequency – query and title overlap– discourse structure criteria– formatting
![Page 46: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/46.jpg)
Named Entity Extraction
People, locations, companies, events…
Alberto Maria SegreDr Segre
Professor Segrealberto
AMSA. M. Segre
![Page 47: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/47.jpg)
Named Entity Extraction
• Vocabulary matching– Problem: vocabulary transfer
• Rule-based– Regular expressions, rules of thumb
• Bootstrapping– Using “seed” Nes to find rules
• Machine learning– SVM, HMM, Decision Trees, Maximum Entropy…
[Nadeau & Sekine: Survey (2006)]
![Page 48: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/48.jpg)
Named Entity Extraction• Features:
– Case– Digit– Character– Punctuation– Morphology– Part-of-speech– Dictionary entry– Meta information– Corpus frequency
As Gulf spill spreads, blame game begins
When BP looks at the spreading oil slick in the Gulf of Mexico that now threatens flora, fauna and livelihoods along the coasts of Louisiana, Mississippi, Alabama and Florida, it's really seeing money floating away on the tide.That's why it may be trying to shift some of the blame for the massive undersea leak to Transocean, which was running the rig that exploded on April 20 and eventually sank, leaving one of the worst oil spills in history in its wake."It wasn't our accident, but we are absolutely responsible for the oil, for cleaning it up, and that's what we intend to do," BP Group CEO Tony Hayward told NBC's "TODAY" show.
http://www.msnbc.msn.com/id/36917929/ns/business-us_business/
![Page 49: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/49.jpg)
Entity Disambiguation
Entities can be referred to differently
Alberto Maria SegreDr Segre
Professor Segrealberto
AMSA. M. SegreAI Professor
Masters students adviser at UI
![Page 50: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/50.jpg)
Entity Disambiguation
• Rules– Name use, emails, greetings, templates
• Outside sources– Wikipedia, ontologies, dictionaries…
• Entity profiles– Context
![Page 51: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/51.jpg)
Web Mining
• Peculiarities:– Linked structure– Multimedia– Spam– Huge dataset– Much used– Variety of topics– Variety of authors
![Page 52: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/52.jpg)
Web Crawling
![Page 53: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/53.jpg)
Web Crawling
• Selection policy: which web pages to crawl?
• Focused crawlers– Relevance to the query
• Exploratory crawlers– Depth-first, breadth-first, URL, anchor text, quality
of in-link, number of in-links, PageRank
![Page 54: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/54.jpg)
Web Crawling
PageRank measures the importance of a pageimportant pages point to other important pages
number of times you visit a page on a random walk
![Page 55: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/55.jpg)
Web Crawling
• Other policy considerations– re-visit policy– politeness policy (robots.txt)
• Robot Exclusion Protocol
– parallelization policy
• Identify yourself as a bot
![Page 56: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/56.jpg)
Web Graph Mining
• Authority (search results)
• Overview sites• Social analysis• Relationships
between topics (site maps)
![Page 57: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS](https://reader038.vdocuments.net/reader038/viewer/2022102908/56649e4a5503460f94b3e5ae/html5/thumbnails/57.jpg)
Web Content Mining
• Sociology• Epidemiology• Marketing• Disaster detection• Finding people• Finding information• …