information retrieval overview - home :: northeastern university
TRANSCRIPT
![Page 1: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/1.jpg)
1
many slides courtesy James Allan @umass Amherstsome slides courtesy ChengXiang Zhai @Urbana-Champaign
Information RetrievalOverview
![Page 2: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/2.jpg)
2
INSTRUCTORVirgil Pavlu [email protected]
GRADER : if we get any, he/she will have final authority on homework grading issues
http://ccs.neu.edu/~jaa/ISU535.06X2
administrivia
![Page 3: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/3.jpg)
3
overview
• what is information retrieval ?• how does it work ?• a simple retrieval model
![Page 4: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/4.jpg)
4
what is IR?
![Page 5: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/5.jpg)
5
R
N
R
R
R
R
N
N
N
N
evaluation
![Page 6: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/6.jpg)
6
web search
![Page 7: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/7.jpg)
7
web search
![Page 8: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/8.jpg)
8
metasearch
![Page 9: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/9.jpg)
9
user feedback
![Page 10: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/10.jpg)
10
Access Mining
Organization
Select information
Create Knowledge
Add Structure/Annotations
text management applications
![Page 11: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/11.jpg)
11
Search
Text
Filtering
Categorization
Summarization
Clustering
Natural Language Content Analysis
Extraction
Mining
VisualizationRetrievalApplications
MiningApplications
InformationAccess
KnowledgeAcquisition
InformationOrganization
text management applications
![Page 12: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/12.jpg)
12
A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun PhraseNoun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Dog(d1).Boy(b1).Playground(p1).Chasing(d1,b1,p1).
Semantic analysis
Lexicalanalysis
(part-of-speechtagging)
Syntactic analysis(Parsing)
A person saying this maybe reminding another person to
get the dog back…
Pragmatic analysis(speech act)
Scared(x) if Chasing(_,x,_).+
Scared(b1)Inference
natural language processing
![Page 13: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/13.jpg)
13
what we can do in NLP
A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun PhraseNoun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Semantics: some aspects
- Entity/relation extraction- Word sense disambiguation- Anaphora resolution
POSTagging:
97%
Parsing: partial >90%(?)
Speech act analysis: ???
Inference: ???
![Page 14: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/14.jpg)
14
… يَجِبُ عَلَى الإنْسَانِ أن يَكُونَ أمِيْنَاً وَصَادِقَاً مَعَ نَفْسِهِ وَمَعَ أَهْلِهِ وَجِيْرَانِهِ وَأَنْ يَبْذُلَ كُلَّ جُهْدٍ فِي إِعْلاءِ
شَأْنِ الوَطَنِ وَأَنْ يَعْمَلَ عَلَى مَا …
How can a computer make sense out of this string?
Arabic text
- What are the basic units of meaning (words)?- What is the meaning of each word? - How are words related with each other? - What is the “combined meaning” of words? - What is the “meta-meaning”? (speech act)- Handling a large chunk of text- Making sense of everything
SyntaxSemantics
Pragmatics
Morphology
DiscourseInference
natural language processing
![Page 15: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/15.jpg)
15
RetrievalSystem
User“robotics applications”
query
Robotics
others
relevant docs
non-relevant docs
database/collection
text docs
search (ad-hoc IR)
![Page 16: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/16.jpg)
16
• Stable & long term interest, dynamic info source
• System must make a delivery decision immediately as a document “arrives”
FilteringSystem
…
my interest:
information filtering
![Page 17: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/17.jpg)
17
collaborative filtering
![Page 18: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/18.jpg)
18
• Pre-given categories and labeled document examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
CategorizationSystem
…
Sports
Business
Education
Science…
SportsBusiness
Education
categorization
![Page 19: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/19.jpg)
19
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
clustering
![Page 20: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/20.jpg)
20
IR vs Databases
![Page 21: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/21.jpg)
21
InformationRetrieval Databases
Library & InfoScience
Machine LearningPattern Recognition
Data Mining
NaturalLanguageProcessing
ApplicationsWeb, Bioinformatics…
StatisticsOptimization
Software engineeringComputer systems
Models
Algorithms
Applications
Systems
related areas
![Page 22: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/22.jpg)
22
ACM SIGIR
VLDB, PODS, ICDE
ASIS
Learning/Mining
NLP
Applications
Statistics??
Software/systems??
COLING, EMNLP, ANLP
HLT
ICML, NIPS, UAIRECOMB, PSB
JCDL
Info. ScienceInfo Retrieval
ACM CIKM, TREC DatabasesACM SIGMOD
ACL
ICML
AAAI
ACM SIGKDD
ISMB WWW
publications/societies
![Page 23: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/23.jpg)
23
• Formal Models, Language Models, Fusion/Combination • Text Representation and Indexing, XML and Metadata• Performance, Compression, Scalability, Architectures, Mobile
Applications • Web IR, Intranet/Enterprise Search, Citation and Link Analysis, Digital
Libraries, Distributed IR • Cross-language Retrieval, Multilingual Retrieval, Machine Translation for
IR • Video and Image Access, Audio and Speech Retrieval, Music Retrieval• Machine Learning for IR, Text Data Mining, Clustering, Text
Categorization • Topic Detection and Tracking, Content-Based Filtering, Collaborative
Filtering, Agents • Summarization, Question Answering, Natural Language Processing for
IR, Information Extraction, Lexical Acquisition • Interfaces, Visualization, Interactive IR, User Models, User Studies• Specialized Applications of IR, including Genomic IR, IR in Software
Engineering, and IR for Chemical Structures
SIGIR 2003 topics
![Page 24: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/24.jpg)
24
overview
• what is information retrieval ?• how does it work ?• a simple retrieval model
![Page 25: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/25.jpg)
25
• boolean• geometric : vector space model• probabilistic : language models• statistical : bayesian networks• graph like : page rank
basic approaches
![Page 26: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/26.jpg)
26
• Much of IR depends upon idea that similar vocabulary → relevant to same queries
• Usually look for documents matching query words
• “Similar” can be measured in many ways – String matching/comparison – Same vocabulary used – Probability that documents arise from same model – Same meaning of text
relevant items are similar
![Page 27: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/27.jpg)
27
• An effective and popular approach• Compares words without regard to order• Consider reordering words in a headline
– Random: beating takes points falling another Dow 355– Alphabetical: 355 another beating Dow falling points– “Interesting”: Dow points beating falling 355 another
– Actual: Dow takes another beating, falling 355 points
bag of words
![Page 28: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/28.jpg)
28
16 × said 14 × McDonalds12 × fat 11 × fries8 × new 6 × company french nutrition5 × food oil percent reduce taste Tuesday4 × amount change health Henstenburg make obesity3 × acids consumer fatty polyunsaturated US2 × amounts artery Beemer cholesterol clogging director down eat estimates expert fast formula impact initiative moderate plans restaurant saturated trans win1 × …added addition adults advocate affect afternoon age Americans Asia battling beef bet brand Britt Brook Browns calorie center chain chemically … crispy customers cut … vegetable weapon weeks Wendys Wootan worldwide years York
what is this about ?
![Page 29: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/29.jpg)
29
McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit.Neither company could immediately be reached for comment....
the text
![Page 30: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/30.jpg)
30
• Text representation – what makes a “good” representation? – how is a representation generated from text? – what are retrievable objects and how are they organized?
• Representing information needs – what is an appropriate query language? – how can interactive query formulation and refinement be supported?
• Comparing representations – what is a “good” model of retrieval? – how is uncertainty represented?
text representation
![Page 31: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/31.jpg)
31
...
Subtopic 1
A general topic
Subtopic i Subtopic M
Concept map
Hypertext
Doc 1 Doc 2 Doc N
hypertext
![Page 32: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/32.jpg)
32
...
Subtopic 1
A general topic
Subtopic i Subtopic M
Doc → Doc Links
Doc 1 Doc 2 Doc N
Term → Term Links Doc→Term Links
Term→Doc Links
hypertext
![Page 33: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/33.jpg)
33
overview
• what is information retrieval ?• how does it work ?• a simple retrieval model
![Page 34: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/34.jpg)
34
retrieval process
![Page 35: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/35.jpg)
35
statistical language model
![Page 36: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/36.jpg)
36
statistical language model
![Page 37: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/37.jpg)
37
statistical language model
![Page 38: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/38.jpg)
38
• Information Retrieval? – Indexing, retrieving, and organizing text by probabilistic or statistical techniques that reflect semantics without actually understanding – Search engines
• Core idea – Bag of words captures much of the “meaning” – Objects that use vocabulary the same way are related
• Statistical language model – Documents used to estimate a topic model – Query reflects a topic, too – Documents of topics that are likely to produce the query are most likely to be relevant
conclusion
![Page 39: Information Retrieval Overview - Home :: Northeastern University](https://reader036.vdocuments.net/reader036/viewer/2022070222/613d3f99736caf36b75b14e9/html5/thumbnails/39.jpg)
39
http://ccs.neu.edu/~jaa/ISU535.06X2/schedulen.html
in this course