surfacing information in large text collections eugene agichtein microsoft research
TRANSCRIPT
Surfacing Information in Large Text Collections
Eugene AgichteinMicrosoft Research
2
Example: Angina treatments
PDR
Web search results
Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)
Medical reference and literature
MedLine
guideline for unstable angina
unstable angina management
herbal treatment for angina pain
medications for treating angina
alternative treatment for angina pain
treatment for angina
angina treatments
3
Research Goal
Seamless, intuitive, efficient, and robust access to knowledge in unstructured sources
Some approaches: Retrieve the relevant documents or passages Question answering Construct domain-specific “verticals” (MedLine) Extract entities and relationships Network of relationships: Semantic Web
4
Semantic Relationships “Buried” in Unstructured Text
Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives
• Corporate mergers, succession, location• Terrorist attacks ] M essage
U nderstandingC onferences
…A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris…
Drug Condition
statins recurrent myocardial infarction
statins strokes
statins unstable angina pectoris
RecommendedTreatment
5
What Structured Representation Can Do for You:
… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web
Large Text Collection Structured Relation
6
Challenges in Information Extraction
Portability• Reduce effort to tune for new domains and tasks
• MUC systems: experts would take 8-12 weeks to tune
Scalability, Efficiency, Access• Enable information extraction over large collections
• 1 sec / document * 5 billion docs = 158 CPU years
Approach: learn from data ( “Bootstrapping” )• Snowball: Partially Supervised Information Extraction
• Querying Large Text Databases for Efficient Information Extraction
7
The Snowball System: Overview
Snowball
Text Database
Organization Location Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress
China 0.3
15th Century Europe
Dark Ages 0.1
3
2
... ... ..... ... ..
1
8
Snowball: Getting User Input
User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
ACM DL 2000
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
9
Evaluating Patterns and Tuples:Expectation Maximization
EM-Spy Algorithm• “Hide” labels for some seed
tuples
• Iterate EM algorithm to convergence on tuple/pattern confidence values
• Set threshold t such that (t > 90% of spy tuples)
• Re-initialize Snowball using new seed tuples
Organization Headquarters Initial Final
Microsoft Redmond 1 1
IBM Armonk 1 0.8
Intel Santa Clara 1 0.9
AG Edwards St Louis 0 0.9
Air Canada Montreal 0 0.8
7th Level Richardson 0 0.8
3Com Corp Santa Clara 0 0.8
3DO Redwood City 0 0.7
3M Minneapolis 0 0.7
MacWorld San Francisco 0 0.7
0
0
157th Street Manhattan 0 0.52
15th Party Congress
China 0 0.3
15th Century Europe
Dark Ages 0 0.1
…..
10
Adapting Snowball for New Relations
Large parameter space• Initial seed tuples (randomly chosen, multiple runs)
• Acceptor features: words, stems, n-grams, phrases, punctuation, POS
• Feature selection techniques: OR, NB, Freq, ``support’’, combinations
• Feature weights: TF*IDF, TF, TF*NB, NB
• Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy
Automatically estimate parameter values:• Estimate operating parameters based on occurrences of seed tuples
• Run cross-validation on hold-out sets of seed tuples for optimal perf.
• Seed occurrences that do not have close “neighbors” are discarded
11
Example Task 1: DiseaseOutbreaks
Proteus: 0.409Snowball: 0.415
SDM 2006
12
Example Task 2: Bioinformatics 100,000+ gene and protein
synonyms extracted from 50,000+ journal articles
Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)
ISMB 2003
“APO-1, also known as DR6…”“MEK4, also called SEK1…”
13
Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]
•CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks
Medical literature: PDRHealth, Micromedex… [Ph.D. Thesis]
•AdverseEffects, DrugInteractions, RecommendedTreatments
Biological literature: GeneWays corpus [ISMB’03]
•Gene and Protein Synonyms
14
Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from
background
Quantify as relative entropy (Kullback-Liebler divergence)
After calibration, metric predicts if bootstrapping likely to work
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
the to and said 's company mrs won president
fre
qu
en
cy
Vw BG
CiCBGC wLM
wLMwLMLMLM
)(
)(log)()||(KL
CIKM 2005
President George W Bush’s three-day visit to India
15
Extracting All Relation Instances From a Text Database
Brute force approach: feed all docs to information extraction system
Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing
keyword index How to identify “useful” documents?
InformationExtraction
System
Text Database StructuredRelation
]Expensive for large collections
16
Accessing Text DBs via Search Engines
InformationExtraction
System
Text Database
StructuredRelation
Search Engine
Search engines impose limitations• Limit on documents retrieved per query
• Support simple keywords and phrases
• Ignore “stopwords” (e.g., “a”, “is”)
17
Text-Centric Task I: Information Extraction
Information extraction applications extract structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease
U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
(e.g., NYU’s Proteus)
Disease Outbreaks in The New York Times
Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan
18
Executing a Text-Centric TaskOutput Tokens
…Extraction
System
Text Database
3. Extract output tokens
2. Process documents
1. Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results
Unlike the relational world
Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed)
→underlying data distribution dictates what is best
19
Extracted Relation
QXtract: Querying Text Databases for Robust Scalable Information EXtraction
User-Provided Seed Tuples
Queries
Promising Documents
Text Database
Search Engine
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Mad Cow Disease The U.K. July 1995
Pneumonia The U.S. Feb. 1995
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Query Generation
Information Extraction System
Problem: Learn keyword queries to retrieve “promising” documents
20
Learning Queries to Retrieve Promising Documents
1. Get document sample with “likely negative” and “likely positive” examples.
2. Label sample documents using information extraction system as “oracle.”
3. Train classifiers to “recognize” useful documents.
4. Generate queries from classifier model/rules. Queries
Query Generation
Information Extraction System
? ???
? ?
??
++
++
- -
--
Seed Sampling
Classifier Training
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
User-Provided Seed Tuples
Text Database
Search Engine
21
SIGMOD 2003 Demonstration
22
Querying Graph
The querying graph is a bipartite graph, containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
23
Sizes of Connected Components
OutInCor
e
OutIn Core
OutIn Core(strongly
connected)
t0
How many tuples are in largest Core + Out?
Conjecture:• Degree distribution in reachability graphs follows “power-law.”
• Then, reachability graph has at most one giant component.
Define Reachability as Fraction of tuples in largest Core + Out
24
NYT Reachability Graph: Outdegree Distribution
MaxResults=10
MaxResults=50
Matches the power-law distribution
25
NYT: Component Size Distribution
MaxResults=10
MaxResults=50
CG / |T| = 0.297
CG / |T| = 0.620
Not “reachable”
“reachable”
26
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
27
Estimate Cost of Retrieval Methods
Alternatives:• Scan, Filtered Scan, Tuples, QXtract
General cost model for text-centric tasks• Information extraction, summary construction, etc…
Estimate the expected cost of each access method• Parametric model describing all retrieval steps• Extended analysis to arbitrary degree distributions • Parameters estimates can be “piggybacked” at runtime
Cost estimates can be provided to a query optimizer for nearly optimal execution
SIGMOD 2006
28
Optimized Execution of Text-Centric Tasks
Tuples
Filtered Scan
Scan
29
Current Research Agenda Seamless, intuitive, and robust access to
knowledge in biologicial and medical sources
Some research problems:• Robust query processing over unstructured data
• Intelligently interpreting user information needs
• Text mining for bio- and medical informatics
• Model implicit network structures:• Entity graphs in Wikipedia
• Protein-Protein interaction networks
• Semantic maps of MedLine
30
Deriving Actionable Knowledge from Unstructured (text) Data
Extract actionable rules from medical text(Medline, patient reports, …)• Joint project (early stages) with medical school, GT
Epidemiology surveillance (w/ SPH)
Query processing over unstructured data• Tune extraction for query workload
• Index structures to support effective extraction
• Queries over extracted and “native” tables
31
Text Mining for Bioinformatics
Impossible to keep up with literature, experimental notes
Automatically update ontologies, indexes
Automate tedious work of post-wetlab search
Identify (and assign text label) DNA structures
32
Mining Text and Sequence DataPSB 2004
ROC50 scores for each class and method