1 accessing, managing, and mining unstructured data eugene agichtein
Post on 25-Dec-2015
216 Views
Preview:
TRANSCRIPT
1
Accessing, Managing, and MiningUnstructured Data
Eugene Agichtein
2
The Web 20B+ of machine-readable text (some of it useful) (Mostly) human-generated for human consumption
Both “artificial” and “natural” phenomenon Still growing? Local and global structure (links) Headaches:
Dynamic vs. static content People figured out how to make money
Positives: Everything (almost) is on the web People (eventually) can find info People (on average) are not evil
3
Wait, there is more Blogs, wikipedia Hidden web: > 25 million databases
Accessible via keyword search interfaces E.g., MedLine, CancerLit, USPTO, … 100x more data than surface web
(Transcribed) speech from Genetic sequence annotations Biological & Medical literature Medical records, reports, alerts, 911 calls
Classified
4
Outline
Unstructured data (text, web, …) is Important (really!) Not so unstructured
Main tasks/requirements and challenges
Example problem: query optimization for text-centric tasks
Fundamental research problems/directions
5
Unstructured data = natural language text (for this talk) Incredibly powerful and flexible means of
communicating knowledge Papers, news, web pages, lecture notes, patient
records, shopping lists… Local structures: syntax
English syntax HTML layout
Semantics implicit, ambiguous, subjective
I saw a man with a chainsaw
Need incredibly powerful and flexible decoder
6
Some more structure Explicit link structure
Web, Blogs, Wikipedia, citations
Implicit link structure Co-occurrence of entities within same document/context
implies link between entities Occurrence of same entity in multiple documents implies
link between documents
Physical location Page primarily “about” Atlanta User somewhere around N. Decatur Rd E-mail sender is two floors down
7
Global Problem Space
Crawling (accessing) the data Storing (multiple version of) data “Understanding” the data information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery Building a nuclear/hydro/wind/ power plant
8
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
Information extraction applications extract structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
(e.g., NYU’s Proteus)
Disease Outbreaks in The New York Times
9
An Abstract View of Text-Centric Tasks Output Tokens
…Extraction
System
Text Database
3. Extract output tokens2. Process documents1. Retrieve documents from database
Task Token
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
10
Executing a Text-Centric TaskOutput Tokens
…Extraction
System
Text Database
3. Extract output tokens
2. Process documents
1. Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results
Unlike the relational world
Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed)
→underlying data distribution dictates what is best
11
Execution Plan CharacteristicsOutput Tokens
…Extraction
System
Text Database
3. Extract output tokens2. Process documents1. Retrieve documents from database
Execution Plans have two main characteristics:Execution TimeRecall (fraction of tokens retrieved)
Question: How do we choose the fastest execution plan for reaching
a target recall ?
“What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”
12
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
13
ScanOutput Tokens
…Extraction
System
Text Database
3. Extract output tokens
2. Process documents
1. Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Time for retrieving a document
Question: How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
14
Estimating Recall of ScanModeling Scan for Token t: What is the probability of seeing t (with
frequency g(t)) after retrieving S documents? A “sampling without replacement” process
After retrieving S documents, frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs > 0
t
d1
d2
dS
dN
...
D
Token
Samplingfor t
...
<SARS, China>
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
15
Estimating Recall of ScanModeling Scan: Multiple “sampling without replacement”
processes, one for each token Overall recall is average recall across
tokens
→ We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
...
...
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
<SARS, China>
<Ebola, Zaire>
Execution time = |Retrieved Docs| · (R + P)
16
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
17
Iterative Set ExpansionOutput Tokens
…Extraction
System
Text Database
3. Extract tokensfrom docs
2. Process retrieved documents
1. Query database with seed tokens
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a document
Time for answering a query
Question: How many queries and how many documents
does Iterative Set Expansion need to reach target recall?
Time for processing a document
Query
Generation
4. Augment seed tokens with new tokens
Question: How many queries and how many documents
does Iterative Set Expansion need to reach target recall?
(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)
18
Querying Graph
The querying graph is a bipartite graph, containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
19
Using Querying Graph for Analysis
We need to compute the: Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the: Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document – it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions – details in the paper
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
20
Recall Limit: Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit: determined by the size of the biggest connected component
Reachability Graph
21
Automatic Query Generation
Details in the papers
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity, if plan cannot reach target recall)
Time and recall depend on task-specific properties of database: Token degree distribution Document degree distribution
Next, we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters!
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-3.3863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 5492.2x-2.0254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this!
No need for separate sampling phase Sampling is equivalent to executing the task:
→Piggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming “default” parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as “random sampling”All other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial, default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines: Actual time Dotted lines: Predicted time with correct parameters
Task: Disease Outbreaks
Snowball IE system
182,531 documents from NYT
16,921 tokens
100
1,000
10,000
100,000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines: Actual time Green line: Time with optimizer
(results similar in other experiments – see paper)
100
1,000
10,000
100,000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt. Scan
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Global Problem Space
Crawling (accessing) the data “Understand” the data information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery
32
Some Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)
33
Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while
unpopular pages get ignored by an average user”
34
Sic Transit Gloria Telae: Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]
35
Modeling Social Networks for Epidemiology, security, …
Email exchange mapped onto cubicle locations.
36
Some Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Query processing over unstructured text Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)
37
Applying Text Mining for Bioinformatics
100,000+ gene and protein synonyms extracted from 50,000+ journal articles
Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)
ISMB 2003
“APO-1, also known as DR6…”“MEK4, also called SEK1…”
38
Examples of Entity-Relationship Extraction
„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complexand that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“
CBF-A CBF-C
CBF-B CBF-A-CBF-C complex
interactcomplex
associates
39
Another ExampleZ-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1 replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection.
40
Query
PubMed visualized
Extracted info
Links to databases
AliBaba, Ulf Leser, http://wbi.informatik.hu-berlin.de:8080/
41
Mining Text and Sequence Data
Agichtein & Eskin, PSB 2004
ROC50 scores for each class and method
42
Some Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)
43
Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]
Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.
44
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
Structure of implicit entity-entity networks in text [Agichtein&Gravano, ICDE 2003]
45
Some Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources Information propagation on the web, news In collaborative sources (wikipedia, MedLine)
46
Thank You
Details:http://www.mathcs.emory.edu/~eugene/
top related