1 accessing, managing, and mining unstructured data eugene agichtein

1

Accessing, Managing, and MiningUnstructured Data

Eugene Agichtein

2

The Web 20B+ of machine-readable text (some of it useful) (Mostly) human-generated for human consumption

Both “artificial” and “natural” phenomenon Still growing? Local and global structure (links) Headaches:

Dynamic vs. static content People figured out how to make money

Positives: Everything (almost) is on the web People (eventually) can find info People (on average) are not evil

3

Wait, there is more Blogs, wikipedia Hidden web: > 25 million databases

Accessible via keyword search interfaces E.g., MedLine, CancerLit, USPTO, … 100x more data than surface web

(Transcribed) speech from Genetic sequence annotations Biological & Medical literature Medical records, reports, alerts, 911 calls

Classified

4

Outline

Unstructured data (text, web, …) is Important (really!) Not so unstructured

Main tasks/requirements and challenges

Example problem: query optimization for text-centric tasks

Fundamental research problems/directions

5

Unstructured data = natural language text (for this talk) Incredibly powerful and flexible means of

communicating knowledge Papers, news, web pages, lecture notes, patient

records, shopping lists… Local structures: syntax

English syntax HTML layout

Semantics implicit, ambiguous, subjective

I saw a man with a chainsaw

Need incredibly powerful and flexible decoder

6

Some more structure Explicit link structure

Web, Blogs, Wikipedia, citations

Implicit link structure Co-occurrence of entities within same document/context

implies link between entities Occurrence of same entity in multiple documents implies

link between documents

Physical location Page primarily “about” Atlanta User somewhere around N. Decatur Rd E-mail sender is two floors down

More on this later

7

Global Problem Space

Crawling (accessing) the data Storing (multiple version of) data “Understanding” the data information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery Building a nuclear/hydro/wind/ power plant

8

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

9

An Abstract View of Text-Centric Tasks Output Tokens

…Extraction

System

Text Database

3. Extract output tokens2. Process documents1. Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

10

Executing a Text-Centric TaskOutput Tokens

…Extraction

System

Text Database

3. Extract output tokens

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

11

Execution Plan CharacteristicsOutput Tokens

…Extraction

System

Text Database

3. Extract output tokens2. Process documents1. Retrieve documents from database

Execution Plans have two main characteristics:Execution TimeRecall (fraction of tokens retrieved)

Question: How do we choose the fastest execution plan for reaching

a target recall ?

“What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

12

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

13

ScanOutput Tokens

…Extraction

System

Text Database

3. Extract output tokens

2. Process documents

1. Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Time for retrieving a document

Question: How many documents does Scan retrieve

to reach target recall?

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

14

Estimating Recall of ScanModeling Scan for Token t: What is the probability of seeing t (with

frequency g(t)) after retrieving S documents? A “sampling without replacement” process

After retrieving S documents, frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs > 0

t

d1

d2

dS

dN

...

D

Token

Samplingfor t

...

<SARS, China>

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

15

Estimating Recall of ScanModeling Scan: Multiple “sampling without replacement”

processes, one for each token Overall recall is average recall across

tokens

→ We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

...

...

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

<SARS, China>

<Ebola, Zaire>

Execution time = |Retrieved Docs| · (R + P)

16

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation



Crawl-based

Query-based

17

Iterative Set ExpansionOutput Tokens

…Extraction

System

Text Database

3. Extract tokensfrom docs

2. Process retrieved documents

1. Query database with seed tokens

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

Time for processing a document

Query

Generation

4. Augment seed tokens with new tokens

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

18

Querying Graph

The querying graph is a bipartite graph, containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

19

Using Querying Graph for Analysis

We need to compute the: Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the: Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document – it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions – details in the paper

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

20

Recall Limit: Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit: determined by the size of the biggest connected component

Reachability Graph

21

Automatic Query Generation

Details in the papers

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens

22

Outline

Description and analysis of crawl- and query-based plans



23

Summary of Cost Analysis

Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity, if plan cannot reach target recall)

Time and recall depend on task-specific properties of database: Token degree distribution Document degree distribution

Next, we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters!

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-3.3863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 5492.2x-2.0254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this!

No need for separate sampling phase Sampling is equivalent to executing the task:

→Piggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming “default” parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as “random sampling”All other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial, default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans



28

Correctness of Theoretical Analysis

Solid lines: Actual time Dotted lines: Predicted time with correct parameters

Task: Disease Outbreaks

Snowball IE system

182,531 documents from NYT

16,921 tokens

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt. Scan

Automatic Query Gen.

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines: Actual time Green line: Time with optimizer

(results similar in other experiments – see paper)

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt. Scan

Iterative Set Expansion

Automatic Query Gen.

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Global Problem Space

Crawling (accessing) the data “Understand” the data information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery

32

Some Research Directions Modeling explicit and Implicit network structures

Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)

Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles

Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources

Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)

33

Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while

unpopular pages get ignored by an average user”

34

Sic Transit Gloria Telae: Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]

35

Modeling Social Networks for Epidemiology, security, …

Email exchange mapped onto cubicle locations.

36




Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Query processing over unstructured text Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)


37

Applying Text Mining for Bioinformatics

100,000+ gene and protein synonyms extracted from 50,000+ journal articles

Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)

ISMB 2003

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

38

Examples of Entity-Relationship Extraction

„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complexand that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“

CBF-A CBF-C

CBF-B CBF-A-CBF-C complex

interactcomplex

associates

39

Another ExampleZ-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1 replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection.

40

Query

PubMed visualized

Extracted info

Links to databases

AliBaba, Ulf Leser, http://wbi.informatik.hu-berlin.de:8080/

41

Mining Text and Sequence Data

Agichtein & Eskin, PSB 2004

ROC50 scores for each class and method

42






43

Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]

Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.

44

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

Structure of implicit entity-entity networks in text [Agichtein&Gravano, ICDE 2003]

45





Information diffusion/propagation in online sources Information propagation on the web, news In collaborative sources (wikipedia, MedLine)

46

Thank You

Details:http://www.mathcs.emory.edu/~eugene/

1 accessing, managing, and mining unstructured data eugene agichtein

Documents

unstructured text

outline unstructured

talk slide

text centric tasks

later slide

evil slide

web people

natural language text