scalable probabilistic databases with factor graphs and mcmc

1

Scalable Probabilistic Databases with Factor Graphs and MCMC

Michael Wick, Andrew McCallum, and Gerome MiklauVLDB 2010

[includes some slides from recent McCallum talk]

2

Outline Background of research Key contributions FACTORIE language Models for information extraction MCMC with database “assist” Experimental results Implications for information extraction more generally

Background of research McCallum an ML researcher crossing bridge to DB

Mostly tools and apps (incl. IE) for undirected models “Probabilistic databases” undergoing significant

evolution (see survey by Dalvi et al, CACM, 2009): Early PDB systems attached probabilities to tuples:

0.7: Employs(IBM,John) 0.95 Employs(IBM,Mary) etc

Aggregation queries etc. under global independence Around 2005, model-based approaches took over, but faced

the same issues (expressive power, complexity) as in AI

3

Key contributions Increasingly sophisticated CRF-like models for

extraction, entity resolution, schema mapping, etc. FACTORIE for model construction and inference Efficient MCMC inference on relational worlds

Handles very large models without blowing up Efficient local computation for each MC step Integration with database technology:

Possible world = database, MC step = database update Query evaluation directly on database

Incremental re-evaluation after each MC step

9

Key contributions Increasingly sophisticated CRF-like models for

extraction, entity resolution, schema mapping, etc. FACTORIE for model construction and inference Efficient MCMC inference on relational worlds

Handles very large models without blowing up Efficient local computation for each MC step Integration with database technology:

Possible world = database, MC step = database update Query evaluation directly on database

Incremental re-evaluation after each MC step

10

Factor graphs Nodes are variables and factors (potentials on sets of variables) Links connect variables to factors that include them P(x1,…,xn) = Πj Fj(sj)/Z and (in this paper)

Fj(sj) = exp(ϕj(sj) θj) w/ features ϕj

FACTORIE uses loops in a way analogous to BUGS (plates)

13

MCMC (Metropolis-Hastings) Worlds x, evidence e, posterior π(x) = P(x | e) = P(x,e)/P(e) Proposal distribution q(x’ | x) determines neighborhood of x MH samples x’ from q(x’ | x), accepts with probability

α(x’ | x) = min(1, π(x’) q(x | x’) / π(x) q(x’ | x) )

= min(1, P(x’,e) q(x | x’) / P(x,e) q(x’ | x) ) For graphical models (and BLOG), P(x,e) is a product of local

conditional probabilities (or potentials) If the change from x to x’ is local (e.g., a single tuple becomes

true or false), almost all terms in P(x,e) and P(x’,e) cancel out Hence the per-step computation cost is independent of model size

14

MCMC on values

16

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

MCMC on values

17

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

MCMC on values

18

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

MCMC on values

19

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

MCMC on values

20

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

B(H1)

A(H1)

Earthquake(Ra)

B(H2)

A(H2)

B(H3)

A(H3)

Earthquake(Rb)

B(H4)

A(H4)

B(H5)

A(H5)

Integration with DB technology Databases are designed for

storing lots of data efficient processing of queries on lots of data

How much can we borrow from DB technology to help with probabilistic IE?

21

Optimizing query evaluation In databases, running a query can be expensive,

especially if it involves scanning all the data: Aggregation, e.g., #{x,y: R(x,y) ^ R(y,x)} Quantifier alternation, chains of literals, etc.

A materialized view is a cached database table representation of any query result

Incremental view maintenance recomputes the materialized view whenever any tuple changes E.g., if R(A,B) is set to true, check R(B,A) and add 1

So query can be re-evaluated much faster after each MC step

33

Drawbacks of black-box DB technology

Modifying tuples in a disk-resident DB is expensive DB technology designed mostly for atomic

transactions; 500/second on $10K system Difficult to add new types of optimization, e.g.,

maintaining efficient summaries (min, etc.) Not suitable for some data types, e.g., images A “database” sounds like a “possible world”, but

only under Herbrand semantics

35

Experiments - NER

36

Skip-chain CRF includes links between labels for identical tokens (but not across docs!!)

Experiments - NER Proposal distribution:

Choose up to five documents at random Choose one label variable at random among these

Choose a label at random

Data: 1788 NYT articles Query # B-PER labels (evaluate every 10k MC steps)

37

17650 plus/minus 50

Essentially each B-PERdecision is independent;Too many parameters, too little context, noparameter uncertainty!

Summary A serious attempt to create scalable, nontrivial

probability models and inference technology for IE

Experiments not totally convincing: Efficiency: documents are independent! Reasonableness of answers: counts far too precise

Not clear if FACTORIE is “elegantly” usable to create very complex models

Some continuing work….

38

scalable probabilistic databases with factor graphs and mcmc

Documents

x qx x x qx x

e qx x px

posterior x

probabilityx x

epeproposal distribution

database technology

neighborhood of xmh

mc stepintegration