paleodeepdive:&a&&&&&&&&&&&&&&&&&&&&&&&&&&&&applica9on&...

1
Hazy Research Group Led by Christopher (Zifei Shan, Mikhail Sushkov, Feiran Wang, Ce Zhang) The Architecture of DeepDive Rigorous ProbabilisIc Framework ExecuIve Summary Simpler Feature Engineering PaleoDeepDive We gratefully acknowledge the support from Toshiba, Google, DARPA DEFT and XDATA, ONR, and NSF. Any opinions, findings, and conclusions or recommendaGons expressed in this material are those of the authors and do not necessarily reflect the view of Toshiba, Google, DARPA, ONR, NSF, or the US government. Takeaways DeepDive enables macroscopic science by building a “dark data” extrac9on system. Developers should think about features, not algorithms. It is possible to abstract probabilis9c inference for use by domain scien9sts using standard SQL and Python. We can achieve comparable (or beEer) quality to human volunteers. We demonstrate these three takeaways! DeepDive is the underlying framework DeepDive has three features! Unstructured text Structured Resources RelaIonal Input Tables Variables, factors, and connecIons between variables and factors Factor Graph e.g., Appear(Taxon, FormaIon) relaIon RelaIonal OutputTables Text spans Freebase, Bing results… StaIsIcal Inference & Learning Feature ExtracIon Feature extracIon with Python & SQL DeepDive supports different highlevel languages to specify a factor graph, e.g., Markov logic network, WinBUGS, etc. PaleoDeepDive is built with a combinaIon of Python and SQL Python SQL Input relaIons (e.g. Coref Task) PID DOC TEXT P1 D1 Columbia Fm. P2 D1 Columbia Phrase Phrase A is coreferent to phrase B if the edit distance between A and B is smaller than 5 and they appear in the same document. We write an SQL query to generate all phrase pairs that appear in the same document, and pair it with a python funcAon. SELECT t0.PID, t0.TEXT, t1.PID, t1.TEXT FROM Phrase t0, Phrase t1 WHERE t0.DOC=t1.DOC USEPYTHON pyfunc We write a Python funcAon to process all phrase pairs and make predicAons. def pyfunc(p1, t1, p2, t2): if edit_dist(t1, t2) < 5: emit(“Coref”, p1, p2) DeepDive will learn the weight automaGcally Extract Error analysis Extractor ApplicaIon Write/Improve extractors DeepDive is able to integrate a diverse set of signals & feedback DeepDive supports an “E3 loop” for feature engineering Domain Experts Unstructured Data Structured Knowledge Base Training labels HTML documents Scanned ArIcles Maps, photos, images Freebase Macrostrat DicIonaries Training labels HeurisIcs & Rules Hard Constraints MoreStructured Signal MoreSupervised Signal Crowd The more signals we use, the beger quality we can expect! Our DimmWiEed System is able to run Gibbs sampling in the speed of 100 million variables/sec! (hgp://arxiv.org/abs/1403.7550) Geoscience Research Group Led by Shanan Peters (Jackson Borchardt, Tim Foltz) PaleoDeepDive: A Applica9on Sta9s9cal Inference using Familiar DataProcessing Languages Try out ! hgp://deepdive.stanford.edu ? How does climate change impact biodiversity? T. Rex are found daIng to the upper Cretaceous. (“T. Rex”, “Cretaceous”) 380 volunteer scienGsts manually read 11K journal arGcles since 1994! Extract biodiversityrelated relaAons from journal arAcles. The DeepDive Approach The central task of PaleoDeepDive is to extract relaAons from unstructured text automaAcally: PaleoDeepDive achieves comparable (or beNer) quality with human volunteers, in a cheaper way. 0 500 1000 1500 2000 2500 3000 0 100 200 300 400 500 Total Diversity Geological Time (M.A.) Human PaleoDeepDive Different Sources of Signals DeepDive uses a joint probability model that enables rigorous probabilisIc interpretaIon 0 0.5 1 0 0.5 1 Accuracy Actual Ideal 0K 1K 2K 3K 4K 0 0.2 0.4 0.6 0.8 # Extrac9ons Goal Candidates for improvement Output to users Output Probability Example Extractor We expect that 8 of 10 with probability 0.8 will be correct PaleoDeepDive

Upload: vodiep

Post on 18-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PaleoDeepDive:&A&&&&&&&&&&&&&&&&&&&&&&&&&&&&Applica9on& · PDF fileWe#gratefully#acknowledge#the#supportfrom Toshiba,#Google,#DARPA#DEFT#and#XDATA,#ONR,#and#NSF.#Any#opinions,#findings,#and#conclusions#or#recommendaGons#expressed

               

Hazy  Research  Group  Led  by  Christopher  Ré  (Zifei  Shan,  Mikhail  Sushkov,  Feiran  Wang,  Ce  Zhang)  

The  Architecture  of  DeepDive  

Rigorous  ProbabilisIc  Framework  

ExecuIve  Summary  

Simpler  Feature  Engineering  

PaleoDeepDive  

We  gratefully  acknowledge  the  support  from  Toshiba,  Google,  DARPA  DEFT  and  XDATA,  ONR,  and  NSF.  Any  opinions,  findings,  and  conclusions  or  recommendaGons  expressed  in  this  material  are  those  of  the  authors  and  do  not  necessarily  reflect  the  view  of  Toshiba,  Google,  DARPA,  ONR,  NSF,  or  the  US  government.  

Takeaways  DeepDive  enables  macroscopic  science  by  building  a  “dark  data”  extrac9on  system.  

Developers  should  think  about  features,  not  algorithms.  It  is  possible  to  abstract  probabilis9c  inference  for  use  by  domain  scien9sts  using  standard  SQL  and  Python.  

We  can  achieve  comparable  (or  beEer)  quality  to  human  volunteers.  

We  demonstrate  these  three  takeaways!  

DeepDive  is  the  underlying  framework  

DeepDive  has  three  features!  

Unstructured  text   Structured  Resources  

RelaIonal  Input  Tables  

Variables,  factors,  and  connecIons  between  variables  and  factors  

Factor  Graph  

e.g.,  Appear(Taxon,  FormaIon)  relaIon  RelaIonal  OutputTables  

Text  spans   Freebase,  Bing  results…  

StaIsIcal  Inference  &  Learning  

Feature  ExtracIon  

Feature  extracIon  with  Python  &  SQL  

DeepDive  supports  different  high-­‐level  languages  to  specify  a  factor  graph,  e.g.,  Markov  logic  network,  WinBUGS,  etc.  

PaleoDeepDive  is  built  with  a  combinaIon  of  Python  and  SQL  

Python  SQL  

Input  relaIons  (e.g.  Coref  Task)  

PID   DOC   TEXT  P1   D1   Columbia  Fm.  P2   D1   Columbia  

Phrase   Phrase  A  is  coreferent  to  phrase  B  if  the  edit  distance  between  A  and  

B  is  smaller  than  5  and  they  appear  in  the  same  document.  

We  write  an  SQL  query  to  generate  all  phrase  pairs  that  appear  in  the  same  document,  and  pair  it  with  a  python  funcAon.  

SELECT  t0.PID,  t0.TEXT,                              t1.PID,  t1.TEXT  FROM      Phrase  t0,  Phrase  t1  WHERE  t0.DOC=t1.DOC  USEPYTHON  pyfunc  

We  write  a  Python  funcAon  to  process  all  phrase  pairs  and  make  predicAons.  

def  pyfunc(p1,  t1,  p2,  t2):          if  edit_dist(t1,  t2)  <  5:                  emit(“Coref”,  p1,  p2)  

DeepDive  will  learn  the    weight  automaGcally  

Extract  Error  analysis    Extractor  

ApplicaIon  

Write/Improve  extractors  

DeepDive  is  able  to  integrate  a  diverse  set  of  signals  &  feedback  

DeepDive  supports  an  “E3  loop”  for  feature  engineering    

Domain    Experts  

Unstructured  Data  

Structured  Knowledge  Base  

-­‐  Training  labels  -­‐  HTML  documents  -­‐  Scanned  ArIcles  -­‐  Maps,  photos,  images  

-­‐  Freebase  -­‐  Macrostrat  -­‐  DicIonaries  

-­‐  Training  labels  -­‐  HeurisIcs  &  Rules  -­‐  Hard  Constraints  

More-­‐Structured

 Signal  

More-­‐Supervised  Signal  

Crowd  

The  more  signals  we  use,  the  beger  quality  we  can  expect!  

Our  DimmWiEed  System  is  able  to  run  Gibbs    sampling  in  the  speed  of  100  million  variables/sec!          (hgp://arxiv.org/abs/1403.7550)  

               

Geoscience  Research  Group  Led  by  Shanan  Peters  (Jackson  Borchardt,  Tim  Foltz)  

PaleoDeepDive:  A                                                          Applica9on  Sta9s9cal  Inference  using  Familiar  Data-­‐Processing  Languages  

Try  out                                                            !  hgp://deepdive.stanford.edu  

?! How  does  climate  change  impact  biodiversity?  

T.  Rex  are  found  daIng  to    the  upper  Cretaceous.  

(“T.  Rex”,  “Cretaceous”)  

380  volunteer  scienGsts  manually  read  11K  journal  arGcles  since  1994!  

Extract  biodiversity-­‐related  relaAons  from  journal  arAcles.  

The  DeepDive  Approach  

The  central  task  of  PaleoDeepDive  is  to  extract  relaAons  from  unstructured  text  automaAcally:  

PaleoDeepDive  achieves  comparable  (or  beNer)    quality  with  human  volunteers,  in  a  cheaper  way.  

0

500

1000

1500

2000

2500

3000

0 100 200 300 400 500

Total  D

iversity  

Geological  Time  (M.A.)  

Human  

PaleoDeepDive  

Different  Sources  of  Signals  DeepDive  uses  a  joint  probability  model  that  enables  rigorous  probabilisIc  interpretaIon  

0  

0.5  

1  

0   0.5   1  

Accuracy  

Actual  

Ideal  

0K  

1K  

2K  

3K  

4K  

0   0.2   0.4   0.6   0.8  

#  Extrac9o

ns  

Goal  

Candidates  for    improvement  

Output    to  users  

Output  Probability  

Example  Extractor  

We  expect  that  8  of  10  with  probability  0.8  will  be  correct  

PaleoDeepDive