managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...

32
Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February 1, 2013

Upload: louisa-matthews

Post on 23-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

Managing Spreadsheets

Michael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo

University of MichiganNew England Database Summit

February 1, 2013

Page 2: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

2

Spreadsheets: The Good Parts

A “Swiss Army Knife” for data: storing, sharing, transforming

Sophisticated users who are not DBAs

Contain lots of data, found nowhere else

Everyone uses them; almost wholly ignored by DB community

Thanks, Jeremy!

Page 3: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

3

Spreadsheets: The Awful Parts Users toss in data,

worry about schemas later (well, never)

Spreadsheets designed for humans, not query processors

No explicit schemas: Poor data integrity

(Zeeberg et al, 2004) Integration very hard

• Tumor suppresor gene Deleted In Esophogeal Cancer 1

• aka, DEC1• aka, (according to Excel) 01-DEC

Page 4: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

4

Spreadsheets: The Awful Parts Users toss in data,

worry about schemas later (well, never)

Spreadsheets designed for humans, not query processors

No explicit schemas: Poor data integrity

(Zeeberg et al, 2004) Integration very hard

Page 5: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

5

A Data Tragedy Spreadsheets build, then entomb, our

best, most expensive, data >400,000 just from ClueWeb09 From gov’ts, WTO, many other sources How many inside firewall?

Application vision: Ad-hoc integration & analysis for any dataset

Challenge: recover relations from any spreadsheet, w/little human effort

Page 6: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

6

Closeup

Desired tuple:

One hierarchy error yields many bad tuples

Too many datasets to process manually

Page 7: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

7

Agenda Spreadsheets: An Overview Extracting Data

Hierarchy Extraction Manual Repairs

Experimental Results Demo Related and Future Work

Page 8: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

8

Agenda Spreadsheets: An Overview Extracting Data

Hierarchy Extraction Manual Repairs

Experimental Results Demo Related and Future Work

Page 9: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

9

Extracting Tuples

1. Extract frame, attribute hierarchy trees2. Map values to attributes; create tuples3. Apply manual repairs, repeat How many repairs for 100% accuracy? Yields tuples, not relations We won’t discuss: relation assembly

Page 10: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

10

1. Frame Detection

Key assumption: inputs are data frames Locate metadata in top/left regions Locate data in center block

Page 11: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

11

Closeup

Page 12: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

12

1. Frame Detection Key assumption: inputs are data frames

Locate metadata in top/left regions Locate data in center block

~72% of spreadsheets fit; others not relational Each non-empty row labeled one of TITLE,

HEADER, DATA, FOOTNOTE Reconstruct regions from labels Infer labels with linear-chain Conditional Random Field

(Lafferty et al, 2001) Layout features: has bold cell? Merged cell? Text features: contains ‘table’, ‘total’? Indented text?

Numeric cells? Year cells?

Page 13: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

13

2. Hierarchy Extraction

Page 14: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

14

Closeup

Page 15: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

15

Page 16: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

16

2. Hierarchy Extraction

1. One task for TOP, one for LEFT

2. Create boolean random var for each candidate parent relationship

3. Build conditional random field to obtain best variable assignment

Page 17: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

17

2. Hierarchy Extraction

Page 18: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

18

2. Hierarchy Extraction CRFs use potential functions to incorporate features Node potentials represent single parent/child match

Share style? Near each other? WS-separated? Edge potentials tie pairs of parent/child decisions

Share style pairs? Share text? Indented similiarly? Spreadsheet potentials ensure a legal tree

One-parent potential: -∞ weight for multiple parents Directional potential: -∞ weight when parent edges go in opposite

directions Run Loopy Belief Propagation for node + edge; post-

inference test and repair for spreadsheet Real sheets yielded 1K-8K variables; inference <0.13 sec Approach adapted from (Pimplikar, Sarwagi, 2012)

Page 19: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

19

3. Manual Repair User reviews, repairs extraction Goal: reduce user burden

Extractor makes repeated mistakes, either within spreadsheet or within corpus

Headache for user to repeat fixes Our sol’n: after each repair, add repair

potentials to CRF Links user-repaired nodes to a set of nodes

throughout CRF Incorporates info on node similarity Edges are generated heuristically

After each repair, re-run inference

Page 20: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

20

Agenda Spreadsheets: An Overview Extracting Data

Hierarchy Extraction Manual Repairs

Experimental Results Demo Related and Future Work

Page 21: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

21

Experiments General survey of spreadsheet use Evaluate:

Standalone extraction accuracy Manual repair effectiveness

Test sets: SAUS: 1,322 files from 2010 Statistical

Abstract of the United States WEB: 410,554 files from 51,252

domains, crawled from ClueWeb09

Page 22: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

22

Spreadsheets in the Wild

Very common for Web-published gov’t data

Domain # files % total

bts.gov 12,435 3.03%

census.gov 7,862 1.91%

stat.co.jp 6,633 1.62%

bankofengland.co.uk 5,520 1.34%

ers.usda.gov 4,328 1.05%

agr.gc.ca 4,186 1.02%

wto.org 3,863 0.94%

doh.wa.gov 3,579 0.87%

nsf.gov 2,770 0.67%

nces.ed.gov 2,177 0.53%

Page 23: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

23

Spreadsheets in the Wild

Page 24: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

24

Standalone Extraction 100 random H-Sheets from SAUS, WEB Three metrics

Pairs: parent/child pairs labeled correctly (F1)

Tuples: relational tuples labeled correctly (F1)

Sheets: % of sheets labeled 100% correctly

Two methods Baseline uses just formatting, position Hierarchy uses our approach

Page 25: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

25

Standalone Extraction

Page 26: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

26

Manual Repair: Effectiveness Gather 10 topic areas from SAUS,

WEB Expert provides ground-truth

hierarchies Extract; repeatedly repair and

recompute

Page 27: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

27

Manual Repair: Ordering Good ordering: errors steadily decrease Bad: extended periods of slow decrease

Page 28: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

28

End-To-End Extraction What is overall utility of our extractor? Final metric: Correct tuples per manual repair

# Tuples

# Errors

# Repairs

Tuples/Repair

SAUS R50

530.76 5.46 2.06 257.65

SAUS Arts

454.8 25.4 13.1 34.72

SAUS Fin.

266.1 29.9 13.5 19.71

WEB R50

520.28 11.38 3.84 135.49

WEB BTS

65.6 2.7 1 65.6

WEB USDA

350.3 6.8 1.7 206.06

Page 29: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

29

Agenda Spreadsheets: An Overview Extracting Data

Hierarchy Extraction Manual Repairs

Experimental Results Demo Related and Future Work

Page 30: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

30

Demo Details Ran SAUS corpus through extractor Simple ad hoc integration analysis tool

on top of extracted data Early version of relation reconstruction Early version of data ranking, join finding

Page 31: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

31

Related Work Spreadsheet as interface

(Witkowski et al, 2003), (Liu et al, 2009)

Spreadsheet extraction User-provided rules

(Ahmad et al, 2003), (Hung et al, 2011) No explicit user rules

(Abraham and Erwig, 2007), (Cunha et al, 2009)

Ad hoc integration for found data(Cafarella et al, 2009), (Pimplikar and Sarawagi,

2012), (Yakout et al, 2012)

Semi-automatic data programming Wrangler (Guo, et al, 2011)

Page 32: Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February

32

Conclusions and Future Work Spreadsheet extraction opens new

datasets Manual repair ensures accuracy, low

user burden Ongoing and Future Work

Relation assembly Data relevance ranking Join finding