managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...

Managing Spreadsheets

Michael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo

University of MichiganNew England Database Summit

February 1, 2013

2

Spreadsheets: The Good Parts

A “Swiss Army Knife” for data: storing, sharing, transforming

Sophisticated users who are not DBAs

Contain lots of data, found nowhere else

Everyone uses them; almost wholly ignored by DB community

Thanks, Jeremy!

3

Spreadsheets: The Awful Parts Users toss in data,

worry about schemas later (well, never)

Spreadsheets designed for humans, not query processors

No explicit schemas: Poor data integrity

(Zeeberg et al, 2004) Integration very hard

• Tumor suppresor gene Deleted In Esophogeal Cancer 1

• aka, DEC1• aka, (according to Excel) 01-DEC

4

Spreadsheets: The Awful Parts Users toss in data,

worry about schemas later (well, never)

Spreadsheets designed for humans, not query processors

No explicit schemas: Poor data integrity

(Zeeberg et al, 2004) Integration very hard

5

A Data Tragedy Spreadsheets build, then entomb, our

best, most expensive, data >400,000 just from ClueWeb09 From gov’ts, WTO, many other sources How many inside firewall?

Application vision: Ad-hoc integration & analysis for any dataset

Challenge: recover relations from any spreadsheet, w/little human effort

6

Closeup

Desired tuple:

One hierarchy error yields many bad tuples

Too many datasets to process manually

7

Agenda Spreadsheets: An Overview Extracting Data

Hierarchy Extraction Manual Repairs

Experimental Results Demo Related and Future Work

8




9

Extracting Tuples

1. Extract frame, attribute hierarchy trees2. Map values to attributes; create tuples3. Apply manual repairs, repeat How many repairs for 100% accuracy? Yields tuples, not relations We won’t discuss: relation assembly

10

1. Frame Detection

Key assumption: inputs are data frames Locate metadata in top/left regions Locate data in center block

11

Closeup

12

1. Frame Detection Key assumption: inputs are data frames

Locate metadata in top/left regions Locate data in center block

~72% of spreadsheets fit; others not relational Each non-empty row labeled one of TITLE,

HEADER, DATA, FOOTNOTE Reconstruct regions from labels Infer labels with linear-chain Conditional Random Field

(Lafferty et al, 2001) Layout features: has bold cell? Merged cell? Text features: contains ‘table’, ‘total’? Indented text?

Numeric cells? Year cells?

13

2. Hierarchy Extraction

14

Closeup

16


1. One task for TOP, one for LEFT

2. Create boolean random var for each candidate parent relationship

3. Build conditional random field to obtain best variable assignment

17


18

2. Hierarchy Extraction CRFs use potential functions to incorporate features Node potentials represent single parent/child match

Share style? Near each other? WS-separated? Edge potentials tie pairs of parent/child decisions

Share style pairs? Share text? Indented similiarly? Spreadsheet potentials ensure a legal tree

One-parent potential: -∞ weight for multiple parents Directional potential: -∞ weight when parent edges go in opposite

directions Run Loopy Belief Propagation for node + edge; post-

inference test and repair for spreadsheet Real sheets yielded 1K-8K variables; inference <0.13 sec Approach adapted from (Pimplikar, Sarwagi, 2012)

19

3. Manual Repair User reviews, repairs extraction Goal: reduce user burden

Extractor makes repeated mistakes, either within spreadsheet or within corpus

Headache for user to repeat fixes Our sol’n: after each repair, add repair

potentials to CRF Links user-repaired nodes to a set of nodes

throughout CRF Incorporates info on node similarity Edges are generated heuristically

After each repair, re-run inference

20




21

Experiments General survey of spreadsheet use Evaluate:

Standalone extraction accuracy Manual repair effectiveness

Test sets: SAUS: 1,322 files from 2010 Statistical

Abstract of the United States WEB: 410,554 files from 51,252

domains, crawled from ClueWeb09

22

Spreadsheets in the Wild

Very common for Web-published gov’t data

Domain # files % total

bts.gov 12,435 3.03%

census.gov 7,862 1.91%

stat.co.jp 6,633 1.62%

bankofengland.co.uk 5,520 1.34%

ers.usda.gov 4,328 1.05%

agr.gc.ca 4,186 1.02%

wto.org 3,863 0.94%

doh.wa.gov 3,579 0.87%

nsf.gov 2,770 0.67%

nces.ed.gov 2,177 0.53%

23

Spreadsheets in the Wild

24

Standalone Extraction 100 random H-Sheets from SAUS, WEB Three metrics

Pairs: parent/child pairs labeled correctly (F1)

Tuples: relational tuples labeled correctly (F1)

Sheets: % of sheets labeled 100% correctly

Two methods Baseline uses just formatting, position Hierarchy uses our approach

25

Standalone Extraction

26

Manual Repair: Effectiveness Gather 10 topic areas from SAUS,

WEB Expert provides ground-truth

hierarchies Extract; repeatedly repair and

recompute

27

Manual Repair: Ordering Good ordering: errors steadily decrease Bad: extended periods of slow decrease

28

End-To-End Extraction What is overall utility of our extractor? Final metric: Correct tuples per manual repair

# Tuples

# Errors

# Repairs

Tuples/Repair

SAUS R50

530.76 5.46 2.06 257.65

SAUS Arts

454.8 25.4 13.1 34.72

SAUS Fin.

266.1 29.9 13.5 19.71

WEB R50

520.28 11.38 3.84 135.49

WEB BTS

65.6 2.7 1 65.6

WEB USDA

350.3 6.8 1.7 206.06

29




30

Demo Details Ran SAUS corpus through extractor Simple ad hoc integration analysis tool

on top of extracted data Early version of relation reconstruction Early version of data ranking, join finding

31

Related Work Spreadsheet as interface

(Witkowski et al, 2003), (Liu et al, 2009)

Spreadsheet extraction User-provided rules

(Ahmad et al, 2003), (Hung et al, 2011) No explicit user rules

(Abraham and Erwig, 2007), (Cunha et al, 2009)

Ad hoc integration for found data(Cafarella et al, 2009), (Pimplikar and Sarawagi,

2012), (Yakout et al, 2012)

Semi-automatic data programming Wrangler (Guo, et al, 2011)

32

Conclusions and Future Work Spreadsheet extraction opens new

datasets Manual repair ensures accuracy, low

user burden Ongoing and Future Work

Relation assembly Data relevance ranking Join finding

managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...

Documents

hierarchy extraction

closeup slide

hard slide

data tragedy spreadsheets

data frames

lots of data

future work slide

center block slide