from dirt to shovels: automatic tool generation for ad hoc data david walker princeton university...

33
From Dirt to Shovels: From Dirt to Shovels: Automatic Tool Automatic Tool Generation Generation for Ad Hoc Data for Ad Hoc Data David Walker David Walker Princeton University Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu

Upload: isabel-perkins

Post on 18-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

From Dirt to Shovels:From Dirt to Shovels:Automatic Tool GenerationAutomatic Tool Generation

for Ad Hoc Datafor Ad Hoc Data

David WalkerDavid Walker

Princeton UniversityPrinceton University

with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu

Page 2: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

who am I?who am I?

why am I here?why am I here?

Page 3: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Our Common Communication InfrastructureOur Common Communication Infrastructure

Much information is represented in Much information is represented in standardized data standardized data formatsformats:: Web pages in HTML Pictures in JPEG Movies in MPEG “Universal” information format XML Standard relational database formats

A plethora of data processing tools:A plethora of data processing tools: Visualizers (Browsers Display JPEG, HTML, ...) Query languages allow users extract information (SQL, XQuery) Programmers get easy access through standard libraries

► Java XML libraries --- JAXP Many applications handle it natively and convert back and forth

►MS Word

Page 4: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Ad Hoc DataAd Hoc Data

Massive amounts of data are stored in XML, HTML or Massive amounts of data are stored in XML, HTML or relational databases but there’s relational databases but there’s even moreeven more data that data that isn’tisn’t

An An ad hoc data formatad hoc data format is any nonstandard, but structured is any nonstandard, but structured data format for which convenient parsing, querying, data format for which convenient parsing, querying, visualizing, transformation tools are not available. (not visualizing, transformation tools are not available. (not natural language)natural language)

Page 5: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Ad Hoc Data from Web Server Logs (CLF)Ad Hoc Data from Web Server Logs (CLF)

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30

244.133.108.200 - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/ddorg/confirm HTTP/1.0" 200 941

Page 6: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Ad Hoc Data from Crashreporter.logAd Hoc Data from Crashreporter.log

Sat Jun 24 06:38:46 2006 crashdump[2164]: Started writing crash report to: /Logs/Crash/Exit/ pro.crash.log

Sun Jun 25 07:23:46 2006 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port

Page 7: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

AT&T Phone Call Provisioning DataAT&T Phone Call Provisioning Data

9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291

9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001

649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982

9152271|9152271|1|0|0|0|0||no_ii152271|EDTF_1|0|SC1MF1F|UNO|EDTF_CRTE|1001649600|EDTF_OS_10|1001649601

9152270|9152270|1|0|0|0|0||no_ii152270|EDTF_1|0|marshak1|UNO|EDTF_CRTE|1001563200|EDTF_OS_10|1001649601

Page 8: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Ad Hoc data from DNS PacketsAd Hoc data from DNS Packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

Page 9: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Ad Hoc data from www.investors.comAd Hoc data from www.investors.com

Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ®Stock List Name: DAVE

Stock Company Price Price Volume EPS RSSymbol Name Price Change % Change % Change Rating Rating

AET Aetna Inc 73.68 -0.22 0% 31% 64 93GE General Electric Co 36.01 0.13 0% -8% 59 56HD Home Depot Inc 37.99 -0.89 -2% 63% 84 38IBM Intl Business Machines 89.51 0.23 0% -13% 66 35INTC Intel Corp 23.50 0.09 0% -47% 39 33

Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.Reproduction or redistribution other than for personal use is prohibited.All prices are delayed at least 20 minutes.

Page 10: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Ad Hoc data from www.geneontology.orgAd Hoc data from www.geneontology.org

!autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;

...

Page 11: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

The Challenge of Ad Hoc DataThe Challenge of Ad Hoc Data

Data arrives “as is.”Data arrives “as is.”

Documentation is often out-of-date or nonexistent.Documentation is often out-of-date or nonexistent.

Data is buggy.Data is buggy. Missing data, “extra” data, … Missing data, “extra” data, … Human error, malfunctioning machines, software bugs (e.g. race Human error, malfunctioning machines, software bugs (e.g. race

conditions on log entries), …conditions on log entries), … Errors are sometimes the Errors are sometimes the mostmost interesting portion of the data. interesting portion of the data.

Data sources may be enormousData sources may be enormous AT&T sources can generate up to 2GB/secondAT&T sources can generate up to 2GB/second

There are no software libraries, manuals, or armies of There are no software libraries, manuals, or armies of consultants to help you....consultants to help you....

Page 12: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Email

Raw Data

Data Entry:Create Format

Description

DataAnalysis

Data Exit: Data

Transformation

ExternalSystems

• Description libraries• Automatic inference• Manual customization• Visual support

• database queries• grep support• google-style search• binary viewer/editor

• anomaly detection• statistical classification• format-independentalgorithms• plug-and-play

• export to XML,HTML, S, database,Excel• language supportfor custom rewriting• plug-and-play

ASCII log files Binary Traces

Goal: An end-to-end, real-time data analysis, transformation and programming framework

Page 13: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

The PADS System (version 1.0) The PADS System (version 1.0) [pldi 05, popl 06, [pldi 05, popl 06, popl 07]popl 07]

“Ad Hoc” Data Source

AnalysisReport

XML

PADS Data Description

PADSCompiler

Generated Libraries(Parsing, Printing, Traversal)

PADS Runtime System(I/O, Error Handling)

XMLConverter

DataProfiler

GraphingTool

QueryEngine

CustomApp

Graph Information

?

genericdescription-directedprogramscodedonce

written by hand

Page 14: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Trivial ExampleTrivial ExampleData Sources:Data Sources:

type payload = union { int32 i; stringFW(3) s2; };

type source = struct { ‘\”’; payload p1; “,”; payload p2; ‘\”’; }

“0, 24”

“foo, 16”

“bar, end”

Description:Description:

Key points to know:Key points to know: Descriptions based on programming language “types”Descriptions based on programming language “types” Broad collection of “base types” (ints, strings, dates, ip addresses...) Broad collection of “base types” (ints, strings, dates, ip addresses...) Structured types includes “structs,” “unions” and “arrays”Structured types includes “structs,” “unions” and “arrays” .... but has many other features: dependency, constraints, recursion, ....... but has many other features: dependency, constraints, recursion, ... has formal semantics & proven propertieshas formal semantics & proven properties

Page 15: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

The PADS System (version 2.0)The PADS System (version 2.0)

Tokenization

Structure Discovery

Format Refinement

Data Description

Scoring Function

Raw Data

PADSCompiler

Profiler

XMLifier

AnalysisReport

XML

FormatInference

Structure Discovery

FormatRefinement

Page 16: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Structure Discovery: OverviewStructure Discovery: Overview

Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level type constructorGuess a top-level type constructor Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunksRecursively analyze and compute types from smaller chunks

“0, 24”

“foo, 16”

“bar, end”

“ INT , INT ”

“ STR , INT ”

“ STR , STR ”

tokenize

Page 17: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Structure Discovery: OverviewStructure Discovery: Overview

Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level type constructorGuess a top-level type constructor Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunksRecursively analyze and compute types from smaller chunks

“ INT , INT ”

“ STR , INT ”

“ STR , STR ”

discover“ ”,

? ?

struct

?

candidate structure so far

INT

STR

STR

INT

INT

STRsources

Page 18: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Structure Discovery: OverviewStructure Discovery: Overview

Top-down, divide-and-conquer algorithm:Top-down, divide-and-conquer algorithm: Compute various statistics from tokenized dataCompute various statistics from tokenized data Guess a top-level type constructorGuess a top-level type constructor Partition tokenized data into smaller chunksPartition tokenized data into smaller chunks Recursively analyze and compute types from smaller chunksRecursively analyze and compute types from smaller chunks

discover“ ”,

? ?

struct

INT

STR

STR

INT

INT

STR

“ ”,

?

?

struct

union

INT

?

STR STR

INT

INT

STR

Page 19: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Structure Discovery: DetailsStructure Discovery: Details

Compute frequency distribution histogram for each Compute frequency distribution histogram for each token.token.

(And recompute at every level of recursion).(And recompute at every level of recursion).

“ INT , INT ”

“ STR , INT ”

“ STR , STR ”

percentageof sources Number

of occurrencesper source

0102030405060708090

100

Quote Comma Integer String

1

2

Page 20: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Structure Discovery: DetailsStructure Discovery: Details

Cluster tokens into groups with similar histogramsCluster tokens into groups with similar histograms Similar histogramsSimilar histograms

► strong evidence tokens coexist in same description componentstrong evidence tokens coexist in same description component► use symmetric relative entropy to measure similarityuse symmetric relative entropy to measure similarity

Only the “shape” of the histogram mattersOnly the “shape” of the histogram matters► normalize histograms by sorting columns in descending sizenormalize histograms by sorting columns in descending size► result: comma & quote grouped together result: comma & quote grouped together

0102030405060708090

100

Quote Comma Integer String

1

2

Page 21: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Structure Discovery: DetailsStructure Discovery: Details

Find most promising token group to divide and conquer:Find most promising token group to divide and conquer: Structs == Groups with high coverage & low “residual mass”Structs == Groups with high coverage & low “residual mass” Arrays == Groups with high coverage, sufficient width & high “residual mass”Arrays == Groups with high coverage, sufficient width & high “residual mass” Unions == Other token groups Unions == Other token groups

Struct involving comma, quote identified in histogram aboveStruct involving comma, quote identified in histogram above

Overall procedure gives good starting point for rewriting systemOverall procedure gives good starting point for rewriting system

0102030405060708090

100

Quote Comma Integer String

1

2

Page 22: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Format RefinementFormat RefinementReanalyze example data with aid of rough descriptionReanalyze example data with aid of rough description

Rewrite format description to:Rewrite format description to: simplify presentationsimplify presentation

► merge & rewrite structuresmerge & rewrite structures improve precisionimprove precision

► reorganize description structurereorganize description structure► add constraints (sortedness, uniqueness, linear relations, functional add constraints (sortedness, uniqueness, linear relations, functional

dependencies)dependencies) fill in missing details fill in missing details

► find completions where structure discovery bottoms outfind completions where structure discovery bottoms out► refine base types (termination conditions for strings, integer sizes)refine base types (termination conditions for strings, integer sizes)

Page 23: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Format RefinementFormat RefinementThree main sub-phasesThree main sub-phases

Phase 1: Tagging/Table generationPhase 1: Tagging/Table generation► Convert rough description into tagged description + relational tableConvert rough description into tagged description + relational table

Phase 2: Constraint inferencePhase 2: Constraint inference► Analyze table and infer constraintsAnalyze table and infer constraints► Use TANE algorithm [Huhtala et al. 99]Use TANE algorithm [Huhtala et al. 99]

Phase 3: Format rewritingPhase 3: Format rewriting► Use inferred constraints & type isomorphisms to rewrite rough Use inferred constraints & type isomorphisms to rewrite rough

descriptiondescription► Greedy search to optimize information-theoretic scoreGreedy search to optimize information-theoretic score

Page 24: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Refinement: Simple ExampleRefinement: Simple Example

Page 25: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…

Page 26: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…

struct

“ ”, unionunion

int alpha int alpha

structurediscovery

Page 27: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…

struct

“ ”, unionunion

int alpha int alpha

structurediscovery

(id2)

struct

“ ”, unionunion

int (id3)

tagging/table gen

(id1)

id1 id2

2

11

2

id3

--

0

... ... ...

alpha (id4) int (id5) alpha (id6)

id4

--

id5

...

id6

--

...

foo beg--

...

24

Page 28: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…

struct

“ ”, unionunion

int alpha int alpha

structurediscovery

(id2)

struct

“ ”, unionunion

int (id3)

tagging/table gen

(id1)

id3 = 0

id1 = id2

(first union is “int” whenever second union is “int”)

constraintinference

id1 id2

2

11

2

id3

--

0

... ... ...

alpha (id4) int (id5) alpha (id6)

id4

--

id5

...

id6

--

...

foo beg--

...

24

Page 29: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

“0, 24”“foo, beg”“bar, end”“0, 56”“baz, middle”“0, 12”“0, 33”…

struct

“ ”, unionunion

int str int str

structurediscovery

(id2)

struct

“ ”, unionunion

int (id3)

tagging/table gen

(id1)

id3 = 0

id1 = id2

(first union is “int” whenever second union is “int”)

constraintinference

rule-basedstructurerewriting

struct

“ ”union

0 strint str

struct struct

, ,

id1 id2

2

11

2

id3

--

0

... ... ...

more accurate:-- first int = 0-- rules out “int , alpha-string” records

str (id4) int (id5) str (id6)

id4

--

id5

...

id6

--

...

foo beg--

...

24

Page 30: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Biggest WeaknessBiggest WeaknessDegree of success often hinges on the inference system Degree of success often hinges on the inference system

having a tokenization scheme that matches the having a tokenization scheme that matches the tokenization scheme of the data source.tokenization scheme of the data source.

Good tokens capture high-level, human abstractions Good tokens capture high-level, human abstractions compactly.compactly.

Techniques for learning tokenizations from data directly?Techniques for learning tokenizations from data directly?

Techniques for using multiple, ambiguous tokenization Techniques for using multiple, ambiguous tokenization schemes simultaneously?schemes simultaneously?

Page 31: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

Related WorkRelated WorkMost common domains for grammar inference:Most common domains for grammar inference:

xml/htmlxml/html natural languagenatural language

Systems that focus on ad hoc data rare and the few that don’t Systems that focus on ad hoc data rare and the few that don’t support PADS tool suite:support PADS tool suite: Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01

Top-down structure discoveryTop-down structure discovery Arasu & Garcia-Molina ’03 (extracting data from web pages)Arasu & Garcia-Molina ’03 (extracting data from web pages)

Grammar induction using MDL & grammar rewriting searchGrammar induction using MDL & grammar rewriting search Stolcke and Omohundro ’94 “Inducing probabilistic grammars...”Stolcke and Omohundro ’94 “Inducing probabilistic grammars...” T. W. Hong ’02, Ph.D. thesis on information extraction from web pagesT. W. Hong ’02, Ph.D. thesis on information extraction from web pages Higuera ’01 “Current trends in grammar induction”Higuera ’01 “Current trends in grammar induction”

Page 32: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

ConclusionsConclusionsStill a work in progress, but we are able to produce XML Still a work in progress, but we are able to produce XML

and statistical reports fully automatically from ad hoc and statistical reports fully automatically from ad hoc data sources.data sources.

We’ve tested on approximately 15 real, mostly systemy We’ve tested on approximately 15 real, mostly systemy data sources (web logs, crash reports, AT&T phone call data sources (web logs, crash reports, AT&T phone call data, etc.) with what we believe is relatively good data, etc.) with what we believe is relatively good successsuccess

For papers & software, see our website at:For papers & software, see our website at:http://www.padsproj.org/http://www.padsproj.org/

[email protected]@cs.princeton.edu

Page 33: From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny

EndEnd