pedro derose university of wisconsin-madison

51
Pedro DeRose University of Wisconsin-Madison mple 1.0: A Community Information Manageme Workbench Preliminary Examination

Upload: chana

Post on 12-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Cimple 1.0: A Community Information Management Workbench. Pedro DeRose University of Wisconsin-Madison. Preliminary Examination. The CIM Problem. Numerous online communities database researchers, movie fans, legal professionals, bioinformatics, enterprise intranets, tech support groups - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pedro DeRose University of Wisconsin-Madison

Pedro DeRoseUniversity of Wisconsin-Madison

Cimple 1.0: A Community Information ManagementWorkbench

Preliminary Examination

Page 2: Pedro DeRose University of Wisconsin-Madison

2

The CIM Problem

Numerous online communities database researchers, movie fans, legal professionals, bioinformatics,

enterprise intranets, tech support groups

Each community = many data sources + many members

Database community home pages, project pages, DBworld, DBLP, conference pages...

Movie fan community review sites, movie home pages, theatre listings...

Legal profession community law firm home pages

Page 3: Pedro DeRose University of Wisconsin-Madison

3

The CIM Problem

Members often want to discover, query, and monitor information in the community

Database community what is new in the past week in the database community? any interesting connection between researchers X and Y? find all citations of this paper in the past one week on the Web what are current hot topics? who has moved where?

Legal profession community which lawyers have moved where? which law firms have taken on which cases?

Solving this problem is becoming increasingly crucial initial efforts at UW, Y!R, MSR, Washington, IBM Almaden

Page 4: Pedro DeRose University of Wisconsin-Madison

4

SIGMOD-07

Planned Contributions of Thesis1. Decompose the overall CIM problem [IEEE 06, CIDR 07, VLDB 07]

Web Pages

* **

*

HV Jagadish

SIGMOD-07***

*** served in

HV Jagadish

Incremental Expansion

Community Portal

* * *

***

****

Day 1

Day n

.

.

.

.

.

.

.

.

.

.

.

.

User services- keyword search- query- browse- mine …

Data Sources

Leveraging the Community

Page 5: Pedro DeRose University of Wisconsin-Madison

5

Planned Contributions of Thesis2. Provide concrete solutions to key sub-problem

• creating ER graphs: key novelty is composing plans from operators [VLDB 07]• facilitates developing, maintaining, and optimizing

• leveraging communities: wiki solution with key novelties [ICDE 08]

• combines contributions from both humans and machines

• combines both structured and text contributions

ExtractMbyName

MatchMbyName

Union

{s1 … sn} DBLP\

ExtractMbyName

DBLP

CreateE

MatchMStrict

conferenceentities

personentities

main pages

ExtractLabel

c(person, label)

CreateR

Page 6: Pedro DeRose University of Wisconsin-Madison

6

Planned Contributions of Thesis3. Capture solutions in the Cimple 1.0 workbench [VLDB 07]

• empty portal shell, including basic services and admin tools• set of general operators, and means to compose them

• simple implementation of operators

• end-to-end development methodology

• extraction/integration plan optimizers

• developers can employ Cimple tools to quickly build and maintain portals

• will release publicly to drive research and evaluation in CIM

Page 7: Pedro DeRose University of Wisconsin-Madison

7

Planned Contributions of Thesis4. Evaluate solutions and workbench on several domains

• use Cimple to build portals for multiple communities

• evaluate ease of development, extensibility, and accuracy of portal

• have built DBLife, a portal for the database community [CIDR 07, VLDB 07]

• will build a second portal for a non-research domain (e.g., movies, NFL)

Page 8: Pedro DeRose University of Wisconsin-Madison

8

Planned Thesis Chapters

Selecting initial data sources

Creating the daily ER graphMerging daily graphs into the global ER graph

Incrementally expanding sources and data

Leveraging community members

Developing the Cimple 1.0 workbenchEvaluating the solutions and workbench

Page 9: Pedro DeRose University of Wisconsin-Madison

9

Selecting Initial Sources

Current solutions often use a "shotgun" approach select as many potentially relevant sources as possible lots of noisy sources, which can lower accuracy

Communitites often show an 80-20 phenomenon small set of sources already covers 80% of interesting activity

Select these 20% of sources e.g., for database community, sites of prominent researchers,

conferences, departments, etc.

Can incrementally expand later semi-automatically or mass collaboration

Crawl sources periodically e.g., DBLife crawls ~10,000 pages (+160 MB) daily

Page 10: Pedro DeRose University of Wisconsin-Madison

10

Creating the ER Graph – Current Solutions

Manual e.g., DBLP require a lot of human effort

Semi-automatic, but domain-specific e.g., Yahoo! Finance, Citeseer difficult to adapt to new domains

Semi-automatic and general many solutions from the database, WWW, and Semantic Web

communities, e.g., Rexa, Libra, Flink, Polyphonet, Cora, Deadliner often use monolithic solutions, e.g., learning methods such as CRFs require little human effort can be difficult to tailor to individual communities

served in

SIGMOD-07

HV Jagadish

Page 11: Pedro DeRose University of Wisconsin-Madison

11

Proposed Solution: A Compositional Approach

Web Pages

* **

*SIGMOD-07SIGMOD-07

***

*** served in

ExtractMbyName

MatchMbyName

Union

{s1 … sn} DBLP\

ExtractMbyName

DBLP

CreateE

MatchMStrict

conferenceentities

personentities

main pagesExtractLabel

c(person, label)

CreateR

Discover Entities Discover Relationships

HV Jagadish HV Jagadish

Page 12: Pedro DeRose University of Wisconsin-Madison

12

Benefits of the Proposed Solution

Easier to develop, maintain, and extend e.g., 2 students less than 1 week to create plans for initial DBLife

Provides opportunities for optimization e.g., extraction and integration plans allow for plan rewriting

Can achieve high accuracy with relatively simple operators by exploiting community properties

e.g., finding people names on seminar pages yields talks with 88% F1

Page 13: Pedro DeRose University of Wisconsin-Madison

13

Creating Plans to Discover Entities

Raghu Ramakrishnan

Union

ExtractM

MatchM

CreateE

s1 sn…

Page 14: Pedro DeRose University of Wisconsin-Madison

14

These operators address well-known problems mention recognition, entity disambiguation... many sophisticated solutions

In DBLife, we find simple implementations can already work surprisingly well

often easy to collect entity names from community sources (e.g., DBLP)

ExtractMByName: finds variations of names

entity names within a community are often uniqueMatchMByName: matches mentions by name

these simple methods work with 98% F1 in DBLife

Creating Plans to Discover Entities (cont.)

Union

ExtractM

MatchM

CreateE

s1 sn…

Page 15: Pedro DeRose University of Wisconsin-Madison

15

Extending Plans to Handle Difficult Spots

Can decide which operators to apply where e.g., stricter operators over more ambiguous data

Provides optimization opportunities similarly to relational query plans see ICDE-07 for a way to optimize such plans

CreateE

MatchMStrict

ExtractMbyName

MatchMbyName

Union

{s1 … sn} DBLP\

ExtractMbyName

DBLP

DBLP: Chen Li

· · ·41. Chen Li, Bin Wang, Xiaochun Yang.VGRAM. VLDB 2007.· · ·38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.Applied Mathematics and Computation.· · ·

Page 16: Pedro DeRose University of Wisconsin-Madison

16

Creating Plans to Discover Relationships

Categorize relations into general classes e.g., co-occur, label, neighborhood…

Then provide operators for each class e.g., ComputeCoStrength, ExtractLabels, neighborhood selection…

And compose them into a plan for each relation type makes plans easier to develop plans are relatively simple to understand can easily add new plans for new relation types

Page 17: Pedro DeRose University of Wisconsin-Madison

17

Illustrating Example: Co-occur

To find affiliated(person, org) relationship... e.g., affiliated(Raghu, UW Madison), affiliated(Raghu, Yahoo! Research) categorized as a co-occur relationship

...compose a simple co-occur plan

This simple plan already finds affiliations with 80% F1

ComputeCoStrength

CreateR

personentities

orgentities

Union

s1 sn…

×

Select (strength > θ)

Page 18: Pedro DeRose University of Wisconsin-Madison

18

Illustrating Example: Label

ICDE'07 Istanbul Turkey

General Chair• Ling Liu• Adnan Yazici

Program Committee Chairs• Asuman Dogac• Tamer Ozsu• Timos Sellis

Program Committee Members• Ashraf Aboulnaga• Sibel Adali…

conferenceentities

personentities

main pages

ExtractLabel

c(person, label)

CreateR

Plan forserved-in(person, conf)

Page 19: Pedro DeRose University of Wisconsin-Madison

19

Illustrating Example: Neighborhood

UCLA Computer Science Seminars

Title: Clustering and ClassificationSpeaker: Yi Ma, UIUCContact: Rachelle Reamkitkarn

Title: Mobility-Assisted RoutingSpeaker: Konstantinos Psounis, USCContact: Rachelle Reamkitkarn

CreateR

orgentities

personentities

seminarpages

c(person, neighborhood)

Plan forgave-talk(person, venue)

Page 20: Pedro DeRose University of Wisconsin-Madison

20

Creating Daily ER Graphs

SIGMOD-07

Web Pages

* **

*

HV Jagadish

SIGMOD-07***

*** served in

HV Jagadish

* * *

***

****

Day 1

Day n

.

.

.

.

.

.

.

.

.

.

.

.

Data Sources

Global ER Graph

Daily ER Graph

Page 21: Pedro DeRose University of Wisconsin-Madison

21

Merging Daily ER Graphs

Match

globalER Graph

dailyER Graph

Enrich

Day n Day n+1

AnHai

gave talk

UIUCUIUC

AnHai gave talk

Stanford

Global ER Graph

AnHai

gave talk

Stanford

gave talk

. . .

UIUC

Page 22: Pedro DeRose University of Wisconsin-Madison

22

Cimple Workflow

Blackboard

Crawler

datasources.xmlhttp://…http://…

index.xmlcrawledPages/…

Discover Relationships

Merge ER Graphs

Services

Discover Entities

ExtractM

MatchM

CreateE

Person

ExtractM

MatchM

CreateE

Publication

entities.xml*** ***

relationships.xml

globalGraph.xml

superhomepages, browsing, search

Page 23: Pedro DeRose University of Wisconsin-Madison

23

Example Plan Specification

# Look for names in pages and mark them as mentionsEXTRACT_MENTIONS = tasks/getMentions/extractMentions$EXTRACT_MENTIONS PerlSearch $CRAWLER->index.xml $NAME_VARIATIONS

# Match mentionsMATCH_MENTIONS = tasks/getEntities/matchMentions$MATCH_ENTITIES RootNames $EXTRACT_MENTIONS->mentions.xml

# Create entitiesCREATE_ENTITIES = tasks/getEntities/createEntities$CREATE_ENTITIES RootNames $MATCH_ENTITIES->mentionGroups.xml

discoverPeopleEnities.cfg#!/usr/bin/perl#################################################################### Arguments: <moduleDir> <fileIndex> <variationsFile># moduleDir: the relative path to the module# fileIndex: the file index of crawled files# variationsFile: a file containing mention name variations## Finds mentions in crawled files by searching for name variations###################################################################use dblife::utils::CrawledDataAccess;use dblife::utils::OutputAccess;

# First get argumentsmy ($moduleDir, $fileIndex, $variationsFile) = @ARGV;

# Parse the crawled file index for info by URLmy %urlToInfoopen(FILEINDEX, "< $fileIndex");while(<FILEINDEX>) { if(/^<file>/) { ... } elsif(/^<\/file>/) { ... } ...}close(FILEINDEX);

# Output as we goopen(OUT, "> $moduleDir/output/output.xml");...

# Search through crawled file for variationsforeach my $url (keys %urlToInfo) { ... }...

close(OUT);

ExtractMentions.pl

Page 24: Pedro DeRose University of Wisconsin-Madison

24

Example Operator APIs

<mention id="mention-537390"> <mentionType>people</mentionType> <rootName>Jeffrey F. Naughton</rootName> <nameVariation>(Jeffrey|Jefferson|Jeff)\s+Naughton</nameVariation> <filepath>www.cse.uconn.edu/icde04/icde04cfp.txt</filepath> <url>http://www.cse.uconn.edu/icde04/icde04cfp.txt</url> <mentionLeftContextLength>100</mentionLeftContextLength> <mentionRightContextLength>100</mentionRightContextLength> <mentionStartPos>3786</mentionStartPos> <mentionEndPos>3798</mentionEndPos> <mentionMatch>Jeff Naughton</mentionMatch> <mentionWindow> PROGRAM CHAIR 6. Mary Fernandez, AT&T Gail Mitchell, BBN Technologies 7. Ling Liu, Georgia Tech 8. Jeff Naughton, U. Wisc. SEMINAR CHAIR 9. Guy Lohman, IBM Almaden Mitch Cherniack, Brandeis U. 10. Panos Chrysanth</mentionWindow></mention>

<mention id="mention-1143"> <mentionType>people</mentionType> <rootName>Guy M. Lohman</rootName> <nameVariation>Guy\s+Lohman</nameVariation> <filepath>www.cse.uconn.edu/icde04/icde04cfp.txt</filepath> <url>http://www.cse.uconn.edu/icde04/icde04cfp.txt</url> <mentionLeftContextLength>100</mentionLeftContextLength> <mentionRightContextLength>100</mentionRightContextLength> <mentionStartPos>3850</mentionStartPos> <mentionEndPos>3859</mentionEndPos> <mentionMatch>Guy Lohman</mentionMatch> <mentionWindow>il Mitchell, BBN Technologies 7. Ling Liu, Georgia Tech 8. Jeff Naughton, U. Wisc. SEMINAR CHAIR 9. Guy Lohman, IBM Almaden Mitch Cherniack, Brandeis U. 10. Panos Chrysanthis, U. Pittsburgh 11. Aidong Zhang, SU</mentionWindow></mention>

extractMentions/output.xml

<mentionGroup> <mention>mention-545376</mention> <mention>mention-2510</mention> <mention>mention-2511</mention> <mention>mention-2512</mention></mentionGroup>

<mentionGroup> <mention>mention-537390</mention> <mention>mention-537678</mention> <mention>mention-538967</mention> <mention>mention-539718</mention> <mention>mention-539973</mention> <mention>mention-539974</mention> <mention>mention-540184</mention> <mention>mention-540265</mention> <mention>mention-554093</mention> <mention>mention-540732</mention> <mention>mention-540733</mention> <mention>mention-541072</mention> <mention>mention-541073</mention> <mention>mention-541964</mention> <mention>mention-542572</mention> <mention>mention-198996</mention> <mention>mention-543725</mention> <mention>mention-544495</mention> <mention>mention-545057</mention> ...</mentionGroup>

matchMentions/output.xml<entity id="entity-1979"> <entityName>Daniel Urieli</entityName> <entityType>people</entityType> <mention>mention-545376</mention> <mention>mention-2510</mention> <mention>mention-2511</mention> <mention>mention-2512</mention></entity>

<entity id="entity-1980"> <entityName>Jeffrey F. Naughton</entityName> <entityType>people</entityType> <mention>mention-537390</mention> <mention>mention-537678</mention> <mention>mention-538967</mention> <mention>mention-539718</mention> <mention>mention-539973</mention> <mention>mention-539974</mention> <mention>mention-540184</mention> <mention>mention-540265</mention> <mention>mention-554093</mention> <mention>mention-540732</mention> <mention>mention-540733</mention> <mention>mention-541072</mention> <mention>mention-541073</mention> <mention>mention-541964</mention> <mention>mention-542572</mention> <mention>mention-198996</mention> <mention>mention-543725</mention> <mention>mention-544495</mention> <mention>mention-545057</mention> ...</entity>

createEntities/output.xml

Page 25: Pedro DeRose University of Wisconsin-Madison

25

Experimental Evaluation: Building DBLife

2 days, 2 personsCore Entities (489): researchers (365), department/organizations (94), conferences (30)

Relation Plans (8): authored, co-author, affiliated with, gave talk, gave tutorial, in panel, served in, related topic

Operators: DBLife-specific implementation of MatchMStrict

Data Sources (846): researcher homepages (365), department/organization homepages (94), conference homepages (30), faculty hubs (63), group pages (48), project pages (187), colloquia pages (50), event pages (8), DBWorld (1), DBLP (1)

Initial DBLife (May 31, 2005)

2 days, 2 persons

1 day, 1 person

2 days, 2 persons

Time

Data Source Maintenance: adding new sources, updating relocated pages, updating source metadata

Maintenance and Expansion1 hour/month,

1 person

Time

Relation Instances (63,923): authored (18,776), co-author (24,709), affiliated with (1,359), served in (5,922), gave talk (1,178), gave tutorial (119), in panel (135), related topic (11,725)

Entities (16,674): researchers (5,767), departments/organizations (162), conferences (232), publications (9,837), topics (676)

Mentions (324,188): researchers (125,013), departments/organizations (30,742), conferences (723), publication: (55,242), topics (112,468)

Data Sources (1,075): researcher homepages (463), department/organization homepages (103), conference homepages (54), faculty hubs (99), group pages (56), project pages (203), colloquia pages (85), event pages (11), DBWorld (1), DBLP (1)

Current DBLife (Mar 21, 2007)

Page 26: Pedro DeRose University of Wisconsin-Madison

26

Relatively Easy to Deploy, Extend, and Debug

DBLife has been deployed and extended by several developers CS at IL, CS at WI, Biochemistry at WI, Yahoo! Research development started after only a few hours Q&A

Developers quickly grasped our compositional approach easily zoomed in on target components could quickly tune, debug, or replace individual components e.g., new student extended ComputeCoStrength operator and added the

"affiliated" plan in just a couple days

Page 27: Pedro DeRose University of Wisconsin-Madison

27

Accuracy in DBLife

Mean accuracy over 20 randomly chosen researchers

0.980.990.97Discovering entities with source-aware plan

0.890.920.95Finding "on panel" relations (labels)

0.921.000.90Finding "gave tutorial" relations (labels)

0.881.000.87Finding "gave talk" relations (neighborhood)

0.770.810.84Finding "served in" relations (labels)

0.800.830.85Finding "affiliated" relations (co-occurrence)

0.840.980.76Finding "authored" relations (DBLP plan)

0.980.961.00Discovering entities with default plan

0.980.980.99Extracting mentions with ExtractMByName

MeanF1

MeanPrecision

MeanRecall

Experiment

1.001.001.00Matching entities across daily graphs

Page 28: Pedro DeRose University of Wisconsin-Madison

28

Proposed Future Work: Data Model

Requirements relatively easy to understand and manipulate for non-technical users temporal dimension to represent changing data

Candidate: Temporal ER Graph

"HV Jagadish" "Jagadish, HV"

mentionOfmentionOf

extractedFrom extractedFrom

http://cse.umich.edu/...

"HV Jagadish"

SIGMOD-07

HV Jagadish

served in

mentionOf

extractedFrom

http://cse.umich.edu/...

Day n Day n+1

SIGMOD-07

HV Jagadish

served in

Page 29: Pedro DeRose University of Wisconsin-Madison

29

Data Model Implementation

Current implementation is ad hoc combination of XML, RDBMS, and unstructured files no clean separation between data and operators

Will explore a more principled implementation efficient storage and processing abstraction through an API

Page 30: Pedro DeRose University of Wisconsin-Madison

30

Proposed Future Work: Operator Model

Identify core set of efficient operators that encompass many common CIM extraction/integration tasks

Data acquisition query search engines crawl sources (e.g., Crawler)

Data extraction extract mentions (e.g., ExtractM) discover relations (e.g., ComputeCoStrength, ExtractLabels)

Data integration match mentions (e.g., MatchM) match entities over time (e.g., Match operator in merge plan) match entities across portals

Data manipulation select subgraphs join graphs

Page 31: Pedro DeRose University of Wisconsin-Madison

31

Operator Model Extensibility

Input parameters that can be tuned for each domain

e.g., MatchMByName "HV Jagadish" = "Jagadish, HV" input parameters: name permutation rules

e.g., MatchMByNeighborhood "Cong Yu, HV Jagadish, Yun Chen" = "Yun Chen, HV Jagadish, Amab Nandi" input parameters: window size, overlapping tokens required

Page 32: Pedro DeRose University of Wisconsin-Madison

32

Proposed Future Work: Plan Model

Need language for developers to compose operators

Workflow language specifies operator order, how they are composed used by current Cimple implementation

Declarative language use a Datalog-like language to compose operators existing work [Shen, VLDB 07] seems promising provides opportunities for data-specific optimizations

Page 33: Pedro DeRose University of Wisconsin-Madison

33

Planned Thesis Chapters

Selecting initial data sources

Creating the daily ER graphMerging daily graphs into the global ER graph

Incrementally expanding sources and data

Leveraging community members

Developing the Cimple 1.0 workbenchEvaluating the solutions and workbench

Page 34: Pedro DeRose University of Wisconsin-Madison

34

Incrementally Expanding

Most prior work periodically re-runs source discovery step

Cimple leverages a common community property important new sources and entities are often mentioned in certain

community sources (e.g., DBWorld)

monitor these sources with simple extraction plans

Message type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data

Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007

http://mud.cs.utwente.nl ...

Page 35: Pedro DeRose University of Wisconsin-Madison

35

Leveraging Community Members

Machine-only contributions little human effort, reasonable initial portal, automatic updates inaccuracies from imperfect methods, limited coverage

Human-only contributions accurate, can provide data not in sources significant human effort, difficult to solicit, typically unstructured

Combine contributions from both humans and machines benefits of both encourage human contributions with machine's initial portal

Combine both structured and text contributions allows structured services (e.g., querying)

Page 36: Pedro DeRose University of Wisconsin-Madison

36

Illustrating Example

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=Professor #>

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>

David J. DeWitt

Professor

Interests: Parallel DB

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=John P. Morgridge

Professor #>

<# person(id=1) {organization}=UW #> since 1976

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}= John P. Morgridge Professor #><# person(id=1){organization}=UW-Madison#>since 1976

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #><# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>

David J. DeWitt

John P. Morgridge ProfessorUW-Madison since 1976

Interests: Parallel DBPrivacy

Madwiki solution (Machine assisted development of wikipedias)

Page 37: Pedro DeRose University of Wisconsin-Madison

37

The Madwiki Architecture

Community wikipedia, backed by a structured database

Challenges include How to model underlying databases G and T How to export a view Vi as a wiki page Wi

How to translate edits to a wiki page Wi into edits to G How to propagate changes to G to affected wiki pages

DataSources G

T

V1

V2

V3

W1

W2

W3

u1

V3’ W3’

T3’

M

Page 38: Pedro DeRose University of Wisconsin-Madison

38

Modeling the Underlying Database G

Use an entity-relationship graph commonly employed by current community portals familiar to ordinary, database-illiterate users

id = 6name = Privacy

id = 1name = David J. DeWittorganization = UW-Madison

id = 4name = Parallel DB

interestsid = 5

id = 12name = SIGMOD 02

interestsid = 3

servicesid = 11as =general chair

id = 7name = Statisticsinterests

id = 8

Page 39: Pedro DeRose University of Wisconsin-Madison

39

Storing ER Graph G

Use an RDBMS for efficient querying and concurrency

Extend with temporal functionality for undo [Snodgrass 99]

xid

value start stopwho

1 UW 2007-04-01 9999-12-31 M2 MITRE 2007-05-20 9999-12-31 M

xid

value start stopwho

2 Purdue 2007-05-02 9999-12-31 U1

1 UW-Madison 2007-05-27 9999-12-31 U2

Organization_m

Organization_u

id etype1 person4 topic12 conf

id rtypeeid1

eid2

3 interests 1 411 services 1 12

… …

… … … …

Entity_ID

Relationship_ID

Page 40: Pedro DeRose University of Wisconsin-Madison

40

Reconciling Human and Machine Edits

When edits conflict, must reconcile to choose attribute values

Reconciliation policy encoded as view over attribute tables e.g., "latest": current value is latest value, whether human or machine e.g., "humans-first": current value is latest human value, or latest machine

value if there are no human values

xid

value start stopwho

1 UW 2007-04-01 2007-05-27 M2 Purdue 2007-05-02 9999-12-31 U1

1 UW-Madison 2007-05-27 9999-12-31 U2

Organization_p

……

……

… …

Page 41: Pedro DeRose University of Wisconsin-Madison

41

Exporting Views over G as Wiki Pages

Views over ER graph G select sub-graphs

Use s-slots to represent sub-graph's data in wiki pages

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){organization}=UW-Madison#>

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>

<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>

David J. DeWitt

UW-Madison

Interests: Parallel DBPrivacy

id = 6name = Privacy

id = 1name = David J. DeWittorganization = UW-Madison

id = 4name = Parallel DB

interestsid = 5

interestsid = 3

Page 42: Pedro DeRose University of Wisconsin-Madison

42

Translating Wiki Edits to Database Edits

Use wiki edits to infer ER edits to underlying view

Map ER edits to RDBMS edits over G See ICDE 08 for details of mapping

<# person(id=1){name}=David J. DeWitt #><# person(id=1){organization}=UW #>

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>

<# person(id=1){name}=David J. DeWitt #><# person(id=1){organization}=UW- Madison #>

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>

<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>

id = 1name = David J. DeWittorganization = UW

id = 4name = Parallel DB

interestsid = 3

id = 6name = Privacy

id = 1name = David J. DeWittorganization = UW-Madison

id = 4name = Parallel DB

interestsid = 5

interestsid = 3

Page 43: Pedro DeRose University of Wisconsin-Madison

43

Resolving Ambiguous Wiki Edits

Users can edit data, views, and schemas change the attribute age of person X to 42 display the attribute homepage of conference entities on wiki pages add the new attribute pages to the publication entity schema

Wiki edits can be ambiguous change the attribute title of person X to NULL do not display the attribute title of people entities on wiki pages delete the attribute title from the people entity schema

Recognize ambiguous edits, then ask users to clarify intention

Page 44: Pedro DeRose University of Wisconsin-Madison

44

Propagating Database Edits to Wiki Pages

Update wiki pages when G changes similar to view update

Eager propagation pre-materialize wiki text, and immediately propagate all changes raises complex concurrency issues

Simpler alternative: lazy propagation materialize wiki text on-the-fly when requested by users underlying RDBMS manages edit concurrency preliminary evaluation indicates lazy propagation is tractable

Page 45: Pedro DeRose University of Wisconsin-Madison

45

S-slot expressiveness prototype shows s-slots can express all structured data in DBLife

superhomepages except aggregation and top-k

Materializing and serving wiki pages materializing and serving time increases linearly with page size

vast majority of current pages are small, hence served in < 1 second

Madwiki Evaluation

Page 46: Pedro DeRose University of Wisconsin-Madison

46

Proposed Future Work

Structured data in wiki pages s-slot tags can be confusing and cumbersome to users propose to explore a structured data representation closer to natural text key challenge: how to avoid requiring an intractable NLP solution

Reconciling conflicting edits preliminary work identifies problem, does not provide satisfactory solution propose to explore reconciling edits by learning editor trustworthiness lots of existing work, but not in CIM settings key challenge: edits are sparse, and not missing at random

Page 47: Pedro DeRose University of Wisconsin-Madison

47

Developing the Cimple 1.0 Workbench

Set of tools for compositional portal building

empty portal shell, including basic services and admin tools browsing, keyword search…

set of general operators, and means to compose them MatchM, ExtractM…

simple implementation of operators MatchMbyName, ExtractMbyName…

end-to-end development methodology 1. select sources, 2. discover entities…

extraction/integration plan optimizers see VLDB 07 for an optimization solution

Page 48: Pedro DeRose University of Wisconsin-Madison

48

Proposed Evaluation: A Second Domain

Use Cimple to build portal for non-research domain

Evaluate portal's ease of development, extensibility, and accuracy

Movie Domain movie, actor, director, appeared in, directed… sources include critic homepages, celebrity news sites, fan sites existent portals (e.g., IMDB), arguably an unfair advantage

NFL Domain player, coach, team, played for, coached… sources include homepages for players, teams, tournaments, stadiums

Digital Camera Domain manufactureres, cameras, stores, reviewers, makes, sells, reviewed… sources include manufacturer homepages, online stores, reviewer sites

Page 49: Pedro DeRose University of Wisconsin-Madison

49

Conclusions

CIM is an interesting and increasingly crucial problem many online communities initial efforts at UW, Y!R, MSR, Washington, IBM Almaden

Preliminary work on CIM seems promising decomposed CIM and developed initial solutions began work on the Cimple 1.0 workbench developed the DBLife portal

Much work remains to realize the potential of this approach formal data, operator, and plan model more natural editing and robust reconciliation for Madwiki Cimple 1.0 workbench a second portal

Page 50: Pedro DeRose University of Wisconsin-Madison

50

Proposed Schedule

February formal data, operator, and plan model for Cimple

March solution for reconciling human and machine edits based on trust

June more natural structured data representation for Madwiki wiki text

July completion of Cimple 1.0 beta workbench

August deployment of second portal built using the Cimple 1.0 workbench public release of Cimple 1.0 workbench

Page 51: Pedro DeRose University of Wisconsin-Madison

51

Leveraging Communities – Prior Work

Automatic methods are imperfect extraction/integration errors, and incomplete coverage

Some solutions builds portals with human-only contributions Wikipedia, Intellipedia, umasswiki.com, ecolicommunity.org... given an active, trusted community, can achieve excellent accuracy can require lots of effort to maintain, and typically unstructured

Recent work explores allowing structured user contributions Semantic Wikipedia, WikiLens, MetaWeb... focus on extending wiki language to express structured data do not explore synergistic leveraging of human and automatic contributions