winacs project web entity extraction and mapping discovering and propagating context

15
Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL

Upload: kenyon

Post on 23-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context. Tim Weninger. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL. Past, Present, Future. Past – Entity search and retrieval is one of the dreams of the Web – TBL - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

WinaCS ProjectWeb Entity Extraction and Mapping

Discovering and Propagating Context

Tim Weninger

Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL

Page 2: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Past, Present, Future

Past – Entity search and retrieval is one of the dreams of the Web – TBL

Present – Ranking and Retrievalbi-directional approach

1) Information Networks 2) Web mining and Information Extraction

a) List Findingb) Entity-page Discoveryc) Entity-page Mapping

Future – InfoBase ProjectInformation extraction via Schema Discovery

Page 3: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Finding lists on the Web is Hard! (KDD Explorations Dec. 2010)

B

C

A

1

2

3

4

1. Google Sets2. WebTables3. Mining Data Records (MDR)4. World Wide Tables (WWT)5. Tag Path Clustering6. RoadRunner6. SEAL 7. Visual List Extraction8. VIsual-based Page Segmentation (VIPS)9. Visualized Element Nodes Table extraction (VENTex)

Page 4: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Why is finding lists important?

• Jiawei Han• ChengXiang Zhai• Kevin Chang• Dan Roth• Marianne Winslett

• Jiawei Han• ChengXiang Zhai• Kevin Chang• Dan Roth• Marianne Winslett• Sarita Adve• Tarek Adelzaher• Vikram Adve• Gul Agha•…

• Charu Aggarwal• Deepayan Chakrabarti• Ed Chang• Kevin Chang• Olivier Chapelle• Chris Clifton• Jiawei Han•…

CORRECTIONINFERENCE

DISAMBIGUATIONRECOMMENDATION

ETC

Page 5: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Our list finding algorithm (Accepted: WWW 2011)

Page 6: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

List Finding for Entity Page Discovery

HTML

DIV DIV

UL

LI LI

hrefY hrefX

UL UL

LI LI LI LI

hrefA hrefB hrefC hrefD

UL

LI LI

hrefE hrefF

P

hrefG

P

hrefH

LI

hrefZ

Data Region 2Data Region 1

Page 7: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Growing Parallel Paths (Accepted: WWW 2011)

DIV UL

AB

AC

HTML DIV ULLI

LI

AX

AY

HTML DIV ULLI

LI

AZ

AW

TABLE TRTD

TD AU

AV

HTML

HTML

LI

LI

DIV

DIV ...

...

Page A

Page D

Page E

Page F

DIV P AFHTMLPage C

DIV

P

AE

Page B

HTML

P

AD

1

2

3

4

5

6

X

Y

Z

W

U

V

Path

Result:

Page 8: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Mapping Pages to Records (CIKM’10)

llvm.cs.uiuc.edu/~vadve/Home.html

rsim.cs.illinois.edu/~sadve/

www.cs.illinois.edu/homes/hanj/

l2r.cs.uiuc.edu/~danr/

Tarek AbdelzaherSarita AdveVikram Adve

Gul AghaEyal AmirDan Roth

Jiawei Han

--------------

Name URL

Structured Data Web PagesMappings

--------------

Zipcode

Page 9: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Mapping Pages to Records (CIKM’10)

/people

/people/faculty

/jiawei-han

/people/faculty

/dan-roth

/people/faculty/vikram-

adve

/research/research

/areas/data

Faculty

DataMining

Jiawei Han

Dan Roth

Vikram Adve

Jiawei Han

Dan Roth

People

/people/faculty

www.cs.illinois.edu/homes/hanj/

llvm.cs.uiuc.edu/~vadve/Home.html

l2r.cs.uiuc.edu/~danr/

Research

PersonalSite

PersonalSite

PersonalSite

/ (root) [cs.illinois.edu]

Example

Ap1={People, Faculty, Dan Roth, Personal Site} Ap2={Research, Data Mining, Dan Roth, Personal Site}

Bag of Anchors: {Research:1, People:1, Faculty:1, Data Mining:1, Dan Roth:2, Personal Site:2}

Sorted Bag of Anchors: Au;v1={Dan Roth:2/2=1, Research:1/2=0.5, Data Mining:1/2 =0.5, Personal Site:2/5=0.4, People:1/3=0.33, Faculty:1/3=0.33}

Page 10: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

CSMap

Locations of top 25 computer science departments. Automatically generated by extracting and ranking 5

digit numbers from Entity Web pages.

Page 11: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Next Steps: The hard part!

Infer categories/schemas from a set of WebPages

Example:

What does these entities have in common?

NameAddressZipCodePublicationsCollaboratorsOrganizations

How can we infer this schema?Wikipedia?

How can we populate it?

Page 12: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Idea! Propagating schemas

Page 13: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Next Steps: The hardest part!

Name Address ZipCode Organizations Collaborators PublicationsJiawei Han A 1 FK FK FKTarek Adelzaher B 2 FK FK FKGerald DeJong C 3 FK FK FKMichael Heath D 4 FK FK FK

This can be modeled as a heterogeneous information network.

Thus, Ranking and Clustering is possibleSo is semantic search, keyword search and typal search

Cube operations are possible

Given Inferred

Page 14: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

WinaCS – An information network based Web search engine

Page 15: WinaCS  Project Web Entity Extraction and Mapping  Discovering and Propagating Context

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

CS 512Jan 18, 2010

Questions? Challenges?