1 querying web-sources within a data federation lynn wu 1, 2 lynn wu 1, aykut firat 2, 31 tarik...

26
1 Querying Web- Querying Web- Sources within a Sources within a Data Federation Data Federation Lynn Wu Lynn Wu 1 , , Aykut Firat 2 , Tarik Alatovic 3 , Stuart Madnick 1 1 MIT Sloan School of Management MIT Sloan School of Management 2 Northeastern University Northeastern University 3 INSEAD INSEAD International Conference on Information International Conference on Information Systems (ICIS) Systems (ICIS) December 11, 2006 December 11, 2006

Upload: erica-hall

Post on 28-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

1

Querying Web-Querying Web-Sources within a Sources within a Data FederationData Federation

Lynn WuLynn Wu11, , Aykut Firat22, Tarik Alatovic33, Stuart Madnick11

11MIT Sloan School of ManagementMIT Sloan School of Management22Northeastern UniversityNortheastern University

33INSEADINSEAD

International Conference on Information Systems International Conference on Information Systems (ICIS)(ICIS)

December 11, 2006December 11, 2006

Page 2: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

2

Motivating ScenarioMotivating Scenario

You want:You want: The current stock quotes of all The current stock quotes of all

companies listed on the Stock companies listed on the Stock ExchangeExchange that are in the biotechnology industry. that are in the biotechnology industry.

And you want to see each of the stock And you want to see each of the stock quotes in all the major currencies.quotes in all the major currencies.

Page 3: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

3

Good NewsGood NewsAll of the necessary information is available All of the necessary information is available

(and for free) on the Web …(and for free) on the Web …

Listing of companies in an industryStock price for any company

Conversion between any two currencies

So what’s the problem?

Page 4: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

4

Process – Part 1Process – Part 1Web sites are not like Relational (SQL) databases.Web sites are not like Relational (SQL) databases.

Must go step-by-step: first find all the biotech Must go step-by-step: first find all the biotech companies.companies.

Page 5: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

5

Biotechnology Ticker

Acadia Pharmaceuticals Inc. ACAD

Accentia Biopharmaceuticals, I ABPI

Achillion Pharmaceuticals, Inc ACHN

Acorda Therapeutics, Inc. ACOR

Adherex Technologies Inc. ADH

Advanced Cell Technology Inc. ACTC.OB

Advanced Life Sciences Holding ADLS

Advaxis Inc. ADXS.OB

Adventrx Pharmaceuticals Inc. ANX

Alfacell Corp. ACEL

Alnylam Pharmaceuticals Inc. ALNY

……

……

ACHN

ABPI

ADLS

ACOR

ADH

ANX

ACAD

237 Biotech firms237 Biotech firms

Process – Part 2Process – Part 2Then must find the stock Then must find the stock

price of each, one-by-one.price of each, one-by-one.

Page 6: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

6

Process – Part 3Process – Part 3Ticker Price($)

ACAD 8.96

ABPI 3.55

ACHN 17.52

ACOR 17.13

ADH 0.31

ACTC.OB 0.76

ADLS 2.43

ADXS.OB 0.14

ANX 2.42

ACEL 1.62

ALNY 23.81

Ticker Price($) EURO JPY

ACAD 8.96 6.8096 792.2289

ABPI 3.55 2.698 313.8853

ACHN 17.52 13.3152 1549.09

ACOR 17.13 13.0188 1514.607

ADH 0.31 0.2356 27.4097

ACTC.OB 0.76 0.5776 67.19798

ADLS 2.43 1.8468 214.8567

ADXS.OB 0.14 0.1064 12.37858

ANX 2.42 1.8392 213.9725

ACEL 1.62 1.2312 143.2378

ALNY 23.81 18.0956 2105.242

Then must convert stock Then must convert stock price of each, one-by-one.price of each, one-by-one.

Page 7: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

7

General ScenarioGeneral Scenario

Users often have to browse through Users often have to browse through many websites and collect and many websites and collect and process a lot of information process a lot of information manually.manually.

Wouldn’t it be great if you could get Wouldn’t it be great if you could get all the stock quotes in the biotech all the stock quotes in the biotech industry using one query?industry using one query?

select ticker, price from yahooF where ticker IN (select companyticker from companytable where industry='Biotechnology')

Page 8: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

8

Why is this so difficult?Why is this so difficult?

Websites have various capability Websites have various capability restrictions.restrictions. Web sites do not accept general queries Web sites do not accept general queries

(e,g., SQL).(e,g., SQL). Assuming they somehow accepted general Assuming they somehow accepted general

queries, there are still problems. For example:queries, there are still problems. For example: select price from yahooFselect price from yahooFThis is not answerable as Yahoo! Finance requires at This is not answerable as Yahoo! Finance requires at

least one ticker at a time to get the stock quote.least one ticker at a time to get the stock quote. select exchanged, expressed, rate, date from olsen select exchanged, expressed, rate, date from olsen

where expressed='USD' and date= '12/10/06' where expressed='USD' and date= '12/10/06' Must specify both currencies.Must specify both currencies.

Page 9: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

9

Existing SolutionsExisting Solutions Commercial databases can incorporate Commercial databases can incorporate

heterogeneous data sources through the heterogeneous data sources through the use of wrappers: use of wrappers: However, there is no general-purpose wrapper However, there is no general-purpose wrapper

that can query the entire Web. that can query the entire Web. Need to construct one wrapper per website.Need to construct one wrapper per website. This is our focus – how can these be This is our focus – how can these be

improved ?improved ?

Other options:Other options: Using highly expressive context-free grammars Using highly expressive context-free grammars

to express the capability restrictionsto express the capability restrictions Has not been used widely in commercial systems due Has not been used widely in commercial systems due

to their complexity.to their complexity.

Page 10: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

10

How does a Federated database How does a Federated database system handle the problem?system handle the problem?

Example: IBM DB2

Wrapper

Web Sources

Capability Handler

Wrapper for S1

Capability Handler

Data Extraction

Wrapper for S2

Capability Handler

Data Extraction

Wrapper for S3

S1-website

Wrapper: Request-Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3 …

IBM DB2

Data Extraction

S2-website S3-website

For web sites (S1, S2, S3), each wrapper must be custom crafted.

Page 11: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

11

Research ContributionResearch Contribution

Offer a complete, practical, and Offer a complete, practical, and scalable solution to easily scalable solution to easily incorporate websites into a data incorporate websites into a data federation.federation.

Abstract wrapper components into Abstract wrapper components into separate reasoning engines.separate reasoning engines. Capability reasoning engine for query Capability reasoning engine for query

planning and executionplanning and execution Data extraction engineData extraction engine

Page 12: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

12

Our SolutionOur SolutionTwo-Layered Architecture—current IBM solution

Three-Layered Architecture— with capability declaration

Wrapper

Web Sources

Capability Handler

Wrapper for S1

Capability Handler

Data Extraction

Wrapper for S2

Capability Handler

Data Extraction

Wrapper for S3

S1-website

Wrapper: Request-Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3 …

IBM DB2

Data Extraction

S2-website S3-website

Wrapper: Request-Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3 …

Wrapper, Capability

Engine

S1-website

Web Sources

Data Extraction

Engine

IBM DB2

Data Extraction

Engine

Query planning

with capability

declaration

CR for S1

CR for S3

CR for S2

Capability Record Declaration

DE for S1

DE for S2

DE for S3

Data Extraction Spec Files

Wrapper: Request-Reply Protocol

Federation Engine

Query: Select ..from s1,s2,s3 …

Wrapper, Capability

Engine

Web Sources

Data Extraction

Engine

IBM DB2

Data Extraction

Engine

Query planning

with capability

declaration

CR for S1

CR for S3

CR for S2

Capability Record Declaration

DE for S1

DE for S2

DE for S3

Data Extraction Spec Files

S2-website S3-website

Page 13: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

13

Adding a web source is Adding a web source is simple.simple.

Define the data extraction rules.Define the data extraction rules. Define the capability record.Define the capability record.

No procedural No procedural coding involved at coding involved at

all !all !

Page 14: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

14

Data Extraction: Cameleon Data Extraction: Cameleon EngineEngine

• Extract data from web pages using declarative specifications that extract specific fields within a website.

• Can answer rudimentary queries involving only a single website.

Input param

Regular expression identifying the region and extracts the price

Example data extraction rules for Yahoo! Finance

Page 15: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

15

Cameleon Studio tool enables Cameleon Studio tool enables quick creation and testing of the quick creation and testing of the

data extraction rulesdata extraction rules

Page 16: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

16

Capability RecordCapability Record For Yahoo Finance!, we have two attributes For Yahoo Finance!, we have two attributes

of interest.of interest. Cameleon extracts data and form a table formatCameleon extracts data and form a table format

Capability RecordCapability Record

TickerTicker PricePrice

relation(‘YahooF’,

[[‘Ticker’, string, bound(1)],

[‘Price’, number, free]],

['='])relation(olsen,

[['Exchanged',string, bound(1)],['Expressed',string, bound(1)],['Rate',number, free], ['Date',string, bound(1)]],['=']).

relation(‘companytable’,

[[‘Industry’, string, bound(1)],

[‘CompanyTicker’, string, free]],

['='])

Must provide one (and only one) Ticker at a time(some sites allow up to 50 Tickers at a time).

Price is value returned.

Can only use equality (=) operator.

Page 17: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

17

IBM DB2IBM DB2 Uses wrapper to access non-relational data sources.Uses wrapper to access non-relational data sources. DB2 first decomposes the original query into query DB2 first decomposes the original query into query

fragments and then sends them to wrappers.fragments and then sends them to wrappers. Wrapper sends the result back to DB2 which then Wrapper sends the result back to DB2 which then

assembles the final results.assembles the final results.

DB2 XML Wrapper (Adapted from IBM).

Page 18: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

18

Request-Reply-Compensate Request-Reply-Compensate ProtocolProtocol

Request-Reply-Compensate protocol example

Query Fragment

select price * 1.3from YahooFwhere ticker in (‘GE’, ‘IBM’, ‘MSFT’);

RequestHXP: PriceTable: YahooFPredicates: ticker in (‘GE’, ‘IBM’, ‘MSFT’)

Wrapper plan 1

HXP: PriceTable: YahooFPredicate: ticker = ‘GE’ Wrapper plan 2

HXP: PriceTable: YahooFPredicate: ticker = ‘IBM’

Wrapper plan 3

HXP: PriceTable: YahooFPredicate: ticker = ‘MSFT’

Page 19: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

19

Query PlanningQuery Planning Now we have a capability record defined.Now we have a capability record defined. Add a secondary mini query planner that Add a secondary mini query planner that

is designed specifically to work with is designed specifically to work with capability records. capability records. Can answer queries involving multiple web Can answer queries involving multiple web

sources.sources. Specify a query execution order of query Specify a query execution order of query

fragments.fragments. Independent query fragments are executed Independent query fragments are executed

first.first. Followed by dependent query fragments that Followed by dependent query fragments that

can uses the prior results.can uses the prior results.

Page 20: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

20

Our SolutionOur Solution Example 1Example 1

Find all the stock quotes of biotech companies.Find all the stock quotes of biotech companies.

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER IN (SELECT COMPANYTICKER FROM COMPANYTABLE WHERE INDUSTRY='BIOTECHNOLOGY' AND COMPANYTICKER <'AD'))

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>]

Depends on the previous query fragment

SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY’ AND COMPANYTICKER < 'AD')

Independent query fragment

Page 21: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

21

Example QueryExample QueryCOMPANYTICKER INDUSTRY---------------------------------------------

ACAD BiotechnologyACAM BiotechnologyACOR BiotechnologyACEL Biotechnology

SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY AND COMPANYTICKER < AD

Independent query fragment

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>]

Depends on the previous query fragment

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACADSELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACAMSELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACORSELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACEL

TICKER PRICE-------------------------------------------ACAD 14.90ACAM 6.51ACOR 5.10ACEL 3.18

Page 22: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

22

Example 2Example 2 Now you want the stock price in all major Now you want the stock price in all major

currencies.currencies.

(select ticker, price from yahooF where ticker IN (select companyticker from companytable

where industry=‘biotechnology’)

(select currency, olsen.rate from (select currency from currency_map where currency <> ‘USD') currency_map, (select exchanged, 'USD', rate, ‘12/10/06'

from olsen where expressed= 'USD' and date=‘12/10/06') olsen where currency_map.currency = olsen.exchanged and currency_map.currency <> 'USD ') as exchange

select yahooF.ticker, yahooF.price * exchange.rate, exchange.curency from

Page 23: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

23

Example 2Example 2Get all the exchange rates against Get all the exchange rates against

the USD on Dec 10 2006the USD on Dec 10 2006

Query fragment 1

Query fragment 2

select olsen.rate, from (select currency, from currency_map where currency <> ‘USD') currency_map, (select exchanged, ‘USD', rate, ‘12/10/06' from olsen where expressed=‘USD' and date=‘12/10/06') olsen,where currency_map.currency = olsen.exchangedand currency_map.currency <> ‘USD'

Page 24: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

24

(select exchanged, 'USD', rate, ’12/10/2006' from olsen where expressed= 'USD' and date='12/10/06' and exchanged in (select currency from currency_map where currency<>’USD’))

(select exchanged, ‘USD', rate, ’12/10/06' from olsen where expressed=‘USD' and date=’12/10/06’) olsen

relation(olsen,[['Exchanged',string, bound(1)],['Expressed',string, bound(1)],['Rate',number, free], ['Date',string, bound(1)]],['=']).

select olsen.rate from (select currency from currency_map where currency <> 'USD') currency_map, (select exchanged, 'USD', rate, '12/10/06' from olsen where expressed= 'USD' and date='12/10/06') olsen,

where currency_map.currency = olsen.exchangedand currency_map.currency <> 'USD'

Query fragment 2

Modified Query fragment 2

Capability record

Page 25: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

25

Currency rate----------------------------------------AUD 1.46 CAD 1.32 HKD 7.72 YPY 113.00

TICKER PRICE-------------------------------------------ACAD 14.90ACAM 6.51ACOR 5.10

TICKER PRICE($) PRICExRATE CURRENCY-------------------------------------------------------------------------------------------------------------ACAD 14.90 21.754 AUDACAD 14.90 19.668 CADACAD 14.90 115.028 HKDACAD 14.90 1683.7 YPYACAM 6.51 9.505 AUDACAM 6.51 8.593 CADACAM 6.51 50.257 HKDACAM 6.51 735.63 YPYACOR 5.10 7.446 AUDACOR 5.10 6.732 CADACOR 5.10 39.372 HKDACOR 5.10 576.3 YPY

select ticker, price * exchange.rate, exchanged.currency

Page 26: 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik Alatovic 3, Stuart Madnick 1 1 MIT Sloan School of Management

26

ConclusionConclusion Three-layered architecture for querying web Three-layered architecture for querying web

sources.sources.

Instead of burying capability handling in each Instead of burying capability handling in each wrapper, we created a generic capability wrapper, we created a generic capability handler.handler.

Using this capability handler, adding a web Using this capability handler, adding a web source to a federated database is as simple as source to a federated database is as simple as declaring the extraction rules and capability declaring the extraction rules and capability record for the source.record for the source.

This was implemented and successfully tested.This was implemented and successfully tested.

This makes millions of semi-structured web sites This makes millions of semi-structured web sites into useful “databases.”into useful “databases.”