livnat sharabani sdbi 2006 the hidden web. 2 based on: “distributed search over the hidden web:...

73
Livnat Sharabani Livnat Sharabani SDBI 2006 SDBI 2006 The Hidden The Hidden Web Web

Upload: sara-marshall

Post on 16-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

Livnat SharabaniLivnat Sharabani

SDBI 2006SDBI 2006

The HiddenThe Hidden WebWeb

Page 2: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

22

Based on:Based on:

““Distributed search over the hidden web: Distributed search over the hidden web: Hierarchical database sampling and Hierarchical database sampling and selection”selection”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB 2002)2002)

““When one sample is not enough: Improving When one sample is not enough: Improving text database selection using shrinkage”text database selection using shrinkage”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD 2004)2004)

Page 3: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

33

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 4: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

44

What is the hidden web?What is the hidden web?

The The “hidden- web”“hidden- web” / / “invisible-web”“invisible-web” is what you cannot retrieve ("see") in is what you cannot retrieve ("see") in the search results the search results

The The “surface-web”“surface-web” / / “visible-web”“visible-web” is is what you see in the results pages what you see in the results pages from general web search engines.from general web search engines.

Page 5: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

55

““Surface” web vs. “Hidden” webSurface” web vs. “Hidden” web

Page 6: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

66

Why Are Some Pages Why Are Some Pages Invisible?Invisible?

Technical barrier:Technical barrier: When typing or judgment are required.When typing or judgment are required. Dynamically generated pages.Dynamically generated pages.

Pages search engines choose to exclude:Pages search engines choose to exclude: Links containing ‘?’ (can be a spiders trap)Links containing ‘?’ (can be a spiders trap) Flash, shockwave (spiders are html Flash, shockwave (spiders are html

optimized)optimized)

Page 7: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

77

The hidden web - majorityThe hidden web - majority Text databases on the web which are Text databases on the web which are

“hidden” behind search interfaces.“hidden” behind search interfaces.

Page 8: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

88

““Surface” web vs. “Hidden” Surface” web vs. “Hidden” webweb

Surface web:Surface web: Link structure.Link structure. The content is The content is

crawlable.crawlable. The content is indexed The content is indexed

by search engines like by search engines like Google.Google.

Hidden web:Hidden web: Documents “hidden” Documents “hidden”

in databases.in databases. The content is not The content is not

crawlable.crawlable. Need to query each Need to query each

collection individually.collection individually.

Keywords:

Page 9: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

99

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 10: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1010

MetasearchersMetasearchers Metsearcher is a tool for searching over multiple Metsearcher is a tool for searching over multiple

hidden databases simultaneously through a query hidden databases simultaneously through a query interface.interface.

A metasearcher performs three main tasks:A metasearcher performs three main tasks: Database selection.Database selection. Query translation.Query translation. Result merging. Result merging.

DB1DB2

DB3

Metasearcher Query

resultsWEB

Page 11: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1111

DB Content SummaryDB Content Summary

CNN.fnCNN.fn

Num Docs:44,730Num Docs:44,730

WordWord dfdf

BreastBreast

CancerCancer

……

124124

4444

……

Statistics that characterize the database Statistics that characterize the database content: content: Document frequencies of the words appear in the Document frequencies of the words appear in the

databasedatabase Number of documents stored in the database.Number of documents stored in the database.

Examples:Examples:

CANCERLITCANCERLIT

Num Docs: 148,944Num Docs: 148,944

WordWord dfdf

BreastBreast

CancerCancer

……

121,134121,134

91,68891,688

……

Page 12: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1212

Typical DB Selection Typical DB Selection AlgorithmAlgorithm

Typical database selection algorithm Typical database selection algorithm depends on the database content depends on the database content summary to make decision.summary to make decision.

Given a content summary the Given a content summary the algorithm estimates how relevant the algorithm estimates how relevant the database is for a given query.database is for a given query.

Page 13: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1313

bGIOSSbGIOSS The algorithm: calculate the number of documents The algorithm: calculate the number of documents

which expected to have the words in the query.which expected to have the words in the query. Example: for query “breast cancer” bGIOSS will calculate:Example: for query “breast cancer” bGIOSS will calculate:

CANCERLIT: |c|=148,944 df(breast)=121,134 CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688df(cancer)=91,688148,944*(121,134/148,944)*(91,688/148,944)=~74,569148,944*(121,134/148,944)*(91,688/148,944)=~74,569

CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 44,730 *(124/ 44,730)*(44/ 44,730)=~044,730 *(124/ 44,730)*(44/ 44,730)=~0

CNN.fnCNN.fn

Num Docs:44,730Num Docs:44,730

WordWord dfdf

BreastBreast

CancerCancer124124

4444

CANCERLITCANCERLIT

Num Docs: 148,944Num Docs: 148,944

WordWord dfdf

BreastBreast

CancerCancer121,134121,134

91,68891,688

Page 14: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1414

Database SelectionDatabase Selection

The data base selection is based on the The data base selection is based on the contents summary.contents summary.

How do the metasearcher obtain the DB How do the metasearcher obtain the DB content summary?content summary? Exported by the DB itself.Exported by the DB itself. Manually generated description.Manually generated description. Use a technique to automate the extraction Use a technique to automate the extraction

of content summaries from searchable text of content summaries from searchable text DBs.DBs.

Page 15: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1515

Content Summary Content Summary construction construction

A pioneer work done by J. Callan and A pioneer work done by J. Callan and M. Connell was presented at SIGMOD M. Connell was presented at SIGMOD ’99.’99.

Their algorithm extracts a document Their algorithm extracts a document sample from a given database D and sample from a given database D and computes the frequency of each computes the frequency of each observed word observed word ww in the sample. in the sample.

Page 16: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1616

Content Summary Content Summary constructionconstruction

The algorithm:The algorithm:1.1. Start with a comprehensive word Start with a comprehensive word

dictionary.dictionary.

2.2. Pick a word and send it as a query to Pick a word and send it as a query to database D.database D.

3.3. Retrieve the top k documents returned.Retrieve the top k documents returned.

4.4. If the number of retrieved documents If the number of retrieved documents exceeds a pre-specified threshold stop exceeds a pre-specified threshold stop sampling. Otherwise return to step 2.sampling. Otherwise return to step 2.

5.5. For each word w in the retrieved For each word w in the retrieved documents calculate SampleDF(documents calculate SampleDF(ww).).

Page 17: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1717

Content Summary Content Summary constructionconstruction

There are two main versions of this There are two main versions of this algorithm that differ in how they pick words algorithm that differ in how they pick words from the dictionary:from the dictionary: RS-Ord (Random Sampling Other Resource) – RS-Ord (Random Sampling Other Resource) –

picks a random word from the dictionary.picks a random word from the dictionary. RS-Lrd (Random Sampling Learned Resource)- RS-Lrd (Random Sampling Learned Resource)-

pick a word from a previously retrieved pick a word from a previously retrieved documents.documents.

Both versions do not retrieve the actual Both versions do not retrieve the actual document frequency for each word document frequency for each word ww, Hence , Hence 2 DBs, differing significantly in size, might be 2 DBs, differing significantly in size, might be assigned similar content summaries.assigned similar content summaries.

Page 18: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1818

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 19: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

1919

Database ClassificationDatabase Classification

Classifying a database to hierarchy Classifying a database to hierarchy of topics is another way to of topics is another way to characterize the content of a characterize the content of a database. database.

Example: “CANCERLIT” can be Example: “CANCERLIT” can be classified under the category classified under the category “health”.“health”.

Page 20: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2020

Topics hierarchyTopics hierarchy

Topics Topics hierarchhierarchy:y:

Page 21: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2121

Automatic Document Automatic Document Classifier Classifier II

Queries closely associated with topical Queries closely associated with topical categories retrieve mainly documents about categories retrieve mainly documents about that category.that category.example: “breast” and “cancer” is likely to example: “breast” and “cancer” is likely to retrieve documents related to health.retrieve documents related to health.

By observing the number of matches By observing the number of matches generated for each query at a database we generated for each query at a database we can classify the database.can classify the database.example: if a database generates a large example: if a database generates a large number of matches to queries associated number of matches to queries associated with health and few matches for other with health and few matches for other categories we can classify the database categories we can classify the database under category health.under category health.

Page 22: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2222

Automatic Document Classifier Automatic Document Classifier IIII

A rule based document classifier A rule based document classifier uses a set of rules defining a uses a set of rules defining a classification decisions.classification decisions. Examples: Examples:

““Jordan” AND “basketball” Jordan” AND “basketball” sportssports““hepatitis”hepatitis” health health

A database can be classified to more A database can be classified to more than one category.than one category.

Page 23: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2323

Automatic Document Classifier Automatic Document Classifier IIIIII

The algorithm defines for each The algorithm defines for each subcategory csubcategory cii : : Coverage(cCoverage(cii) – the number of documents ) – the number of documents

estimated to belong to cestimated to belong to cii.. Specificity(cSpecificity(cii) – the fraction of documents ) – the fraction of documents

estimated to belong to cestimated to belong to cii.. The algorithm classify a database into The algorithm classify a database into

a category ca category cii if the values of if the values of Coverage(cCoverage(cii) and specificity(c) and specificity(cii) exceed ) exceed two pre-specify thresholds.two pre-specify thresholds.

Page 24: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2424

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Coverage(sport)=Coverage(sport)=

300300

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 25: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2525

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Coverage(sport)=Coverage(sport)=

300300 + 200 + 200 = 500= 500

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 26: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2626

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500

Coverage(health)=Coverage(health)=140140

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 27: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2727

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500

Coverage(health)=Coverage(health)=140140+12 +12 = 162= 162

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 28: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2828

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Specificity(sport) =Specificity(sport) =

500500//(500+162)=0.76(500+162)=0.76

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 29: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

2929

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Specificity(sport) =Specificity(sport) =

500/500/((500+162500+162)=0.76)=0.76

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 30: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3030

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76

Specificity(health) = Specificity(health) = 162162/(500+162) = 0.24/(500+162) = 0.24

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 31: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3131

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76

Specificity(health) = Specificity(health) = 162/(162/(500+162500+162) = 0.24) = 0.24

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

Page 32: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3232

ExampleExample

Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>

sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

sporsportt

healthealthh

coveragcoveragee

500500 162162

SpecificitSpecificityy

0.760.76 0.240.24

Page 33: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3333

ExampleExample

Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>

sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

sporsportt

healthhealth

coveragecoverage 500500 162162

SpecificitSpecificityy

0.760.76 0.240.24

The word “cancer” did not

appear in the rules thus did not affect coverage nor specificity.

Page 34: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3434

QProberQProber

View DemoView Demo

Page 35: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3535

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 36: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3636

Construct Content SummaryConstruct Content Summary

Algorithm outline:Algorithm outline:

1.1. Retrieve a document sample.Retrieve a document sample.

2.2. Generate a preliminary content Generate a preliminary content summary.summary.

3.3. Categorize the database.Categorize the database.

4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.

Page 37: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3737

Construct Content SummaryConstruct Content Summary

Algorithm outline:Algorithm outline:

1.1. Retrieve a document sample.Retrieve a document sample.

2.2. Generate a preliminary content Generate a preliminary content summary.summary.

3.3. Categorize the database.Categorize the database.

4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.

Page 38: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3838

Document SampleDocument Sample

Document sample for category c:Document sample for category c: newdocsnewdocs = = ØØ For each subcategory For each subcategory ccii of c: of c:

For each query q relevant for For each query q relevant for ccii:: newdocsnewdocs = = newdocsnewdocs U {top k documents U {top k documents

return for q}return for q} If q consist a single word If q consist a single word ww

then ActualDF(then ActualDF(ww)= #matches returned for )= #matches returned for q.q.

Page 39: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

3939

Document Sample – Document Sample – Example Example II

START

Sport Arts Science

Basketball soccer

Health

RulesRules

SportSport ““Jordan” and “bulls” , “Romario” and “soccer”,Jordan” and “bulls” , “Romario” and “soccer”,

““Maradona”, “swimming” , etc.Maradona”, “swimming” , etc.

HealtHealthh

““diabetes”, “diet” and “fat”, “stomach”, diabetes”, “diet” and “fat”, “stomach”, etc.etc.

……

We know ActualDF

(.)

Page 40: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4040

Construct Content SummaryConstruct Content Summary

Algorithm outline:Algorithm outline:

1.1. Retrieve a document sample.Retrieve a document sample.

2.2. Generate a preliminary content Generate a preliminary content summary.summary.

3.3. Categorize the database.Categorize the database.

4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.

Page 41: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4141

Content SummaryContent Summary

Build content summary for category c:Build content summary for category c: For each word w in For each word w in newdocsnewdocs : :

SampleDF(SampleDF(ww)= #documents in )= #documents in newdocsnewdocs that contain that contain ww..

Page 42: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4242

Construct Content SummaryConstruct Content Summary

Algorithm outline:Algorithm outline:

1.1. Retrieve a document sample.Retrieve a document sample.

2.2. Generate a preliminary content Generate a preliminary content summary.summary.

3.3. Categorize the database.Categorize the database.

4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.

Page 43: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4343

Categorizing the DatabaseCategorizing the Database

The algorithm is recursive.The algorithm is recursive. We go down the topics hierarchy We go down the topics hierarchy

according to the “Coverage” and the according to the “Coverage” and the “specificity” .“specificity” .

Categorization:Categorization: If Coverage(If Coverage(ccii)>treshold1 and )>treshold1 and

specificity( specificity(ccii)>threshold2)>threshold2

Then getContentSummary(Then getContentSummary(ccii))

Page 44: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4444

Document Sample – Document Sample – Example Example IIII

START

Sport Arts Science

Basketball soccer

Health

Requirements:Requirements: Coverage(cCoverage(cii) > x1) > x1 Specificity(cSpecificity(cii) > x2) > x2

NBAstatisticsNBA

statistics

Page 45: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4545

Construct Content SummaryConstruct Content Summary

Algorithm outline:Algorithm outline:

1.1. Retrieve a document sample.Retrieve a document sample.

2.2. Generate a preliminary content Generate a preliminary content summary.summary.

3.3. Categorize the database.Categorize the database.

4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.

Page 46: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4646

Estimating absolute document Estimating absolute document FrequenciesFrequencies

To evaluate the absolute document To evaluate the absolute document frequencies the paper uses Zipf’s frequencies the paper uses Zipf’s observation that was refined later by observation that was refined later by Mendelbort:Mendelbort:

f=P(r+p)f=P(r+p)-B-B

Page 47: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4747

Estimating absolute document Estimating absolute document FrequenciesFrequencies

ff=P(r+p)=P(r+p)-B-B

f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s

frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific

document collection.document collection.

Page 48: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4848

Estimating absolute document Estimating absolute document FrequenciesFrequencies

f=P(f=P(rr+p+p))-B-B

f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s

frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific

document collection.document collection.

Page 49: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

4949

Estimating absolute document Estimating absolute document FrequenciesFrequencies

f=f=PP(r+(r+pp))--BB

f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s

frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific

document collection.document collection.

Page 50: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5050

Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example

Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4

RulesRules

SporSportt

““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.

SampleDSampleDFF

ActualDActualDFF

JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa

4040 68006800

RomarioRomario 3232 ------

……

Page 51: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5151

Estimating absolute document Estimating absolute document FrequenciesFrequencies

Estimate actual word frequencies:Estimate actual word frequencies:1.1. Sort words in their descending order of Sort words in their descending order of

their SampleDF(.). Determine the rank their SampleDF(.). Determine the rank rrii of each word w of each word wii..

2.2. Estimate P, p, B by the ActualDF(.) you Estimate P, p, B by the ActualDF(.) you have.have.

3.3. Estimate absolute document frequency Estimate absolute document frequency for all words in the sample.for all words in the sample.

Page 52: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5252

Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example

Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4

According to According to Maradona (and Maradona (and more actualDF) more actualDF) estimate P, p and Bestimate P, p and B

Estimate ActualDF Estimate ActualDF of “Jordan”, “Bulls” of “Jordan”, “Bulls” etc.etc.

RulesRules

SporSportt

““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.

SampleDSampleDFF

ActualDActualDFF

JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa

4040 68006800

RomarioRomario 3232 ------

……

Page 53: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5353

Content Summary ProblemsContent Summary Problems

The sparse data problem:The sparse data problem: The content summary tends to include the The content summary tends to include the

most frequent words but generally miss most frequent words but generally miss many other words that appear only in few many other words that appear only in few documents.documents. Example: The word “hemophilia” appears in Example: The word “hemophilia” appears in

0.1% of the PubMed documents.0.1% of the PubMed documents.A typical content summary for PubMed will not A typical content summary for PubMed will not include “hemophilia” in it, thus causing the include “hemophilia” in it, thus causing the metasearcher to find PubMed as a non relevant metasearcher to find PubMed as a non relevant database to query containing “hemophilia”.database to query containing “hemophilia”.

Page 54: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5454

Content Summary ProblemsContent Summary Problems

Disproportion:Disproportion:

Some word might be disproportionately Some word might be disproportionately represented in the document summary.represented in the document summary.

Challenge:Challenge: Improving the quality of the content Improving the quality of the content

summary without necessarily increasing summary without necessarily increasing the document sample size.the document sample size.

Page 55: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5555

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 56: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5656

ShrinkageShrinkage

When multiple databases correspond When multiple databases correspond to similar topic categories they tend to similar topic categories they tend to have similar content summaries.to have similar content summaries.

The content summaries of databases The content summaries of databases under similar topics can mutually under similar topics can mutually complement each other.complement each other.

Page 57: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5757

Category Content SummaryCategory Content SummaryRoot

Sport Health

HeartD3

D1 D2

^DB = 1000Df(“hypertension”)=480P(“hypertension”)=0.48

^DB = 2000Df(“hypertension”)=0P(“hypertension”)=0

P(“hypertension”) = 0.16((2000*0)+(1000*0.48))/3000

Page 58: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5858

Shrunk content Summary Shrunk content Summary II

To create a shrunk content summary we must To create a shrunk content summary we must first create the categories content summary for first create the categories content summary for all the categories in the hierarchy.all the categories in the hierarchy.

Consider a path in the topic hierarchy CConsider a path in the topic hierarchy C11,….,C,….,Cm m

were cwere cii=parent(c=parent(ci+1i+1))

Root

c1

c2

c3

D

Page 59: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

5959

Shrunk content Summary Shrunk content Summary IIII

A shrunk content summary for database D A shrunk content summary for database D classified under categories cclassified under categories c11…c…cmm is: is:

Where:Where:

Page 60: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6060

Shrunk content Summary Shrunk content Summary IIIIII

Root

… … C1

C2

… C3

D

P(w|D)=0.6

P(w|C3)=0.4

P(w|C2)=0.78

P(w|C1)=0.3

P(w|Root)=0.01

Shrunk content Summary:

0.01*λ0+0.3*λ1+0.78*λ2+0.4*λ3+0.6*λ4

Page 61: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6161

Shrunk content Summary Shrunk content Summary IVIV

The category weights:The category weights:λλm+1m+1 is the highest among the is the highest among the λλii’s, which ’s, which means the highest weight is given to the means the highest weight is given to the original content summary. original content summary.

The shrunk content summary The shrunk content summary incorporates information from incorporates information from multiple content summary and thus multiple content summary and thus it can be closer to the complete (and it can be closer to the complete (and unknown) content summary.unknown) content summary.

Page 62: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6262

Shrunk content summary – Shrunk content summary – is it always good?is it always good?

Not always, if the “uncertainty” associated Not always, if the “uncertainty” associated with the score is low don’t use shrinkage:with the score is low don’t use shrinkage: The sample size - If the database sample includes The sample size - If the database sample includes

most of the documents from the DB (a small DB) most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this then this sample is sufficiently complete. In this case shrinkage is not needed and might be case shrinkage is not needed and might be undesirable.undesirable.

The frequency of the query words – if all the query The frequency of the query words – if all the query words appear in almost all of the sample words appear in almost all of the sample documents then the distribution of the words over documents then the distribution of the words over the DB is “certain”. Same goes if every query the DB is “certain”. Same goes if every query word appears in close to no sample document.word appears in close to no sample document.

Page 63: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6363

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 64: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6464

Experiments ResultExperiments Result

The papers refer to 2 aspects:The papers refer to 2 aspects: Content summary quality.Content summary quality. Database selection accuracy.Database selection accuracy.

The papers show that the idea of The papers show that the idea of exploiting content summaries of exploiting content summaries of similarly classified databases similarly classified databases increases the content summary increases the content summary quality and improves the database quality and improves the database selection for a given query.selection for a given query.

Page 65: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6565

Content summary quality Content summary quality IIComparing coverage of the retrieve Comparing coverage of the retrieve vocabulary. RS-ORD and RS-LRD vs. vocabulary. RS-ORD and RS-LRD vs. different Rulers.different Rulers.

Specificity

% r

etr

ieved

word

s

Page 66: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6666

Content summary quality Content summary quality IIIIComparing rank of words.Comparing rank of words.

RS-ORD and RS-LRD vs. different Rulers.RS-ORD and RS-LRD vs. different Rulers.

Page 67: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6767

Content summary quality Content summary quality IIIIII Comparing the number of queries done to the Comparing the number of queries done to the

database. RS-ORD and RS-LRD vs. different database. RS-ORD and RS-LRD vs. different Rulers.Rulers.

Page 68: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6868

Data base selection using Data base selection using shrinkage shrinkage

The shrinkage improves selecting The shrinkage improves selecting relevant data bases.relevant data bases.

Page 69: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

6969

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

Page 70: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

7070

Summary Summary II

Database selection is critical to Database selection is critical to building efficient metasearchers that building efficient metasearchers that interact with potentially large interact with potentially large number of databases.number of databases.

The metasearchers uses the The metasearchers uses the database content summary to select database content summary to select the most relevant databases for a the most relevant databases for a given query.given query.

Page 71: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

7171

Summary Summary IIII

The papers present methods to improve The papers present methods to improve the database content summary:the database content summary: Creating Content summary with estimation Creating Content summary with estimation

of actual document frequency.of actual document frequency. Categorizing databases in a classification Categorizing databases in a classification

scheme.scheme. A method to exploits content summaries of A method to exploits content summaries of

similarly classified databases and combines similarly classified databases and combines them using shrinkage.them using shrinkage.

Page 72: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

7272

The EndThe End

"The invisible portion of the Web will continue to grow

exponentially before the tools to uncover the hidden Web are

ready for general use" (http://brightplanet.com/technol

ogy/deepweb.asp)

QUESTIONS?

Page 73: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed

7373

Appendix Appendix

The metasearcher Turbo10 - The metasearcher Turbo10 - http://turbo10.com/index.htmlhttp://turbo10.com/index.html