livnat sharabani sdbi 2006 the hidden web. 2 based on: “distributed search over the hidden web:...

Livnat SharabaniLivnat Sharabani

SDBI 2006SDBI 2006

The HiddenThe Hidden WebWeb

22

Based on:Based on:

““Distributed search over the hidden web: Distributed search over the hidden web: Hierarchical database sampling and Hierarchical database sampling and selection”selection”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB 2002)2002)

““When one sample is not enough: Improving When one sample is not enough: Improving text database selection using shrinkage”text database selection using shrinkage”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD 2004)2004)

33

ContentContent

What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.

44

What is the hidden web?What is the hidden web?

The The “hidden- web”“hidden- web” / / “invisible-web”“invisible-web” is what you cannot retrieve ("see") in is what you cannot retrieve ("see") in the search results the search results

The The “surface-web”“surface-web” / / “visible-web”“visible-web” is is what you see in the results pages what you see in the results pages from general web search engines.from general web search engines.

55

““Surface” web vs. “Hidden” webSurface” web vs. “Hidden” web

66

Why Are Some Pages Why Are Some Pages Invisible?Invisible?

Technical barrier:Technical barrier: When typing or judgment are required.When typing or judgment are required. Dynamically generated pages.Dynamically generated pages.

Pages search engines choose to exclude:Pages search engines choose to exclude: Links containing ‘?’ (can be a spiders trap)Links containing ‘?’ (can be a spiders trap) Flash, shockwave (spiders are html Flash, shockwave (spiders are html

optimized)optimized)

77

The hidden web - majorityThe hidden web - majority Text databases on the web which are Text databases on the web which are

“hidden” behind search interfaces.“hidden” behind search interfaces.

88

““Surface” web vs. “Hidden” Surface” web vs. “Hidden” webweb

Surface web:Surface web: Link structure.Link structure. The content is The content is

crawlable.crawlable. The content is indexed The content is indexed

by search engines like by search engines like Google.Google.

Hidden web:Hidden web: Documents “hidden” Documents “hidden”

in databases.in databases. The content is not The content is not

crawlable.crawlable. Need to query each Need to query each

collection individually.collection individually.

Keywords:

99

ContentContent


1010

MetasearchersMetasearchers Metsearcher is a tool for searching over multiple Metsearcher is a tool for searching over multiple

hidden databases simultaneously through a query hidden databases simultaneously through a query interface.interface.

A metasearcher performs three main tasks:A metasearcher performs three main tasks: Database selection.Database selection. Query translation.Query translation. Result merging. Result merging.

DB1DB2

DB3

Metasearcher Query

resultsWEB

1111

DB Content SummaryDB Content Summary

CNN.fnCNN.fn

Num Docs:44,730Num Docs:44,730

WordWord dfdf

BreastBreast

CancerCancer

……

124124

4444

……

Statistics that characterize the database Statistics that characterize the database content: content: Document frequencies of the words appear in the Document frequencies of the words appear in the

databasedatabase Number of documents stored in the database.Number of documents stored in the database.

Examples:Examples:

CANCERLITCANCERLIT

Num Docs: 148,944Num Docs: 148,944

WordWord dfdf

BreastBreast

CancerCancer

……

121,134121,134

91,68891,688

……

1212

Typical DB Selection Typical DB Selection AlgorithmAlgorithm

Typical database selection algorithm Typical database selection algorithm depends on the database content depends on the database content summary to make decision.summary to make decision.

Given a content summary the Given a content summary the algorithm estimates how relevant the algorithm estimates how relevant the database is for a given query.database is for a given query.

1313

bGIOSSbGIOSS The algorithm: calculate the number of documents The algorithm: calculate the number of documents

which expected to have the words in the query.which expected to have the words in the query. Example: for query “breast cancer” bGIOSS will calculate:Example: for query “breast cancer” bGIOSS will calculate:

CANCERLIT: |c|=148,944 df(breast)=121,134 CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688df(cancer)=91,688148,944*(121,134/148,944)*(91,688/148,944)=~74,569148,944*(121,134/148,944)*(91,688/148,944)=~74,569

CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 44,730 *(124/ 44,730)*(44/ 44,730)=~044,730 *(124/ 44,730)*(44/ 44,730)=~0

CNN.fnCNN.fn

Num Docs:44,730Num Docs:44,730

WordWord dfdf

BreastBreast

CancerCancer124124

4444

CANCERLITCANCERLIT

Num Docs: 148,944Num Docs: 148,944

WordWord dfdf

BreastBreast

CancerCancer121,134121,134

91,68891,688

1414

Database SelectionDatabase Selection

The data base selection is based on the The data base selection is based on the contents summary.contents summary.

How do the metasearcher obtain the DB How do the metasearcher obtain the DB content summary?content summary? Exported by the DB itself.Exported by the DB itself. Manually generated description.Manually generated description. Use a technique to automate the extraction Use a technique to automate the extraction

of content summaries from searchable text of content summaries from searchable text DBs.DBs.

1515

Content Summary Content Summary construction construction

A pioneer work done by J. Callan and A pioneer work done by J. Callan and M. Connell was presented at SIGMOD M. Connell was presented at SIGMOD ’99.’99.

Their algorithm extracts a document Their algorithm extracts a document sample from a given database D and sample from a given database D and computes the frequency of each computes the frequency of each observed word observed word ww in the sample. in the sample.

1616

Content Summary Content Summary constructionconstruction

The algorithm:The algorithm:1.1. Start with a comprehensive word Start with a comprehensive word

dictionary.dictionary.

2.2. Pick a word and send it as a query to Pick a word and send it as a query to database D.database D.

3.3. Retrieve the top k documents returned.Retrieve the top k documents returned.

4.4. If the number of retrieved documents If the number of retrieved documents exceeds a pre-specified threshold stop exceeds a pre-specified threshold stop sampling. Otherwise return to step 2.sampling. Otherwise return to step 2.

5.5. For each word w in the retrieved For each word w in the retrieved documents calculate SampleDF(documents calculate SampleDF(ww).).

1717

Content Summary Content Summary constructionconstruction

There are two main versions of this There are two main versions of this algorithm that differ in how they pick words algorithm that differ in how they pick words from the dictionary:from the dictionary: RS-Ord (Random Sampling Other Resource) – RS-Ord (Random Sampling Other Resource) –

picks a random word from the dictionary.picks a random word from the dictionary. RS-Lrd (Random Sampling Learned Resource)- RS-Lrd (Random Sampling Learned Resource)-

pick a word from a previously retrieved pick a word from a previously retrieved documents.documents.

Both versions do not retrieve the actual Both versions do not retrieve the actual document frequency for each word document frequency for each word ww, Hence , Hence 2 DBs, differing significantly in size, might be 2 DBs, differing significantly in size, might be assigned similar content summaries.assigned similar content summaries.

1818

ContentContent


1919

Database ClassificationDatabase Classification

Classifying a database to hierarchy Classifying a database to hierarchy of topics is another way to of topics is another way to characterize the content of a characterize the content of a database. database.

Example: “CANCERLIT” can be Example: “CANCERLIT” can be classified under the category classified under the category “health”.“health”.

2020

Topics hierarchyTopics hierarchy

Topics Topics hierarchhierarchy:y:

2121

Automatic Document Automatic Document Classifier Classifier II

Queries closely associated with topical Queries closely associated with topical categories retrieve mainly documents about categories retrieve mainly documents about that category.that category.example: “breast” and “cancer” is likely to example: “breast” and “cancer” is likely to retrieve documents related to health.retrieve documents related to health.

By observing the number of matches By observing the number of matches generated for each query at a database we generated for each query at a database we can classify the database.can classify the database.example: if a database generates a large example: if a database generates a large number of matches to queries associated number of matches to queries associated with health and few matches for other with health and few matches for other categories we can classify the database categories we can classify the database under category health.under category health.

2222

Automatic Document Classifier Automatic Document Classifier IIII

A rule based document classifier A rule based document classifier uses a set of rules defining a uses a set of rules defining a classification decisions.classification decisions. Examples: Examples:

““Jordan” AND “basketball” Jordan” AND “basketball” sportssports““hepatitis”hepatitis” health health

A database can be classified to more A database can be classified to more than one category.than one category.

2323

Automatic Document Classifier Automatic Document Classifier IIIIII

The algorithm defines for each The algorithm defines for each subcategory csubcategory cii : : Coverage(cCoverage(cii) – the number of documents ) – the number of documents

estimated to belong to cestimated to belong to cii.. Specificity(cSpecificity(cii) – the fraction of documents ) – the fraction of documents

estimated to belong to cestimated to belong to cii.. The algorithm classify a database into The algorithm classify a database into

a category ca category cii if the values of if the values of Coverage(cCoverage(cii) and specificity(c) and specificity(cii) exceed ) exceed two pre-specify thresholds.two pre-specify thresholds.

2424

ExampleExample Rules:Rules:

““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health

Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5

Coverage(sport)=Coverage(sport)=

300300

Documents frequencyDocuments frequency

soccersoccer 300300

basketballbasketball 200200

dietdiet 140140

diabetesdiabetes 1212

CancerCancer 250250

2525




Coverage(sport)=Coverage(sport)=

300300 + 200 + 200 = 500= 500


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

2626




Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500

Coverage(health)=Coverage(health)=140140


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

2727




Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500

Coverage(health)=Coverage(health)=140140+12 +12 = 162= 162


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

2828




Specificity(sport) =Specificity(sport) =

500500//(500+162)=0.76(500+162)=0.76


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

2929




Specificity(sport) =Specificity(sport) =

500/500/((500+162500+162)=0.76)=0.76


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

3030




Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76

Specificity(health) = Specificity(health) = 162162/(500+162) = 0.24/(500+162) = 0.24


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

3131




Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76

Specificity(health) = Specificity(health) = 162/(162/(500+162500+162) = 0.24) = 0.24


soccersoccer 300300


dietdiet 140140


CancerCancer 250250

3232

ExampleExample

Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>

sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health


sporsportt

healthealthh

coveragcoveragee

500500 162162

SpecificitSpecificityy

0.760.76 0.240.24

3333

ExampleExample

Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>

sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health


sporsportt

healthhealth

coveragecoverage 500500 162162

SpecificitSpecificityy

0.760.76 0.240.24

The word “cancer” did not

appear in the rules thus did not affect coverage nor specificity.

3434

QProberQProber

View DemoView Demo

3535

ContentContent


3636

Construct Content SummaryConstruct Content Summary

Algorithm outline:Algorithm outline:

1.1. Retrieve a document sample.Retrieve a document sample.

2.2. Generate a preliminary content Generate a preliminary content summary.summary.

3.3. Categorize the database.Categorize the database.

4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.

3737







3838

Document SampleDocument Sample

Document sample for category c:Document sample for category c: newdocsnewdocs = = ØØ For each subcategory For each subcategory ccii of c: of c:

For each query q relevant for For each query q relevant for ccii:: newdocsnewdocs = = newdocsnewdocs U {top k documents U {top k documents

return for q}return for q} If q consist a single word If q consist a single word ww

then ActualDF(then ActualDF(ww)= #matches returned for )= #matches returned for q.q.

3939

Document Sample – Document Sample – Example Example II

START

Sport Arts Science

Basketball soccer

Health

RulesRules

SportSport ““Jordan” and “bulls” , “Romario” and “soccer”,Jordan” and “bulls” , “Romario” and “soccer”,

““Maradona”, “swimming” , etc.Maradona”, “swimming” , etc.

HealtHealthh

““diabetes”, “diet” and “fat”, “stomach”, diabetes”, “diet” and “fat”, “stomach”, etc.etc.

……

We know ActualDF

(.)

4040







4141

Content SummaryContent Summary

Build content summary for category c:Build content summary for category c: For each word w in For each word w in newdocsnewdocs : :

SampleDF(SampleDF(ww)= #documents in )= #documents in newdocsnewdocs that contain that contain ww..

4242







4343

Categorizing the DatabaseCategorizing the Database

The algorithm is recursive.The algorithm is recursive. We go down the topics hierarchy We go down the topics hierarchy

according to the “Coverage” and the according to the “Coverage” and the “specificity” .“specificity” .

Categorization:Categorization: If Coverage(If Coverage(ccii)>treshold1 and )>treshold1 and

specificity( specificity(ccii)>threshold2)>threshold2

Then getContentSummary(Then getContentSummary(ccii))

4444

Document Sample – Document Sample – Example Example IIII

START

Sport Arts Science

Basketball soccer

Health

Requirements:Requirements: Coverage(cCoverage(cii) > x1) > x1 Specificity(cSpecificity(cii) > x2) > x2

NBAstatisticsNBA

statistics

4545







4646

Estimating absolute document Estimating absolute document FrequenciesFrequencies

To evaluate the absolute document To evaluate the absolute document frequencies the paper uses Zipf’s frequencies the paper uses Zipf’s observation that was refined later by observation that was refined later by Mendelbort:Mendelbort:

f=P(r+p)f=P(r+p)-B-B

4747


ff=P(r+p)=P(r+p)-B-B

f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s

frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific

document collection.document collection.

4848


f=P(f=P(rr+p+p))-B-B




4949


f=f=PP(r+(r+pp))--BB




5050

Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example

Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4

RulesRules

SporSportt

““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.

SampleDSampleDFF

ActualDActualDFF

JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa

4040 68006800

RomarioRomario 3232 ------

……

5151


Estimate actual word frequencies:Estimate actual word frequencies:1.1. Sort words in their descending order of Sort words in their descending order of

their SampleDF(.). Determine the rank their SampleDF(.). Determine the rank rrii of each word w of each word wii..

2.2. Estimate P, p, B by the ActualDF(.) you Estimate P, p, B by the ActualDF(.) you have.have.

3.3. Estimate absolute document frequency Estimate absolute document frequency for all words in the sample.for all words in the sample.

5252

Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example

Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4

According to According to Maradona (and Maradona (and more actualDF) more actualDF) estimate P, p and Bestimate P, p and B

Estimate ActualDF Estimate ActualDF of “Jordan”, “Bulls” of “Jordan”, “Bulls” etc.etc.

RulesRules

SporSportt

““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.

SampleDSampleDFF

ActualDActualDFF

JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa

4040 68006800

RomarioRomario 3232 ------

……

5353

Content Summary ProblemsContent Summary Problems

The sparse data problem:The sparse data problem: The content summary tends to include the The content summary tends to include the

most frequent words but generally miss most frequent words but generally miss many other words that appear only in few many other words that appear only in few documents.documents. Example: The word “hemophilia” appears in Example: The word “hemophilia” appears in

0.1% of the PubMed documents.0.1% of the PubMed documents.A typical content summary for PubMed will not A typical content summary for PubMed will not include “hemophilia” in it, thus causing the include “hemophilia” in it, thus causing the metasearcher to find PubMed as a non relevant metasearcher to find PubMed as a non relevant database to query containing “hemophilia”.database to query containing “hemophilia”.

5454

Content Summary ProblemsContent Summary Problems

Disproportion:Disproportion:

Some word might be disproportionately Some word might be disproportionately represented in the document summary.represented in the document summary.

Challenge:Challenge: Improving the quality of the content Improving the quality of the content

summary without necessarily increasing summary without necessarily increasing the document sample size.the document sample size.

5555

ContentContent


5656

ShrinkageShrinkage

When multiple databases correspond When multiple databases correspond to similar topic categories they tend to similar topic categories they tend to have similar content summaries.to have similar content summaries.

The content summaries of databases The content summaries of databases under similar topics can mutually under similar topics can mutually complement each other.complement each other.

5757

Category Content SummaryCategory Content SummaryRoot

Sport Health

HeartD3

D1 D2

^DB = 1000Df(“hypertension”)=480P(“hypertension”)=0.48

^DB = 2000Df(“hypertension”)=0P(“hypertension”)=0

P(“hypertension”) = 0.16((2000*0)+(1000*0.48))/3000

5858

Shrunk content Summary Shrunk content Summary II

To create a shrunk content summary we must To create a shrunk content summary we must first create the categories content summary for first create the categories content summary for all the categories in the hierarchy.all the categories in the hierarchy.

Consider a path in the topic hierarchy CConsider a path in the topic hierarchy C11,….,C,….,Cm m

were cwere cii=parent(c=parent(ci+1i+1))

Root

c1

c2

c3

D

5959

Shrunk content Summary Shrunk content Summary IIII

A shrunk content summary for database D A shrunk content summary for database D classified under categories cclassified under categories c11…c…cmm is: is:

Where:Where:

6060

Shrunk content Summary Shrunk content Summary IIIIII

Root

… … C1

C2

… C3

…

D

P(w|D)=0.6

P(w|C3)=0.4

P(w|C2)=0.78

P(w|C1)=0.3

P(w|Root)=0.01

Shrunk content Summary:

0.01*λ0+0.3*λ1+0.78*λ2+0.4*λ3+0.6*λ4

6161

Shrunk content Summary Shrunk content Summary IVIV

The category weights:The category weights:λλm+1m+1 is the highest among the is the highest among the λλii’s, which ’s, which means the highest weight is given to the means the highest weight is given to the original content summary. original content summary.

The shrunk content summary The shrunk content summary incorporates information from incorporates information from multiple content summary and thus multiple content summary and thus it can be closer to the complete (and it can be closer to the complete (and unknown) content summary.unknown) content summary.

6262

Shrunk content summary – Shrunk content summary – is it always good?is it always good?

Not always, if the “uncertainty” associated Not always, if the “uncertainty” associated with the score is low don’t use shrinkage:with the score is low don’t use shrinkage: The sample size - If the database sample includes The sample size - If the database sample includes

most of the documents from the DB (a small DB) most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this then this sample is sufficiently complete. In this case shrinkage is not needed and might be case shrinkage is not needed and might be undesirable.undesirable.

The frequency of the query words – if all the query The frequency of the query words – if all the query words appear in almost all of the sample words appear in almost all of the sample documents then the distribution of the words over documents then the distribution of the words over the DB is “certain”. Same goes if every query the DB is “certain”. Same goes if every query word appears in close to no sample document.word appears in close to no sample document.

6363

ContentContent


6464

Experiments ResultExperiments Result

The papers refer to 2 aspects:The papers refer to 2 aspects: Content summary quality.Content summary quality. Database selection accuracy.Database selection accuracy.

The papers show that the idea of The papers show that the idea of exploiting content summaries of exploiting content summaries of similarly classified databases similarly classified databases increases the content summary increases the content summary quality and improves the database quality and improves the database selection for a given query.selection for a given query.

6565

Content summary quality Content summary quality IIComparing coverage of the retrieve Comparing coverage of the retrieve vocabulary. RS-ORD and RS-LRD vs. vocabulary. RS-ORD and RS-LRD vs. different Rulers.different Rulers.

Specificity

% r

etr

ieved

word

s

6666

Content summary quality Content summary quality IIIIComparing rank of words.Comparing rank of words.

RS-ORD and RS-LRD vs. different Rulers.RS-ORD and RS-LRD vs. different Rulers.

6767

Content summary quality Content summary quality IIIIII Comparing the number of queries done to the Comparing the number of queries done to the

database. RS-ORD and RS-LRD vs. different database. RS-ORD and RS-LRD vs. different Rulers.Rulers.

6868

Data base selection using Data base selection using shrinkage shrinkage

The shrinkage improves selecting The shrinkage improves selecting relevant data bases.relevant data bases.

6969

ContentContent


7070

Summary Summary II

Database selection is critical to Database selection is critical to building efficient metasearchers that building efficient metasearchers that interact with potentially large interact with potentially large number of databases.number of databases.

The metasearchers uses the The metasearchers uses the database content summary to select database content summary to select the most relevant databases for a the most relevant databases for a given query.given query.

7171

Summary Summary IIII

The papers present methods to improve The papers present methods to improve the database content summary:the database content summary: Creating Content summary with estimation Creating Content summary with estimation

of actual document frequency.of actual document frequency. Categorizing databases in a classification Categorizing databases in a classification

scheme.scheme. A method to exploits content summaries of A method to exploits content summaries of

similarly classified databases and combines similarly classified databases and combines them using shrinkage.them using shrinkage.

7272

The EndThe End

"The invisible portion of the Web will continue to grow

exponentially before the tools to uncover the hidden Web are

ready for general use" (http://brightplanet.com/technol

ogy/deepweb.asp)

QUESTIONS?

7373

Appendix Appendix

The metasearcher Turbo10 - The metasearcher Turbo10 - http://turbo10.com/index.htmlhttp://turbo10.com/index.html

http://turbo10.com/index.html

livnat sharabani sdbi 2006 the hidden web. 2 based on: “distributed search over the hidden web:...

Documents

database content summary

hidden web invisibleweb

hidden websurface web

surfaceweb visibleweb

database classification

hidden webwhy

hidden webbased

text database selection