livnat sharabani sdbi 2006 the hidden web. 2 based on: “distributed search over the hidden web:...
TRANSCRIPT
Livnat SharabaniLivnat Sharabani
SDBI 2006SDBI 2006
The HiddenThe Hidden WebWeb
22
Based on:Based on:
““Distributed search over the hidden web: Distributed search over the hidden web: Hierarchical database sampling and Hierarchical database sampling and selection”selection”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB 2002)2002)
““When one sample is not enough: Improving When one sample is not enough: Improving text database selection using shrinkage”text database selection using shrinkage”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD 2004)2004)
33
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
44
What is the hidden web?What is the hidden web?
The The “hidden- web”“hidden- web” / / “invisible-web”“invisible-web” is what you cannot retrieve ("see") in is what you cannot retrieve ("see") in the search results the search results
The The “surface-web”“surface-web” / / “visible-web”“visible-web” is is what you see in the results pages what you see in the results pages from general web search engines.from general web search engines.
55
““Surface” web vs. “Hidden” webSurface” web vs. “Hidden” web
66
Why Are Some Pages Why Are Some Pages Invisible?Invisible?
Technical barrier:Technical barrier: When typing or judgment are required.When typing or judgment are required. Dynamically generated pages.Dynamically generated pages.
Pages search engines choose to exclude:Pages search engines choose to exclude: Links containing ‘?’ (can be a spiders trap)Links containing ‘?’ (can be a spiders trap) Flash, shockwave (spiders are html Flash, shockwave (spiders are html
optimized)optimized)
77
The hidden web - majorityThe hidden web - majority Text databases on the web which are Text databases on the web which are
“hidden” behind search interfaces.“hidden” behind search interfaces.
88
““Surface” web vs. “Hidden” Surface” web vs. “Hidden” webweb
Surface web:Surface web: Link structure.Link structure. The content is The content is
crawlable.crawlable. The content is indexed The content is indexed
by search engines like by search engines like Google.Google.
Hidden web:Hidden web: Documents “hidden” Documents “hidden”
in databases.in databases. The content is not The content is not
crawlable.crawlable. Need to query each Need to query each
collection individually.collection individually.
Keywords:
99
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
1010
MetasearchersMetasearchers Metsearcher is a tool for searching over multiple Metsearcher is a tool for searching over multiple
hidden databases simultaneously through a query hidden databases simultaneously through a query interface.interface.
A metasearcher performs three main tasks:A metasearcher performs three main tasks: Database selection.Database selection. Query translation.Query translation. Result merging. Result merging.
DB1DB2
DB3
Metasearcher Query
resultsWEB
1111
DB Content SummaryDB Content Summary
CNN.fnCNN.fn
Num Docs:44,730Num Docs:44,730
WordWord dfdf
BreastBreast
CancerCancer
……
124124
4444
……
Statistics that characterize the database Statistics that characterize the database content: content: Document frequencies of the words appear in the Document frequencies of the words appear in the
databasedatabase Number of documents stored in the database.Number of documents stored in the database.
Examples:Examples:
CANCERLITCANCERLIT
Num Docs: 148,944Num Docs: 148,944
WordWord dfdf
BreastBreast
CancerCancer
……
121,134121,134
91,68891,688
……
1212
Typical DB Selection Typical DB Selection AlgorithmAlgorithm
Typical database selection algorithm Typical database selection algorithm depends on the database content depends on the database content summary to make decision.summary to make decision.
Given a content summary the Given a content summary the algorithm estimates how relevant the algorithm estimates how relevant the database is for a given query.database is for a given query.
1313
bGIOSSbGIOSS The algorithm: calculate the number of documents The algorithm: calculate the number of documents
which expected to have the words in the query.which expected to have the words in the query. Example: for query “breast cancer” bGIOSS will calculate:Example: for query “breast cancer” bGIOSS will calculate:
CANCERLIT: |c|=148,944 df(breast)=121,134 CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688df(cancer)=91,688148,944*(121,134/148,944)*(91,688/148,944)=~74,569148,944*(121,134/148,944)*(91,688/148,944)=~74,569
CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 44,730 *(124/ 44,730)*(44/ 44,730)=~044,730 *(124/ 44,730)*(44/ 44,730)=~0
CNN.fnCNN.fn
Num Docs:44,730Num Docs:44,730
WordWord dfdf
BreastBreast
CancerCancer124124
4444
CANCERLITCANCERLIT
Num Docs: 148,944Num Docs: 148,944
WordWord dfdf
BreastBreast
CancerCancer121,134121,134
91,68891,688
1414
Database SelectionDatabase Selection
The data base selection is based on the The data base selection is based on the contents summary.contents summary.
How do the metasearcher obtain the DB How do the metasearcher obtain the DB content summary?content summary? Exported by the DB itself.Exported by the DB itself. Manually generated description.Manually generated description. Use a technique to automate the extraction Use a technique to automate the extraction
of content summaries from searchable text of content summaries from searchable text DBs.DBs.
1515
Content Summary Content Summary construction construction
A pioneer work done by J. Callan and A pioneer work done by J. Callan and M. Connell was presented at SIGMOD M. Connell was presented at SIGMOD ’99.’99.
Their algorithm extracts a document Their algorithm extracts a document sample from a given database D and sample from a given database D and computes the frequency of each computes the frequency of each observed word observed word ww in the sample. in the sample.
1616
Content Summary Content Summary constructionconstruction
The algorithm:The algorithm:1.1. Start with a comprehensive word Start with a comprehensive word
dictionary.dictionary.
2.2. Pick a word and send it as a query to Pick a word and send it as a query to database D.database D.
3.3. Retrieve the top k documents returned.Retrieve the top k documents returned.
4.4. If the number of retrieved documents If the number of retrieved documents exceeds a pre-specified threshold stop exceeds a pre-specified threshold stop sampling. Otherwise return to step 2.sampling. Otherwise return to step 2.
5.5. For each word w in the retrieved For each word w in the retrieved documents calculate SampleDF(documents calculate SampleDF(ww).).
1717
Content Summary Content Summary constructionconstruction
There are two main versions of this There are two main versions of this algorithm that differ in how they pick words algorithm that differ in how they pick words from the dictionary:from the dictionary: RS-Ord (Random Sampling Other Resource) – RS-Ord (Random Sampling Other Resource) –
picks a random word from the dictionary.picks a random word from the dictionary. RS-Lrd (Random Sampling Learned Resource)- RS-Lrd (Random Sampling Learned Resource)-
pick a word from a previously retrieved pick a word from a previously retrieved documents.documents.
Both versions do not retrieve the actual Both versions do not retrieve the actual document frequency for each word document frequency for each word ww, Hence , Hence 2 DBs, differing significantly in size, might be 2 DBs, differing significantly in size, might be assigned similar content summaries.assigned similar content summaries.
1818
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
1919
Database ClassificationDatabase Classification
Classifying a database to hierarchy Classifying a database to hierarchy of topics is another way to of topics is another way to characterize the content of a characterize the content of a database. database.
Example: “CANCERLIT” can be Example: “CANCERLIT” can be classified under the category classified under the category “health”.“health”.
2020
Topics hierarchyTopics hierarchy
Topics Topics hierarchhierarchy:y:
2121
Automatic Document Automatic Document Classifier Classifier II
Queries closely associated with topical Queries closely associated with topical categories retrieve mainly documents about categories retrieve mainly documents about that category.that category.example: “breast” and “cancer” is likely to example: “breast” and “cancer” is likely to retrieve documents related to health.retrieve documents related to health.
By observing the number of matches By observing the number of matches generated for each query at a database we generated for each query at a database we can classify the database.can classify the database.example: if a database generates a large example: if a database generates a large number of matches to queries associated number of matches to queries associated with health and few matches for other with health and few matches for other categories we can classify the database categories we can classify the database under category health.under category health.
2222
Automatic Document Classifier Automatic Document Classifier IIII
A rule based document classifier A rule based document classifier uses a set of rules defining a uses a set of rules defining a classification decisions.classification decisions. Examples: Examples:
““Jordan” AND “basketball” Jordan” AND “basketball” sportssports““hepatitis”hepatitis” health health
A database can be classified to more A database can be classified to more than one category.than one category.
2323
Automatic Document Classifier Automatic Document Classifier IIIIII
The algorithm defines for each The algorithm defines for each subcategory csubcategory cii : : Coverage(cCoverage(cii) – the number of documents ) – the number of documents
estimated to belong to cestimated to belong to cii.. Specificity(cSpecificity(cii) – the fraction of documents ) – the fraction of documents
estimated to belong to cestimated to belong to cii.. The algorithm classify a database into The algorithm classify a database into
a category ca category cii if the values of if the values of Coverage(cCoverage(cii) and specificity(c) and specificity(cii) exceed ) exceed two pre-specify thresholds.two pre-specify thresholds.
2424
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=
300300
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
2525
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=
300300 + 200 + 200 = 500= 500
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
2626
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500
Coverage(health)=Coverage(health)=140140
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
2727
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500
Coverage(health)=Coverage(health)=140140+12 +12 = 162= 162
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
2828
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) =
500500//(500+162)=0.76(500+162)=0.76
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
2929
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) =
500/500/((500+162500+162)=0.76)=0.76
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
3030
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76
Specificity(health) = Specificity(health) = 162162/(500+162) = 0.24/(500+162) = 0.24
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
3131
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76
Specificity(health) = Specificity(health) = 162/(162/(500+162500+162) = 0.24) = 0.24
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
3232
ExampleExample
Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>
sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
sporsportt
healthealthh
coveragcoveragee
500500 162162
SpecificitSpecificityy
0.760.76 0.240.24
3333
ExampleExample
Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>
sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
sporsportt
healthhealth
coveragecoverage 500500 162162
SpecificitSpecificityy
0.760.76 0.240.24
The word “cancer” did not
appear in the rules thus did not affect coverage nor specificity.
3434
QProberQProber
View DemoView Demo
3535
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
3636
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
3737
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
3838
Document SampleDocument Sample
Document sample for category c:Document sample for category c: newdocsnewdocs = = ØØ For each subcategory For each subcategory ccii of c: of c:
For each query q relevant for For each query q relevant for ccii:: newdocsnewdocs = = newdocsnewdocs U {top k documents U {top k documents
return for q}return for q} If q consist a single word If q consist a single word ww
then ActualDF(then ActualDF(ww)= #matches returned for )= #matches returned for q.q.
3939
Document Sample – Document Sample – Example Example II
START
Sport Arts Science
Basketball soccer
Health
RulesRules
SportSport ““Jordan” and “bulls” , “Romario” and “soccer”,Jordan” and “bulls” , “Romario” and “soccer”,
““Maradona”, “swimming” , etc.Maradona”, “swimming” , etc.
HealtHealthh
““diabetes”, “diet” and “fat”, “stomach”, diabetes”, “diet” and “fat”, “stomach”, etc.etc.
……
We know ActualDF
(.)
4040
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
4141
Content SummaryContent Summary
Build content summary for category c:Build content summary for category c: For each word w in For each word w in newdocsnewdocs : :
SampleDF(SampleDF(ww)= #documents in )= #documents in newdocsnewdocs that contain that contain ww..
4242
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
4343
Categorizing the DatabaseCategorizing the Database
The algorithm is recursive.The algorithm is recursive. We go down the topics hierarchy We go down the topics hierarchy
according to the “Coverage” and the according to the “Coverage” and the “specificity” .“specificity” .
Categorization:Categorization: If Coverage(If Coverage(ccii)>treshold1 and )>treshold1 and
specificity( specificity(ccii)>threshold2)>threshold2
Then getContentSummary(Then getContentSummary(ccii))
4444
Document Sample – Document Sample – Example Example IIII
START
Sport Arts Science
Basketball soccer
Health
Requirements:Requirements: Coverage(cCoverage(cii) > x1) > x1 Specificity(cSpecificity(cii) > x2) > x2
NBAstatisticsNBA
statistics
4545
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
4646
Estimating absolute document Estimating absolute document FrequenciesFrequencies
To evaluate the absolute document To evaluate the absolute document frequencies the paper uses Zipf’s frequencies the paper uses Zipf’s observation that was refined later by observation that was refined later by Mendelbort:Mendelbort:
f=P(r+p)f=P(r+p)-B-B
4747
Estimating absolute document Estimating absolute document FrequenciesFrequencies
ff=P(r+p)=P(r+p)-B-B
f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s
frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific
document collection.document collection.
4848
Estimating absolute document Estimating absolute document FrequenciesFrequencies
f=P(f=P(rr+p+p))-B-B
f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s
frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific
document collection.document collection.
4949
Estimating absolute document Estimating absolute document FrequenciesFrequencies
f=f=PP(r+(r+pp))--BB
f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s
frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific
document collection.document collection.
5050
Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example
Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4
RulesRules
SporSportt
““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.
SampleDSampleDFF
ActualDActualDFF
JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa
4040 68006800
RomarioRomario 3232 ------
……
5151
Estimating absolute document Estimating absolute document FrequenciesFrequencies
Estimate actual word frequencies:Estimate actual word frequencies:1.1. Sort words in their descending order of Sort words in their descending order of
their SampleDF(.). Determine the rank their SampleDF(.). Determine the rank rrii of each word w of each word wii..
2.2. Estimate P, p, B by the ActualDF(.) you Estimate P, p, B by the ActualDF(.) you have.have.
3.3. Estimate absolute document frequency Estimate absolute document frequency for all words in the sample.for all words in the sample.
5252
Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example
Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4
According to According to Maradona (and Maradona (and more actualDF) more actualDF) estimate P, p and Bestimate P, p and B
Estimate ActualDF Estimate ActualDF of “Jordan”, “Bulls” of “Jordan”, “Bulls” etc.etc.
RulesRules
SporSportt
““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.
SampleDSampleDFF
ActualDActualDFF
JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa
4040 68006800
RomarioRomario 3232 ------
……
5353
Content Summary ProblemsContent Summary Problems
The sparse data problem:The sparse data problem: The content summary tends to include the The content summary tends to include the
most frequent words but generally miss most frequent words but generally miss many other words that appear only in few many other words that appear only in few documents.documents. Example: The word “hemophilia” appears in Example: The word “hemophilia” appears in
0.1% of the PubMed documents.0.1% of the PubMed documents.A typical content summary for PubMed will not A typical content summary for PubMed will not include “hemophilia” in it, thus causing the include “hemophilia” in it, thus causing the metasearcher to find PubMed as a non relevant metasearcher to find PubMed as a non relevant database to query containing “hemophilia”.database to query containing “hemophilia”.
5454
Content Summary ProblemsContent Summary Problems
Disproportion:Disproportion:
Some word might be disproportionately Some word might be disproportionately represented in the document summary.represented in the document summary.
Challenge:Challenge: Improving the quality of the content Improving the quality of the content
summary without necessarily increasing summary without necessarily increasing the document sample size.the document sample size.
5555
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
5656
ShrinkageShrinkage
When multiple databases correspond When multiple databases correspond to similar topic categories they tend to similar topic categories they tend to have similar content summaries.to have similar content summaries.
The content summaries of databases The content summaries of databases under similar topics can mutually under similar topics can mutually complement each other.complement each other.
5757
Category Content SummaryCategory Content SummaryRoot
Sport Health
HeartD3
D1 D2
^DB = 1000Df(“hypertension”)=480P(“hypertension”)=0.48
^DB = 2000Df(“hypertension”)=0P(“hypertension”)=0
P(“hypertension”) = 0.16((2000*0)+(1000*0.48))/3000
5858
Shrunk content Summary Shrunk content Summary II
To create a shrunk content summary we must To create a shrunk content summary we must first create the categories content summary for first create the categories content summary for all the categories in the hierarchy.all the categories in the hierarchy.
Consider a path in the topic hierarchy CConsider a path in the topic hierarchy C11,….,C,….,Cm m
were cwere cii=parent(c=parent(ci+1i+1))
Root
c1
c2
c3
D
5959
Shrunk content Summary Shrunk content Summary IIII
A shrunk content summary for database D A shrunk content summary for database D classified under categories cclassified under categories c11…c…cmm is: is:
Where:Where:
6060
Shrunk content Summary Shrunk content Summary IIIIII
Root
… … C1
C2
… C3
…
D
P(w|D)=0.6
P(w|C3)=0.4
P(w|C2)=0.78
P(w|C1)=0.3
P(w|Root)=0.01
Shrunk content Summary:
0.01*λ0+0.3*λ1+0.78*λ2+0.4*λ3+0.6*λ4
6161
Shrunk content Summary Shrunk content Summary IVIV
The category weights:The category weights:λλm+1m+1 is the highest among the is the highest among the λλii’s, which ’s, which means the highest weight is given to the means the highest weight is given to the original content summary. original content summary.
The shrunk content summary The shrunk content summary incorporates information from incorporates information from multiple content summary and thus multiple content summary and thus it can be closer to the complete (and it can be closer to the complete (and unknown) content summary.unknown) content summary.
6262
Shrunk content summary – Shrunk content summary – is it always good?is it always good?
Not always, if the “uncertainty” associated Not always, if the “uncertainty” associated with the score is low don’t use shrinkage:with the score is low don’t use shrinkage: The sample size - If the database sample includes The sample size - If the database sample includes
most of the documents from the DB (a small DB) most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this then this sample is sufficiently complete. In this case shrinkage is not needed and might be case shrinkage is not needed and might be undesirable.undesirable.
The frequency of the query words – if all the query The frequency of the query words – if all the query words appear in almost all of the sample words appear in almost all of the sample documents then the distribution of the words over documents then the distribution of the words over the DB is “certain”. Same goes if every query the DB is “certain”. Same goes if every query word appears in close to no sample document.word appears in close to no sample document.
6363
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
6464
Experiments ResultExperiments Result
The papers refer to 2 aspects:The papers refer to 2 aspects: Content summary quality.Content summary quality. Database selection accuracy.Database selection accuracy.
The papers show that the idea of The papers show that the idea of exploiting content summaries of exploiting content summaries of similarly classified databases similarly classified databases increases the content summary increases the content summary quality and improves the database quality and improves the database selection for a given query.selection for a given query.
6565
Content summary quality Content summary quality IIComparing coverage of the retrieve Comparing coverage of the retrieve vocabulary. RS-ORD and RS-LRD vs. vocabulary. RS-ORD and RS-LRD vs. different Rulers.different Rulers.
Specificity
% r
etr
ieved
word
s
6666
Content summary quality Content summary quality IIIIComparing rank of words.Comparing rank of words.
RS-ORD and RS-LRD vs. different Rulers.RS-ORD and RS-LRD vs. different Rulers.
6767
Content summary quality Content summary quality IIIIII Comparing the number of queries done to the Comparing the number of queries done to the
database. RS-ORD and RS-LRD vs. different database. RS-ORD and RS-LRD vs. different Rulers.Rulers.
6868
Data base selection using Data base selection using shrinkage shrinkage
The shrinkage improves selecting The shrinkage improves selecting relevant data bases.relevant data bases.
6969
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
7070
Summary Summary II
Database selection is critical to Database selection is critical to building efficient metasearchers that building efficient metasearchers that interact with potentially large interact with potentially large number of databases.number of databases.
The metasearchers uses the The metasearchers uses the database content summary to select database content summary to select the most relevant databases for a the most relevant databases for a given query.given query.
7171
Summary Summary IIII
The papers present methods to improve The papers present methods to improve the database content summary:the database content summary: Creating Content summary with estimation Creating Content summary with estimation
of actual document frequency.of actual document frequency. Categorizing databases in a classification Categorizing databases in a classification
scheme.scheme. A method to exploits content summaries of A method to exploits content summaries of
similarly classified databases and combines similarly classified databases and combines them using shrinkage.them using shrinkage.
7272
The EndThe End
"The invisible portion of the Web will continue to grow
exponentially before the tools to uncover the hidden Web are
ready for general use" (http://brightplanet.com/technol
ogy/deepweb.asp)
QUESTIONS?
7373
Appendix Appendix
The metasearcher Turbo10 - The metasearcher Turbo10 - http://turbo10.com/index.htmlhttp://turbo10.com/index.html