downloading textual hidden-web content through keyword queries alexandros ntoulaspetros...
TRANSCRIPT
Downloading Textual Hidden-WebContent Through Keyword Queries
Alexandros Ntoulas Petros Zerfos Junghoo Cho
University of California Los AngelesComputer Science Department
{ntoulas, pzerfos, cho}@cs.ucla.edu
JCDL, June 8th 2005
Downloading Textual Hidden-WebContent Through Keyword Queries
April 10, 2023
Motivation
I would like to buy a used ’98 Ford Taurus Technical specs ?
Reviews ?
Classifieds ?
Vehicle history ?
GGooooggllee??
April 10, 2023
Why can’t we use a search engine ? Search engines today employ crawlers that
find pages by following links around Many useful pages are available only after
issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …)
Search engines cannot reach such pages: there are no links to them (Hidden-Web)
In this talk: how can we download Hidden-Web content?
April 10, 2023
Outline
Interacting with Hidden-Web sites Algorithms for selecting queries for the
Hidden-Web sites Experimental evaluation of our algorithms
April 10, 2023
Interacting with Hidden-Web pages (1)1. The user issues a query through a query
interface
liver
April 10, 2023
Interacting with Hidden-Web pages (2)1. The user issues a query through a query
interface 2. A result list is presented to the user
Result List Page
April 10, 2023
1. The user issues a query through a query interface
2. A result list is presented to the user
3. The user selects and views the “interesting” results
Interacting with Hidden-Web pages (3)
April 10, 2023
Querying a Hidden-Web site
Procedure
while ( there are available resources ) do
(1) select a query to send to the site
(2) send query and acquire result list
(3) download the pages
done
April 10, 2023
How should we select the queries ? (1)
S: set of pages in Web site (pages as points) qi: set of pages returned if we issue query qi
(queries as circles)
April 10, 2023
How should we select the queries ? (2)
Find the queries (circles) that cover the maximum number of pages (points)
Equivalent to the set-covering problem in graph-theory
April 10, 2023
Challenges during query selection In practice we don’t know which pages will be
returned by which queries (qi are unknown)
Even if we did know qi, the set-covering problem is NP-Hard
We will present approximation algorithms to the query selection problem
We will assume single-keyword queries
April 10, 2023
Outline
Interacting with Hidden-Web sites Algorithms for selecting queries for the
Hidden-Web sites Experimental evaluation of our algorithms
April 10, 2023
Some background (1)
Assumption: When we issue query qi to a Web site, all pages containing qi are returned
P(qi): fraction of pages from site we get back after issuing qi
Example: q = liver No. of docs in DB: 10,000 No. of docs containing liver: 3,000 P(liver) = 0.3
April 10, 2023
Some background (2)
P(q1/\q2): fraction of pages containing both q1 and q2 (intersection of q1 and q2)
P(q1\/q2): fraction of pages containing either q1 or q2 (union of q1 and q2)
Cost and benefit: How much benefit do we get out of a query ? How costly is it to issue a query?
April 10, 2023
Cost function
The cost to issue a query and download the Hidden-Web pages:
cq: query cost cr: cost for retrieving
a result item cd: cost for downloading
a document
Cost(qi) =
(1) Cost for issuing a query
(2) Cost for retrieving a result item times no. of results
(3) Cost for retrieving a doc times no. of docs
cq + crP(qi) + cdP(qi)
April 10, 2023
Problem formalization
Find the set of queries q1,…,qn
which maximizes
P(q1\/…\/qn)
Under the constraint:
n
ii tqCost
1
)(
April 10, 2023
Query selection algorithms
Random: Select a query randomly from a precompiled list (e.g. a dictionary)
Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web)
Adaptive: Analyze previously downloaded pages to determine “promising” future queries
April 10, 2023
Adaptive query selection
Assume we have issued q1,…,qi-1.
To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi)
P( (q1\/…\/qi-1) \/ qi) =
P(q1\/…\/qi-1) +
P(qi) -
P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1)
Known (by counting) since we have
issued q1,…,qi-1
Can measure by counting P(qi) within
P(q1,…,qi-1)What about P(qi) ?
April 10, 2023
Estimating P(qi)
Independence estimator
Zipf estimator [IG02] Rank queries based on frequency of occurrence
and fit a power law distribution Use fitted distribution to estimate P(qi)
P(qi) ~ P(qi|q1\/…\/qi-1)
April 10, 2023
Query selection algorithm
foreach qi in [potential queries] do
Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1)
Estimate
done
return qi with maximum Efficiency(qi)
)(
)()(
i
inewi qCost
qPqEfficiency
April 10, 2023
Other practical issues
Efficient calculation of P(qi|q1\/…\/qi-1) Selection of the initial query Crawling sites that limit the number of results
(e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details
April 10, 2023
Outline
Interacting with Hidden-Web sites Algorithms for selecting queries for the
Hidden-Web sites Experimental evaluation of our algorithms
April 10, 2023
Experimental evaluation Applied our algorithms to 4 different sites
Hidden-Web site No. of documents
Limit in the no.
of results
PubMed medical library
~13 million no limit
Books section of Amazon
~4.2 million 32,000
DMOZ: Open directory project
~3.8 million 10,000
Arts section of DMOZ
~429,000 10,000
April 10, 2023
Policies
Random-16K Pick query randomly from 16,000
most popular terms Random-1M
Pick query randomly from 1,000,000 most popular terms
Frequency-based Pick query based on frequency of occurrence
Adaptive
April 10, 2023
Coverage of policies
What fraction of the Web sites can we download by issuing queries ?
Study P(q1\/…\/qi) as i increases
April 10, 2023
Coverage of policies for PubMed
Adaptive gets ~80% with ~83 queries Frequency needs 103 for the same coverage
April 10, 2023
Coverage of policies for DMOZ (whole)
Adaptive outperforms others
April 10, 2023
Coverage of policies for DMOZ (arts)
Adaptive performs best in topic-specific texts
April 10, 2023
Other experiments
Impact of the initial query Impact of the various parameters of the cost
function Crawling sites that limit the number of results
(e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details
April 10, 2023
Related work
Issuing queries to databases Acquire language model [CCD99] Estimate fraction of the Web indexed [LG98] Estimate relative size and overlap of indexes
[BB98] Build multi-keyword queries that can return a
large number of documents [BF04] Harvesting approaches/cooperative
databases (OAI [LS01], DP9 [LMZN02])
April 10, 2023
Conclusion
An adaptive algorithm for issuing queries to Hidden-Web sites
Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries)
Allows users to tap into unexplored information on the Web
Allows the research community to download, mine, study, understand the Hidden-Web
April 10, 2023
References [IG02] P. Ipeirotis, L. Gravano. Distributed search over the
hidden web: Hierarchical database sampling and selection. VLDB 2002.
[CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999.
[LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998.
[BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998.
[BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces.
[LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001.
[LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002.
Thank you !
Questions ?
April 10, 2023
Impact of the initial query
Does it matter what the first query is ? Crawled PubMed with queries:
data (1,344,999 results) information (308,474 results) return (29,707 results) pubmed (695 results)
April 10, 2023
Impact of the initial query
Algorithm converges regardless of initial query
April 10, 2023
Incorporating the document download cost Cost(qi) = cq + crP(qi) + cdPnew (qi) Crawled PubMed with
cq = 100
cr = 100
cd = 10,000
April 10, 2023
Incorporating document download cost
Adaptive uses resources more efficiently Document cost significant portion of the cost
April 10, 2023
Can we get all the results back ?
…
April 10, 2023
Downloading from sites limiting the number of results (1)
Site returns qi’ instead of qi
For qi+1 we need to estimate P(qi+1|q1\/…\/qi)
April 10, 2023
Downloading from sites limiting the number of results (2)
Assuming qi’ is a random sample of qi
))]...(()(
))...(([)...(
1
)...|(
1111
1111
11
iiiii
iii
ii
qqqqPqqP
qqqPqqP
qqqP
)'(
)(
)'(
)(
1
1
i
i
ii
ii
qP
qP
qqP
qqP
April 10, 2023
Impact of the limit of results
How does the limit of results affect our algorithms ?
Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000
April 10, 2023
Dmoz with a result cap at 1,000
Adaptive still outperforms frequency-based