an approach deep web crawling

8/10/2019 An Approach Deep Web Crawling

1/13

An Approach to Deep Web

Crawling by SamplingMoh Shohibul Wafa (213210388


2/13

Apa itu deep web? Bisa disebut hidden web, invisible web

Berlawanan dengan surface web

Content secara dinamis dihasilkan olehsearch interface. Bentuk search interfacedapat berupa :

HTML form

Web service

Content umumnya disimpan di database

Biasanya tidak di index oleh search engine

Ini yang menjadi alasan biasanyabeberapa orang mengartikan surfaceweb sebagai web yang dapat di indexoleh serach engine


3/13

Deep web vs. surface web


4/13

Deep and surface web may overlap

Beberapa content yang tersembunyi dibalik form HTML atau

web service biasa juga terdapat di halaman normal html

Beberapa search engine mencoba mengindex beberapa bagian

dari deep web

Google is also crawling deep web

Madhavan, Jayant; David Ko, ucjaKot, Vignesh

Ganapathy, Alex Rasmussen, Alon Halevy (2008).Googles Deep-Web Crawl. VLDB

Hanya sebagian partisi dari deep web yang berhasil di

index
http://www.cs.cornell.edu/~lucja/Publications/I03.pdfhttp://www.cs.cornell.edu/~lucja/Publications/I03.pdfhttp://www.cs.cornell.edu/~lucja/Publications/I03.pdfhttp://www.cs.cornell.edu/~lucja/Publications/I03.pdfhttp://www.cs.cornell.edu/~lucja/Publications/I03.pdf


5/13

Deep web crawling

Crawl and index the deep web sehingga data yang

tersembunyi dapat ditampilkan Tidak seperti the surface web, tidak ada hyperlink

untuk diikuti

Two tasks

Find deep web data sources, i.e., html forms, web services

Accessing the deep web: A survey, B He, M Patel, Z Zhang, KCC Chang - Communicationsof the ACM, 2007

Given a data source, download the data from this data source

Kita akan fokus pada task ke 2
http://www.almaden.ibm.com/people/binhe/pubs/dwsurvey-cacm07.pdfhttp://www.almaden.ibm.com/people/binhe/pubs/dwsurvey-cacm07.pdfhttp://www.almaden.ibm.com/people/binhe/pubs/dwsurvey-cacm07.pdfhttp://www.almaden.ibm.com/people/binhe/pubs/dwsurvey-cacm07.pdfhttp://www.almaden.ibm.com/people/binhe/pubs/dwsurvey-cacm07.pdfhttp://www.almaden.ibm.com/people/binhe/pubs/dwsurvey-cacm07.pdf


6/13

Crawling a deep web data source

Satu-satunya antarmuka adalah dalam bentuk html atau web service

Jika data tersembunyi oleh bentuk HTML

Isi form

Pilih dan kirim query

Alexandros, Ntoulas; Petros Zerfos, and Junghoo Cho (2005). DownloadingHidden Web Content. UCLAComputer Science.

Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a New SetCovering Algorithm. ADMA 2009: 326-337.

Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An Approach toDeep Web Crawling by Sampling. Web Intelligence 2008: 718-724

Extract relevant data dari kembalian halaman HTML

Jika data tersembunyi oleh web service

Pilih dan kirim query

Form filling and data extraction are exempted
http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdfhttp://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdfhttp://en.wikipedia.org/wiki/UCLAhttp://en.wikipedia.org/wiki/UCLAhttp://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdfhttp://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf


7/13

The problem Minimize the cost while dislodging most of the data

Beberapa orang mencoba untuk meminimalisasi jumlah dari query sedangkan kita

akan meminimalkan jumlah dokument

Minimize the OR (Overlapping Rate) while reaching a high Hit Rate (HR)

S(qj , DB) : set of results of the query qj on database DB.


8/13

Sampling based approach Query dipilih dari sampel set

dokumen

Berbeda dengan incrementalapproach

Steps

Kirim query acak ke TotalDB;

Mendapatkan dokumen yangcocok dan constructSampleDB;

Analisa semua dokument diSampleDB, constructQueryPool;

Gunakan set coveringalgorithms untuk memilihquery;

Kirim Query ke TotalDB tomengambil documents.

Apakah query dapat mengcoversebagian besar sumber data?

Apakah rendah OR dalamSampleDB dapat diproyeksikanke TotalDB?

Apakah SampleDB harus sangat

besar?


9/13

Hypothesis 1: vocabulary learnt from sample

can cover most of the documents in TotalDB

Pengaruh ukuran sampel pada HR. Query dipilih dari SampleDB dan mengcover di

atas 99% dari dokumen di SampleDB. HR dalam plot diperoleh ketika querydikirim ke TotalDB. relative query pool size is 20


10/13

Hypothesis 2: low OR in sampleDB can be

projected to TotalDB

Sample size is 3000,

relative query pool sizeis 20.

Metode inimenghasilkan lebihkecil OR ketika HRtinggi.

Impact of sample size on OR. HR is

89%, relative query pool size is 20.


11/13

Hypothesis 3: both the sample size and query

pool size do not need to be very large

Comparison of ourmethod on the four

corpora with queries

selected randomly

from sample.

X axis is the

Overlapping Rate, Y

Axis is the Hit Rate.

Sample size is 3000,relative query pool

size is 20. Our

method achieves a

much smaller OR

when HR is high.


12/13

Conclusions

Makalah ini mengusulkan sebuah metode web deep crawling yang efisien

dan efektif. Metode ini dapat memulihkan sebagian besar data dalam

sumber data teks dengan overlapping rate yang rendah. Menggunakansampel dari sekitar 2.000 dokumen, kita secara efisien dapat memilih

satu set query yang dapat mengcover sebagian besar sumber data dengan

cost rendah. juga secara empiris mengidentifikasi ukuran yang sesuai

untuk sampel dan query pool.

Menggunakan sampel untuk predikat karakteristik populasi total banyakdigunakan di berbagai bidang. Sampling sumber data dipelajari dengan

baik. Hipotesis ini 1 terkait dengan hasil oleh Callan et al [4], yangmengatakan bahwa dengan menggunakan sekitar 500 dokumen dari

sampel, seseorang dapat predikat lebih akurat dari KKP (frekuensi totaljangka) rasio untuk total DB. Hasil itu bertepatan dengan Hipotesis ini 1.


13/13

An Approach to Deep Web Crawling by Sampling. Jianguo Lu, Yan Wang, Jie

Liang, Jessica Chen. School of Computer Science. University of Windsor

an approach deep web crawling

Documents