jiit;project 2013-14,project presentation

19
AN EFFICIENT APPROACH FOR ILLUSTRATING WEB DATA OF USER SEARCH RESULTS

Upload: neha-singh

Post on 26-Jun-2015

52 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: JIIT;Project 2013-14,Project Presentation

AN EFFICIENT APPROACH

FOR ILLUSTRATING WEB DATA OF USER SEARCH

RESULTS

Page 2: JIIT;Project 2013-14,Project Presentation

INPUT URL

WRAPPER GENERATIO

N

DATA EXTRACTIO

N

SEARCH ENGINE

EXTRACTOR

SEARCH RESULT RECORD

CONTENT LINE

EXTRACTION

DATA ALIGNMENT

ANNOTATORS

LINE SEPARATOR

BLOCK EXTRACTIO

N

ANNOTATION WRAPPER

ANNOTATED GROUPS

COMBINING ANNOTATOR

S

NEW RESULT PAGE

Page 3: JIIT;Project 2013-14,Project Presentation

GOOGLE SEARCH CONTENT LINE

•LINK•TEXT•LINK-TEXT•LINK-HEAD•TEXT-HEAD•LINK-TEXT-HEAD•HR LINE• BLANK LINE

Page 4: JIIT;Project 2013-14,Project Presentation

GOOGLE SEARCH BLOCKS

To identify similar blocks we check for block similarity on basis of-

•TYPE distance•SHAPE distance•POSITION distance

Page 5: JIIT;Project 2013-14,Project Presentation

Candidate Content Line Separators

•blank line (e.g., the <p> tag) •visual line (e.g. the <HR> tag).

(1) the line following an HR-LINE (2) if there is only one line starting with a number in a block, this line is a first line; (3) if only one line in a block has the smallest position code ,this line is a first line (4) if there is only one BLANK line in a block, the line following the BLANK line is the first line.

Fist line of block

Page 6: JIIT;Project 2013-14,Project Presentation

Tag path

•Sibling•child

Tag Tree

Page 7: JIIT;Project 2013-14,Project Presentation

Wrapper Integration

Page 8: JIIT;Project 2013-14,Project Presentation

Relationships between data unit (U) and text node (T):

•One-to-One Relationship T=U

•One-to-Many Relationship T )U

•Many-to-One RelationshipT (U

•One-To-Nothing RelationshipT!=U

Page 9: JIIT;Project 2013-14,Project Presentation

Five common features shared by the data units

•Tag Path (TP)•Data Content (DC)•Data Type (DT)•Adjacency (AD)•Presentation Style (PS)

Page 10: JIIT;Project 2013-14,Project Presentation

Alignment Algorithm

Here we will apply our data alignment algorithm to align the semantically same data in a group

Page 11: JIIT;Project 2013-14,Project Presentation

After Alignment Algorithm

Same semantic data are aligned in a column as shown in fig.

Page 12: JIIT;Project 2013-14,Project Presentation

Output

Page 13: JIIT;Project 2013-14,Project Presentation

Annotators

• Table annotator• Query-based annotator• Schema value annotator• Frequency-based annotator• Same-prefix annotator• Common knowledge based annotator:

Page 14: JIIT;Project 2013-14,Project Presentation

Let P(L) be the probability that L is correct in identifying a correct label for a group of data units when L is applicable. P(L) is essentially the success rate of L. Specifically, suppose L is applicable to N cases and among these cases M areannotated correctly, then P(L)=M/N

Probability

Page 15: JIIT;Project 2013-14,Project Presentation

Labelling

Suitable annotator is applied and then label the data

Page 16: JIIT;Project 2013-14,Project Presentation

Annotation Wrapper

attribute= <label; prefix; suffix; separators; unit index>.

•comparing all the suffixes

•compare the prefixes of all the data units

Page 17: JIIT;Project 2013-14,Project Presentation

New Result Page

This is a new result page with less no. of result record but all the result data will be annotated and efficient.

Page 18: JIIT;Project 2013-14,Project Presentation

Tools

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

•Other Tools:•Webharvest•Htmlunit

Page 19: JIIT;Project 2013-14,Project Presentation

[1] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, “FullyAutomatic Wrapper Generation for Search Engines,” Proc. Int’l Conf. World Wide Web (WWW), 2005.

[2] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. 14th Int’l Conf. World Wide Web (WWW ’05),2005.

[3] Y. Lu, H. He, H. Zhao, W. Meng, and C. Yu, “Annotating Structured Data of the Deep Web,” Proc. IEEE 23rd Int’l Conf. Data Eng. (ICDE), 2007

[4] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. 12th Int’l Conf. World Wide Web (WWW), 2003.

[5] Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng, “Annotating Search Results from Web Database”, IEEE, 2014

References