prospectus presentation

UNDERSTANDING DEEP WEB

SEARCH INTERFACES

Prospectus-PresentationApril 03 2009

Ritu Khare

1

Presentation Order

Problem Statement The Deep Web & Challenges Search Interface Understanding Challenges and Significance

Literature Review Results Settings Reductionist Analysis Holistic Analysis

Research Questions and Design Ideas Techniques Vs Heterogeneity Semantics and Artificial Designer

2

Problem: Understanding Semantics

of Search Interfaces

How do existing approaches solve this problem?

What are the research gaps? How

to fill them?

This presentationuses this spacefor writingadditional facts.

The Deep WebChallenges in Accessing Deep WebDoors of Opportunity

The SIU ProcessAbout the Stages of the ProcessWhy SIU is Challenging?

Why SIU is Significant?

PROBLEM STATEMENT3

The Deep Web

What is DEEP WEB? The portion of Web

resources that is notreturned by searchengines throughtraditional crawling andindexing.

Where do the contents LIE?Online Databases

4

WEB(as seen by

search engines)

SURFACE WEB

ALMOST VISIBLE WEB DEEP

WEB

HAS MANY OTHER NAMES!!Hidden Web, Dark Web,Invisible Web, Subject-specific databases, Data-intensive Web sites.

The Deep Web5

HAS MANY OTHER NAMES!!Hidden Web, Dark Web,Invisible Web, Subject-specific databases, Data-intensive Web sites.

How are the contents ACCESSED?By filling up HTML forms

on search interfaces.

How are theyPRESENTED to users?Dynamic Pages /Result

Pages /Response Pages

Challenges in Accessing Deep Web Contents

6

QUICK FACT!Deep Web includes 307,000 sites450,000 databases1,258,000 interfaces

500 times more than that of the rest of

the Web (BrightPlanet.com, 2001).

Increase of 3-7 times from 2000-2004 (He et al., 2007a).

manually reconciles information obtained from diff. sources.

visits several interfaces before finding the right information.

Not ScalableAlternative approaches: Invisible Web directories and Search engine browse directories cover only 37% of the deep Web (He et al., 2007a).

The deep Web remains invisible on the Web

Opportunitiesin Accessing Deep Web Contents

7

HTML Forms on search interfaces provide a useful wayof discovering the underlying database structure. The labels attached to fields are very expressive and

meaningful. Instructions for users to enter data may provide information

on data constraints (such as range of data, domain of data),and integrity constraints (mandatory /optional attributes).

In the last decade, several prominent researchers havefocused on the PROBLEM OF UNDERSTANDINGSEARCH INTERFACES.

INTERESTING FACT!There exist at least 10million high qualityHTML forms on thedeep Web

Search Interface Understanding (SIU) Process8

Search Interface(Input)

System-Tagged Search Interface

(Output)

Manually Tagged Search

Interface

Online DB

Extracted DB

B. Parsing

A. Representation

C. SegmentationE. Evaluation

D. Segment Processing

The SIU process is challenging because search interfaces are designedautonomously by different designers and thus, do not have a standardstructure (Halevy, 2005).

A. Representation and Modeling9

This stage formalizes the information to be extracted from a search interface interface components: Any text or form elementsemantic label: Meaning of a component from a

user’s standpoint.segment: a composite component formed with a

group of related componentssegment label: semantic label of the segment

This stage buildsup the foundationfor the process

A. Representation and Modeling10

Zhang et al. (2004) represent an interface as a list of query conditions Segment Label = Query

Condition Segment consists of following

semantic labels: An attribute nameOperator Value

This stage buildsup the foundationfor the process

Attribute-name Operator Value

B. Parsing11

The interface is parsed into a workable memory structure. It can be done in two modes:by reading the HTML source code; by rendering the page on a Web

browser either manually or automatically using a visual layout engine.

This stage is thefirst task physicallyperformed on theinterface.

B. Parsing12

He et al. (2007b) parse the interface into an interface expression (IEXP) with constructs t(corresponding to any text), e (corresponding to any form element), and | (corresponding to a row limiter).

IEXP for figure is: t|te|teee

This stage is thefirst task physicallyperformed on theinterface.

C. Segmentation13

A segment has a semantic existence but no physically defined boundaries making this stage a challenging one. Grouping of semantically related

components. (A sub-problem is to associate a surrounding text with a form element)

Assignment of semantic labels to components

Techniques used :RulesHeuristicsMachine Learning.

C. Segmentation14

He et al. (2007b) use a heuristic-based method LEX to groupelements and text labels together. One heuristic used by LEX isthat the text and form element that lie on the same line arelikely to belong to one segment. In Figure , the 3 components“Gene Name”, radio button with options ‘Exact Match’ and‘Ignore Case’, and the textbox belong to one segment.


Logical Attribute

Attribute-label Constraint Element Domain Element

D. Segment Processing15

In this stage, Each segment is further tagged with additional meta-

information regarding itself and its components. Post-processing of extracted information: normalization,

stemming, removal of stop words. He et al. (2007b)’s LEX extracts meta-information about

each extracted segment using Naïve Bayes classificationtechnique. The extracted information for the segment includes domain

type (finite, infinite); Unit (miles, sec); value type (numeric,character, etc.); layout order position (in IEXP)


E. Evaluation16

How accurate extracted information is? The system-generated segmented and tagged

interface is compared with the manually segmented and tagged interface

The results are evaluated based on standard metrics (precision, recall, accuracy, etc).

An approach isusually tested ona set of interfacesbelonging to aparticular domain.

Why SIU is Significant?17

Researchers have proposed solutions to make the deep Webcontents more useful to the users. These solutions can be dividedinto following categories based on goals:

To Increase Content Visibility on Search Engines Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)

Building Database Content Repository: Madhavan et al. (2008)

To Increase Domain-specific Usability Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al.

(2006), He and Chang (2003), and Wang et al. (2004)

To Attain Knowledge Organization Derivation of Ontologies: Benslimane et al. (2007)

These solutions can only be materialized by leveraging the opportunities provided by search interface.

SIGNIFICANCE:SIU is a pre-requisite for several advanced deep Web applications.

Review SettingsReductionist AnalysisHolistic Analysis

Progress Made

LITERATURE REVIEW RESULTS18

Literature Review Process19

Reviewed research works that propose approaches forperforming the SIU process in the context of the deep Web.

ALIAS:For quick reference each work is assigned an alias.

S.No. Reference Alias1 Raghavan and Garcia Molina (2001) LITE2 Kalijuvee et al. (2001) CombMatch3 Wu et al. (2004) FieldTree4 Zhang et al. (2004) HSP5 Shestakov et al.(2005) DEQUE

6 Pei et al. (2006) AttrList

7 He et al. (2007) LEX

8 Benslimane et al. (2007) FormModel

9 Nguyen et al. (2008) LabelEx

Literature Review Process20

Review was done in 2 phases:Reductionist Analysis: The works were

decomposed into small pieces. Each work was visualized as a 2-dimensional grid where

the horizontal sections refer to stages of the SIU process.For each stage the works were analyzed in verticaldegrees of analysis known as Stage-specific dimensions.

Holistic Analysis: Each work was studied in itsentirety within a big picture context. Composite dimensions were created out of the stage-

specific dimensions.

DIMENSION:facilitates comparison among different works by placing them underthe same umbrella

Reductionist Analysis : Representation

21Work Segment: Segment Contents Text Label: Form

ElementMeta-information

HSP Conditional Pattern: Attribute-name, Operator*, and Value+

1:M

DEQUE Field segment : f, Name(f),Label(f)f = field

1:1 JavaScript Functions,Visible and invisible values,Subinfo(F) = {action, method, enctype)Iset(F) =initial field set, that can be submittiedwithout completing a formdomain(f), type(f)F=form

AttrList Attribute: Attribute-name,description and form element

1:1 Domain information for each attribute (set ofvalues and data types)

LEX Logical Attribute Ai: Attr-label L,List of domain elements {Ej,…Ek},and element labels .

1:M(1:1 in case of“element label”:form element)

site information and form constraintAi=(P, U, Re, Ca, DT, DF, VT)Ai=ith attribute,P = layout order position, U = Unit, Re=relationship type, Ca = domain elementconstraint, DT = domain type, DF = defaultvalue, VT = value typeEi=(N, Fe, V, DV)N = internal name, Fe = format, V = set of

values, DV = default value.

A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation

Reductionist Analysis : Parsing22

Work Input Mode Basic Step: Description Cleaning Up Resulting StructureLITE HTML source code

AndVisual Interface

PruningIsolate elements thatdirectly influence layoutof form elements andlabels.

Discard Images,ignore stylinginformation such asfont size, font style,and style sheets.

Pruned Page

CombMatch HTML source code Chunk Partitioning, andfinding meta-informationabout each chunk: Findbounding HTML tags, andtext strings delimited bytable cell tags, etc.

StopPhrases(“optional”,“required”, “*”, Textformatting HTML tags.

Chunk List and TableIndex List. Each chunk isrepresented as an 8-tuple describing meta-information.

DEQUE HTML textAndVisual interface

Preparing FormDatabase: A DOM tree iscreated for each FORMelement

Ignore font size,typefaces, and stylinginformation.

Pruned Tree

LEX HTML source code Interface ExpressionGeneration:t=text, e = element,I=row delimiter (<BR>,<P>, or </TR>

String


Reductionist Analysis: Segmentation23

Work Problem Description Segmentation Criteria Technique

CombMatch Assigning text Label to aninput element

Combination of string similarityand spatial similarity algorithms

Heuristics (String properties,Proximity and Layout)

HSP Finding the 3-tuple <attributename, operators, values>

Grammar (set of rules) based onproductions and preferences

Rules (Best Effort Parser tobuild a parse tree)

LEX Assigning text labels toattributes, and assigningelement labels to domainelements

Ending colon, textual similaritywith element name, verticalalignment, distance, preferenceto current row

Heuristics(String Properties, Layout andProximity)

LabelEx Assigning text Label to aform element

Classifiers (Naïve Bayes, andDecision Tree). Featuresconsidered include spatialfeatures, element type, font type,internal, similarity, alignment,label placement, distance.

Supervised Machine Learning


Reductionist Analysis : Segment Processing24

Work Technique for extractingmeta-information

Post-processing

HSP The Merger module reports conflicting (that occur intwo query conditions) and missing tokens (they do notoccur in any query condition).

LEX Naïve Bayesian Classification(Supervised MachineLearning)

Meaningless stopwords (the, with, any, etc.)

FormModel Learning by Examples(Machine Learning)

LabelEx Heuristics for reconciliation of multiple assigned labelsto an element; and to handle dangling elements.


Redu

ctio

nist

Ana

lysis

: Eva

luat

ion

25 Work Test Domain Yahoo Subject Category Comparison with… Metrics

LITE SemiconductorIndustry, Movies,DatabaseTechnology.

Science, Entertainment,Computers & Internet

CombMatch (in terms ofmethodology)

Accuracy

HSP Airfare, automobile,book, job, real estate,car rental, hotel,movies, music records.

Business & Economy,Recreation & Sports,Entertainment

4 datasets from differentsources collected byauthors.

Precision, Recall

LabelEx Airfare, Automobiles,Books, Movies.

Business & Economy,Recreation & Sports,Entertainment

Barbosa et al. (2007)’sand HSP ( in terms ofdatasets)Using Classifier Ensemblewith or without MappingReconciliation (MR).Generic Classifier VsDomain-specific ClassifierGeneric Classifier with MRVs Domain-specificClassifier with MRHSP and LEX ( in terms ofmethodology)

Recall,Precision,F-Measure


Hol

istic

Ana

lysis

26

Work Type of semantics Techniques Human Involvement

Target Application

LITE Partial form capabilities (Label associated with form element)

Heuristics None Deep Web Crawler(search engine visibility)

HSP Query capability (attribute name, operator and values)

Rules Manual Specification of Grammar Rules

Meta-searchers(domain-specific usability)

LEX Components belonging to same logical attribute (labels and form elements)

Heuristics None Meta-searchers(domain-specific usability)

Meta-information Supervised Machine Learning

Training data for classifier

FormModel Structural Units (groups of fields belonging to same entity)

NOT REPORTED Unknown Ontology Derivation(Knowledge Organization)

Partial form capabilities (Label associated with form element)

Heuristics None

Meta-information Supervised Machine Learning

Training data forlearning byexamples.

LabelEx Partial form capabilities (Label associated with form element)

Supervised Machine Learning

Classifier Training data was manually tagged.

Deep Web in general (search engine visibility domain-specific usability)

Progress Made 27

SEMANTICS modeled and extracted. (Stages A and B) from merely stating what we see, to stating what is meant by what we see

from merely associating labels to form elements, to discovering query capabilities

from no meta-information to a lot of meta-information which might be useful for target application.

TECHNIQUES employed (Stages C and D) A mild transitioning from naïve techniques (rules-based and heuristic-based) to

sophisticated techniques (supervised machine learning).

DOMAINS explored (Stage E) Only Commercial Domains: books, used cars, movies, etc.

Still Unexplored Non-Commercial Domains: yahoo.com subject categories such as regional, society and culture, education, arts and humanities, science, reference, and others


Techniques Vs Design HeterogeneityTechniques Vs Domain HeterogeneitySimulating a Human Designer

RESEARCH QUESTIONS28

Research Questions29

R.Q.#1 Technique Vs Design Heterogeneity What is the correlation between the technique employed and the ability to

handle heterogeneity in design of interfaces?

R.Q.#2 Technique Vs Domains How can we design approaches that work well for arbitrary domains, and

thus prevent the need to design domain-specific approaches?

R.Q.#3 Simulating a Human Designer How can we make a machine understand an interface in the same way as

a human designer does?

Derived fromHolistic and Reductionist Analysis

Research Question #1What is the correlation between the technique employed and the ability to handle heterogeneity in design of interfaces?

Techniques: Rules, Heuristics, and Machine Learning. Design: Arrangement of interface components. Handling Heterogeneity in design: Being able to perform

the following tasks for any kind of design. Segmentation Semantic TaggingGrouping (Label Assignment is a part of this)

Segment Processing

30

Elaborating the Question

Technique is a dimension ofStages Segmentation & Segment Processing


31

Heterogeneity: Automobile Domain Heterogeneity: Movie Domain

Technique is a dimension ofStages Segmentation & Segment Processing

Operator Attribute-nameOperand

Multiple Attribute-name


A 2002 study (Kushmerick, 2002) suggests the superiority ofmachine learning techniques over rule-based andheuristic-based techniques for handling designheterogeneity in general.

A 2008 study (Nguyen et al., 2008) compared the labelassignment accuracy (a part of grouping accuracy) ofthe three approaches: rule-based (HSP), heuristic-based (LEX), and machine learning based (LabelEx).Machine learning technique outperformed the othertwo.

32

Existing Efforts to Answer

This question has been only partially explored.

Investigating R.Q.#1Technique Vs Design Heterogeneity

33

Experiment Description Evaluation Metrics

Result Compared With Improvement

A machine learningtechnique based on Hidden Markov Models (HMMs) was designed and tested on a dataset belonging to biology domain.

Grouping Accuracy (label assignmentincluded)

86% Heuristic-basedstate-of-the-art approach LEX

10%

Semantic Tagging Accuracy

90% A Heuristic-basedalgorithm was designed

17%

Tasks to test:Segmentation•Grouping•Semantic TaggingSegment Processing

Compare Segmentation Performance: Machine Learning Vs. Rule-Based

Various machine learning techniques

Classification Vs. HMM Vs…

Compare Segment Processing Performances: Rules Vs. heuristics Vs. machine learning

However, there is NO comparative study in terms of overallgrouping, semantic tagging, and segment processing.

Investigating R.Q.#1Technique Vs Design Heterogeneity

34

Experiment Description Evaluation Metrics

Result Compared With

Monitoring Human Intervention(IN PROGRESS)

Rule-based: Manual CraftingHeuristics: Manual ObservationsMachine Learning: Manual Tagging

Rule Based Vs Heuristics Vs Machine Learning

The HMM was trained using unsupervised training algorithm Baum Welch

P(O|λ) Not promising

Human Intervention is a dimension of Holistic analysis

Designing Unsupervised Techniques

There is NO comparative study to measure human intervention in these techniques.

Research Question #2How can we design approaches that work well for arbitrary domains, and thus prevent the need to design domain-specific approaches?

Domain Heterogeneity: Deep Web is heterogeneous interms of domains, i.e. has databases belonging to all the14 subject categories of Yahoo (Arts & Humanities, Business andEconomy, Computers and Internet, Education, Entertainment, Government,Health, etc. )

How to design generic approaches that work for manydomains? How do interface designs differ across domains? Which technique should be employed?

35

Elaborating the Question

Domain tested is a dimension ofEvaluation stage.

Research Question #2How can we design approaches that work well for arbitrary domains, and thus prevent the need to design domain-specific approaches?

2004: A single grammar (rule-based) generates reasonably good segmentationperformance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)

Higher accuracy can be attained using domain-specific techniques which are notfeasible to be designed using rules (Nguyen et al., 2008) .

2008: For label assignment (a portion of grouping), domain-specific classifiersresult in higher accuracy than generic classifiers. (Nguyen et al., 2008)

Still missing: A comparison of domain-specific and generic approaches for overall segmentation

performance

The design differences across domains

generic approaches that result in equally good results for as many domains as possible.

36

Existing Efforts to Answer

Deep Web has a balanced domain distribution

Investigating R.Q. #2

37Attribute-

name

Text-trivial

Operand

Operator

0.41

0.090.21

0.30

0.35

0.080.56

0.83

0.62

0.37 Attribute-name

Text-trivial

Operand

Operator

0.57

0.40 0.20

0.20

0.120.14

0.220.34

0.64

0.880.11

Attribute-name

Text-trivial

Operand

Operator

0.64

0.09 0.17

0.11

0.080.24

0.050. 51

0.83

1.00.08

Attribute-name

Text-trivial

Operand

Operator

0.31

0.150.21

0.23

0.210.44

0.160.59

0.54

0.89 0.08

0.09

Movie

Biology

References & Education

Automobile

Design tendenciesof designers from different domainsare different.

Investigating R.Q. #2: Technique Vs Domain

38

Domain Exp Description Evaluation Winner (improvement)

Movie Domain-Specific HMM Vs. Generic HMM

Segmentation Accuracy Generic HMM (4.4%)

Ref & Edu Domain-Specific HMM Vs. Generic HMM

Segmentation Accuracy Domain-Specific HMM (7%)

Automobile Domain-Specific HMM Vs. Generic HMM

Segmentation Accuracy Domain-SpecificHMM (8%)

Biology Domain-Specific HMM Vs. Generic HMM

Segmentation Accuracy Domain-SpecificHMM (36%)

All experiments done using the Machine learning technique, HMM.

What is the correlation between design topology and performance of domain-specific model?

Research Question #3How can we make a machine understand the interface and extract semantics from it in the same way as a human designer does?

39

A human-designer/user naturally understands thedesign and semantics of an interface based on visualcues and based on his prior experiences.

A machine cannot really “see” an interface and doesnot have any implicit Web search experience. (Howmuch do visual layout engines assist?)

Hence, there is a difference between the way amachine perceives an interface and the way a designerperceives the interface.

How can we reconcile these differences?

Investigating R.Q. #3Simulating a Human Designer

40

Hypothesis: A machine can be made to understand the interface in the same way as a human designer does if it is enabled to discover the deep source of knowledge that created the interface in the first place.

Search Interface

Conceptual Model

Designer/Modeler

Web Design Knowledge

Recover DB Schema

Derive Query

Capabilities

Derive Segments

Attach Semantic Labels

Understand design

Understands / Designs

Existing methods have been able to: understand design, attach semantic labels, derive segments and query capabilities.

Extracting DB schema and conceptual model is still an open question.

Connecting the dots

41

Search Interface

Conceptual Model

Designer


Recover DB Schema

Derive Query

Capabilities

Derive Segments

Attach Semantic Labels

Understand design


Search Interface

Search Interface

Conceptual Model based

Interface

?

R.Q. 1

R.Q. 2

R.Q. 3

Suggestions, Comments, Thoughts, Ideas, Questions…

THANK YOU !42

Acknowledgements: To My Prospectus Committee Members

References: [1] to [42] (in prospectus report).

prospectus presentation

Documents

deep web contentsdeep

dark web

hidden web

deep web invisibleweb

deep web challengesproblem

web indexing

deep web doors of opportunity

dataintensive web sites