prospectus presentation
TRANSCRIPT
UNDERSTANDING DEEP WEB
SEARCH INTERFACES
Prospectus-PresentationApril 03 2009
Ritu Khare
1
Presentation Order
Problem Statement The Deep Web & Challenges Search Interface Understanding Challenges and Significance
Literature Review Results Settings Reductionist Analysis Holistic Analysis
Research Questions and Design Ideas Techniques Vs Heterogeneity Semantics and Artificial Designer
2
Problem: Understanding Semantics
of Search Interfaces
How do existing approaches solve this problem?
What are the research gaps? How
to fill them?
This presentationuses this spacefor writingadditional facts.
The Deep WebChallenges in Accessing Deep WebDoors of Opportunity
The SIU ProcessAbout the Stages of the ProcessWhy SIU is Challenging?
Why SIU is Significant?
PROBLEM STATEMENT3
The Deep Web
What is DEEP WEB? The portion of Web
resources that is notreturned by searchengines throughtraditional crawling andindexing.
Where do the contents LIE?Online Databases
4
WEB(as seen by
search engines)
SURFACE WEB
ALMOST VISIBLE WEB DEEP
WEB
HAS MANY OTHER NAMES!!Hidden Web, Dark Web,Invisible Web, Subject-specific databases, Data-intensive Web sites.
The Deep Web5
HAS MANY OTHER NAMES!!Hidden Web, Dark Web,Invisible Web, Subject-specific databases, Data-intensive Web sites.
How are the contents ACCESSED?By filling up HTML forms
on search interfaces.
How are theyPRESENTED to users?Dynamic Pages /Result
Pages /Response Pages
Challenges in Accessing Deep Web Contents
6
QUICK FACT!Deep Web includes 307,000 sites450,000 databases1,258,000 interfaces
500 times more than that of the rest of
the Web (BrightPlanet.com, 2001).
Increase of 3-7 times from 2000-2004 (He et al., 2007a).
manually reconciles information obtained from diff. sources.
visits several interfaces before finding the right information.
Not ScalableAlternative approaches: Invisible Web directories and Search engine browse directories cover only 37% of the deep Web (He et al., 2007a).
The deep Web remains invisible on the Web
Opportunitiesin Accessing Deep Web Contents
7
HTML Forms on search interfaces provide a useful wayof discovering the underlying database structure. The labels attached to fields are very expressive and
meaningful. Instructions for users to enter data may provide information
on data constraints (such as range of data, domain of data),and integrity constraints (mandatory /optional attributes).
In the last decade, several prominent researchers havefocused on the PROBLEM OF UNDERSTANDINGSEARCH INTERFACES.
INTERESTING FACT!There exist at least 10million high qualityHTML forms on thedeep Web
Search Interface Understanding (SIU) Process8
Search Interface(Input)
System-Tagged Search Interface
(Output)
Manually Tagged Search
Interface
Online DB
Extracted DB
B. Parsing
A. Representation
C. SegmentationE. Evaluation
D. Segment Processing
The SIU process is challenging because search interfaces are designedautonomously by different designers and thus, do not have a standardstructure (Halevy, 2005).
A. Representation and Modeling9
This stage formalizes the information to be extracted from a search interface interface components: Any text or form elementsemantic label: Meaning of a component from a
user’s standpoint.segment: a composite component formed with a
group of related componentssegment label: semantic label of the segment
This stage buildsup the foundationfor the process
A. Representation and Modeling10
Zhang et al. (2004) represent an interface as a list of query conditions Segment Label = Query
Condition Segment consists of following
semantic labels: An attribute nameOperator Value
This stage buildsup the foundationfor the process
Attribute-name Operator Value
B. Parsing11
The interface is parsed into a workable memory structure. It can be done in two modes:by reading the HTML source code; by rendering the page on a Web
browser either manually or automatically using a visual layout engine.
This stage is thefirst task physicallyperformed on theinterface.
B. Parsing12
He et al. (2007b) parse the interface into an interface expression (IEXP) with constructs t(corresponding to any text), e (corresponding to any form element), and | (corresponding to a row limiter).
IEXP for figure is: t|te|teee
This stage is thefirst task physicallyperformed on theinterface.
C. Segmentation13
A segment has a semantic existence but no physically defined boundaries making this stage a challenging one. Grouping of semantically related
components. (A sub-problem is to associate a surrounding text with a form element)
Assignment of semantic labels to components
Techniques used :RulesHeuristicsMachine Learning.
C. Segmentation14
He et al. (2007b) use a heuristic-based method LEX to groupelements and text labels together. One heuristic used by LEX isthat the text and form element that lie on the same line arelikely to belong to one segment. In Figure , the 3 components“Gene Name”, radio button with options ‘Exact Match’ and‘Ignore Case’, and the textbox belong to one segment.
Techniques used :RulesHeuristicsMachine Learning.
Logical Attribute
Attribute-label Constraint Element Domain Element
D. Segment Processing15
In this stage, Each segment is further tagged with additional meta-
information regarding itself and its components. Post-processing of extracted information: normalization,
stemming, removal of stop words. He et al. (2007b)’s LEX extracts meta-information about
each extracted segment using Naïve Bayes classificationtechnique. The extracted information for the segment includes domain
type (finite, infinite); Unit (miles, sec); value type (numeric,character, etc.); layout order position (in IEXP)
Techniques used :RulesHeuristicsMachine Learning.
E. Evaluation16
How accurate extracted information is? The system-generated segmented and tagged
interface is compared with the manually segmented and tagged interface
The results are evaluated based on standard metrics (precision, recall, accuracy, etc).
An approach isusually tested ona set of interfacesbelonging to aparticular domain.
Why SIU is Significant?17
Researchers have proposed solutions to make the deep Webcontents more useful to the users. These solutions can be dividedinto following categories based on goals:
To Increase Content Visibility on Search Engines Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)
Building Database Content Repository: Madhavan et al. (2008)
To Increase Domain-specific Usability Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al.
(2006), He and Chang (2003), and Wang et al. (2004)
To Attain Knowledge Organization Derivation of Ontologies: Benslimane et al. (2007)
These solutions can only be materialized by leveraging the opportunities provided by search interface.
SIGNIFICANCE:SIU is a pre-requisite for several advanced deep Web applications.
Review SettingsReductionist AnalysisHolistic Analysis
Progress Made
LITERATURE REVIEW RESULTS18
Literature Review Process19
Reviewed research works that propose approaches forperforming the SIU process in the context of the deep Web.
ALIAS:For quick reference each work is assigned an alias.
S.No. Reference Alias1 Raghavan and Garcia Molina (2001) LITE2 Kalijuvee et al. (2001) CombMatch3 Wu et al. (2004) FieldTree4 Zhang et al. (2004) HSP5 Shestakov et al.(2005) DEQUE
6 Pei et al. (2006) AttrList
7 He et al. (2007) LEX
8 Benslimane et al. (2007) FormModel
9 Nguyen et al. (2008) LabelEx
Literature Review Process20
Review was done in 2 phases:Reductionist Analysis: The works were
decomposed into small pieces. Each work was visualized as a 2-dimensional grid where
the horizontal sections refer to stages of the SIU process.For each stage the works were analyzed in verticaldegrees of analysis known as Stage-specific dimensions.
Holistic Analysis: Each work was studied in itsentirety within a big picture context. Composite dimensions were created out of the stage-
specific dimensions.
DIMENSION:facilitates comparison among different works by placing them underthe same umbrella
Reductionist Analysis : Representation
21Work Segment: Segment Contents Text Label: Form
ElementMeta-information
HSP Conditional Pattern: Attribute-name, Operator*, and Value+
1:M
DEQUE Field segment : f, Name(f),Label(f)f = field
1:1 JavaScript Functions,Visible and invisible values,Subinfo(F) = {action, method, enctype)Iset(F) =initial field set, that can be submittiedwithout completing a formdomain(f), type(f)F=form
AttrList Attribute: Attribute-name,description and form element
1:1 Domain information for each attribute (set ofvalues and data types)
LEX Logical Attribute Ai: Attr-label L,List of domain elements {Ej,…Ek},and element labels .
1:M(1:1 in case of“element label”:form element)
site information and form constraintAi=(P, U, Re, Ca, DT, DF, VT)Ai=ith attribute,P = layout order position, U = Unit, Re=relationship type, Ca = domain elementconstraint, DT = domain type, DF = defaultvalue, VT = value typeEi=(N, Fe, V, DV)N = internal name, Fe = format, V = set of
values, DV = default value.
A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation
Reductionist Analysis : Parsing22
Work Input Mode Basic Step: Description Cleaning Up Resulting StructureLITE HTML source code
AndVisual Interface
PruningIsolate elements thatdirectly influence layoutof form elements andlabels.
Discard Images,ignore stylinginformation such asfont size, font style,and style sheets.
Pruned Page
CombMatch HTML source code Chunk Partitioning, andfinding meta-informationabout each chunk: Findbounding HTML tags, andtext strings delimited bytable cell tags, etc.
StopPhrases(“optional”,“required”, “*”, Textformatting HTML tags.
Chunk List and TableIndex List. Each chunk isrepresented as an 8-tuple describing meta-information.
DEQUE HTML textAndVisual interface
Preparing FormDatabase: A DOM tree iscreated for each FORMelement
Ignore font size,typefaces, and stylinginformation.
Pruned Tree
LEX HTML source code Interface ExpressionGeneration:t=text, e = element,I=row delimiter (<BR>,<P>, or </TR>
String
A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation
Reductionist Analysis: Segmentation23
Work Problem Description Segmentation Criteria Technique
CombMatch Assigning text Label to aninput element
Combination of string similarityand spatial similarity algorithms
Heuristics (String properties,Proximity and Layout)
HSP Finding the 3-tuple <attributename, operators, values>
Grammar (set of rules) based onproductions and preferences
Rules (Best Effort Parser tobuild a parse tree)
LEX Assigning text labels toattributes, and assigningelement labels to domainelements
Ending colon, textual similaritywith element name, verticalalignment, distance, preferenceto current row
Heuristics(String Properties, Layout andProximity)
LabelEx Assigning text Label to aform element
Classifiers (Naïve Bayes, andDecision Tree). Featuresconsidered include spatialfeatures, element type, font type,internal, similarity, alignment,label placement, distance.
Supervised Machine Learning
A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation
Reductionist Analysis : Segment Processing24
Work Technique for extractingmeta-information
Post-processing
HSP The Merger module reports conflicting (that occur intwo query conditions) and missing tokens (they do notoccur in any query condition).
LEX Naïve Bayesian Classification(Supervised MachineLearning)
Meaningless stopwords (the, with, any, etc.)
FormModel Learning by Examples(Machine Learning)
LabelEx Heuristics for reconciliation of multiple assigned labelsto an element; and to handle dangling elements.
A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation
Redu
ctio
nist
Ana
lysis
: Eva
luat
ion
25 Work Test Domain Yahoo Subject Category Comparison with… Metrics
LITE SemiconductorIndustry, Movies,DatabaseTechnology.
Science, Entertainment,Computers & Internet
CombMatch (in terms ofmethodology)
Accuracy
HSP Airfare, automobile,book, job, real estate,car rental, hotel,movies, music records.
Business & Economy,Recreation & Sports,Entertainment
4 datasets from differentsources collected byauthors.
Precision, Recall
LabelEx Airfare, Automobiles,Books, Movies.
Business & Economy,Recreation & Sports,Entertainment
Barbosa et al. (2007)’sand HSP ( in terms ofdatasets)Using Classifier Ensemblewith or without MappingReconciliation (MR).Generic Classifier VsDomain-specific ClassifierGeneric Classifier with MRVs Domain-specificClassifier with MRHSP and LEX ( in terms ofmethodology)
Recall,Precision,F-Measure
A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation
Hol
istic
Ana
lysis
26
Work Type of semantics Techniques Human Involvement
Target Application
LITE Partial form capabilities (Label associated with form element)
Heuristics None Deep Web Crawler(search engine visibility)
HSP Query capability (attribute name, operator and values)
Rules Manual Specification of Grammar Rules
Meta-searchers(domain-specific usability)
LEX Components belonging to same logical attribute (labels and form elements)
Heuristics None Meta-searchers(domain-specific usability)
Meta-information Supervised Machine Learning
Training data for classifier
FormModel Structural Units (groups of fields belonging to same entity)
NOT REPORTED Unknown Ontology Derivation(Knowledge Organization)
Partial form capabilities (Label associated with form element)
Heuristics None
Meta-information Supervised Machine Learning
Training data forlearning byexamples.
LabelEx Partial form capabilities (Label associated with form element)
Supervised Machine Learning
Classifier Training data was manually tagged.
Deep Web in general (search engine visibility domain-specific usability)
Progress Made 27
SEMANTICS modeled and extracted. (Stages A and B) from merely stating what we see, to stating what is meant by what we see
from merely associating labels to form elements, to discovering query capabilities
from no meta-information to a lot of meta-information which might be useful for target application.
TECHNIQUES employed (Stages C and D) A mild transitioning from naïve techniques (rules-based and heuristic-based) to
sophisticated techniques (supervised machine learning).
DOMAINS explored (Stage E) Only Commercial Domains: books, used cars, movies, etc.
Still Unexplored Non-Commercial Domains: yahoo.com subject categories such as regional, society and culture, education, arts and humanities, science, reference, and others
A: RepresentationB: ParsingC: SegmentationD: Segment ProcessingE: Evaluation
Techniques Vs Design HeterogeneityTechniques Vs Domain HeterogeneitySimulating a Human Designer
RESEARCH QUESTIONS28
Research Questions29
R.Q.#1 Technique Vs Design Heterogeneity What is the correlation between the technique employed and the ability to
handle heterogeneity in design of interfaces?
R.Q.#2 Technique Vs Domains How can we design approaches that work well for arbitrary domains, and
thus prevent the need to design domain-specific approaches?
R.Q.#3 Simulating a Human Designer How can we make a machine understand an interface in the same way as
a human designer does?
Derived fromHolistic and Reductionist Analysis
Research Question #1What is the correlation between the technique employed and the ability to handle heterogeneity in design of interfaces?
Techniques: Rules, Heuristics, and Machine Learning. Design: Arrangement of interface components. Handling Heterogeneity in design: Being able to perform
the following tasks for any kind of design. Segmentation Semantic TaggingGrouping (Label Assignment is a part of this)
Segment Processing
30
Elaborating the Question
Technique is a dimension ofStages Segmentation & Segment Processing
Research Question #1What is the correlation between the technique employed and the ability to handle heterogeneity in design of interfaces?
31
Heterogeneity: Automobile Domain Heterogeneity: Movie Domain
Technique is a dimension ofStages Segmentation & Segment Processing
Operator Attribute-nameOperand
Multiple Attribute-name
Research Question #1What is the correlation between the technique employed and the ability to handle heterogeneity in design of interfaces?
A 2002 study (Kushmerick, 2002) suggests the superiority ofmachine learning techniques over rule-based andheuristic-based techniques for handling designheterogeneity in general.
A 2008 study (Nguyen et al., 2008) compared the labelassignment accuracy (a part of grouping accuracy) ofthe three approaches: rule-based (HSP), heuristic-based (LEX), and machine learning based (LabelEx).Machine learning technique outperformed the othertwo.
32
Existing Efforts to Answer
This question has been only partially explored.
Investigating R.Q.#1Technique Vs Design Heterogeneity
33
Experiment Description Evaluation Metrics
Result Compared With Improvement
A machine learningtechnique based on Hidden Markov Models (HMMs) was designed and tested on a dataset belonging to biology domain.
Grouping Accuracy (label assignmentincluded)
86% Heuristic-basedstate-of-the-art approach LEX
10%
Semantic Tagging Accuracy
90% A Heuristic-basedalgorithm was designed
17%
Tasks to test:Segmentation•Grouping•Semantic TaggingSegment Processing
Compare Segmentation Performance: Machine Learning Vs. Rule-Based
Various machine learning techniques
Classification Vs. HMM Vs…
Compare Segment Processing Performances: Rules Vs. heuristics Vs. machine learning
However, there is NO comparative study in terms of overallgrouping, semantic tagging, and segment processing.
Investigating R.Q.#1Technique Vs Design Heterogeneity
34
Experiment Description Evaluation Metrics
Result Compared With
Monitoring Human Intervention(IN PROGRESS)
Rule-based: Manual CraftingHeuristics: Manual ObservationsMachine Learning: Manual Tagging
Rule Based Vs Heuristics Vs Machine Learning
The HMM was trained using unsupervised training algorithm Baum Welch
P(O|λ) Not promising
Human Intervention is a dimension of Holistic analysis
Designing Unsupervised Techniques
There is NO comparative study to measure human intervention in these techniques.
Research Question #2How can we design approaches that work well for arbitrary domains, and thus prevent the need to design domain-specific approaches?
Domain Heterogeneity: Deep Web is heterogeneous interms of domains, i.e. has databases belonging to all the14 subject categories of Yahoo (Arts & Humanities, Business andEconomy, Computers and Internet, Education, Entertainment, Government,Health, etc. )
How to design generic approaches that work for manydomains? How do interface designs differ across domains? Which technique should be employed?
35
Elaborating the Question
Domain tested is a dimension ofEvaluation stage.
Research Question #2How can we design approaches that work well for arbitrary domains, and thus prevent the need to design domain-specific approaches?
2004: A single grammar (rule-based) generates reasonably good segmentationperformance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)
Higher accuracy can be attained using domain-specific techniques which are notfeasible to be designed using rules (Nguyen et al., 2008) .
2008: For label assignment (a portion of grouping), domain-specific classifiersresult in higher accuracy than generic classifiers. (Nguyen et al., 2008)
Still missing: A comparison of domain-specific and generic approaches for overall segmentation
performance
The design differences across domains
generic approaches that result in equally good results for as many domains as possible.
36
Existing Efforts to Answer
Deep Web has a balanced domain distribution
Investigating R.Q. #2
37Attribute-
name
Text-trivial
Operand
Operator
0.41
0.090.21
0.30
0.35
0.080.56
0.83
0.62
0.37 Attribute-name
Text-trivial
Operand
Operator
0.57
0.40 0.20
0.20
0.120.14
0.220.34
0.64
0.880.11
Attribute-name
Text-trivial
Operand
Operator
0.64
0.09 0.17
0.11
0.080.24
0.050. 51
0.83
1.00.08
Attribute-name
Text-trivial
Operand
Operator
0.31
0.150.21
0.23
0.210.44
0.160.59
0.54
0.89 0.08
0.09
Movie
Biology
References & Education
Automobile
Design tendenciesof designers from different domainsare different.
Investigating R.Q. #2: Technique Vs Domain
38
Domain Exp Description Evaluation Winner (improvement)
Movie Domain-Specific HMM Vs. Generic HMM
Segmentation Accuracy Generic HMM (4.4%)
Ref & Edu Domain-Specific HMM Vs. Generic HMM
Segmentation Accuracy Domain-Specific HMM (7%)
Automobile Domain-Specific HMM Vs. Generic HMM
Segmentation Accuracy Domain-SpecificHMM (8%)
Biology Domain-Specific HMM Vs. Generic HMM
Segmentation Accuracy Domain-SpecificHMM (36%)
All experiments done using the Machine learning technique, HMM.
What is the correlation between design topology and performance of domain-specific model?
Research Question #3How can we make a machine understand the interface and extract semantics from it in the same way as a human designer does?
39
A human-designer/user naturally understands thedesign and semantics of an interface based on visualcues and based on his prior experiences.
A machine cannot really “see” an interface and doesnot have any implicit Web search experience. (Howmuch do visual layout engines assist?)
Hence, there is a difference between the way amachine perceives an interface and the way a designerperceives the interface.
How can we reconcile these differences?
Investigating R.Q. #3Simulating a Human Designer
40
Hypothesis: A machine can be made to understand the interface in the same way as a human designer does if it is enabled to discover the deep source of knowledge that created the interface in the first place.
Search Interface
Conceptual Model
Designer/Modeler
Web Design Knowledge
Recover DB Schema
Derive Query
Capabilities
Derive Segments
Attach Semantic Labels
Understand design
Understands / Designs
Existing methods have been able to: understand design, attach semantic labels, derive segments and query capabilities.
Extracting DB schema and conceptual model is still an open question.
Connecting the dots
41
Search Interface
Conceptual Model
Designer
Web Design Knowledge
Recover DB Schema
Derive Query
Capabilities
Derive Segments
Attach Semantic Labels
Understand design
Web Design Knowledge
Search Interface
Search Interface
Conceptual Model based
Interface
?
R.Q. 1
R.Q. 2
R.Q. 3
Suggestions, Comments, Thoughts, Ideas, Questions…
THANK YOU !42
Acknowledgements: To My Prospectus Committee Members
References: [1] to [42] (in prospectus report).