march 2000 gio xit 1 increasing the precision when obtaining information from the web gio wiederhold...
Post on 21-Dec-2015
219 Views
Preview:
TRANSCRIPT
March 2000 Gio XIT 1
Increasing the Precision when
Obtaining Information from the Web
Gio Wiederhold Stanford University
4 April 2000
related report: www-db.stanford.edu/pub/gio/1999/miti.htm
E P F L seminar
Supported by the AFOSR- New World Vistas Program
March 2000 Gio XIT 2
Growth Factors
Consumer
Pull
Research &
Inno -vation
Toolbuilding
Product building &
marketingGeneralTechnology
Push Businessneeds
Governmentresponsibilities
InformationTechnology
March 2000 Gio XIT 3
T r e n d s 1998 : 1999
• Users of the Internet 40% 52% of U.S. population
• Growth of Net Sites (now 2.2M public sites with 288M pages)• Expected growth in E-commerce by Internet users [BW, 6 Sep.1999]
segment 1998 1999– books 7.2% 16.0%– music & video 6.3% 16.4%– toys 3.1% 10.3%– travel 2.6% 4.0%– tickets 1.4% 4.2%– Overall 8.0% 33.0% = $9.5Billion
An unstainable trend cannot be sustained [Herbert Stein]
new services
98 99 00 01 02 03 04 0.3 1 3 9 27 81 **
90 80 70 60 50 40 30 20 10
0
Year / %
%
Centroid, in 1999 ~1% of total market
E-penetration Toys
March 2000 Gio XIT 4
Expect continuing growth
• Hardware technology will continue to lead and encourage broader usage
• Communication technology will continue to lead and become more economical
• User interfaces will improve and not be a barrier to the acceptance of technology
• Government policies will not hinder open interaction - or not be able to
March 2000 Gio XIT 5
The Problem of Information Growth:
"We are drowning in information but starved for knowledge. This level of information is clearly impossible to be handled by present means. Uncontrolled and unorganized information is no longer a resource in an information society, instead it becomes the enemy."
-- John Naisbitt, author of 1982 bestseller Megatrends
. . . and it’s not getting better
Dealing with this issue requires Precision:• Helpful for casual users --
reduce human filtering when browsing• Essential for business --
regular tasks require automation
needsKnow-ledge
March 2000 Gio XIT 6
Knowledge from Experts
encoded for reuse
• The product: Information
ObservationsFilters
AnalysesAggregation of instancesIntegration of sources
Data
Data + Knowledge Information
March 2000 Gio XIT 7
Precision to be improved in
• Relevance of Information for the Customer– modeling the customer
• Timeliness of Information– resolving temporal mismatch for past data
• Search for Information– precision versus recall
• Meaning of the Information our focus here– resolving semantic mismatch
Service model to achieve these objectives services add value by increasing precision
March 2000 Gio XIT 8
Search techniques add value
Yahoo humans catalog and organize useful web sites.
Junglee integrates diverse sources using wrappers.
AltaVista automatically surfs and indexes the web.
Excite also tracks queries and classifies customers.
Firefly provides customer control over their profiles.
Cookies track users’ activities between sessions.
Alexa collects webpages and their usage.
Google ranks the reference importance of web pages.
. . .
March 2000 Gio XIT 9
Problems for search engines and progress
• Unsuitable source representations• part classification: HTML --- XML• print formats: postscript, adobe PDF• non-text: images, sound, video• hidden in databases behind CGI scripts
• Inconsistent semantics • context distinct / scope / view
• Naïve modeling of customers• roles & growth
Search engines cannot solve all problems
Being improved.
Rate?
March 2000 Gio XIT 10
Large quantities affect cost
The human genome: ~ 4 000 000 000 base pairs
Genes, and gene abnormalities
1 human
~10 000
proteins ?
diseases
Everybody’s genes
6 000 000 000
humans
Small organic molecules - affect proteins - suitable for drugs
~2 000 000molecules
Metabolic pathways
<1000systems
Nature Progress
March 2000 Gio XIT 11
Need for precision
adapted from Warren Powell, Princeton Un.
data
err
ors
information quantity
human lim
it
acceptable limit
hum
an w
ith to
ols?
Information Wall
More precision is needed as data volume increases--- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toysnot real cars apparent drug-target with poor annotation ) False Negatives cause lost opportunities, suboptimal to some degree
False positives = poor precision typically cost more thanfalse negatives = poor recall
Testing false lead in pharmaceutics costs > $ 100 000 in stage 1.
March 2000 Gio XIT 12
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues• Autonomy conflicts with consistency,
– Local Needs have Priority,– Outside uses are a Byproduct
Heterogeneity must be addressed• Platform and Operating Systems • Representation and Access Conventions • Naming and Ontology
March 2000 Gio XIT 13
Semantic Mismatches
Information comes from many autonomous sources• Differing viewpoints (by source)
– differing terms for similar items { lorry, truck }
– same terms for dissimilar items trunk(luggage, car)
– differing coverage vehicles (DMV, AIA)
– differing granularity trucks (shipper, manuf.)
– different scope student museum fee, Stanford
• Hinders use of information from disjoint sources – missed linkages loss of information, opportunities– irrelevant linkages overload on user or application
program
• Poor precision when merged
ok for web browsing , poor for business
March 2000 Gio XIT 14
Proposed Solutions
Specify and standardize terminology usage: ontology• Globally all interacting sources
– wonderful for users and their programs– long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ? – costly maintenance, since all sources evolve – who has the authority to dictate conformance
• Domain-specific XML DTD assumption– Small, focused, cooperating groups– high quality, some examples - genomics, arthritis, shakespeare plays
– allows sharable, formal tools – ongoing, local maintenance affecting users - annual updates
– poor interoperation, users still face inter-domain mismatches
• solves only part of the problem
March 2000 Gio XIT 15
Domains and Consistency .
• a domain will contain many objects• the object configuration is consistent• within a domain all terms are consistent &• relationships among objects are consistent
• context is implicit
No committee is needed to forge compromises * within a domain
Compromises hide valuable details
Domain Ontology
March 2000 Gio XIT 16
Objective of Scalable Knowledge Composition
Provide for Maintainable Application Ontologies
• devolve maintenance onto many domain-specific experts / authorities
• provide an algebra to compute composed ontologies that are limited to their articulation terms
• enable interpretation within the source contexts
SKC
March 2000 Gio XIT 17
Sample Operation: INTERSECTION
Source Domain 1:Owned and maintained by Store
Result contains shared terms,useful for purchasing
Source Domain 2:Owned and maintainedby Factory
Articulation
March 2000 Gio XIT 18
Tools to create articulations
Graph matcherforArticulation- creatingExpert
Vehicle ontology
Transport ontology
Suggestionsfor articulations
March 2000 Gio XIT 19
continue from initial pointTool suggests terms for further articulation:
• by spelling similarity,• by graph position• by term match nexus
Expert response:1. Okay2. False3. Irrelevant to this articulation
All results are recorded
Okay ’s are converted into articulation rules
March 2000 Gio XIT 20
Candidate Match Nexus
Term linkages automatically extracted from Webster’s* / Oxford dictionary +
* freely available
+ restricted
Based on processing headwords definitions using algebra primitives
Notice presence of 2 domains: chemistry, transport
March 2000 Gio XIT 21
Using the Match Nexus
Experiment: On government structures of
NATO countries:SKEIN system resolved over 70% of unmatched terms
March 2000 Gio XIT 22
Using the Match Nexus
March 2000 Gio XIT 23
An Ontology Algebra
A knowledge-based algebra for ontologies
The Articulation Ontology (AO) consists of matching rules that link domain ontologies
Intersection create a subset ontology keep sharable entries
Union create a joint ontology merge entries
Difference create a distinct ontology remove shared entries
March 2000 Gio XIT 24
INTERSECTION support
Store Ontology
Articulation ontology
Matching rules that use terms from the 2 source domains
Factory Ontology
Terms usefulfor purchasing
March 2000 Gio XIT 25
Other Basic Operations
typically priorintersections
UNION: mergingentire ontologies
DIFFERENCE: materialfully under local control
Arti-culation ontology
March 2000 Gio XIT 26
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused when sources change
March 2000 Gio XIT 27
Knowledge CompositionArticulationknowledgefor U
U
U
(A B)U
(B C)U
(C E)
Knowledge resource
B
Knowledge resource
A
Knowledge resource
C Knowledge
resourceD
U
(C D)
U
(B C)
Articulationknowledge
Composed knowledge forapplications using A,B,C,E
Knowledge resource
E
U
(C E)
Legend:
U : unionU: intersection
Articulationknowledgefor (A B)
U
March 2000 Gio XIT 28
SKC Primitive Operations
Unary• Summarize -- abstract • Glossarize - list terms
• Filter - reduce instances
• Extract - move into context
Binary • Match - data corrobaration
• Difference - distance measure
• Intersect - use of articulation
• Union - search broadening
Constructors• create object• create setConnectors• match object• match setEditors• insert value• edit value• move value• delete valueConverters• object - value• object indirection• reference indirection
Model and Instance
March 2000 Gio XIT 29
Exploiting the result .
Processing & query evaluation is best performed withinSource Domains & by their engines
Result has linksto source Avoid n2 problem of
interpreter mapping
March 2000 Gio XIT 30
Domain Specialization .• Knowledge Acquisition (20% effort) &• Knowledge Maintenance (80% effort *)
to be performed• Domain specialists• Professional organizations• Field teams
of modest size
Empowermentautomouslymaintainable
* based on experience with software
March 2000 Gio XIT 31
Summary
To sustain the growth of web usage1. The value of the results has to keep increasing
precision, relevance not volume, nor recall2. Value is provided by experts,
encoded as models of diverse resources, customersProblems to be addressed mismatches quality maintenance
+ Tools for these tasks
} Clear, scalable models
March 2000 Gio XIT 32
Acknowledgments
Supported by AF Office of Scientific Research– New World Vistas program
Participants• David Maluf, postdoc, PhD EE, McGill Univ., 1997.• Jan Jannink, PhD candidate, CS, grad. June 2000? • Shrish Agarwal, MS graduate, CS, 1999.• Prasenjit Mitra, PhD candidate, EE, grad. 2001?• Martin Kerstens, PhD, summer visitor from CWI.• Stefan Decker, postdoc, PhD Univ.Karlsruhe 1999.
March 2000 Gio XIT 33
• April-June 2000, at 14:15 - 15:15, room ?
Presentations in English -- but I'll try to manage discussions in French and/or German.• I plan to cover the material in an integrating fashion, drawing from concepts in
databases, artificial intelligence, software engineering, and business principles.
1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR, XML.
2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).
3. 4/5 Digital libraries, information resources. Value of services, copyright.
4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.
5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance.
6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]
7. 31/5 Application to Bioinformatics.
8. 15/6 Educational challenges. Expected changes in teaching and learning.
9. 22/6 Privacy protection and security. Security mediation.
10.29/6 Summary and projection for the future.• Feedback and comments are appreciated.
Seminar Course on Intelligent Information Systems
top related