outline
DESCRIPTION
Indexing and searching heterogeneous information LLNL – Nov. 3, 2006 Edward A. Fox Virginia Tech [email protected] http://fox.cs.vt.edu. Outline. Acknowledgements, Publications Introduction: Problem, Digital Libraries New Efforts: Personalization, Superimposed Info 5S, ETANA, Structure - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/1.jpg)
1
Indexing and searching heterogeneous information
LLNL – Nov. 3, 2006
Edward A. FoxVirginia [email protected]
http://fox.cs.vt.edu
![Page 2: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/2.jpg)
2
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 3: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/3.jpg)
3
Acknowledgements: Students
• Pavel Calado, William Cameron, Yuxin Chen, Fernando Das Neves, Robert France, Marcos Gonçalves, S.H. Kim, Aaron Krowne, Ming Luo, Paul Mather, Fernando Das Neves, Sanghee Oh, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Manas Tungare, Wensi Xi, Seungwon Yang, Xiaoyan Yu, Baoping Zhang, Qinwei Zhu, …
![Page 4: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/4.jpg)
4
Acknowledgements: Faculty, Staff
• Lillian Cassel, Lois Delcambre, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, David Maier, Gail McMillan, Claudia Medeiros, Manuel Perez-Quinones, Jeffrey Pomerantz, Naren Ramakrishnan, Layne Watson, Barbara Wildemuth, …
![Page 5: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/5.jpg)
5
Other Collaborators (Selected)
• Brazil: FUA, UFMG, UNICAMP• Case Western Reserve University• Emory, Notre Dame, Oregon State• Germany: Univ. Oldenburg• Mexico: UDLA (Puebla), Monterrey• College of NJ, Hofstra, Penn State,
Villanova• University of Arizona• University of Florida, Univ. of Illinois• University of Virginia
![Page 6: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/6.jpg)
Acknowledgements: Support
• ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579, 0535057; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601, 0435059, 0532825), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS
![Page 7: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/7.jpg)
7
Publications – 1 of 2• N. J. Belkin, P. Kantor, E. A. Fox and J. A. Shaw. Combining the Evidence of Multiple
Query Representations for Information Retrieval. Information Processing & Management, 31(3), 431-448, May-June 1995.
• Fan, W., Luo, M., Wang, L., Xi, W., and Fox, E. A. Tuning before feedback: Combining ranking discovery and blind feedback for robust retrieval. SIGIR 2004, 27th Annual Int’l ACM SIGIR Conf. on R&D in Information Retrieval, Sheffield, England, 25-29 July
• Weiguo Fan; Gordon, M.D.; Pathak, P.; Wensi Xi; Fox, E.A.; Ranking function optimization for effective web search by genetic programming: an empirical study, in the Proceedings of 37th Hawaii International Conf. on System Sciences (HICSS), 5-8 Jan. 2004, 105 - 112
• Edward A. Fox, Fernando Das Neves, Xiaoyan Yu, Rao Shen, Seonho Kim, and Weiguo Fan. Exploring the computing literature with visualization and stepping stones & pathways. CACM 49(4): 52-58, April 2006
• Edward A. Fox and Paul Mather. Scalable Storage for Digital Libraries. Chapter 12 in Multimedia Information Retrieval and Management: Technological Fundamentals and Applications, eds. D. Feng, W.C. Siu and H.J. Zhang, Berlin: Springer, 2003, pp. 265-288
• E. Fox and J. Shaw. Combination of Multiple Searches. In Proc. of The Second Text REtrieval Conference (TREC-2) (Aug. 30 - Sept. 1, 1993, NIST, Gaithersburg, MD), NIST Special Pub. 500-215, 1994, ed. D. K. Harman, 243-252
• Marcos Andre Goncalves, Robert K. France, and Edward A. Fox, MARIAN: Flexible Interoperability for Federated Digital Libraries. In Proc. 5th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'2001, September 4-8, 2001, Darmstadt, Germany, Springer, LNCS 2163 / 2001, pp. 173-186
• Ananth Raghavan, Naga Srinivas Vemuri, Rao Shen, Marcos Andre Goncalves, Weiguo Fan, and Edward A. Fox. Incremental, Semi-automatic, Mapping-Based Integration of Heterogeneous Collections into Archaeological Digital Libraries: Megiddo Case Study. In Proc. ECDL2005, Vienna, Sept. 18-23, 2005, 139-150
![Page 8: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/8.jpg)
8
Publications – 2 of 2• Rao Shen, Naga Srinivas Vemuri, Weiguo Fan, Ricardo da S. Torres, and Edward A. Fox.
Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization. In Proc. JCDL 2006, June 11-15, 2006, Chapel Hill, NC, 1-10
• Ricardo da Silva Torres, Alexandre X. Falcao, Baoping Zhang, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Pavel Calado. A new framework to combine descriptors for content-based image retrieval. In Proc. 14th Conf. Information and Knowledge Management, CIKM 2005, 31 Oct. - 5 Nov. 2005 Bremen, Germany, 335-336
• Li Wang, Weiguo Fan, Rui Yang, Wensi Xi, Ming Luo, Ye Zhou, Edward A. Fox, Ranking Function Discovery by Genetic Programming for Robust Retrieval, Text Retrieval Evaluation Conference-2003, Nov 17-23, NIST, Washington DC, 9 pages
• Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, Dong Zhuang. SimFusion: Measuring Similarity using Unified Relationship Matrix. In Proc. SIGIR 2005, 28th Annual International ACM SIGIR Conf., Salvador, Brazil, August 15-19, 2005, 130-137, http://doi.acm.org/10.1145/1076034.1076059
• W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan, W.Y. Ma, E.A. Fox. Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects. In Proc. Thirteenth International World Wide Web Conf., WWW2004, NY, U.S.A. 19-22 May 2004, 10 pages
• Wensi Xi, Ohm Sornil, Ming Luo, and Edward A. Fox. Hybrid Partition Inverted Files: Experimental Validation. In "Research and Advanced Technology for Digital Libraries, 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings", eds. Maristella Agosti and Constantino Thanos, LNCS 2458, Springer, pp. 422-431.
• Wensi Xi, Ohm Sornil, and Edward A. Fox. Hybrid Partition Inverted Files for Large-Scale Digital Libraries. Proc. Digital Library: IT Opportunities and Challenges in the New Millennium, July 9-11, 2002, Beijing Library Press, Beijing, China, 404-418
• Baoping Zhang, Yuxin Chen, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Marco Cristo, Pavel Calado. Intelligent GP Fusion from Multiple Sources for Text Classification. In Proc. 14th Conf. on Information and Knowledge Management, CIKM 2005, 31st October - 5 Nov 2005 Bremen, Germany, 477-484
![Page 9: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/9.jpg)
9
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 10: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/10.jpg)
10
Problem Characterization
• Distributed (space)
• Content (streams)
• Indexing (space, structure)– Features– Type/sub-type: Image, texture; link, citation– Descriptors: words or phrases or concepts– High dimensionality
• Searching (scenario)
![Page 11: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/11.jpg)
11
Efficiency / Effectiveness
• Effectiveness– Very common measures: Precision, Recall,
F1, 10-precision, R-Precision– Usefulness, usability, task support, …
• Efficiency– Time– Space– Performance, Resource use, …
![Page 12: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/12.jpg)
12
![Page 13: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/13.jpg)
13
![Page 14: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/14.jpg)
14
![Page 15: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/15.jpg)
15
CC2001 Information Management Areas
IM1. Information models and systems*
IM8. Distributed DBs
IM2. Database systems* IM9. Physical DB design
IM3. Data modeling* IM10. Data mining
IM4. Relational DBs IM11. Information storage and retrieval
IM5. Database query languages
IM12. Hypertext and hypermedia
IM6. Relational DB design IM13. Multimedia information & systems
IM7. Transaction processing IM14. Digital libraries
* Core components
![Page 16: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/16.jpg)
16
DL Curriculum FrameworkSemester 1:
DL collections:development/creation
Semester 2:DL services and
sustainability
CO
UR
SE
ST
RU
CT
UR
E
DigitizationStorage
Interchange
Digital objectsCompositesPackages
MetadataCataloging
Author submission
NamingRepositories
Archives
Spaces(conceptual,geographic,2/3D, VR)
Architectures(agents, buses,
wrappers/mediators)Interoperability
Services(searching,
linking, browsing, etc.)
Intellectual property rights mgmt.
PrivacyProtection (watermarking)
Archiving and preservation
Integrity
Architectures(agents, buses,
wrappers/mediators)Interoperability
CO
RE
DL
TO
PIC
S
DocumentsE-publishing
Markup
Info. NeedsRelevanceEvaluation
Effectiveness
ThesauriOntologies
ClassificationCategorization
Bibliographic information
BibliometricsCitations
RoutingFiltering
Community filtering
Search & search strategyInfo seeking behavior
User modelingFeedback
Info summarizationVisualization
Multimedia streams/structures
Capture/representationCompression/coding
Content-based analysis
Multimedia indexing
Multimediapresentation,
rendering
RE
LA
TE
DT
OP
ICS
![Page 17: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/17.jpg)
17
D ig ita l L ib ra r y C o n te n t
A rtic le s ,R e p o rts,
B o o ks
T e xtD o cum e n ts
S p ee ch ,M u s ic
V id eoA u d io
(A e ria l)P h o tos
G e og rap h icIn fo rm ation
M o d e lsS im u la tio ns
S o ftw a re ,P ro g ra m s
G e no m eH u m a n,a n im a l,
p la n t
B ioIn fo rm ation
2 D , 3 D ,V R ,C A T
Im ag es a ndG ra p h ics
C o nte n tT yp e s
![Page 18: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/18.jpg)
18
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 19: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/19.jpg)
19
Personalizing A Course Website Using the NSDL
William Cameron2, Boots Cassel2, Edward Fox1, Manuel Perez-Quinones1, Manas
Tungare1, Xiaoyan Yu1
Virginia Tech1, Villanova2
![Page 20: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/20.jpg)
20
Syllabus Collection …Towards an intelligent educational system
Unstructured Syllabus Text
StructuredSyllabus
Text
SearcherRecommender
Crawler
SyllabusClassifier
Extractor
Editor
SyllabusOntology
Services
Publisher
Other NSDL
Resources
Potential Syllabus
Text
Classification Scheme
ResourceClassifier
![Page 21: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/21.jpg)
21
Search
• With collection, we have a full text search
• Results point to local copy in our collection as well as to original document
• Try it outhttp://doc.cs.vt.edu/search/
![Page 22: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/22.jpg)
22
Syllabus Ontology
• Standard, machine understandable
• Ontology Editor: Protégé
• Syllabus Schema: SylVia
• http://doc.cs.vt.edu/ontologies/
![Page 23: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/23.jpg)
23
Creating new syllabus
• Web-based application to support entry of syllabi into collection
• Moodle Plug-in in the works
• Uses CC 2001 to select topics for a course
![Page 24: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/24.jpg)
24
Information Extraction
• Plans to automatically extract information from syllabi documents collected
• Rule-based Approach
• Statistics-based Approach
• Apply the best extractor on the unstructured syllabi
![Page 25: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/25.jpg)
04/19/23 25
Superimposed Tools for VT
Uma Murthy and Edward A. FoxDepartment of Computer Science, Virginia Tech
18 October 2006
![Page 26: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/26.jpg)
26
Origin of SI
• This basic need had been addressed in diverse ways, with varying degrees of success, for many years:– concordances, annotations, comments
– bookmarks, concept maps, digital annotations, …
• The term “SI” was coined in 1999 by researchers, currently collaborating with us, now at Portland State University– Lois Delcambre
– David Maier
![Page 27: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/27.jpg)
27
Layers in an SI system
Superimposed
Layer
Base Layer
Information Source1
Information Source2
Information Sourcen
…
marks
* Source: ICDE04 presentation by Murthy, et. al
![Page 28: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/28.jpg)
28
Annotating an image
![Page 29: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/29.jpg)
29
Searching over annotations
![Page 30: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/30.jpg)
30
Searching over images/sub-images
![Page 31: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/31.jpg)
31
Summary
* Source: ICDE04 presentation by Murthy, et. al
Superimposed
Layer
Base Layer
Information Source1
Information Source2
Information Sourcen
…
marks
![Page 32: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/32.jpg)
32
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 33: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/33.jpg)
33
Informal 5S & DL Definitions
DLs are complex systems that
• help satisfy info needs of users (societies)
• provide info services (scenarios)
• organize info in usable ways (structures)
• present info in usable ways (spaces)
• communicate info with users (streams)
![Page 34: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/34.jpg)
34
5Ss
Ss Examples Objectives
Streams Text; video; audio; image Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data
Structures Collection; catalog; hypertext; document; metadata
Specifies organizational aspects of the DL content
Spaces Measure; measurable, topological, vector, probabilistic
Defines logical and presentational views of several DL components
Scenarios Searching, browsing, recommending
Details the behavior of DL services
Societies Service managers, learners, teachers, etc.
Defines managers, responsible for running DL services; actors, that use those services; and relationships among them
![Page 35: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/35.jpg)
35
Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing
Annotating Classifying Clustering Evaluating Extracting Indexing
Measuring Publicizing
Rating Reviewing (peer)
Surveying Translating
(language)
Conserving Converting
Copying/Replicating Emulating Renewing
Translating (format)
Acquiring Cataloging
Crawling (focused) Describing Digitizing
Federating Harvesting Purchasing Submitting
Preservational Creational
Add Value
Repository-Building
Information Satisfaction
Services
Infrastructure Services
Taxonomy of DL Services
![Page 36: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/36.jpg)
36
5S and DL formal definitions and compositions (April 2004 TOIS)
5S
structures (d.10)streams (d.9) spaces (d.18) scenarios (d.21) societies (d. 24)
structural metadataspecification(d.25)
descriptive metadataspecification(d.26)
repository(d. 33)
collection (d. 31)
(d.34)indexingservice
structured stream (d.29)
digitalobject (d.30)
metadata catalog (d.32)
browsingservice
(d.37)
searchingservice (d.35)
digital library(minimal) (d. 38)
services (d.22)
sequence (d. 3)
graph (d. 6)function (d. 2)
measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces
event (d.10)state (d. 18)
hypertext(d.36)
sequence (d. 3)
transmission(d.23)
relation (d. 1) language (d.5)
grammar (d. 7)
tuple (d. 4)*
![Page 37: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/37.jpg)
37
Digital Object
RepositoryCollection Minimal DL
Metadata Catalog
Descriptive Metadata
Specification
A Minimal DL in the 5S Framework
Structural Metadata
Specification
Streams Structures Spaces Scenarios Societies
indexing
browsing searching
services
hypertext
Structured Stream
![Page 38: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/38.jpg)
38
![Page 39: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/39.jpg)
39
ETANA-DL
• Archaeological DL• Integrated DL
– Heterogeneous data handling
• Applies and extends the OAI-PMH– Open Archives Initiative Protocol for Metadata
Handling
• Design considerations– Componentized– Extensible– Portable
![Page 40: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/40.jpg)
40
![Page 41: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/41.jpg)
41
Site Artifact Type Original data sourceNumber of
records harvested
Bab edh-Dhra’ Pottery cp6 database file 786
Lahav Figurine Tab-delimited text file 563
Madaba Locus field record Tables in Access DB 786
Mozan Publication PDF files 19
Nimrin
Bone field record Table in Oracle DB 7419
Seed field record Table in Oracle DB 429
Locus field record Table in Oracle DB 2101
Umayri Bone field record 2 tables in Access DB 2122
Total 18404
Heterogeneous data handling
![Page 42: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/42.jpg)
42
ETANA Spaces
1. Geographic distribution of found artifacts2. Temporal dimension (as inferred by
archaeologists) 3. Metric or vector spaces
1. used to support retrieval operations, and to calculate distance (and similarity)
2. used to browse / constrain searches spatially
4. 3D models of the past, used to reconstruct and visualize archaeological ruins
5. 2D interfaces for human-computer interaction
![Page 43: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/43.jpg)
43
ETANA Structures
1. Site Organization1. Region, site, partition, sub-partition, locus,
…
2. Temporal orderings (ages, periods)
3. Taxonomies1. for bones, seeds, building materials, …
4. Stratigraphic relationships1. above, beneath, coexistent
![Page 44: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/44.jpg)
44
ETANA Streams
1. successive photos and drawings of excavation sites, loci, unearthed artifacts
2. audio and video recordings of excavation activities and discussions
3. textual reports
4. 3D models used to reconstruct and visualize archaeological ruins.
![Page 45: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/45.jpg)
45
Degree of Structure
Chaotic Organized Structured
Web DLs DBs
![Page 46: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/46.jpg)
46
Digital Objects (DOs)
• Born digital
• Digitized version of “real” object– Is the DO version the same, better, or worse?– Decision for ETDs: structured + rendered
• Surrogate for “real” object– Not covered explicitly in metamodel for a
minimal DL– Crucial in metamodel for archaeology DL
![Page 47: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/47.jpg)
47
Metadata Objects (MDOs)
• MARC
• Dublin Core
• RDF
• IMS
• OAI (Open Archives Initiative)
• Crosswalks, mappings
• Ontologies
• Topics maps, concept maps
![Page 48: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/48.jpg)
48
Also Important: Epub, SGML, XML
• 5S perspective: streams, structures, scenarios
• Authoring
• Rendering, presenting
• Tagging, Markup, DOM
• Semi-structured information
• Dual-publishing, eBooks
• Styles (XSL, XSLT)
• Structured queries
![Page 49: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/49.jpg)
49
XML-based DL Log Standard
• Log analysis– is a source of information on:
• How patrons really use DL services• How systems behave while supporting user
information seeking activities• Used to:
– Evaluate and enhance services– Guide allocation of resources
• Common practice in the web setting– Supported by web servers, proxy caches
• DL Logging can be more detailed
![Page 50: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/50.jpg)
50
DL Logging Features
• Captures high level user and system behaviors
• Organized according to the 5S framework– Hierarchical organization (XML-based)– Centered on the notions of events
• Record only events related to initial user inputs and final system outputs
• Help to understand user interactions and the perceived value of responses
![Page 51: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/51.jpg)
51
The XML Log Format
Log
SessionId MachineInfo StatementTransaction Timestamp
SessionInfo RegisterInfo StatementEvent Timestamp
Action
Search Browse StoreSysInfoUpdate
SearchBy QueryString CatalogCollection PresentationInfo
StatusInfo
Timeout
![Page 52: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/52.jpg)
52
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 53: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/53.jpg)
53
Hybrid Partitioned Inverted Indices for Large-Scale
Digital Libraries
Ohm SornilThe National Institute for Development
Administration (NIDA)
Bangkok, Thailand
Edward A. FoxDepartment of Computer Science
Virginia Tech, USA
![Page 54: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/54.jpg)
54
Inverted IndexDocument 1 = Information retrieval is searching and indexingDocument 2 = Indexing is building an indexDocument 3 = An inverted file is an indexDocument 4 = Building an inverted file is indexing
Vocabulary Inverted List (document; position)
an (2;4), (3;1), (3;5), (4;2)and (1;5)building (2;3), (4;1)file (3;3), (4;4)index (2;5), (3;6)indexing (1;6), (2;1), (4;6)information (1;1)inverted (3;2), (4;3)is (1;3), (2;2), (3;4), (4;5)retrieval (1;2)searching (1;4)
![Page 55: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/55.jpg)
55
Inverted Index Partitioning
• The Inverted Index Partitioning Problem is NP-complete
TWO PREVIOUSLY PROPOSED SCHEMES– Document Partitioning
• Postings are stored at the same node as are their documents
• Aggressively balance the load
– Term Partitioning• Every posting of a term is stored in one node• Normally no attempt to balance the load
![Page 56: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/56.jpg)
56
Hybrid Partitioning Scheme
N
c
iZiq
C
V
i
1
)()(
C: Average number of chunks required for a node to retrieve an average termc: Chunk sizeq(i): Query selection distributionZ(i): Term-frequency distributionN: Number of nodes in the system
• Attempts to balance the load • Groups postings into chunks
Chunk Size Selection Scheme– Suggests a reasonable chunk size for a particular operating
condition– Based on the cost of processing a batch of queries
![Page 57: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/57.jpg)
57
Inverted Index PartitioningGiven• 4 Disks• Collection (4 docs)
d1: <a, b, a, c, b> d2: <a, d, e, a> d3: <b, c, a, b> d4: <b>
Term Partitioning
Node1: a = (d1;1), (d1;3), (d2;1), (d2;4), (d3;3)
Node2: b = (d1;2), (d1;5), (d3;1), (d3;4), (d4;1)
Node3: c = (d1;4), (d3;2)
Node4: d = (d2;2) e = (d2;3)
Document Partitioning
Node1: a = (d1;1), (d1;3) b = (d1;2), (d1;5) c = (d1;4)
Node2: a = (d2;1), (d2;4) d = (d2;2) e = (d2;3)
Node3: a = (d3;3) b = (d3;1), (d3;4) c = (d3;2)
Node4: b = (d4;1)
Hybrid Partitioning
• Assume: Chunk Size = 4 postings
Node1: a = (d1;1), (d1;3), (d2;1), (d2;4)
Node2: b = (d1;2), (d1;5), (d3;1), (d3;4)
Node3: a = (d3;3)
c = (d1;4), (d3;2)
Node4: b = (d4;1) d = (d2;2) e = (d2;3)
![Page 58: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/58.jpg)
58
Conclusion
• Performance Measures:– Hybrid > Term > Document
• Hybrid partitioning scheme performs better than the other two schemes in a variety of conditions– Large collection– Multiprogramming level– Query skew– System scaling
![Page 59: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/59.jpg)
59
Conclusion (cont.)
• Observations from the results
– Node Utilization (best is middle range)
• Results: Document > Hybrid > Term
– Load Fluctuation
• Results: Term > Hybrid > Document
![Page 60: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/60.jpg)
60
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 61: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/61.jpg)
61
Tuning Before Feedback:Combining Ranking Discovery
and Blind Feedback forRobust Retrieval
• Ranking function plays an important role in IR performance
• Blind feedback (pseudo-relevance feedback) was found very useful for ad hoc retrieval
• Why not combine ranking function optimization with blind feedback to improve the robustness?
![Page 62: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/62.jpg)
62
Blind Feedback
• Automatically adds more terms to a user’s query to enhance the performance of search engines by assuming top ranked docs relevant
• Some examples– Rocchio (performs better in our exp.)– Dec-Hi– Kullback-Leibler Divergence(KLD)– Chi-Square
![Page 63: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/63.jpg)
63
RF Discovery Problem
Order Doc. Rele.1 A 12 D 13 F 14 G 15 B 06 C 07 E 0
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
Feedback
Training
Data
Input
Ranking Function
Discovery
Ranking
Function f
Output
![Page 64: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/64.jpg)
64
Ranking Function Optimization
• Ranking Function Tuning is an art! – Paul Kantor• Why not adaptively discover RF by Genetic
Programming?– Huge search space– Discrete objective function– Modeling advantage
• What is GP?– Problem solving systems designed based on principles
of evolution and heredity. Widely used for structure discovery, functional form discovery, other data mining and optimization tasks
![Page 65: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/65.jpg)
65
An Example of GP-based RF(log (+ (* df (log (log (* (* (/ n df) (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col)))) (* (/ (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) tf_avg_Col)) (log tf_avg_Col)))))) (+ (* (* df_max_Col tf) (/ (* (* (/ (/ (* tf 6.720) (/ df N)) (* df_max_Col tf)) (* (* tf N) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) (* (* (/ tf tf_max) (+ (* length df) (* 2.812 1))) tf_avg)))) (+ (/ df tf_avg) tf))))
tf Query term frequency in the document ( vector )
tf_query Query term frequency in the query ( vector )
tf_max The maximum term frequency in a document ( scalar )
Length Document length in the number of words ( scalar )
Length_avg Average document length in the number of words ( scalar )
N Number of documents in the collection ( scalar )
tf_avg Average term frequency in the current document (scalar)
tf_avg_Col Average term frequency for all the documents in the collection ( scalar )
df_max_Col Maximum document frequency for a word in the collection ( scalar )
df Document frequency for the query words ( vector )
tf Query term frequency in the document ( vector )
tf_query Query term frequency in the query ( vector )
tf_max The maximum term frequency in a document ( scalar )
Length Document length in the number of words ( scalar )
Length_avg Average document length in the number of words ( scalar )
N Number of documents in the collection ( scalar )
tf_avg Average term frequency in the current document (scalar)
tf_avg_Col Average term frequency for all the documents in the collection ( scalar )
df_max_Col Maximum document frequency for a word in the collection ( scalar )
df Document frequency for the query words ( vector )
tftf Query term frequency in the document ( vector ) Query term frequency in the document ( vector )
tf_querytf_query Query term frequency in the query ( vector )Query term frequency in the query ( vector )
tf_maxtf_max The maximum term frequency in a document ( scalar )The maximum term frequency in a document ( scalar )
LengthLength Document length in the number of words ( scalar )Document length in the number of words ( scalar )
Length_avgLength_avg Average document length in the number of words ( scalar )Average document length in the number of words ( scalar )
NN Number of documents in the collection ( scalar )Number of documents in the collection ( scalar )
tf_avgtf_avg Average term frequency in the current document (scalar)Average term frequency in the current document (scalar)
tf_avg_Coltf_avg_Col Average term frequency for all the documents in the collection ( scalar )Average term frequency for all the documents in the collection ( scalar )
df_max_Coldf_max_Col Maximum document frequency for a word in the collection ( scalar )Maximum document frequency for a word in the collection ( scalar )
dfdf Document frequency for the query words ( vector )Document frequency for the query words ( vector )
![Page 66: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/66.jpg)
66
The ARRANGER Engine1. Split the training data
into training and validation
2. Generate an initial population of random “ranking functions”
3. Evaluate the fitness of each “ranking function” in the population and record 10 best ones
4. If stopping criteria is not met, generate the next generation of population by genetic transformation, go to Step 3.
5. Validate the recorded best “ranking functions” and select the best one as the RF
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
1 2 3 48 49 50
Start
Initialize Population
Evaluate Fitness
Apply Crossover
Stop?
Validate and Output End
48 49 501 2 30.40.30.4 0.80.30.4
![Page 67: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/67.jpg)
67
Ranking Tuning
Blind Feedback
Multiple user queriesWith relevance information New Ranking
Function
New Search Results
User Queries
Ranking Tuning
Blind Feedback
Multiple user queriesWith relevance information New Ranking
Function
New Search Results
User Queries
An Integrated Model
![Page 68: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/68.jpg)
68
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 69: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/69.jpg)
69
Text + CBIR + Metadata + GIS
• Combined retrieval across multiple types of information
• Ex.: bio-diversity information systems
• Architecture, approach, prototype, validation
• Novel aspects:– Learn set of descriptors for a collection– Application to fisheries, archaeology
![Page 70: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/70.jpg)
70
Textual information retrieval
Query on Google using Sunset and Rio de Janeiro
Query result
![Page 71: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/71.jpg)
71
Content BasedInformationRetrieval
![Page 72: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/72.jpg)
72
Motivation for Integration
• Query 1:– List all metadata information related to fishes which
have been observed at Mississippi River.• Query 2:
– Retrieve fish images which contain a shape similar to this example
o Query 3: List all metadata information related to fishes which both have been observed at Mississippi River and contain a shape similar to a given example.
![Page 73: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/73.jpg)
73
Longer Integrated Query
• Retrieve fish descriptions of all fish whose shape is similar to that shown in Figure below, which belong to genus “Notropis”, which have “large eyes” and “dorsal stripe”, and have been observed within the catchments of the “Tennessee” river
![Page 74: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/74.jpg)
74
System’s Architecture
Mediator
InterfaceInterface
Data Insertion ModuleData Insertion Module Query Processing ModuleQuery Processing Module
GISDBMS
Geo. DBMetadataImage DB
Databases
![Page 75: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/75.jpg)
75
Content-Based ImageSearch Component
(CBISC)
OAI
EcoCollection Metadata
Taxonomic Trees
Metadata-Based Search Component
(ESSEX)
Geographic Data
Search Component
(GDSC)Web Feature Server(WFS)
GeoCollection MetadataMaps
ImageCollection Image
MetadataImage
DescriptorsImages
Image Collection
InterfaceQuery
Specification Visualization
Query Mediator
AnalysisMerging
Execution
BIS Manager
HTTP Request(ListDescriptors)
HTTP Request(GetImages)
HTTP Request(keywords)
HTTP Request(GetCapabilities)
HTTP Request(GetFeatureType)
HTTP Request(GetFeature)
![Page 76: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/76.jpg)
76
Feature Extraction Model
Feature Vector[0.98, 0.91, 0.73, ……]
R
B
G
B
![Page 77: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/77.jpg)
77
CBISC Architecture
![Page 78: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/78.jpg)
78
CBISC Configuration Tool
![Page 79: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/79.jpg)
79
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 80: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/80.jpg)
80
Meta-Search
• Contexts:– Web: search engine built atop others– Federated search: bring together results from
distributed partial content sites
• Approach– Send query out to multiple sites– Merge results from sites– Combine those results for ranking as follows:
![Page 81: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/81.jpg)
81
Combination
• For a document, combine the sim values from each system involved:
• CombMIN
• CombMAX
• CombSUM
• CombMNZ = CombSUM * no. systems with non-zero similarities
• CombMNZ oft best, else CombSum
![Page 82: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/82.jpg)
82
DL Integration
• What is “DL Integration”– Hide distribution– Hide heterogeneity– Enable autonomy of individual component
• Why Integration– island-DLs– inability to seamlessly and transparently
access knowledge across DLs
Utilize various autonomous DLs in concert
![Page 83: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/83.jpg)
83
Introduction and Problem Description
UnionDL
DL1 DL2
DL4DL3
DL5
DL1 DL2
DL4DL3
DL5
![Page 84: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/84.jpg)
84
Related Work on Integrating Services in DLs
integrating searching and browsing with other services
clustering and visualization
has an example
Stepping Stones& Pathways
EtanaVizCitiViz
includes
has an example
I3R
systemsIn 1980s
found in
RABBIT
integrating searching and browsing
systemsIn 1990s
systemsIn 2000s
CODER
DataWeb
has an example
PESTO SenseMaker
has an example
MIX ScentTrailsBBQ
16ODLMARIAN
![Page 85: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/85.jpg)
85
semantic interoperability
in DLs
Intermediary-based mapping-based
consists of
mediator wrapper ontology
use
federation union archiving
used in
schema mapping
use
Interrelated with
CITIDELDienst FedoraNDLTD Infobus
…
proactive standardization
reactive interpretation
achieved by use
Reconceptualization of Related Work on Semantic Interoperability
Key: Blue indicates focus of our work.
Automatically generate
31
![Page 86: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/86.jpg)
86
Formal Definition of DL Integration
• DLi=(Ri, DMi, Servi, Soci), 1 i n
– Ri is a network accessible repository
– DMi is a set of metadata catalogs for all collections
– Servi is a set of services
– Soci is a society
• UnionRep• UnionCat• UnionServices• UnionSociety
![Page 87: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/87.jpg)
87
Formal Definition of DL Integration (Cont.)
• DL integration problem definition:
Given n individual libraries, integrate the n DLs to create a UnionDL.
![Page 88: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/88.jpg)
88
Taxonomy of Union Services
Infrastructure Services Information Satisfaction Services
Essential Add_Vaue Essential Add_value
indexing
harvesting
mapping
(Schema registry with analyses & mapping)
(data) cleaning
(focused) crawling
copying (replicating)
logging
(format) translating
(Service to support annotation)
(Metadata validation)
searching
browsing
access control
binding
comparison
(forum) discussion
(query) expansion
filtering
recommendation
visualization
Note: Suggested NSDL services are shown in blue.
![Page 89: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/89.jpg)
89
Union Catalog Integration
VN MetadataFormat
Global MetadataFormat
VNCatalog
HDCatalog
Union Catalog
MappingTool
Wrapper
MappingTool
Wrapper
HD MetadataFormat
Virtual Nimrin(VN)
Halif DigMaster(HD)
Union ArchDL
![Page 90: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/90.jpg)
90
Data Mapping (state-of-the-art)
![Page 91: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/91.jpg)
91
local schema global schema
![Page 92: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/92.jpg)
92
Mapping recommendation
![Page 93: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/93.jpg)
93
Mapping confirmation
Mapping history
![Page 94: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/94.jpg)
94
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 95: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/95.jpg)
95
Link Fusion
A Unified Link Analysis for Multi-Type Interrelated Data
Objects
Wensi Xi1, Benyu Zhang2, Zheng Chen2, Yizhou Lu3, Shuicheng Yan3, Wei-Ying Ma2, Edward A. Fox1
1Virginia Tech 2Microsoft Research Asia 3Peking University
![Page 96: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/96.jpg)
96
Traditional way of representing relationships
• Space: Vector and probability spaces (e.g., Salton and Wang’s works)
• Database: Sets of attributes represent relationships, and are used to design databases (e.g., Fuhr and Frieder’s works).
• Networks: Belief, inference and spreading activation networks (e.g., Turtle, Ribeiro-Neto, and Acid’s works)
Problems:• Not easily used to combine multiple types of data
objects and relationships• Need to find representations that are closer to reality
and are more dynamic
![Page 97: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/97.jpg)
97
Example: Collaborative Filtering, Recommender System
Human Relationshi
p
User
Browse
Hyperlink/ Content
Similarity
Web page
• Inter-type relationship: Browse.• Intra-type relationship: Hyperlink, Human
relationships.
![Page 98: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/98.jpg)
98
Example: Query Expansion, Web Clustering
Content Similarity
Reference
Hyperlink/ Content Similarity
Web page
Query
• Inter-type relationship: Reference.
• Intra-type relationship: Hyperlink, Content Similarity
![Page 99: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/99.jpg)
99
Attribute Reinforcement Assumption
The specific attribute of a data object in one data type can be reinforced by both the attributes of related data objects in the same data space and attributes of related data objects from other data space.
Data SpaceInter-type relationship
Intra-type relationship
DataObject
![Page 100: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/100.jpg)
100
Link Fusion algorithm (1)• Consider two types of objects, X={x1…xm} and Y={y1…ym},
their intra-type relationships are Rx Ry, inter-type relationships are Rxy and Ryx.
• Adjacency matrix Lx, Ly, Lxy, and Lyx represent the relationship of Rx Ry Rxy and Ryx respectively.
• Suppose wx is the attribute vector of objects in X, wy is the attribute vector of objects in Y
• wx is reinforced by both the intra and inter type relationships from X and Y, so as wy. The Link Fusion algorithm can be represented as:
T Ty y y xy x
T Tx x x yx y
w L w L w
w L w L w
![Page 101: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/101.jpg)
101
Link Fusion algorithm (2)• In a more generalized scenario, suppose there are N data
types, importance attribute of one type of object can be reinforced by both inter and intra-type links as:
• It can also be represented into matrix representation w=ATw. In the matrix α and β are weights for different attributes.
• Iterative calculation would result in the prime eigenvector of A, which can be explained as the value of data objects regarding a specific attribute.
NM
T TM M M N
N M
w L w L w
' ' '1 1 12 12 1 1
' ' '21 21 2 2 2 2
' ' '1 1 2 2
...
...
...
n n
n n
n n n n n n
L L L
L L LA
L L L
![Page 102: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/102.jpg)
102
Link Fusion algorithm (3)• If we consider webpages as a homogeneous
data space, and they are connected via intra-type relationships (hyperlinks), Link Fusion is reduced to PageRank algorithm.
• If we consider Hub and Authority attributes of webpages as two different type of objects, and they are connected via inter-type relationships (hyperlinks), Link Fusion is reduced to HITS algorithm.
• Thus, Link Fusion can be considered as an extension of traditional link analysis algorithms
![Page 103: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/103.jpg)
103
SimFusion:Measuring Similarity Using the
Unified Relationship Matrix
1Wensi Xi, 1Edward Fox, 1Weiguo Fan, 2Benyu Zhang, 2Zheng Chen, 3Jun Yan, 4Dong Zhuang
1Department of Computer Science, Virginia Tech2Microsoft Research Asia
3School of Mathematical Science, Peking University4Department of Computer Science, Beijing Institute of
Technology
SIGIR 2005Salvador, Brazil August 15-19, 2005
![Page 104: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/104.jpg)
104
Motivation• To achieve desired improvements with advanced
information systems, we need to combine and integrate information from a variety of sources.
• Entities from different domains can be considered as objects containing information:– Web pages or scientific papers– Queries– Users
• Information contained by objects may include:– Contents: papers, web-pages– Attributes: popularity, authority– Relationships: reference, hyperlink, similarity
![Page 105: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/105.jpg)
105
Research Statement
Problem:
“How can the broad variety of heterogeneous data and relationships be effectively and efficiently integrated to improve the performance of various information retrieval related tasks?”
Solution:
Use matrices to represent multi-relationships, and use matrix calculations to integrate them (so as to improve searching and clustering).
![Page 106: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/106.jpg)
106
ExampleUsers QueriesDocuments
0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 x 00 0 0 x 0 0 0
0 x 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 x
Collaborative filtering Log based clustering
U
DD
U
U
D
D
Q
0 0 0 0 0 0 0 0 0 0 x 0 0 00 0 0 0 0 0 0x 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 x 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 x
0 x 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 x 0
0 0 0 0 0 0 0 0 0 0 x 0 0 00 0 0 0 0 x 0x 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 x 00 0 0 0 0 x 00 0 0 x 0 0 0
U’
U’
D’
U’
D’
D’
D’
Q’
Cluster based retrievalUser Modeling
![Page 107: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/107.jpg)
107
Unified Relationship Matrix (URM)
• Consider two types of objects, X={x1…xm} and Y={y1…yn}. Their intra-type relationships are Rx and Ry, while the inter-type relationships are Rxy and Ryx.
• Adjacency matrices Lx, Ly, Lxy, and Lyx represent the relationships of Rx, Ry, Rxy, and Ryx, respectively.
• All the relationships can be represented in a single unified matrix :
yyx
xyxurm LL
LLL
![Page 108: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/108.jpg)
108
Unified Relationship Matrix (2)• In a more generalized scenario, suppose there are N data
types, and objects from different data spaces are connected by intra- and inter-type relationships.
• All the relationships can be represented into a Unified Relationship Matrix:
• Diagonal sub-matrices are for intra-relationships; others are for inter-relationships.
NNN
N
N
urm
LLL
LLL
LLL
L
21
2221
1121
![Page 109: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/109.jpg)
109
Unified Relationship Matrix (3)
• Provides a general way of viewing data objects and relationships
• Data objects from different spaces are now all in the “unified” space.
• Previous inter- and intra-type relationships are now all intra-type relationships in the “unified” space.
![Page 110: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/110.jpg)
110
Similarity Reinforcement Assumption The similarity of two data objects in one data type
can be reinforced by the similarity values of other data objects to which they are related.
User Space
Reference relationshipSelect relationship
Query Space
Document Space
Reading relationship
![Page 111: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/111.jpg)
111
SimFusion Algorithm
• Suppose there are N data spaces, then objects in these spaces are connected by inter- and intra-type relationships as in the URM below:
• A Unified Similarity Matrix is built to represent the pair-wise similarities of data objects:
NNNNNNN
NN
NN
urm
LLL
LLL
LLL
L
2211
222222121
111212111
1
1
1
21
221
112
TT
T
T
usm
ss
ss
ss
S
![Page 112: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/112.jpg)
112
SimFusion Algorithm (2)
• The Similarity reinforcement assumption can be represented as:
• Such reinforcement calculation can be continued as:
• The calculation has been proven to converge, and is named the SimFusion algorithm.
Turm
originalusmurm
newusm LSLS
Tnurmusm
nurm
Turm
nusmurm
nusm LSLLSLS )(01
![Page 113: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/113.jpg)
113
Real World Examples
• Consider the space of scientific papers. Then with reference relationships (intra-type), SimFusion reduces to co-citation or bibliographic coupling algorithms.
• Consider the document-term “contain” relationship and build a URM as below:
Here the USM is the identity matrix, and SimFusion reduces to VSM document similarity calculation.
• Others…(Raghavan, Beeferman)
0
0
dtT
dturm L
LL
![Page 114: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/114.jpg)
114
Summary• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
![Page 115: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/115.jpg)
115
Selected Links
• http://fox.cs.vt.edu• CITIDEL (computing education
resources)–www.citidel.org
• NDLTD (electronic theses and dissertations worldwide)–www.ndltd.org and etdguide.org
• Virginia Tech Digital Library Research Laboratory–DLRL, www.dlib.vt.edu
![Page 116: Outline](https://reader036.vdocuments.net/reader036/viewer/2022062407/56812fa9550346895d952a7f/html5/thumbnails/116.jpg)
116
Questions?Discussion?
Thank You!