search engine technology for digital libraries state of the art and future 7th international...

Search Engine Technology for Digital Libraries

State of the Art and Future

7th International Bielefeld Conference

Jürgen Oesterle

[email protected]

Fast Search & Transfer Deutschland GmbH

Most prominent problems with digital libraries

„Multiple content sources“ problem

• In a typical digital library, you have to provide a combined search on many different collections at a time

• The format of the content varies between these collections

• The availability of structure varies between these collections

• The availability of external reference data varies between these collections

• The availability of meta data varies between these collections

• The kind of content might vary between these collections

On these grounds, it‘s extremely difficult to provide equal ranking among the documents in a results set, coming from different content sources.


„Meta data“ problem

• You can‘t tell how the meta data was generated (Author? Editor? Automatically assigned?)

• You can‘t tell in advance what meta data is available (Title, author, keywords, date, publisher, place, etc.)

• You don‘t know the original purpose of the meta data (Quick summary for reader? Condensed description? Normalization of content for search?)

• You can‘t assume uniform availability and quality of meta data even on one collection


„Distributed documents“ problem

• Documents are often really hypertext, i.e. their parts are distributed over a site, with links between them

„Multiple languages“ problem

• Documents are often in many different languages

„Availability of classification schemas“ problem

• If classification is of interest (and of help while searching), the underlying classification taxonomies are not standardized across collections


„Inaccurate queries“ problem

• Users typically lack domain specific knowledge

• Users don‘t have proper terminology to hand

• Users don‘t include all potential synonyms and variations in the query

• Users have a problem but aren‘t sure how to phrase it (i.e. how the same problem is phrased in the documents)

On these grounds, it‘s extremely difficult to provide a perfectly relevant result set as first response. Intelligent suggestions for refinement or expansion are needed.

Technologies are underway to solve the problems..

Meta data extraction

• Automatic extraction of keywords

• Structural analysis

• Normalization of existing meta data

• Use external reference data & citation analysis

Part of speech tagging & normalization

Extraction of specific syntactic patterns

Statistic analysis of the extracted patterns

Suffering from chronical rhinitis, the patient was treated......

Vpart Prep Adj N Det N Vcop Vpart

Vpart Prep Adj N Det N Vcop Vpart

P(„chronical rhinitis“)

P(„chronical “) * P(„rhinits“)log

Identification of new terminologychronical rhinitis







Investigations in E. coli

B. C. AbracadabraDepartment of Molecular MedicineUniversity of Wisconsin

S. MiheevAnalytical LaboratoryRussian Academy of ScienecesMoscow

Journal of Cancer Research

Issue 5, 2003 -12

Abstract

1. Introduction

2. Materials and Methods

In this study we investigate………


B. C. AbracadabraDepartment of Molecular MedicineUniversity of Wisconsin

S. MiheevAnalytical LaboratoryRussian Academy of ScienecesMoscow


Issue 5, 2003 -12

Abstract

1. Introduction

2. Materials and Methods

In this study we investigate………

Journal title

Article title

Affiliation

1. Analyse structure

2. Determine text block features

3. Classify text blocks

4. Apply structure grammar







Artikel 1

Artikel 6

Artikel 7

Artikel 8

Artikel 2

Artikel 5Artikel 4

Artikel 3

Citation graph

Infer relative importance of Article 5

and

Use textual context of citation to obtain good descriptors of it







Artikel 1

Artikel 6

Artikel 7

Artikel 8

Artikel 2

Artikel 5Artikel 4

Artikel 3

Citation graph

Infer relatedness of Article 8 and Article 7 because they are cited by the same articles


Equal ranking

• Test runs with representative queries

• Check typical ranking position per content source

• Assign static rank boosts per content source, based on results

Retrieval Engine

Content source A

Content source E

Content source D

Content source BContent source C

• only abstracts• rich meta data• no external references

• few meta data• indexed in citation index• full text articles

• web data• hard to crawl, distributed documents• unreliable meta data• web anchor text as external reference

• full text documents• PDF, DOC conversion problems

• full text documents• indexed in citation index• rich meta data

High boost

High boost

Medium boost

Medium boost

Low boost


Proper treatment of queries

• Deal with orthographic variation• Deal with morphological variation• Deal with vocabulary variation• Deal with special-interest queries (e.g. restrict on user homepages, find definitions, narrow down on articles)

Cerebral infarct

Cerebral infarctsApoplexyApoplectic insultStroke

“Cerebral infarct”Cerebral infarktSerebral infarctCetebral ingarct

Cerebral diseaseInfarction

Cerebral infarct / medicineCerebral infarct / biology

Cerebral infarct / conferences

Infarctus cérébral

Phrasing Doc typeclassification

Spellchecking

Synonymy Thesaurussupport

RefinementCharacternormalization

Lemmatization

Topic classification

Ambiguequeries



B. C. Abracadabra[author info]

S. Miheev[author info]


Issue 5, 2003 -12

Journal contents

Current Issue

This issue

Personal Profile

While crawling for documents

Abstract

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Introduction

recognize links that point to „other parts of the

document“

Abstract

Introduction

Chapter 1

Chapter 2

Chapter 3

Chapter 4follow these links and put together a complete document.

Smart data aggregation


CrawlingDocument processing

INDEX

Results

Query processing

Result processing

Query

Doc

Smart data aggregation(e.g. restoring distributed documents)

Advanced linguistic processing(e.g. terminology extraction, classification, structural analysis)

Proper treatment of queries(e.g. covering morphol.+semant. variation)

Queryrefinement suggestions

(e.g. covering morphol.+semant. variation)

Citation index

Scirus

Evolution of Digital Libraries

Traditional DL

Full text search engine

Next generation DL

• data base• pure predefined meta data• exact match• data is

• heterogenuous• not normalized• incomplete• unreliable

• inverted index• full text• exact match• data is

• heterogenuous• not normalized• redundant• unreliable

• inverted index + linguistics + smart data aggregation• extracted information• fuzzy search• data is

• homogenuous• auto-normalized• auto-completed• reliable