tovek presentation by livio costantini

CE v5.8 12/04/23

1

Livio Costantini

Tovek’s Tools Software to Access Unstructured Information

Auhofstrasse 25/2

1130 Wien

E-mail: [email protected]

Tel. 0043-1-8794274

Mobile: 0043-664-9919154

CE v5.8 12/04/23

2

AGENDA

• Basic notions of Information Retrieval (IR)

• Verity Query Language – Topic tree

• Tovek Tools (Enterprise Search Engine & Analytical System)

CE v5.8 12/04/23

3

Distinctions between Data Retrieval and Text Retrieval (1/2)

Distinctions

The type of query

Query representation

Criterion for success

Representing data or information

Data Retrieval Text Retrieval

.. is direct and precise ("I want to know X"); the correct answer is there, and you know it.

. . indirect and ambiguous ("I want to know about X"); a "correct" answer to your question may not even exist.

The formal search query and the user's information need are closely mapped. Deterministic relation.

Probabilistic relation between a

formal query and the representation of adequate answer

Correctness: data retrieval systems (DBMS) should retrieve the correct answers

Utility: as there are no or few "correct" answers, text retrieval systems ideally retrieve the most useful documents;

The ways of representing data are finite; there aren't too many variants for the term "ZIP code."

The ways of representing documents are virtually unlimited, as language is ambiguous. Effect of Semantic Indeterminacy

CE v5.8 12/04/23

4

Distinctions between Data Retrieval and Text Retrieval (2/2)

Distinctions

A query's target area

Zero or no useful results

Types of searches

Delegation of searching

Data Retrieval Text Retrieval

Because there aren't many ways of representing data (unit of information) , the number of possible alternative queries for data is small, and target area is also small.

Many ways of representing documents mean many more possible queries for that document Semantic target area is large and in large collection of documents the number of documents retrieved can overwhelm.

... means that the data really doesn't exist in the database.

... a negative search result does not necessarily mean that there are no useful documents in the database. The end-point of searching.

Just one to support: exact matching.

At least three types to support: sample ("give me a few documents about X"), exhaustive ("give me everything about X""), and existence ("are there any documents about X at all?").

Fairly easy to do; queries are straightforward and not too dependent on context.

Open to interpretation; it's difficult to know exactly what the query was intended to retrieve.

CE v5.8 12/04/23

5

The Data Retrieval and Document Retrieval Models

All the most prominent of the differences arise from the more fundamental problem of the representation of the indeterminacy

The representation of the indeterminacy is a result of the effects of semantic ambiguity and system (“corpus”) size.

The differences influence their design, use and management.

Semantic ambiguity is a measure of the number of different senses a “word and/or phase” has.

System (corpus) size is the number of time that a given “word and/or phase” is used to represent an item of information .

CE v5.8 12/04/23

6

Generation of Text Retrieval Technology Intellectual Text Processing

Performed intectectually by subject experts area - No full text index search - only keywords

Definition and classification

• The process of assigning text identifiers (key words and/or meta data) to the information items (documents).

• Metadata - data about the data : Title, Author Name, Publication's Date, etc.

• Keywords . The human experts use controlled indexing vocabulary and thesaurus.

• Helps to bridge the semantic gap

• Increases search effectiveness and effectiveness.

• Provides additional information to the uses

• Avoids to perform more complex filters operations.

• Is too expensive and time-consuming

• Is subjective and depends on context

• Is too complicated

• No clear and precise rules

Key Benefits

Criticisms

A thesaurus designed for indexing is:

– a list of every important term (single-word or multi-word) in a given domain of knowledge; and

– a set of related terms for each term in the list.

CE v5.8 12/04/23

7

Generation of Text Retrieval Technology Automatic Text Processing

Distinctions

Boolean Retrieval Model STAIRS - IBM

Natural Language Processing

Probabilistic Approach

Concept Retrieval

Boolean Retrieval Model (AND; OR; NOT; proximity operator ).The rank order of retrieved documents is arbitrary, no relevance assigned to each documents retrieved

Based on the syntactic and morphological analysis, usually supported by a controlled dictionary. Automatic semantic network representation and free text queries.

Probabilistic Models treat the process of document retrieval as a multistage random experiment. Similarities are thus represented as probabilities. Relevance usually calculated by examining how many times a query term appears in a document compensate by the frequency of the query term in the collection. (term frequency–inverse document frequency; tf–idf )

Concept Retrieval is a search technology which allows the possibility to search for subjects or concepts rather than individual words or phrases in documents. Retrieved documents are ranked by relevance. Usually the user is responsible for specifying the concept definition.

Full Text Index – is a data structure that stores a list of occurrences and position of each atomic search criterion (words) , typically in the form of a hash table or binary tree, allowing full text search

CE v5.8 12/04/23

8

Determining relevance

What is a goal of a Text Retrieval

A “good” text retrieval system is able to :

Capabilities

Extract meaningful -useful information

While

Withholding non-relevant information

Subjective in nature; may be determined by

• The user who posed the retrieval problem

– Realistic but based on many personal factors

• An external judge

CE v5.8 12/04/23

9

Measuring Retrieval Effectiveness - Precision & Recall

• Recall

– the ratio of the number of relevant-useful documents retrieved to the total number of relevant documents in the database

– Measures how well the search engine retrieves all of the relevant-useful documents

• Precision

– ratio of the number of relevant –useful documents retrieved to the total number of irrelevant and relevant documents retrieved

– Measures how well the search engine retrieves only the relevant-useful documents

CE v5.8 12/04/23

10

Results Analysis of Precision and Recall of Query

Result Analysis

A low precision and low recall value

A high precision and low recall value

A high recall and low precision value

A high precision and high recall value

Indicates that the search engine has retrieved many irrelevant documents and has missed out many important results.

Indicates that the system was selective and has retrieved a good number of relevant documents but missed out some important results.

Indicates that system has retrieved a good number of relevant results but has also retrieved many irrelevant results in this process.

Indicate good retrieval performance of a search engine. To provide access to all and only those documents which are relevant high precision and high recall criterion for most efficient search engine.

A document is considered relevance if it is judged useful by the user who originated the query

Explanation

CE v5.8 12/04/23

11

AGENDA

• Basic notions of Information Retrieval

• Verity Query Language & Topic tree


CE v5.8 12/04/23

12

The Problem

The 80 % of information is unstructured textual documents - Imagine it as an iceberg !!!

Using standard tools and basic search engines (or not using them at all) you can find only the proverbial top of the iceberg.

If you can see the whole, you can become frustrated by inability to see, what can be inside.

CE v5.8 12/04/23

13

Verity Query Language (VQL)

Operators for searching

full-text

Evidence

Proximity

Relational

Concept

An evidence operators can specify either a basic word (s) search or an expanded word list based on the original search word. Perform a basic word (s) or expanded word (s) search Word; Stem; Thesaurus; Wildcard; Soundex; Typo;

A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Phrase ; Sentence ; Paragraph; Near/n ; Order

Search in the document fields (Meta data) defined in the collection, (such as Title; Author; Published Date; etc) for filtering function. Numeric or textual search are accepted depending on the format of the fields

Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc.

It combines the meaning of search elements to identify a concept in a document. Documents retrieved are relevance ranked. Accrue; And; Or; All; Any

Weights between 0.1 and 1.0 are assigned to each keyword (s) or phrase based on its relative importance in meeting the search objective.

Understanding

CE v5.8 12/04/23

14

Verity Query Language (VQL)

Operators for searching

full-text

Evidence

Proximity

Relational

Concept

An evidence operators can specify either a basic word (s) search or an expanded word list based on the original search word. Perform a basic word (s) or expanded word (s) search Word; Stem; Thesaurus; Wildcard; Soundex; Typo;

A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Phrase ; Sentence ; Paragraph; Near/n ; Before; After,

Search in the document fields (Meta data) defined in the collection, (such as Title; Author; Published Date; etc) for filtering function. Numeric or textual search are accepted depending on the format of the fields

Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc.

It combines the meaning of search elements to identify a concept in a document. Documents retrieved are relevance ranked. Accrue; And; Or; All; Any

Weights between 0.1 and 1.0 are assigned to each keyword (s) or phrase based on its relative importance in meeting the search objective.

Understanding

CE v5.8 12/04/23

15

Evidence Operators

Operators

<Stem>

<Word>

Question Mark ?

ASTERISK *

Expand the keyword into a list of related words

Understanding

<Case>

Selects documents that include one or more variations of the search word you specify., e.g.: <STEM>export Note: By default words and phrases are stemmed

Selects documents that include one or more instance of only the word you specify., without located stemmed variation words e.g.: <STEM>export NB. Search for documents that contains the word “export” but not “exporting” , “exported” , etc.

Performs a case sensitive search based on the case of the word or phrase specified e.g.:EMIS (acronym for electromagnetic isotope separation) and not emis (the past participle of the French verb emiter):<CASE> EMIS Specifies one of any alphanumeric character, as in organi?ation

which locates organization and organization.

Specifies zero or more of any alphanumeric character, as in test*which locates not only test and tests but also testimony, testosterone etc,.

CE v5.8 12/04/23

16

Proximity Operators

Operators

<Phrase>

<Sentence>

<Near/N>

<Order>

Specify relative location of specific words

Understanding

<Paragraph>

Selects documents that include a phrase you specify. A phrase is a grouping of two or more words that occur next to each other, e.g.: <Phrase> (export, control) or “export control ”

Selects documents that include all the word (s) you specify in a sentence e.g. nuclear<Sentence>research

Selects documents, that include all the word (s) you specify in Paragraphe.g. Nuclear <Paragraph> Proliferation

Specifies that search elements must occur in the same order as in the query statement. Always to be placed in front of an operator e.g.: ballistic <ORDER><NEAR/5> missile

Selects documents containing all specified search terms within N number of words of each other, where N is an integer, e.g.: nuclear<NEAR/5>weapon

CE v5.8 12/04/23

17

Concept Operators

Operators

<Accrue>

<And>

<NOT>

Combine the meaning of search elements (words) to find a concept

Understanding

<OR>

Selects documents that include at least one of the search elements you specify. The more search elements that are present, the higher the score will be. e.g. plutonium<ACCRUE> Pu or plutonium, Pu - Documents with both terms are listed first!

Selects documents that include all search elements you specify . Documents are relevance-ranked.e.g. Germany<AND>hot cells

Selects documents that include at least one of the search elements you specify.e.g. electromagnetic isotope separation<OR>EMIS<OR>calutron

Note: AND, OR and NOT are treated as operators by default and do not require brackets. To use them as literal words enclose them in double quotes. All other operators must be enclosed in brackets.

the <NOT > modifier followed by a word or phrase excludes documents which contain that word or phrase, e.g.: missile <AND> <NOT> short range

CE v5.8 12/04/23

18

Relational Operators

Metadata

Title

Search in the metadata (such as Title, Date, etc.) defined in the collection

Understanding

Date

Selects documents that include in the Title the search elements you specify.

Numeric or textual search are accepted depending on the format of the fields Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc.

Sort Option: The sorting of the resulted documents can be done either by score, date, or title in ascending or descending order.

CE v5.8 12/04/23

19

Concept Retrieval - Fuzzy Logic Approach

Characteristic

Process of searching for subjects concepts rather than individual words or phrases

In building up a concept ( Topic tree) , an expert familiar with the subject of the search assigns weights to search terms.

Topic tree provide a convenient means which can encapsulate in a hierarchical structure, the knowledge of an expert.

Topic trees have the ability to understand the context of the text and retrieve every document related to a ‘topic’ of interest.

Retrieval results are presented in a relevance ranked order, giving users access to the most important information.

Topic trees are available to end users as a shared resource.

Advantages

CE v5.8 12/04/23

20

Design a Topic Tree - Knowledge Elicitation Process

Extracting knowledge from subject area experts

Subject Area Expert Knowledge Engineer

The Knowledge Engineer extracts and organizes the knowledge of the Subject-Area Expert and expresses it in a hierarchic format which can be used in a “Topic Tree” environment.

CE v5.8 12/04/23

21

Topic Tree – An Introduction

• Topic trees mitigate the semantic ambiguity

• Topic trees are predefined query in tree-like form that can be utilized for Searching, Data Mining and Taxonomy Classification

• Topic Tree includes a definition of the relationship between keywords and provides rules for evaluating and scoring documents

• Topic Trees Proprieties

– Structure Establish the relationships between nodes

– Weights Define the relative importance of words and nodes

– OperatorsInterprets rules for the search engine

CE v5.8 12/04/23

22

Corporate intellectual property to be reused by employees, or business rules

Topic Trees are available to end users as a shared resource.

Topic Trees provide a convenient means which can encapsulate in a hierarchical structure the expert’s knowledge

The Importance of Topic Trees

Topic Trees include all the components of the Verity Query Language (Conceptual and Proximity Operators, Modifiers and Weights)

Topic Trees have the ability to understand the context of a text and retrieve documents related to a ”topic” of interest

CE v5.8 12/04/23

The Accrue operator performs “the more the better” approach when assign to a topic or to a search; the more children specified by a topic using the accrue operator are found in the document, the better the document is considered related to your search. Documents which contain the maximum of highly-weighted children are the highest-ranked documents lists in the result list .

Topic tree - Accrue Operator

CE v5.8 12/04/23

Topic tree – Sentence; Any; Word; Stem; Operators

Word operator performs the basic search and selects documents that include one or more instance of the exact word specified as search element. Stem operator increases the search to include the expanded word list, based on the original search word. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stemSentence operator is used to indicate that the children of a sub-topic must be located within the same sentence in a documentAny operator is used to retrieve a document which contains at least one of the search elements specified.

CE v5.8 12/04/23

25

Topic Trees – A knowledge representation of “Ferrari Concept ” Topic trees are predefined query in tree-like form that can be utilized for Searching, Mining and Taxonomy Classification

CE v5.8 12/04/23

26

Topic tree - Algorithm for Scoring

Elements

Weight

Operators

Hierarchical Structure

Representing the relative contribution of that child (keyword) to the overall score produced by a Topic tree. The designer attributes importance weights to sub-concepts to reflect the fact that some words, phrases or other concepts are more important than others in expressing the overall concept.

Operators are used in conjunction with the weight of the child (keyword) to compute the score for each topic-node during the search.

Interpret the relationships between the topic-nodes and determines the whole score of the topic tree. The position of each topic-node, within the hierarchical structure, influences the calculation of the score.

Numerical score assigned to each document in the search result list , representing how well the document meets the information need of the user that issued the search

Rational

CE v5.8 12/04/23

27

Topic tree - Quality Assurance procedures and Testing process

Quality Assurance

Enrich the original key words

Proximity operator

Key words used too general

Procedures to check the performance of the topic trees against a “representative” collection of reports, amongst which the reports dealing with the concepts covered by the topic trees have been identified in advance. Measuring Retrieval Effectiveness - Precision & Recall

Retrieved reports are examined for words that may serve as new keywords.

Excessively restrictive proximity conditions that did not allow combinations of keywords to contribute to the retrieval of the document in the manner expected

Thoughts have to be made whether same keywords should be eliminated or used with new or more restrictive proximity conditions

CE v5.8 12/04/23

28

Probabilistic Approach in Text Retrieval System

Drawbacks

Synonymy

Polysemy

Search keywords

Semantic sensitivity

Different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query.

The same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents.

Search keywords must precisely match document terms; word substrings (stemming) might result in a "false positive match"

Documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".

The probability that a specific document will be judged relevant to a specific query, is based on the assumption that the words are distributed differently in relevant and non relevant documents. The probability formula is usually derived from Bayes' theorem.

CE v5.8 12/04/23

29

AGENDA

• Basic notions of Information Retrieval

• Concept-based Retrieval – Topic tree


CE v5.8 12/04/23

30

Tovek Info Rating – Context Analysis & Data’s Visualisation Tool

Tovek Harvester – Mine document’s context

Tovek Agent – Enterprise Search Engine

Tovek Index Manager – Collection

Builder

Tovek Editor – Create and Maintain Topic Trees

Tovek’s Tools - Enterprise Search Engine & Analytical System

Desk-top & Client - Server Application

Based on Verity’s technology, includes the ability to handle multi-language collections and different types of document format

Combines the capacity of exploiting simple and advanced searching capabilities and automatic clustering. Moreover it shared the ability to encompass on the same instance collections based on different languages

Provides the users with an easy-to-use, graphical way of creating Topic trees, it displays Topic trees in a way that allows the users to quickly and easily view and change the definition of Topic trees.

Enable analysts to rapidly exploit large quantities of text information, including advanced search and retrieval, pattern analysis and data visualization

Based on statistic method, determines relevant themes and consequently helps to cope with information overload, relevant information is made accessible very quickly and without any administrative effort

Understanding

CE v5.8 12/04/23

31

a) The Index Manager is designed to be an administrator's tool to create and manipulate the collection as whole

b) Based on Verity’s technology

c) Remote and local collections

d) Different types of documents, E-mail , Microsoft Word, PDF, Etc.

e) Multi-languages collections

f) ODBC connection

Tovek Index Manager – Collection Builder

CE v5.8 12/04/23

32

Simple and Advanced Search Capabilities - Based on Verity Query Language (VQL)

Users can submit sample data as input and the system returns references to related documents ranked by relevance

Ability to accept all know legacy search method, including keyword search with the support of Evidence; Proximity; Relational and Concept operators alone or combined as Topic trees.

Features Capabilities

Simple and Highly Structured Query

Query By Example

Based on the results of the natural language retrieval, users can quickly refine their search to precisely focus on the context they require

Refine By Example

Analyze large sets of documents or even user’s queries and automatically group relevant documents together that have a high likelihood of being relevant to the same information need

Automatic Clustering

Agent - Enterprise Search Engine

CE v5.8 12/04/23

33

Tovek Agent - User Interface – Automatic Clustering

Ability to create hierarchy of collections, which can be used individually or concatenated

CE v5.8 12/04/23

34

Tovek Agent

Selecting Collections - Find documents that satisfy specific criteria e.g. Nuclear , test

Selecting collections Search Pane &

Search Elements

Result List Documents found

Total documents

Documents fields or Metadata

CE v5.8 12/04/23

35

Tovek Agent – Collection Fields

View / Fields on the result list heading

Available Fields in the selected collection Grouping

CE v5.8 12/04/23

36

Tovek Agent – Query History & Query in Time

Query history (Tools / Query History )

Possibility to execute old query Query in Time ( Tools / Query in Time )

CE v5.8 12/04/23

37

Tovek Agent – View document

Highlights found search elements (words)

CE v5.8 12/04/23

38

Examine the matched words (highlighted) in the selected document

Tovek Agent - Document Proprieties

CE v5.8 12/04/23

39

Capacity to extract highlighted words from selected documents, together with words adjacent (preceding or following) to the highlighted ones.

Tovelk Agent Extract adjacent words

Search Criteria : President

CE v5.8 12/04/23

40

Tovek Agent - Multiple languages search capability

CE v5.8 12/04/23

41

Tovek Agent – Exporting documents (Menu Tools)

XML Export HTLM Export

CE v5.8 12/04/23

42

Ability to export selected documents from the result list , in different format (XML HTML, text) which can be analysed further

Tovek Agent - Export of selected documents

HTLM Export

List of Documents

Found words

Full text

CE v5.8 12/04/23

43

Tovek Query EditorFor advanced users to construct more complex queries to create topic trees

CE v5.8 12/04/23

44

Provide a context analysis by matching an extracted list of documents against a set of queries

Documents in the results list can be visualized in multiple ways

InfoRating is an analytical and data visualization tool to be able to assist users in performing context analysis together with a graphical representation of aggregate documents

Information are presented graphically in ways that make it easy to observe trends and general characteristic

InfoRating

Organize documents by the criteria and categories the user has requested, the conclusions are then delivered the user

Categorize documents into navigable structures to assist user in finding relevant information and in understanding the context of a collection

CE v5.8 12/04/23

45

Connection ChartRelationships between queries and documents, together with their scores

Possibility to add comments to the queries and/or documents

Switches for

the main pane

Query pane

Main pane

Documents pane

CE v5.8 12/04/23

46

Cross Matrix Upper panel - Number of documents matching all the possible permutations of two queries

Lower panel – Documents matching the selected element of the Cross Matrix

CE v5.8 12/04/23

47

Summary Graph Visualisation of the results of the queries in combination with different fields (Source or Date )

(e.g. queries within weeks)

CE v5.8 12/04/23

48

Generation of descriptors

Each keyword has assigned a weight (Relevance)

Automatic assignment of keywords

The tf–idf weight (term frequency–inverse document frequency)

Harvester

Harvester Approach

The goal of Harvester is to automatically extract “relevant terms” (e.g.,keywords) from a given corpus of information

The weight is a statistical measure used to evaluate how important a descriptor is to a document in a collection or corpus

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus .

Descriptors are formed out of all pairs of keywords that appear near to each other in at least one document in the corpus

Understanding

Time dependent – ( keywords and descriptors )

Keywords and descriptors are time dependent and therefore new trends in new documents are reflected in regular reassignment of new keywords.

CE v5.8 12/04/23

49

(Chart / Show Clusters Chart / Hide All)

Harvester – Show & Hide Cluster

CE v5.8 12/04/23

50

Harvester – Part of a Cluster

CE v5.8 12/04/23

51

Visualization of a “Descriptor” Centrifuge and relation with Partner words

Word List

Word History

Partner Words

Result List

Working Pane Descriptors

Words Neighborhood

CE v5.8 12/04/23

52

Descriptors can be used as input query in concert with Tovek’s agent

CE v5.8 12/04/23

53

Visualization of a “Descriptor” - IAEA - and the relation with Partner words

CE v5.8 12/04/23

54

Visualization of a “Descriptor” - Temelin - and the relation with Partner words

tovek presentation by livio costantini

Technology

text retrieval systems

text search concept

data retrieval systems

process of document

document retrieval models

text index search

data unit of information

metadata data