tovek presentation by livio costantini
Embed Size (px)
TRANSCRIPT

CE v5.8 12/04/23
1
Livio Costantini
Tovek’s Tools Software to Access Unstructured Information
Auhofstrasse 25/2
1130 Wien
E-mail: [email protected]
Tel. 0043-1-8794274
Mobile: 0043-664-9919154

CE v5.8 12/04/23
2
AGENDA
• Basic notions of Information Retrieval (IR)
• Verity Query Language – Topic tree
• Tovek Tools (Enterprise Search Engine & Analytical System)

CE v5.8 12/04/23
3
Distinctions between Data Retrieval and Text Retrieval (1/2)
Distinctions
The type of query
Query representation
Criterion for success
Representing data or information
Data Retrieval Text Retrieval
.. is direct and precise ("I want to know X"); the correct answer is there, and you know it.
. . indirect and ambiguous ("I want to know about X"); a "correct" answer to your question may not even exist.
The formal search query and the user's information need are closely mapped. Deterministic relation.
Probabilistic relation between a
formal query and the representation of adequate answer
Correctness: data retrieval systems (DBMS) should retrieve the correct answers
Utility: as there are no or few "correct" answers, text retrieval systems ideally retrieve the most useful documents;
The ways of representing data are finite; there aren't too many variants for the term "ZIP code."
The ways of representing documents are virtually unlimited, as language is ambiguous. Effect of Semantic Indeterminacy

CE v5.8 12/04/23
4
Distinctions between Data Retrieval and Text Retrieval (2/2)
Distinctions
A query's target area
Zero or no useful results
Types of searches
Delegation of searching
Data Retrieval Text Retrieval
Because there aren't many ways of representing data (unit of information) , the number of possible alternative queries for data is small, and target area is also small.
Many ways of representing documents mean many more possible queries for that document Semantic target area is large and in large collection of documents the number of documents retrieved can overwhelm.
... means that the data really doesn't exist in the database.
... a negative search result does not necessarily mean that there are no useful documents in the database. The end-point of searching.
Just one to support: exact matching.
At least three types to support: sample ("give me a few documents about X"), exhaustive ("give me everything about X""), and existence ("are there any documents about X at all?").
Fairly easy to do; queries are straightforward and not too dependent on context.
Open to interpretation; it's difficult to know exactly what the query was intended to retrieve.

CE v5.8 12/04/23
5
The Data Retrieval and Document Retrieval Models
All the most prominent of the differences arise from the more fundamental problem of the representation of the indeterminacy
The representation of the indeterminacy is a result of the effects of semantic ambiguity and system (“corpus”) size.
The differences influence their design, use and management.
Semantic ambiguity is a measure of the number of different senses a “word and/or phase” has.
System (corpus) size is the number of time that a given “word and/or phase” is used to represent an item of information .

CE v5.8 12/04/23
6
Generation of Text Retrieval Technology Intellectual Text Processing
Performed intectectually by subject experts area - No full text index search - only keywords
Definition and classification
• The process of assigning text identifiers (key words and/or meta data) to the information items (documents).
• Metadata - data about the data : Title, Author Name, Publication's Date, etc.
• Keywords . The human experts use controlled indexing vocabulary and thesaurus.
• Helps to bridge the semantic gap
• Increases search effectiveness and effectiveness.
• Provides additional information to the uses
• Avoids to perform more complex filters operations.
• Is too expensive and time-consuming
• Is subjective and depends on context
• Is too complicated
• No clear and precise rules
Key Benefits
Criticisms
A thesaurus designed for indexing is:
– a list of every important term (single-word or multi-word) in a given domain of knowledge; and
– a set of related terms for each term in the list.

CE v5.8 12/04/23
7
Generation of Text Retrieval Technology Automatic Text Processing
Distinctions
Boolean Retrieval Model STAIRS - IBM
Natural Language Processing
Probabilistic Approach
Concept Retrieval
Boolean Retrieval Model (AND; OR; NOT; proximity operator ).The rank order of retrieved documents is arbitrary, no relevance assigned to each documents retrieved
Based on the syntactic and morphological analysis, usually supported by a controlled dictionary. Automatic semantic network representation and free text queries.
Probabilistic Models treat the process of document retrieval as a multistage random experiment. Similarities are thus represented as probabilities. Relevance usually calculated by examining how many times a query term appears in a document compensate by the frequency of the query term in the collection. (term frequency–inverse document frequency; tf–idf )
Concept Retrieval is a search technology which allows the possibility to search for subjects or concepts rather than individual words or phrases in documents. Retrieved documents are ranked by relevance. Usually the user is responsible for specifying the concept definition.
Full Text Index – is a data structure that stores a list of occurrences and position of each atomic search criterion (words) , typically in the form of a hash table or binary tree, allowing full text search

CE v5.8 12/04/23
8
Determining relevance
What is a goal of a Text Retrieval
A “good” text retrieval system is able to :
Capabilities
Extract meaningful -useful information
While
Withholding non-relevant information
Subjective in nature; may be determined by
• The user who posed the retrieval problem
– Realistic but based on many personal factors
• An external judge

CE v5.8 12/04/23
9
Measuring Retrieval Effectiveness - Precision & Recall
• Recall
– the ratio of the number of relevant-useful documents retrieved to the total number of relevant documents in the database
– Measures how well the search engine retrieves all of the relevant-useful documents
• Precision
– ratio of the number of relevant –useful documents retrieved to the total number of irrelevant and relevant documents retrieved
– Measures how well the search engine retrieves only the relevant-useful documents

CE v5.8 12/04/23
10
Results Analysis of Precision and Recall of Query
Result Analysis
A low precision and low recall value
A high precision and low recall value
A high recall and low precision value
A high precision and high recall value
Indicates that the search engine has retrieved many irrelevant documents and has missed out many important results.
Indicates that the system was selective and has retrieved a good number of relevant documents but missed out some important results.
Indicates that system has retrieved a good number of relevant results but has also retrieved many irrelevant results in this process.
Indicate good retrieval performance of a search engine. To provide access to all and only those documents which are relevant high precision and high recall criterion for most efficient search engine.
A document is considered relevance if it is judged useful by the user who originated the query
Explanation

CE v5.8 12/04/23
11
AGENDA
• Basic notions of Information Retrieval
• Verity Query Language & Topic tree
• Tovek Tools (Enterprise Search Engine & Analytical System)

CE v5.8 12/04/23
12
The Problem
The 80 % of information is unstructured textual documents - Imagine it as an iceberg !!!
Using standard tools and basic search engines (or not using them at all) you can find only the proverbial top of the iceberg.
If you can see the whole, you can become frustrated by inability to see, what can be inside.

CE v5.8 12/04/23
13
Verity Query Language (VQL)
Operators for searching
full-text
Evidence
Proximity
Relational
Concept
An evidence operators can specify either a basic word (s) search or an expanded word list based on the original search word. Perform a basic word (s) or expanded word (s) search Word; Stem; Thesaurus; Wildcard; Soundex; Typo;
A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Phrase ; Sentence ; Paragraph; Near/n ; Order
Search in the document fields (Meta data) defined in the collection, (such as Title; Author; Published Date; etc) for filtering function. Numeric or textual search are accepted depending on the format of the fields
Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc.
It combines the meaning of search elements to identify a concept in a document. Documents retrieved are relevance ranked. Accrue; And; Or; All; Any
Weights between 0.1 and 1.0 are assigned to each keyword (s) or phrase based on its relative importance in meeting the search objective.
Understanding

CE v5.8 12/04/23
14
Verity Query Language (VQL)
Operators for searching
full-text
Evidence
Proximity
Relational
Concept
An evidence operators can specify either a basic word (s) search or an expanded word list based on the original search word. Perform a basic word (s) or expanded word (s) search Word; Stem; Thesaurus; Wildcard; Soundex; Typo;
A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Phrase ; Sentence ; Paragraph; Near/n ; Before; After,
Search in the document fields (Meta data) defined in the collection, (such as Title; Author; Published Date; etc) for filtering function. Numeric or textual search are accepted depending on the format of the fields
Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc.
It combines the meaning of search elements to identify a concept in a document. Documents retrieved are relevance ranked. Accrue; And; Or; All; Any
Weights between 0.1 and 1.0 are assigned to each keyword (s) or phrase based on its relative importance in meeting the search objective.
Understanding

CE v5.8 12/04/23
15
Evidence Operators
Operators
<Stem>
<Word>
Question Mark ?
ASTERISK *
Expand the keyword into a list of related words
Understanding
<Case>
Selects documents that include one or more variations of the search word you specify., e.g.: <STEM>export Note: By default words and phrases are stemmed
Selects documents that include one or more instance of only the word you specify., without located stemmed variation words e.g.: <STEM>export NB. Search for documents that contains the word “export” but not “exporting” , “exported” , etc.
Performs a case sensitive search based on the case of the word or phrase specified e.g.:EMIS (acronym for electromagnetic isotope separation) and not emis (the past participle of the French verb emiter):<CASE> EMIS Specifies one of any alphanumeric character, as in organi?ation
which locates organization and organization.
Specifies zero or more of any alphanumeric character, as in test*which locates not only test and tests but also testimony, testosterone etc,.

CE v5.8 12/04/23
16
Proximity Operators
Operators
<Phrase>
<Sentence>
<Near/N>
<Order>
Specify relative location of specific words
Understanding
<Paragraph>
Selects documents that include a phrase you specify. A phrase is a grouping of two or more words that occur next to each other, e.g.: <Phrase> (export, control) or “export control ”
Selects documents that include all the word (s) you specify in a sentence e.g. nuclear<Sentence>research
Selects documents, that include all the word (s) you specify in Paragraphe.g. Nuclear <Paragraph> Proliferation
Specifies that search elements must occur in the same order as in the query statement. Always to be placed in front of an operator e.g.: ballistic <ORDER><NEAR/5> missile
Selects documents containing all specified search terms within N number of words of each other, where N is an integer, e.g.: nuclear<NEAR/5>weapon

CE v5.8 12/04/23
17
Concept Operators
Operators
<Accrue>
<And>
<NOT>
Combine the meaning of search elements (words) to find a concept
Understanding
<OR>
Selects documents that include at least one of the search elements you specify. The more search elements that are present, the higher the score will be. e.g. plutonium<ACCRUE> Pu or plutonium, Pu - Documents with both terms are listed first!
Selects documents that include all search elements you specify . Documents are relevance-ranked.e.g. Germany<AND>hot cells
Selects documents that include at least one of the search elements you specify.e.g. electromagnetic isotope separation<OR>EMIS<OR>calutron
Note: AND, OR and NOT are treated as operators by default and do not require brackets. To use them as literal words enclose them in double quotes. All other operators must be enclosed in brackets.
the <NOT > modifier followed by a word or phrase excludes documents which contain that word or phrase, e.g.: missile <AND> <NOT> short range

CE v5.8 12/04/23
18
Relational Operators
Metadata
Title
Search in the metadata (such as Title, Date, etc.) defined in the collection
Understanding
Date
Selects documents that include in the Title the search elements you specify.
Numeric or textual search are accepted depending on the format of the fields Equal =; Greater than >= ; Less than <= etc. Contains; Ends; etc.
Sort Option: The sorting of the resulted documents can be done either by score, date, or title in ascending or descending order.

CE v5.8 12/04/23
19
Concept Retrieval - Fuzzy Logic Approach
Characteristic
Process of searching for subjects concepts rather than individual words or phrases
In building up a concept ( Topic tree) , an expert familiar with the subject of the search assigns weights to search terms.
Topic tree provide a convenient means which can encapsulate in a hierarchical structure, the knowledge of an expert.
Topic trees have the ability to understand the context of the text and retrieve every document related to a ‘topic’ of interest.
Retrieval results are presented in a relevance ranked order, giving users access to the most important information.
Topic trees are available to end users as a shared resource.
Advantages

CE v5.8 12/04/23
20
Design a Topic Tree - Knowledge Elicitation Process
Extracting knowledge from subject area experts
Subject Area Expert Knowledge Engineer
The Knowledge Engineer extracts and organizes the knowledge of the Subject-Area Expert and expresses it in a hierarchic format which can be used in a “Topic Tree” environment.

CE v5.8 12/04/23
21
Topic Tree – An Introduction
• Topic trees mitigate the semantic ambiguity
• Topic trees are predefined query in tree-like form that can be utilized for Searching, Data Mining and Taxonomy Classification
• Topic Tree includes a definition of the relationship between keywords and provides rules for evaluating and scoring documents
• Topic Trees Proprieties
– Structure Establish the relationships between nodes
– Weights Define the relative importance of words and nodes
– OperatorsInterprets rules for the search engine

CE v5.8 12/04/23
22
Corporate intellectual property to be reused by employees, or business rules
Topic Trees are available to end users as a shared resource.
Topic Trees provide a convenient means which can encapsulate in a hierarchical structure the expert’s knowledge
The Importance of Topic Trees
Topic Trees include all the components of the Verity Query Language (Conceptual and Proximity Operators, Modifiers and Weights)
Topic Trees have the ability to understand the context of a text and retrieve documents related to a ”topic” of interest

CE v5.8 12/04/23
The Accrue operator performs “the more the better” approach when assign to a topic or to a search; the more children specified by a topic using the accrue operator are found in the document, the better the document is considered related to your search. Documents which contain the maximum of highly-weighted children are the highest-ranked documents lists in the result list .
Topic tree - Accrue Operator

CE v5.8 12/04/23
Topic tree – Sentence; Any; Word; Stem; Operators
Word operator performs the basic search and selects documents that include one or more instance of the exact word specified as search element. Stem operator increases the search to include the expanded word list, based on the original search word. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stemSentence operator is used to indicate that the children of a sub-topic must be located within the same sentence in a documentAny operator is used to retrieve a document which contains at least one of the search elements specified.

CE v5.8 12/04/23
25
Topic Trees – A knowledge representation of “Ferrari Concept ” Topic trees are predefined query in tree-like form that can be utilized for Searching, Mining and Taxonomy Classification

CE v5.8 12/04/23
26
Topic tree - Algorithm for Scoring
Elements
Weight
Operators
Hierarchical Structure
Representing the relative contribution of that child (keyword) to the overall score produced by a Topic tree. The designer attributes importance weights to sub-concepts to reflect the fact that some words, phrases or other concepts are more important than others in expressing the overall concept.
Operators are used in conjunction with the weight of the child (keyword) to compute the score for each topic-node during the search.
Interpret the relationships between the topic-nodes and determines the whole score of the topic tree. The position of each topic-node, within the hierarchical structure, influences the calculation of the score.
Numerical score assigned to each document in the search result list , representing how well the document meets the information need of the user that issued the search
Rational

CE v5.8 12/04/23
27
Topic tree - Quality Assurance procedures and Testing process
Quality Assurance
Enrich the original key words
Proximity operator
Key words used too general
Procedures to check the performance of the topic trees against a “representative” collection of reports, amongst which the reports dealing with the concepts covered by the topic trees have been identified in advance. Measuring Retrieval Effectiveness - Precision & Recall
Retrieved reports are examined for words that may serve as new keywords.
Excessively restrictive proximity conditions that did not allow combinations of keywords to contribute to the retrieval of the document in the manner expected
Thoughts have to be made whether same keywords should be eliminated or used with new or more restrictive proximity conditions

CE v5.8 12/04/23
28
Probabilistic Approach in Text Retrieval System
Drawbacks
Synonymy
Polysemy
Search keywords
Semantic sensitivity
Different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query.
The same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents.
Search keywords must precisely match document terms; word substrings (stemming) might result in a "false positive match"
Documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
The probability that a specific document will be judged relevant to a specific query, is based on the assumption that the words are distributed differently in relevant and non relevant documents. The probability formula is usually derived from Bayes' theorem.

CE v5.8 12/04/23
29
AGENDA
• Basic notions of Information Retrieval
• Concept-based Retrieval – Topic tree
• Tovek Tools (Enterprise Search Engine & Analytical System)

CE v5.8 12/04/23
30
Tovek Info Rating – Context Analysis & Data’s Visualisation Tool
Tovek Harvester – Mine document’s context
Tovek Agent – Enterprise Search Engine
Tovek Index Manager – Collection
Builder
Tovek Editor – Create and Maintain Topic Trees
Tovek’s Tools - Enterprise Search Engine & Analytical System
Desk-top & Client - Server Application
Based on Verity’s technology, includes the ability to handle multi-language collections and different types of document format
Combines the capacity of exploiting simple and advanced searching capabilities and automatic clustering. Moreover it shared the ability to encompass on the same instance collections based on different languages
Provides the users with an easy-to-use, graphical way of creating Topic trees, it displays Topic trees in a way that allows the users to quickly and easily view and change the definition of Topic trees.
Enable analysts to rapidly exploit large quantities of text information, including advanced search and retrieval, pattern analysis and data visualization
Based on statistic method, determines relevant themes and consequently helps to cope with information overload, relevant information is made accessible very quickly and without any administrative effort
Understanding

CE v5.8 12/04/23
31
a) The Index Manager is designed to be an administrator's tool to create and manipulate the collection as whole
b) Based on Verity’s technology
c) Remote and local collections
d) Different types of documents, E-mail , Microsoft Word, PDF, Etc.
e) Multi-languages collections
f) ODBC connection
Tovek Index Manager – Collection Builder

CE v5.8 12/04/23
32
Simple and Advanced Search Capabilities - Based on Verity Query Language (VQL)
Users can submit sample data as input and the system returns references to related documents ranked by relevance
Ability to accept all know legacy search method, including keyword search with the support of Evidence; Proximity; Relational and Concept operators alone or combined as Topic trees.
Features Capabilities
Simple and Highly Structured Query
Query By Example
Based on the results of the natural language retrieval, users can quickly refine their search to precisely focus on the context they require
Refine By Example
Analyze large sets of documents or even user’s queries and automatically group relevant documents together that have a high likelihood of being relevant to the same information need
Automatic Clustering
Agent - Enterprise Search Engine

CE v5.8 12/04/23
33
Tovek Agent - User Interface – Automatic Clustering
Ability to create hierarchy of collections, which can be used individually or concatenated

CE v5.8 12/04/23
34
Tovek Agent
Selecting Collections - Find documents that satisfy specific criteria e.g. Nuclear , test
Selecting collections Search Pane &
Search Elements
Result List Documents found
Total documents
Documents fields or Metadata

CE v5.8 12/04/23
35
Tovek Agent – Collection Fields
View / Fields on the result list heading
Available Fields in the selected collection Grouping

CE v5.8 12/04/23
36
Tovek Agent – Query History & Query in Time
Query history (Tools / Query History )
Possibility to execute old query Query in Time ( Tools / Query in Time )

CE v5.8 12/04/23
37
Tovek Agent – View document
Highlights found search elements (words)

CE v5.8 12/04/23
38
Examine the matched words (highlighted) in the selected document
Tovek Agent - Document Proprieties

CE v5.8 12/04/23
39
Capacity to extract highlighted words from selected documents, together with words adjacent (preceding or following) to the highlighted ones.
Tovelk Agent Extract adjacent words
Search Criteria : President

CE v5.8 12/04/23
40
Tovek Agent - Multiple languages search capability

CE v5.8 12/04/23
41
Tovek Agent – Exporting documents (Menu Tools)
XML Export HTLM Export

CE v5.8 12/04/23
42
Ability to export selected documents from the result list , in different format (XML HTML, text) which can be analysed further
Tovek Agent - Export of selected documents
HTLM Export
List of Documents
Found words
Full text

CE v5.8 12/04/23
43
Tovek Query EditorFor advanced users to construct more complex queries to create topic trees

CE v5.8 12/04/23
44
Provide a context analysis by matching an extracted list of documents against a set of queries
Documents in the results list can be visualized in multiple ways
InfoRating is an analytical and data visualization tool to be able to assist users in performing context analysis together with a graphical representation of aggregate documents
Information are presented graphically in ways that make it easy to observe trends and general characteristic
InfoRating
Organize documents by the criteria and categories the user has requested, the conclusions are then delivered the user
Categorize documents into navigable structures to assist user in finding relevant information and in understanding the context of a collection

CE v5.8 12/04/23
45
Connection ChartRelationships between queries and documents, together with their scores
Possibility to add comments to the queries and/or documents
Switches for
the main pane
Query pane
Main pane
Documents pane

CE v5.8 12/04/23
46
Cross Matrix Upper panel - Number of documents matching all the possible permutations of two queries
Lower panel – Documents matching the selected element of the Cross Matrix

CE v5.8 12/04/23
47
Summary Graph Visualisation of the results of the queries in combination with different fields (Source or Date )
(e.g. queries within weeks)

CE v5.8 12/04/23
48
Generation of descriptors
Each keyword has assigned a weight (Relevance)
Automatic assignment of keywords
The tf–idf weight (term frequency–inverse document frequency)
Harvester
Harvester Approach
The goal of Harvester is to automatically extract “relevant terms” (e.g.,keywords) from a given corpus of information
The weight is a statistical measure used to evaluate how important a descriptor is to a document in a collection or corpus
The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus .
Descriptors are formed out of all pairs of keywords that appear near to each other in at least one document in the corpus
Understanding
Time dependent – ( keywords and descriptors )
Keywords and descriptors are time dependent and therefore new trends in new documents are reflected in regular reassignment of new keywords.

CE v5.8 12/04/23
49
(Chart / Show Clusters Chart / Hide All)
Harvester – Show & Hide Cluster

CE v5.8 12/04/23
50
Harvester – Part of a Cluster

CE v5.8 12/04/23
51
Visualization of a “Descriptor” Centrifuge and relation with Partner words
Word List
Word History
Partner Words
Result List
Working Pane Descriptors
Words Neighborhood

CE v5.8 12/04/23
52
Descriptors can be used as input query in concert with Tovek’s agent

CE v5.8 12/04/23
53
Visualization of a “Descriptor” - IAEA - and the relation with Partner words

CE v5.8 12/04/23
54
Visualization of a “Descriptor” - Temelin - and the relation with Partner words