towards an automatic semantic integration of information

20
Fourth International Conference on Topic Maps – Research and Applications (TMRA 2008) Towards an Automatic Semantic Integration of Information Dr. Jörg Wurzer, iQser AG Prof. Dr. Stefan Smolnik, European Business School (EBS) Leipzig, October 16, 2008

Upload: tmra

Post on 18-Dec-2014

698 views

Category:

Technology


3 download

DESCRIPTION

With their expanding information assets and the increasing importance of the knowledge factor, organizations are increasingly challenged to efficiently support knowledge management processes with appropriate integration and retrieval technologies. Besides traditional information retrieval approaches, the use of semantic technologies like Topic Maps is also becoming more important. This paper proposes a technology framework for the automatic semantic integration of information. Based on various information repositories, topics and topic associations are created automatically in real time. In addition, the first results from a proof of concept in conjunction with the European company EADS provide further insights into the proposed framework's applicability in practice.

TRANSCRIPT

Page 1: Towards an automatic semantic integration of information

Fourth International Conference on Topic Maps – Research and Applications (TMRA 2008)

Towards an Automatic Semantic Integration of Information

Dr. Jörg Wurzer, iQser AGProf. Dr. Stefan Smolnik, European Business School (EBS)

Leipzig, October 16, 2008

Page 2: Towards an automatic semantic integration of information

Agenda

• Status quo and motivation

• New paradigm: information access by context

• Proof of Concept at EADS

• Technical architecture

• Analysis process & queries

• Further research & questions

Page 3: Towards an automatic semantic integration of information

Motivation

• The quantity of digital information is still growing. IDC 2008: 60% per year

• Information is dispersed over documents and various applications/databases

• Growing need for creating knowledge based on available information

• Profound knowledge for management decisions, completing tasks and business processes, development of new products, sales and marketing campaigns

• Topic Maps can adopt the new results of research in semantic technologies

Page 4: Towards an automatic semantic integration of information

Todays solution I: full-text search

• Advantages: easy to use, generally accepted, high user experiences

• Disadvantages:

• Result quality depends on the keyword selection

• Results are presented as long document lists, which have to be assessed intellectually by the users

• The result set does not necessarily consider the user’s intention

• Each application has its own search functionality (no standards)

Page 5: Towards an automatic semantic integration of information

Todays solution II: directory hierarchy

• Advantages: content like documents can be organized considering their meaning, context, and applicability

• Disadvantages:

• A manually created hierarchy provides a static view on the content, but in practice, the user need different views like on customers, projects and products dimensions

• Documents are usually needed in several contexts; in this case, the documents are stored redundantly; problem: editing of all relevant documents

• Directory hierarchies often reflect the current state of knowledge; however, some documents can not be included appropriately in the hierarchy

Page 6: Towards an automatic semantic integration of information

New paradigm: access content in any context

• Automatically created topic maps of all content object types

• Multiple links between the content objects establish a semantic, non-hierachical network; links are created semantically

• The user chooses his focus of interest; a topic map provides the related content; example: customers are linked to projects, contracts, products, employees, and service calls.

• Exploring the available data by navigating through a topic map

• The content could be located in heterogeneous sources and could be stored in different formats or data models; even external content could be included

Page 7: Towards an automatic semantic integration of information
Page 8: Towards an automatic semantic integration of information

Proof of Concept of iQser Middleware at EADS

• Devision Defence and Communcation Systems

• Requirements:

• Analysis of unstructured data of military information

• Automatically created network of content objects

• Automatically created network of main concepts

• All links between documents have to be justified

• Benchmark: a system with a manually created ontology

Page 9: Towards an automatic semantic integration of information

Application screenshot (modified data due to confidentiality)

Page 10: Towards an automatic semantic integration of information

• The created topic map provides transparent relations between documents

• The terms tree provides users with an overview of the document base’s content as well as of related fundamental facts

• In the Poc for EADS, the concept-tree shows that “Biber” is a bridge tank and the location of the anti-missile defense

• The tree’s information quality as well as the topic map’s quality is high and can compete with that of a manually created ontology

Results

Page 11: Towards an automatic semantic integration of information

Uniform Information Layer (UIL)

• Single point of access for all content object types

• Connector for each type of structured and unstructured content from any source (document, database, application): transforms data into a semantically typed generic content object and stores modified data back.

• No redundantly stored data

• Searching across heterogeneous sources including the web is possible

• Users can specify search queries by means of attributes

Page 12: Towards an automatic semantic integration of information

Architecture of iQser Semantic Middleware

Page 13: Towards an automatic semantic integration of information

• All content changes (and changes of the topic map) trigger an event

• All user actions are tracked

• All changes or specific amounts of user actions trigger the analysis process

• Combination of three analysis methods: Syntax Analyzer, Pattern Analyzer, Semantic Analyzer

• More analyzers could be included according to customers needs

• Pairs of content objects can have n relations with calculated weights

Analysis process

Page 14: Towards an automatic semantic integration of information

Syntax Analyzer

• Each content object can have multiple key attributes defined in the content provider

• Examples: full name of a person, sender and recipient of an email, project ID

• The Syntax Analyzer looks wether these key attributes are related to attributes of other content objects in the data pool

Page 15: Towards an automatic semantic integration of information

Pattern Analyzer

• The Pattern Analyzer extracts the meaningful words according to significance

• Transforms a selected set of words into a data query; the result is a list of similar content objects

• The similarity is described by a weight between 0 and 1

• The Pattern Analyzer considers the context of used words in a text; it therefore reflects the different use of words in different contexts

Page 16: Towards an automatic semantic integration of information

Semantic Analyzer

• Background: the meaning of words and sentences in a language is not defined abstractly but indirectly manifested in the daily use of language

• The Semantic Analyzer evaluates the tracked user actions

• If two content objects are selected, edited, or created in a sequence, the Semantic Analyzer creates a link between these objects

• The weight of such a link will grow, if the same sequence of content objects occurs again

• The weights of content object links can shrink, if a weight has a value larger than 1

• The topic map is self-optimizing considering the customers’ interests

Page 17: Towards an automatic semantic integration of information

Querying associated information

• Users can specify search queries aiming at a precise result by means of

• attibutes

• semantic types

• relations (context search)

• All changes in the data pool and in the topic map can be used to trigger or control a process

Page 18: Towards an automatic semantic integration of information

Further research

• Developing more applications as concrete use cases based on the iQser Semantic Middleware

• Developing and evaluating additional analysis methods

• Implementing complex queries with multiple contexts

Page 20: Towards an automatic semantic integration of information

Technical details

• Hardware: Pentium(R) Dual Core 3 GHz, 2 GB RAM

• Software: Windows XP 2002 SP3, JBoss 4.0.4 GA, Sun JDK 1.5_12

• JBoss JVM heap size configuration: -Xms128m -Xmx512m

• 3 GB of data (Word, Excel, PowerPoint, Plain Text, HTML) are indexed and analyzed in 14 hours

• More than 70 % of CPU resources for I/O waits

• CPU needed less than 400 MB memory