Download - Towards an automatic semantic integration of information

Fourth International Conference on Topic Maps – Research and Applications (TMRA 2008)

Towards an Automatic Semantic Integration of Information

Dr. Jörg Wurzer, iQser AGProf. Dr. Stefan Smolnik, European Business School (EBS)

Leipzig, October 16, 2008

Agenda

• Status quo and motivation

• New paradigm: information access by context

• Proof of Concept at EADS

• Technical architecture

• Analysis process & queries

• Further research & questions

Motivation

• The quantity of digital information is still growing. IDC 2008: 60% per year

• Information is dispersed over documents and various applications/databases

• Growing need for creating knowledge based on available information

• Profound knowledge for management decisions, completing tasks and business processes, development of new products, sales and marketing campaigns

• Topic Maps can adopt the new results of research in semantic technologies

Todays solution I: full-text search

• Advantages: easy to use, generally accepted, high user experiences

• Disadvantages:

• Result quality depends on the keyword selection

• Results are presented as long document lists, which have to be assessed intellectually by the users

• The result set does not necessarily consider the user’s intention

• Each application has its own search functionality (no standards)

Todays solution II: directory hierarchy

• Advantages: content like documents can be organized considering their meaning, context, and applicability

• Disadvantages:

• A manually created hierarchy provides a static view on the content, but in practice, the user need different views like on customers, projects and products dimensions

• Documents are usually needed in several contexts; in this case, the documents are stored redundantly; problem: editing of all relevant documents

• Directory hierarchies often reflect the current state of knowledge; however, some documents can not be included appropriately in the hierarchy

New paradigm: access content in any context

• Automatically created topic maps of all content object types

• Multiple links between the content objects establish a semantic, non-hierachical network; links are created semantically

• The user chooses his focus of interest; a topic map provides the related content; example: customers are linked to projects, contracts, products, employees, and service calls.

• Exploring the available data by navigating through a topic map

• The content could be located in heterogeneous sources and could be stored in different formats or data models; even external content could be included

Proof of Concept of iQser Middleware at EADS

• Devision Defence and Communcation Systems

• Requirements:

• Analysis of unstructured data of military information

• Automatically created network of content objects

• Automatically created network of main concepts

• All links between documents have to be justified

• Benchmark: a system with a manually created ontology

Application screenshot (modified data due to confidentiality)

• The created topic map provides transparent relations between documents

• The terms tree provides users with an overview of the document base’s content as well as of related fundamental facts

• In the Poc for EADS, the concept-tree shows that “Biber” is a bridge tank and the location of the anti-missile defense

• The tree’s information quality as well as the topic map’s quality is high and can compete with that of a manually created ontology

Results

Uniform Information Layer (UIL)

• Single point of access for all content object types

• Connector for each type of structured and unstructured content from any source (document, database, application): transforms data into a semantically typed generic content object and stores modified data back.

• No redundantly stored data

• Searching across heterogeneous sources including the web is possible

• Users can specify search queries by means of attributes

Architecture of iQser Semantic Middleware

• All content changes (and changes of the topic map) trigger an event

• All user actions are tracked

• All changes or specific amounts of user actions trigger the analysis process

• Combination of three analysis methods: Syntax Analyzer, Pattern Analyzer, Semantic Analyzer

• More analyzers could be included according to customers needs

• Pairs of content objects can have n relations with calculated weights

Analysis process

Syntax Analyzer

• Each content object can have multiple key attributes defined in the content provider

• Examples: full name of a person, sender and recipient of an email, project ID

• The Syntax Analyzer looks wether these key attributes are related to attributes of other content objects in the data pool

Pattern Analyzer

• The Pattern Analyzer extracts the meaningful words according to significance

• Transforms a selected set of words into a data query; the result is a list of similar content objects

• The similarity is described by a weight between 0 and 1

• The Pattern Analyzer considers the context of used words in a text; it therefore reflects the different use of words in different contexts

Semantic Analyzer

• Background: the meaning of words and sentences in a language is not defined abstractly but indirectly manifested in the daily use of language

• The Semantic Analyzer evaluates the tracked user actions

• If two content objects are selected, edited, or created in a sequence, the Semantic Analyzer creates a link between these objects

• The weight of such a link will grow, if the same sequence of content objects occurs again

• The weights of content object links can shrink, if a weight has a value larger than 1

• The topic map is self-optimizing considering the customers’ interests

Querying associated information

• Users can specify search queries aiming at a precise result by means of

• attibutes

• semantic types

• relations (context search)

• All changes in the data pool and in the topic map can be used to trigger or control a process

Further research

• Developing more applications as concrete use cases based on the iQser Semantic Middleware

• Developing and evaluating additional analysis methods

• Implementing complex queries with multiple contexts

Thank you!

Dr. Jörg Wurzer+49 172 [email protected]

http://www.iqser.com

http://www.iqser.com

mailto:[email protected]

mailto:[email protected]

Technical details

• Hardware: Pentium(R) Dual Core 3 GHz, 2 GB RAM

• Software: Windows XP 2002 SP3, JBoss 4.0.4 GA, Sun JDK 1.5_12

• JBoss JVM heap size configuration: -Xms128m -Xmx512m

• 3 GB of data (Word, Excel, PowerPoint, Plain Text, HTML) are indexed and analyzed in 14 hours

• More than 70 % of CPU resources for I/O waits

• CPU needed less than 400 MB memory

Download - Towards an automatic semantic integration of information

Top Related