Fourth International Conference on Topic Maps – Research and Applications (TMRA 2008)
Towards an Automatic Semantic Integration of Information
Dr. Jörg Wurzer, iQser AGProf. Dr. Stefan Smolnik, European Business School (EBS)
Leipzig, October 16, 2008
Agenda
• Status quo and motivation
• New paradigm: information access by context
• Proof of Concept at EADS
• Technical architecture
• Analysis process & queries
• Further research & questions
Motivation
• The quantity of digital information is still growing. IDC 2008: 60% per year
• Information is dispersed over documents and various applications/databases
• Growing need for creating knowledge based on available information
• Profound knowledge for management decisions, completing tasks and business processes, development of new products, sales and marketing campaigns
• Topic Maps can adopt the new results of research in semantic technologies
Todays solution I: full-text search
• Advantages: easy to use, generally accepted, high user experiences
• Disadvantages:
• Result quality depends on the keyword selection
• Results are presented as long document lists, which have to be assessed intellectually by the users
• The result set does not necessarily consider the user’s intention
• Each application has its own search functionality (no standards)
Todays solution II: directory hierarchy
• Advantages: content like documents can be organized considering their meaning, context, and applicability
• Disadvantages:
• A manually created hierarchy provides a static view on the content, but in practice, the user need different views like on customers, projects and products dimensions
• Documents are usually needed in several contexts; in this case, the documents are stored redundantly; problem: editing of all relevant documents
• Directory hierarchies often reflect the current state of knowledge; however, some documents can not be included appropriately in the hierarchy
New paradigm: access content in any context
• Automatically created topic maps of all content object types
• Multiple links between the content objects establish a semantic, non-hierachical network; links are created semantically
• The user chooses his focus of interest; a topic map provides the related content; example: customers are linked to projects, contracts, products, employees, and service calls.
• Exploring the available data by navigating through a topic map
• The content could be located in heterogeneous sources and could be stored in different formats or data models; even external content could be included
Proof of Concept of iQser Middleware at EADS
• Devision Defence and Communcation Systems
• Requirements:
• Analysis of unstructured data of military information
• Automatically created network of content objects
• Automatically created network of main concepts
• All links between documents have to be justified
• Benchmark: a system with a manually created ontology
Application screenshot (modified data due to confidentiality)
• The created topic map provides transparent relations between documents
• The terms tree provides users with an overview of the document base’s content as well as of related fundamental facts
• In the Poc for EADS, the concept-tree shows that “Biber” is a bridge tank and the location of the anti-missile defense
• The tree’s information quality as well as the topic map’s quality is high and can compete with that of a manually created ontology
Results
Uniform Information Layer (UIL)
• Single point of access for all content object types
• Connector for each type of structured and unstructured content from any source (document, database, application): transforms data into a semantically typed generic content object and stores modified data back.
• No redundantly stored data
• Searching across heterogeneous sources including the web is possible
• Users can specify search queries by means of attributes
Architecture of iQser Semantic Middleware
• All content changes (and changes of the topic map) trigger an event
• All user actions are tracked
• All changes or specific amounts of user actions trigger the analysis process
• Combination of three analysis methods: Syntax Analyzer, Pattern Analyzer, Semantic Analyzer
• More analyzers could be included according to customers needs
• Pairs of content objects can have n relations with calculated weights
Analysis process
Syntax Analyzer
• Each content object can have multiple key attributes defined in the content provider
• Examples: full name of a person, sender and recipient of an email, project ID
• The Syntax Analyzer looks wether these key attributes are related to attributes of other content objects in the data pool
Pattern Analyzer
• The Pattern Analyzer extracts the meaningful words according to significance
• Transforms a selected set of words into a data query; the result is a list of similar content objects
• The similarity is described by a weight between 0 and 1
• The Pattern Analyzer considers the context of used words in a text; it therefore reflects the different use of words in different contexts
Semantic Analyzer
• Background: the meaning of words and sentences in a language is not defined abstractly but indirectly manifested in the daily use of language
• The Semantic Analyzer evaluates the tracked user actions
• If two content objects are selected, edited, or created in a sequence, the Semantic Analyzer creates a link between these objects
• The weight of such a link will grow, if the same sequence of content objects occurs again
• The weights of content object links can shrink, if a weight has a value larger than 1
• The topic map is self-optimizing considering the customers’ interests
Querying associated information
• Users can specify search queries aiming at a precise result by means of
• attibutes
• semantic types
• relations (context search)
• All changes in the data pool and in the topic map can be used to trigger or control a process
Further research
• Developing more applications as concrete use cases based on the iQser Semantic Middleware
• Developing and evaluating additional analysis methods
• Implementing complex queries with multiple contexts
Thank you!
Dr. Jörg Wurzer+49 172 [email protected]
Technical details
• Hardware: Pentium(R) Dual Core 3 GHz, 2 GB RAM
• Software: Windows XP 2002 SP3, JBoss 4.0.4 GA, Sun JDK 1.5_12
• JBoss JVM heap size configuration: -Xms128m -Xmx512m
• 3 GB of data (Word, Excel, PowerPoint, Plain Text, HTML) are indexed and analyzed in 14 hours
• More than 70 % of CPU resources for I/O waits
• CPU needed less than 400 MB memory