© 2003 ibm corporation 1 mukesh mohania presentation subtitle: 20pt arial regular, teal r045 | g182...
TRANSCRIPT
© 2003 IBM Corporation1
Mukesh Mohania June 7, 2007
Mukesh Mohania
IBM India research Lab
A Journey from Data Warehousing to
Active Information Integration
Mukesh Mohania2
Outline
What is Data Warehousing (DW)? What is Information Integration (II)? Existing Solutions From DW II Event based (Active) Information Integration Context-Oriented Information Integration
IBM India Research Lab
IBM Research © 2007 IBM Corporation
Structured and Unstructured Information
Information content in an enterprise can be structured or unstructuredStructured Content: payroll, sales orders, invoice, customer profiles, etc.
Unstructured Content:: emails, reports, web-pages, complaints, information on sales, customers, competitors, products, suppliers and people, etc.
According to recent estimates, structured content < 20%, unstructured content > 80%
Historically, the structured and unstructured data management technologies have evolved separately Artificial separation between these two “kinds” of information
Enterprises are realizing the need to bridge this separation, and are demanding integrated retrieval, management and analysis of both the structured and unstructured content
Mukesh Mohania4
From Data Warehouse to Information Integration
Data warehousing was first driven by a need for consistent business information from disparate systems.
Business needs•Support for decision making, based on
•Historical or point-in-time view of the business•Aligned across different departments
Technical limitations•OLTP systems and performance must be protected•Historical and summary data stores needed•Reconciliation of data in different systems is slow•Ad hoc query performance needs to be optimized
Mukesh Mohania5
Mid-1980s: Data Warehousing
Business data warehouse
Operational systems
Data marts
Metadata
Data warehouse• Reconciling disparate data• Single version of the truth• Historical record
Characteristics• Information not needed immediately• Structured data• Unidirectional data flow• Trusted sources
Mukesh Mohania6
The need for consistency extends to more immediate business information
Technical limitationsOLTP systems and performance must be protectedHistorical and summary data stores neededReconciliation of data in different systems is slowAd hoc query performance needs to be optimised
Business needsSupport for decision making, based on
Historical or point-in-time view of the business, as well as near real-time views
Aligned across different departments
Mukesh Mohania7
Mid-1990s: Operational Data Store
Business data warehouse
Operational systems
Data marts
Operational data store
Metadata
Operational data store•Near real-time•Reconciling a subset of data
Characteristics• Immediate and historical info needs• Structured data• Partial bi-directional data flow• Increasing technical metadata
Mukesh Mohania8
The vision: Comprehensive integration of information
Business data warehouse
Operational systems
(Operational) data marts
Information integration
Metadata
Untrusted & unstructured sources (e.g. Internet)
Integrated information • Real-time knowledge• Integrating all information
Characteristics• Immediate and historical information needs• Fully merged informational & operational needs• Structured and unstructured data • Bi-directional data flow• Caching reduces data flow• Complete business & technical metadata
VLDB 2006, SIGMOD 2007, PODS 2007
© 2003 IBM Corporation9
Mukesh Mohania June 7, 2007
Existing Solutions
IBM India Research Lab
IBM Research © 2007 IBM Corporation
Structured and Unstructured Information Integration:A Brief Background on Existing Solutions
Existing solutions can be classified in terms of the query paradigm used:
Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS [ICDE02])Relational data exposed to search engine as virtual text documents
Query both structured and unstructured information using keywords
SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearch Extender)Text data exposed to relational engine as virtual tables with text columns
Query both structured and unstructured information using SQL
Provide SQL primitives to search text in table columns using a set of keywords
IBM India Research Lab
IBM Research © 2007 IBM Corporation
Keyword Query Based Solution: DB2 ESE
DB2Enterprise
SearchExtender
DB2Enterprise
SearchExtender
Keyword Query
IBM India Research Lab
IBM Research © 2007 IBM Corporation
Keyword Query Based Solution:DbXplorer/BANKS [ICDE02]
DbXplorer/BANKS
DbXplorer/BANKS
Search EngineSearch Engine
Keyword Query
Keyword Query
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
IBM India Research Lab
IBM Research © 2007 IBM Corporation
Keyword Query Based Solutions: Summary
Advantage: Simplicity!
DisadvantagesLess expressive (as compared to SQL)
How to ask for the information related to the five best performing stocks in the past week?
Need to specify a set of keywords that succinctly encodes the information need
Not always easy
IBM India Research Lab
IBM Research © 2007 IBM Corporation
SQL Query Based Solution:Standard SQL LIKE Predicate
DB2 UDB / DB2 Information
Integrator
DB2 UDB / DB2 Information
Integrator
SELECT stocks.price, docs.textFROM stocks, docsWHERE (stocks.name = ‘IBM’AND docs.text LIKE ‘% IBM %’)OR (stocks.name = ‘ORCL’AND docs.text LIKE ‘% ORCL %’)
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
IBM India Research Lab
IBM Research © 2007 IBM Corporation
SELECT stocks.price, docs.textFROM stocks, docsWHERE (stocks.name = ‘IBM’AND CONTAINS(docs.text, “IBM”))OR (stocks.name = ‘ORCL’AND CONTAINS(docs.text, “ORCL”))
SQL Query Based Solution:Net Search Extender
DB2 UDB / DB2 Information
Integrator
DB2 UDB / DB2 Information
Integrator
Net Search Extender
Net Search Extender
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
CONTAINS(…)
IBM India Research Lab
IBM Research © 2007 IBM Corporation
SQL Query Based Solutions: Summary
Advantages:More expressive – can specify more involved and sophisticated queries
Disadvantages:The unstructured data is still queried using keywords
Need to specify a set of keywords that succinctly encodes the information need
Not always easy
The SQL query and the embedded keyword query encode the same information need
Redundant effort
Association of documents with tuples (local context), not with the entire result (global context)
Same documents get attached to “IBM” when “IBM” is queried with “ORCL” as when “IBM” is queried with “DELL”
Mukesh Mohania17
Active Information Integration
Mukesh Mohania18
Data Stream
A data stream is a sequence of data items X1, X2, …, Xn, coming continuously from single or multiple sources where random access to data is not allowed.
Data Stream CharacteristicsStrongly regular: strongly periodic (inclusive zero time interval between two
data items), only one type of data, schema can be derived or conforms schema.
Weakly regular: weakly periodic (follows some time interval), mixed types of data but follows the order, schema can be derived.
Irregular: aperiodic, types of data unknown, no order, schema cannot be derived.
Mukesh Mohania19
Active functionalities over streaming data
Provides real-time functionalities that is needed in several advanced applications.– Alert a doctor when the blood pressure of a patient goes below X, heart
beats less than Y and ECG touches Z.– Sell all my INTC stocks at the higher trading price exchange if the price
difference at any time between two exchanges is more than 2%.– Cancel my tomorrow’s flight if there is a terrorist attack in the region of
flying.
Events can be defined on composition of data streams that can trigger some pre-defined actions (notification and alert, database change, etc.)
Context can be associated with the events – INTC was trading higher at NASDAQ at 9:32 AM since CEO of INTC
rang the opening bell.
Mukesh Mohania20
Active Rules
An active rule is composed of three components:
Event (E): Monitor - Detect - EvaluateCondition (C): Derive - Analyze - EvaluateAction (A): Collaborate - Integrate - Effect
Events: customer-event (1st purchase, new subscription, etc.) time-based event (birthday, retirement, etc.) product-based (launch of a new product, decline in sales, etc.) calendar-based (Christmas, Diwali, etc.)
Mukesh Mohania21
Rule Examples
• If the value of a transaction is less than $2 and this transaction is made by credit card, notify to the fraud detection system and send a notification to the customer service representative for calling the credit card holder immediately to check the validity of the transaction.
• If a customer has made at least 3 transactions or the total value of all transactions is more than $2000 during Christmas holidays, then offer 10% discount to the user between January 10-January 31.
• If the duration of a telephone call exceeds by more than 40 minutes, then send a notification to the fraud detection system.
• If less than 10% of the stock is sold in a retail store by the end of the week, offer a 20% discount on the non-luxury items for the next week.
Mukesh Mohania22
Architecture
Data SourcesDB Data Stream
WebMDB
Adaptor
Monitor
Adaptor Adaptor Adaptor
MonitorMonitorMonitorConnectors
Business Logic/Process
Business Logic/Process Feedback
Active Functionalities
Mukesh Mohania23
Rule Execution Engine
Event Composer and Detector
WorkflowExecution
Information Flow Engine
Rules DB
Information Integrator and Decision Analysis
Metadata
Integration Hub
© 2003 IBM Corporation24
Mukesh Mohania June 7, 2007
Context-Oriented Information Integration
Mukesh Mohania25
Motivation
CM SystemRDBMS
Relational Query Non-relational Query
DB2 Result Retrieved data and documents
Structured Data Management Unstructured Data (Content) Management
Current Scenario: Isolated Management of Structured and Unstructured Information
(20% of enterprise data) (80% of enterprise data)
Needed: Consolidated Management
Mukesh Mohania26
Motivation
CM SystemRDBMSBroker
Relational Query Content Query
Main result: Relational Result
Addition: Relevant Documents
Main result: Retrieved documents
Addition: Relevant database fragment
Consolidated Management
(100% of enterprise data)
A broker enables consolidation of information stored in Relational and Non-relational (CM) systems
Mukesh Mohania27
Problem
To enhance structured and unstructured data retrieval through symbiotic consolidation of related information. Specifically:
Enhance structured data retrieval by associating additional documents relevant to the user context with the query result
Enhance document contents by associating additional information derived from structured data
Structured data = relations, schema-based (XML) documentsUnstructured data = schema-less (free-flow) documents,
web-pages
IBM India Research Lab
IBM Research © 2004 IBM Corporation
Solution Overview
DB2 Information
Integrator
DB2 Information
Integrator
DB2Enterprise
SearchExtender
DB2Enterprise
SearchExtender
SELECT name, max(price) - min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY
“IBM” “ORCL” “MSFT” “Database” “Software”
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
SCORESCORE
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
SELECT name, max(price) -min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY
“Doctype:Patents”
“Doctype:Patents”
“Get the 3 companies with max price variation”
CIKM 2005 Best Paper Award
IBM India Research Lab
IBM Research © 2004 IBM Corporation
Solution Overview
DB2 Information
Integrator
DB2 Information
Integrator
DB2Enterprise
SearchExtender
DB2Enterprise
SearchExtender
SELECT name, max(price) - min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY
“IBM” “ORCL” “MSFT” “Database” “Software”
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
SCORESCORE
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
C3C2C1
X
X
Y
X
X
Y
X
X
8
7
6
5
4
3
2
1
B
B
B
A
B
A
A
A
SELECT name, max(price) -min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY
“Doctype:Patents”
“Doctype:Patents”
SQL Query Result
SQL Query Context
IBM India Research Lab
IBM Research © 2004 IBM Corporation
Main Idea
Specify information need in terms of SQL over the structured database
Additional information needs specified using “directives” (optional)
Automatically synthesize the “context” of the SQL query from its result and the known semantic dependencies in the structured data
Use this context and the directives to retrieve the unstructured data
IBM India Research Lab
IBM Research © 2004 IBM Corporation
Overall Architecture
Query + Directives
Enterprise Search
Query
Query Result + Context
Context + Directives
Relevant Documents
SC
OR
EUser Interface/Application
Metadata
Modified Query
Modified Query Result
Metadata
MetadataDB2 II
CM
Query Handler
Context Handler
Metadata mapping
Me
tad
ata
Re
po
sit
ory
(Cri
oll
o)
Query Result + Relevant
Documents
CMUnstruct Data Source
CMCM
Structured Data Source
Mukesh Mohania32
Applications
• Financial: Customer-centric investment account and risk assessment documents
• Health: Patient specific report and medical articles
• Telecommunications and Manufacturing: Defect statistics and engineering specifications
• Marketing: Customer transaction history and marketing documents
Mukesh Mohania33
Research problems in Active Information Integration
How to model the Active Rules? Is it just an ECA or something more? How to define events and rules along dimensional and fact tables in data
warehousing? Do we need a different data modeling schemes for Active/real-time data
warehousing? How to monitor the data sources for active integration? What data need to be materialized for computing the exact change at data
warehouse site? How to provide keyword search on data warehousing considering the data
is exposed as business objects to users? How to translate keyword query into more semantic based query for OLAP
analysis? How to handle uncertainty in keyword based queries? Single data indexing mechanism for both unstructured and structured
data? New architectures for Active/Real-time Data warehousing – handling large
number of queries in almost sub-seconds.
Mukesh Mohania34
Conclusions
• Integration Evolution1980 Federated Data Integration1985 Data warehouse1995 Operational data store1999 Client information integration2003 Information integration
• Active Information Integration Approaches• Rule based Information Integration • Co-relating structured and unstructured data