2010.03.16 pollock.edw2010.modern d ifor warehousing
DESCRIPTION
Presentation describes a modern alternative to conventional hub-based ETL and Replication for Data WarehousingTRANSCRIPT
<Insert Picture Here>
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
2
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.
<Insert Picture Here>
Modern Data Integration for Data WarehousingOracle Fusion Middleware
Agenda
• Data Warehouse Problem Space (Data Intg. Focus)
• Ancient Pre-History of Data Warehouse
• “The Good Old Days” of Data Warehouse
• Revival Period for Data Warehouse
• Data Integration for Modern Data Warehousing
• Old Generation: Hub & Spoke with Invasive Capture
• New Generation: Agent-based with Non-invasive Capture
4
• New Generation: Agent-based with Non-invasive Capture
• Drive Business Value with Data Integration
• Why Replace? Isn’t my Old _____ Good Enough?
• The Oracle Solution for Data Integration
• Oracle GoldenGate
• Oracle Data Integrator
• Oracle Data Quality
Data Warehousing
P R O B L E M S P A C E
5
P R O B L E M S P A C E
Data Warehouse Ancient History
• 1985 – 1995 “Controlled Chaos”
• Fragmented Strategy for Marts vs. Warehouse
• No practical notion of “Enterprise Data Warehouse”
• Data Integration:
• Hand-coded Scripts (External to DB)
• Not Optimized
6
• Not Optimized
• Procedural Transformations (PL/SQL etc)
• Few Data Integration Tools
• No Formal Methodology, Metrics or Governance
Data Warehouse Good Old Days
• 1995 – 2005 “Formal Methods and Discipline”
• Strategy Choices for Marts vs. Warehouse
• Top-down (Inmon) vs. Bottom-up (Kimball)
• Formal notion of “Enterprise Data Warehouse”
• Data Integration:
• Tool-based Data Integration Solutions
7
• Tool-based Data Integration Solutions
• Optimized, Parallel Server-based Transforms
• Formal Methodology, Metrics or Governance
• Reduced Reliance on Hand-coded Scripts and
Procedural Transformations (PL/SQL etc)
Data Warehouse Revival Period
• 2005 – 2015 “Specialized Warehouse Solutions”
• Technology-driven Choices for High-end DW’s
• Commodity H/W vs. Optimized Appliances
• Relational/Star vs. Columnar (vs. Cubes/OLAP)
• Database + BI vs. Distributed Analytic Apps (Hadoop etc)
• EDW as a “source of truth” vision � morphs and
expands to MDM as a distinct problem domain
8
expands to MDM as a distinct problem domain
• Data Integration is still stuck in the “Good Old Days”
Good Old Days Modern Alternative
Hub-based Runtime Agent-based Runtime
Centralized ETL Server Optimized E-LT (DW Appliance)
Mainly Batch Mainly Real Time / Trickle Feed
Data Warehousing with
MODERN DATA INTEGRATION
9
MODERN DATA INTEGRATION
Traditional ETL + CDC
• Invasive Capture on OLTP
systems using complex Adapters
• Transformations in ETL engine
on expensive middle tier servers
• Continuous feeds from
operational systems
• Non-invasive data capture
• Thin middle tier with
Modern Data Integration ApproachHeterogeneous, Real-time, Non-Invasive, High Performance E-LT
Modern E-LT + Real-time
10
• Bulk load to the data warehouse
with large nightly/daily batch
transformations on the database
platform (target)
• Mini-batches throughout the day
or bulk processing nightly
Staging
Trickle
Lookup
Data
Load
Extract
Lookup
Data
Xform XformBulk
Ag
en
t
Ag
en
t
Heterogeneous
Good Old Days of ETL Batch Integration
Extract Transform Load Lookups/Calcs Transform Load
Development, QA, System (etc)
Environments
• Good Tools, but:
• Expensive Environments, Performance
Bottlenecks, Too Many Data Hops,
Proprietary Skills w/Vendor Lock-in, and
Heavy Optimization in Complex Situations
• Won’t scale w/new Generation of DW’s
11
Stage ProdLookup
DataSources
ETL engines
require BIG
H/W and heavy
parallel tuning
Extract Transform Load Lookups/Calcs Transform Load
ETL Engine(s)
MetaLookup
Data
ETL Metadata
Extract Transform Load Lookups/Calcs Transform Load
Modern Agent-based E-LT Processing
• Same Good Tools you Expect, plus:
• Reduce Data Center Costs, De-commission Servers
• Open Frameworks, Non-Proprietary SQL Skills
• Deploys Seamlessly Alone or within SOA Servers
• Scales Linearly with Modern DW Appliances
12
Extract Transform Load Lookups/Calcs Transform Load
Sources
Meta
Stage ProdLookup
Data
E-LTAgent
Data Movement
Set-based SQL
transforms
typically faster
SQL Load
inside DB is
always faster
Development, QA, System (etc)
Environments
Data Transformation
Good Old Days of Real Time Replication
• Good Tools, but:
• Arcane capture process, sometimes invasive
• Okay for Data Integration Changed Data Capture, but:
• not used for Active-Active / ZDT Migrations
• not used for High Availability or Disaster Recovery
13
Stage ProdLookup
DataSources
CDC Hub(s)
ETL Engine(s)
Transaction Apply
Mgmt Server
Agent-based Real Time Replication
• Same Good Tools you Expect, but:
• Not dependent on hardware for replication
• Capable of Heterogeneous, Active-Active Deployments
• Suitable for Zero Downtime Migrations
• Point-in-time Recovery
14
Sources Stage ProdLookup
Data Data MovementCaptureAgent
ReplicatAgent
Data Capture Architecture Options
• Next Generation Capabilities
• Non-invasive, heterogeneous, disk-based log access
• Suitable for CDC + High Availability & Active-Active
• Bi-directional and high performance
• Check-pointing and Simple Trail/Queue Management
15
On-Disk Logs
Log Tables
TriggersUpdatesInsertsDeletes
OracleIBM DB2MSFT SQL ServerSybaseTeradataEnscribe
Good Old Days of Data Integration
• Monolithic & Expensive Environments
• Fragile, Hard to Manage
• Difficult to Tune or Optimize
ETL engines
require BIG
H/W and heavy
Extract Transform Load Lookups/Calcs Transform Load
MetaLookup
Data
ETL Metadata
Development, QA, System (etc)
Environments
16
Stage ProdLookup
DataSources
H/W and heavy
parallel tuning
ETL Engine(s)
CDC Hub(s)
Transaction Apply
Mgmt Server
Modern Data Integration Architecture
• Lightweight, Inexpensive Environments – Agents
• Resilient, Easy to Manage – Non-Invasive
• Easy to Optimize and Tune – uses DBMS power
Extract Transform Load Lookups/Calcs Transform Load
17
Sources
Meta
Stage ProdLookup
Data Data Transformation
E-LTAgent
Bulk Data Movement
Set-based SQL
transforms
typically faster
SQL Load
inside DB is
always faster
Development, QA, System (etc)
Environments
CaptureAgent
ReplicatAgent
Data Integration Drives
B U S I N E S S V A L U E
18
B U S I N E S S V A L U E
1. Do More with Less
2. Compete Globally 24X7
Design metadata-driven integrationLeverage skills & dictate patterns
Ensure continuous uptimeAccess data in real time
Business Drivers for Data IntegrationAdd Value to the Core Business Lines
19
3. Use Data for Competitive Advantage
4. Automate and Adapt Business Processes
Ensure the quality of your dataActively govern most valuable asset
Expose data services for reuseOrchestrate processes using SOA
Project Drivers for Data IntegrationEssential Ingredient for Information Agility
Strategic Value of Data Integration
• Consistency for major enterprise initiatives like BI, DW, & MDM
• Common technical foundation platform across data silos
• Central point for data governance, availability and controls
20
Key Data Integration Use Cases
• BI, DW, and OLTP Data Integration & Replication
• SOA, Enterprise Integration & Modernization
• Migrations and Master Data Management
Modern Data Integration Alternatives:
W H Y R E P L A C E _______?
21
W H Y R E P L A C E _______?
Why Replace _______?
• We often hear, “my company has already standardized
on __________, why should I replace it?
Answer:
� Save Money on Data Center Costs
� Accelerate Project Delivery / TTM
22
� Accelerate Project Delivery / TTM
� Supply Real Time Intelligence to the Business
� Reduce Batch Windows on Data Warehouse
� Unify Data Integration with SOA Plans
Save Money on Hardware/Data CenterE-LT runs on Small Commodity Servers as an Agent Process
Next Generation Architecture
E-LTE-LTLoadExtract
Transform Transform
Typical: Separate ETL Server• Proprietary ETL Engine, Poor Performance
• High Costs for Separate Standalone Server
E-LT: No New Servers• Lower Cost: Leverage Compute
Resources & Partition Workload efficiently
• Efficient: Exploits Database Optimizer
23
Conventional ETL Architecture
Extract LoadTransform
• Efficient: Exploits Database Optimizer
• Fast: Exploits Native Bulk Load & Other Database Interfaces
• Scalable: Scales as you add Processors to Source or Target
Benefits• Optimal Performance & Scalability
• Better Hardware Leverage
• Easier to Manage & Lower Cost
Speed Project Delivery/Time to MarketE-LT uses Declarative SQL-style Design + Simple Runtime
• Development Productivity• 40% Efficiency Gains
• Environment Setup (ex: BI Apps)• 33-50% Less Complex
Number of Setup Steps 7
Number of Servers 1
Number of connections 3
24
Number of Setup Steps 10
Number of Servers 3
Number of connections 7
Supply Real Time Business IntelligenceNon-invasive Capture + E-LT Processing
Application Real Time BI(using Data Copy)
Analytic BI(Facts & Dims)
Consistency Window
25
E-LT(Mini-Batch + Transforms)
Stage ProdLookup
DataSources
MetaLookup
Data
ETL Engine(s)
ETL Metadata
ETL engines
require BIG
H/W and heavy
parallel tuning
Main driver for batch
window is data integrity &
consistency; once lookup &
calc functions begin, DW
typically goes offline
Reduce Consistency Windows w/E-LTFewer Steps, Faster Xform, and Faster Loads vs. typical ETL
Extract Transform Load Lookups/Calcs Transform Load
26
DW isOnline
E-LT Batch Window
ETL Batch Window
Sources
Meta
Stage ProdLookup
Data Data Movement
E-LTAgent
Data Movement
Extract
Extract
Transform Load
Load
Extract Transform Load
Transform Load
Set-based SQL
transforms
typically faster
SQL Load
inside DB is
always faster
Uptime GainsTransform
*What About “Pushdown Processing”• Pushdown Processing is what the ETL vendors do to
compensate for bad performance – push the transformation
processing to the Database
• Both Pushdown & E-LT have in common:• uses the power of your Data Warehouse for maximum performance
• can combine engine-based operations with DB-based transformations to
accomplish any level of data transformation complexity
• can scale to any multi-TB level and using parallel processing
• Only E-LT can claim:
27
• Only E-LT can claim:• performance optimized for your Database – whichever DB you use
• operate without any new IT Hardware costs
• 100% Java-based
• easily embedded within your existing or planned SOA infrastructure
• is not a glorified scheduler that relies on PL-SQL, or other custom-coded
DB scripts to achieve maximal performance
• can entirely eliminate needless network-hops for remote data joins
• can operate with no additional energy drain in your Datacenter
Unified Management + Monitoring• Common Runtime – 100% Java
• Common Monitoring
Example Use Cases• Bulk Data Transformation (any2any)
• XML/EDI Large File Handling
• SOA-driven Business Intelligence
Unify E-LT Agent with SOA RuntimeBest of Breed Data Integration as a Shared SOA Service
28
High PerformanceETL & Replication
Any Data SourceData Warehouse
& OLAP
• SOA-driven Business Intelligence
• Load DW from SOA
• Unified Data Steward Workflow(ETL Error Hospital w/BPEL PM)
• ERP Migration, Replication / Loading
• Query Offloading & Zero Downtime
E-LT Frameworks are optimal architectures for:
• Business Intelligence
• Performance Management
• Database & OLAP
• Embedded Applications
• Application Integration
• Middleware Servers
Data Integration the:
O R A C L E S O L U T I O N
29
O R A C L E S O L U T I O N
Oracle Data Integration SolutionBest-in-class Heterogeneous Platform for Data Integration
MDMApplications
SOAPlatforms
OracleApplications
BusinessIntelligence
Activity Monitoring
Custom Applications
Oracle GoldenGate
SOA Abstraction Layer
Service BusProcess Manager Data Services
Oracle Data Integrator Oracle Data Quality
Data Federation
Comprehensive Data Integration Solution
30
Oracle GoldenGate
Log-based CDC
Bi-directional Replication
Real-time Data
Oracle Data Integrator
ELT/ETL
Data Transformation
Bulk Data Movement
OLTPSystem
Flat FilesData Warehouse/Data Mart
OLAP Cube Web 2.0 Web and Event Services, SOA
Storage
Data Verification
Oracle Data Quality
Data Profiling
Data Parsing
Data Cleansing
Data Lineage Match and Merge
Key Data Integration Products
• Comprehensive Integration
• ELT/ETL for Bulk Data
• Service Bus
• Process Orchestration
• Human Workflow
• Data Grid
• Heterogeneous E-LT & ETL
• High-speed Transformations
• OLAP Data Loading
• Data Warehouse Loading
• Real Time Data Replication
• Changed Data Capture
• DBMS High Availability
• Disaster Tolerance
31
• Business Data / Metadata
• Statistical Analysis
• Time Series Reporting
• Integrated Data Quality
• Cleansing & Parsing
• De-duplication
• High Performance
• Integrated w/ODI
• Data Service Modeling
• Query Federation
• Data Redaction
• Service Data Objects
Oracle Data Integrator Enterprise EditionOptimized E-LT for improved Performance, Productivity and Lower TCO
E-LT Transformation vs. E-T-L
Any Data Warehouse
Legacy Sources
32
Declarative Set-based design
Change Data Capture
vs. E-T-L
Hot-pluggable Architecture
Any Planning System
OLTP DB Sources
Application Sources
Pluggable Knowledge Modules
Oracle GoldenGate OverviewEnterprise-wide Solution for Real Time Data Needs
Log Based, Real-
Time Change Data
Capture
Disaster Recovery, Data Protection
Zero Downtime Migration and
Upgrades
Operational Reporting
Standby(Open & Active)
Reporting
• Standardize on Single
Technology for Multiple Needs
• Deploy for Continuous
Availability and Real-time Data
Access for Reporting / BI
33
Capture
Heterogeneous Source Systems
EDWODS
EDW
Reporting
Real-time BI
ReportingDatabase
OGG
ETL
ETL
Query Offloading
Data Distribution
• Highly Flexible
• Fast Deployments
• Lower TCO & Improved ROI
How Oracle GoldenGate WorksModular De-Coupled Architecture
Capture: committed transactions are captured (and can be
filtered) as they occur by reading the transaction logs.
Trail: stages and queues data for routing.
Pump: distributes data for routing to target(s).
Route: data is compressed,
encrypted for routing to target(s).
Delivery: applies data with transaction
integrity, transforming the data as required.
34
LAN/WANInternet
TCP/IP
Bi-directional
CaptureTrail
Pump DeliveryTrail
SourceDatabase(s)
TargetDatabase(s)
Govern Data Better with Data Quality
• Data Movement
– E-LT & ETL
– Data Transformation
– Change Data Capture
– Data Access
– Data Services
• Data Profiling
– Statistical Analysis
– Rule-based Validation
– Monitoring & Timeslice
– Fine-grained Auditing Data Movement
35
• Data Cleansing
• Data Validation during ETL
• Data Standardization
• Address Matching & Dedup
• Error Hospital / Workflow
Data Cleansing
Data Quality and Profiling
Data Integration
C O N C L U S I O N
36
C O N C L U S I O N
Traditional ETL + CDC
• Invasive Capture on OLTP
systems using complex Adapters
• Transformations in ETL engine
on expensive middle tier servers
• Continuous feeds from
operational systems
• Non-invasive data capture
• Thin middle tier with
Modern Data Integration ApproachHeterogeneous, Real-time, Non-Invasive, High Performance E-LT
Modern E-LT + Real-time
37
• Bulk load to the data warehouse
with large nightly/daily batch
transformations on the database
platform (target)
• Mini-batches throughout the day
or bulk processing nightly
Staging
Trickle
Lookup
Data
Load
Extract
Lookup
Data
Xform XformBulk
Ag
en
t
Ag
en
t
Heterogeneous
Questions
38
The preceeding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
40
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.