data revolution tour 2019 transforming into a digital ...› cheap storage for huge data amounts...
TRANSCRIPT
›1
NICO WOLF 29 OKTOBER 2019, FRANKFURT
Commerzbank Data Revolution Tour 2019 Transforming into a digital enterprise –
Begins with collect & connect data
›2 2
„WE SEE CHALLENGES IN THE BANKING INDUSTRY AS AN OPPORTUNITY AND ARE TRANSFORMING INTO A DIGITAL ENTERPRISE.“ From Strategy Commerzbank 4.0
BDAA IS A KEY FACTOR FOR COBA 4.0 – MAIN FOCUS LAST YEAR TO SUPPORT GROWTH IN CC & PBC
Marketing/ Growth
Digiti- zation/ Efficiency
Single Loss Prevention
Next Best Offer
Churn/Winback
Campaign-/Marketing-Elasticity
Pricing
OCR – Optical Character Recognition
Fraud Detection
Early warning signals
› Significant increase of cross-selling
› Prevention of customer churn – protecting the customer base
› More efficient use of marketing budget
› Increase revenues by customer individual price increases/decreases
› Higher efficiency › In addition: Sales support by increase in
speed & quality of processes
› Reduction of fraud losses
› Optimizing loan loss provisions
Decision Algorithms/ Engines
Increase of customer base & share of wallet
Automation of processes
Optimized cost of risk/fraud
BDAA’S MISSION STATEMENT
BDAA creates value from the Bank´s data assets:
We collect and connect internal and external
data and implement state of the art infrastructure.
We use analytics to increase revenues and operational efficiency, prevent losses and provide
customer-centric products.
Our Mission
BDAA FOCUS THIS YEAR – CLOUD FIRST! Supports BDA² on-premises Technology Stack › Cheap Storage for huge Data Amounts (e.g. for Back Ups) › Flexible provisioning of Development environments › On-demand processing and Self-Serivice in Big Data Clusters
Access to Cutting Edge Technology › New Risk Assessment approaches › Efficient processing and classification of
transaction data
Supports the development of new business models › Pay Per Use Credit (based on machine
utilization) › New ways of collaboration (market
data, IoT, ..) › Smart Farming
Cloud Platform to support
„Beyond Banking“
IoT
Self-Serv ice BI/Analytics & Data Distribution
Data Lake - Data Hub
Operative systems Archives External
data
Data Science and Prototyping
Data Lake
Commerzbank
Cloud
External data Data Lab External data Subsidiary
Analytics Lab
DATA LAKE & LAB – DESIGNED TO ACHIEVE FAST DEPLOYMENT FROM PROTOTYPE TO PRODUCTION
Hybrid Cloud
› Fast and easy data distribution as well as data accessibility within Commerzbank
› Simplify our data lineage
› Simplify IT-Landscape
› Technology spearhead of BDAA serving as a platform to evaluate new technologies
› Dedicated data analytics exploration field for Big Data & Advanced Analytics members building new services for Commerzbank and the Data Lake
› Hybrid Cloud Strategy to extend on short term capacity, compute and operational capabilities
› Access to cutting edge technology
DATA FLOW IN THE DATA LAKE
Data ingested into data lake
Data continuously updated and merged into historic data store
DataMart's created to meet analytic requirements
CLO
UD
O
N P
REM
WEB
RDBMS
FILES
MAINFRAME
CREATE
UPDATE
ENRICH
UPDATE
STANDARDIZE MERGE
FORMAT
ANALYZE REPORT
Raw Landing
Persisting History
HDS ODS
Snapshot
DATA WAREHOUSE
PREDICT, LEARN
Landing Zone Data Landing
Annotation Zone Data Provisioning
Enriched Zone Use Case Data
Raw Data Ingest Technical Transformation Data Consolidation in Hive data layer
Use Case Data Use case specific storage
Application Stack
Data Operating System
On-Premises Scalabe Data Storage
Operation Orchestration Governance
Data Ingest / Flow & Repl.
On-Premises – Data Managing Layer
Data Query Layer
Ad-Hoc BI Reporting Remote Access
Cloud Platform and Storages
COMMERZBANK BIG DATA PLATTFORM
Connect : Direct
SQL Client
Cloud Service csv, orc,
parquet, arvo, etc.
Raw, (un- / semi-) structured data ingested as file
Structured data stored in managed Hive Warehouse
Specific storage for fast low latency key value access
[ </> ] Hive QL
Hive HWC
Hive Tez LLAP
Cloud Storage
Big Query
Flexible Landing Mixing different technologies we can easy ingest all kind of data. • Any sources of structured, semi- and unstructured data • Batch (end of day processing and historic data loads) • Near- to real time (Messages via Kafka, other data NiFi) • CDC (Change Data Capture) for real-time changes with minimal impact to
source systems • Easy Data Landing with Attunity Replicate without interface programming • Automatic switches from full load to near-time and handling of
schema evolutions with Attunity Compose
BREAKDOWN INGEST
Data Ingest / Flow & Repl.
Connect : Direct
Data Lake and Lab • Infrastructure as Code
allows reliable scaling and rebuilding of multiple clusters
• Central data repository and Data management using Apache Hive
• Central Identity and Access Management via Apache Ranger support LDAP integration
Central Governance • Data governance by Apache
Atlas
• Smart tagging, tag propagation and profiling of data
• Data catalog and lineage • Documentation of technical and
business definition with D-Quantum
• Overview of Ownership and responsibility
BREAKDOWN CENTRAL STORAGE
Data Operating System
On-Premises Scalabe Data Storage
On-Premises – Data Managing Layer
Data Query
Cloud Platform and Storages
Cloud Service csv, orc,
parquet, arvo, etc.
Hive QL
Hive HWC
Hive Tez LLAP
Operation Orchestration Governance
Provide the right solutions with focus on self-service
• Easy dash boarding & BI reporting without moving data, avoids network traffic, data redundancy and makes audit much easier
• Providing a Data Science Analytics Workbench with packages for R, Python and more, web based Notebook and integrated GIT Repository for CI/CD integration
• Modern scalable infrastructure including GPU Processors allows fast ML and AI development, provide scalability into Cloud Services
• Additional 3rd party tools for Data Wrangling and Exploration • Tools are browsers based, no client installation necessary • Easy secured access to kerberized Hadoop Cluster via Apache Knox as
Gateway and central Ranger permission management
BREAKDOWN ACCESS
Ad-Hoc
BI Reporting
Remote Access
SQL Client
Clean Transform
Data Plane Services
On-Premises Cluster
CLUSTER ORCHESTRATION
Data Lab HDP 2.6.4
Data Lake HDP 3.2.0 Secondary Lake DR Location • Disaster Recover • Self Service • Reporting
Data Lake HDP 3.1.4 Primary Lake DR Location • First Ingest • Operational Processing
Analytics Lab • No Prod. Protoyps • Heavy Spark Loads • ML Adv. Analytics
Hybrid Cloud • Ephemeral Cluster • Persistent Cluster
Dev / Test Envs. DEV
TU-C TU-D
HDP 3.x Data Proc Dev Instance
CDP 7 (PaaS) DEV
TU-C TU-D
Prd Prd-DR
ATTUNITY - LESSONS LEARNED
A World without Attunity › Business Analyst work › Service Interface Agreements › Implementation Extract › Connectivity setup › Implementation Import › Monitoring › Documentation
New World › Business Analyst work › Even more BA work (understanding of the Data Source) › Convince DBA and Source Owner › Implementation › Implementation Post Tasks
› Monitoring › Scheduling in Enterprise Scheduler › Documentation
EXAMPLE OF DATA INGESTION SPEEDUP
BDAA Source
Process data landing “OLD” Export of data to files C:D File
Transfer
Transfer and ingest
Transfer- Server to LZ
Transformation LZ to AZ HIVE Total
CORE KB
Eod File Transfer 90min 15min 30min 30min 15min 180min
BDAA
Process data landing “NEW”
Attunity Replicate
Data load from table
Attunity Compose
Historical Data Store Total
CORE KB „Attunity eod Full Load“ 30min 30min ~60min
Core KB „Attunity CDC / neartime“ <15sec <15sec ~ 15 - 30sec
METADATA MAINTENANCE
Automatic import of technical metadata and self assessment of business metadata by data owners reduces end-to-end complexity/costs and improves metadata quality
Situation: Complex process with too many steps between data owners and metadata repository
• Data Landing responsible for receiving and maintaining metadata belonging to data owners
• EXCEL and MYSQL used for temporary storage of metadata
Technical metadata
Business metadata Meta Data Repository
Self assessment
Automatic ingest from source
Technical metadata
Business metadata Meta Data Repository
Metadata maintenance
Metadata change mgmnt
MySQL DB EXCEL
Data Owner Data Landing
› Straightforward process - No “in betweens”
› Data is described by “knowing” Data Owners
› Superfluous landing effort
› No direct metadata control by data owners
Data Owner
„OLD“
„NEW“
ARCHITECTURE ATTUNITY REPLICATION
Current databases that can be accessed:
• Oracle
• DB2 Mainframe
• Sybase
Replication process:
Option 1: Load the data as a snapshot in FULL mode
Option 2: Load the data near real-time in a CDC mode
Landing:
Land the data 1:1 into our Landing Zone in HDFS or into KAFKA
Provision:
Annotate the data
Make it accessible via HIVE-View
Compose the data into our Annotated Zone creating views for
• Full Load Snapshots
• ODS – online data store
• HDS – historical data store in SCD2