data revolution tour 2019 transforming into a digital ...› cheap storage for huge data amounts...

›1

NICO WOLF 29 OKTOBER 2019, FRANKFURT

Commerzbank Data Revolution Tour 2019 Transforming into a digital enterprise –

Begins with collect & connect data

›2 2

„WE SEE CHALLENGES IN THE BANKING INDUSTRY AS AN OPPORTUNITY AND ARE TRANSFORMING INTO A DIGITAL ENTERPRISE.“ From Strategy Commerzbank 4.0

BDAA IS A KEY FACTOR FOR COBA 4.0 – MAIN FOCUS LAST YEAR TO SUPPORT GROWTH IN CC & PBC

Marketing/ Growth

Digiti- zation/ Efficiency

Single Loss Prevention

Next Best Offer

Churn/Winback

Campaign-/Marketing-Elasticity

Pricing

OCR – Optical Character Recognition

Fraud Detection

Early warning signals

› Significant increase of cross-selling

› Prevention of customer churn – protecting the customer base

› More efficient use of marketing budget

› Increase revenues by customer individual price increases/decreases

› Higher efficiency › In addition: Sales support by increase in

speed & quality of processes

› Reduction of fraud losses

› Optimizing loan loss provisions

Decision Algorithms/ Engines

Increase of customer base & share of wallet

Automation of processes

Optimized cost of risk/fraud

BDAA’S MISSION STATEMENT

BDAA creates value from the Bank´s data assets:

We collect and connect internal and external

data and implement state of the art infrastructure.

We use analytics to increase revenues and operational efficiency, prevent losses and provide

customer-centric products.

Our Mission

BDAA FOCUS THIS YEAR – CLOUD FIRST! Supports BDA² on-premises Technology Stack › Cheap Storage for huge Data Amounts (e.g. for Back Ups) › Flexible provisioning of Development environments › On-demand processing and Self-Serivice in Big Data Clusters

Access to Cutting Edge Technology › New Risk Assessment approaches › Efficient processing and classification of

transaction data

Supports the development of new business models › Pay Per Use Credit (based on machine

utilization) › New ways of collaboration (market

data, IoT, ..) › Smart Farming

Cloud Platform to support

„Beyond Banking“

IoT

Self-Serv ice BI/Analytics & Data Distribution

Data Lake - Data Hub

Operative systems Archives External

data

Data Science and Prototyping

Data Lake

Commerzbank

Cloud

External data Data Lab External data Subsidiary

Analytics Lab

DATA LAKE & LAB – DESIGNED TO ACHIEVE FAST DEPLOYMENT FROM PROTOTYPE TO PRODUCTION

Hybrid Cloud

› Fast and easy data distribution as well as data accessibility within Commerzbank

› Simplify our data lineage

› Simplify IT-Landscape

› Technology spearhead of BDAA serving as a platform to evaluate new technologies

› Dedicated data analytics exploration field for Big Data & Advanced Analytics members building new services for Commerzbank and the Data Lake

› Hybrid Cloud Strategy to extend on short term capacity, compute and operational capabilities

› Access to cutting edge technology

DATA FLOW IN THE DATA LAKE

Data ingested into data lake

Data continuously updated and merged into historic data store

DataMart's created to meet analytic requirements

CLO

UD

O

N P

REM

WEB

RDBMS

FILES

MAINFRAME

CREATE

UPDATE

ENRICH

UPDATE

STANDARDIZE MERGE

FORMAT

ANALYZE REPORT

Raw Landing

Persisting History

HDS ODS

Snapshot

DATA WAREHOUSE

PREDICT, LEARN

Landing Zone Data Landing

Annotation Zone Data Provisioning

Enriched Zone Use Case Data

Raw Data Ingest Technical Transformation Data Consolidation in Hive data layer

Use Case Data Use case specific storage

Application Stack

Data Operating System

On-Premises Scalabe Data Storage

Operation Orchestration Governance

Data Ingest / Flow & Repl.

On-Premises – Data Managing Layer

Data Query Layer

Ad-Hoc BI Reporting Remote Access

Cloud Platform and Storages

COMMERZBANK BIG DATA PLATTFORM

Connect : Direct

SQL Client

Cloud Service csv, orc,

parquet, arvo, etc.

Raw, (un- / semi-) structured data ingested as file

Structured data stored in managed Hive Warehouse

Specific storage for fast low latency key value access

[ </> ] Hive QL

Hive HWC

Hive Tez LLAP

Cloud Storage

Big Query

https://www.google.de/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjD6ozDyrPXAhUkDJoKHVImCNcQjRwIBw&url=https://spark.apache.org/&psig=AOvVaw1YtfEEMsB5F_SmeZMUjZ7t&ust=1510388668261084


Flexible Landing Mixing different technologies we can easy ingest all kind of data. • Any sources of structured, semi- and unstructured data • Batch (end of day processing and historic data loads) • Near- to real time (Messages via Kafka, other data NiFi) • CDC (Change Data Capture) for real-time changes with minimal impact to

source systems • Easy Data Landing with Attunity Replicate without interface programming • Automatic switches from full load to near-time and handling of

schema evolutions with Attunity Compose

BREAKDOWN INGEST

Data Ingest / Flow & Repl.

Connect : Direct

Data Lake and Lab • Infrastructure as Code

allows reliable scaling and rebuilding of multiple clusters

• Central data repository and Data management using Apache Hive

• Central Identity and Access Management via Apache Ranger support LDAP integration

Central Governance • Data governance by Apache

Atlas

• Smart tagging, tag propagation and profiling of data

• Data catalog and lineage • Documentation of technical and

business definition with D-Quantum

• Overview of Ownership and responsibility

BREAKDOWN CENTRAL STORAGE

Data Operating System

On-Premises Scalabe Data Storage

On-Premises – Data Managing Layer

Data Query

Cloud Platform and Storages

Cloud Service csv, orc,

parquet, arvo, etc.

Hive QL

Hive HWC

Hive Tez LLAP

Operation Orchestration Governance


Provide the right solutions with focus on self-service

• Easy dash boarding & BI reporting without moving data, avoids network traffic, data redundancy and makes audit much easier

• Providing a Data Science Analytics Workbench with packages for R, Python and more, web based Notebook and integrated GIT Repository for CI/CD integration

• Modern scalable infrastructure including GPU Processors allows fast ML and AI development, provide scalability into Cloud Services

• Additional 3rd party tools for Data Wrangling and Exploration • Tools are browsers based, no client installation necessary • Easy secured access to kerberized Hadoop Cluster via Apache Knox as

Gateway and central Ranger permission management

BREAKDOWN ACCESS

Ad-Hoc

BI Reporting

Remote Access

SQL Client

Clean Transform


Data Plane Services

On-Premises Cluster

CLUSTER ORCHESTRATION

Data Lab HDP 2.6.4

Data Lake HDP 3.2.0 Secondary Lake DR Location • Disaster Recover • Self Service • Reporting

Data Lake HDP 3.1.4 Primary Lake DR Location • First Ingest • Operational Processing

Analytics Lab • No Prod. Protoyps • Heavy Spark Loads • ML Adv. Analytics

Hybrid Cloud • Ephemeral Cluster • Persistent Cluster

Dev / Test Envs. DEV

TU-C TU-D

HDP 3.x Data Proc Dev Instance

CDP 7 (PaaS) DEV

TU-C TU-D

Prd Prd-DR

ATTUNITY - LESSONS LEARNED

A World without Attunity › Business Analyst work › Service Interface Agreements › Implementation Extract › Connectivity setup › Implementation Import › Monitoring › Documentation

New World › Business Analyst work › Even more BA work (understanding of the Data Source) › Convince DBA and Source Owner › Implementation › Implementation Post Tasks

› Monitoring › Scheduling in Enterprise Scheduler › Documentation

EXAMPLE OF DATA INGESTION SPEEDUP

BDAA Source

Process data landing “OLD” Export of data to files C:D File

Transfer

Transfer and ingest

Transfer- Server to LZ

Transformation LZ to AZ HIVE Total

CORE KB

Eod File Transfer 90min 15min 30min 30min 15min 180min

BDAA

Process data landing “NEW”

Attunity Replicate

Data load from table

Attunity Compose

Historical Data Store Total

CORE KB „Attunity eod Full Load“ 30min 30min ~60min

Core KB „Attunity CDC / neartime“ <15sec <15sec ~ 15 - 30sec

METADATA MAINTENANCE

Automatic import of technical metadata and self assessment of business metadata by data owners reduces end-to-end complexity/costs and improves metadata quality

Situation: Complex process with too many steps between data owners and metadata repository

• Data Landing responsible for receiving and maintaining metadata belonging to data owners

• EXCEL and MYSQL used for temporary storage of metadata

Technical metadata

Business metadata Meta Data Repository

Self assessment

Automatic ingest from source

Technical metadata

Business metadata Meta Data Repository

Metadata maintenance

Metadata change mgmnt

MySQL DB EXCEL

Data Owner Data Landing

› Straightforward process - No “in betweens”

› Data is described by “knowing” Data Owners

› Superfluous landing effort

› No direct metadata control by data owners

Data Owner

„OLD“

„NEW“

ARCHITECTURE ATTUNITY REPLICATION

Current databases that can be accessed:

• Oracle

• DB2 Mainframe

• Sybase

Replication process:

Option 1: Load the data as a snapshot in FULL mode

Option 2: Load the data near real-time in a CDC mode

Landing:

Land the data 1:1 into our Landing Zone in HDFS or into KAFKA

Provision:

Annotate the data

Make it accessible via HIVE-View

Compose the data into our Annotated Zone creating views for

• Full Load Snapshots

• ODS – online data store

• HDS – historical data store in SCD2

data revolution tour 2019 transforming into a digital ...› cheap storage for huge data amounts...

Documents