boost your cloud data & analytics

Boost your Cloud Data & AnalyticsAvoiding vendor lock-ins with Multi Cloud Data Lakes

Big Data & AI in Finance, Banking & Insurance / 5th May 2021

Toma Buchinsky / Slavomir Krivak / Prashant Gangwar

Presenters

Slavomir KrivakBig Data & Cloud Architect

[email protected]

Prashant GangwarIT Consultant

[email protected]

Toma BuchinskyCEO Adastra Germany & Data Architect

[email protected]

mailto:[email protected]



01Introduction

02Getting started with Cloud Data & Analytics

03Avoiding Cloud Vendor Lock-ins

04Case Study 1: Multi-Cloud D&A with Azure DevOps

05Case Study 2: Portable D&A Platform Spark & Kubernetes

Agenda

Our Solution Portfolio

ARTIFICIALINTELLIGENCE

Machine Learning

Deep Learning

Statistical Analysis

Text Mining

Exploratory Data Analysis

Visual Analytics

Feature Engineering

CLOUD SERVICES

Readiness Assessment

Cloud Provider Evaluation

Cloud Migration

Managed Services

Azure, AWS, GCP

DIGITAL BUSINESS

Digital Transformation

Robotic Process Automation (RPA)

Internet of Things (IoT)

Blockchain

Mobile Apps

GOVERNANCEDATA ENGINEERING

Data Strategy

Modern Data Warehousing

Data Lake

Data Integration

Data Visualization

Business Intelligence

Data Governance

Data Quality

Master/Reference Data Management

Data Lineage

Metadata Management

Adastra Group worldwide

2000+More than 2000 professionals

1000+Projects in 46 countries

20Offices in 10 countries

Frankfurt Toronto Detroit Bratislava Prague London Sofia Moscow Bangkok SydneyWolfsburg Vancouver Stamford S K CZ U K Varna R U T H AUMunich CA U S PlovdivHannover BGMagdeburgDarmstadtDE

Good news is:

German top managers want to transform their companies to

data-driven businesses.

6

7

Traditional DWHOn-prem

Data Lakes

Age of cloud-first programs

8

POWER BIENTERPRISE

DATASETS(DAX)

SOURCES INGEST RAW ANALYZE PRESENT

AZURE

AZURE STRUCTURED ANALYTICS PATH

AZURE ADVANCED ANALYTICS PATH

IOTDATA

AZURE IOT HUB

MANAGED FILE TRANSFER

AZURE ACTIVE DIRECTORY

FILE

ACTIVE DIRECTORY

ON-PREMISE

AZURE SYNAPSE ANALYTICS

SQL PROVISIONED(SQL)

ADVANCED ANALYTIC MODELS

AZURE COGNITIVE

SERVICES

AZURE DATA LAKE

STORAGE

STRUCTURED

(orc, parquet, csv, ..)

STAGE

CURATED

SAASSOURCE

CLOUD

POWER BI(dashboards,

reports, portal)

DATA GOVERNANCE

INTEGRATEDATA

WAREHOUSE

SEMI STRUCTURED

(json, xml, ...)

UNSTRUCTURED

(image, wav, doc, ...)

POLYBASE

STRUCTUREDANALYTICRESULTS

AZURE SYNAPSEANALYTICS

SPARK PROVISIONED(PYTHON, R, SCALA, .NET)

(TRANSACTIONAL DELTA LAKE)

WORKSPACE NOTEBOOKSADVANCEDANALYTICRESULTS

AZURE ML

SERVICES

CUSTOMMODELS

DBAZURE

LOGICAPPS

AZURE DATA FACTORY

AZURE LOGICAPPS

SYNAPSE WORKSPACE

AZURE STACK EDGE

EVENT HANDLING

BATCH LOADS

IOT STREAMING

MASTER DATA REFERENCE DATA DATA CLEANSING DATA LINEAGE

AZURE SYNAPSE STUDIODATA DEVELOPMENT DATA EXPLORATION DATA ACCESS

ANALYST ACCESS

LOCALCOMPUTE

AZURE SYNAPSE ANALYTICSSQL ON DEMAND DOWNSTREAM

TARGETS

EVENT

ADF IR

DATA GW

DATABRICKS(data science)

STREAMANALYTICS

AZURE SYNAPSE ANALYTICSSPARK ON DEMAND

AZURE COSMOSDB

AZURE SQL DATABASE

TRANSACTIONAL

SYNAPSELINK

What do large cloud providers sell?

…expecting euphoric clients.

9

Clients: reluctance and skepticism

⁄ Uncertainty regarding IT security and data protection

⁄ Concerns due to US Cloud Act

⁄ Lack of know-how

⁄ Complexity and costs of migration

⁄ Large investments in DWHs and data lakes in the recent past make it difficult to sell new, expensive initiatives to CFOs

⁄ Reluctance due to possible vendor lock-ins

10

Cloud adoption: Where to start?

Don’ts:

⁄ Big Bang Approach: all at once

⁄ Do nothing and ignore Cloud

⁄ Long & costly conceptual phase

11

Cloud adoption: Where to start?

Dos:

⁄ Quick identification of a suitable existing use case

⁄ Approval by IT security and data protection departments

⁄ Quick upload of relevant data (subsets) to the cloud storage (e.g. as CSV or JSON)

⁄ Data access for analysts and data scientists

⁄ Deplyoment of the first MVP after a maximum of 3 months

12

Traditional DWH

Your BI tool

Persistentdata storage

Interactive Query Services ,make data SQL visible!No data storage Data Scientists

Analysts

Cloud Storage

Data (csv, json, parquet)

SQLEngine

Data Science

Notebook

Public Cloud

Interactive Query Services

14

Azure Synapse AnalyticsSQL on-demand

Interactive Query Services

Advantages

⁄ Quick and easy setup

⁄ Data stays in the cheap cloud object storage

⁄ Data could be accessed through widely used SQL

⁄ Support of different data formats (csv, json, parquet etc.)

⁄ Simple cost structure - price is based on the volume of requested data

Disadvantages

⁄ Poor performance with complex queries and large amounts of data

⁄ Limitation for competing queries

⁄ No user defined functions

15

Azure Synapse AnalyticsSQL on-demand

16

TraditionalDWH

Cloud Storage

Data (csv, json, parquet)

Analysts Data Scientists

QuellenQuellenSources

Complex Data transformations orML Model Training

Next step: Cloud Data Processing& Analytical Data Stores

Data Science

NotebookAnalytical Data

Store

Public Cloud

What to learn from the first Cloud MVPs?

Get a first impression of…

⁄ how fast new data & analytics applications can be implemented in the cloud

⁄ how Cloud Consumption costs develop and how I get a grip on them (therefore MVP, not only PoC)

⁄ if necessary, whether my legacy system can be brought into the cloud fast, with manageable costs and ideally automatically?

17

Cloud Data Lake & Modern DWH

Sources (Structured data)

Sources (Semi-structured data)

Data Storage

DataProcessing

Analytical Data Store

(or)

Query Engine

Analytics & Reporting

Orchestration & Monitoring

Data Governance

Users / Downstream

systems

Machine Learning

DevOps

Do you have to use cloud-native technologies/tools only?

19

No!

Cloud Analytical Data Stores

Cloud-native:

20

Amazon Redshift MS Azure Provisioned SQL (SQLDW)

Alternatives:

Analytical Data Stores: avoiding lock-ins

⁄ Considering the analytical data store as an SQL engine box that executes analytical queries ...

→with acceptable performance→at an acceptable price→with the necessary elasticity

⁄ Transform/aggregate data in advance (e.g. with Apache SPARK or ETL tools)

⁄ Avoid platform-native functionalities such as stored procedures etc.

21

Cloud Data Processing

Cloud-native:

22

Alternatives:

AWS EMR AWS GlueApache Spark

Jobs

Apache Spark In Azure Synapse Analytics or

Azure Databricks

Data Integration Toolsavailable in the Cloud

Fully integrated data management platforms

Apache Spark / Hive / Apache NiFiJobs

Apache Spark in Google Cloud

Dataproc

Best-of-Breed Portable Data & Analytics Plattform

as Kubernetes Cluster

Cloud Data Processing: avoiding lock-ins

⁄ Pay attention to portability already during the design phase

⁄ Definition of development guidelines that ensure the development of portable code

⁄ Avoid technology-specific libraries and dependencies (e.g., AWS extensions from Vanilla Spark)

⁄ Using as many reusable components as possible, so only these components have to be adapted during a migration

⁄ Test automation to ensure an easy validation after a migration

23

Case Study 1: Multi-Cloud D&A with Azure DevOpsAutomotive

CAP Data pipelines

25

Infrastructure as a Code used for maintaining and provisioning all cloud infrastructure (Terraform)

S3 object storage leveraged for maintaining best possible value/price ratio

ETL processing based on SPARK ETLs (AWS Glue serverless service) implemented in Scala

Multi-Cloud CI/CD pipeline design for CAP

26

Azure

AWS

Redshift

AWS

GitHubS3

Build

Validation

Release and deployment

Source Code

• Spark ETL unit tests• Test Terraform config

using in isolated sandbox• Terraform checkov test• DB change validation• Shared libraries validation• Automated code checks

risk analysis (security, license, Black Duck scan),

• Code quality checks SonarQube

Upload source code

and metadata

Apply Terraform

config

Apply DB changes

Build pipelines Release pipelines Secret

management Staging

environments

Azure

Databricks

Azure

Blob storage

Upload metadata

Apply Terraform

config

Deploy notebooks

⁄ CI/CD pipeline implemented in Azure Cloud DevOps solution

⁄ Integrated with AWS, Azure and GitHub

⁄ Building and validation of Scala based Glue ETL spark jobs

⁄ Automated unit testing – unit tests are executed in Azure DevOps agents

⁄ Pull request validation pipeline supports code reviews

⁄ Infrastructure as a code (Terraform) – all infrastructure changes are validated and released using CI/CD pipeline

⁄ Pipeline configured including staging environments dev, test and prod

⁄ Dedicated Database change management pipeline designed for validation and releases RedShift/PostgreSQL

AWS Glue

Case Study 2: Portable D&A PlatformSpark & KubernetesAutomotive

Case Study: Portable Data & Analytics Platform German Large Corporation

First Use Case

⁄ Creation of a platform for Analytics on car configuration data, driven by new regulatory requirements

⁄ The data must be processed locally in a unified way in the covered regions (EU, US, China)

⁄ The platform should accommodate further Big Data & Analytics programs in the future

⁄ The platform should be able to be deployed on-prem as well as in the Cloud

After assessing several possible solutions (incl. Cloud, Legacy DWH solutions) a Cloud-agnostic best-of-breed approach based on Kubernetes, Apache SPARK and Object Storage was selected.

28

Portable D&A Platform with Spark & Kubernetes

⁄ Kubernetes: „Run K8s Anywhere - Kubernetes is open source giving you the freedom to take advantage of on-premises, hybrid, or public cloud infrastructure, letting you effortlessly move workloads to where it matters to you.“ (https://kubernetes.io/de/)

⁄ Apache SPARK: Already for the on-prem and now for the Cloud Data Lakes one of the most used platforms for parallel processing of Big Data

⁄ Since 2018, Kubernetes is available as an alternative Apache Spark Cluster Manager (next to Mesos & Yarn)

29

https://kubernetes.io/de/

Solution Overview

Presentation title30

Databases

Files

Web Interface

Metadata and Control Logging

Schema Store

Raw Layer Prepared Layer Presentation Layer

Data Sources

User Interface

Web Services

Data Scientists

Data Orchestration/ETL - Manage, Schedule & Monitor

Agility with CI/CD Pipelines

31

1. Commit the code to SCM

2a. SCM Polling through Web-hook enabled for application code

4. Argo workflow to trigger the job on Kubernetes cluster

5. Pull the image and run the job

Application code repo

ETL JOB DAGS repo

yaml

k8s

Driver PodExecutor Pod

Executor Pod

a b c d

d

a) argowf submits the job to k8sb) k8s creates the driver podc) driver requests k8s to create executor

podsd) k8s creates the executors

Portable from Day One

32

Object Data Storage on Prem

(Ceph)

VM on-prem

AWS S3 ObjectData Storage

AWS EC2

First Prototype developed in Adastra

Germany Data Engineering Lab (AWS)

AWS S3 ObjectData Storage

AWS EC2

Following Sprints in AWS-Account of the

Customer

Final Developmenton on-prem Infrastructure

of the customer

One Framework, deployed everywhere

Presentation title33

Object Data Storage on Prem

(z.B. Ceph )

VM on-prem Prod

AWS S3 Object Data Storage

AWS EC2 Prod

Azure Data LakeStorage

Azure VM Prod

Google Cloud Storage

GCP Compute EngineProd

Telekom Object Storage Services

Telekom ECS Prod

AWS EC2 DEV AWS EC2 TEST

Adastra Modular Cloud D&A Framework Overview

Adastra Modular Cloud D&A Framework Overview

35

Templates Scripts Policies

Networking Apps Storage Security Cloud Infra

CI/CD Pipeline

Reusable Modules

Ingestion Historization and CDC

Control and metadata capture

Generic Data Transformation

Data Lake best practices

Coding Best practices

Infrastructure As a Code Design and Development

Thank you!Adastra GmbH

Niedenau 36, 60325 Frankfurt am Main.+49 (0)69 719 779 790 / [email protected]

www.adastragrp.com


http://www.adastragrp.com/

boost your cloud data & analytics

Documents