boost your cloud data & analytics
TRANSCRIPT
Boost your Cloud Data & AnalyticsAvoiding vendor lock-ins with Multi Cloud Data Lakes
Big Data & AI in Finance, Banking & Insurance / 5th May 2021
Toma Buchinsky / Slavomir Krivak / Prashant Gangwar
Presenters
Slavomir KrivakBig Data & Cloud Architect
Prashant GangwarIT Consultant
Toma BuchinskyCEO Adastra Germany & Data Architect
01Introduction
02Getting started with Cloud Data & Analytics
03Avoiding Cloud Vendor Lock-ins
04Case Study 1: Multi-Cloud D&A with Azure DevOps
05Case Study 2: Portable D&A Platform Spark & Kubernetes
Agenda
Our Solution Portfolio
ARTIFICIALINTELLIGENCE
Machine Learning
Deep Learning
Statistical Analysis
Text Mining
Exploratory Data Analysis
Visual Analytics
Feature Engineering
CLOUD SERVICES
Readiness Assessment
Cloud Provider Evaluation
Cloud Migration
Managed Services
Azure, AWS, GCP
DIGITAL BUSINESS
Digital Transformation
Robotic Process Automation (RPA)
Internet of Things (IoT)
Blockchain
Mobile Apps
GOVERNANCEDATA ENGINEERING
Data Strategy
Modern Data Warehousing
Data Lake
Data Integration
Data Visualization
Business Intelligence
Data Governance
Data Quality
Master/Reference Data Management
Data Lineage
Metadata Management
Adastra Group worldwide
2000+More than 2000 professionals
1000+Projects in 46 countries
20Offices in 10 countries
Frankfurt Toronto Detroit Bratislava Prague London Sofia Moscow Bangkok SydneyWolfsburg Vancouver Stamford S K CZ U K Varna R U T H AUMunich CA U S PlovdivHannover BGMagdeburgDarmstadtDE
Good news is:
German top managers want to transform their companies to
data-driven businesses.
6
7
Traditional DWHOn-prem
Data Lakes
Age of cloud-first programs
8
POWER BIENTERPRISE
DATASETS(DAX)
SOURCES INGEST RAW ANALYZE PRESENT
AZURE
AZURE STRUCTURED ANALYTICS PATH
AZURE ADVANCED ANALYTICS PATH
IOTDATA
AZURE IOT HUB
MANAGED FILE TRANSFER
AZURE ACTIVE DIRECTORY
FILE
ACTIVE DIRECTORY
ON-PREMISE
AZURE SYNAPSE ANALYTICS
SQL PROVISIONED(SQL)
ADVANCED ANALYTIC MODELS
AZURE COGNITIVE
SERVICES
AZURE DATA LAKE
STORAGE
STRUCTURED
(orc, parquet, csv, ..)
STAGE
CURATED
SAASSOURCE
CLOUD
POWER BI(dashboards,
reports, portal)
DATA GOVERNANCE
INTEGRATEDATA
WAREHOUSE
SEMI STRUCTURED
(json, xml, ...)
UNSTRUCTURED
(image, wav, doc, ...)
POLYBASE
STRUCTUREDANALYTICRESULTS
AZURE SYNAPSEANALYTICS
SPARK PROVISIONED(PYTHON, R, SCALA, .NET)
(TRANSACTIONAL DELTA LAKE)
WORKSPACE NOTEBOOKSADVANCEDANALYTICRESULTS
AZURE ML
SERVICES
CUSTOMMODELS
DBAZURE
LOGICAPPS
AZURE DATA FACTORY
AZURE LOGICAPPS
SYNAPSE WORKSPACE
AZURE STACK EDGE
EVENT HANDLING
BATCH LOADS
IOT STREAMING
MASTER DATA REFERENCE DATA DATA CLEANSING DATA LINEAGE
AZURE SYNAPSE STUDIODATA DEVELOPMENT DATA EXPLORATION DATA ACCESS
ANALYST ACCESS
LOCALCOMPUTE
AZURE SYNAPSE ANALYTICSSQL ON DEMAND DOWNSTREAM
TARGETS
EVENT
ADF IR
DATA GW
DATABRICKS(data science)
STREAMANALYTICS
AZURE SYNAPSE ANALYTICSSPARK ON DEMAND
AZURE COSMOSDB
AZURE SQL DATABASE
TRANSACTIONAL
SYNAPSELINK
What do large cloud providers sell?
…expecting euphoric clients.
9
Clients: reluctance and skepticism
⁄ Uncertainty regarding IT security and data protection
⁄ Concerns due to US Cloud Act
⁄ Lack of know-how
⁄ Complexity and costs of migration
⁄ Large investments in DWHs and data lakes in the recent past make it difficult to sell new, expensive initiatives to CFOs
⁄ Reluctance due to possible vendor lock-ins
10
Cloud adoption: Where to start?
Don’ts:
⁄ Big Bang Approach: all at once
⁄ Do nothing and ignore Cloud
⁄ Long & costly conceptual phase
11
Cloud adoption: Where to start?
Dos:
⁄ Quick identification of a suitable existing use case
⁄ Approval by IT security and data protection departments
⁄ Quick upload of relevant data (subsets) to the cloud storage (e.g. as CSV or JSON)
⁄ Data access for analysts and data scientists
⁄ Deplyoment of the first MVP after a maximum of 3 months
12
Traditional DWH
Your BI tool
Persistentdata storage
Interactive Query Services ,make data SQL visible!No data storage Data Scientists
Analysts
Cloud Storage
Data (csv, json, parquet)
SQLEngine
Data Science
Notebook
Public Cloud
Interactive Query Services
14
Azure Synapse AnalyticsSQL on-demand
Interactive Query Services
Advantages
⁄ Quick and easy setup
⁄ Data stays in the cheap cloud object storage
⁄ Data could be accessed through widely used SQL
⁄ Support of different data formats (csv, json, parquet etc.)
⁄ Simple cost structure - price is based on the volume of requested data
Disadvantages
⁄ Poor performance with complex queries and large amounts of data
⁄ Limitation for competing queries
⁄ No user defined functions
15
Azure Synapse AnalyticsSQL on-demand
16
TraditionalDWH
Cloud Storage
Data (csv, json, parquet)
Analysts Data Scientists
QuellenQuellenSources
Complex Data transformations orML Model Training
Next step: Cloud Data Processing& Analytical Data Stores
Data Science
NotebookAnalytical Data
Store
Public Cloud
What to learn from the first Cloud MVPs?
Get a first impression of…
⁄ how fast new data & analytics applications can be implemented in the cloud
⁄ how Cloud Consumption costs develop and how I get a grip on them (therefore MVP, not only PoC)
⁄ if necessary, whether my legacy system can be brought into the cloud fast, with manageable costs and ideally automatically?
17
Cloud Data Lake & Modern DWH
Sources (Structured data)
Sources (Semi-structured data)
Data Storage
DataProcessing
Analytical Data Store
(or)
Query Engine
Analytics & Reporting
Orchestration & Monitoring
Data Governance
Users / Downstream
systems
Machine Learning
DevOps
Do you have to use cloud-native technologies/tools only?
19
No!
Cloud Analytical Data Stores
Cloud-native:
20
Amazon Redshift MS Azure Provisioned SQL (SQLDW)
Alternatives:
Analytical Data Stores: avoiding lock-ins
⁄ Considering the analytical data store as an SQL engine box that executes analytical queries ...
→with acceptable performance→at an acceptable price→with the necessary elasticity
⁄ Transform/aggregate data in advance (e.g. with Apache SPARK or ETL tools)
⁄ Avoid platform-native functionalities such as stored procedures etc.
21
Cloud Data Processing
Cloud-native:
22
Alternatives:
AWS EMR AWS GlueApache Spark
Jobs
Apache Spark In Azure Synapse Analytics or
Azure Databricks
Data Integration Toolsavailable in the Cloud
Fully integrated data management platforms
Apache Spark / Hive / Apache NiFiJobs
Apache Spark in Google Cloud
Dataproc
Best-of-Breed Portable Data & Analytics Plattform
as Kubernetes Cluster
Cloud Data Processing: avoiding lock-ins
⁄ Pay attention to portability already during the design phase
⁄ Definition of development guidelines that ensure the development of portable code
⁄ Avoid technology-specific libraries and dependencies (e.g., AWS extensions from Vanilla Spark)
⁄ Using as many reusable components as possible, so only these components have to be adapted during a migration
⁄ Test automation to ensure an easy validation after a migration
23
Case Study 1: Multi-Cloud D&A with Azure DevOpsAutomotive
CAP Data pipelines
25
Infrastructure as a Code used for maintaining and provisioning all cloud infrastructure (Terraform)
S3 object storage leveraged for maintaining best possible value/price ratio
ETL processing based on SPARK ETLs (AWS Glue serverless service) implemented in Scala
Multi-Cloud CI/CD pipeline design for CAP
26
Azure
AWS
Redshift
AWS
GitHubS3
Build
Validation
Release and deployment
Source Code
• Spark ETL unit tests• Test Terraform config
using in isolated sandbox• Terraform checkov test• DB change validation• Shared libraries validation• Automated code checks
risk analysis (security, license, Black Duck scan),
• Code quality checks SonarQube
Upload source code
and metadata
Apply Terraform
config
Apply DB changes
Build pipelines Release pipelines Secret
management Staging
environments
Azure
Databricks
Azure
Blob storage
Upload metadata
Apply Terraform
config
Deploy notebooks
⁄ CI/CD pipeline implemented in Azure Cloud DevOps solution
⁄ Integrated with AWS, Azure and GitHub
⁄ Building and validation of Scala based Glue ETL spark jobs
⁄ Automated unit testing – unit tests are executed in Azure DevOps agents
⁄ Pull request validation pipeline supports code reviews
⁄ Infrastructure as a code (Terraform) – all infrastructure changes are validated and released using CI/CD pipeline
⁄ Pipeline configured including staging environments dev, test and prod
⁄ Dedicated Database change management pipeline designed for validation and releases RedShift/PostgreSQL
AWS Glue
Case Study 2: Portable D&A PlatformSpark & KubernetesAutomotive
Case Study: Portable Data & Analytics Platform German Large Corporation
First Use Case
⁄ Creation of a platform for Analytics on car configuration data, driven by new regulatory requirements
⁄ The data must be processed locally in a unified way in the covered regions (EU, US, China)
⁄ The platform should accommodate further Big Data & Analytics programs in the future
⁄ The platform should be able to be deployed on-prem as well as in the Cloud
After assessing several possible solutions (incl. Cloud, Legacy DWH solutions) a Cloud-agnostic best-of-breed approach based on Kubernetes, Apache SPARK and Object Storage was selected.
28
Portable D&A Platform with Spark & Kubernetes
⁄ Kubernetes: „Run K8s Anywhere - Kubernetes is open source giving you the freedom to take advantage of on-premises, hybrid, or public cloud infrastructure, letting you effortlessly move workloads to where it matters to you.“ (https://kubernetes.io/de/)
⁄ Apache SPARK: Already for the on-prem and now for the Cloud Data Lakes one of the most used platforms for parallel processing of Big Data
⁄ Since 2018, Kubernetes is available as an alternative Apache Spark Cluster Manager (next to Mesos & Yarn)
29
Solution Overview
Presentation title30
Databases
Files
Web Interface
Metadata and Control Logging
Schema Store
Raw Layer Prepared Layer Presentation Layer
Data Sources
User Interface
Web Services
Data Scientists
Data Orchestration/ETL - Manage, Schedule & Monitor
Agility with CI/CD Pipelines
31
1. Commit the code to SCM
2a. SCM Polling through Web-hook enabled for application code
4. Argo workflow to trigger the job on Kubernetes cluster
5. Pull the image and run the job
Application code repo
ETL JOB DAGS repo
yaml
k8s
Driver PodExecutor Pod
Executor Pod
a b c d
d
a) argowf submits the job to k8sb) k8s creates the driver podc) driver requests k8s to create executor
podsd) k8s creates the executors
Portable from Day One
32
Object Data Storage on Prem
(Ceph)
VM on-prem
AWS S3 ObjectData Storage
AWS EC2
First Prototype developed in Adastra
Germany Data Engineering Lab (AWS)
AWS S3 ObjectData Storage
AWS EC2
Following Sprints in AWS-Account of the
Customer
Final Developmenton on-prem Infrastructure
of the customer
One Framework, deployed everywhere
Presentation title33
Object Data Storage on Prem
(z.B. Ceph )
VM on-prem Prod
AWS S3 Object Data Storage
AWS EC2 Prod
Azure Data LakeStorage
Azure VM Prod
Google Cloud Storage
GCP Compute EngineProd
Telekom Object Storage Services
Telekom ECS Prod
AWS EC2 DEV AWS EC2 TEST
Adastra Modular Cloud D&A Framework Overview
Adastra Modular Cloud D&A Framework Overview
35
Templates Scripts Policies
Networking Apps Storage Security Cloud Infra
CI/CD Pipeline
Reusable Modules
Ingestion Historization and CDC
Control and metadata capture
Generic Data Transformation
Data Lake best practices
Coding Best practices
Infrastructure As a Code Design and Development
Thank you!Adastra GmbH
Niedenau 36, 60325 Frankfurt am Main.+49 (0)69 719 779 790 / [email protected]
www.adastragrp.com