cdp data center 7 · • tez 0.9 • key trustee server 7 • rhel/centos/oel 7.7 • postgres 10...

40
CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer 30 Juin 2020

Upload: others

Post on 07-Feb-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

  • CDP DATA CENTER 7.1Laurent Edel : Solution EngineerJacques Marchand : Solution EngineerMael Ropars : Principal Solution Engineer

    30 Juin 2020

  • © 2019 Cloudera, Inc. All rights reserved.

    SPEAKERS

  • © 2019 Cloudera, Inc. All rights reserved.

    AGENDA

    • CDP DATA CENTER OVERVIEW• DETAILS ABOUT MAJOR COMPONENTS• PATH TO CDP DC && SMART MIGRATION • Q/A

  • © 2020 Cloudera, Inc. All rights reserved. 4

    CLOUDERA DATA PLATFORM

  • © 2020 Cloudera, Inc. All rights reserved. 5

    ARCHITECTURE CIBLE : ENTERPRISE DATA CLOUD

    CDP Cloud Public(platform-as-a-service)

    CDP On-Prem(installable software)

  • © 2020 Cloudera, Inc. All rights reserved. 6

    CDP DATA CENTER OVERVIEW

    NEW CDP Data Center features include:• High-performance SQL analytics• Real-time stream processing, analytics, and management• Fine-grained security, enterprise metadata, and scalable data lineage• Support for object storage (tech preview)• Single pane of glass for management - multi-cluster support

    Enterprise analytics and data management platform, built for hybrid cloud, optimized for bare metal and ready for

    private cloud

    CDP Data Center(installable software)

    Cloudera Runtime

    Cloudera Manager

  • © 2019 Cloudera, Inc. All rights reserved. 7

    A NEW OPEN SOURCE DISTRIBUTION FOR BETTER CAPABILITY

    Cloudera Runtime - created from the best of CDH and HDP

    Deprecate competitive technologies

    Merge overlapping technologies

    Keep complementary technologies

    Upgrade shared technologies

  • © 2020 Cloudera, Inc. All rights reserved. 8

    CDP Data Center 7.1(May) 2020

    • Cloudera Manager 7.1

    • Hadoop 3.1

    • Spark 2.4 / Spark 3(b2)

    • Hive 3.1

    • Impala 3.4

    • Oozie 5.1

    • Hue 4.5

    • Ranger 2.0

    • Atlas 2.0

    • HBase 2.2

    • Phoenix 5.0

    • Kudu 1.12

    • Sqoop 1.4.7

    • Parquet 1.10

    • Avro 1.8

    • ORC 1.5

    • Zookeeper 3.5

    • Solr 8.4

    • Key HSM 7.1

    • Knox 1.3

    • Livy 0.7

    • Navigator Encrypt 7.1

    • Ranger KMS 7.1

    • Zeppelin

    • Hive Warehouse Connector 1.0

    • Kafka 2.4

    • Kafka Schema Registry 0.8

    • Streams Messaging Mgr 1.0

    • Streams Replication Mgr 2.1

    • Ozone (Beta) 0.6

    • Kafka Connect 2.4

    • Cruise Control 2.0

    • Tez 0.9

    • Key Trustee Server 7

    • RHEL/CENTOS/OEL 7.7

    • Postgres 10

    • JDK 8

    • JDK 11 Runtime

    • MySQL 5.7

    • Oracle DB 12 (Fresh Install Only)

    • PostgreSQL 10

    • Maria DB 10.2

    • Upgrades from CDP DC 7.0

    • Upgrades from CDH 5.13-5.16

    • Upgrades from HDP 2.6.5

    COMPONENT LIST

  • © 2019 Cloudera, Inc. All rights reserved.

    AGENDA

    • CDP DATA CENTER OVERVIEW• DETAILS ABOUT MAJOR COMPONENTS• PATH TO CDP DC && SMART MIGRATION • Q/A

  • © 2020 Cloudera, Inc. All rights reserved. 10

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2020 Cloudera, Inc. All rights reserved. 11

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2019 Cloudera, Inc. All rights reserved.

    KAFKA COMPUTE CLUSTERS WITH CLOUDERA MANAGERKafka Clusters using Shared Security & Governance Data Lake with Atlas and Ranger

    • Kafka 2.4• Ranger & Atlas Integration

    • Support of Kafka Connect, Kafka Streams

    • Cruise Control for load balancing

    • Create multiple Kafka compute clusters using shared Security Data Lake with Ranger & Atlas

  • © 2019 Cloudera, Inc. All rights reserved.

    KAFKA MANAGEMENT SERVICESKafka Services for Schema Management, Replication and Monitoring

    Schema RegistryNew Kafka Schema Governance

    Streams Replication Manager (SRM)New Kafka Replication Engine powered by

    MirrorMaker2

    Streams Messaging Manager (SMM)New Kafka Monitoring Service

  • © 2020 Cloudera, Inc. All rights reserved. 14

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2020 Cloudera, Inc. All rights reserved. 15

    SPARK

    • Spark 2.4

    • Integration with Ranger for Fine Grained Authorizations

    • Coming soon: Spark 3 !• Better performance• Enhanced support for Deep

    Learning• New modules• MLLib replaced with SparkML• Tech Preview available

  • © 2020 Cloudera, Inc. All rights reserved. 16

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2020 Cloudera, Inc. All rights reserved. 17

    SQL USER EXPERIENCE : HUE

  • © 2020 Cloudera, Inc. All rights reserved. 18

    DATA WAREHOUSEHive 3

    Apache Hive 3• Comprehensive ANSI SQL 2016 coverage• GDPR: new ACID v2 as fast as regular tables, transactions,

    UPDATE/DELETE/MERGE• Cloud-ready: optimized for S3/WASB/GCP• Support for JDBC/Kafka/Druid out of the box• EDW offload:

    – “DBA” tooling: surrogate keys, materialized views, constraints– information schema

    • Performance: – workload management– query result cache

  • © 2020 Cloudera, Inc. All rights reserved. 19

    DATA WAREHOUSEImpala and Kudu

    Apache Impala• Leading MPP SQL Engine for DW -

    optimized for Parquet/Kudu• Ideal for Data Mart Implementations that

    require Interactive/Ad-hoc BI• 1000+ enterprise customers - many

    running on 10s of PBs and 100s of nodes• Certified with leading BI tools with broad

    SQL coverage• Latest release adds improvements for

    resiliency, concurrency, and metadata

    Apache Kudu• Leading columnar storage engine for fast

    analytics on fast data• Ideal for Low latency time series data

    ingest and analytics (with Impala SQL engine)

    • Strength of fast ingest with single rows like HBASE and allows large scans like HDFS

    • ACID (insert/update/delete) semantics with single rows

  • © 2020 Cloudera, Inc. All rights reserved. 20

    WORKLOAD MANAGER

    Global view on analytic processing

    Deep Dive Query analysis

  • © 2020 Cloudera, Inc. All rights reserved. 21

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2020 Cloudera, Inc. All rights reserved. 22

    stmt.executeUpdate(“UPSERT INTO TABLE_NAME VALUES(rowKey, GREETINGS) ");

    stmt.execute();

    HBASE + PHOENIX

    • Maximally flexible & customizable• SQL only for data remediation• All advanced functionality available• New async client• JDK8/G1GC• Off-Heap read path• API clean-up, HBCK2

    • Programmatic ANSI SQL support• RDBMS-like data architecture• Auto-applies performance best

    practices• Can co-exist with HBase apps

    Put put = new Put(Bytes.toBytes(rowKey));

    put.addColumn(COLUMN_FAMILY_NAME, COLUMN_NAME, Bytes.toBytes(GREETINGS));

    table.put(put);

    HBASEFlexible, scale-out, no-sql database

    PHOENIXRDBMS-like, scale-out database

  • © 2020 Cloudera, Inc. All rights reserved. 23

    Shared Data Storage

    Indexing engine (Lucene)

    Extraction Mapping

    Solr

    Distributed processing coordinator

    Solr Cloud

    Querying API Indexing APIScalable and Robust Index Storage with SOLR 8.4

    ● Scalable, cost-efficient index storage

    ● High availability, Integrated security with Atlas/Ranger

    ● Shared data store with other processing tools (Spark, Impala..)

    ● Search AND process data in one platform

    CDP SEARCH

  • © 2020 Cloudera, Inc. All rights reserved. 24

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2019 Cloudera, Inc. All rights reserved.

    DATA SCIENCE AND ENGINEERING TOOLS

    APACHE ZEPPELIN CLOUDERA DATA SCIENCE WORKBENCH

  • © 2020 Cloudera, Inc. All rights reserved. 26

    CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]

    01 03

    04

    05

    02Collect

    EnrichSparkHive

    ReportImpala

    HiveKudu

    ServeHbase

    PhoenixSolR

    PredictSpark

    Zeppelin

    SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

  • © 2020 Cloudera, Inc. All rights reserved. 27

    • Management of multiple clusters• Knox,Ranger,Atlas,Hive-on-Tez,DAS• Cluster-level configuration history• Improved global search• Resume errors in enabling

    Kerberos

    • Scalability improvements• Improved alerts configuration• Upgrade Support• Support for Private Cloud (Beta)

    SIMPLIFIED MANAGEMENTCloudera Manager

  • © 2020 Cloudera, Inc. All rights reserved. 28

    CONSISTENT SECURITY AND GOVERNANCEBuilt for multi-functional analytics anywhere

    • Data Catalog: a comprehensive catalog of all data sets, spanning on-premises, cloud object stores, structured, unstructured, and semi-structured

    • Schema: automatic capture and storage of any and all schema and metadata definitions as they are used and created by platform workloads

    • Replication: deliver data as well as data policies there where the enterprise needs to work, with complete consistency and security

    • Security: role-based access control applied consistently across the platform. Includes full stack encryption and key management

    • Governance: enterprise-grade auditing, lineage, and governance capabilities applied across the platform with rich extensibility for partner integrations

  • © 2020 Cloudera, Inc. All rights reserved. 29

    Identity & PerimeterValidate users in enterprise directory

    Technical Concepts:AuthenticationUser/group mapping

    Data ProtectionShielding data in the cluster from unauthorized visibility

    Technical Concepts:Encryption, Key Management

    VisibilityReporting on where data came from and how it’s being used

    Technical Concepts:AuditingLineage

    AccessDefining what users and applications can do with data

    Technical Concepts:PermissionsAuthorization

    SECURITY AND GOVERNANCE

    Kerberos, Apache Knox

    SSL/TLS, HDFS TDE, Apache Ranger

    (KMS, Masking, Filtering)Apache Ranger Apache Atlas

  • © 2020 Cloudera, Inc. All rights reserved. 30

    VISIBILITY SECURITY AND AUDITINGApache Atlas

    • Lineage– What data do I consume?– What consumes my Data?

    • Who uses my data?– Audit who accessed what– Track access events from Apache

    Ranger

    – Metadata audit and versioning from Apache Atlas

  • © 2020 Cloudera, Inc. All rights reserved. 31

    ACCESS CONTROLApache Ranger

    • Maintain one set of data, control access centrally with fine grained policies down to the column and the row level.

    • Anonymize PII with Dynamic column masking

    • Customize views for users with Dynamic row filtering

    • Manage user access with Role-based Access Control

    • Unify policies across many data sets with Attribute-based Access Control

  • © 2020 Cloudera, Inc. All rights reserved. 32

    OBJECT STORAGEApache Ozone

    • Ozone is the next generation of HDFS– Based on HDFS architecture, but with

    some fundamental shifts

    – Preserve and reuse good parts of HDFS

    – Addresses HDFS scale limits and small file problem

    • Uses an object store architecture to achieve scale.

    • Provides native Hadoop File System API as well as a native S3 API

  • PATHS TO CDP

  • © 2020 Cloudera, Inc. All rights reserved. 34

    THREE PATHS TO CDP

    Migrate to Public Cloud Migrate to CDP DC Upgrade to CDP DC

    Copy data and metadata to a public cloud; implement new, or migrate existing workloads on CDP Public Cloud.

    Build a new CDP Datacenter cluster on-premises; copy data and metadata from existing classic cluster; and migrate existing workloads.

    Upgrade from classic cluster to CDP Datacenter in-place on the same hardware infrastructure.

  • © 2020 Cloudera, Inc. All rights reserved. 35

    3. PRODUCTION1. PLAN 2. LAUNCH

    PATH2CDP

    1+ Weeks

    CONSUME CDP

    Custom Estimate

    CDP Data Center

    UPGRADE2CDP

    ●●●●

    ●●●

    3 + Weeks

    SMARTUPGRADE TO CDP DC

  • © 2020 Cloudera, Inc. All rights reserved. 36

    FORMATIONS GENERALES ET CDP

    Spark (DC - Q1)

    CDW

    CML

    CDP Public Cloud Admin (Q2)

    Stream Processing (Q1)(CDF)

    Spark Performance Workshop

    Hive/Impala (DC - Q1)

    Data Science Wkshp (Q2)

    CDP Security (Q2-Q3)

    Kafka Operations(CDF)

    CDP Data Governance (Q2)

    DS/ML Modules

    ADMINISTRATOR

    DATA ANALYST

    DEVELOPER

    DATA SCIENTIST

    AWS Fundamentals for CDP Pub. Cloud

    Flow Management with NiFi (CDF)

    cloudera.com/training.html

    CDH/HDP TO CDP DELTA COURSES

    https://www.cloudera.com/about/training/courses/advanced-spark-application-performance-tuning.htmlhttps://www.cloudera.com/about/training/courses/advanced-spark-application-performance-tuning.htmlhttps://www.cloudera.com/about/training/courses/aws-undamentals-for-cdp.htmlhttps://www.cloudera.com/about/training/courses/aws-undamentals-for-cdp.htmlhttps://www.cloudera.com/about/training/courses/dataflow-flow-managment-with-nifi.html#?classType=&country=UShttps://www.cloudera.com/about/training/courses/dataflow-flow-managment-with-nifi.html#?classType=&country=UShttps://cloudera.com/training.html

  • WHAT IS COMING NEXT ?

  • © 2020 Cloudera, Inc. All rights reserved. 38

    CDP PRIVATE CLOUD : BASED ON CDP DATA CENTER

    Management Console

    Data Catalog

    Workload Manager

    Replication Manager

    Experiences

    Machine Learning

    Kubernetes

    BareMetal

    New set of data analytics applications Featuring use-case optimized interfaces

    Running on a container cloud Fast provisioning & scaling, efficient, simple

    With access to a shared data lake That is secured and governed

    SDXSecurity

    MetadataGovernance

    Data Warehouse

    Data Engineering DataFlow

    Bare MetalWorkloads

    CDP Data Center

  • Questions/RéponsesA vous la parole...