concevoir un data lake ouvert pour · build a data lake inventory •manage change with version...

40
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26 Juin 2017 Concevoir un Data Lake ouvert pour l’entreprise © 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

26 Juin 2017

Concevoir un Data Lake ouvert pour l’entreprise

© 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 2: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

PrésentateurAbass Safouatou, Solutions Architect, Amazon Web Services

Seif Eddine Abbassi, Solution Engineer, Talend

Page 3: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda• Les services data management et data lake d’AWS

• Solutions Talend pour la data gouvernance dans un contexte data lake

• Les challenges rencontrés par Beachbody

• Retour d’expérience Beachbody avec AWS & Talend

• Q&A

Page 4: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Learning Objectives:1. How to migrate a variety of structured and unstructured data sources to a

data lake

2. How to shorten development and testing cycles

3. How to mitigate complex deployment challenges common to real-time data

4. How to take advantage of Spark and Hadoop by generating native code

Page 5: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

The Data Lake and AWS

Drive business value with any type of data

Page 6: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Legacy Data Warehouses & RDBMS

• Complex to setup and manage

• Do not scale

• Takes months to add new data sources

• Queries take too long

• Cost $MM upfront

Page 7: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Should I Build a Data Lake?

Starting by amassing "all your data" and dumping into a large repository for the data gurus to start finding "insights" is like trying to win the lottery by buying all the tickets

Page 8: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Building a Data Lake on AWS

Page 9: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rethink How to Become a Data-driven Business

• Business outcomes - start with the insights and actions you want to drive, then work backwards to a streamlined design

• Experimentation - start small, test many ideas, keep the good ones and scale those up, paying only for what you consume

• Agile and timely - deploy data processing infrastructure in minutes, not months. take advantage of a rich platform of services to respond quickly to changing business needs

Page 10: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Business Case Determines Platform Design

Ingest/

Collect

Consume/

visualizeStore

Process/

analyze

Data

1 40 9

5

Answers &

Insights

START HEREWITH A BUSINESS CASE

Page 11: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Experiment and Scale Based on Your Business Needs

MATCHAVAILABLE DATA

Metrics and

Monitoring

Workflow

Logs

ERP

Transactions

Ingest/

Collect

Consume/

visualizeStore

Process/

analyze

Data

1 40 9

5

Answers &

Insights

Page 12: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Business Outcomes on a Modern Data Architecture

Outcome 1 : Modernize and consolidate

• Insights to enhance business applications and create new digital services

Outcome 2 : Innovate for new revenues

• Personalization, demand forecasting, risk analysis

Outcome 3 : Real-time engagement

• Interactive customer experience, event-driven automation, fraud detection

Outcome 4 : Automate for expansive reach

• Automation of business processes and physical infrastructure

Page 13: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Use an Optimal Combination of Highly Interoperable Services

Page 14: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Why Amazon S3 for Modern Data Architecture?

Designed for 11 9s

of durability

Designed for

99.99% availability

Durable Available High performance▪ Multiple upload

▪ Range GET

▪ Store as much as you need

▪ Scale storage and compute

independently

▪ No minimum usage commitments

Scalable

▪ Amazon EMR

▪ Amazon Redshift

▪ Amazon DynamoDB

▪ Amazon Athena

IntegratedEasy to use

▪ Simple REST API

▪ AWS SDKs

▪ Read-after-create consistency

▪ Event notification

▪ Lifecycle policies

Page 15: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Decouple Storage and Compute

• Legacy design was large databases or data

warehouses with integrated hardware

• Big Data architectures often benefit from

decoupling storage and compute

Page 16: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Improving Data Agility with Talend

Page 17: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Stakes are Ever Higher with Big Data

Companies that plan on

increasing spending on

analytics and making data

discovery a more significant

part of the architecture

Revenue from Big data and

analytics applications, tools

and services

Big data projects that will fail to

deliver against expectations

1/273%50%

Page 18: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lakes: The First Phase and the Future

First phase: Capture and store raw data of many

different types at scale

Next phase: Augment enterprise data warehousing

strategies

Page 19: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Why Data Lake Projects Fail

No DevOps

Practices for

Scalability &

Testing

Lack of Expertise Siloed

Operating Model

Poor Data

Governance

Poor Architectural

Design &

Integration

Page 20: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Foundational Elements of a Data Lake

Data Preparation

Self-service Data IngestMetadata Management

Data Classification

Data Lake

Data GovernanceData Lineage

Security Data Profiling

Page 21: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Streamline DevOps Process for Big Data

Custom Cluster Configuration

• Retrieve Hadoop configuration

data from job server

• Upload configuration files to

different clusters based on role:

dev/test/prod

• Enforce uniform security

standards

• Available for Spark and Spark

Streaming jobs

Portable integration jobs across your environment

Page 22: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Big Data Matching and Machine Learning on Spark

• New Data Stewardship interface simplifies matching process

• Improved performance through continuous matching speeding time to insight

Harmonize data at scale by learning from your key experts

Page 23: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Faster and Better Real-time Insights with Spark

Enterprise – class robustness and Intelligent Integration

• New Spark Support

• Production-ready with Spark 2.2

• Toggle between Spark 1.X and 2.X

• Easily upgrade to Spark 2.X

• Natural Language Processing with Spark

• Data Preparation for Spark Streaming

• Talend Data Mapper runs with Spark Streaming

• Spark Streaming support for Kerberized Kafka 0.10

Page 24: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Big Data Governance

Complete End-to-end Data Lineage

• Understand more about your unstructured data with new cloud and big data metadata bridges

• Save time by automatically harvesting data structures to build a data lake inventory

• Manage change with version control and notifications

Metadata bridges

S3, Hadoop HDFS, Hive, MongoDB,

Couchbase, Cassandra, Apache Atlas

Files systems

Amazon S3, Hadoop HDFS, Unix, Windows, Linux

File formats

CSV, Excel, JSON, Avro, Parquet

Know Your Data for Increased Data Protection, Accessibility and Compliance

Page 25: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Beachbody – Fitness goes Big Data

Driving innovation with Talend on AWS

Page 26: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

About Beachbody

• A leading provider of fitness, nutrition, and weight-loss programs

• Creator of P90X® Series, INSANITY®, FOCUS T25®, 21 Day Fix®, Body Beast®, PiYo®, and Hip Hop Abs®

• Empowers over 23 Million customers

• Supports 350K+ independent “Coach” distributors

• Operates with 800+ employees

• Sees 5 million+ monthly unique visits across digital platforms

• Reached $1 billion in gross sales in 2015

Page 27: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

The Challenge - Do More Better, Faster, Cheaper

Page 28: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Project Vision – Open Enterprise Data Lake

• Build an OPEN Enterprise Data Platform

• Open Source Technology: Bring Your Own Tool

• Decentralized Data Ownership: Many teams can publish

• Centralized people, processes, and tools available

• Capture All Data as real-time as possible

• Access to All raw + processed data by Authorized Users

• HIPAA/PII encrypted or masked to for compliance

• Shift Time from Collecting Data -> Analyzing Data

Page 29: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

The Technology

Page 30: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lake Architecture

Amazon S3 Data Lake folder structure

AWS

Page 31: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Architectural Design

Page 32: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lake Component - Storage

Page 33: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lake Component – Data Pipeline

Page 34: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lake Component – RDBMS

Page 35: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lake Component – Compute

Page 36: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data Lake Component – Analytics

Page 37: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Business Benefits

• Reduced Data Acquisition Time by 5x

• Improved Marketing Campaigns

• Reduced Site Tagging Costs

• Improved Employee Retention and Satisfaction

• Automated Customer Self-Service Order Status

• Identified Web Funnel Conversion Opportunities (testing now)

Page 38: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Next steps and further information

• Data Lake solution on AWS:https://aws.amazon.com/big-data/data-lake-on-aws/

• Take a Free 30-Day Trial of Talend Integration Cloud:https://iam.integrationcloud.talend.com/idp/federation/up/login

• Try AWS for free:https://aws.amazon.com/

Page 39: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Q & A

Page 40: Concevoir un Data Lake ouvert pour · build a data lake inventory •Manage change with version control and notifications Metadata bridges S3, Hadoop HDFS, Hive, MongoDB, Couchbase,

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Thank you!