data governance and schema registry for software developers

26
1 Simplify Governance of Streaming Data Gwen Shapira Product Manager, Apache Kafka™ PMC

Upload: gwen-chen-shapira

Post on 28-Jan-2018

984 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Data Governance and Schema Registry for Software Developers

1

Simplify Governance of Streaming Data

Gwen Shapira

Product Manager, Apache Kafka™ PMC

Page 2: Data Governance and Schema Registry for Software Developers

2

Attend the whole series!

Simplify Governance for Streaming Data in Apache Kafka

Date: Thursday, April 6, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Gwen Shapira, Product Manager, Confluent

Using Apache Kafka to Analyze Session Windows

Date: Thursday, March 30, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Michael Noll, Product Manager, Confluent

Monitoring and Alerting Apache Kafka with Confluent

Control Center

Date: Thursday, March 16, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Nick Dearden, Director, Engineering and Product

Data Pipelines Made Simple with Apache Kafka

Date: Thursday, March 23, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Ewen Cheslack-Postava, Engineer, Confluent

https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/

What’s New in Apache Kafka 0.10.2 and Confluent 3.2

Date: Thursday, March 9, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Clarke Patterson, Senior Director, Product

Marketing

Page 3: Data Governance and Schema Registry for Software Developers

3

What is Data Governance?

Data Access

Master Data

Management

Data AgilityData Reliability

Data Quality

Page 4: Data Governance and Schema Registry for Software Developers

4

To Grow a Successful Data Platform, You need Some Guarantees

Page 5: Data Governance and Schema Registry for Software Developers

5

If you can’t trust your data,

what can you trust?

Page 6: Data Governance and Schema Registry for Software Developers

7

Schemas Are a Key Part of This

1. Schemas: Enforced Contracts Between your Components

2. Schemas need to be agile but compatible

3. You need tools & process that will let you:• Move fast and not break things

Page 7: Data Governance and Schema Registry for Software Developers

9

Cautionary tale: Uber

Producers

Service 1 Service 2

RedshiftS3

EMR

{created_at: “Dec 1,

2015”}

CREATE (

created_at: STRING )

parseDate(created_at, “%M %d,

%yyyy”)

Page 8: Data Governance and Schema Registry for Software Developers

10

Producers

Service 1 Service 2

RedshiftS3

EMR

{created_at:

1491368429}

CREATE (

created_at: STRING )

parseDate(created_at, “%M %d,

%yyyy”)

Cautionary tale: Uber

Page 9: Data Governance and Schema Registry for Software Developers

11

The right way to do things

MessageCustomer

RoomGifts

BookRoom

LoyalCustomers

Page 10: Data Governance and Schema Registry for Software Developers

12

The right way to do things

MessageCustomer

RoomGifts

BookRoom

LoyalCustomers

BeachPromo If (property.location == beach &&

customer.level == platinum)

sendMessage({topic: RoomGifts, customer_id: customer.id,

room: property.roomnum, gift: seaShell})

Page 11: Data Governance and Schema Registry for Software Developers

13

There is special value in standardizing parts of schema

{ "type": "record",

"name": "Event",

"fields": [{ "name": "headers", "type": {

“name” : “headers”,

“type”: ”record”

“fields”: [

{“name”: “source_app”, “type”: “string”},

{“name”: “host_app”, “type”: “string”},

{“name”: “destination_app”, “type”: [“null,“string”]},

{“name”: “timestamp”, “type”: “long”}

]},

{ "name": "body", "type": "bytes" }] }

Standard headers allowed

analyzing patterns of communication

S3

Page 12: Data Governance and Schema Registry for Software Developers

15

In Request Response World – APIs between services are key

In Async Stream Processing World – Message Schemas ARE the API

…except they stick around for a lot longer

Page 13: Data Governance and Schema Registry for Software Developers

16

Lets assume you see why you need

schema compatibility…

How do we make it work for us?

Page 14: Data Governance and Schema Registry for Software Developers

17

Schema Registry

Elastic

Cassandra

Streams App

Example Consumers

SerializerApp 1

SerializerApp 2

!

Kafka Topic!

Schema

Registry

Define the expected fields for each Kafka topic

Automatically handle schema changes (e.g. new fields)

Prevent incompatible changes

Supports multi-datacenter environments

Page 15: Data Governance and Schema Registry for Software Developers

18

Key Points:

• Schema Registry is efficient – due to caching

• Schema Registry is transparent

• Schema Registry is Highly Available

• Schema Registry prevents bad data at the source

Page 16: Data Governance and Schema Registry for Software Developers

19

Big Benefits

• Schema Management is part of your entire app lifecycle – from dev to prod

• Single source of truth for all apps and data stores

• No more “Bad Data” randomly breaking things

• Increased agility! add, remove and modify fields with confidence

• Teams can share and discover data

• Smaller messages on the wire and in storage

Page 17: Data Governance and Schema Registry for Software Developers

20

But the Best Part is…

RDBMHadoop /

Hive

Kafka w/

Schema

Registry

JDBC

Source

HDFS

Sink

SchemaSchema

Schema

SchemaSchema

Page 18: Data Governance and Schema Registry for Software Developers

21

Everyone Schema Registry

Page 19: Data Governance and Schema Registry for Software Developers

22

Compatibility tips!

• Do you want to upgrade producer without touching consumers?

• You want *forward* compatibility.

• You can add fields.

• You can delete fields with defaults

• Do you want to update schema in a DWH but still read old messages?

• You want *backward* compatibility

• You can delete fields

• You can add fields with defaults

• Both?

• You want *full* compatibility.

• Default all things!

• Except the primary key which you never touch.

Page 20: Data Governance and Schema Registry for Software Developers

23

App Lifecycle

DevelopementSource

RepositoryPull Request

CI Staging / Test

Production

“Nightly build”

Lets Try

Lets Do It!

OptionalGood

Idea

YES Almost

too late

Last

resort

Page 21: Data Governance and Schema Registry for Software Developers

24

Finding Schemas

1. Write Schemas -> Generate Code

2. Get Schemas from registry -> Generate code

3. Write code -> Generate Schemas

Page 22: Data Governance and Schema Registry for Software Developers

25

Summary!

1. Data Quality is critical

2. Schema are critical

3. Schema Registry is used to maintain quality from Dev to Prod

4. Schema Registry is in Confluent Open Source

Page 23: Data Governance and Schema Registry for Software Developers

26

Attend the whole series!

Simplify Governance for Streaming Data in Apache Kafka

Date: Thursday, April 6, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Gwen Shapira, Product Manager, Confluent

Using Apache Kafka to Analyze Session Windows

Date: Thursday, March 30, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Michael Noll, Product Manager, Confluent

Monitoring and Alerting Apache Kafka with Confluent

Control Center

Date: Thursday, March 16, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Nick Dearden, Director, Engineering and Product

Data Pipelines Made Simple with Apache Kafka

Date: Thursday, March 23, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Ewen Cheslack-Postava, Engineer, Confluent

https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/

What’s New in Apache Kafka 0.10.2 and Confluent 3.2

Date: Thursday, March 9, 2017

Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET

Speaker: Clarke Patterson, Senior Director, Product

Marketing

Page 24: Data Governance and Schema Registry for Software Developers

27

Get Started with Apache Kafka Today!

https://www.confluent.io/downloads/

THE place to start with Apache Kafka!

Thoroughly tested and

quality assured

More extensible developer

experience

Easy upgrade path to

Confluent Enterprise

Page 25: Data Governance and Schema Registry for Software Developers

28

30% off conference feecode = kafpcf17

New York City | May 8, 2017

Presented by

Page 26: Data Governance and Schema Registry for Software Developers

29

Thank You!