map r seattle streams meetup oct 2016

29
© 2016 MapR Technologies 1 © 2016 MapR Technologies 1 MapR Confidential © 2016 MapR Technologies When Your Stream is the System of Record Seattle Kafka Meetup Will Ochandarena Sr Dir, Product October 24 2016

Upload: nitin-kumar

Post on 13-Apr-2017

256 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 1© 2016 MapR Technologies 1MapR Confidential

© 2016 MapR Technologies

When Your Stream is the System of RecordSeattle Kafka Meetup Will Ochandarena

Sr Dir, Product

October 24 2016

Page 2: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2MapR Confidential

Agenda• Streaming System of Record - What?• A Little About MapR Streams• Versioning a Real-time Data Pipeline

– Demo - MapR + StreamSets

Page 3: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 3© 2016 MapR Technologies 3MapR Confidential © 2016 MapR Technologies

Streaming System of Record

System of Record (n): information storage system that is the authoritative data source for a given data element or piece of information.

Page 4: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 4© 2016 MapR Technologies 4MapR Confidential

Who Does This Today?

Events

Processing

DB

More Processing

Long Term Storage

Page 5: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 5© 2016 MapR Technologies 5MapR Confidential

Reprocessing is Hard

Events

Processing

DB

More Processing

Long Term Storage

?

Medium Term Storage3d ago -> Now

1 Year ago -> ~an hour ago

Page 6: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 6© 2016 MapR Technologies 6MapR Confidential

Easy Fix - Streaming System of Persistence

Events

Processing

DB

More Processing

Long Term Storage

Long Term Storage

Events

Page 7: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 7© 2016 MapR Technologies 7MapR Confidential

DMV_Updates

Imagine each event as a change to an entry in a database.

DL_ID City Points

0: { WillO : {City : Mountain View}, ts : 7/5/2009 04:01:01, src : dmv201 }1: { BradA : {City : Atlanta}, ts : 5/11/2010 05:11:31, src : dmv1341 }2: { BradA : {Points : +2}, ts : 6/22/2011 03:31:10, src : officer1213}3: { WillO : {City : San Jose}, ts : 11/1/2012 04:01:01, src : dmv1661 }

WillO

BradA

Mountain View

Atlanta

0

0

San Jose

2

How Can a Stream Be a System of Record?

Page 8: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 8© 2016 MapR Technologies 8MapR Confidential

Key-Val Document Graph

Wide Column Time Series Relational

???Inserts Updates

Streams and Databases in Harmony

Page 9: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 9© 2016 MapR Technologies 9MapR Confidential

Which of these can be used to reconstruct the other?

0: { WillO : {City : Mountain View}, ts : 7/5/2009 04:01:01, src : dmv201 }1: { BradA : {City : Atlanta}, ts : 5/11/2010 05:11:31, src : dmv1341 }2: { BradA : {Points : +2}, ts : 6/22/2011 03:31:10, src : officer1213}3: { WillO : {City : San Jose}, ts : 11/1/2012 04:01:01, src : dmv1661 }

DL_ID City Points

Will0 San Jose 0

BradA Atlanta 2

Which Makes a Better System of Record?

Page 10: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 10© 2016 MapR Technologies 10MapR Confidential

• Auditing - “how did BradA’s points get so high?”• Lineage - “who added points to BradA license?”• History - “where did WillO used to live?”• Integrity - “can I trust this data hasn’t been tampered with?”

• Yup - Streams are immutable

0: { WillO : {City : Mountain View}, ts : 7/5/2009 04:01:01, src : dmv201 }1: { BradA : {City : Atlanta}, ts : 5/11/2010 05:11:31, src : dmv1341 }2: { BradA : {Points : +2}, ts : 6/22/2011 03:31:10, src : officer1213}3: { WillO : {City : San Jose}, ts : 11/1/2012 04:01:01, src : dmv1661 }

Other Benefits of Streaming System of Record

Page 11: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 11© 2016 MapR Technologies 11MapR Confidential

• Infinitely persisted events• A way to query your persisted stream data• An integrated security model across data services

What Do I Need For This to Work?

• Applied Streaming System of Record @ Liaison Blog

Page 12: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 12© 2016 MapR Technologies 12MapR Confidential © 2016 MapR Technologies

About MapR & MapR Streams

Page 13: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 13© 2016 MapR Technologies 13MapR Confidential

MapR Streams:Global Pub-sub Event Streaming System for Big Data

Producers publish billions of events/sec to a topic in a stream.

Events persisted and immediately delivered to all consumers, guaranteed.

Tie together geo-dispersed clusters. Worldwide.

Standard real-time API (Kafka). Integrates with Spark Streaming, Storm, Apex, and Flink.

Direct data access (OJAI API) from analytics frameworks.

Topic

Stream

TopicProducers Consumers

Remote sites and consumersBatch analytics

Page 14: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 14© 2016 MapR Technologies 14MapR Confidential

Streams Offers a Durable, Persistent System of Record

[ {“Topic1Part0Seq5001”: {

“timestamp” : 1456246886,“topic” : “Topic1”,“partition” : 0,“producer” : “wochanda”,“offset” : 5001,“key” : “MsgKey”,“data” : {...} }, {“Topic2Part0Seq5002”: { … } }, …]

● Reliable● Secure● Immutable● Auditable● Replayable

Page 15: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 15© 2016 MapR Technologies 15MapR Confidential

Streams Enables Global Applications and Analytics

Provides● Arbitrary topology of thousands of clusters● Automatic loop prevention● DNS-based discovery● Globally synchronized message offsets

and consumer cursors

Enables● Global applications & data collection● Producer & consumer failover● Analysis/filtering/aggregation at the edge● “Occasional” connections

Producers

Consumers

Page 16: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 16© 2016 MapR Technologies 16MapR Confidential

Fun Facts

MapR Streams

Converged Global Scale

Secure & Multi-Tenant

Single cluster for files, tables, and streams. Global, IoT-scale “fabrics”

with failover.

Tenant-owned streams, logical grouping of topics and messages.

Authentication, authorization, encryption. Unified policy with all other platform services.

Infinite “system of record” persistence.

Metadata tracked internally, no dependencies on ZK. Consumers, topics scale into millions.

Page 17: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 17© 2016 MapR Technologies 17MapR Confidential

Open Source Engines & Tools Commercial Engines & ApplicationsD

ata

Proc

essi

ng

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Global Namespace | No Single Point of Failure | Data Protection | Multi-tenancy | Workload Management Multi Temperature | Global Multi Datacenter | High Performance Low Latency | Security | Management & Monitoring

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform

MapR Data Platform Services

Commodity Hardware/Storage, Clouds, & Containers

Page 18: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 18© 2016 MapR Technologies 18MapR Confidential © 2016 MapR Technologies

Versioning a Real-time Data Pipeline

Page 19: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 19© 2016 MapR Technologies 19MapR Confidential

Challenges of a Streaming App DeveloperPre-Production

Streaming System

Database Hadoop Cluster

App Environment

events

logs

events2

logs2

v2

v2 /clicks /clicks2

... ...

... ...

Page 20: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 20© 2016 MapR Technologies 20MapR Confidential

Challenges with VersioningPost-Production

Input Data App Logic Output Data+ =

Output Streams

Database Tables

Logs, Metrics

What if you deploy a new version of your

application?

What happens to all of this?

Page 21: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 21© 2016 MapR Technologies 21MapR Confidential

Example: Versioning in Production

45 40 60 30 37 39 72 79 60

Input_Stream

45 35 70

Output_Stream

Calculate_Mean_3

Time Value

00:00:00 70

00:00:05 35

00:00:10 45

Output_Table

Calculate_Mean_3Calculate_Median_3

Page 22: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 22© 2016 MapR Technologies 22MapR Confidential Calculate_Mean_3 Volume

Versioning with Converged App Volumes

45 40 60 30 37 39 72 79 60

Input_Stream

35 70

Output_Stream

Calculate_Mean_3

Time Value

00:00:00 70

00:00:05 35

00:00:10

Output_Table

Calculate_Mean_3Calculate_Median_3

Calculate_Median_3 Volume

Time Value

00:00:00 72

00:00:05 37

00:00:10 45

45 37 72

Output_Stream

Output_Table

Page 23: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 23© 2016 MapR Technologies 23MapR Confidential

Versioning & A/B Testing

80%

10%

10%

A

B

C

Page 24: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 24© 2016 MapR Technologies 24MapR Confidential © 2016 MapR Technologies

DEMO - MapR & StreamsetsVersioning a Production Data Pipeline

Rupal Shah - Streamsets

Page 25: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 25© 2016 MapR Technologies 25MapR Confidential

StreamSets Data Collector™

Adaptable Pipelines -> Efficiency❑ Intent-driven ingest (minimal schema specification).

❑ Data drift handling.

Pipeline KPIs -> Visibility❑ Real-time stage, edge and bad data metrics.

❑ Alerts via profiling, sampling and threshold-based rules.

Containerized Architecture -> Agility❑ Flexible deployment: edge, cluster, embedded, pipeline,

pub/sub

❑ Zero-downtime upgrades due to logical component isolation.

StreamSets Data Collector™ is open source software for building and deploying individual any-to-any ingest pipelines in the face of data drift.  

Page 26: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 26© 2016 MapR Technologies 26MapR Confidential

StreamSets Dataflow Performance Manager™

StreamSets Dataflow Performance Manager (DPM™) provides a single pane of glass to map, measure and master big data in motion.

MASTER Availability & Accuracy Proactive Remediation

MEASURE Any Path Any Time

MAP Dataflow Lineage Live Data Architecture

Page 27: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 27© 2016 MapR Technologies 27MapR Confidential

…helping you put data technology to work

● Find answers

● Ask technical questions

● Join on-demand training course discussions

● Follow release announcements

● Share and vote on product ideas

● Find Meetup and event listings

Connect with fellow Apache Hadoop and Spark professionals

community.mapr.com

Page 28: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 28© 2016 MapR Technologies 28MapR Confidential © 2016 MapR Technologies

Backup

Page 29: Map r seattle streams meetup   oct 2016

© 2016 MapR Technologies 29© 2016 MapR Technologies 29MapR Confidential

bit.ly/tbdFind my slides & other related materials to this talk here:

or search: