map r seattle streams meetup oct 2016
TRANSCRIPT
© 2016 MapR Technologies 1© 2016 MapR Technologies 1MapR Confidential
© 2016 MapR Technologies
When Your Stream is the System of RecordSeattle Kafka Meetup Will Ochandarena
Sr Dir, Product
October 24 2016
© 2016 MapR Technologies 2© 2016 MapR Technologies 2MapR Confidential
Agenda• Streaming System of Record - What?• A Little About MapR Streams• Versioning a Real-time Data Pipeline
– Demo - MapR + StreamSets
© 2016 MapR Technologies 3© 2016 MapR Technologies 3MapR Confidential © 2016 MapR Technologies
Streaming System of Record
System of Record (n): information storage system that is the authoritative data source for a given data element or piece of information.
© 2016 MapR Technologies 4© 2016 MapR Technologies 4MapR Confidential
Who Does This Today?
Events
Processing
DB
More Processing
Long Term Storage
© 2016 MapR Technologies 5© 2016 MapR Technologies 5MapR Confidential
Reprocessing is Hard
Events
Processing
DB
More Processing
Long Term Storage
?
Medium Term Storage3d ago -> Now
1 Year ago -> ~an hour ago
© 2016 MapR Technologies 6© 2016 MapR Technologies 6MapR Confidential
Easy Fix - Streaming System of Persistence
Events
Processing
DB
More Processing
Long Term Storage
Long Term Storage
Events
© 2016 MapR Technologies 7© 2016 MapR Technologies 7MapR Confidential
DMV_Updates
Imagine each event as a change to an entry in a database.
DL_ID City Points
0: { WillO : {City : Mountain View}, ts : 7/5/2009 04:01:01, src : dmv201 }1: { BradA : {City : Atlanta}, ts : 5/11/2010 05:11:31, src : dmv1341 }2: { BradA : {Points : +2}, ts : 6/22/2011 03:31:10, src : officer1213}3: { WillO : {City : San Jose}, ts : 11/1/2012 04:01:01, src : dmv1661 }
WillO
BradA
Mountain View
Atlanta
0
0
San Jose
2
How Can a Stream Be a System of Record?
© 2016 MapR Technologies 8© 2016 MapR Technologies 8MapR Confidential
Key-Val Document Graph
Wide Column Time Series Relational
???Inserts Updates
Streams and Databases in Harmony
© 2016 MapR Technologies 9© 2016 MapR Technologies 9MapR Confidential
Which of these can be used to reconstruct the other?
0: { WillO : {City : Mountain View}, ts : 7/5/2009 04:01:01, src : dmv201 }1: { BradA : {City : Atlanta}, ts : 5/11/2010 05:11:31, src : dmv1341 }2: { BradA : {Points : +2}, ts : 6/22/2011 03:31:10, src : officer1213}3: { WillO : {City : San Jose}, ts : 11/1/2012 04:01:01, src : dmv1661 }
DL_ID City Points
Will0 San Jose 0
BradA Atlanta 2
Which Makes a Better System of Record?
© 2016 MapR Technologies 10© 2016 MapR Technologies 10MapR Confidential
• Auditing - “how did BradA’s points get so high?”• Lineage - “who added points to BradA license?”• History - “where did WillO used to live?”• Integrity - “can I trust this data hasn’t been tampered with?”
• Yup - Streams are immutable
0: { WillO : {City : Mountain View}, ts : 7/5/2009 04:01:01, src : dmv201 }1: { BradA : {City : Atlanta}, ts : 5/11/2010 05:11:31, src : dmv1341 }2: { BradA : {Points : +2}, ts : 6/22/2011 03:31:10, src : officer1213}3: { WillO : {City : San Jose}, ts : 11/1/2012 04:01:01, src : dmv1661 }
Other Benefits of Streaming System of Record
© 2016 MapR Technologies 11© 2016 MapR Technologies 11MapR Confidential
• Infinitely persisted events• A way to query your persisted stream data• An integrated security model across data services
What Do I Need For This to Work?
• Applied Streaming System of Record @ Liaison Blog
© 2016 MapR Technologies 12© 2016 MapR Technologies 12MapR Confidential © 2016 MapR Technologies
About MapR & MapR Streams
© 2016 MapR Technologies 13© 2016 MapR Technologies 13MapR Confidential
MapR Streams:Global Pub-sub Event Streaming System for Big Data
Producers publish billions of events/sec to a topic in a stream.
Events persisted and immediately delivered to all consumers, guaranteed.
Tie together geo-dispersed clusters. Worldwide.
Standard real-time API (Kafka). Integrates with Spark Streaming, Storm, Apex, and Flink.
Direct data access (OJAI API) from analytics frameworks.
Topic
Stream
TopicProducers Consumers
Remote sites and consumersBatch analytics
© 2016 MapR Technologies 14© 2016 MapR Technologies 14MapR Confidential
Streams Offers a Durable, Persistent System of Record
[ {“Topic1Part0Seq5001”: {
“timestamp” : 1456246886,“topic” : “Topic1”,“partition” : 0,“producer” : “wochanda”,“offset” : 5001,“key” : “MsgKey”,“data” : {...} }, {“Topic2Part0Seq5002”: { … } }, …]
● Reliable● Secure● Immutable● Auditable● Replayable
© 2016 MapR Technologies 15© 2016 MapR Technologies 15MapR Confidential
Streams Enables Global Applications and Analytics
Provides● Arbitrary topology of thousands of clusters● Automatic loop prevention● DNS-based discovery● Globally synchronized message offsets
and consumer cursors
Enables● Global applications & data collection● Producer & consumer failover● Analysis/filtering/aggregation at the edge● “Occasional” connections
Producers
Consumers
© 2016 MapR Technologies 16© 2016 MapR Technologies 16MapR Confidential
Fun Facts
MapR Streams
Converged Global Scale
Secure & Multi-Tenant
Single cluster for files, tables, and streams. Global, IoT-scale “fabrics”
with failover.
Tenant-owned streams, logical grouping of topics and messages.
Authentication, authorization, encryption. Unified policy with all other platform services.
Infinite “system of record” persistence.
Metadata tracked internally, no dependencies on ZK. Consumers, topics scale into millions.
© 2016 MapR Technologies 17© 2016 MapR Technologies 17MapR Confidential
Open Source Engines & Tools Commercial Engines & ApplicationsD
ata
Proc
essi
ng
Web-Scale StorageMapR-FS MapR-DB
Search and Others
Global Namespace | No Single Point of Failure | Data Protection | Multi-tenancy | Workload Management Multi Temperature | Global Multi Datacenter | High Performance Low Latency | Security | Management & Monitoring
MapR Streams
Cloud and Managed Services
Search and Others
Unified M
anagement and M
onitoring
Search and Others
Event StreamingDatabase
Custom Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform
MapR Data Platform Services
Commodity Hardware/Storage, Clouds, & Containers
© 2016 MapR Technologies 18© 2016 MapR Technologies 18MapR Confidential © 2016 MapR Technologies
Versioning a Real-time Data Pipeline
© 2016 MapR Technologies 19© 2016 MapR Technologies 19MapR Confidential
Challenges of a Streaming App DeveloperPre-Production
Streaming System
Database Hadoop Cluster
App Environment
events
logs
events2
logs2
v2
v2 /clicks /clicks2
... ...
... ...
© 2016 MapR Technologies 20© 2016 MapR Technologies 20MapR Confidential
Challenges with VersioningPost-Production
Input Data App Logic Output Data+ =
Output Streams
Database Tables
Logs, Metrics
What if you deploy a new version of your
application?
What happens to all of this?
© 2016 MapR Technologies 21© 2016 MapR Technologies 21MapR Confidential
Example: Versioning in Production
45 40 60 30 37 39 72 79 60
Input_Stream
45 35 70
Output_Stream
Calculate_Mean_3
Time Value
00:00:00 70
00:00:05 35
00:00:10 45
Output_Table
Calculate_Mean_3Calculate_Median_3
© 2016 MapR Technologies 22© 2016 MapR Technologies 22MapR Confidential Calculate_Mean_3 Volume
Versioning with Converged App Volumes
45 40 60 30 37 39 72 79 60
Input_Stream
35 70
Output_Stream
Calculate_Mean_3
Time Value
00:00:00 70
00:00:05 35
00:00:10
Output_Table
Calculate_Mean_3Calculate_Median_3
Calculate_Median_3 Volume
Time Value
00:00:00 72
00:00:05 37
00:00:10 45
45 37 72
Output_Stream
Output_Table
© 2016 MapR Technologies 23© 2016 MapR Technologies 23MapR Confidential
Versioning & A/B Testing
80%
10%
10%
A
B
C
© 2016 MapR Technologies 24© 2016 MapR Technologies 24MapR Confidential © 2016 MapR Technologies
DEMO - MapR & StreamsetsVersioning a Production Data Pipeline
Rupal Shah - Streamsets
© 2016 MapR Technologies 25© 2016 MapR Technologies 25MapR Confidential
StreamSets Data Collector™
Adaptable Pipelines -> Efficiency❑ Intent-driven ingest (minimal schema specification).
❑ Data drift handling.
Pipeline KPIs -> Visibility❑ Real-time stage, edge and bad data metrics.
❑ Alerts via profiling, sampling and threshold-based rules.
Containerized Architecture -> Agility❑ Flexible deployment: edge, cluster, embedded, pipeline,
pub/sub
❑ Zero-downtime upgrades due to logical component isolation.
StreamSets Data Collector™ is open source software for building and deploying individual any-to-any ingest pipelines in the face of data drift.
© 2016 MapR Technologies 26© 2016 MapR Technologies 26MapR Confidential
StreamSets Dataflow Performance Manager™
StreamSets Dataflow Performance Manager (DPM™) provides a single pane of glass to map, measure and master big data in motion.
MASTER Availability & Accuracy Proactive Remediation
MEASURE Any Path Any Time
MAP Dataflow Lineage Live Data Architecture
© 2016 MapR Technologies 27© 2016 MapR Technologies 27MapR Confidential
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache Hadoop and Spark professionals
community.mapr.com
© 2016 MapR Technologies 28© 2016 MapR Technologies 28MapR Confidential © 2016 MapR Technologies
Backup
© 2016 MapR Technologies 29© 2016 MapR Technologies 29MapR Confidential
bit.ly/tbdFind my slides & other related materials to this talk here:
or search: