coping with iot data - mimming.com...coping with iot data on google cloud platform. google cloud...
TRANSCRIPT
Jen TongDeveloper Advocate
Coping with IoT DataOn Google Cloud Platform
Confidential & ProprietaryGoogle Cloud Platform 2
Jen Tong@MimmingCodes
Felipe Hoffa@felipehoffa
Agenda
● IoT Data Challenges● A use case● A recipe● Workshop
○ Setup stuff○ Simulate the activity○ Capture it with Pub/Sub○ Wrangle it with Dataflow○ Analyze it with BigQuery
Confidential & ProprietaryGoogle Cloud Platform 4
About you
● Electrical engineers?● Web developers?● Data scientists?● Mechanical engineers?● Not engineers at all?
Confidential & ProprietaryGoogle Cloud Platform 5Photo credit: Matt Chan
Data
photo credit - taniwha on flickr
Confidential & ProprietaryGoogle Cloud Platform 6
Confidential & ProprietaryGoogle Cloud Platform 7Photo credit: Matt Chan
Data
photo credit - wemake_cc on flickr
Confidential & ProprietaryGoogle Cloud Platform 8
Data
Confidential & ProprietaryGoogle Cloud Platform 9
Big data
Confidential & ProprietaryGoogle Cloud Platform 10
Confidential & ProprietaryGoogle Cloud Platform 11
Confidential & ProprietaryGoogle Cloud Platform 12
Google Research Publications
Confidential & ProprietaryGoogle Cloud Platform 13
Google Research Publications
Confidential & ProprietaryGoogle Cloud Platform 14
Open Source Implementations
Bigtable
Flume
Dremel
Confidential & ProprietaryGoogle Cloud Platform 15
Managed Cloud Versions
Bigtable
Flume
Dremel
Bigtable
Dataflow
BigQuery
Confidential & ProprietaryGoogle Cloud Platform 16
Coping with big data
Confidential & ProprietaryGoogle Cloud Platform 17
Big data
Confidential & ProprietaryGoogle Cloud Platform 18
Really big data
TuesdayWednesday
Thursday
Confidential & ProprietaryGoogle Cloud Platform 19
Infinite data
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
Confidential & ProprietaryGoogle Cloud Platform 20
Delayed data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
Confidential & ProprietaryGoogle Cloud Platform 21
Batch Patterns: Creating Structured Data
MapReduce
Confidential & ProprietaryGoogle Cloud Platform 22
Batch Patterns: Repetitive Runs
MapReduce
TuesdayWednesday
Thursday
Confidential & ProprietaryGoogle Cloud Platform 23
Batch Patterns: Time Based Windows
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Confidential & ProprietaryGoogle Cloud Platform 24
Batch Patterns: Sessions
MapReduce
TuesdayWednesday
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
Confidential & ProprietaryGoogle Cloud Platform 25
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Confidential & ProprietaryGoogle Cloud Platform 26
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Confidential & ProprietaryGoogle Cloud Platform 27
Delayed data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
Confidential & ProprietaryGoogle Cloud Platform 28
Streaming Patterns: Event Time Based Windows
11:0010:00 15:0014:0013:0012:00Event Time
11:0010:00 15:0014:0013:0012:00Processing Time
Input
Output
Confidential & ProprietaryGoogle Cloud Platform 29
Streaming Patterns: Session Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Confidential & ProprietaryGoogle Cloud Platform 30
The use case
Confidential & ProprietaryGoogle Cloud Platform 31
The game
Confidential & ProprietaryGoogle Cloud Platform 32
Confidential & ProprietaryGoogle Cloud Platform 33
Confidential & ProprietaryGoogle Cloud Platform 34
Problem scope
● Intermittent connectivity● Inconsistent data delivery timing● Large, endless stream of data● Multiple input and output streams● Bursts of activity● Integrate and synchronize multiple event streams
Confidential & ProprietaryGoogle Cloud Platform 35
Solution requirements
● Keep up with the event streams● Respond in real-time● Scale up and down with demand● Process data once● Accommodate late-arriving data● Detect anomalies
Confidential & ProprietaryGoogle Cloud Platform 36
A recipe
The recipe
Data
The recipe
Pub/SubData
The recipe
Pub/Sub DataflowData
The recipe
Pub/Sub Dataflow BigQueryData
Confidential & ProprietaryGoogle Cloud Platform 41
Code time!
Setup stuff
Setup stuff
1. Log into your codelab project2. Install Java, Maven, and the Cloud SDK3. Download the example code4. Setup service account credentials5. Create a Google Cloud Storage 'staging bucket' for Dataflow6. Create a BigQuery Dataset
Detailed instructions: tinyurl.com/cope-iot-data
Confidential & ProprietaryGoogle Cloud Platform 43
Cloud Pub/Sub
Confidential & ProprietaryGoogle Cloud Platform 44
How Pub/Sub works
Topics Subscriptions Subscribers
Push
Pull
Push
Confidential & ProprietaryGoogle Cloud Platform 45
OSS alternative
Confidential & ProprietaryGoogle Cloud Platform 46
Cloud Pub/Sub features
● Asynchronous messaging● Many-to-many● Push and pull● At-least-once message delivery● REST/JSON API
Confidential & ProprietaryGoogle Cloud Platform 47
Nonfunctional stuff
● Globally available● Automatic scaling● Replicated storage● Encrypted on the wire and at rest
Confidential & ProprietaryGoogle Cloud Platform 48
Code time!
Pub/Sub injector
Confidential & ProprietaryGoogle Cloud Platform 49
Cloud Dataflow
Confidential & ProprietaryGoogle Cloud Platform 50
Cloud Dataflow
Cloud Dataflow is a collection of SDKs for building batch or
streaming parallelized data processing pipelines.
Cloud Dataflow is a fully managed service for executing optimized
parallelized data processing pipelines.
Confidential & ProprietaryGoogle Cloud Platform 51
How Dataflow works● A direct acyclic graph of data
processing transformations● Dataflow Service optimizes and
executes● Supports alternate runners● Multiple inputs and outputs● Logical MapReduce operations● PCollections flow through the
pipeline
Confidential & ProprietaryGoogle Cloud Platform 52
OSS alternative
Confidential & ProprietaryGoogle Cloud Platform 53
Features
● Unified model for streaming/batch analysis● Once-and-only-once input element processing● Autoscaling● Toolkit of complex transforms● Support for event-time stream processing
○ Handles of late data● Session windowing● Real-time analytics● Real-time dashboard and alerts
Confidential & ProprietaryGoogle Cloud Platform 54
What it's good at
• Filtering
• Transformation
• Movement
• Extract insights
• Batch
• Continuous
AnalysisETL
Confidential & ProprietaryGoogle Cloud Platform 55
Dataflow concepts
Confidential & ProprietaryGoogle Cloud Platform 56
PCollections -- pipeline collections
● A collection of data in a pipeline● Bounded or unbounded in size● Created by:
○ Building from a java.util.collection
○ Reading from a backing data store
○ Transforming an existing PCollection
{Seahawks, NFC, Champions, ...}
{..., “NFC Champions #Seahawks”, “Seahawks third #superbowl!”, ... “Je suis #12thMan”, “#GoHawks”, ...}
Confidential & ProprietaryGoogle Cloud Platform 57
ParDo -- Parallel Do transformation
● Process each PCollection element independently using a user-provided DoFn
● Both map and reduce phases in Hadoop.
{Seahawks, NFC, Champions, ...}
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
Key by initial letter
Confidential & ProprietaryGoogle Cloud Platform 58
ParDo example
{Seahawks, NFC, Champions, ...}
Lowercase
PCollection<String> tweets = …;
tweets.apply(ParDo.of(
new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toLowerCase());
}));
{seakawhs, nfc, champions, ...}
Confidential & ProprietaryGoogle Cloud Platform 59
GroupByKey
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
{KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
● Gathers all PCollection elements with the same key
● Shuffle phase in Hadoop
Confidential & ProprietaryGoogle Cloud Platform 60
GroupByKey & Combine
● Compute the most common value for each key with GroupByKey and DoFn
● DoFn needs to see all of the elements
● Easier to optimize than CombineFn
GroupByKey
{KV<S, Seahawks>, KV<C,Champion>, KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, {Seahawks, Seattle, …}>, KV<N, {NFC, …}>,
KV<C, {Champion, …}>}
Combine.groupedValues(TopFn)
{KV<S, Seahawks>, KV<N, NFC>,
KV<C, Champion>}
Confidential & ProprietaryGoogle Cloud Platform 61
Windows
● Divide or group elements of a PCollection into windows○ Fixed Windows: hourly, daily, …○ Sliding Windows○ Sessions
● Required for GroupByKey transforms on an unbounded PCollection
Nighttime Mid-Day Nighttime
Confidential & ProprietaryGoogle Cloud Platform 62
Composite PTransforms
● Build new PTransforms from existing transforms
● Some utilities are included in the SDK:○ Count, RemoveDuplicates,
Join, Min, Max, Sum… ● Define your own
○ DoSomething, DoSomethingElse...● Why bother?
○ Code reuse○ Easy to monitor
GroupByKey
Pair With Ones
Sum Values Count
Confidential & ProprietaryGoogle Cloud Platform 63
Code time!
Start a pipeline
Confidential & ProprietaryGoogle Cloud Platform 64
Google BigQuery
Confidential & ProprietaryGoogle Cloud Platform 65
OSS alternative
Confidential & ProprietaryGoogle Cloud Platform 66
BigQuery
● Scales flat to petabytes● SQL dialect● User defined functions● REST, Web UI, ODBC● 1TB free each month
Confidential & ProprietaryGoogle Cloud Platform 67
BigQuery
● Scales flat to petabytes● SQL dialect● User defined functions● REST, Web UI, ODBC● 1TB free each month
Confidential & ProprietaryGoogle Cloud Platform 68
Demo: count stuff
Confidential & ProprietaryGoogle Cloud Platform 69
SELECT count(word)FROM publicdata:samples.shakespeare
Words in Shakespeare
Confidential & ProprietaryGoogle Cloud Platform 70
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]
Wikipedia hits over 1 hour
Confidential & ProprietaryGoogle Cloud Platform 71
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]
Wikipedia hits over 1 month
Confidential & ProprietaryGoogle Cloud Platform 72
Several years of Wikipedia data
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],
...
Confidential & ProprietaryGoogle Cloud Platform 73
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')
Several years of Wikipedia data
Confidential & ProprietaryGoogle Cloud Platform 74
How about a RegExp
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))
Confidential & ProprietaryGoogle Cloud Platform 75
How BigQuery works
Confidential & ProprietaryGoogle Cloud Platform 76
Qualities of a good RDBMS
Confidential & ProprietaryGoogle Cloud Platform 77
Qualities of a good RDBMS
● Inserts & locking● Indexing● Cache● Query planning
Confidential & ProprietaryGoogle Cloud Platform 78
Qualities of a good RDBMS
● Inserts & locking● Indexing● Cache● Query planning
Confidential & ProprietaryGoogle Cloud Platform 79
Confidential & ProprietaryGoogle Cloud Platform 80
Confidential & ProprietaryGoogle Cloud Platform 81
Confidential & ProprietaryGoogle Cloud Platform 82
Storing data
-- -- -- ---- -- -- ---- -- -- --
Table
Columns
Disks
Confidential & ProprietaryGoogle Cloud Platform 83
Reading data: Life of a BigQuery
SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )
Confidential & ProprietaryGoogle Cloud Platform 84
Life of a BigQuery
L L
MMixer
Leaf
Storage
Confidential & ProprietaryGoogle Cloud Platform 85
L L L L
M M
M
Life of a BigQuery
Root Mixer
Mixer
Leaf
Storage
Confidential & ProprietaryGoogle Cloud Platform 86
Life of a BigQueryQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
Confidential & ProprietaryGoogle Cloud Platform 87
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
StorageSELECT requests, title
Confidential & ProprietaryGoogle Cloud Platform 88
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT requests, title
WHERE (REGEXP_MATCH(title, '[Jj]en.+'))
Confidential & ProprietaryGoogle Cloud Platform 89
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
Confidential & ProprietaryGoogle Cloud Platform 90
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
SELECT sum(requests)
Confidential & ProprietaryGoogle Cloud Platform 91
Code time!
BigQuery
Confidential & ProprietaryGoogle Cloud Platform 92
Wrap up
Confidential & ProprietaryGoogle Cloud Platform 93
Big data
Confidential & ProprietaryGoogle Cloud Platform 94Photo credit: Matt Chan
Data
photo credit - wemake_cc on flickr
Confidential & ProprietaryGoogle Cloud Platform 95Photo credit: Matt Chan
Data
photo credit - taniwha on flickr
Confidential & ProprietaryGoogle Cloud Platform 96
Thank you!
Jen Tong@MimmingCodes
Felipe Hoffa@felipehoffa
Slides:mimming.com/presos
Confidential & ProprietaryGoogle Cloud Platform 97
Confidential & ProprietaryGoogle Cloud Platform 98
Bonus stuff
Confidential & ProprietaryGoogle Cloud Platform 99
Cloud BigtableCloud Bigtable
Confidential & ProprietaryGoogle Cloud Platform 100
Bigness
Confidential & ProprietaryGoogle Cloud Platform 101
Google Internal Bigtable in Numbers
• Storage: 100s of PB
• Throughput: 1,000,000s of QPS
• Bandwidth: 100s of GB/sec
Confidential & ProprietaryGoogle Cloud Platform 102
How much is that?
Several Datas worthPhoto credit: jdhancock
Confidential & ProprietaryGoogle Cloud Platform 103
How much is that?
Millennia of DVD videoPhoto credit: illinoislibrary
Confidential & ProprietaryGoogle Cloud Platform 104
Engineering
Confidential & ProprietaryGoogle Cloud Platform 105
Engineering
Hundreds of engineer-years worth
Confidential & ProprietaryGoogle Cloud Platform 106
Bigtable - The early years
• Jeff and Sanjay decided to build a database service that could scale linearly across thousands and thousands of commodity servers
• Systems will fail, retain performance at scale
• Abandon traditional relational model
• The first generation was about:
• Prototyping and build the service to do its first scaling
• Migrate initial applications to Bigtable
• Figure out replication, and first multi-tenant version of Bigtable
Confidential & ProprietaryGoogle Cloud Platform 107
Bigtable - Stabilized
• From batch only, to serving web traffic
• Low latency for 99th percentile of requests
• Polish the Bigtable service
• React better to abusive usage
• Mixed media clusters - mixture of SSD + HDD storage with configurable affinity
• Bring tablet server recovery time from 10s of seconds to 1 second or less
• Easier replication
Confidential & ProprietaryGoogle Cloud Platform 108
Google Cloud Bigtable
• A fully-managed service
• Focus more on your business, less on infrastructure
• Straightforward pricing model
Confidential & ProprietaryGoogle Cloud Platform 109
Data Model
Confidential & ProprietaryGoogle Cloud Platform 110
Data model
Confidential & ProprietaryGoogle Cloud Platform 111
Data model
Confidential & ProprietaryGoogle Cloud Platform 112
How it works
Confidential & ProprietaryGoogle Cloud Platform 113
HBase Architecture
HBase Cluster
Region Server
Region Server
Region Server
Region Server
Master
Region Server
Bloomfilter
Memory Table
WAL
Block Cache
RegionRegion
Region Region
ZooKeeper
HBase Client
HDFS
Confidential & ProprietaryGoogle Cloud Platform 114
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Confidential & ProprietaryGoogle Cloud Platform 115
Bigtable System Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Confidential & ProprietaryGoogle Cloud Platform 116
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Confidential & ProprietaryGoogle Cloud Platform 117
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Confidential & ProprietaryGoogle Cloud Platform 118
Life of Bigtable data
Confidential & ProprietaryGoogle Cloud Platform 119
Life of Bigtable data
Confidential & ProprietaryGoogle Cloud Platform 120
Life of Bigtable data
Confidential & ProprietaryGoogle Cloud Platform 121
Life of Bigtable data
Confidential & ProprietaryGoogle Cloud Platform 122
When it's awesome
Confidential & ProprietaryGoogle Cloud Platform 123
Management
Who in the audience have used HBase before?
Things you will not see in Cloud Bigtable:
Compactions
Pre-splitting
Lots of configuration settings
1 minute regionserver outages
Coprocessors (for now)
Confidential & ProprietaryGoogle Cloud Platform 124
Throughput
Write Throughput (MB/s)Mixed Read/Write Throughput(MB/s)
Confidential & ProprietaryGoogle Cloud Platform 125
Latency
late
ncy
(ms)
at
99%
read
update
Confidential & ProprietaryGoogle Cloud Platform 126
Financial ServicesFaster risk analysis, credit card fraud/abuse
Marketing/ Digital MediaUser engagement, clickstream analysis, real-time adaptive content
Internet of ThingsSensor data dashboards and anomaly detection
TelecommunicationsSampled traffic patterns, metric collection and reporting
EnergyOil well sensors, anomaly detection, predictive modeling
BiomedicalGenomics sequencing data analysis
Cloud Bigtable Use Cases
Confidential & ProprietaryGoogle Cloud Platform 127
When not to use it
• Relational joins, like for online transaction processing
• Interactive querying
• Blobs over 10MB
• ACID transactions
• Automatic cross-zone replication
• You don't have much data yet
Confidential & ProprietaryGoogle Cloud Platform 128
When not to use it
• Relational joins, like for online transaction processing - Cloud SQL
• Interactive querying - BigQuery
• Blobs over 10MB - Cloud Storage
• ACID transactions - Datastore
• Automatic cross-zone replication - Datastore
• You don't have much data yet - Datastore, Firebase, or Cloud SQL
Confidential & ProprietaryGoogle Cloud Platform 129
More bonus stuff
Build one of something with Firebase