data modeling iot and time series data in nosql

48
Data Modeling IoT and Time Series data in NoSQL Matthew Brender Drew Kerrigan 1

Upload: basho-technologies

Post on 27-Jan-2017

1.505 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Data Modeling IoT and Time Series data in NoSQL

Data Modeling IoT and Time Series data in NoSQL

Matthew BrenderDrew Kerrigan

1

Page 2: Data Modeling IoT and Time Series data in NoSQL

{ “Matt” : ‘[email protected]’,‘mjbrender’,‘@mjbrender’,‘ruby, javascript, go’

}

{ “Drew” : ‘[email protected]’,‘drewkerrigan’,‘@dr00_b’,‘erlang, elixir, go’

}

Meet your presenters

Basho Technologies | 2

Page 3: Data Modeling IoT and Time Series data in NoSQL

Basho SnapshotDistributed Systems Software for Big Data, IoT and Hybrid Cloud applications

Basho Technologies | 3

Founded January 2008

2011 Creators of RiakRiak core: used by Goldman, Visa…Riak KV: Feature-rich Distributed NoSQL databaseRiak S2: Object and cloud storage software

2015 New ProductsBasho Data Platform: NoSQL, caching & analyticsRiak TS: Distributed database designed for time series

120+ employees

Global Offices Seattle (HQ), Washington DC, London, Tokyo

Page 4: Data Modeling IoT and Time Series data in NoSQL

Agenda

• Time Series Data• Introducing Riak

TS• Data Modeling• Coding with Riak

TSBasho Technologies | 4

Page 5: Data Modeling IoT and Time Series data in NoSQL

Basho Technologies | 5

What is Time Series?

Page 6: Data Modeling IoT and Time Series data in NoSQL

What is Time Series?

Basho Technologies | 6

Page 7: Data Modeling IoT and Time Series data in NoSQL

What is Time Series?

Basho Technologies | 7

Page 8: Data Modeling IoT and Time Series data in NoSQL

What is Time Series?

Basho Technologies | 8

Page 9: Data Modeling IoT and Time Series data in NoSQL

How Is Time Series Data Different?• High performance reads and writes of time series data

Basho Technologies | 9

Data location matters

Data needs to be easy to retrieve using range queries

select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”

Higher write volumes

All while still being highly available!

With no data loss even with a huge number of sources

Eventually rolled up, compressed, with the details expired

Page 10: Data Modeling IoT and Time Series data in NoSQL

Introducing Riak TS

Basho Technologies | 10

SERVICEINSTANCES

STORAGEINSTANCES

Solr

Spark Redis (Caching) Solr Elastic

SearchWeb Services3rd Party Web

Services & Integrations

Riak KV Key/Value

Riak S2 Object Storage

Riak TS Time Series

Document Store Columnar Graph

Replication & Synchronization

MessageRouting

Cluster Management &

Monitoring

Logging &Analytics

Internal Data Store

CORE SERVICES

Page 11: Data Modeling IoT and Time Series data in NoSQL

Riak TS Feature DetailsFeature Overview

Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data

Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command

Simple setup for faster ROI

Greater data locality Faster data storage and retrieval

Option to store structured and semi-structured data

Clean data written to the database eliminating the need to cleanse data

Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language

Near-linear scaling Easy to grow database to meet data demands

High Availability for ingest No data loss even when data is streaming from a large number of sources

Basho Technologies | 11

Page 12: Data Modeling IoT and Time Series data in NoSQL

Riak TS Feature Details• Same distributed systems benefits of Riak KV

Operational Simplicity

Fault Tolerance

Robust Client APIs

Broad Client Libraries

Massive Scalability

CRDTs

Active Anti-Entropy

Masterless

High Availability

Low Latency

Read Repair

Riak Search

Basho Technologies | 12

Page 13: Data Modeling IoT and Time Series data in NoSQL

Riak TS Optimization

Basho Technologies | 13

Optimized Deployment

• Data Co-Location• Composite Keys - time or geohash,

data family• Time quantization (quantum)

Simplified Data Modeling

• DDL – Table and field definitions support structured and semi-structured data

Fast Queries and Analysis

• Range Queries (SQL based)• LevelDB filtering • Spark Connector

Page 14: Data Modeling IoT and Time Series data in NoSQL

Riak has a masterless architecture in which every node in a cluster is capable of serving read and write requests.

Requests are routed to nodes using standard load balancing.

Riak TS Optimization

Basho Technologies | 14

Page 15: Data Modeling IoT and Time Series data in NoSQL

Basho Technologies | 15

Riak KV Hashing

Page 16: Data Modeling IoT and Time Series data in NoSQL

Riak KV Hashing

PUT

Basho Technologies | 16

Page 17: Data Modeling IoT and Time Series data in NoSQL

Riak KV Hashing

2i Query

Basho Technologies | 17

Page 18: Data Modeling IoT and Time Series data in NoSQL

Riak TS Hashing

PUT

Basho Technologies | 18

Page 19: Data Modeling IoT and Time Series data in NoSQL

Riak TS Hashing

TS Query

Basho Technologies | 19

Page 20: Data Modeling IoT and Time Series data in NoSQL

RIAK TS – Storing Structured Data

• Key format– Objects have a composite key

(partition key and local key)• Tables

– Buckets can be defined as tables

– Tables can have a schema defined using DDL

– Columns in the table can be typed

• Data Validation– Data is validated on input

Buckets used to Define Tables

Basho Technologies | 20

Page 21: Data Modeling IoT and Time Series data in NoSQL

RIAK TS – Range Queries

• Use Cases– Range queries

• Implementation Details– SQL based query language– Filtering rows based on column expressions– Filtering executed in backend– Specific columns are extracted– Simple select with WHERE clause

• for numbers <,>=,<,<=,=,!=• for other data types =, !=• AND, OR (nesting operators are supported)

Query Like SQL

select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”

Basho Technologies | 21

Page 22: Data Modeling IoT and Time Series data in NoSQL

Data Modeling

How does one approach time series data?

Page 23: Data Modeling IoT and Time Series data in NoSQL

The first rule…

Basho Technologies | 23

Page 24: Data Modeling IoT and Time Series data in NoSQL

The real first rule of data modeling:• Decide what questions you want to ask of the data

– Graphs?– Granularity?– Analysis?– Monitoring?

Basho Technologies | 24

Page 25: Data Modeling IoT and Time Series data in NoSQL

Graphs

Basho Technologies | 25

Page 26: Data Modeling IoT and Time Series data in NoSQL

Graphs

Basho Technologies | 26

Page 27: Data Modeling IoT and Time Series data in NoSQL

Sample Data Exercise

Hard drive test data– https://www.backblaze.com/hard-drive-test-data.html– https://en.wikipedia.org/wiki/S.M.A.R.T.

Basho Technologies | 27

Page 28: Data Modeling IoT and Time Series data in NoSQL

Sample Data Exercise

Basho Technologies | 28

Page 29: Data Modeling IoT and Time Series data in NoSQL

Data Characteristics[Date, Serial Number, Model, Capacity (bytes), Failure, …, smart_194_raw (Temp), …]

Sample Row:• Date: “2013-04-10”• Model: “Hitachi HDS5C3030ALA630”• Failure: 0• Temp: 26

Which columns are good candidates for indexing given the question we are asking of the data?

Basho Technologies | 29

Page 30: Data Modeling IoT and Time Series data in NoSQL

Define the Conceptual QueryEffect of temperature on hard drive stability

Approach 1:

SELECT * FROM HardDrivesWHERE date >= 2013-01-01

AND date <= 2013-12-31AND failure = 'true’

“Find all failures in 2013”• Pros:

– All data is colocated physically• Cons:

– Requires client side processing for further analysis

Basho Technologies | 30

Page 31: Data Modeling IoT and Time Series data in NoSQL

Create the Table

riak-admin bucket-type create HardDrives '{"props":{"n_val":3, "table_def":”CREATE TABLE HardDrives (

date TIMESTAMP NOT NULL, family VARCHAR NOT NULL, failure VARCHAR NOT NULL, serial VARCHAR, model VARCHAR, capacity FLOAT, temperature FLOAT,

PRIMARY KEY ((quantum(date, 1, ‘y'), family, failure), date, family, failure))"}}’

Basho Technologies | 31

Page 32: Data Modeling IoT and Time Series data in NoSQL

Ingest the DataRawRow = [

<<“2013-04-10”>>, %% Date<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model<<“3000592982016”>>, %% Capacity<<“0”>>, %% Failure…, <<“26”>>, …], %% SMART Stats with Temperature

ProcessedRow = [1365555661000, %% Date<<“all”>>, %% Family<<“false”>>, %% Failure<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model3000592982016.0, %% Capacity26.0], %% Temperature

Basho Technologies | 32

Page 33: Data Modeling IoT and Time Series data in NoSQL

Ingest the DataProcessedRow = [ convert(lists:nth(1, RawRow), date), % date <<"all">>, % family convert(lists:nth(5, RawRow), boolean), % failure lists:nth(2, RawRow), % serial lists:nth(3, RawRow), % model convert(lists:nth(4, RawRow), float), % capacity convert(lists:nth(51, RawRow), float) % temp],

riakc_ts:put(Pid,<<"HardDrives">>,[ProcessedRow]).

Basho Technologies | 33

Page 34: Data Modeling IoT and Time Series data in NoSQL

Query the DataStart = integer_to_list(date_to_epoch_ms(<<"2013-01-01">>)),End = integer_to_list(date_to_epoch_ms(<<"2013-12-31">>)),

Query = "select * from HardDrives where date >= " ++ Start ++ " and date <= " ++ End ++ " and family = 'all' and failure = 'true'",

{_Fields, Results} = riakc_ts:query(Pid, list_to_binary(Query)),

Basho Technologies | 34

Page 35: Data Modeling IoT and Time Series data in NoSQL

Process the ResultsTotal Failures: 112Results:

[{

1365555661000,<<"all">>,<<"true">>,<<"9VS3FM1J">>,<<"ST31500341AS">>,1500301910016.0,31.0

}, {...}, {...}, ...]

Basho Technologies | 35

Page 36: Data Modeling IoT and Time Series data in NoSQL

Results

130> ts:approach1().Total Failures: 112"ST31500341AS": ..."ST3000DM001": ..."Hitachi HDS5C4040ALE630": ..."ST4000DM000": ...

"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1

Basho Technologies | 36

Page 37: Data Modeling IoT and Time Series data in NoSQL

Refine the QueryNew QuerySELECT * FROM HardDrivesWHERE date >= 2013-01-01

AND date <= 2013-12-31

AND model = ‘ST31500541AS‘AND failure = 'true’

New Primary KeyPRIMARY KEY (

(quantum(date, 1, ‘y'), model, failure), date, model, failure))"}}’

Same (but more focused) Results"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1

Basho Technologies | 37

Page 38: Data Modeling IoT and Time Series data in NoSQL

Think Outside the BoxNew Approach: Multi-Model with Riak KV

Conceptual Query:

Read the single value of a bunch of counters!

“Find the number of failures for each Quantum, Model, and Temperature combination”• Pros:

– Each data point is pre-calculated, so very little client side processing– Potentially faster, depending on a lot of variables

• Cons:– Requires the desire to know very specific stat values prior to writing data– Requires several counter writes for every row of raw data

Basho Technologies | 38

Page 39: Data Modeling IoT and Time Series data in NoSQL

Create the Bucket Type

riak-admin bucket-type create HardDriveCounters '{"props":{"datatype":"counter"}}’

Basho Technologies | 39

Page 40: Data Modeling IoT and Time Series data in NoSQL

Ingest the DataFailure = lists:nth(5, RawRow), % failureYear = extract_year(lists:nth(1, RawRow), % yearTemp = lists:nth(51, RawRow),

Bucket = {<<"HardDriveCounters">>,Year},Key = list_to_binary(binary_to_list(Model) ++ binary_to_list(Temp)),

%% We only care about failurescase Failure of

<<“1”>> ->Counter = riakc_counter:new(),

Counter1 = riakc_counter:increment(Counter),riakc_pb_socket:update_type(Pid,Bucket,Key,riakc_counter:to_op(Counter1))_ -> okend.

Basho Technologies | 40

Page 41: Data Modeling IoT and Time Series data in NoSQL

Query the DataStartTemp = 16,EndTemp = 28,Results = range_get(<<“2013”>>, <<“ST31500341AS”>>, StartTemp, EndTemp, [])....

range_get(_Year, _Model, EndTemp, EndTemp, Accum) -> lists:reverse(Accum);

range_get(Year, Model, CurrentTemp, EndTemp, Accum) -> Bucket = {<<"HardDriveCounters">>,Year}, Key = list_to_binary(binary_to_list(Model) ++ integer_to_list(Temp)),

{ok, Counter} = riakc_pb_socket:fetch_type(Pid,Bucket, Key),

NumFailures = riakc_counter:value(Counter), range_get(Year, Model, CurrentTemp + 1, EndTemp, [{CurrentTemp, NumFailures}|Accum]).

Basho Technologies | 41

Page 42: Data Modeling IoT and Time Series data in NoSQL

Data Modeling in Riak

Multi-Model with Riak KV

• Keys: Create your own using quantum + “dimension”

• Range Queries: Create your own client side multi-get to issue incremental key gets

• Compound Queries: Create more composite keys!

• Data Location: Sometimes inefficient because data is spread across many vnodes / partitions

Basho Technologies | 42

Page 43: Data Modeling IoT and Time Series data in NoSQL

Data Modeling in Riak

Time Series Modeling in Riak TS

• Keys: Automatically managed based on your PRIMARY KEY definition as well as the values in those fields

• Range Queries: Use a well known subset of SQL to simply specify a start and end in a WHERE clause which performs a server side multi-get

• Compound Queries: Possible with a wisely chosen composite PRIMARY KEY, although multiple tables may still be necessary

• Data Location: Very efficient data grouping by quantums, families, and series.

Basho Technologies | 43

Page 44: Data Modeling IoT and Time Series data in NoSQL

Conclusion

Page 45: Data Modeling IoT and Time Series data in NoSQL

Part of the Basho Data Platform

Basho Technologies | 45

SERVICEINSTANCES

STORAGEINSTANCES

Solr

Spark Redis (Caching) Solr Elastic

SearchWeb Services3rd Party Web

Services & Integrations

Riak KV Key/Value

Riak S2 Object Storage

Riak TS Time Series

Document Store Columnar Graph

Replication & Synchronization

MessageRouting

Cluster Management &

Monitoring

Logging &Analytics

Internal Data Store

CORE SERVICES

Page 46: Data Modeling IoT and Time Series data in NoSQL

RIAK TS Feature DetailsFeature Overview

Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data

Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command

Simple setup for faster ROI

Greater data locality Faster data storage and retrieval

Option to store structured and semi-structured data

Clean data written to the database eliminating the need to cleanse data

Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language

Near-linear scaling Easy to grow database to meet data demands

High Availability for ingest No data loss even when data is streaming from a large number of sources

Basho Technologies | 46

Page 47: Data Modeling IoT and Time Series data in NoSQL

QUESTIONS?

Page 48: Data Modeling IoT and Time Series data in NoSQL

Spend Time

@basho@riconconf

OPEN SOURCE ENTERPRISE

Basho Data Platform (code)• Riak KV with parallel extract

Basho Data Platform, Enterprise• Riak EE with multi-cluster replication• Spark Leader Election Service

Basho Data Platform Add-on’s (code)• Spark + Spark Connector

Basho Data Platform Add-on’s• Redis + Cache Proxy• Spark Workers + Spark Master

Download a build Contact us to get started

getting to know us

Basho Technologies | 48