data modeling iot and time series data in nosql
TRANSCRIPT
Data Modeling IoT and Time Series data in NoSQL
Matthew BrenderDrew Kerrigan
1
{ “Matt” : ‘[email protected]’,‘mjbrender’,‘@mjbrender’,‘ruby, javascript, go’
}
{ “Drew” : ‘[email protected]’,‘drewkerrigan’,‘@dr00_b’,‘erlang, elixir, go’
}
Meet your presenters
Basho Technologies | 2
Basho SnapshotDistributed Systems Software for Big Data, IoT and Hybrid Cloud applications
Basho Technologies | 3
Founded January 2008
2011 Creators of RiakRiak core: used by Goldman, Visa…Riak KV: Feature-rich Distributed NoSQL databaseRiak S2: Object and cloud storage software
2015 New ProductsBasho Data Platform: NoSQL, caching & analyticsRiak TS: Distributed database designed for time series
120+ employees
Global Offices Seattle (HQ), Washington DC, London, Tokyo
Agenda
• Time Series Data• Introducing Riak
TS• Data Modeling• Coding with Riak
TSBasho Technologies | 4
Basho Technologies | 5
What is Time Series?
What is Time Series?
Basho Technologies | 6
What is Time Series?
Basho Technologies | 7
What is Time Series?
Basho Technologies | 8
How Is Time Series Data Different?• High performance reads and writes of time series data
Basho Technologies | 9
Data location matters
Data needs to be easy to retrieve using range queries
select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”
Higher write volumes
All while still being highly available!
With no data loss even with a huge number of sources
Eventually rolled up, compressed, with the details expired
Introducing Riak TS
Basho Technologies | 10
SERVICEINSTANCES
STORAGEINSTANCES
Solr
Spark Redis (Caching) Solr Elastic
SearchWeb Services3rd Party Web
Services & Integrations
Riak KV Key/Value
Riak S2 Object Storage
Riak TS Time Series
Document Store Columnar Graph
Replication & Synchronization
MessageRouting
Cluster Management &
Monitoring
Logging &Analytics
Internal Data Store
CORE SERVICES
Riak TS Feature DetailsFeature Overview
Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data
Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command
Simple setup for faster ROI
Greater data locality Faster data storage and retrieval
Option to store structured and semi-structured data
Clean data written to the database eliminating the need to cleanse data
Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language
Near-linear scaling Easy to grow database to meet data demands
High Availability for ingest No data loss even when data is streaming from a large number of sources
Basho Technologies | 11
Riak TS Feature Details• Same distributed systems benefits of Riak KV
Operational Simplicity
Fault Tolerance
Robust Client APIs
Broad Client Libraries
Massive Scalability
CRDTs
Active Anti-Entropy
Masterless
High Availability
Low Latency
Read Repair
Riak Search
Basho Technologies | 12
Riak TS Optimization
Basho Technologies | 13
Optimized Deployment
• Data Co-Location• Composite Keys - time or geohash,
data family• Time quantization (quantum)
Simplified Data Modeling
• DDL – Table and field definitions support structured and semi-structured data
Fast Queries and Analysis
• Range Queries (SQL based)• LevelDB filtering • Spark Connector
Riak has a masterless architecture in which every node in a cluster is capable of serving read and write requests.
Requests are routed to nodes using standard load balancing.
Riak TS Optimization
Basho Technologies | 14
Basho Technologies | 15
Riak KV Hashing
Riak KV Hashing
PUT
Basho Technologies | 16
Riak KV Hashing
2i Query
Basho Technologies | 17
Riak TS Hashing
PUT
Basho Technologies | 18
Riak TS Hashing
TS Query
Basho Technologies | 19
RIAK TS – Storing Structured Data
• Key format– Objects have a composite key
(partition key and local key)• Tables
– Buckets can be defined as tables
– Tables can have a schema defined using DDL
– Columns in the table can be typed
• Data Validation– Data is validated on input
Buckets used to Define Tables
Basho Technologies | 20
RIAK TS – Range Queries
• Use Cases– Range queries
• Implementation Details– SQL based query language– Filtering rows based on column expressions– Filtering executed in backend– Specific columns are extracted– Simple select with WHERE clause
• for numbers <,>=,<,<=,=,!=• for other data types =, !=• AND, OR (nesting operators are supported)
Query Like SQL
select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”
Basho Technologies | 21
Data Modeling
How does one approach time series data?
The first rule…
Basho Technologies | 23
The real first rule of data modeling:• Decide what questions you want to ask of the data
– Graphs?– Granularity?– Analysis?– Monitoring?
Basho Technologies | 24
Graphs
Basho Technologies | 25
Graphs
Basho Technologies | 26
Sample Data Exercise
Hard drive test data– https://www.backblaze.com/hard-drive-test-data.html– https://en.wikipedia.org/wiki/S.M.A.R.T.
Basho Technologies | 27
Sample Data Exercise
Basho Technologies | 28
Data Characteristics[Date, Serial Number, Model, Capacity (bytes), Failure, …, smart_194_raw (Temp), …]
Sample Row:• Date: “2013-04-10”• Model: “Hitachi HDS5C3030ALA630”• Failure: 0• Temp: 26
Which columns are good candidates for indexing given the question we are asking of the data?
Basho Technologies | 29
Define the Conceptual QueryEffect of temperature on hard drive stability
Approach 1:
SELECT * FROM HardDrivesWHERE date >= 2013-01-01
AND date <= 2013-12-31AND failure = 'true’
“Find all failures in 2013”• Pros:
– All data is colocated physically• Cons:
– Requires client side processing for further analysis
Basho Technologies | 30
Create the Table
riak-admin bucket-type create HardDrives '{"props":{"n_val":3, "table_def":”CREATE TABLE HardDrives (
date TIMESTAMP NOT NULL, family VARCHAR NOT NULL, failure VARCHAR NOT NULL, serial VARCHAR, model VARCHAR, capacity FLOAT, temperature FLOAT,
PRIMARY KEY ((quantum(date, 1, ‘y'), family, failure), date, family, failure))"}}’
Basho Technologies | 31
Ingest the DataRawRow = [
<<“2013-04-10”>>, %% Date<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model<<“3000592982016”>>, %% Capacity<<“0”>>, %% Failure…, <<“26”>>, …], %% SMART Stats with Temperature
ProcessedRow = [1365555661000, %% Date<<“all”>>, %% Family<<“false”>>, %% Failure<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model3000592982016.0, %% Capacity26.0], %% Temperature
Basho Technologies | 32
Ingest the DataProcessedRow = [ convert(lists:nth(1, RawRow), date), % date <<"all">>, % family convert(lists:nth(5, RawRow), boolean), % failure lists:nth(2, RawRow), % serial lists:nth(3, RawRow), % model convert(lists:nth(4, RawRow), float), % capacity convert(lists:nth(51, RawRow), float) % temp],
riakc_ts:put(Pid,<<"HardDrives">>,[ProcessedRow]).
Basho Technologies | 33
Query the DataStart = integer_to_list(date_to_epoch_ms(<<"2013-01-01">>)),End = integer_to_list(date_to_epoch_ms(<<"2013-12-31">>)),
Query = "select * from HardDrives where date >= " ++ Start ++ " and date <= " ++ End ++ " and family = 'all' and failure = 'true'",
{_Fields, Results} = riakc_ts:query(Pid, list_to_binary(Query)),
Basho Technologies | 34
Process the ResultsTotal Failures: 112Results:
[{
1365555661000,<<"all">>,<<"true">>,<<"9VS3FM1J">>,<<"ST31500341AS">>,1500301910016.0,31.0
}, {...}, {...}, ...]
Basho Technologies | 35
Results
130> ts:approach1().Total Failures: 112"ST31500341AS": ..."ST3000DM001": ..."Hitachi HDS5C4040ALE630": ..."ST4000DM000": ...
"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1
Basho Technologies | 36
Refine the QueryNew QuerySELECT * FROM HardDrivesWHERE date >= 2013-01-01
AND date <= 2013-12-31
AND model = ‘ST31500541AS‘AND failure = 'true’
New Primary KeyPRIMARY KEY (
(quantum(date, 1, ‘y'), model, failure), date, model, failure))"}}’
Same (but more focused) Results"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1
Basho Technologies | 37
Think Outside the BoxNew Approach: Multi-Model with Riak KV
Conceptual Query:
Read the single value of a bunch of counters!
“Find the number of failures for each Quantum, Model, and Temperature combination”• Pros:
– Each data point is pre-calculated, so very little client side processing– Potentially faster, depending on a lot of variables
• Cons:– Requires the desire to know very specific stat values prior to writing data– Requires several counter writes for every row of raw data
Basho Technologies | 38
Create the Bucket Type
riak-admin bucket-type create HardDriveCounters '{"props":{"datatype":"counter"}}’
Basho Technologies | 39
Ingest the DataFailure = lists:nth(5, RawRow), % failureYear = extract_year(lists:nth(1, RawRow), % yearTemp = lists:nth(51, RawRow),
Bucket = {<<"HardDriveCounters">>,Year},Key = list_to_binary(binary_to_list(Model) ++ binary_to_list(Temp)),
%% We only care about failurescase Failure of
<<“1”>> ->Counter = riakc_counter:new(),
Counter1 = riakc_counter:increment(Counter),riakc_pb_socket:update_type(Pid,Bucket,Key,riakc_counter:to_op(Counter1))_ -> okend.
Basho Technologies | 40
Query the DataStartTemp = 16,EndTemp = 28,Results = range_get(<<“2013”>>, <<“ST31500341AS”>>, StartTemp, EndTemp, [])....
range_get(_Year, _Model, EndTemp, EndTemp, Accum) -> lists:reverse(Accum);
range_get(Year, Model, CurrentTemp, EndTemp, Accum) -> Bucket = {<<"HardDriveCounters">>,Year}, Key = list_to_binary(binary_to_list(Model) ++ integer_to_list(Temp)),
{ok, Counter} = riakc_pb_socket:fetch_type(Pid,Bucket, Key),
NumFailures = riakc_counter:value(Counter), range_get(Year, Model, CurrentTemp + 1, EndTemp, [{CurrentTemp, NumFailures}|Accum]).
Basho Technologies | 41
Data Modeling in Riak
Multi-Model with Riak KV
• Keys: Create your own using quantum + “dimension”
• Range Queries: Create your own client side multi-get to issue incremental key gets
• Compound Queries: Create more composite keys!
• Data Location: Sometimes inefficient because data is spread across many vnodes / partitions
Basho Technologies | 42
Data Modeling in Riak
Time Series Modeling in Riak TS
• Keys: Automatically managed based on your PRIMARY KEY definition as well as the values in those fields
• Range Queries: Use a well known subset of SQL to simply specify a start and end in a WHERE clause which performs a server side multi-get
• Compound Queries: Possible with a wisely chosen composite PRIMARY KEY, although multiple tables may still be necessary
• Data Location: Very efficient data grouping by quantums, families, and series.
Basho Technologies | 43
Conclusion
Part of the Basho Data Platform
Basho Technologies | 45
SERVICEINSTANCES
STORAGEINSTANCES
Solr
Spark Redis (Caching) Solr Elastic
SearchWeb Services3rd Party Web
Services & Integrations
Riak KV Key/Value
Riak S2 Object Storage
Riak TS Time Series
Document Store Columnar Graph
Replication & Synchronization
MessageRouting
Cluster Management &
Monitoring
Logging &Analytics
Internal Data Store
CORE SERVICES
RIAK TS Feature DetailsFeature Overview
Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data
Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command
Simple setup for faster ROI
Greater data locality Faster data storage and retrieval
Option to store structured and semi-structured data
Clean data written to the database eliminating the need to cleanse data
Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language
Near-linear scaling Easy to grow database to meet data demands
High Availability for ingest No data loss even when data is streaming from a large number of sources
Basho Technologies | 46
QUESTIONS?
Spend Time
@basho@riconconf
OPEN SOURCE ENTERPRISE
Basho Data Platform (code)• Riak KV with parallel extract
Basho Data Platform, Enterprise• Riak EE with multi-cluster replication• Spark Leader Election Service
Basho Data Platform Add-on’s (code)• Spark + Spark Connector
Basho Data Platform Add-on’s• Redis + Cache Proxy• Spark Workers + Spark Master
Download a build Contact us to get started
getting to know us
Basho Technologies | 48