time series data storage in mongodb

106
+ Sunday, July 24, 2011

Upload: skyjackson

Post on 12-May-2015

36.387 views

Category:

Technology


1 download

DESCRIPTION

Skyline Innovations, a renewable energy company in Washington DC, uses MongoDB to store its time series data from its solar installations. This talk tells how, and why. www.skylineinnovations.com Given at MongoDC2011

TRANSCRIPT

Page 1: Time Series Data Storage in MongoDB

+

Sunday, July 24, 2011

Page 2: Time Series Data Storage in MongoDB

ajackson@

skylineinnovations.com

Sunday, July 24, 2011

Page 3: Time Series Data Storage in MongoDB

a tale of rapid prototyping, data

warehousing, solar power, an architecture

designed for data analysis at “scale”...and arduinos!

Sunday, July 24, 2011

So here’s what i’d like to talk about: Who we are, how we got started, and most importantly, how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s flexible nature really helped us as a business, and how Mongo specifically has been a good choice for us as we build some of our tools. Here are some themes:

Page 4: Time Series Data Storage in MongoDB

Scaling

Sunday, July 24, 2011

Mongo has come to have a pretty strong association with the word “scaling.”

Scaling is a word we throw around a lot, and it almost always means “software performance, as inputs grow by orders of magnitude.”

But scaling also means performance as the variety of inputs increases. I’d argue that it’s scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to a hundred.

There’s another word for this.

Page 5: Time Series Data Storage in MongoDB

ScalingFlexibility

Sunday, July 24, 2011

Particularly when you scale in the real world, you start to find that it’s complicated and messy and entropic in ways that software isn’t always equipped to handle. So for us, when we say “mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come back to them as well.

Page 6: Time Series Data Storage in MongoDB

Business-first development

Sunday, July 24, 2011

This generally means flexibile, lightweight processes. Things that become fixed & unchangable quickly become obsolete and sad :’(

Page 7: Time Series Data Storage in MongoDB

When Does “Context”

become “Yak Shaving”?

Sunday, July 24, 2011

When i read new things or hear about new stuff, I’m always trying to put it in context. So, sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast over the context that *is* important. So please stop me to ask questions! Also, the problem domain here is a little different than what we might be used to, so bear with me as we go into plumbing & construction.

Page 8: Time Series Data Storage in MongoDB

Preliminaries

Sunday, July 24, 2011

Page 9: Time Series Data Storage in MongoDB

Est. 8/2009Sunday, July 24, 2011

Page 10: Time Series Data Storage in MongoDB

Project Development+

Technology

Sunday, July 24, 2011

Page 11: Time Series Data Storage in MongoDB

“Project Development”Sunday, July 24, 2011

Page 12: Time Series Data Storage in MongoDB

finance, develop, and operate renewable energy and efficiency

installations, for measurable, guaranteed savings.

Sunday, July 24, 2011

Page 13: Time Series Data Storage in MongoDB

finance, develop, and operate renewable energy

and efficiency installations, for measurable, guaranteed savings.

Sunday, July 24, 2011

We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.

Page 14: Time Series Data Storage in MongoDB

finance, develop, and operate renewable energy and

efficiency installations, for measurable, guaranteed savings.

Sunday, July 24, 2011

Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.

Page 15: Time Series Data Storage in MongoDB

finance, develop, and operate renewable energy and efficiency installations, for measurable,

guaranteed savings.

Sunday, July 24, 2011

So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the twist. Other companies have done similar things, where they say “we’ll pay for a system/retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So, we actually measure the performance of this stuff, collect the data, and guarantee that you save money.

Page 16: Time Series Data Storage in MongoDB

(not webapps)

Sunday, July 24, 2011

Page 17: Time Series Data Storage in MongoDB

Topics not covered:

Sunday, July 24, 2011

Page 18: Time Series Data Storage in MongoDB

• Why solar thermal?

• Why hasn’t anyone else done this before?

• Pivots? Iterations?

• What’s the market size?

• Funding? Capital structures?

• Wait, how do you guys make money?

Sunday, July 24, 2011

Oh, right, this isn’t a startup talk. But feel free to ask me these later!

Page 19: Time Series Data Storage in MongoDB

Solar Thermal in Five Minutes

( mongo next, i promise! )

Sunday, July 24, 2011

Page 20: Time Series Data Storage in MongoDB

Municipal =>

Roof=>

Tank=>

CustomerSunday, July 24, 2011

Page 21: Time Series Data Storage in MongoDB

Relevant Data to Track

Sunday, July 24, 2011

Page 22: Time Series Data Storage in MongoDB

Temperatures (about a dozen)

Sunday, July 24, 2011

Page 23: Time Series Data Storage in MongoDB

Flow Rates(at least two)

Sunday, July 24, 2011

Page 24: Time Series Data Storage in MongoDB

Parallel data streams(hopefully many)

Sunday, July 24, 2011

e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.

Page 25: Time Series Data Storage in MongoDB

how much data?

20 data points @ 4 bytes

1 minute intervals

at 1000 projects (I wish!)

for 10 years

80 * 60 * 24 * 365 * 10 * 1000 = 400 GB?

...not much, really, “in the raw”

Sunday, July 24, 2011

unfortunately, we can’t really store it with maximal efficiency, because of things like timestamps, metadata, etc., but still.

Page 26: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

I hope this provides enough context on the business problems we’re trying to solve. It looks like we’ll need a data pipeline, and we’ll need one fast.

We’ve got data that we’ll need to use to build, monitor, and monetize these energy technologies. Having worked at other smart grid companies before, I’ve seen some good data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i have to build, the better.

Page 27: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

As i do some research, i find that a lot of these data pipelines have a few well-defined areas of responsibility.

Page 28: Time Series Data Storage in MongoDB

Acquisition, Storage, Search,

Retrieval, Analytics.

Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.

Page 29: Time Series Data Storage in MongoDB

Acquisition, Storage, Search,

Retrieval, Analytics. <= Users are here

} Designed for these

Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.

Page 30: Time Series Data Storage in MongoDB

Acquisition, Storage, Search,

Retrieval, Analytics.

Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”

Page 31: Time Series Data Storage in MongoDB

Acquisition, Storage, Search,

Retrieval, Analytics. <= Users are here

Business value is here!

} Designed for these

Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”

Page 32: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

so, here’s how i started thinking about things. This is a design diagram from the early days of the company.

Page 33: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

easy, python, no problem. There are some interesting topics here, but they’re not mongoDB related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the data would look like.

Page 34: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some light webapps for internal use. These would be dictated by business goals first, but the technological questions were straightforward.

Page 35: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

Here was the real question.

What would be some use cases of an analyst having a good experience look like? What would they expect the tools to do?

Page 36: Time Series Data Storage in MongoDB

Now we can think about what the data

looks like

Sunday, July 24, 2011

So, let’s think about what this data looks like, how it’s structured and what it is. Then, after that, we can look at what the best ways to organize it for future usefulness.

Page 37: Time Series Data Storage in MongoDB

Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle CountTue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614

Time series?

Sunday, July 24, 2011

Page 38: Time Series Data Storage in MongoDB

TIME SERIES DATA

Sunday, July 24, 2011

So what is time series data?

Page 39: Time Series Data Storage in MongoDB

Features, Over Time

Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout

Page 40: Time Series Data Storage in MongoDB

Features, Over Time

Time

Thing(Feature vector, v)

(t)

Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout

Page 41: Time Series Data Storage in MongoDB

Features, Over Time

Time

Thing(Feature vector, v)

(t)

Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout

Page 42: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

A couple of ideas:sampling rates. “regularity”. “completeness”analog vs. digitalinstantaneous vs. cumulative (tradeoffs)

Page 43: Time Series Data Storage in MongoDB

tn tn+1

Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)

Page 44: Time Series Data Storage in MongoDB

tn tn+1

Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)

Page 45: Time Series Data Storage in MongoDB

t t’ etc.Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.

Page 46: Time Series Data Storage in MongoDB

y

t t’ etc.Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.

Page 47: Time Series Data Storage in MongoDB

y

y’

Thresholds

t t’ etc.Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.

Page 48: Time Series Data Storage in MongoDB

y

y’

Thresholds

t t’ etc.Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.

Page 49: Time Series Data Storage in MongoDB

(more complicated stuff can be thought of as transformations...)

Sunday, July 24, 2011

e.g., frequency analysis, wavelets, whatever.

Page 50: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

At this point, I go off and do a bunch of research on existing technologies. I really hate reinventing the wheel, and we really don’t have the manpower.

Page 51: Time Series Data Storage in MongoDB

Time series specific tools

Scientific tools & libraries

Traditional data-warehousing approaches

Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools.

Page 52: Time Series Data Storage in MongoDB

Time series specific tools

RRDtool -- Round Robin Database

Sunday, July 24, 2011

There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and i highly recommend it. Unfortunately, it’s really designed for applications that are highly regular, and that are already pretty digital, for instance, sampling latencies, or temperatures in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get me wrong, it’s totally rad, but i didn’t think it was for us.

Page 53: Time Series Data Storage in MongoDB

Scientific tools & libraries

e.g., PyTables

Sunday, July 24, 2011

Pretty cool, but not many of these were mature & ready for primetime. Some that were, like PyTables, didn’t really match our business use-case.

Page 54: Time Series Data Storage in MongoDB

Traditional data-warehousing approaches

Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools. [...]. That leaves us with the traditional approaches. This represents a pretty well established field, but very few of the tools are free, lightweight, and mature.

Page 55: Time Series Data Storage in MongoDB

Enterprise buzzwords(Just google for OLAP)

Sunday, July 24, 2011

But the biggest idea i learned is that most data warehousing revolves around the idea of a “fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally denormalized SQL table.

Page 56: Time Series Data Storage in MongoDB

“Measures” and their

“Dimensions”

Sunday, July 24, 2011

(or facts)

Page 57: Time Series Data Storage in MongoDB

pretty neat!Sunday, July 24, 2011

Page 58: Time Series Data Storage in MongoDB

“how elegant!”

Sunday, July 24, 2011

Page 59: Time Series Data Storage in MongoDB

in practice...

Sunday, July 24, 2011

Page 60: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

Page 62: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

ha! Yeah right.

Page 63: Time Series Data Storage in MongoDB

Time series are not relational!Sunday, July 24, 2011

even extracted features are not inherently relational!

Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t know when you’ll have to start looking for something different.Why would you lock yourself into a schema?

Page 64: Time Series Data Storage in MongoDB

We don’t know what we’ll want to know.

Sunday, July 24, 2011

We won’t know what we want to know. Not only are we warehousing time-series of multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in yet!

Page 65: Time Series Data Storage in MongoDB

natural fit for documents

Sunday, July 24, 2011

This makes a schema-less database a natural fit for these sorts of things. Think about all the alter-table calls i’ve avoided...

Page 66: Time Series Data Storage in MongoDB

"_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6

Sunday, July 24, 2011

isn’t this better?

Page 67: Time Series Data Storage in MongoDB

"_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6

“measures”

“dimensions”

...right?

Sunday, July 24, 2011

measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently we’ll look for measures by other measures -- i.e., each measure serves as a dimension.

Page 68: Time Series Data Storage in MongoDB

...actually, not a good model.

Sunday, July 24, 2011

The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure provides another dimension.Anyway!

Page 69: Time Series Data Storage in MongoDB

"_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6

Sunday, July 24, 2011

How do we build these quickly & efficiently?

Page 70: Time Series Data Storage in MongoDB

the goal: good numbers!

Sunday, July 24, 2011

Remember, the goal here is to make it easy for analysts to get comparable numbers, so when i ask for the delivered energy for one system, compared to the delivered energy from another, i can just get the time-series data, without having to worry about if sensors changed, when the network was out, when a logger was replaced with another one, etc.

Page 71: Time Series Data Storage in MongoDB

Sunday, July 24, 2011

So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers

Page 72: Time Series Data Storage in MongoDB

from rowsto columns

Sunday, July 24, 2011

So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful way. I’m gonna walk through that process, quickly.

Page 73: Time Series Data Storage in MongoDB

"_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6

Let’s just look at one

Sunday, July 24, 2011

Page 74: Time Series Data Storage in MongoDB

Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle CountTue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614

row-major data

Sunday, July 24, 2011

Page 75: Time Series Data Storage in MongoDB

“Functional”

class Mass(BasicMeasure): def __init__(self, density, volume): ...

self._result_func = functools.partial( lambda data, density, volume: density * volume(data) density=density, volume=volume)

def __call__(self, data): return self._result_func(data)

Sunday, July 24, 2011

quasi-functional classes that describe how to calculate a value from data.

Page 76: Time Series Data Storage in MongoDB

"_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597,

#pseudocodeclass LoopEnergy(BasicMeasure): def __init__(self, heat_cap, delta, mass): ... def result_func(data): return self.delta(data) * self.mass(data) * self.heat_cap self._result_func = result_func

def __call__(self, data): return self._result_func(data)

A formula:

E = ∆t × F

Sunday, July 24, 2011

Page 77: Time Series Data Storage in MongoDB

For each install, for each chunk of data:

apply all known formulas to get values

make some convenience keys (e.g., day_of_year)

stuff it in mongo

Then, map/reduce to whatever dimensionalities you’re interested in: e.g., downsampling.

Creating a Cube

Sunday, July 24, 2011

Here’s some pseudocode for how to make a cube of multidimensional data.So, what’s the payoff?

Page 78: Time Series Data Storage in MongoDB

How much water did[x] use, monthly?

> db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold": 1}).sort({“_id”: 1})

Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:

Page 79: Time Series Data Storage in MongoDB

What were our highest production days?

> db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy Sold”: -1})

Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:

Page 80: Time Series Data Storage in MongoDB

How does the distribution of [x] on the weekend compare to its distribution on the weekdays?

> weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}})> weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}})> do stuff

Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:

Page 81: Time Series Data Storage in MongoDB

What’s the production of installs north of a certain latitude, with a certain class of panel, on Tuesdays?

For hours where the average delivered temperature delta was above [x], what was our generation efficiency?

Normalize by number of panels? (map/reduce)

Normalize by distance from equinox? (map/reduce)

...etc.

Sunday, July 24, 2011

Page 82: Time Series Data Storage in MongoDB

• Building a cube can be done in parallel

• Map/reduce is an easy way to think about transforms.

• Not maximally efficient, but parallelizes on commodity hardware.

Sunday, July 24, 2011

Some advantages.re #3 -- so what? It’s not a webapp.

Page 83: Time Series Data Storage in MongoDB

mongoDB:The future of enterprise

business intelligence.(they just don’t know it yet)

Sunday, July 24, 2011

So, here’s my thesis:document-databases are far superior to relational databases for business intelligence cases. Not only that, but mongoDB and some common sense lets you replace multimillion dollar IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.

Page 84: Time Series Data Storage in MongoDB

Lastly...

Sunday, July 24, 2011

Page 85: Time Series Data Storage in MongoDB

Mongo expands in an organization.

Sunday, July 24, 2011

it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot of other schema-loose data that we could use it for -- like the definitions of the measures themselves, or the details about an install, etc., etc.

Page 86: Time Series Data Storage in MongoDB

Final Thoughts

Sunday, July 24, 2011

Ok, i want to close up with a few jumping-off points.

Page 87: Time Series Data Storage in MongoDB

“Business Intelligence”no longer requires

megabucks

Sunday, July 24, 2011

Page 88: Time Series Data Storage in MongoDB

Flexible tools means business responsiveness

should be easy

Sunday, July 24, 2011

Page 89: Time Series Data Storage in MongoDB

“Scaling” doesn’t just mean depth-first.

Sunday, July 24, 2011

businesses grow deep, in the sense of adding more users, but they also grow broad.

Page 90: Time Series Data Storage in MongoDB

Questions?

Sunday, July 24, 2011

Page 91: Time Series Data Storage in MongoDB

Epilogue

Quest for Logging Hardware

Sunday, July 24, 2011

Page 92: Time Series Data Storage in MongoDB

This’ll be easy!This is such an obvious and well

explored problem space, i’m sure we’ll be able to find a

solution that matches our needs without breaking the bank!

Sunday, July 24, 2011

Page 93: Time Series Data Storage in MongoDB

Shopping List!16 temperature sensors

4 flow sensorsmaybe some miscellaneous ones

internet backhaulno software/data lock in.

Sunday, July 24, 2011

Page 94: Time Series Data Storage in MongoDB

Conventions FTW!

And since we’ve walked a couple convention floors and product catalogs from major industrial supply vendors, i’m sure it’s in

here somewhere!

Sunday, July 24, 2011

Page 95: Time Series Data Storage in MongoDB

derp derp “internet”?

I’m sure there’s a reason why all of these loggers have to connect

via USB...Pace Scientific XR5:

8 analog3 pulse

ONE MBno internet?

$500?!?

Sunday, July 24, 2011

Page 96: Time Series Data Storage in MongoDB

yay windows?...and require proprietary (windows!) software or

subscription plans that route my data through their servers

(basically all of them!)

Sunday, July 24, 2011

Page 97: Time Series Data Storage in MongoDB

Maybe the gov’t can help!

Perhaps there’s some kind of standard that the governments

require for solar thermal monitoring systems to be

eligible for incentives or tax credits.

Sunday, July 24, 2011

Page 98: Time Series Data Storage in MongoDB

Vive la France!An obscure standard by the

Organisation Internationale de Métrologie Légale

appears! Neat!

Sunday, July 24, 2011

Page 99: Time Series Data Storage in MongoDB

A “Certified”Logger

two temperature sensorsone pulse

no increase in accuracyno data backhaul -- at all

...what’s the price?

Sunday, July 24, 2011

Page 100: Time Series Data Storage in MongoDB

$1,000

Sunday, July 24, 2011

Page 101: Time Series Data Storage in MongoDB

$1,000

Sunday, July 24, 2011

Page 102: Time Series Data Storage in MongoDB

Hmm...I can solder, and arduinos are

pretty cheap

Sunday, July 24, 2011

Page 103: Time Series Data Storage in MongoDB

It’s on!

Sunday, July 24, 2011

Page 104: Time Series Data Storage in MongoDB

arduino + netbook!Sunday, July 24, 2011

Page 105: Time Series Data Storage in MongoDB

TL; DR: Existing loggers

are terrible.

Sunday, July 24, 2011

Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.