data rods: high speed, time-series analysis of massive data sets data rods: high speed, time-series...

34
Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods David Gallaher (1) , Qin Lv (2) , Glenn Grant (1) , Garrett Campbell (1) 1 1) National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA 2) Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA

Upload: randall-burke

Post on 03-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods

David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1)

1

1) National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA

2) Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

The National Snow and Ice Data Center

Creates tools for

data access

Manages and distributes

scientific data Performs scientific

research

Educates the publicabout the cryosphere

Supports data users

Affiliations and

Sponsorship

Cooperative Institute for Research in Environmental Sciences

University of Colorado at Boulder

World Data Center for Glaciology (since 1976)

Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze and Distribute it Worldwide 24x7. Focus is Mainly NASA Satellite Data

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Data Rods - Project Basis

The “Data Rods” project proposes to create prototype a high-speed, scalable database structure for rapid retrieval, filtering, and analysis of massive multi-modality data sets.

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Objective: Remote Sensing Data Analysis

The Problem:

• Data sets are becoming too large to move over the internet

• Need for basic Boolean logic for time-series anomaly detection

• Data downloads for long time-series analysis are especially cumbersome

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Analysis Challenges

• A wide variety of data formats

• Ever-increasing data set sizes

• Myriad analysis and visualization requirements

• There will be uses and analysis of the data that cannot be anticipated (data discovery is not enough)

• Lack of direct access to the data (ie albedo > 15%)

• Our current directory trees impede data access (We really need to consider a database)

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

“Big Data” Considerations:

6

Search, Order and Transmission of data is ending.

•We must develop systems where the data stay fixed and analyses are rendered against it

•Rapid, scalable data access across time and space

•Direct query of the data, not just the metadata (we need more than what, where, when)

•Web-based spatio-temporal analysis and visualization

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Database Choice

Fast and efficient storage, query and retrieval of entire data sets – not just the metadata

Ability to store colossal amounts of small files

Relational databases can't handle it. The tables grow too big. (Object-relational is no better)

Hadoop excels at unstructured data but due to it’s batch oriented nature, it is inefficient with real-time analytics as well as intra-data analysis

A “pure-object” database seen as best choice

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

The Data Rods Project

The “Data Rods” project has created a high speed, scalable database structure for rapid retrieval,

filtering, and analysis of massive data sets.

We’ll cover the following:

• Database design

• Status on development

• Basic analysis examples and performance

• Planned analysis and potential applications

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Gridded data is key.

For consistency, NSIDC's Equal-Area Scalable Earth Grids (EASE-Grids) tool is used.

Common resolutions between data sets (1km, 5km, etc) and point data

Database design

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

The nesting relationship of differing resolutions in EASE-Grid

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Data Rods Concept

Y coordinateX coordinate

Tim

e

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Database Systems Development

Da

ta In

pu

t

Passive Microwave

Passive Microwave

Visual InfraredVisual Infrared

Ease Grid Processing

Ease Grid Processing

Active Microwave

Active Microwave

RadarRadar

OtherOther

Object Database Loading

Object Database Loading

Data RodUpdating

Data RodUpdating

User Interface

User Interface

Basic Data Management(query & index)

Basic Data Management(query & index)

Pattern Search(input pattern or trend)

Pattern Search(input pattern or trend)

Automated Pattern Discovery

• Anomaly Detection• Trend Detection• Cycle Detection

Automated Pattern Discovery

• Anomaly Detection• Trend Detection• Cycle Detection

Object Database Design Cryospheric Change Analysis

Object Interface

Object InterfacePixel Grid

Sampling

Pixel GridSampling

Y coordinate

X coordinateT

ime

Use

r Inp

ut

Data Rod Objects

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Pure-Object Database

Object persistence/instantiation is directly to/from the database – no Java Spring or Hibernate needed

Not object-relational (examples include Versant, ObjectDB, db4o, Objectivity)

Not as limited by size

Fast interactions across databases

Simple, efficient schema

Next: schema design

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Object Database Schema

Each image pixel is an object

Data rods are time-series collections of pixels

Each data rod can be analyzed independently

Adjacency analysis by row/col or lat/lon

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Longitude

Latitude

Time

Gridded data sets

Standardized grid dimensions

Visualize as layers of imagery through time (days to decades)

Lends itself well to time-series analysis

Database Creation

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Status – Database Administration

5 AVHRR databases, each with 5 years of imagery (<100 GB each, administratively easier)

Surface mask databases for northern hemisphere at 5 km and 25km

SSM/I database, 25 years of daily 25 km data at all frequencies and polarizations

Selected MODIS database at 250 Meter resolution

~600 GB total

No upper limit to database except disk space

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Initial demonstration region is Greenland

25 years of daily multi-spectral AVHRR data at 5 km resolution

AVHRR Database Creation

9000+ images

2 billion+ pixel objects total

Each pixel object is independently accessible for query

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Data can be spread across many databases

Transparent queries across databases

Methods (routines) can be attached to the data rods to add functionality such as statistical analysis

Data fusion: analyses may span multiple data types, resolutions, time spans

Data Rods supports NetCDF output

Database Flexibility

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Simple AVHRR Object Database Time Test

• Built a using AVHRR 5km data from 1995-1999

• 2 visible channels, 3 IR channels, 3 references plus albedo, skin temperature and cloud mask

• Database includes location class, time stamp class and metadata

• 213,000 data rods covering 5-years over Greenland

• 1 Data rod contains 1825 pixels

• Pixels = 388,725,000 each with 11 variables/pixel

• Variables = 4.2 billion coded short integer values

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Example Analysis Using Object Databases

• All queries run on a singe processor, single thread

• Example #1: Queries and plots on single database

• Example #2: Queries and plots on multiple databases

• Example #3: Advanced Spatiotemporal Analysis

• 1 Data rod contains 1825 pixels

• Pixels = 388,725,000 each with 11 variables/pixel

• Variables = 4.2 billion coded short integer values

• We will move to multi-tread, multiprocessor once we have the design finalized (this is a research project)

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Using Single AVHRR Object Database Time Test

• Single processor under load• 5-year plots returned in 2-10

seconds. • Cached data plots returned in

½ second. • Images in 10 seconds

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Multi Data RodSelection

• Seven locations selected across 5 years simultaneously

• Selected Temperature Brightness and Albedo output

• Again caching is much faster

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Example Analysis of Greenland & 5 databases

Using 5 5-year Rods and Statistics (1 min or 5 secs cached)

Image ref: Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media.

AVHRR albedo statisticsMay average, 1981 – 2005

Camp Century:Mean: 0.801Std. dev.: 0.077

Summit Station:Mean: 0.819Std. dev.: 0.069

Swiss Camp:Mean: 0.817Std. dev.: 0.070

GISP Ice Core Camp:Mean: 0.802Std. dev.: 0.071

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Temporal Analysis of Single Rods

Descriptive Statistical functions

Spatiotemporal data selection

Filtering by value

Anomaly detection

Also:Image generationInter-database data fusion

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Broad Spatiotemporal Analysis (This took some time)

• Statistical analysis repeated at every grid cell.

• Intersection of surface mask database and AVHRR database: only pixels on the ice sheet were processed.

• Bad data filtered out.

• Multivariate: cloud mask used to exclude cloudy pixels from albedo averages.

• All 2 billion objects queried and analyzed

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Analysis Example: Sea Ice Temporal Query

We would like to remove clouds from the image (clouds move faster than ice so find minimum Albedo for open water)

Moving 8-day window through datarod

Minimum albedo in temporal window

Pseudocode example query:

Datarod time-series of pixels

}t1

t8

datarod = database.getDatarod(row,col)

albedo = datarod.getMinAlbedo(t,t+7)

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Analysis result: Sea Ice Detection

Technique for removing clouds from the image

Composite image created from Data Rods’ time series

Lowest AVHRR albedo over an 8-day period

Remaining objective: exclude lingering clouds

One of the Original images

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Analysis Potential: Rapid Data FusionLoss of AMSR-E decreases sea ice detection capability

Data Rods AVHRR/SSM/I product fusion may fill the gap

Can be validated against AMSR-E sea ice record.

AVHRR 8-day SSM/I Fused product

High resolution sea ice detection – still some clouds

Cloud free with good sea ice

detection but low resolution

+ =

High-res sea ice extent, no clouds

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Performing this lake detection analysis conventionally took 6 months (downloading & gridding & image analysis)

With Data Rods, the analysis was done in 2 days (single tread, single processor)

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

What’s Next-Ongoing EffortsNewest version of ODB software has multi-threaded

capability – to take advantage of multiprocessor machines to reduce query times

Investigating Data rod performance on the Janus supercomputer with Pan-Arctic extent

User Interface to Data Rod database

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Creating 1000s of Databases for Use with Massive Parallel Machines

• Each database is small enough to be held in memory for each CPU (uses MPI calls)

• Each database covers 5ox5ox25 years of Data Rods

• Each database is capped (fixed for minimal changes)

• Changes are added to the present year database for each 5ox5o

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Creating 1000s of Databases for Use with Massive Parallel Machines

• With this database it should be possible perform analysis at Internet speeds

• Multi-sensor analysis is relatively simple

• We are starting the database loading now

• 100TB database testing will occur over the summer

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

Summary

We can now perform high-speed time-series analysis on the server-side without downloads

Scalable, massive remote sensing databases

Accelerated analysis compared to traditional “search, order and transmission”’ methods

Interactions across data sets – data fusion

Developing UI and additional analysis tools

Allow users interactive access to the data

Data Rods:High Speed, Time-Series Analysis of Massive Data Sets

NSIDC Data Rods Project

The Data Rods project is funded by the National Science Foundation through grant: ARC 0941442

Thank You

Interesting in testing Data Rods? Please contact us at:[email protected]