data rods: high speed, time-series analysis of massive data sets data rods: high speed, time-series...
TRANSCRIPT
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods
David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1)
1
1) National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA
2) Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
The National Snow and Ice Data Center
Creates tools for
data access
Manages and distributes
scientific data Performs scientific
research
Educates the publicabout the cryosphere
Supports data users
Affiliations and
Sponsorship
Cooperative Institute for Research in Environmental Sciences
University of Colorado at Boulder
World Data Center for Glaciology (since 1976)
Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze and Distribute it Worldwide 24x7. Focus is Mainly NASA Satellite Data
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data Rods - Project Basis
The “Data Rods” project proposes to create prototype a high-speed, scalable database structure for rapid retrieval, filtering, and analysis of massive multi-modality data sets.
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Objective: Remote Sensing Data Analysis
The Problem:
• Data sets are becoming too large to move over the internet
• Need for basic Boolean logic for time-series anomaly detection
• Data downloads for long time-series analysis are especially cumbersome
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Analysis Challenges
• A wide variety of data formats
• Ever-increasing data set sizes
• Myriad analysis and visualization requirements
• There will be uses and analysis of the data that cannot be anticipated (data discovery is not enough)
• Lack of direct access to the data (ie albedo > 15%)
• Our current directory trees impede data access (We really need to consider a database)
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
“Big Data” Considerations:
6
Search, Order and Transmission of data is ending.
•We must develop systems where the data stay fixed and analyses are rendered against it
•Rapid, scalable data access across time and space
•Direct query of the data, not just the metadata (we need more than what, where, when)
•Web-based spatio-temporal analysis and visualization
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Database Choice
Fast and efficient storage, query and retrieval of entire data sets – not just the metadata
Ability to store colossal amounts of small files
Relational databases can't handle it. The tables grow too big. (Object-relational is no better)
Hadoop excels at unstructured data but due to it’s batch oriented nature, it is inefficient with real-time analytics as well as intra-data analysis
A “pure-object” database seen as best choice
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
The Data Rods Project
The “Data Rods” project has created a high speed, scalable database structure for rapid retrieval,
filtering, and analysis of massive data sets.
We’ll cover the following:
• Database design
• Status on development
• Basic analysis examples and performance
• Planned analysis and potential applications
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Gridded data is key.
For consistency, NSIDC's Equal-Area Scalable Earth Grids (EASE-Grids) tool is used.
Common resolutions between data sets (1km, 5km, etc) and point data
Database design
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
The nesting relationship of differing resolutions in EASE-Grid
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data Rods Concept
Y coordinateX coordinate
Tim
e
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Database Systems Development
Da
ta In
pu
t
Passive Microwave
Passive Microwave
Visual InfraredVisual Infrared
Ease Grid Processing
Ease Grid Processing
Active Microwave
Active Microwave
RadarRadar
OtherOther
Object Database Loading
Object Database Loading
Data RodUpdating
Data RodUpdating
User Interface
User Interface
Basic Data Management(query & index)
Basic Data Management(query & index)
Pattern Search(input pattern or trend)
Pattern Search(input pattern or trend)
Automated Pattern Discovery
• Anomaly Detection• Trend Detection• Cycle Detection
Automated Pattern Discovery
• Anomaly Detection• Trend Detection• Cycle Detection
Object Database Design Cryospheric Change Analysis
Object Interface
Object InterfacePixel Grid
Sampling
Pixel GridSampling
Y coordinate
X coordinateT
ime
Use
r Inp
ut
Data Rod Objects
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Pure-Object Database
Object persistence/instantiation is directly to/from the database – no Java Spring or Hibernate needed
Not object-relational (examples include Versant, ObjectDB, db4o, Objectivity)
Not as limited by size
Fast interactions across databases
Simple, efficient schema
Next: schema design
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Object Database Schema
Each image pixel is an object
Data rods are time-series collections of pixels
Each data rod can be analyzed independently
Adjacency analysis by row/col or lat/lon
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Longitude
Latitude
Time
Gridded data sets
Standardized grid dimensions
Visualize as layers of imagery through time (days to decades)
Lends itself well to time-series analysis
Database Creation
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Status – Database Administration
5 AVHRR databases, each with 5 years of imagery (<100 GB each, administratively easier)
Surface mask databases for northern hemisphere at 5 km and 25km
SSM/I database, 25 years of daily 25 km data at all frequencies and polarizations
Selected MODIS database at 250 Meter resolution
~600 GB total
No upper limit to database except disk space
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Initial demonstration region is Greenland
25 years of daily multi-spectral AVHRR data at 5 km resolution
AVHRR Database Creation
9000+ images
2 billion+ pixel objects total
Each pixel object is independently accessible for query
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data can be spread across many databases
Transparent queries across databases
Methods (routines) can be attached to the data rods to add functionality such as statistical analysis
Data fusion: analyses may span multiple data types, resolutions, time spans
Data Rods supports NetCDF output
Database Flexibility
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Simple AVHRR Object Database Time Test
• Built a using AVHRR 5km data from 1995-1999
• 2 visible channels, 3 IR channels, 3 references plus albedo, skin temperature and cloud mask
• Database includes location class, time stamp class and metadata
• 213,000 data rods covering 5-years over Greenland
• 1 Data rod contains 1825 pixels
• Pixels = 388,725,000 each with 11 variables/pixel
• Variables = 4.2 billion coded short integer values
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Example Analysis Using Object Databases
• All queries run on a singe processor, single thread
• Example #1: Queries and plots on single database
• Example #2: Queries and plots on multiple databases
• Example #3: Advanced Spatiotemporal Analysis
• 1 Data rod contains 1825 pixels
• Pixels = 388,725,000 each with 11 variables/pixel
• Variables = 4.2 billion coded short integer values
• We will move to multi-tread, multiprocessor once we have the design finalized (this is a research project)
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Using Single AVHRR Object Database Time Test
• Single processor under load• 5-year plots returned in 2-10
seconds. • Cached data plots returned in
½ second. • Images in 10 seconds
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Multi Data RodSelection
• Seven locations selected across 5 years simultaneously
• Selected Temperature Brightness and Albedo output
• Again caching is much faster
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Example Analysis of Greenland & 5 databases
Using 5 5-year Rods and Statistics (1 min or 5 secs cached)
Image ref: Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media.
AVHRR albedo statisticsMay average, 1981 – 2005
Camp Century:Mean: 0.801Std. dev.: 0.077
Summit Station:Mean: 0.819Std. dev.: 0.069
Swiss Camp:Mean: 0.817Std. dev.: 0.070
GISP Ice Core Camp:Mean: 0.802Std. dev.: 0.071
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Temporal Analysis of Single Rods
Descriptive Statistical functions
Spatiotemporal data selection
Filtering by value
Anomaly detection
Also:Image generationInter-database data fusion
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Broad Spatiotemporal Analysis (This took some time)
• Statistical analysis repeated at every grid cell.
• Intersection of surface mask database and AVHRR database: only pixels on the ice sheet were processed.
• Bad data filtered out.
• Multivariate: cloud mask used to exclude cloudy pixels from albedo averages.
• All 2 billion objects queried and analyzed
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Analysis Example: Sea Ice Temporal Query
We would like to remove clouds from the image (clouds move faster than ice so find minimum Albedo for open water)
Moving 8-day window through datarod
Minimum albedo in temporal window
Pseudocode example query:
Datarod time-series of pixels
}t1
t8
datarod = database.getDatarod(row,col)
albedo = datarod.getMinAlbedo(t,t+7)
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Analysis result: Sea Ice Detection
Technique for removing clouds from the image
Composite image created from Data Rods’ time series
Lowest AVHRR albedo over an 8-day period
Remaining objective: exclude lingering clouds
One of the Original images
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Analysis Potential: Rapid Data FusionLoss of AMSR-E decreases sea ice detection capability
Data Rods AVHRR/SSM/I product fusion may fill the gap
Can be validated against AMSR-E sea ice record.
AVHRR 8-day SSM/I Fused product
High resolution sea ice detection – still some clouds
Cloud free with good sea ice
detection but low resolution
+ =
High-res sea ice extent, no clouds
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Performing this lake detection analysis conventionally took 6 months (downloading & gridding & image analysis)
With Data Rods, the analysis was done in 2 days (single tread, single processor)
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
What’s Next-Ongoing EffortsNewest version of ODB software has multi-threaded
capability – to take advantage of multiprocessor machines to reduce query times
Investigating Data rod performance on the Janus supercomputer with Pan-Arctic extent
User Interface to Data Rod database
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Creating 1000s of Databases for Use with Massive Parallel Machines
• Each database is small enough to be held in memory for each CPU (uses MPI calls)
• Each database covers 5ox5ox25 years of Data Rods
• Each database is capped (fixed for minimal changes)
• Changes are added to the present year database for each 5ox5o
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Creating 1000s of Databases for Use with Massive Parallel Machines
• With this database it should be possible perform analysis at Internet speeds
• Multi-sensor analysis is relatively simple
• We are starting the database loading now
• 100TB database testing will occur over the summer
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Summary
We can now perform high-speed time-series analysis on the server-side without downloads
Scalable, massive remote sensing databases
Accelerated analysis compared to traditional “search, order and transmission”’ methods
Interactions across data sets – data fusion
Developing UI and additional analysis tools
Allow users interactive access to the data
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
NSIDC Data Rods Project
The Data Rods project is funded by the National Science Foundation through grant: ARC 0941442
Thank You
Interesting in testing Data Rods? Please contact us at:[email protected]