building a scientific data warehouse - slac … · · 2012-09-14building a scientific data...

Building a Scientific Data Warehouse Supporting Petascale

Science and Data Mining

XLDB 2012

Clark Gaylord Chief Information Officer Virginia Tech Transportation Institute [email protected] 11 September 2012

Background

VTTI was established in August 1988 by agreement between US DOT and the University Transportation Centers Program

• Largest university-level research center at Virginia Tech – Approximately 300 faculty, staff and students

working on over 150 projects – $80 Million Awarded – Approximately $30 Million in Annual Expenditures – Largest supporter of both undergraduate and

graduate students

11 September 2012 CKG – Building Data Warehouse 2

Unique Facilities

Instrumented Vehicles The Virginia Smart Road


The Virginia Smart Road

• Advanced Control Room

• Weather capabilities

• Variable Lighting Systems

• Pavement Testing


VTTI Naturalistic Driving Research

9/13/2012 5 VTTI | Driving Transportation with Technology

Empirical Data Collection

Large-Scale Naturalistic Data Collection

• Proactive • Provides important

ordinal crash risk info

• Precise knowledge about crash risk

• Information about important circumstances and scenarios that lead to crashes

• Imprecise, relies on unproven safety surrogates

• Experimental situations modify driver behavior

• Reactive • Very limited pre-crash

information

• “Natural” driver behavior in full driving context

• Detailed pre-crash/crash info including driver performance/ behavior, driver error and vehicle kinematics

• Can utilize combination of crash, near crash and other safety surrogate data

Epidemiological Data Collection

Naturalistic Method

• Study participants use an instrumented vehicle for an extended period (e.g., several months to two years)

• Able to get detailed pre-crash/crash information along with routine driving behaviors

• Highly capable data acquisition

• Able to collect crash pre-cursor data and driver performance/behavior data using sensors and video cameras


SHRP2 Naturalistic Driving Study

• Strategic Highway Research Program

• Funded through the Transportation Research Board of the National Academies

• Large-scale nationwide naturalistic driving study

– Six regional centers


SHRP2 Sites


Scale of SHRP2

100 Car • 150 vehicle years (100

vehicles, 18 months) • 43,000 hours • 2,000,000 miles • 6 TB total storage

– 94% video

• 700GB sensor database • More constrained by

instrumentation

SHRP2 • 4,000 vehicle years (3,000

distinct vehicles, 2,000 at a time for two years)

• 2,000,000 hours • 60,000,000 miles • 1.5 PB total storage

– 85% video

• 250TB sensor data • ~400 a priori research

questions • 20-30 year life cycle for

research, data mining

• 11 September 2012 CKG – Building Data Warehouse 9

SHRP2 Data Gathering

• Real time health check

• Automatic crash reports

• Bulk data are harvested every few months

• Total 1-2TB/day

• Sites send data from regional center via Internet2


Future Naturalistic Studies

• Many more studies coming

• None (yet) planned as large as SHRP2

– Commercial vehicles

• Some more “epoch-based”


Data visualization


Experiences analyzing data

• Data analysis for VTTI’s legacy naturalistic studies has focused on using individual trip data (both sensor and video) files

• Identification of events from sensor algorithms, coupled with effort-intensive data reduction and annotation

• Difficult or expensive to scale this method to larger studies

• Analysis methods and infrastructure were not suitable to perform larger scale data mining – Very useful for “case study” (e.g. crash investigation, random samples)

• Some success extracting data to a database for mining


Application support

• Desktop and cluster: Matlab, R – SAS only on desktop, mostly due to licensing cost

on cluster

• Cluster: python, shell for data ingestion, other utility tasks

• Custom Windows applications for visualization on desktop

• Legacy Windows computational/simulation software


Typical Analysis Workflow

• Researcher tries to pull all data into Matlab (or R)

• Researcher eventually learns some things can be expressed better in SQL

• Researcher finds out not everything performs well in SQL

• Researcher pulls all data into Matlab (or R)


Data types

• Sensor data:

– Time series data

– Not on same unified time mesh

• Compressed Video (h.264)

• Geospatial data (GIS/SQL)

• Other sources


Data structures

• Legacy studies were “big rectangle” synchronized measurements

• SHRP2 and other current studies are more “AV-pair” pattern: – File_ID – Timestamp – VariableID – Value – Status

• A variableID may be one observation per file or several observations thousand per file over time – Commonly 10Hz, 20Hz, 1Hz, or interrupt driven


Why database?

• Performance and scalability – With 100Car, 200,000 files were collected. Computations

routinely took weeks to perform – Bringing 100Car collected data into database, this could be less

under an hour

• Common interface (JDBC/ODBC) supports many tools • Expressive semantics, accessibility of SQL • Maturity of technology • Good support for indexing and partitioning • Natural metadata • Typed data – not just strings and AV pairs • Not so much referential integrity, etc


File-oriented approach?

• File-oriented technologies, e.g. hadoop, have promise but need further investigation and feasibility/proof-of-concept

• Not optimized for computational intensive environments, floating point algorithms

• Less mature, accessible, and ubiquitous than SQL/databases

• Potentially ultimately more scalable or cost-effective • Lower software licensing costs

– Open source databases, e.g. PostgreSQL, also an option

• Perhaps 3-5 year horizon?


Is hadoop free?


Schema for Instrumentation Data

• Collected data have variables: – File ID, variable ID, timestamp, data value, sanity – Up to about twenty tables have this structure

• Each of these tables exist for data value of type: – Integers: Short/int/long – Floats: Real/double – String

• Each of these have different tables for: – “hot” vs “cold” – “low-frequency” – this reflects a specific DB2-ism

• Plus separate tables for “PII” data


(Simplified) Schema

• Collected data – File_ID

– Timestamp

– VariableID

– Data • Float

• Int

• String

– QA Status

• Many of these by type, tier, index type

• Metadata – VariableID

– Module name

– Variable name

– Units

• File_Info – File_ID

– Datafile_ID

– Filetype • Video/Audio

– Filename


Collected data


Various metadata

FILE_INFO

FILE_ID BIGINT

FILE_GROUP_ID BIGINT

FILE_TYPE_ID INTEGER

DATA_FILE_ID BIGINT

FILE_NAME VARCHAR(512)

FILE_PATH VARCHAR(512)

FILE_HASH_VALUE VARCHAR(512)

HASH_TYPE VARCHAR(512)

KEY_INITIALIZATION_VECTOR VARCHAR(512)

ENCRYPTION_KEY VARCHAR(512)

KEY_TYPE VARCHAR(512)

INSERTTIME TIMESTAMP

LASTUPDATETIME TIMESTAMP

FILE_SIZE BIGINT

DATA_FILE_EXTRA_INFORMATION

FILE_ID BIGINT

MINIMUM_TIME BIGINT

MAXIMUM_TIME BIGINT

ACQUISITION_BOARD_BOARD_ID DOUBLE

ACQUISITION_BOARD_STRING_ID VARCHAR(8000)

STORAGE_BOARD_BOARD_ID DOUBLE

STORAGE_BOARD_STRING_ID VARCHAR(8000)



FILE_GROUP

FILE_GROUP_ID BIGINT

FILE_NAME_BASE VARCHAR(512)

HDD_SERIAL VARCHAR(512)

COPY_DATE_TIME TIMESTAMPFILE_HEADERS

FILE_ID BIGINT

HEADER XML

HEADERSOURCEID SMALLINT



METADATA

MODULENAME VARCHAR(128)

VARIABLENAME VARCHAR(128)

VARIABLEID INTEGER

TABLENAME VARCHAR(128)

COLUMNNAME VARCHAR(128)

COLLECTEDFREQUENCY DOUBLE

ISCOLLECTED SMALLINT

ISDEMUXED SMALLINT

ISCOMPUTED SMALLINT

ISSTANDARD SMALLINT

UNITS VARCHAR(128)

CLASS VARCHAR(8)

SOLTYPE VARCHAR(16)

FINAL_TABLE VARCHAR(128)

SUMMARY_INFO

FILE_ID BIGINT

VEHICLE_MANAGEMENT_ID INTEGER

PARTICIPANT_ID INTEGER

LOCATION_CODE VARCHAR(4)

COLLECTED_DATE_TIME TIMESTAMP

COLLECTION_MODE VARCHAR(25)

COLLECTION_PHASE VARCHAR(50)

VIDEO_FILE_EXTRA_INFORMATION

FILE_ID INTEGER

DEGREESROTATION SMALLINT



OFFSET INTEGER

ALIGNMENT_VARIABLE VARCHAR(128)


Sample data

FILE_ID STATUS VARIABLEID TIMESTAMP DATA --------- ------ ---------- --------- ------ 1,895,896 0 -396 561,198 0.0928 1,895,896 0 -396 561,299 0.0986 1,895,896 0 -396 561,398 0.1015 1,895,896 0 -396 561,499 0.1073 1,895,896 0 -396 561,598 0.1131 1,895,896 0 -396 561,699 0.1102 1,895,896 0 -396 561,798 0.1073 1,895,896 0 -396 561,899 0.1131 1,895,896 0 -396 561,998 0.116 1,895,896 0 -396 562,099 0.1247 1,895,896 0 -396 562,198 0.1305 1,895,896 0 -396 562,299 0.1276 1,895,896 0 -396 562,398 0.1247 1,895,896 0 -396 562,499 0.1247 1,895,896 0 -396 562,598 0.1276 [Entire file’s x_accel takes < 0.5 second to query.]


Sample summary query

FILE_ID VARIABLEID MODULENAME VARIABLENAME COUNT_DATA AVERAGE_DATA ------- ---------- ---------- ------------ ---------- ------------ 810 -396 IMU Accel_X 636 -0.0417 810 -397 IMU Accel_Y 636 -0.0141 810 -398 IMU Accel_Z 636 -0.9831 811 -396 IMU Accel_X 2,903 -0.0338 811 -397 IMU Accel_Y 2,903 -0.0091 811 -398 IMU Accel_Z 2,903 -0.9857 822 -396 IMU Accel_X 54 -0.0276 822 -397 IMU Accel_Y 54 -0.0056 822 -398 IMU Accel_Z 54 -0.9869 831 -396 IMU Accel_X 81 -0.0265 831 -397 IMU Accel_Y 81 -0.0051 831 -398 IMU Accel_Z 81 -0.9871 838 -396 IMU Accel_X 6,363 -0.0045 838 -397 IMU Accel_Y 6,363 0.0018 838 -398 IMU Accel_Z 6,363 -0.9903 857 -396 IMU Accel_X 10,928 -0.0240 857 -397 IMU Accel_Y 10,928 -0.0168 857 -398 IMU Accel_Z 10,928 -0.9872 859 -396 IMU Accel_X 403 0.0066 859 -397 IMU Accel_Y 403 0.0127 859 -398 IMU Accel_Z 403 -0.9870 862 -396 IMU Accel_X 413 -0.0547 862 -397 IMU Accel_Y 413 -0.0312 862 -398 IMU Accel_Z 413 -0.9841

... [for over 6,000 file_id‘s] Less than 20 seconds for query results


Approach to Data Center Design

• Technical and performance specs • Balance cost with performance & availability • Focus on more mature technology

– While still needing to push state of the art

• Matlab/R/SAS researchers can add SQL to their skill set – Not so much Java/C++

• Other (non-programmer) analysts need visualization tools

• Systems programmers can use python, Java, C++


High Performance?

• What do we mean by “high performance”?

– Actually we do “high throughput”…

• Computational and communication resources that are beyond those normally achievable by individual desktop workstations or stand-alone servers in typical enterprise environments.


Infrastructure to support data-intensive science

• Large (parallel) file system

– Especially for unstructured data

• Hierarchical storage

• Compute cluster

• Distributed workflow

• Structured data warehouse

– Parallel database using PostgreSQL, DB2, …

28 Apr 2011 CKG – Scientific Data Management 29


VTTI Storage array (1 PB - GPFS)

VTTI Compute Cluster48 node (12x4) Dell C6100

Platform PCM hybrid Lin/Win

VTTI Scientific Data WarehouseInfosphere(DB2) ~400TB

IBM/SGI

VTData Center

LAN

Info

Sphere

Data

Wa

reho

use

LA

N

DB2 DW

WorkerInfoSphere

DWHEAD node(active)IBM

X3650128GB RAM

IBM DS340012 x 450GB

SAS

InfoSphere DWHEAD node(standby)

IBM X3650128GB RAM

InfoSphere DW

ETL node(active)IBM

X3650128GB RAM

IBM DS340012 x 450GB

SAS

researchers

Internet

DB2 DW Workers(each):

8 partitions/worker20TB/partition

IBM DS4800SATA

DELL R710SQL Serverreplication

GPFS10GigE

DB2 DW

Worker

DB2 DW

Worker

DB2 DW

Worker

VTTI Smart Data Center Infrastructure

Dell C61004 CPU X 6 core

(24 cores)

Intr

a-C

lust

er

Fabric

(10g

igE

)

Dell R710Head Node

Linux


(24 cores)


(24 cores)


(24 cores)


(24 cores)


(24 cores)


(24 cores)

Dell R710Head Node

Linux


(24 cores)


(24 cores)


(24 cores)


(24 cores)


(24 cores)

VT Data Center 1G LAN

HP

C 1

0G

LA

N

HPC 10G / 1G LAN

DB2 DW

Worker

DB2 DWWorker

(standby)

SGI DMF"Archive"

5+PB disk/tape

Data warehouse building block

X3650 8 core

128GB RAM

DS3400 4TB SAS

DS3400 4TB SAS

EXP3000 4TB SAS

EXP3000 4TB SAS

DS3512 30TB NLSAS

DS3512 30TB NLSAS

EXP3512 30TB NLSAS

DS3512 30TB NLSAS

DS3400 9TB SATA

EXP3000 9TB SATA

Fibre

Ch

ann

el

SAS

SAS

SAS

SAS

SAS

Future analytics

• Facial recognition

– Driver identification

– Eye glance

• Machine vision and video-based analytics

• Cross-correlation (non-synchronous) time series

• Dimension reduction

• Outlier detection


building a scientific data warehouse - slac … · · 2012-09-14building a scientific data...

Documents