indexing scientific data with fastbit

5
SDM Cente r Indexing Scientific Data With FastBit Motivating Examples Find the collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova These are not typical database searches: Large high-dimensional data sets (1000 time steps X 1000 X 1000 X 1000 cells X 100 variables) Most data records never modified, i.e., append-only data Multi-dimensional queries: 500 < Temp < 1000 && CH3 > 10 -4 && … Large answers (hit thousands or millions of records) Seek collective features e.g., regions of interest, not average and sum operations New searching technology needed

Upload: ilar

Post on 04-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Indexing Scientific Data With FastBit. Motivating Examples Find the collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova These are not typical database searches: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Indexing Scientific Data With FastBit

SDMCenterIndexing Scientific Data With FastBit

• Motivating Examples• Find the collision events with the most

distinct signature of Quark Gluon Plasma• Find the ignition kernels in a combustion

simulation• Track a layer of exploding supernova

• These are not typical database searches:• Large high-dimensional data sets (1000 time

steps X 1000 X 1000 X 1000 cells X 100 variables)

• Most data records never modified, i.e., append-only data

• Multi-dimensional queries: 500 < Temp < 1000 && CH3 > 10-4 && …

• Large answers (hit thousands or millions of records)

• Seek collective features e.g., regions of interest, not average and sum operations

• New searching technology needed

Page 2: Indexing Scientific Data With FastBit

SDMCenterA Good Candidate: Bitmap Index

• First commercial version• Model 204, P. O’Neil, 1987

• Take less time to build than B-trees• Efficient for querying: only bitwise

logical operations• A < 2 b0 OR b1

• A > 2 b3 OR b4 OR b5

• Efficient for multi-dimensional queries• Use bitwise operations to combine

the partial results• Size may be large: one bit per

distinct value per row• Definition: Cardinality == number of

distinct values• Compact for low cardinality

attributes, say, cardinality < 100• Worst case: cardinality = N, number

of rows; index size: N*N bitsA < 2 2 < A

Datavalues015312041

100000100

010010001

000001000

000100000

000000010

001000000

=0 =1 =2 =3 =4 =5b0 b1 b2 b3 b4 b5

• First commercial version• Model 204, P. O’Neil, 1987

• Take less time to build than B-trees• Efficient for querying: only bitwise

logical operations• A < 2 b0 OR b1

• A > 2 b3 OR b4 OR b5

• Efficient for multi-dimensional queries• Use bitwise operations to combine

the partial results• Size may be large: one bit per

distinct value per row• Definition: Cardinality == number of

distinct values• Compact for low cardinality

attributes, say, cardinality < 100• Worst case: cardinality = N, number

of rows; index size: N*N bitsA < 2

Page 3: Indexing Scientific Data With FastBit

SDMCenterCompression Makes It Better

10000000000000000000011100000000000000000000000000000……………….00000000000000000000000000000001111111111111111111111111

Example: 2015 bits

Main Idea: Use run-length-encoding, but...partition bits into 31-bit groups [not 32 bit] on 32-bit machines

31 bits 31 bits(62 groups skipped) …31 bits

• Name: Word-Aligned Hybrid (WAH) code• Key features:

Compressed indices typically 30% of raw data10X faster in answering queries than the most

competitive bitmap indexWorst case index size 4N words, not N*N

Encode each group using one 32-bit word

31-bit count=63

Merge neighboring groups with identical bits

31 literal bits0 1 0 31 literal bits0

32 bits

Page 4: Indexing Scientific Data With FastBit

SDMCenter

Handling Collective Features:Regions of Interest

• FastBit has been used in• GridCollector for High-Energy

Physics Experiment STAR• Dexterous Data Explorer (DEX) for

query driven visualization• Dynamic histograming for network

traffic analysis• On the right is an illustration of

our region-growing approach

Index

DataRegion

Growing

RegionTracking

Query

2-D connected regions identified with line segments (in green)Line segments come out of FastBit compressed bitmaps

1 1

1

1 11

1 1 11

1

11 111

111

1

1 1 1 1

1 11 1

1FastBit

Page 5: Indexing Scientific Data With FastBit

SDMCenterFuture Plans

• Software development• Release FastBit under LGPL (John, March ’07)• Fastbit Integration with ROOT (John, Sept ’07)• Fastbit Integration with HDF5 for Particle Physics (Kurt)

• Finding Regions of Interest• Existing work only dealt with data on regular meshes• Working on extensions to AMR mesh (Kurt), GTC mesh (John),

and tetrahedral mesh (Rishi)

• New features (research)• Parallel version• Table groups / partitions• Range join