indexing scientific data with fastbit
DESCRIPTION
Indexing Scientific Data With FastBit. Motivating Examples Find the collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova These are not typical database searches: - PowerPoint PPT PresentationTRANSCRIPT
SDMCenterIndexing Scientific Data With FastBit
• Motivating Examples• Find the collision events with the most
distinct signature of Quark Gluon Plasma• Find the ignition kernels in a combustion
simulation• Track a layer of exploding supernova
• These are not typical database searches:• Large high-dimensional data sets (1000 time
steps X 1000 X 1000 X 1000 cells X 100 variables)
• Most data records never modified, i.e., append-only data
• Multi-dimensional queries: 500 < Temp < 1000 && CH3 > 10-4 && …
• Large answers (hit thousands or millions of records)
• Seek collective features e.g., regions of interest, not average and sum operations
• New searching technology needed
SDMCenterA Good Candidate: Bitmap Index
• First commercial version• Model 204, P. O’Neil, 1987
• Take less time to build than B-trees• Efficient for querying: only bitwise
logical operations• A < 2 b0 OR b1
• A > 2 b3 OR b4 OR b5
• Efficient for multi-dimensional queries• Use bitwise operations to combine
the partial results• Size may be large: one bit per
distinct value per row• Definition: Cardinality == number of
distinct values• Compact for low cardinality
attributes, say, cardinality < 100• Worst case: cardinality = N, number
of rows; index size: N*N bitsA < 2 2 < A
Datavalues015312041
100000100
010010001
000001000
000100000
000000010
001000000
=0 =1 =2 =3 =4 =5b0 b1 b2 b3 b4 b5
• First commercial version• Model 204, P. O’Neil, 1987
• Take less time to build than B-trees• Efficient for querying: only bitwise
logical operations• A < 2 b0 OR b1
• A > 2 b3 OR b4 OR b5
• Efficient for multi-dimensional queries• Use bitwise operations to combine
the partial results• Size may be large: one bit per
distinct value per row• Definition: Cardinality == number of
distinct values• Compact for low cardinality
attributes, say, cardinality < 100• Worst case: cardinality = N, number
of rows; index size: N*N bitsA < 2
SDMCenterCompression Makes It Better
10000000000000000000011100000000000000000000000000000……………….00000000000000000000000000000001111111111111111111111111
Example: 2015 bits
Main Idea: Use run-length-encoding, but...partition bits into 31-bit groups [not 32 bit] on 32-bit machines
31 bits 31 bits(62 groups skipped) …31 bits
• Name: Word-Aligned Hybrid (WAH) code• Key features:
Compressed indices typically 30% of raw data10X faster in answering queries than the most
competitive bitmap indexWorst case index size 4N words, not N*N
Encode each group using one 32-bit word
31-bit count=63
Merge neighboring groups with identical bits
31 literal bits0 1 0 31 literal bits0
32 bits
SDMCenter
Handling Collective Features:Regions of Interest
• FastBit has been used in• GridCollector for High-Energy
Physics Experiment STAR• Dexterous Data Explorer (DEX) for
query driven visualization• Dynamic histograming for network
traffic analysis• On the right is an illustration of
our region-growing approach
Index
DataRegion
Growing
RegionTracking
Query
2-D connected regions identified with line segments (in green)Line segments come out of FastBit compressed bitmaps
1 1
1
1 11
1 1 11
1
11 111
111
1
1 1 1 1
1 11 1
1FastBit
SDMCenterFuture Plans
• Software development• Release FastBit under LGPL (John, March ’07)• Fastbit Integration with ROOT (John, Sept ’07)• Fastbit Integration with HDF5 for Particle Physics (Kurt)
• Finding Regions of Interest• Existing work only dealt with data on regular meshes• Working on extensions to AMR mesh (Kurt), GTC mesh (John),
and tetrahedral mesh (Rishi)
• New features (research)• Parallel version• Table groups / partitions• Range join