the globus toolkit r-tree for partial spatial replica ... · combine the globus toolkit rls...
TRANSCRIPT
The Globus Toolkit R-Tree for Partial Spatial Replica Selection
Yun Tian, Philip J. Rhodes
Department of Computer and Information ScienceUniversity of Mississippi, Oct. 28 2010
1
Thursday, October 28, 2010
OutlineBackground and Introduction
Partial Spatial Replica Selection
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work2
Thursday, October 28, 2010
Background and Introduction
Scientific datasets continue to grow.
Hard to store entire datasets locally.
Solutions: - move computation close to the datasets.
- move data close to the computation.
- replicate data to spread computational load.
3
Thursday, October 28, 2010
If only part of the dataset is needed, it is possible to perform the computation away from the data. We should:
- Identify necessary subset of the dataset.
- Provide fast remote access to the subset.
- Identify relevant replicas
- Select the best replicas if many copies are available.
Background and Introduction
4
Thursday, October 28, 2010
Outline
Background and Introduction
Partial Spatial Replica Selection
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work5
Thursday, October 28, 2010
Partial Spatial Replica Selection
Spatial Data- A Spatial Dataset associates data values
with locations in an n-dimensional space.
- Often dense arrays of values.
- Spatial Extent is often represented as a Minimum Bounding Rectangle (MBR). For example:
{Xmin , Ymin ,Xmax , Ymax}
y
x(0,0)
(8,6)
(2,2)
2D MBR example
6
Thursday, October 28, 2010
Partial ReplicasReplication
• Increases data availability and performance
Partial Replicas• Partial copies of the original dataset.
- Represent a subset of the attributes or a spatial subset of the dataset domain
• “Hot spots” can be replicated more
- can increase performance7
Thursday, October 28, 2010
Partial Spatial Replicas
• Partial Spatial Replicas
- represent spatial subsets of the larger data volume.
- associated with metadata such as subset bounds (MBR), physical address, logical file name (LFN), storage organization, etc.
8
Thursday, October 28, 2010
Partial Spatial Replicas• Before a computation begins, there are two
problems to solve:(1) Identify the set of partial replicas that intersect the
subvolume required by the computation.
(2) From that set, we want to select the “Best” replicas to minimize transfer time. This requires a metric based on factors such as disk and network bandwidth and latency, data storage organization, etc.
• This paper addresses problem 1.9
Thursday, October 28, 2010
An Overview of Our Work
(a) spatial query
Figure 2, Example of Intersection Replicas in 2D
(b) intersecting replicas
10
Thursday, October 28, 2010
An Overview of Our WorkEfficiently determine the set of replicas that intersect with the spatial query bounds.
Combine the Globus Toolkit RLS component with R-tree structure.
• All spatial metadata in a relational database is re-organized into an R-tree structure.
Implement on an unmodified Globus Toolkit to support spatial data.
11
Thursday, October 28, 2010
An Overview of Our Work
R-tree Query Routine Using Globus Toolkit RLS
API
GRAM SERVER
Client Node
Spatial Replica Query
{(20,20,20), (70,70,70)}
GRAM
Replica Query Client Layer
Spatial Replica Query within XML-based JDD String
Node-4
…
…..…
RLS Backend Database
Node-3
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-2
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-1
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
12
Thursday, October 28, 2010
An Overview of Our Work
R-tree Query Routine Using Globus Toolkit RLS
API
GRAM SERVER
Client Node
Spatial Replica Query
{(20,20,20), (70,70,70)}
GRAM
Replica Query Client Layer
Spatial Replica Query within XML-based JDD String
Node-4
…
…..…
RLS Backend Database
Node-3
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-2
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-1
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
12
Thursday, October 28, 2010
Partial Spatial Replica Selection
• Granite Scientific Database System.- Fast access to remote or local spatial data.
- Iteration Aware Prefetching.
- Multi-dimensional caching.
• Magnolia component of the Granite system.- Spatial Replica Selection.
13
Thursday, October 28, 2010
Outline
Background and Introduction
Partial Spatial Replica Selection
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work14
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service
15
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service• Distributed Replica Location Service (RLS)
15
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service• Distributed Replica Location Service (RLS)
- Used to map a LFN to a PFN.
15
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service• Distributed Replica Location Service (RLS)
- Used to map a LFN to a PFN.
- Uses a backend relational database.
15
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service• Distributed Replica Location Service (RLS)
- Used to map a LFN to a PFN.
- Uses a backend relational database.
- Supports user-defined attributes.
15
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service• Distributed Replica Location Service (RLS)
- Used to map a LFN to a PFN.
- Uses a backend relational database.
- Supports user-defined attributes.
- date, float, int, or string.
15
Thursday, October 28, 2010
Globus Toolkit and Replica Location Service
• Limitations of existing RLS for spatial data - No built-in spatial intersection test.
- No spatial data structures for pruning search. All replicas in the RLS must be tested.
• Overcome the Limitations: - Implement intersection test on top of RLS.
- Re-organize metadata in relational database using R-tree structure.
16
Thursday, October 28, 2010
Outline
Background and Introduction
Partial Spatial Replica Selection
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work17
Thursday, October 28, 2010
R-tree and Distributed R-tree
B-tree like data structure, but with rectangles (“R”).Many variants: R*-tree, Hilbert R-tree, Priority R-tree etc.
(a) R-tree with fanout=3 (b) Replica layout and packed results for R-tree(a)
Figure 3 R-tree 2D Example
18
Thursday, October 28, 2010
R-tree and Distributed R-tree
B-tree like data structure, but with rectangles (“R”).Many variants: R*-tree, Hilbert R-tree, Priority R-tree etc.
G0-0 G0-1 G0-2 G0-3 G0-4 G0-5G0-6 G0-7G0-8
(a) R-tree with fanout=3
G0-0 G0-2
G0-5
G0-3
G0-1
G0-4
G0-6
G0-7
G0-8
(b) Replica layout and packed results for R-tree(a)
Figure 3 R-tree 2D Example
18
Thursday, October 28, 2010
R-tree and Distributed R-tree
B-tree like data structure, but with rectangles (“R”).Many variants: R*-tree, Hilbert R-tree, Priority R-tree etc.
G1-0 G1-1 G1-2
G0-0 G0-1 G0-2 G0-3 G0-4 G0-5G0-6 G0-7G0-8
(a) R-tree with fanout=3
G0-0 G0-2
G0-5
G0-3
G0-1
G0-4
G0-6
G0-7
G0-8
(b) Replica layout and packed results for R-tree(a)
Figure 3 R-tree 2D Example
G1-2 G1-1
G1-0
18
Thursday, October 28, 2010
R-tree and Distributed R-tree
B-tree like data structure, but with rectangles (“R”).Many variants: R*-tree, Hilbert R-tree, Priority R-tree etc.
G2-0
G1-0 G1-1 G1-2
G0-0 G0-1 G0-2 G0-3 G0-4 G0-5G0-6 G0-7G0-8
(a) R-tree with fanout=3
G0-0 G0-2
G0-5
G0-3
G0-1
G0-4
G0-6
G0-7
G0-8
(b) Replica layout and packed results for R-tree(a)
Figure 3 R-tree 2D Example
G1-2 G1-1
G1-0
G2-0
18
Thursday, October 28, 2010
OutlineBackground and Introduction
Partial Spatial Replica Selection
An Overview of Our Work
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work 19
Thursday, October 28, 2010
Globus Toolkit R-tree(the GTR-tree)
Overview
- Distributed and parallel.
- Tree node described as {id, MBR, info}.
- String PFNs to represent tree node id
- Two user-defined string attributes: Info and MBR.
- Stored in t_str_attr table of the RLS database.
20
Thursday, October 28, 2010
Globus Toolkit R-tree(the GTR-tree)GTR-tree example
Id Attribute Value
G0-0 MBR “0,0,1,1”
G0-0 Info
“<gtrtree> <pfn>G0-0</pfn> <numChild>0</numChild> <Address>gsiftp://dragon.cs.olemiss.edu/data/d0.bin </Address></gtrtree>”
...... ...... ......
G2-0 MBR “0,0,4,6”
G2-0 Info
“<gtrtree> <pfn>G2-0</pfn> <numChild>3</numChild> <child0>G1-0</child0> <child1>G1-1</child1> <child2>G1-2</child2></gtrtree>”
G0-0 G0-1 G0-2 G0-3 G0-4 G0-5G0-6 G0-7G0-8
G1-0 G1-1 G1-2
G2-0
(a) R-tree with fanout=3
(b) GTR-tree node representation for (a)
G0-0 G0-2
G0-5
G0-3
G0-1
G0-4
G0-6
G0-7
G0-8
G1-0
G1-2 G1-1G2-0
(c) Replica layout and packed results for R-tree(a)
Figure 3 GTR-tree 2D Example21
Thursday, October 28, 2010
Globus Toolkit R-tree(the GTR-tree)Issues Involved
- Decluster data onto the grid nodes.
- Pack metadata in the database with R-tree
- GRAM to invoke the query routine on the remote grid node.
- Three options with query results: transfer back with Gridftp, or with Java socket, or kept on the remote nodes.
22
Thursday, October 28, 2010
Globus Toolkit R-tree(the GTR-tree)
Figure 4, Distributed GTR-tree for our experiments
R-tree Query Routine Using Globus Toolkit RLS
API
GRAM SERVER
Client Node
Spatial Replica Query
{(20,20,20), (70,70,70)}
GRAM
Replica Query Client Layer
Spatial Replica Query within XML-based JDD String
Node-4
…
…..…
RLS Backend Database
Node-3
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-2
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-1
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
23
Thursday, October 28, 2010
Globus Toolkit R-tree(the GTR-tree)
Figure 4, Distributed GTR-tree for our experiments
R-tree Query Routine Using Globus Toolkit RLS
API
GRAM SERVER
Client Node
Spatial Replica Query
{(20,20,20), (70,70,70)}
GRAM
Replica Query Client Layer
Spatial Replica Query within XML-based JDD String
Node-4
…
…..…
RLS Backend Database
Node-3
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-2
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
Node-1
GRAM SERVER
R-tree Query Routine Using Globus Toolkit RLS
API
…
…..…
RLS Backend Database
23
Thursday, October 28, 2010
Globus Toolkit R-tree(the GTR-tree)Implementation
- General Bottom-up packing.
- JDD string to submit a query to GRAM.
- Stack to do the DFS search R-tree.
- Gridftp or Java Socket transfer results back.
24
Thursday, October 28, 2010
Outline
Background and Introduction
Partial Spatial Replica Selection
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work25
Thursday, October 28, 2010
Experimental ResultsTest Environment
Node Name Location OS Processor Number
of CPURAM Java
Version
Node-1 Univ. of Mississippi
Mac OS X 10.5.8
Dual-Core Intel Xeon
2 2G SE 1.5.0_22
Node-2 Univ. of Mississippi
Mac OS X 10.5.8
PowerPC G5
2 2G SE 1.5.0_22
Node-3 Univ. of Mississippi
Mac OS X 10.5.8
Quad-Core Intel
Xeon
2 8G SE 1.5.0_22
Node-4 Univ. of New
Hampshire
Linux 2.6.18
Intel Xeon 5150
4 4G SE 1.6.0_18
Table 1 Experimental Data Grid Node Characteristics
3D Query MBR (String)Q1 “1400 1400 1400 1400 1400 1400”Q2 “1500 1500 1600 1510 1520 1640”Q3 “50 50 50 100 100 100”Q4 “100 100 100 200 200 200”Q5 “200 200 200 400 400 400”Q6 “800 900 1000 1000 1100 1400”Q7 “1024 1024 1024 1324 1324 1324”Q8 “1600 1600 1600 2048 2048 2048”Q9 “900 900 1000 1400 1100 1400”
Table 2 Representative Spatial Queries 26
Thursday, October 28, 2010
Experimental Results
Test Environment- Four-node grid.
- sqlite3odbc driver, Globus Java API.
- metadata for one million 3D replicas on each node, fanout=10.
- Replica extents ranges from 100 to 512. dataset volume size 2048*2048*2048.
27
Thursday, October 28, 2010
Experimental Results
Test Data
• On each grid node, we randomly generate replicas as points in 2n space.
- We use a random walk algorithm to provide good locality.
- No declustering is needed.
28
Thursday, October 28, 2010
Experimental Results
•Test Results
------ Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Average Speedups
Node-1 7 4 178 645 21904700 6573 18194 8737 48.9
Node-2 0 3 135 1039 42723941 5434 35152 8104 34.4
Node-3 3 34 246 1350 44754007 7047 39411 9476 27.2
Node-4 1 24 24 88 615 2045 4091 4503 4954 37.3
Table 3 Number of Replicas Returned and Average Speedups
29
Thursday, October 28, 2010
Experimental Results•Test Results
1.0
181.9
362.8
543.7
724.6
905.5
1,086.4
1,267.3
1,448.2
1,629.1
1,810.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Tim
e(S
ec.)
Query GTR-tree On Node-4 (The Fastest Node) Non-tree Query on Node-4Query GTR-tree On Node-2 (The Slowest Node Non-tree Query on Node-2
Figure 5, GTR-tree Query Performance compared with plain RLS performance on Node-2 and Node-4 (Linear Scale)
30
Thursday, October 28, 2010
Experimental Results•Test Results
1.00
10.00
100.00
1,000.00
10,000.00
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Query GTR-tree On Node-4 (The Fastest Node) Non-tree Query on Node-4Query GTR-tree On Node-2 (The Slowest Node Non-tree Query on Node-2
Figure 5, GTR-tree Query Performance compared with plain RLS performance on Node-2 and Node-4, with average speedup of 34 and 37 respectively (Logarithmic Scale)
30
Thursday, October 28, 2010
Experimental Results•Test Results
Figure 6, Distributed GRT-tree Query Performance compared with sequential performance, Average speedup of 25 (Logarithmic Scale)
Figure 6, Distributed GRT-tree Query Performance compared with sequential performance (Linear Scale)
31
1.0
500.9
1,000.8
1,500.7
2,000.6
2,500.5
3,000.4
3,500.3
4,000.2
4,500.1
5,000.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Tim
e(S
ec.)
Distributed Query GTR-treeSequential Query Four LRCs with GTR-treeSequential Query Four LRCs without GTR-tree
Figure 6, Distributed GRT-tree Query Performance compared with sequential performance (Linear Scale)
Thursday, October 28, 2010
Experimental Results•Test Results
Figure 6, Distributed GRT-tree Query Performance compared with sequential performance, Average speedup of 25 (Logarithmic Scale)
Figure 6, Distributed GRT-tree Query Performance compared with sequential performance (Linear Scale)
31
1.00
10.00
100.00
1,000.00
10,000.00
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Tim
e(S
ec.)
Distributed Query GTR-treeSequential Query Four LRCs with GTR-treeSequential Query Four LRCs without GTR-tree
Figure 6, Distributed GRT-tree Query Performance compared with sequential performance (Log Scale)
Thursday, October 28, 2010
Experimental ResultsOverhead
- Storage overhead 11% for fanout 10.
- GRAM time overhead 11.7% on average.
- Allows us to use unmodified Globus
Test Observations- Only push intersected child onto stack.
- Retrieve child attributes in bulk. M children in the stack are processed in bulk.
32
Thursday, October 28, 2010
OutlineBackground and Introduction
Partial Spatial Replica Selection
Globus Toolkit and Replica Location Service
R-tree and distributed(parallel) R-tree
Globus Toolkit R-tree(GTR-tree)
Experimental Results
Future Work33
Thursday, October 28, 2010
Future Work
Alleviate effects of grid node failure.
Dynamic GTR-tree.
- Insertion and Deletion
Select the “best” replica that will yield best performance.
34
Thursday, October 28, 2010
Acknowledgments
This work was supported by the National Science Foundation under grants CCF-0541239 and CRI-0855136.
Thanks to the anonymous reviewers.
Thank You for Questions and Comments!
35
Thursday, October 28, 2010