data-centric transformations on non-integer iteration spaces

31
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Data-Centric Transformations on Non-Integer Iteration Spaces Non-Integer Iteration Spaces Swarup Kumar Sahoo Gagan Agrawal The Ohio State University

Upload: nicola

Post on 20-Feb-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Data-Centric Transformations on Non-Integer Iteration Spaces. Swarup Kumar Sahoo Gagan Agrawal The Ohio State University. Roadmap. Motivation Background System Overview XQuery, Low and High Level schema and Mapping schema Compiler Analysis and Algorithm Parallelization Experiment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Data-Centric Transformations on Data-Centric Transformations on Non-Integer Iteration SpacesNon-Integer Iteration Spaces

Swarup Kumar Sahoo Gagan Agrawal

The Ohio State University

Page 2: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

RoadmapRoadmap• Motivation • Background• System Overview• XQuery, Low and High Level schema and Mapping

schema• Compiler Analysis and Algorithm• Parallelization• Experiment • Summary and Future Work

Page 3: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

MotivationMotivation• Declarative and application specific languages

– Uses high-level abstractions– Simplifies development of applications

• Use of restructuring transformations– Difficult due to these abstractions

• Goal : Apply data-centric transformations– On integer and non-integer based iteration space while providing

high-level abstractions/virtual view of underlying datasets.

Page 4: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

BackgroundBackground

• Data-centric transformation :– Input data is brought into memory/cache in chunks or

shackles and then corresponding program fragments or loop iterations requiring access to these data are executed.

– Helps in improving data locality.• Integer based iteration space

– Loop takes integer values with constant step-size between a lower and upper bound.

• Non-integer based iteration space– Loop takes values from a sequence or set of real numbers,

strings, or any other data types.– Easily expressible in declarative languages

Page 5: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Example: Data-centric Example: Data-centric transformationtransformation

for i:= 1 to 3 {     Count the number of occurrences of i in a list

of digits }

Page 6: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Naïve Strategy Naïve Strategy

DatasetOutput

0

2

2

224

1

11

1 1

5

33

33

336

Requires 3 Scans

Counter

Page 7: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Data Centric StrategyData Centric Strategy

DatasetOutput

0 0 0

2

2

224

1

11

1 1

5

33

33

336

Requires just one scan

Counter1 Counter2 Counter3

21 11

Page 8: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Example: Data-centric Example: Data-centric transformation with non-integer transformation with non-integer

iteration spaceiteration space

for each distinct color (green, blue, pink) {

     with that color

}

Page 9: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Naïve Strategy Naïve Strategy

DatasetOutput

000

Requires 3 Scans

555

Counter

Page 10: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Data Centric StrategyData Centric Strategy

DatasetsOutput

0 0 0

Requires just one scan

Counter1 Counter2 Counter3

Mapping

5 5 51 112

Page 11: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Previous work and ContributionsPrevious work and Contributions

• Related Work– Data-centric multilevel blocking (Pingali et. al., PLDI 1997)– Sparse matrix code synthesis from high-level specifications

(Pingali et. al., SC 2000)– Supporting XML Based high-level abstraction on flat-file

datasets (LCPC 2003, XIME-P 2004)• Contributions of this paper

– An improved data- centric transformation algorithm which works on both integer and non-integer based iteration spaces.

– Handling of out-of-core computations involving multi-dimensional datasets, without limiting the organization of low-level datasets.

– Automatic parallelization of the considered class of application.

Page 12: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

System OverviewSystem OverviewHigh levelXML Schema

Mapping Schema

Dataset

CompilerMapping Service

System OverviewSystem Overview

Low levelXML Schema

Low-level Library

Cluster with Disk

XQuery Source Code

Page 13: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

XQuery and XML SchemasXQuery and XML Schemas

• High-level declarative languages ease application development– XQuery is a high-level language for processing XML datasets– Derived from database, declarative, and functional languages!

• High-level schema– XML is used to provide a virtual view of the dataset

• Low-level schema – reflects actual physical layout.

• Mapping schema:– describes mapping between each element of high-level

schema and low-level schema

Page 14: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Oil Reservoir SimulationOil Reservoir Simulation• Support cost-effective Oil

Production• Simulations on a 3-D grid• 17 variables and cell

locations in 3-D grid at each time step

• Computation of bypassed regions

– Expression to determine if a cell is bypassed for a time-step

– Within a spatial region and range of time steps

– Grid cells that are bypassed for every time-step in the range

Oil Reservoir management

Page 15: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

High-Level SchemaHigh-Level Schema< xs:element name="data" maxOccurs="unbounded" >

< xs:complexType > < xs:sequence (unique x,y,z,t) >

< xs:element name="x" type="xs:integer"/ > < xs:element name="y" type="xs:integer"/ > < xs:element name="z" type="xs:integer"/ > < xs:element name="time" type="xs:integer"/ > < xs:element name="velocity" type="xs:float"/ > < xs:element name="mom" type="xs:float"/ >

< /xs:sequence > < /xs:complexType >

< /xs:element >

Page 16: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

High-Level XQuery Code Of Oil High-Level XQuery Code Of Oil Reservoir managementReservoir management

unordered( for $i in ($x1 to $x2)

for $j in ($y1 to $y2) for $k in ($z1 to $z2)

let $p := document("OilRes.xml")/datawhere ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$i, $j, $k} </x-coord> <summary> { analyze($p) } </summary> </info>

)

Page 17: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Low-Level SchemaLow-Level Schema<file name="info">

<sequence> <group name="data">

<attribute name="time"> <datatype> integer </datatype> <dataspace> <rank> 1 </rank> <dimension> [1] </dimension> </dataspace> </attribute>

<dataset name="velocity"> <datatype> float </datatype> <dataspace> <rank> 1 </rank> <dimension> [x] </dimension> </dataspace> </dataset>

..............

</group> </sequence>

</file>

Page 18: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Mapping SchemaMapping Schema//high/data/velocity //low/info/data/velocity //high/data/time //low/info/data/time //high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)]

//high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]

Page 19: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Modified Oil Reservoir management Modified Oil Reservoir management with non-integer iteration spacewith non-integer iteration space

let $src = document(“Oil.xml”)//data/x,y,zLet $coord = distinct-values($src)unordered(

for $C in $coord

let $p := document("OilRes.xml")/datawhere ($p/x=$C/x) and ($p/y = $C/y) and ($p/z = $C/z) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$C/x, $C/y, $C/z} </x-coord> <summary> { analyze($p) } </summary> </info>

)

Page 20: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Basic steps in our Data Centric Basic steps in our Data Centric Transformation algorithmTransformation algorithm

• Mapping Function T :Iteration space → High-Level data

• Mapping Function C : High-Level data → Low-Level data

• Mapping Function C · T = M : Iteration space → Low-Level data

• Our Goal is to compute M-1 and use the following steps– Iterate over each data element in actual storage – Find out iterations of the original loop in which they are accessed

using M-1.– Access required elements of other datasets.– Execute computation corresponding to those iterations.

Page 21: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Handling non-integer based iteration Handling non-integer based iteration space with hash-tablespace with hash-table

• Abstract integer iteration space:– Based on the unique sequence number of each element

in the actual iteration space.– One-to-one correspondence between actual and abstract

iteration space» Hash table can be used to create this mapping» Sequence number in the hash table indicates the iteration

instance in abstract iteration space

Page 22: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Template for Generated Code using Template for Generated Code using hash tablehash table

Generated_Query { Go through the datasets and create a list of tuples, each denoting an

iterationForeach i in the list of tuples { apply hash function on i If i is not present in hash table, enter i into hash table and store its

sequence number and the corresponding output element }

For k = 1, …, NO_OF_CHUNKS { Read kth chunk of dataset S1 using HDF5 functions. Foreach of the other datasets S2, … , Sn

access the required chunk of the dataset. Foreach data element in the chunks of data {

compute the iteration instance i. apply the hash function and determine the corresponding output element.

apply the reduction computation and update the output. } }

}

Page 23: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Handling non-integer based iteration Handling non-integer based iteration space without hash-tablespace without hash-table

• Find out the two choices required for construction of actual iteration space

• Determine the procedure to construct the actual iteration space

• From High-level schema, select the attributes forming unique set of tuples (V)

• Consider the set of attributes forming the iteration space as P.

• If P is not a subset of V, we use hash table.• Else if P = V, transformation is done without hash

table.• Else if P is a proper subset of V, then the choice

depends on the presence of duplicate tuples.

Page 24: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

ParallelizationParallelization• Two obvious ways to parallelize

– First one is to parallelize the for loop going through different chunks

– Second one is to parallelize the for loop going through data in each chunk

• Choose the method depending on the number of chunks and chunk size.

• Reduction operation required to combine values from different processors.

Page 25: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Experimental test bedExperimental test bed• HDF5 version 1.6.3 ( uses MPI-I/O for parallel I/O )• Sequential experiments - 700 MHz PIII machine,1GB

memory, Linux version 2.4.18• Parallel Experiments – Itanium 2 cluster with dual 1.3

Ghz Itanium 2 processor nodes, 4 GB RAM, 80 GB hard drive

• Four applications– Transaction database analysis– Original Oil reservoir simulation– Modified Oil reservoir simulation– Virtual microscope

Page 26: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Experimental resultExperimental resultVirtual Microscope

Oil Reservoir Simulation

Modified Oil Reservoir Simulation

Transaction database Analysis

With DCT without hash table

1.32 2.64 2.08 -

With DCT using hash table

- - 2.97 7.57

Without DCT

10.65 27.13 23.69 96.11

Execution time (sec.) using different versions of transformation algorithm

Page 27: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Experimental resultExperimental resultParallel Performance of Virtual

Microscope

0

20

40

60

80

100

1 2 4 8

Number of Processors

Exec

utio

n Ti

me

(Sec

)

Page 28: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Experimental resultExperimental resultParallel Performance of Oil Reservoir

Simulation

010203040506070

1 2 4 8

Number of Processors

Exec

utio

n Ti

me

(sec

)

Page 29: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Experimental resultExperimental resultParallel Performance of Modified Oil

Reservoir Simulation

0

10

20

30

40

1 2 4 8

Number of Processors

Exec

utio

n Ti

me

(sec

)

Page 30: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

Experimental resultExperimental resultParallel Performance of Transaction

Database Analysis

010203040506070

1 2 4 8

Number of Processors

Exec

utio

n Ti

me

(sec

)

Page 31: Data-Centric Transformations on Non-Integer Iteration Spaces

Ohio State University Department of Computer Science and Engineering

SummarySummary• Compiler techniques

– Perform data centric transformations automatically on integer and non-integer based iteration space.

– More efficient method without using has table for data centric transformation on non-integer based iteration space.

– Support High-level abstractions on complex low-level data formats.

– Parallelization of the considered class of queries.• Future Work

– Experimental results on more applications.– Compare performance with manual implementations – Formalize the mapping schema.– Extend applicability of the algorithm to more general class of

queries.