grid-based data stream processing in e-science

16
Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e- Science 1 Grid-based Data Stream Processing in e-Science Richard Kuntschke 1 , Tobias Scholl 1 , Sebastian Huber 1 , Alfons Kemper 1 , Angelika Reiser 1 , Hans-Martin Adorf 2 , Gerard Lemson 3 , and Wolfgang Voges 3 1 Lehrstuhl Informatik III: 1 Datenbanksysteme 1 Fakultät für Informatik 1 Technische Universität München 2 Max-Planck-Institut 2 für Astrophysik 3 Max-Planck-Institut 3 für extraterrestrische 3 Physik

Upload: marnie

Post on 07-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Grid-based Data Stream Processing in e-Science. Richard Kuntschke 1 , Tobias Scholl 1 , Sebastian Huber 1 , Alfons Kemper 1 , Angelika Reiser 1 , Hans-Martin Adorf 2 , Gerard Lemson 3 , and Wolfgang Voges 3. 2 Max-Planck-Institut 2 für Astrophysik. 3 Max-Planck-Institut - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Grid-based Data Stream Processing in e-Science

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science 1

Grid-basedData Stream Processingin e-Science

Richard Kuntschke1, Tobias Scholl1, Sebastian Huber1,Alfons Kemper1, Angelika Reiser1,Hans-Martin Adorf2, Gerard Lemson3, and Wolfgang Voges3

1Lehrstuhl Informatik III:1Datenbanksysteme1Fakultät für Informatik1Technische Universität München

2Max-Planck-Institut2für Astrophysik

3Max-Planck-Institut3für extraterrestrische3Physik

Page 2: Grid-based Data Stream Processing in e-Science

2

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Important Challenges in e-Science

In general: Large and exponentially growing amounts of data Distributed data archives No unique identifiers Uncertainty

In astrophysics:Spectral Energy Distributions (SEDs)

Used to classify celestial objects (active galactic nuclei, brown dwarfs, neutron stars, ...)

Generation requires spatial (astrometric) matching

Page 3: Grid-based Data Stream Processing in e-Science

3

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Spatial (Astrometric) Matching

Current solutions … … load all data into main memory

Uses a lot of memory Infeasible if memory size is insufficient

… process all data at once and deliver the complete result at the end Inefficient No results until all processing has completed

Page 4: Grid-based Data Stream Processing in e-Science

4

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Our Contributions

StarGlobe Grid-based P2P Data

Stream Management System implemented on top of Globus

In-network processing Early filtering Parallelization Pipelining Load-balancing

Mobile user-defined operators

Astrophysical Example Workflow Astrometric matching Performance evaluation

Page 5: Grid-based Data Stream Processing in e-Science

5

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

The StarGlobe Architecture

Super-Peer BackboneQuery 1

Stream 0

Publish

Subscribe

filter

transform

Load mobile operators

Fct-Provider

filter

transform

Stream 1

Publish

Query 2

Subscribe

Page 6: Grid-based Data Stream Processing in e-Science

6

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Traditional Approach: Bring Data to Code

union

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. BT

... ... ...

... ... ...

... ... ...

Data-Prov. CT

... ... ...

... ... ...

... ... ...

Data-Prov. DT

... ... ...

... ... ...

... ... ...

Data-Prov. A

Page 7: Grid-based Data Stream Processing in e-Science

7

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

New Approach: Bring Code to Data

T... ... ...... ... ...... ... ...

Data-Prov. A

scan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. B

scan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. C

scan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. D

scan

NN_10

union

NN_10

Fct-Provider

NN_10

Page 8: Grid-based Data Stream Processing in e-Science

8

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Mobile User-Defined Operators

Load user-defined operators from function provider servers in the network

Common interface for integrating external operators

Push-based iterator

Flexibility

Page 9: Grid-based Data Stream Processing in e-Science

9

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

StreamIterator Interface

open(Config, StreamWriter) Configuration parameters Writer for result stream

next(StreamIteratorEvent) Next element in input stream Writing output to result stream using

StreamWriter.write() close()

Page 10: Grid-based Data Stream Processing in e-Science

10

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Communication betweenStreamProcessor and StreamIterator

StreamIterator

StreamHandler 1

StreamHandler 2

StreamHandler n

StreamWriter

StreamProcessor

...

XML InputStream 1

XML InputStream 2

XML InputStream n

XML OutputStream

...

Item 1 Item 2 Item n Result Item

Page 11: Grid-based Data Stream Processing in e-Science

11

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Astrophysical Example Workflow

peer-10

peer-9 peer-8

peer-6 peer-4 peer-5peer-7

peer-2peer-1peer-0 peer-3

Input ListRASS-BSC

2MASS FIRST USNOB1

NVSS GSC-2

SED assembly

Page 12: Grid-based Data Stream Processing in e-Science

12

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Planplan-10

at peer-10

plan-8at peer-8

plan-5at peer-5enrichσ-5

transform-5

stream-5

χ²filter-2

join-2

plan-4at peer-4enrichσ-4

transform-4

stream-4

plan-9at peer-9

χ²filter-3

join-3

plan-7at peer-7

plan-2at peer-2enrichσ-2

transform-2

stream-2

plan-3at peer-3enrichσ-3

transform-3

stream-3

χ²filter-1

join-1

plan-6at peer-6

χ²filter-0

join-0

plan-1at peer-1enrichσ-1

transform-1

stream-1

χ²filter-4

join-4

display

plan-0at peer-0enrichσ-0

transform-0

stream-0

Page 13: Grid-based Data Stream Processing in e-Science

13

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Plan

plan-6at peer-6

χ²filter-0

join-0

plan-1at peer-1enrichσ-1

transform-1

stream-1

plan-0at peer-0enrichσ-0

transform-0

stream-0

Page 14: Grid-based Data Stream Processing in e-Science

14

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Plan

plan-10at peer-10

χ²filter-4

join-4

display

Page 15: Grid-based Data Stream Processing in e-Science

15

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Evaluation of Early Filtering

Page 16: Grid-based Data Stream Processing in e-Science

16

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Conclusion

Synergies between research in computer science and other scientific disciplines, e.g., astrophysics

StarGlobe Handling large data volumes efficiently

Early filtering, parallelization, pipelining Returning first results early on

Pipelining Flexible support of domain-specific application logic

Mobile user-defined operators

Results also applicable to other domains