runtime data management for data-intensive scientific applications

12
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge National Lab ECPI: 2005 – 2008

Upload: marek

Post on 14-Jan-2016

50 views

Category:

Documents


2 download

DESCRIPTION

Runtime Data Management for Data-Intensive Scientific Applications. Xiaosong Ma NC State University Joint Faculty: Oak Ridge National Lab ECPI: 2005 – 2008. Data-Intensive Applications on NLCF. Data-processing applications Bio sequence DB queries, simulation data analysis, visualization - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Runtime Data Management for

Data-Intensive Scientific Applications

Xiaosong MaNC State University

Joint Faculty: Oak Ridge National LabECPI: 2005 – 2008

Page 2: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Data-Intensive Applications on NLCF

Data-processing applications Bio sequence DB queries, simulation data analysis,

visualization Challenges

Rapid data growth (data avalanche) Computation requirement I/O requirement Needs ultra-scale machines

Less studied than numerical simulations Scalability on large machines

Complexity and heterogeneity Case-by-case static optimization costly

Page 3: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Run-Time Data Management

Parallel execution plan optimization Example: genome vs. database sequence comparison on

1000s of processors Data placement crucial for performance/scalability Issues

Data partitioning/replication Load balancing

Efficient parallel I/O w. scientific data formats I/O subsystem performance lagging behind Scientific data formats widely used (HDF, netCDF)

Further limits applications’ I/O performance Issues

Library overhead Metadata management and accesses

Page 4: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Proposed Approach

Adaptive run-time optimization For parallel execution plan optimization

Connect scientific data processing to relational databases

Runtime cost modeling and evaluation For parallel I/O w. scientific data formats

Library-level memory management Hiding I/O costs

Caching, prefetching, buffering

Page 5: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Prelim Result 1: Efficient Data Accesses for Parallel Sequence Searches

BLAST Widely used bio sequence search tool NCBI BLAST Toolkit

mpiBLAST Developed at LANL Open source parallelization of BLAST using database

partitioning Increasingly popular: more than 10,000 downloads since

early 2003 Directly utilizing NCBI BLAST Super linear speedup with small number of processors

Page 6: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Data Handling in mpiBLAST Not Efficient

Databases partitioned statically before search Inflexible: re-partitioning

required to use different No. of procs

Management overhead: generating large number of small files, hard to manage, migrate and share

Results processing and output serialized by the master node

Result: rapidly growing non-search overhead as No. of procs grows Output data size grows

0

500

1000

1500

2000

2500

3000

3500

4000

4500

4 8 16 32 62

Number of Processors

Tim

e (

Seco

nd

s)

Other time

Search time

- Search 150k queries against NR database

Page 7: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

pioBLAST

Efficient, highly scalable parallel BLAST implementation [IPDPS ’05] Improves mpiBLAST Focus on data handling Up to order of magnitude improvement on overall

performance Currently being merged with mpiBLAST

Major contributions Applying collective I/O techniques to bioinformatics,

enabling Dynamic database partitioning Parallel database input and result output

Efficient result data processing Removing master bottleneck

Page 8: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

pioBLAST Sample Performance Results

Platform: SGI Altix at ORNL 256 1.5GHz Itanium2

processors 8GB memory per

processor Database: NCBI nr

(1GB)

Node scalability tests (top figure) Queries – 150k queries

randomly sampled from nr

Varied no. of processors

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Program-No. of processes

Execu

tio

n t

ime (

s)

Other time

Search time

Page 9: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Prelim Result 2: Active Buffering Hides periodic I/O costs behind computation phases

[IPDPS ’02, ICS ’02, IPDPS ’03, IEEE TPDS (to appear)] Organizes idle memory resources into buffer

hierarchy Masks costs of scientific data formats

Panda Parallel I/O Library University of Illinois Client-server architecture

ROMIO Parallel I/O Library Argonne National Lab Popular MPI-IO implementation, included in MPICH Server-less architecture ABT (Active Buffering with Threads)

Page 10: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Write Throughput w. Active Buffering

0

200

400

600

800

1000

1200

2 4 8 16 32

number of clients

thro

ug

hp

ut

pe

r s

erv

er

(MB

/s)

local bufferingABMPIbinary write

0

200

400

600

800

1000

1200

2 4 8 16 32

number of clients

local bufferingABMPIHDF4 write

0

50

100

150

200

250

2 4 8 16 32

number of clients

thro

ughp

ut p

er

serv

er (M

B/s

)

idealABMPIbinary write

0

50

100

150

200

250

2 4 8 16 32

number of clients

idealABMPIHDF4 write

w/o buffer

overflow

w. buffer overflow

Page 11: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

Prelim Result 3: Application-level Prefetching

GODIVA Framework: hides periodic input costs behind

computation phases [ICDE ’04] General Object Data

Interface for Visualization Applications

In-memory database managing data buffer locations

Relational database-like interfaces

Developer controllable prefetching and caching

Developer-supplied read functions

0

100

200

300

400

500

600

700

sim

ple(O

)

sim

ple(G

)

sim

ple(T

G)

med

ium

(O)

med

ium

(G)

med

ium

(TG

)

com

plex(

O)

com

plex(

G)

com

plex(

TG)

exec

uti

on

tim

e (s

)

computation time visible I/O time

Page 12: Runtime Data Management  for  Data-Intensive Scientific Applications

DOE Network PI Meeting 2005

On-going Research

Parallel execution plan optimization Explore optimization space of bio sequence processing

tools on large-scale machines Develop algorithm-independent cost models

Efficient parallel I/O w. scientific data formats Investigate unified caching, prefetching and buffering

DOE Collaborators Team led by Al Geist and Nagiza Samatova (ORNL) Team leb by Wu-Chun Feng (LANL)