gen: a database interface generator for hpc programs

GEN: A Database Interface Generator

forHPC Programs

Quan Pham, Tanu MalikComputation Institute

University of Chicago and Argonne National [email protected]

Monday, June 29, 15

Computational Science Challenges

• Science is iterative and data-intensive

• Simulate-Analyze-Simulate

• Supercomputing time is precious

• So reduce time for data analysis

• Challenges

• Slow IO and network bandwidth

• How to remove redundant and IO-intensive steps?

Monday, June 29, 15

Computational Cosmology(current state-of-art)

• Data sizes reduce as analysis proceeds

• Simulation size is few Petabytes

• Analysis data may range from TBs to 10s of GBs

Simulations

Scripts

Visualization

Database

Analyze

Halo/sub-halo Identification

Halo Merger Trees

Identifying GalaxiesPVFS

FS

Simulate

Madduri, Malik, Habib, et. al. PDACS: A Portal for Data Analysis Services for Cosmological Simulations. In: ACM Computing In Science and Engineering, 2015

Monday, June 29, 15

Computational Cosmology(envisioned state-of-art)

• Database-friendly analysis within the DB

• Mechanisms for direct IO to databases

Simulations

Scripts

Visualization

Database

Analyze

Halo/sub-halo Identification

Halo Merger Trees

Identifying GalaxiesPVFS

Simulate

Monday, June 29, 15

Related Work• The SciHadoop System

• Move data between file-systems (from PVFS to HDFS)

• Simple sub-selection and aggregation queries

• Querying using HPC libraries

• Transform sub-selection and aggregation SQL queries into parallel IO operations using parallel-NetCDF

• Requires user to perform parallel data management

• Database-as-a-service

• RESTful service approach good for small amount of data

Monday, June 29, 15

GEN: A Database Interface Generator

• Takes user-supplied C declarations and provides an interface to load into and access data from common scientific array databases

Monday, June 29, 15

GEN OverviewMPI_Init ( &argc, &argv );MPI_Comm_rank ( MPI_COMM_WORLD, &id );MPI_Comm_size ( MPI_COMM_WORLD, &p );

if ( rank == 0 ){

x_file = fopen ( "x_data.txt", "w" );for ( j = 0; j < n; j++ )

{for ( i = 0; i < m; i++ )fprintf ( output, " %24.16g", table[i+j*m] );

fprintf ( output, "\n" );}fclose ( x_file );

}

Figure 2: A sample program I/O

MPI_Init ( &argc, &argv );MPI_Comm_rank ( MPI_COMM_WORLD, &id );MPI_Comm_size ( MPI_COMM_WORLD, &p );

if ( rank == 0 ){

CREATE TABLE "x_data.txt"for ( j = 0; j < n; j++ ){for ( i = 0; i < m; i++ )INSERT into "x_data.txt" VALUES (table[i+j*m])


}

Figure 3: Incorrect Database Interface of SampleProgram

To generate the correct interface, GEN relies on (a) run-time wrapping of POSIX I/O calls, (b) run-time bookkeep-ing information for creation of database DDL statements,and (c) ingesting data directly from OS bu↵ers and lazilyinterpreting them within the database using scientific file-format libraries. We explain this interface generation in de-tail:

3.1 Runtime Wrapping of POSIX I/O callsSince we do not want to change the user’s program, we

would like that whenever the C program makes a POSIXsystem call’s, the call be directed to GEN’s library to makea corresponding database system call. In C programs, I/Osystem calls are handled through the system libraries, suchas stdio or iostream which are shared libraries. A sharedlibrary may be loaded during execution into the memoryspace of the program’s process or may already be loadedinto the memory space of another process. Further, a pro-gram process can be forced to load a shared library using theLD_PRELOAD environment variable, which instructs the loader

(ld.so) to load the specified shared libraries. Symbol defini-tions in LD_PRELOADed libraries will be found before defini-tions in non-LD_PRELOADed libraries, so if an LD_PRELOADedlibrary contains a definition of, say, fopen, then that codewill be executed when the process references fopen. In par-ticular, this means that we can package our wrapper func-tions into a shared library (.so file) and use LD_PRELOAD tocause them to be loaded.The LD_PRELOAD directive helps to direct system calls to

GEN’s wrapper functions, but at run-time GEN’s wrapper func-tions still does not have the handle to the original systemcall functions, the arguments of which are necessary to cre-ate corresponding database DDL statements in the wrapperfunction and then execute them.

3.2 Creating Database DDL statementsTo create database DDL statements we need the handle to

the original system call function and use the parameters ofthe original system call function create appropriate databaseDDL statements. On Unix platforms, the dlsym functiondlsym(void* handle, char* symbol) returns the addressof the symbol given as its second argument, searching the dy-namic or shared library accessible through the handle givenas its first argument. Thus, assuming the only definitions offopen are in our shared library and in the system C library,calling dlsym(RTLD_NEXT, “fopen ) will return the addressof the standard fopen() function, to which we may assigna function pointer and later call. In this RTLD_NEXT allowssearching for libraries in the entire library path.Given the handle to system call functions, GEN creates

appropriate database DDL statement, based on the so faravailable information. Thus when the fopen system call isavailable at run time,

fopen("x_data.txt", w")

GEN redirects it to the following DDL statement

Create <Database Object> x_data.txt.tmp <structure>

where <database object> refers to keywords TABLE oran ARRAY. Typically, in a database the above declarationis incomplete since it does not have the associated structureof the database object. In case of a TABLE that struc-ture corresponds to the attributes and their types of theTABLE, and in case of ARRAY the structure correspondsto the dimensions and attributes of the ARRAY. Since thestructure information is not known at the time of openinga file, we add dummy attributes and dimensions to make itcomplete. Thus for relational databases the structure addsan integer column, which is auto-incremented, and a columnof binary data type. For scientific databases the structureadds a uni-dimensional array with a single attribute of typeinteger. While, creation of these tables may seem unneces-sary at this point, in next subsection we describe how thesetables can be used for lazy reorganization of the data withinthe database.When the fwrite() system call is redirected to GEN, then

GEN uses the handle of fwrite obtained from dlsym to obtainthe actual data structures that will be written out in thefile. GEN obtains the declarations of these data structuresfrom the header files to initialize the appropriate schema forthe database objects that it created earlier. At this point,instead of updating the schema of the previous tables, it cre-ates companion table/array with the correct data structure.


if ( rank == 0 ){

x_file = fopen ( "x_data.txt", "w" );for ( j = 0; j < n; j++ )

{for ( i = 0; i < m; i++ )fprintf ( output, " %24.16g", table[i+j*m] );


}

Figure 2: A sample program I/O


if ( rank == 0 ){

CREATE TABLE "x_data.txt"for ( j = 0; j < n; j++ ){for ( i = 0; i < m; i++ )INSERT into "x_data.txt" VALUES (table[i+j*m])


}

Figure 3: Incorrect Database Interface of SampleProgram

To generate the correct interface, GEN relies on (a) run-time wrapping of POSIX I/O calls, (b) run-time bookkeep-ing information for creation of database DDL statements,and (c) ingesting data directly from OS bu↵ers and lazilyinterpreting them within the database using scientific file-format libraries. We explain this interface generation in de-tail:

3.1 Runtime Wrapping of POSIX I/O callsSince we do not want to change the user’s program, we

would like that whenever the C program makes a POSIXsystem call’s, the call be directed to GEN’s library to makea corresponding database system call. In C programs, I/Osystem calls are handled through the system libraries, suchas stdio or iostream which are shared libraries. A sharedlibrary may be loaded during execution into the memoryspace of the program’s process or may already be loadedinto the memory space of another process. Further, a pro-gram process can be forced to load a shared library using theLD_PRELOAD environment variable, which instructs the loader

(ld.so) to load the specified shared libraries. Symbol defini-tions in LD_PRELOADed libraries will be found before defini-tions in non-LD_PRELOADed libraries, so if an LD_PRELOADedlibrary contains a definition of, say, fopen, then that codewill be executed when the process references fopen. In par-ticular, this means that we can package our wrapper func-tions into a shared library (.so file) and use LD_PRELOAD tocause them to be loaded.The LD_PRELOAD directive helps to direct system calls to

GEN’s wrapper functions, but at run-time GEN’s wrapper func-tions still does not have the handle to the original systemcall functions, the arguments of which are necessary to cre-ate corresponding database DDL statements in the wrapperfunction and then execute them.

3.2 Creating Database DDL statementsTo create database DDL statements we need the handle to

the original system call function and use the parameters ofthe original system call function create appropriate databaseDDL statements. On Unix platforms, the dlsym functiondlsym(void* handle, char* symbol) returns the addressof the symbol given as its second argument, searching the dy-namic or shared library accessible through the handle givenas its first argument. Thus, assuming the only definitions offopen are in our shared library and in the system C library,calling dlsym(RTLD_NEXT, “fopen ) will return the addressof the standard fopen() function, to which we may assigna function pointer and later call. In this RTLD_NEXT allowssearching for libraries in the entire library path.Given the handle to system call functions, GEN creates

appropriate database DDL statement, based on the so faravailable information. Thus when the fopen system call isavailable at run time,


GEN redirects it to the following DDL statement


where <database object> refers to keywords TABLE oran ARRAY. Typically, in a database the above declarationis incomplete since it does not have the associated structureof the database object. In case of a TABLE that struc-ture corresponds to the attributes and their types of theTABLE, and in case of ARRAY the structure correspondsto the dimensions and attributes of the ARRAY. Since thestructure information is not known at the time of openinga file, we add dummy attributes and dimensions to make itcomplete. Thus for relational databases the structure addsan integer column, which is auto-incremented, and a columnof binary data type. For scientific databases the structureadds a uni-dimensional array with a single attribute of typeinteger. While, creation of these tables may seem unneces-sary at this point, in next subsection we describe how thesetables can be used for lazy reorganization of the data withinthe database.When the fwrite() system call is redirected to GEN, then

GEN uses the handle of fwrite obtained from dlsym to obtainthe actual data structures that will be written out in thefile. GEN obtains the declarations of these data structuresfrom the header files to initialize the appropriate schema forthe database objects that it created earlier. At this point,instead of updating the schema of the previous tables, it cre-ates companion table/array with the correct data structure.

Monday, June 29, 15

Issues• IO patterns

• N-1 non-strided or N-1 strided or N-N

• Scattered IO calls

• No direct mapping between POSIX calls and DB statements

• Direct C POSIX calls to DB statements

• Constraints:

• No source code changes

Monday, June 29, 15

GEN• Assumptions

• Serial IO for now

• IO done contiguously within the context of a single function

Monday, June 29, 15

GEN-1• Map system calls to DDL statements without

changing source code

• LD_PRELOAD: load a shared library

• Use LD_PRELOAD to force load a library that has GEN’s implementation of fopen(), fwrite(), fread()



Monday, June 29, 15

GEN-1I• Schema-later; no structure initially, reorganize later


Create Table x_data.txt.tmp <i int; x double>; Create Array x_data.txt.tmp <x:double> [i=0:*]

• OS-DB impedance mismatch

• Not every fwrite() translates to INSERT statement

• fwrites are buffered before being sent to filesystem

• A single statement when data is being flushed

Monday, June 29, 15

Knowing the Structure

• User-defined

• Header files

• Allow some system calls to create sample files

Monday, June 29, 15

Experiments on Merger Trees

• Building merger-trees in DB system

• Built from halo particles, i.e. particles in a given halo

• Several time-steps

• Perform a particle merge-join between halos of timestep ti-1 and ti

• Lineage queries to answer ancestors and progenitors

• MPI program

• Streams data to a single node, which streams to GEN

• Postgres and SciDB interfaces

Monday, June 29, 15

Experimental Setup

• Time for reorganizing the data

• Reorganizing is cheaper in the DB than outside

Thus given a fwrite declaration, such as

fwrite(&data, sizeof(example), 1, fout);

GEN determines the data structure to which data belongs to,i.e., example and using its declaration specified in headerfiles obtains the appropriate attributes and type informa-tion. Thus, all data structures are flattened out as attributesof a table in a relational database, and as attributes of asingle-dimensional array in a array database. Thus nesteddata structures are two tables joined by a foreign key. Usingthe same logic, but some more bookkeeping we can createappropriate statements for fprintf statements and sprintf

statements.

3.3 Ingesting Data Directly From Program toDatabase

GEN still does not create the corresponding INSERT/LOADstatements for the table due to the operating system-databasesystem impedance mismatch where in a fwrite() in a C pro-gram does not correspond to an INSERT/LOAD statementin a database. This is due to use of system libraries in whichwrites are bu↵ered before sending to the OS kernel and theOS kernel converts these into streaming writes to improveperformance.

IN GEN we continue to maintain this performance advan-tage and create an INSERT statement for the fwrite, butdirect the data for the fwrite into the temporary table ofthe file to which fwrite intends to write. Thus for the pre-vious fwrite statement the following INSERT statement iscreated:

INSERT INTO x_data.txt.tmp \VALUES(id, <data pointed by &data>)

While the writes are being sent to the temporary table,GEN maintains two pieces of information: it maintains whatdata structure is being written out and it maintains the num-ber of times, fwrite was redirected by dlsym. The latter in-formation is important for us to lazily reinterpret the datastructure using the scientific file format libraries. For ar-ray databases, the process of ingesting data is not similar.Data from fwrite must be interpreted before it can be in-serted. Thus array databases have significantly more delayin inserting data than relational databases.

4. EXPERIMENTSIn GEN we incur two kinds of costs for interface generation:

the cost of redirecting the POSIX system calls during therun-time, and the cost of inferring the data structure lazily.We measure these costs individually. We also measure thecost of not using GEN and loading data via files.

We have build a working prototype in which MPI pro-grams, similar to the one in the example, stream data toa Postgres database and a SciDB database. Both Postgresand SciDB are installed in a single-instance mode on a singlenode. The MPI program is run in parallel on 3 nodes withone node streaming data to the database using GEN. Ex-periments were conducted on Ubuntu machines with 4 GBRAM and dual-core processor.

To measure the overhead of redirection, we experimentedby increasing the number of dynamically loaded libraries as-sociated with a program. Our experiments show that redi-rection increases the program run time by 0.04ms with a

variation of 0.02. The redirection time is not influenced byincreasing the number of elements to write in the program,since the libraries are already loaded in memory. The varia-tion comes if we execute other programs, which can o✏oadthe library from memory and it needs to reloaded again.In the second experiment, we measured the space vs time

overhead of reorganizing the data. In GEN we have not savedany space. Without GEN the binary data is written out onfiles and then within the database. With GEN the data iswritten out to temporary tables within the database andthen reorganized within the database. However, as Table 1shows that there is significant savings in terms of reorganiz-ing the data within the database versus reorganizing priorto loading the database.Table shows the time it takes to load increasingly larger

sized datasets into a relational and an array database vs howmuch time it takes to load the same data from a temporarytable into a formatted table plus the time of loading theinitial binary data into a temporary table.

Table 1: Reorganization timesData Size Postgres SciDB

(in GB)

Load from

Files

Reorganize

within DB

Load from

Files

Reorganize

within DB

1 2.7 1.2 8.4 3.2

5 6.8 2.5 14.3 7.6

10 10.56 4.7 22.6 8.9

5. RELATED WORKTypically, simulation data is written onto data files in

write-optimized, self-describing formats to be analyzed later.However, the increasing volumes of data and the need toreduce the time to analysis, di↵erent analysis approachesfor analysis have emerged. In [4] and [5] data is not refor-matted for analysis, but enabled over write-optimized datafiles. In [4], this is achieved by integrating the self-describingscientific file formats with the HDFS file system, and sub-sequently using map-reduce methods for analysis. In [5],analysis is enabled by translating common queries, writtenin a native language, into appropriate library calls of the sci-entific file formats. In both cases, only simple sub-selections,and aggregations are enabled. Databases allow more com-plex analysis. However ingesting data into a database andmonitoring it can be onerous. In this paper, we have re-duced this overhead by directly linking HPC programs withdatabase statements, and thus automating the process forsimulation based science that generates a lot of data butwould like to automatically ingest data into databases in-stead of files.The operating system provides a di↵erent and simplified

model for writing data. Optimizations in the operating sys-tem have long been discussion of the systems research com-munity. In this paper, we have kept most of the optimizationthat the OS system provides, and yet provided a databasethat can slowly be adapted on a need basis for analysis.While our approach is preliminary, we plan to extend it todi↵erent operating systems such as Windows and Mac OSX and determining appropriate system libraries.

6. CONCLUSIONSIn this paper, we described a method to transparently in-

troduce and link databases within the natural workflow of

• Overhead of redirection

• Dynamically-linked library may get off-loaded

• 0.04 ms + 0.02ms-

Monday, June 29, 15

Conclusions and Current Work

• A database generator library for data-intensive scientific applications that requires no source code changes, which loads to be reorganized later.

• Relax the assumptions

• Loading is still an issue

• Source-code release

Monday, June 29, 15

• Thank You!

• Acknowledgements

• Salman Habib, Katrin Heitmann, Steve Rangel, Hal Finkel, Ravi Madduri

Monday, June 29, 15

Monday, June 29, 15

A vision

HPC ClusterHigh-bandwidth

Network

DatabaseServers

Monday, June 29, 15

gen: a database interface generator for hpc programs

Technology