gen: a database interface generator for hpc programs
TRANSCRIPT
GEN: A Database Interface Generator
forHPC Programs
Quan Pham, Tanu MalikComputation Institute
University of Chicago and Argonne National [email protected]
Monday, June 29, 15
Computational Science Challenges
• Science is iterative and data-intensive
• Simulate-Analyze-Simulate
• Supercomputing time is precious
• So reduce time for data analysis
• Challenges
• Slow IO and network bandwidth
• How to remove redundant and IO-intensive steps?
Monday, June 29, 15
Computational Cosmology(current state-of-art)
• Data sizes reduce as analysis proceeds
• Simulation size is few Petabytes
• Analysis data may range from TBs to 10s of GBs
Simulations
Scripts
Visualization
Database
Analyze
Halo/sub-halo Identification
Halo Merger Trees
Identifying GalaxiesPVFS
FS
Simulate
Madduri, Malik, Habib, et. al. PDACS: A Portal for Data Analysis Services for Cosmological Simulations. In: ACM Computing In Science and Engineering, 2015
Monday, June 29, 15
Computational Cosmology(envisioned state-of-art)
• Database-friendly analysis within the DB
• Mechanisms for direct IO to databases
Simulations
Scripts
Visualization
Database
Analyze
Halo/sub-halo Identification
Halo Merger Trees
Identifying GalaxiesPVFS
Simulate
Monday, June 29, 15
Related Work• The SciHadoop System
• Move data between file-systems (from PVFS to HDFS)
• Simple sub-selection and aggregation queries
• Querying using HPC libraries
• Transform sub-selection and aggregation SQL queries into parallel IO operations using parallel-NetCDF
• Requires user to perform parallel data management
• Database-as-a-service
• RESTful service approach good for small amount of data
Monday, June 29, 15
GEN: A Database Interface Generator
• Takes user-supplied C declarations and provides an interface to load into and access data from common scientific array databases
Monday, June 29, 15
GEN OverviewMPI_Init ( &argc, &argv );MPI_Comm_rank ( MPI_COMM_WORLD, &id );MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 ){
x_file = fopen ( "x_data.txt", "w" );for ( j = 0; j < n; j++ )
{for ( i = 0; i < m; i++ )fprintf ( output, " %24.16g", table[i+j*m] );
fprintf ( output, "\n" );}fclose ( x_file );
}
Figure 2: A sample program I/O
MPI_Init ( &argc, &argv );MPI_Comm_rank ( MPI_COMM_WORLD, &id );MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 ){
CREATE TABLE "x_data.txt"for ( j = 0; j < n; j++ ){for ( i = 0; i < m; i++ )INSERT into "x_data.txt" VALUES (table[i+j*m])
fprintf ( output, "\n" );}fclose ( x_file );
}
Figure 3: Incorrect Database Interface of SampleProgram
To generate the correct interface, GEN relies on (a) run-time wrapping of POSIX I/O calls, (b) run-time bookkeep-ing information for creation of database DDL statements,and (c) ingesting data directly from OS bu↵ers and lazilyinterpreting them within the database using scientific file-format libraries. We explain this interface generation in de-tail:
3.1 Runtime Wrapping of POSIX I/O callsSince we do not want to change the user’s program, we
would like that whenever the C program makes a POSIXsystem call’s, the call be directed to GEN’s library to makea corresponding database system call. In C programs, I/Osystem calls are handled through the system libraries, suchas stdio or iostream which are shared libraries. A sharedlibrary may be loaded during execution into the memoryspace of the program’s process or may already be loadedinto the memory space of another process. Further, a pro-gram process can be forced to load a shared library using theLD_PRELOAD environment variable, which instructs the loader
(ld.so) to load the specified shared libraries. Symbol defini-tions in LD_PRELOADed libraries will be found before defini-tions in non-LD_PRELOADed libraries, so if an LD_PRELOADedlibrary contains a definition of, say, fopen, then that codewill be executed when the process references fopen. In par-ticular, this means that we can package our wrapper func-tions into a shared library (.so file) and use LD_PRELOAD tocause them to be loaded.The LD_PRELOAD directive helps to direct system calls to
GEN’s wrapper functions, but at run-time GEN’s wrapper func-tions still does not have the handle to the original systemcall functions, the arguments of which are necessary to cre-ate corresponding database DDL statements in the wrapperfunction and then execute them.
3.2 Creating Database DDL statementsTo create database DDL statements we need the handle to
the original system call function and use the parameters ofthe original system call function create appropriate databaseDDL statements. On Unix platforms, the dlsym functiondlsym(void* handle, char* symbol) returns the addressof the symbol given as its second argument, searching the dy-namic or shared library accessible through the handle givenas its first argument. Thus, assuming the only definitions offopen are in our shared library and in the system C library,calling dlsym(RTLD_NEXT, “fopen ) will return the addressof the standard fopen() function, to which we may assigna function pointer and later call. In this RTLD_NEXT allowssearching for libraries in the entire library path.Given the handle to system call functions, GEN creates
appropriate database DDL statement, based on the so faravailable information. Thus when the fopen system call isavailable at run time,
fopen("x_data.txt", w")
GEN redirects it to the following DDL statement
Create <Database Object> x_data.txt.tmp <structure>
where <database object> refers to keywords TABLE oran ARRAY. Typically, in a database the above declarationis incomplete since it does not have the associated structureof the database object. In case of a TABLE that struc-ture corresponds to the attributes and their types of theTABLE, and in case of ARRAY the structure correspondsto the dimensions and attributes of the ARRAY. Since thestructure information is not known at the time of openinga file, we add dummy attributes and dimensions to make itcomplete. Thus for relational databases the structure addsan integer column, which is auto-incremented, and a columnof binary data type. For scientific databases the structureadds a uni-dimensional array with a single attribute of typeinteger. While, creation of these tables may seem unneces-sary at this point, in next subsection we describe how thesetables can be used for lazy reorganization of the data withinthe database.When the fwrite() system call is redirected to GEN, then
GEN uses the handle of fwrite obtained from dlsym to obtainthe actual data structures that will be written out in thefile. GEN obtains the declarations of these data structuresfrom the header files to initialize the appropriate schema forthe database objects that it created earlier. At this point,instead of updating the schema of the previous tables, it cre-ates companion table/array with the correct data structure.
MPI_Init ( &argc, &argv );MPI_Comm_rank ( MPI_COMM_WORLD, &id );MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 ){
x_file = fopen ( "x_data.txt", "w" );for ( j = 0; j < n; j++ )
{for ( i = 0; i < m; i++ )fprintf ( output, " %24.16g", table[i+j*m] );
fprintf ( output, "\n" );}fclose ( x_file );
}
Figure 2: A sample program I/O
MPI_Init ( &argc, &argv );MPI_Comm_rank ( MPI_COMM_WORLD, &id );MPI_Comm_size ( MPI_COMM_WORLD, &p );
if ( rank == 0 ){
CREATE TABLE "x_data.txt"for ( j = 0; j < n; j++ ){for ( i = 0; i < m; i++ )INSERT into "x_data.txt" VALUES (table[i+j*m])
fprintf ( output, "\n" );}fclose ( x_file );
}
Figure 3: Incorrect Database Interface of SampleProgram
To generate the correct interface, GEN relies on (a) run-time wrapping of POSIX I/O calls, (b) run-time bookkeep-ing information for creation of database DDL statements,and (c) ingesting data directly from OS bu↵ers and lazilyinterpreting them within the database using scientific file-format libraries. We explain this interface generation in de-tail:
3.1 Runtime Wrapping of POSIX I/O callsSince we do not want to change the user’s program, we
would like that whenever the C program makes a POSIXsystem call’s, the call be directed to GEN’s library to makea corresponding database system call. In C programs, I/Osystem calls are handled through the system libraries, suchas stdio or iostream which are shared libraries. A sharedlibrary may be loaded during execution into the memoryspace of the program’s process or may already be loadedinto the memory space of another process. Further, a pro-gram process can be forced to load a shared library using theLD_PRELOAD environment variable, which instructs the loader
(ld.so) to load the specified shared libraries. Symbol defini-tions in LD_PRELOADed libraries will be found before defini-tions in non-LD_PRELOADed libraries, so if an LD_PRELOADedlibrary contains a definition of, say, fopen, then that codewill be executed when the process references fopen. In par-ticular, this means that we can package our wrapper func-tions into a shared library (.so file) and use LD_PRELOAD tocause them to be loaded.The LD_PRELOAD directive helps to direct system calls to
GEN’s wrapper functions, but at run-time GEN’s wrapper func-tions still does not have the handle to the original systemcall functions, the arguments of which are necessary to cre-ate corresponding database DDL statements in the wrapperfunction and then execute them.
3.2 Creating Database DDL statementsTo create database DDL statements we need the handle to
the original system call function and use the parameters ofthe original system call function create appropriate databaseDDL statements. On Unix platforms, the dlsym functiondlsym(void* handle, char* symbol) returns the addressof the symbol given as its second argument, searching the dy-namic or shared library accessible through the handle givenas its first argument. Thus, assuming the only definitions offopen are in our shared library and in the system C library,calling dlsym(RTLD_NEXT, “fopen ) will return the addressof the standard fopen() function, to which we may assigna function pointer and later call. In this RTLD_NEXT allowssearching for libraries in the entire library path.Given the handle to system call functions, GEN creates
appropriate database DDL statement, based on the so faravailable information. Thus when the fopen system call isavailable at run time,
fopen("x_data.txt", w")
GEN redirects it to the following DDL statement
Create <Database Object> x_data.txt.tmp <structure>
where <database object> refers to keywords TABLE oran ARRAY. Typically, in a database the above declarationis incomplete since it does not have the associated structureof the database object. In case of a TABLE that struc-ture corresponds to the attributes and their types of theTABLE, and in case of ARRAY the structure correspondsto the dimensions and attributes of the ARRAY. Since thestructure information is not known at the time of openinga file, we add dummy attributes and dimensions to make itcomplete. Thus for relational databases the structure addsan integer column, which is auto-incremented, and a columnof binary data type. For scientific databases the structureadds a uni-dimensional array with a single attribute of typeinteger. While, creation of these tables may seem unneces-sary at this point, in next subsection we describe how thesetables can be used for lazy reorganization of the data withinthe database.When the fwrite() system call is redirected to GEN, then
GEN uses the handle of fwrite obtained from dlsym to obtainthe actual data structures that will be written out in thefile. GEN obtains the declarations of these data structuresfrom the header files to initialize the appropriate schema forthe database objects that it created earlier. At this point,instead of updating the schema of the previous tables, it cre-ates companion table/array with the correct data structure.
Monday, June 29, 15
Issues• IO patterns
• N-1 non-strided or N-1 strided or N-N
• Scattered IO calls
• No direct mapping between POSIX calls and DB statements
• Direct C POSIX calls to DB statements
• Constraints:
• No source code changes
Monday, June 29, 15
GEN• Assumptions
• Serial IO for now
• IO done contiguously within the context of a single function
Monday, June 29, 15
GEN-1• Map system calls to DDL statements without
changing source code
• LD_PRELOAD: load a shared library
• Use LD_PRELOAD to force load a library that has GEN’s implementation of fopen(), fwrite(), fread()
fopen("x_data.txt", w")
Create <Database Object> x_data.txt.tmp <structure>
Monday, June 29, 15
GEN-1I• Schema-later; no structure initially, reorganize later
fopen("x_data.txt", w")
Create Table x_data.txt.tmp <i int; x double>; Create Array x_data.txt.tmp <x:double> [i=0:*]
• OS-DB impedance mismatch
• Not every fwrite() translates to INSERT statement
• fwrites are buffered before being sent to filesystem
• A single statement when data is being flushed
Monday, June 29, 15
Knowing the Structure
• User-defined
• Header files
• Allow some system calls to create sample files
Monday, June 29, 15
Experiments on Merger Trees
• Building merger-trees in DB system
• Built from halo particles, i.e. particles in a given halo
• Several time-steps
• Perform a particle merge-join between halos of timestep ti-1 and ti
• Lineage queries to answer ancestors and progenitors
• MPI program
• Streams data to a single node, which streams to GEN
• Postgres and SciDB interfaces
Monday, June 29, 15
Experimental Setup
• Time for reorganizing the data
• Reorganizing is cheaper in the DB than outside
Thus given a fwrite declaration, such as
fwrite(&data, sizeof(example), 1, fout);
GEN determines the data structure to which data belongs to,i.e., example and using its declaration specified in headerfiles obtains the appropriate attributes and type informa-tion. Thus, all data structures are flattened out as attributesof a table in a relational database, and as attributes of asingle-dimensional array in a array database. Thus nesteddata structures are two tables joined by a foreign key. Usingthe same logic, but some more bookkeeping we can createappropriate statements for fprintf statements and sprintf
statements.
3.3 Ingesting Data Directly From Program toDatabase
GEN still does not create the corresponding INSERT/LOADstatements for the table due to the operating system-databasesystem impedance mismatch where in a fwrite() in a C pro-gram does not correspond to an INSERT/LOAD statementin a database. This is due to use of system libraries in whichwrites are bu↵ered before sending to the OS kernel and theOS kernel converts these into streaming writes to improveperformance.
IN GEN we continue to maintain this performance advan-tage and create an INSERT statement for the fwrite, butdirect the data for the fwrite into the temporary table ofthe file to which fwrite intends to write. Thus for the pre-vious fwrite statement the following INSERT statement iscreated:
INSERT INTO x_data.txt.tmp \VALUES(id, <data pointed by &data>)
While the writes are being sent to the temporary table,GEN maintains two pieces of information: it maintains whatdata structure is being written out and it maintains the num-ber of times, fwrite was redirected by dlsym. The latter in-formation is important for us to lazily reinterpret the datastructure using the scientific file format libraries. For ar-ray databases, the process of ingesting data is not similar.Data from fwrite must be interpreted before it can be in-serted. Thus array databases have significantly more delayin inserting data than relational databases.
4. EXPERIMENTSIn GEN we incur two kinds of costs for interface generation:
the cost of redirecting the POSIX system calls during therun-time, and the cost of inferring the data structure lazily.We measure these costs individually. We also measure thecost of not using GEN and loading data via files.
We have build a working prototype in which MPI pro-grams, similar to the one in the example, stream data toa Postgres database and a SciDB database. Both Postgresand SciDB are installed in a single-instance mode on a singlenode. The MPI program is run in parallel on 3 nodes withone node streaming data to the database using GEN. Ex-periments were conducted on Ubuntu machines with 4 GBRAM and dual-core processor.
To measure the overhead of redirection, we experimentedby increasing the number of dynamically loaded libraries as-sociated with a program. Our experiments show that redi-rection increases the program run time by 0.04ms with a
variation of 0.02. The redirection time is not influenced byincreasing the number of elements to write in the program,since the libraries are already loaded in memory. The varia-tion comes if we execute other programs, which can o✏oadthe library from memory and it needs to reloaded again.In the second experiment, we measured the space vs time
overhead of reorganizing the data. In GEN we have not savedany space. Without GEN the binary data is written out onfiles and then within the database. With GEN the data iswritten out to temporary tables within the database andthen reorganized within the database. However, as Table 1shows that there is significant savings in terms of reorganiz-ing the data within the database versus reorganizing priorto loading the database.Table shows the time it takes to load increasingly larger
sized datasets into a relational and an array database vs howmuch time it takes to load the same data from a temporarytable into a formatted table plus the time of loading theinitial binary data into a temporary table.
Table 1: Reorganization timesData Size Postgres SciDB
(in GB)
Load from
Files
Reorganize
within DB
Load from
Files
Reorganize
within DB
1 2.7 1.2 8.4 3.2
5 6.8 2.5 14.3 7.6
10 10.56 4.7 22.6 8.9
5. RELATED WORKTypically, simulation data is written onto data files in
write-optimized, self-describing formats to be analyzed later.However, the increasing volumes of data and the need toreduce the time to analysis, di↵erent analysis approachesfor analysis have emerged. In [4] and [5] data is not refor-matted for analysis, but enabled over write-optimized datafiles. In [4], this is achieved by integrating the self-describingscientific file formats with the HDFS file system, and sub-sequently using map-reduce methods for analysis. In [5],analysis is enabled by translating common queries, writtenin a native language, into appropriate library calls of the sci-entific file formats. In both cases, only simple sub-selections,and aggregations are enabled. Databases allow more com-plex analysis. However ingesting data into a database andmonitoring it can be onerous. In this paper, we have re-duced this overhead by directly linking HPC programs withdatabase statements, and thus automating the process forsimulation based science that generates a lot of data butwould like to automatically ingest data into databases in-stead of files.The operating system provides a di↵erent and simplified
model for writing data. Optimizations in the operating sys-tem have long been discussion of the systems research com-munity. In this paper, we have kept most of the optimizationthat the OS system provides, and yet provided a databasethat can slowly be adapted on a need basis for analysis.While our approach is preliminary, we plan to extend it todi↵erent operating systems such as Windows and Mac OSX and determining appropriate system libraries.
6. CONCLUSIONSIn this paper, we described a method to transparently in-
troduce and link databases within the natural workflow of
• Overhead of redirection
• Dynamically-linked library may get off-loaded
• 0.04 ms + 0.02ms-
Monday, June 29, 15
Conclusions and Current Work
• A database generator library for data-intensive scientific applications that requires no source code changes, which loads to be reorganized later.
• Relax the assumptions
• Loading is still an issue
• Source-code release
Monday, June 29, 15
• Thank You!
• Acknowledgements
• Salman Habib, Katrin Heitmann, Steve Rangel, Hal Finkel, Ravi Madduri
Monday, June 29, 15
Monday, June 29, 15
A vision
HPC ClusterHigh-bandwidth
Network
DatabaseServers
Monday, June 29, 15