design for a high performance, configurable cagrid data services platform peter hussey labkey...

1
Design for a High Performance, Configurable caGrid Data Services Platform Peter Hussey LabKey Software, Inc, Seattle, WA USA Contact: [email protected] Abstract Conclusion caCORE SDK Current Process for Building caGrid Data Services The Introduce toolkit focuses on the runtime architecture and implementation that allows application providers to make services available on the internet and to allow application users to discover and access those services in a structured, predictable way. It is based on a general model called Web Services, a complex, comprehensive framework for applications to interact with other applications over the internet. The primary goal of the Introduce toolkit is to manage the mechanisms for discovery, security, and invocation of application services so that a provider need not be an expert in Web Services technology to successfully connect his application to the web in ways that others can use it. The caGrid Introduce toolkit makes a distinction between fixed-function “Application Services” and more general “Data Services”. Application Services publish a certain set of questions that can be asked of it, whereas Data Services support a well-defined query mechanism that allow an unlimited variety of questions to be asked of it. Introduce uses a caCORE-generated application as the basis for a Data Service. Most of the functionality in Introduce, however, works the same for Application Service and Data Services. It generates Java code and configuration files to create a web application that can the query and the SQL database where it is result. Then the result data must travel back through those same process hops to get back to the client. Process hops usually involve some form of serialization to some format such as XML for transfer across process or across the network, creating extensive processing work that adds no value to either the provider or the consumer of the data . Proposed Design for a Data Services Platform The way caGrid-connected data services are built today uses a code generation paradigm as implemented in the caCORE SDK and the Introduce Toolkit. The resulting applications are complex to administer, difficult to extend, and slow relative to common data access mechanisms used on the web today. These problems can be remedied by taking an end- to-end approach to building caGrid data services that combines all of the functionality needed to implement one or more caGrid data services into a single web server application. In this poster, LabKey Software outlines a design for such a new development methodology that is based on LabKey Server. In our experience with database applications, process hops are a first-order determinant of performance. Another aspect of poor performance stems from the inability to combine multiple data models under a single Data Service. This means the federated query service will be required in many more use cases, and the matching of data from multiple domains will occur only after moving the data from the SQL database where it is stored to the federated query service. If both data domains could be combined under a single Data Service, the matching could be done at the SQL database, a job that SQL databases do very well. Given the degree of investment in caBIG technologies, we argue that the caBIG initiative needs to invest in a better way to create Data Service on caGrid. We propose a Data Services platform as an alternative to the current code generator approach. The starting point in our design- a UML class and data model-- is the same as in the current methodology. The end product is also the same: one or more data services that respond consistently to CQL queries sent from caGrid-connected clients. The proposed implementation, however, relies on extending a single base platform via a plug-in module design. The proposed design minimizes the amount of code specific to a data source that needs to be created. This platform itself implements all of the generic functionality needed to respond to caGrid data service queries, including CQL query resolution and Web Services serialization. These generic platform services need only configuration data UML Class Model UML Data Model Web Application SQL Schema (Tables) Client APIs SDK Build Process Figure 1. The caCORE SDK Build process CQL Criter ia HQL QBE SOAP XML Local Java Remote Write Read only Registered Class Models (caDSR, GME) Deployed Grid Services SDK-generated files Introduce toolkit Service Definitions Index Service s Appl. Svcs Data Svcs Securit y Service s caCORE applications Figure 2. The caGrid Introduce toolkit T Building a Data Service on caGrid today requires a combination of two development toolkits: the caCORE Software Development Kit (SDK) and the caGrid Introduce toolkit. While these toolkits share some similarities, they are developed by two different teams and are designed to support many use cases other than the creation of Data Services on caGrid. The predictable result is that creating grid Data Services is a complex undertaking that produces sub-optimal results.. The caCORE SDK was designed to facilitate the creation of database- backed web applications for specific domains of cancer research and clinical practice. It is based on a software development paradigm that starts with an abstract model of the entities represented in a particular application. This data model is tightly described and published in a way that allows users of the applications it generates to find out the intended meaning and use of the data, The database definitions and the data models that drive them are the basis for caGrid Data Services. But as the primary application development tool for “caCORE-like” applications, the SDK has to serve a much wider set of users than just developers of Data Services. There are three phases to developing an application in the caCORE paradigm, as depicted in Figure 1:: 1. The developer describes objects and their relationships in Universal Modeling Language (UML). as a UML Class model and as a UML Data model. 2. Validate the classes and attributes of the UML model objects with NCI’s Enterprise Vocabulary Services (EVS) and register the model in the Cancer Data Standards Repository (caDSR). 3. Run the SDK build process to generate the application, converting the model into three runtime entities: • Database definition scripts, in the form of SQL CREATE TABLE commands that implement the Data model described in UML. • A web application that implements the UML Class model and can translate requests for objects into SQL commands. • A set of programming interface libraries that enable applications to query, objects managed by the web application. The SDK supports several different communication channels and several equivalent ways to specify a query. It also offers the option of a “writeable” interface that supports update, insert, and deletions of the data. • At the core of the generated caCORE web application is Hibernate, an open source middleware layer for mapping Java programming objects into SQL table objects and vice-versa. The caCORE SDK build process translates the UML model into configuration files that allow Hibernate to construct complex queries that traverse relationships between objects and turn them into appropriate SQL constructs. caGrid Introduce Toolkit Figure 3. Data Services as they are structured today. caCORE applicati on SQL database CQL Java Remote Read only caGrid Data Service Federated Query Service Web Services Clients caGrid Portal Process hop Process hop Process hop caCORE applicati on SQL database CQL Java Remote Read only • Register its identity and address with an Index Service • Authenticate incoming requests and determine whether they are authorized to perform a certain function • Validate the requests and respond in the manner described by the Web Services framework. The core technology managed by the Introduce toolkit is the Globus framework for supporting Web Services. Figure 3 depicts the result of combining the two toolkits to create a caGrid Data Service. Some uses of these Data Services only require data from a single caCORE-generated application instance. Other uses need to combine results from multiple caCORE application instances. For this use case caGrid offers a Federated Query Processor that can split out a single query to multiple caCORE application instances and combine the results into a single set of objects. Tracing the data flow in Figure 3 serves to illustrate some of the shortcomings of the current approach to building data services: Complex: Many independent components that have to all be coordinated for a successful deployment. This complexity makes Data Services not only difficult to develop but also difficult to administer. Inflexible: The application development paradigm is that a caCORE generated application supports one and only one data model. Likewise the caGrid Data Service front-end supports only one back- end caCORE application server. The generator technology used in both toolkits makes it difficult to even begin to combine functionality between data models or between the caGrid and caCORE infrastructures. Slow: Current data services queries on caGrid require that query requests go Problems with the current approach We propose a new approach to creating Data Services on caGrid that addresses the problems with the current methodology. The design is based on the idea of a single web application instance that implements all of the necessary functionality between the Data Service client applications and the SQL database that stores the data they want to access. This single Data Services platform would be based on the following design principles: Configuration instead of generation: The inputs to the current toolkits comprise the key elements that make one Data Service a discoverable, secure, usable entity on the grid. These inputs include the registered data model and associated artifacts created by the caCORE toolkit, and the services metadata collected and configured by Introduce. But instead of using these inputs to generate multiple standalone, single purpose web applications, the proposed Data Services platform would package them into a plug in module that a developer or administrator can install on an existing running server. Multiple such Data Services modules could be installed on the same server. Make simple scenarios simple: Many Data Services scenarios involve simple, rectangular data sets, data that comes from spreadsheets. It should not take an expert in build scripts and multiple toolkits to add support for storing and sharing such simple datasets . Create one optimized pathway to the data. The caCORE toolkit supports many programming interfaces to access a caCORE application, but only a very specific subset of these are used in the Data Services scenario. There would be no reason to implement these other interfaces in the proposed Data Services platform. The proposed platform would offer CQL/D-CQL queries directly to web services clients, with read-only access through these mechanisms. Push queries to the SQL database wherever possible. Federated queries are equivalent to “distributed” query features that every major SQL database offers. Distributed query features are often used as sales differentiators, but in our experience rarely used in real-world applications. They suffer from both poor performance and high complexity. To avoid needing distributed queries, the Data Services platform would support multiple data services out of one SQL database. The federated query service would need some way to detect this case and send entire braches of a distributed query to a single Data Service endpoint that has access to the underlying SQL engine. Progress and Hurdles LabKey Server would be a strong starting point to develop an optimized Data Services platform for caGrid. LabKey Server already supports the following: • A plug-in module architecture that is already used to dynamically extend the server with new database tables and new query metadata. This includes a well-developed mechanism for running SQL scripts that create and change database tables and views at startup time. • A query architecture that makes uses of metadata derived from an ontology • Initial use of Globus libraries in the pipeline module • An implementation of Data Services for proteomics data • A development team that thoroughly understands SQL and web technologies and is adept at delivering solutions. Nevertheless, the design proposed here would require some significant development investment to get LabKey Server to the point of proof of concept. Both of the current toolkits used for creating Data Services are complex and would require a thorough understanding of their components to be able to subsume some of their functionality in LabKey..

Upload: sydney-rich

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design for a High Performance, Configurable caGrid Data Services Platform Peter Hussey LabKey Software, Inc, Seattle, WA USA Contact:

Design for a High Performance, Configurable caGrid Data Services Platform Peter Hussey LabKey Software, Inc, Seattle, WA USA

Contact: [email protected]

Abstract

Conclusion

caCORE SDK

Current Process for Building caGrid Data Services

The Introduce toolkit focuses on the runtime architecture and implementation that allows application providers to make services available on the internet and to allow application users to discover and access those services in a structured, predictable way. It is based on a general model called Web Services, a complex, comprehensive framework for applications to interact with other applications over the internet. The primary goal of the Introduce toolkit is to manage the mechanisms for discovery, security, and invocation of application services so that a provider need not be an expert in Web Services technology to successfully connect his application to the web in ways that others can use it.The caGrid Introduce toolkit makes a distinction between fixed-function “Application Services” and more general “Data Services”. Application Services publish a certain set of questions that can be asked of it, whereas Data Services support a well-defined query mechanism that allow an unlimited variety of questions to be asked of it. Introduce uses a caCORE-generated application as the basis for a Data Service. Most of the functionality in Introduce, however, works the same for Application Service and Data Services. It generates Java code and configuration files to create a web application that can

the query and the SQL database where it is result. Then the result data must travel back through those same process hops to get back to the client. Process hops usually involve some form of serialization to some format such as XML for transfer across process or across the network, creating extensive processing work that adds no value to either the provider or the consumer of the data .

Proposed Design for a Data Services Platform

The way caGrid-connected data services are built today uses a code generation paradigm as implemented in the caCORE SDK and the Introduce Toolkit. The resulting applications are complex to administer, difficult to extend, and slow relative to common data access mechanisms used on the web today. These problems can be remedied by taking an end-to-end approach to building caGrid data services that combines all of the functionality needed to implement one or more caGrid data services into a single web server application. In this poster, LabKey Software outlines a design for such a new development methodology that is based on LabKey Server.

In our experience with database applications, process hops are a first-order determinant of performance. Another aspect of poor performance stems from the inability to combine multiple data models under a single Data Service. This means the federated query service will be required in many more use cases, and the matching of data from multiple domains will occur only after moving the data from the SQL database where it is stored to the federated query service. If both data domains could be combined under a single Data Service, the matching could be done at the SQL database, a job that SQL databases do very well.

Given the degree of investment in caBIG technologies, we argue that the caBIG initiative needs to invest in a better way to create Data Service on caGrid. We propose a Data Services platform as an alternative to the current code generator approach. The starting point in our design- a UML class and data model-- is the same as in the current methodology. The end product is also the same: one or more data services that respond consistently to CQL queries sent from caGrid-connected clients. The proposed implementation, however, relies on extending a single base platform via a plug-in module design. The proposed design minimizes the amount of code specific to a data source that needs to be created. This platform itself implements all of the generic functionality needed to respond to caGrid data service queries, including CQL query resolution and Web Services serialization. These generic platform services need only configuration data from the module.

UML Class Model

UML Data Model

Web Application

SQL Schema (Tables)

Client APIs

SDK Build Process

Figure 1. The caCORE SDK Build process

CQLCriteria

HQLQBE

SOAPXML

LocalJava

Remote

WriteRead only

Registered Class Models (caDSR, GME)

Deployed Grid Services

SDK-generated files

Introduce toolkit

Service Definitions

Index Services

Appl.Svcs

DataSvcs

Security Services

caCORE applications

Figure 2. The caGrid Introduce toolkit

T Building a Data Service on caGrid today requires a combination of two development toolkits: the caCORE Software Development Kit (SDK) and the caGrid Introduce toolkit. While these toolkits share some similarities, they are developed by two different teams and are designed to support many use cases other than the creation of Data Services on caGrid. The predictable result is that creating grid Data Services is a complex undertaking that produces sub-optimal results..

The caCORE SDK was designed to facilitate the creation of database-backed web applications for specific domains of cancer research and clinical practice. It is based on a software development paradigm that starts with an abstract model of the entities represented in a particular application. This data model is tightly described and published in a way that allows users of the applications it generates to find out the intended meaning and use of the data, The database definitions and the data models that drive them are the basis for caGrid Data Services. But as the primary application development tool for “caCORE-like” applications, the SDK has to serve a much wider set of users than just developers of Data Services.

There are three phases to developing an application in the caCORE paradigm, as depicted in Figure 1::

1. The developer describes objects and their relationships in Universal Modeling Language (UML). as a UML Class model and as a UML Data model.

2. Validate the classes and attributes of the UML model objects with NCI’s Enterprise Vocabulary Services (EVS) and register the model in the Cancer Data Standards Repository (caDSR).

3. Run the SDK build process to generate the application, converting the model into three runtime entities:

• Database definition scripts, in the form of SQL CREATE TABLE commands that implement the Data model described in UML.

• A web application that implements the UML Class model and can translate requests for objects into SQL commands.

• A set of programming interface libraries that enable applications to query, objects managed by the web application. The SDK supports several different communication channels and several equivalent ways to specify a query. It also offers the option of a “writeable” interface that supports update, insert, and deletions of the data.

• At the core of the generated caCORE web application is Hibernate, an open source middleware layer for mapping Java programming objects into SQL table objects and vice-versa. The caCORE SDK build process translates the UML model into configuration files that allow Hibernate to construct complex queries that traverse relationships between objects and turn them into appropriate SQL constructs.

caGrid Introduce Toolkit

Figure 3. Data Services as they are structured today.

caCORE application

SQL database

CQL

Java RemoteRead only

caGrid Data

ServiceFederated

Query Service

Web Services Clients

caGrid Portal

Process hop

Process hop

Process hop

caCORE application

SQL database

CQL

Java RemoteRead only

• Register its identity and address with an Index Service

• Authenticate incoming requests and determine whether they are authorized to perform a certain function

• Validate the requests and respond in the manner described by the Web Services framework.

The core technology managed by the Introduce toolkit is the Globus framework for supporting Web Services.

Figure 3 depicts the result of combining the two toolkits to create a caGrid Data Service. Some uses of these Data Services only require data from a single caCORE-generated application instance. Other uses need to combine results from multiple caCORE application instances. For this use case caGrid offers a Federated Query Processor that can split out a single query to multiple caCORE application instances and combine the results into a single set of objects. Tracing the data flow in Figure 3 serves to illustrate some of the shortcomings of the current approach to building data services:Complex: Many independent components that have to all be coordinated for a successful deployment. This complexity makes Data Services not only difficult to develop but also difficult to administer.Inflexible: The application development paradigm is that a caCORE generated application supports one and only one data model. Likewise the caGrid Data Service front-end supports only one back-end caCORE application server. The generator technology used in both toolkits makes it difficult to even begin to combine functionality between data models or between the caGrid and caCORE infrastructures. Slow: Current data services queries on caGrid require that query requests go through at least 3 different process boundaries between the client program issuing

Problems with the current approach

We propose a new approach to creating Data Services on caGrid that addresses the problems with the current methodology. The design is based on the idea of a single web application instance that implements all of the necessary functionality between the Data Service client applications and the SQL database that stores the data they want to access. This single Data Services platform would be based on the following design principles:Configuration instead of generation: The inputs to the current toolkits comprise the key elements that make one Data Service a discoverable, secure, usable entity on the grid. These inputs include the registered data model and associated artifacts created by the caCORE toolkit, and the services metadata collected and configured by Introduce. But instead of using these inputs to generate multiple standalone, single purpose web applications, the proposed Data Services platform would package them into a plug in module that a developer or administrator can install on an existing running server. Multiple such Data Services modules could be installed on the same server. Make simple scenarios simple: Many Data Services scenarios involve simple, rectangular data sets, data that comes from spreadsheets. It should not take an expert in build scripts and multiple toolkits to add support for storing and sharing such simple datasets .Create one optimized pathway to the data. The caCORE toolkit supports many programming interfaces to access a caCORE application, but only a very specific subset of these are used in the Data Services scenario. There would be no reason to implement these other interfaces in the proposed Data Services platform. The proposed platform would offer CQL/D-CQL queries directly to web services clients, with read-only access through these mechanisms.Push queries to the SQL database wherever possible. Federated queries are equivalent to “distributed” query features that every major SQL database offers. Distributed query features are often used as sales differentiators, but in our experience rarely used in real-world applications. They suffer from both poor performance and high complexity. To avoid needing distributed queries, the Data Services platform would support multiple data services out of one SQL database. The federated query service would need some way to detect this case and send entire braches of a distributed query to a single Data Service endpoint that has access to the underlying SQL engine.

Progress and HurdlesLabKey Server would be a strong starting point to develop an optimized Data Services platform for caGrid. LabKey Server already supports the following:• A plug-in module architecture that is already used to dynamically extend the server with new

database tables and new query metadata. This includes a well-developed mechanism for running SQL scripts that create and change database tables and views at startup time.

• A query architecture that makes uses of metadata derived from an ontology• Initial use of Globus libraries in the pipeline module• An implementation of Data Services for proteomics data• A development team that thoroughly understands SQL and web technologies and is adept at

delivering solutions.Nevertheless, the design proposed here would require some significant development investment to get LabKey Server to the point of proof of concept. Both of the current toolkits used for creating Data Services are complex and would require a thorough understanding of their components to be able to subsume some of their functionality in LabKey..