the cancer translational research informatics platform

13
BioMed Central Page 1 of 13 (page number not for citation purposes) BMC Medical Informatics and Decision Making Open Access Software The cancer translational research informatics platform Patrick McConnell 1 , Rajesh C Dash 2 , Ram Chilukuri 1 , Ricardo Pietrobon 2 , Kimberly Johnson 2 , Robert Annechiarico 2 and A Jamie Cuticchia* 2,3 Address: 1 SemanticBits, Herndon, Virginia, USA, 2 Duke University Medical Center, Durham, North Carolina, USA and 3 Duke Bioinformatics Group, Duke Comprehensive Cancer Center, DUMC Box 2717, USA Email: Patrick McConnell - [email protected]; Rajesh C Dash - [email protected]; Ram Chilukuri - [email protected]; Ricardo Pietrobon - [email protected]; Kimberly Johnson - [email protected]; Robert Annechiarico - [email protected]; A Jamie Cuticchia* - [email protected] * Corresponding author Abstract Background: Despite the pressing need for the creation of applications that facilitate the aggregation of clinical and molecular data, most current applications are proprietary and lack the necessary compliance with standards that would allow for cross-institutional data exchange. In line with its mission of accelerating research discoveries and improving patient outcomes by linking networks of researchers, physicians, and patients focused on cancer research, caBIG (cancer Biomedical Informatics Grid™) has sponsored the creation of the caTRIP (Cancer Translational Research Informatics Platform) tool, with the purpose of aggregating clinical and molecular data in a repository that is user-friendly, easily accessible, as well as compliant with regulatory requirements of privacy and security. Results: caTRIP has been developed as an N-tier architecture, with three primary tiers: domain services, the distributed query engine, and the graphical user interface, primarily making use of the caGrid infrastructure to ensure compatibility with other tools currently developed by caBIG. The application interface was designed so that users can construct queries using either the Simple Interface via drop-down menus or the Advanced Interface for more sophisticated searching strategies to using drag-and-drop. Furthermore, the application addresses the security concerns of authentication, authorization, and delegation, as well as an automated honest broker service for deidentifying data. Conclusion: Currently being deployed at Duke University and a few other centers, we expect that caTRIP will make a significant contribution to further the development of translational research through the facilitation of its data exchange and storage processes. Background In order to have an impact in society, discoveries in cancer research need to be translated into knowledge that can be directly applied to treatment and prevention. These dis- coveries usually start within the basic sciences, from experiments developed at the molecular level, slowly pro- gressing to clinical research. Although this translational process is at the very basis of our ability to generate new biomedical knowledge, to date few tools have been devel- oped to successfully link the basic and clinical science Published: 24 December 2008 BMC Medical Informatics and Decision Making 2008, 8:60 doi:10.1186/1472-6947-8-60 Received: 7 March 2008 Accepted: 24 December 2008 This article is available from: http://www.biomedcentral.com/1472-6947/8/60 © 2008 McConnell et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: patrick-mcconnell

Post on 01-Oct-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The cancer translational research informatics platform

BioMed Central

BMC Medical Informatics and Decision Making

ss

Open AcceSoftwareThe cancer translational research informatics platformPatrick McConnell1 Rajesh C Dash2 Ram Chilukuri1 Ricardo Pietrobon2 Kimberly Johnson2 Robert Annechiarico2 and A Jamie Cuticchia23

Address 1SemanticBits Herndon Virginia USA 2Duke University Medical Center Durham North Carolina USA and 3Duke Bioinformatics Group Duke Comprehensive Cancer Center DUMC Box 2717 USA

Email Patrick McConnell - patrickmcconnellsemanticbitscom Rajesh C Dash - rdashdukeedu Ram Chilukuri - ramchilukurisemanticbitscom Ricardo Pietrobon - rpietrodukeedu Kimberly Johnson - kimjohnsondukeedu Robert Annechiarico - bobannechiaricodukeedu A Jamie Cuticchia - jamiecuticchiadukeedu

Corresponding author

AbstractBackground Despite the pressing need for the creation of applications that facilitate theaggregation of clinical and molecular data most current applications are proprietary and lack thenecessary compliance with standards that would allow for cross-institutional data exchange In linewith its mission of accelerating research discoveries and improving patient outcomes by linkingnetworks of researchers physicians and patients focused on cancer research caBIG (cancerBiomedical Informatics Gridtrade) has sponsored the creation of the caTRIP (Cancer TranslationalResearch Informatics Platform) tool with the purpose of aggregating clinical and molecular data ina repository that is user-friendly easily accessible as well as compliant with regulatoryrequirements of privacy and security

Results caTRIP has been developed as an N-tier architecture with three primary tiers domainservices the distributed query engine and the graphical user interface primarily making use of thecaGrid infrastructure to ensure compatibility with other tools currently developed by caBIG Theapplication interface was designed so that users can construct queries using either the SimpleInterface via drop-down menus or the Advanced Interface for more sophisticated searchingstrategies to using drag-and-drop Furthermore the application addresses the security concerns ofauthentication authorization and delegation as well as an automated honest broker service fordeidentifying data

Conclusion Currently being deployed at Duke University and a few other centers we expect thatcaTRIP will make a significant contribution to further the development of translational researchthrough the facilitation of its data exchange and storage processes

BackgroundIn order to have an impact in society discoveries in cancerresearch need to be translated into knowledge that can bedirectly applied to treatment and prevention These dis-coveries usually start within the basic sciences from

experiments developed at the molecular level slowly pro-gressing to clinical research Although this translationalprocess is at the very basis of our ability to generate newbiomedical knowledge to date few tools have been devel-oped to successfully link the basic and clinical science

Published 24 December 2008

BMC Medical Informatics and Decision Making 2008 860 doi1011861472-6947-8-60

Received 7 March 2008Accepted 24 December 2008

This article is available from httpwwwbiomedcentralcom1472-6947860

copy 2008 McConnell et al licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Page 1 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

fields in a way that researchers from both arenas can easilymake connections More specifically cancer researchwould benefit from the development of applications thatcan aggregate clinical and molecular data in a repositorythat is user-friendly easily accessible as well as compliantwith regulatory requirements of privacy and security

In alignment with the requirements outlined above theDuke Comprehensive Cancer Center (DCCC) in collabo-ration with SemanticBits LLC has developed the CancerTranslational Research Informatics Platform (caTRIPhttpscabigncinihgovtoolscaTRIP this and all caBIGlinks last current as of July 2008) a translational researchsystem for the Cancer Biomedical Informatics Grid(caBIG) project [1] caTRIP allows users to query across anumber of caBIG data services join on common data ele-ments (CDEs) and view their results in a user-friendlyinterface Having outcomes analysis as its initial focuscaTRIP allows clinicians to query across data from existingpatients with similar characteristics to find treatments thatwere administered with success In doing so caTRIP canhelp inform treatment and improve patient care as well asenable searching for available tumor tissue locatingpatients for clinical trials and investigating the associa-tion between multiple predictors and their correspondingoutcomes (such as survival) Of importance caTRIP relieson the vast array of open source caBIG applicationsincluding (1) the Tumor Registry a clinical system that isused to collect endpoint data (2) the cancer Text Informa-tion Extraction System (caTIES httpscabigncinihgovtoolscaties) a locator of tissue resources that works viathe extraction of clinical information from free text surgi-cal pathology reports while using controlled terminolo-gies to populate caBIG-compliant data structures (3)caTissue CORE httpscabigncinihgovtoolscatissuecore a tissue bank repository tool for biospecimeninventory tracking and basic annotation (4) CancerAnnotation Engine (CAE httpscabigncinihgovtoolscae) a system for storing and searching pathology anno-tations (5) caIntegrator httpgforgencinihgovprojectscaintegrator) a tool for storing querying andanalyzing translational data including SNP data

The objective of this article is to describe the caTRIP sys-tem in relation to its objectives overall software architec-ture security implementation usability and futuredevelopment plans

Case Study Outcomes AnalysisA motivating use case for the development of caTRIP isone of outcomes analysis whereby a clinician can querydata from a cohort of preexisting patients to help guidetreatment of another patient An oncologist can ask Whatare the treatments and outcomes of other patients thathave similar characteristics to my patient caTRIP makes

this possible on multiple levels local institutionalregional national and beyond by leveraging grid infra-structure to scale beyond traditional query systems

Consider an example of a patient with a breast lumpapproaching an oncologist for treatment at a tertiary carefacility The patient has been told by her local physicianthat the tumor she has is not uncommon but difficult tomanage The oncologist reviews the medical documentsfrom the outside institution and recognizes an entity thatis well-equipped to handle a lobular carcinoma of thebreast What is slightly odd is that this tumor has an inter-mediate nuclear grade and has been further designated asa pleomorphic lobular carcinoma The oncologist isfamiliar with the medical literature on the topic and rec-ognizes that in certain circumstances such an entity mayexhibit biologic behavior akin to an invasive ductal tumor(rather than that of a typical lobular) Paralleling the oddhistologic finding is an additional finding that the tumoris negative for estrogen receptor (ER) expression negativefor progesterone receptor (PR) expression but stronglypositive for human epidermal growth factor receptor 2(Her-2neu) overexpression This profile is very uncharac-teristic for a lobular carcinoma of the breast Thinking thatthe tumor may have been misclassified the oncologist hasthe pathology material reviewed again and re-runs themolecular studies to confirm the receptor studies reportedfrom the outside institution To complicate matters thereviewing pathologist reclassifies the tumor as a typicallobular carcinoma (not pleomorphic) but the receptorstudy profiles (ER PR) and Her-2neu status remainunchanged Now there is no diagnostic correlate for thereceptor profile results Lobular tumors (particularly typi-cal ones) are nearly always positive for ER and PR receptor(and usually strongly so) and negative for Her-2neu over-expression

As an example at Duke the Breast Pathology service willtypically see one or two such cases per year at most Theexpected biologic behavior and sensitivity to various treat-ment protocols are unknown in such cases The body ofliterature on this particular scenario of findings is stillmaturing [2-5] The fact of the matter is that even a largetertiary care facility will rarely come across such tumorscaTRIP provides a mechanism to examine all thosepatients who have been seen at any institution wherecaTRIP has been deployed and identify the cohort ofpatients that matches the specified criteria It is a matter ofa few minutes at most to construct a query that narrowsthe cohort to patients that have lobular tumors of thebreast are ER negative PR negative and Her-2neu over-expressed (Figure 1) Furthermore the type of treatmentemployed and the time to recurrence or death are easilycaptured as part of the result set by drawing upon theTumor Registry as an additional data source This permits

Page 2 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a level of decision in the clinic ndash heretofore impossiblewith prior technologies Running such a query at Dukereturns very few cases because of the relative rarity of suchclinical phenotypes However caTRIP is grid-enabledand if deployed at Cancer Centers across the country aspart of the NCIs caBIG initiative the statistical powercould easily increase by multiple orders of magnitudeRather than relying on single-institution or limited datasets published in the literature an oncologist now hasaccess to a rich live data set that can provide strong statis-tically significant data in mere minutes

While the preceding scenario focuses on facilitating clini-cal care caTRIP is adept at enabling research as well Con-sider a follow-up situation in which a researcher wouldlike to take that cohort of patients that have lobular carci-noma and Her-2neu overexpression and compare them

to a cohort that have ductal tumors that also overexpressHer-2neu as well as to a third group that have lobulartumors but not overexpress Her-2neu DNA arrays can beused to test for the presence of hundreds or even thou-sands of gene sequences in relatively short order if tissuesamples are available caTRIP can cross-query linkedbiospecimen repositories to identify samples that areavailable for the three cohorts of patients and retrievesamples that match specified criteria (eg frozen diseasedtissue with match adjacent normal tissue) With informat-ics solutions like caTRIP and newer molecular technolo-gies like DNA arrays the time and effort to develop newdiagnostic and prognostic markers of disease can be sig-nificantly reduced

Screenshot of the Simple Interface depicting a outcomes analysis translational queryFigure 1Screenshot of the Simple Interface depicting a outcomes analysis translational query

Page 3 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include

1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive

2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment

3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death

4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors

5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)

6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma

In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations

ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)

The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging

Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data

Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller

Page 4 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries

Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include

bull Globus 41 for the underlying messaging stack

bull caGrid 10 for the underlying service infrastructure

bull Introduce for service creation

bull Index Service for discovering services on the grid

bull Authentication Service to authenticate users using theirinstitutional credentials

bull Dorian to obtain a grid user proxy

bull Grid Grouper to authorize users to access data based ongroups

bull Common Security Module (CSM) for plugging authoriza-tion into the data service

bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)

High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface

Page 5 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts

bull Hibernate for an object-relational mapping

bull Oracle and MySQL for backend relational databases

More information can be found in the caTRIP architecturedocument and UML models

bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip

bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip

Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge

Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop

Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed

using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query

Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity

Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service

Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu

Page 6 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 2: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

fields in a way that researchers from both arenas can easilymake connections More specifically cancer researchwould benefit from the development of applications thatcan aggregate clinical and molecular data in a repositorythat is user-friendly easily accessible as well as compliantwith regulatory requirements of privacy and security

In alignment with the requirements outlined above theDuke Comprehensive Cancer Center (DCCC) in collabo-ration with SemanticBits LLC has developed the CancerTranslational Research Informatics Platform (caTRIPhttpscabigncinihgovtoolscaTRIP this and all caBIGlinks last current as of July 2008) a translational researchsystem for the Cancer Biomedical Informatics Grid(caBIG) project [1] caTRIP allows users to query across anumber of caBIG data services join on common data ele-ments (CDEs) and view their results in a user-friendlyinterface Having outcomes analysis as its initial focuscaTRIP allows clinicians to query across data from existingpatients with similar characteristics to find treatments thatwere administered with success In doing so caTRIP canhelp inform treatment and improve patient care as well asenable searching for available tumor tissue locatingpatients for clinical trials and investigating the associa-tion between multiple predictors and their correspondingoutcomes (such as survival) Of importance caTRIP relieson the vast array of open source caBIG applicationsincluding (1) the Tumor Registry a clinical system that isused to collect endpoint data (2) the cancer Text Informa-tion Extraction System (caTIES httpscabigncinihgovtoolscaties) a locator of tissue resources that works viathe extraction of clinical information from free text surgi-cal pathology reports while using controlled terminolo-gies to populate caBIG-compliant data structures (3)caTissue CORE httpscabigncinihgovtoolscatissuecore a tissue bank repository tool for biospecimeninventory tracking and basic annotation (4) CancerAnnotation Engine (CAE httpscabigncinihgovtoolscae) a system for storing and searching pathology anno-tations (5) caIntegrator httpgforgencinihgovprojectscaintegrator) a tool for storing querying andanalyzing translational data including SNP data

The objective of this article is to describe the caTRIP sys-tem in relation to its objectives overall software architec-ture security implementation usability and futuredevelopment plans

Case Study Outcomes AnalysisA motivating use case for the development of caTRIP isone of outcomes analysis whereby a clinician can querydata from a cohort of preexisting patients to help guidetreatment of another patient An oncologist can ask Whatare the treatments and outcomes of other patients thathave similar characteristics to my patient caTRIP makes

this possible on multiple levels local institutionalregional national and beyond by leveraging grid infra-structure to scale beyond traditional query systems

Consider an example of a patient with a breast lumpapproaching an oncologist for treatment at a tertiary carefacility The patient has been told by her local physicianthat the tumor she has is not uncommon but difficult tomanage The oncologist reviews the medical documentsfrom the outside institution and recognizes an entity thatis well-equipped to handle a lobular carcinoma of thebreast What is slightly odd is that this tumor has an inter-mediate nuclear grade and has been further designated asa pleomorphic lobular carcinoma The oncologist isfamiliar with the medical literature on the topic and rec-ognizes that in certain circumstances such an entity mayexhibit biologic behavior akin to an invasive ductal tumor(rather than that of a typical lobular) Paralleling the oddhistologic finding is an additional finding that the tumoris negative for estrogen receptor (ER) expression negativefor progesterone receptor (PR) expression but stronglypositive for human epidermal growth factor receptor 2(Her-2neu) overexpression This profile is very uncharac-teristic for a lobular carcinoma of the breast Thinking thatthe tumor may have been misclassified the oncologist hasthe pathology material reviewed again and re-runs themolecular studies to confirm the receptor studies reportedfrom the outside institution To complicate matters thereviewing pathologist reclassifies the tumor as a typicallobular carcinoma (not pleomorphic) but the receptorstudy profiles (ER PR) and Her-2neu status remainunchanged Now there is no diagnostic correlate for thereceptor profile results Lobular tumors (particularly typi-cal ones) are nearly always positive for ER and PR receptor(and usually strongly so) and negative for Her-2neu over-expression

As an example at Duke the Breast Pathology service willtypically see one or two such cases per year at most Theexpected biologic behavior and sensitivity to various treat-ment protocols are unknown in such cases The body ofliterature on this particular scenario of findings is stillmaturing [2-5] The fact of the matter is that even a largetertiary care facility will rarely come across such tumorscaTRIP provides a mechanism to examine all thosepatients who have been seen at any institution wherecaTRIP has been deployed and identify the cohort ofpatients that matches the specified criteria It is a matter ofa few minutes at most to construct a query that narrowsthe cohort to patients that have lobular tumors of thebreast are ER negative PR negative and Her-2neu over-expressed (Figure 1) Furthermore the type of treatmentemployed and the time to recurrence or death are easilycaptured as part of the result set by drawing upon theTumor Registry as an additional data source This permits

Page 2 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a level of decision in the clinic ndash heretofore impossiblewith prior technologies Running such a query at Dukereturns very few cases because of the relative rarity of suchclinical phenotypes However caTRIP is grid-enabledand if deployed at Cancer Centers across the country aspart of the NCIs caBIG initiative the statistical powercould easily increase by multiple orders of magnitudeRather than relying on single-institution or limited datasets published in the literature an oncologist now hasaccess to a rich live data set that can provide strong statis-tically significant data in mere minutes

While the preceding scenario focuses on facilitating clini-cal care caTRIP is adept at enabling research as well Con-sider a follow-up situation in which a researcher wouldlike to take that cohort of patients that have lobular carci-noma and Her-2neu overexpression and compare them

to a cohort that have ductal tumors that also overexpressHer-2neu as well as to a third group that have lobulartumors but not overexpress Her-2neu DNA arrays can beused to test for the presence of hundreds or even thou-sands of gene sequences in relatively short order if tissuesamples are available caTRIP can cross-query linkedbiospecimen repositories to identify samples that areavailable for the three cohorts of patients and retrievesamples that match specified criteria (eg frozen diseasedtissue with match adjacent normal tissue) With informat-ics solutions like caTRIP and newer molecular technolo-gies like DNA arrays the time and effort to develop newdiagnostic and prognostic markers of disease can be sig-nificantly reduced

Screenshot of the Simple Interface depicting a outcomes analysis translational queryFigure 1Screenshot of the Simple Interface depicting a outcomes analysis translational query

Page 3 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include

1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive

2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment

3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death

4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors

5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)

6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma

In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations

ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)

The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging

Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data

Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller

Page 4 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries

Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include

bull Globus 41 for the underlying messaging stack

bull caGrid 10 for the underlying service infrastructure

bull Introduce for service creation

bull Index Service for discovering services on the grid

bull Authentication Service to authenticate users using theirinstitutional credentials

bull Dorian to obtain a grid user proxy

bull Grid Grouper to authorize users to access data based ongroups

bull Common Security Module (CSM) for plugging authoriza-tion into the data service

bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)

High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface

Page 5 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts

bull Hibernate for an object-relational mapping

bull Oracle and MySQL for backend relational databases

More information can be found in the caTRIP architecturedocument and UML models

bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip

bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip

Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge

Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop

Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed

using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query

Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity

Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service

Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu

Page 6 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 3: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a level of decision in the clinic ndash heretofore impossiblewith prior technologies Running such a query at Dukereturns very few cases because of the relative rarity of suchclinical phenotypes However caTRIP is grid-enabledand if deployed at Cancer Centers across the country aspart of the NCIs caBIG initiative the statistical powercould easily increase by multiple orders of magnitudeRather than relying on single-institution or limited datasets published in the literature an oncologist now hasaccess to a rich live data set that can provide strong statis-tically significant data in mere minutes

While the preceding scenario focuses on facilitating clini-cal care caTRIP is adept at enabling research as well Con-sider a follow-up situation in which a researcher wouldlike to take that cohort of patients that have lobular carci-noma and Her-2neu overexpression and compare them

to a cohort that have ductal tumors that also overexpressHer-2neu as well as to a third group that have lobulartumors but not overexpress Her-2neu DNA arrays can beused to test for the presence of hundreds or even thou-sands of gene sequences in relatively short order if tissuesamples are available caTRIP can cross-query linkedbiospecimen repositories to identify samples that areavailable for the three cohorts of patients and retrievesamples that match specified criteria (eg frozen diseasedtissue with match adjacent normal tissue) With informat-ics solutions like caTRIP and newer molecular technolo-gies like DNA arrays the time and effort to develop newdiagnostic and prognostic markers of disease can be sig-nificantly reduced

Screenshot of the Simple Interface depicting a outcomes analysis translational queryFigure 1Screenshot of the Simple Interface depicting a outcomes analysis translational query

Page 3 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include

1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive

2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment

3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death

4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors

5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)

6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma

In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations

ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)

The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging

Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data

Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller

Page 4 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries

Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include

bull Globus 41 for the underlying messaging stack

bull caGrid 10 for the underlying service infrastructure

bull Introduce for service creation

bull Index Service for discovering services on the grid

bull Authentication Service to authenticate users using theirinstitutional credentials

bull Dorian to obtain a grid user proxy

bull Grid Grouper to authorize users to access data based ongroups

bull Common Security Module (CSM) for plugging authoriza-tion into the data service

bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)

High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface

Page 5 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts

bull Hibernate for an object-relational mapping

bull Oracle and MySQL for backend relational databases

More information can be found in the caTRIP architecturedocument and UML models

bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip

bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip

Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge

Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop

Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed

using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query

Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity

Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service

Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu

Page 6 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 4: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include

1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive

2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment

3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death

4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors

5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)

6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma

In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations

ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)

The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging

Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data

Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller

Page 4 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries

Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include

bull Globus 41 for the underlying messaging stack

bull caGrid 10 for the underlying service infrastructure

bull Introduce for service creation

bull Index Service for discovering services on the grid

bull Authentication Service to authenticate users using theirinstitutional credentials

bull Dorian to obtain a grid user proxy

bull Grid Grouper to authorize users to access data based ongroups

bull Common Security Module (CSM) for plugging authoriza-tion into the data service

bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)

High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface

Page 5 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts

bull Hibernate for an object-relational mapping

bull Oracle and MySQL for backend relational databases

More information can be found in the caTRIP architecturedocument and UML models

bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip

bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip

Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge

Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop

Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed

using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query

Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity

Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service

Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu

Page 6 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 5: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries

Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include

bull Globus 41 for the underlying messaging stack

bull caGrid 10 for the underlying service infrastructure

bull Introduce for service creation

bull Index Service for discovering services on the grid

bull Authentication Service to authenticate users using theirinstitutional credentials

bull Dorian to obtain a grid user proxy

bull Grid Grouper to authorize users to access data based ongroups

bull Common Security Module (CSM) for plugging authoriza-tion into the data service

bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)

High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface

Page 5 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts

bull Hibernate for an object-relational mapping

bull Oracle and MySQL for backend relational databases

More information can be found in the caTRIP architecturedocument and UML models

bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip

bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip

Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge

Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop

Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed

using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query

Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity

Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service

Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu

Page 6 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 6: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts

bull Hibernate for an object-relational mapping

bull Oracle and MySQL for backend relational databases

More information can be found in the caTRIP architecturedocument and UML models

bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip

bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip

Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge

Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop

Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed

using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query

Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity

Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service

Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu

Page 6 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 7: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg

SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)

Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against

Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface

Page 7 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 8: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application

Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the

Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService

Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the

Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries

Page 8 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 9: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases

Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join

patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application

Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for

High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg

Page 9 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 10: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys

Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this

1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt

2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt

3 ltGroup logicRelation = ANDgt

4 ltAttribute name = available predicate =EQUAL_TO value = truegt

5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt

6 ltAttribute name = tissueSite predicate =LIKE value = breastgt

7 ltAssociationgt

8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt

9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt

10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt

11 ltForeignAssociationgt

12 ltJoinConditiongt

13 ltLeftJoingt

14ltObjectgteduwustlcatissuecoredomai

nobjectimplParticipantMedicalIdentifierImplltObjectgt

15ltPropertygtmedicalRecordNumberlt

Propertygt

16 ltLeftJoingt

17 ltRightJoingt

18ltObjectgtedudukecatripcaedomaing

eneralParticipantMedicalIdentifierltObjectgt

19ltPropertygtmedicalRecordNumberlt

Propertygt

20 ltRightJoingt

21 ltJoinConditiongt

22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt

Page 10 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 11: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt

24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt

25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt

26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt

27 ltAssociationgt

28 ltAssociationgt

29 ltAssociationgt

30 ltForeignObjectgt

31 ltForeignAssociationgt

32 ltAssociationgt

33 ltAssociationgt

ltAssociationgt

ltGroupgt

ltTargetObjectgt

ltDCQLQuerygt

Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query

upon and return The abbreviated results would look likethis

1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt

2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

4 ltns1ObjectResultgt

5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt

7 ltns2ObjectResultgt

8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt

9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt

10 ltns3ObjectResultgt

11 ltqueryResultsgt

Note that line 1 defines the type of object being returnedin the query results and that the results themselves found

Page 11 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 12: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes

ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals

In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative

Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)

Project home page httpsgforgencinihgovprojectscatrip

Operating system(s) platform independent

Programming language Java

Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165

License NCI caBIG license (non-viral)

Any restrictions to use by non-academics no

Competing interestsThe authors declare that they have no competing interests

Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI

AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp

The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions

References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-

man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6

Page 12 of 13(page number not for citation purposes)

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history
Page 13: The cancer translational research informatics platform

BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860

Publish with BioMed Central and every scientist can read your work free of charge

BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime

Sir Paul Nurse Cancer Research UK

Your research papers will be

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours mdash you keep the copyright

Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp

BioMedcentral

2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8

3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5

4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6

5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6

6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4

7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6

8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1

9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80

Pre-publication historyThe pre-publication history for this paper can be accessedhere

httpwwwbiomedcentralcom1472-6947860prepub

Page 13 of 13(page number not for citation purposes)

  • Abstract
    • Background
    • Results
    • Conclusion
      • Background
        • Case Study Outcomes Analysis
        • Design Objectives
          • Implementation
            • Technical Requirements
            • Software architecture
            • Implementation at Duke
              • Results and discussion
                • Graphical User Interface
                • Simple Interface
                • Advanced Interface
                • Distributed Query Engine
                • Domain Services
                • Security
                • Security Authentication
                • Security Authorization
                • Delegation and Foreign CDEs
                • Automated Honest Broker Service
                • Example from the Duke Implementation
                  • Conclusion
                  • Availability and requirements
                  • Competing interests
                  • Authors contributions
                  • Acknowledgements
                  • References
                  • Pre-publication history