the cancer translational research informatics platform
TRANSCRIPT
BioMed Central
BMC Medical Informatics and Decision Making
ss
Open AcceSoftwareThe cancer translational research informatics platformPatrick McConnell1 Rajesh C Dash2 Ram Chilukuri1 Ricardo Pietrobon2 Kimberly Johnson2 Robert Annechiarico2 and A Jamie Cuticchia23
Address 1SemanticBits Herndon Virginia USA 2Duke University Medical Center Durham North Carolina USA and 3Duke Bioinformatics Group Duke Comprehensive Cancer Center DUMC Box 2717 USA
Email Patrick McConnell - patrickmcconnellsemanticbitscom Rajesh C Dash - rdashdukeedu Ram Chilukuri - ramchilukurisemanticbitscom Ricardo Pietrobon - rpietrodukeedu Kimberly Johnson - kimjohnsondukeedu Robert Annechiarico - bobannechiaricodukeedu A Jamie Cuticchia - jamiecuticchiadukeedu
Corresponding author
AbstractBackground Despite the pressing need for the creation of applications that facilitate theaggregation of clinical and molecular data most current applications are proprietary and lack thenecessary compliance with standards that would allow for cross-institutional data exchange In linewith its mission of accelerating research discoveries and improving patient outcomes by linkingnetworks of researchers physicians and patients focused on cancer research caBIG (cancerBiomedical Informatics Gridtrade) has sponsored the creation of the caTRIP (Cancer TranslationalResearch Informatics Platform) tool with the purpose of aggregating clinical and molecular data ina repository that is user-friendly easily accessible as well as compliant with regulatoryrequirements of privacy and security
Results caTRIP has been developed as an N-tier architecture with three primary tiers domainservices the distributed query engine and the graphical user interface primarily making use of thecaGrid infrastructure to ensure compatibility with other tools currently developed by caBIG Theapplication interface was designed so that users can construct queries using either the SimpleInterface via drop-down menus or the Advanced Interface for more sophisticated searchingstrategies to using drag-and-drop Furthermore the application addresses the security concerns ofauthentication authorization and delegation as well as an automated honest broker service fordeidentifying data
Conclusion Currently being deployed at Duke University and a few other centers we expect thatcaTRIP will make a significant contribution to further the development of translational researchthrough the facilitation of its data exchange and storage processes
BackgroundIn order to have an impact in society discoveries in cancerresearch need to be translated into knowledge that can bedirectly applied to treatment and prevention These dis-coveries usually start within the basic sciences from
experiments developed at the molecular level slowly pro-gressing to clinical research Although this translationalprocess is at the very basis of our ability to generate newbiomedical knowledge to date few tools have been devel-oped to successfully link the basic and clinical science
Published 24 December 2008
BMC Medical Informatics and Decision Making 2008 860 doi1011861472-6947-8-60
Received 7 March 2008Accepted 24 December 2008
This article is available from httpwwwbiomedcentralcom1472-6947860
copy 2008 McConnell et al licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Page 1 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
fields in a way that researchers from both arenas can easilymake connections More specifically cancer researchwould benefit from the development of applications thatcan aggregate clinical and molecular data in a repositorythat is user-friendly easily accessible as well as compliantwith regulatory requirements of privacy and security
In alignment with the requirements outlined above theDuke Comprehensive Cancer Center (DCCC) in collabo-ration with SemanticBits LLC has developed the CancerTranslational Research Informatics Platform (caTRIPhttpscabigncinihgovtoolscaTRIP this and all caBIGlinks last current as of July 2008) a translational researchsystem for the Cancer Biomedical Informatics Grid(caBIG) project [1] caTRIP allows users to query across anumber of caBIG data services join on common data ele-ments (CDEs) and view their results in a user-friendlyinterface Having outcomes analysis as its initial focuscaTRIP allows clinicians to query across data from existingpatients with similar characteristics to find treatments thatwere administered with success In doing so caTRIP canhelp inform treatment and improve patient care as well asenable searching for available tumor tissue locatingpatients for clinical trials and investigating the associa-tion between multiple predictors and their correspondingoutcomes (such as survival) Of importance caTRIP relieson the vast array of open source caBIG applicationsincluding (1) the Tumor Registry a clinical system that isused to collect endpoint data (2) the cancer Text Informa-tion Extraction System (caTIES httpscabigncinihgovtoolscaties) a locator of tissue resources that works viathe extraction of clinical information from free text surgi-cal pathology reports while using controlled terminolo-gies to populate caBIG-compliant data structures (3)caTissue CORE httpscabigncinihgovtoolscatissuecore a tissue bank repository tool for biospecimeninventory tracking and basic annotation (4) CancerAnnotation Engine (CAE httpscabigncinihgovtoolscae) a system for storing and searching pathology anno-tations (5) caIntegrator httpgforgencinihgovprojectscaintegrator) a tool for storing querying andanalyzing translational data including SNP data
The objective of this article is to describe the caTRIP sys-tem in relation to its objectives overall software architec-ture security implementation usability and futuredevelopment plans
Case Study Outcomes AnalysisA motivating use case for the development of caTRIP isone of outcomes analysis whereby a clinician can querydata from a cohort of preexisting patients to help guidetreatment of another patient An oncologist can ask Whatare the treatments and outcomes of other patients thathave similar characteristics to my patient caTRIP makes
this possible on multiple levels local institutionalregional national and beyond by leveraging grid infra-structure to scale beyond traditional query systems
Consider an example of a patient with a breast lumpapproaching an oncologist for treatment at a tertiary carefacility The patient has been told by her local physicianthat the tumor she has is not uncommon but difficult tomanage The oncologist reviews the medical documentsfrom the outside institution and recognizes an entity thatis well-equipped to handle a lobular carcinoma of thebreast What is slightly odd is that this tumor has an inter-mediate nuclear grade and has been further designated asa pleomorphic lobular carcinoma The oncologist isfamiliar with the medical literature on the topic and rec-ognizes that in certain circumstances such an entity mayexhibit biologic behavior akin to an invasive ductal tumor(rather than that of a typical lobular) Paralleling the oddhistologic finding is an additional finding that the tumoris negative for estrogen receptor (ER) expression negativefor progesterone receptor (PR) expression but stronglypositive for human epidermal growth factor receptor 2(Her-2neu) overexpression This profile is very uncharac-teristic for a lobular carcinoma of the breast Thinking thatthe tumor may have been misclassified the oncologist hasthe pathology material reviewed again and re-runs themolecular studies to confirm the receptor studies reportedfrom the outside institution To complicate matters thereviewing pathologist reclassifies the tumor as a typicallobular carcinoma (not pleomorphic) but the receptorstudy profiles (ER PR) and Her-2neu status remainunchanged Now there is no diagnostic correlate for thereceptor profile results Lobular tumors (particularly typi-cal ones) are nearly always positive for ER and PR receptor(and usually strongly so) and negative for Her-2neu over-expression
As an example at Duke the Breast Pathology service willtypically see one or two such cases per year at most Theexpected biologic behavior and sensitivity to various treat-ment protocols are unknown in such cases The body ofliterature on this particular scenario of findings is stillmaturing [2-5] The fact of the matter is that even a largetertiary care facility will rarely come across such tumorscaTRIP provides a mechanism to examine all thosepatients who have been seen at any institution wherecaTRIP has been deployed and identify the cohort ofpatients that matches the specified criteria It is a matter ofa few minutes at most to construct a query that narrowsthe cohort to patients that have lobular tumors of thebreast are ER negative PR negative and Her-2neu over-expressed (Figure 1) Furthermore the type of treatmentemployed and the time to recurrence or death are easilycaptured as part of the result set by drawing upon theTumor Registry as an additional data source This permits
Page 2 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a level of decision in the clinic ndash heretofore impossiblewith prior technologies Running such a query at Dukereturns very few cases because of the relative rarity of suchclinical phenotypes However caTRIP is grid-enabledand if deployed at Cancer Centers across the country aspart of the NCIs caBIG initiative the statistical powercould easily increase by multiple orders of magnitudeRather than relying on single-institution or limited datasets published in the literature an oncologist now hasaccess to a rich live data set that can provide strong statis-tically significant data in mere minutes
While the preceding scenario focuses on facilitating clini-cal care caTRIP is adept at enabling research as well Con-sider a follow-up situation in which a researcher wouldlike to take that cohort of patients that have lobular carci-noma and Her-2neu overexpression and compare them
to a cohort that have ductal tumors that also overexpressHer-2neu as well as to a third group that have lobulartumors but not overexpress Her-2neu DNA arrays can beused to test for the presence of hundreds or even thou-sands of gene sequences in relatively short order if tissuesamples are available caTRIP can cross-query linkedbiospecimen repositories to identify samples that areavailable for the three cohorts of patients and retrievesamples that match specified criteria (eg frozen diseasedtissue with match adjacent normal tissue) With informat-ics solutions like caTRIP and newer molecular technolo-gies like DNA arrays the time and effort to develop newdiagnostic and prognostic markers of disease can be sig-nificantly reduced
Screenshot of the Simple Interface depicting a outcomes analysis translational queryFigure 1Screenshot of the Simple Interface depicting a outcomes analysis translational query
Page 3 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include
1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive
2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment
3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death
4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors
5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)
6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma
In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations
ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)
The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging
Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data
Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller
Page 4 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries
Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include
bull Globus 41 for the underlying messaging stack
bull caGrid 10 for the underlying service infrastructure
bull Introduce for service creation
bull Index Service for discovering services on the grid
bull Authentication Service to authenticate users using theirinstitutional credentials
bull Dorian to obtain a grid user proxy
bull Grid Grouper to authorize users to access data based ongroups
bull Common Security Module (CSM) for plugging authoriza-tion into the data service
bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)
High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface
Page 5 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts
bull Hibernate for an object-relational mapping
bull Oracle and MySQL for backend relational databases
More information can be found in the caTRIP architecturedocument and UML models
bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip
bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip
Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge
Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop
Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed
using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query
Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity
Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service
Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu
Page 6 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
fields in a way that researchers from both arenas can easilymake connections More specifically cancer researchwould benefit from the development of applications thatcan aggregate clinical and molecular data in a repositorythat is user-friendly easily accessible as well as compliantwith regulatory requirements of privacy and security
In alignment with the requirements outlined above theDuke Comprehensive Cancer Center (DCCC) in collabo-ration with SemanticBits LLC has developed the CancerTranslational Research Informatics Platform (caTRIPhttpscabigncinihgovtoolscaTRIP this and all caBIGlinks last current as of July 2008) a translational researchsystem for the Cancer Biomedical Informatics Grid(caBIG) project [1] caTRIP allows users to query across anumber of caBIG data services join on common data ele-ments (CDEs) and view their results in a user-friendlyinterface Having outcomes analysis as its initial focuscaTRIP allows clinicians to query across data from existingpatients with similar characteristics to find treatments thatwere administered with success In doing so caTRIP canhelp inform treatment and improve patient care as well asenable searching for available tumor tissue locatingpatients for clinical trials and investigating the associa-tion between multiple predictors and their correspondingoutcomes (such as survival) Of importance caTRIP relieson the vast array of open source caBIG applicationsincluding (1) the Tumor Registry a clinical system that isused to collect endpoint data (2) the cancer Text Informa-tion Extraction System (caTIES httpscabigncinihgovtoolscaties) a locator of tissue resources that works viathe extraction of clinical information from free text surgi-cal pathology reports while using controlled terminolo-gies to populate caBIG-compliant data structures (3)caTissue CORE httpscabigncinihgovtoolscatissuecore a tissue bank repository tool for biospecimeninventory tracking and basic annotation (4) CancerAnnotation Engine (CAE httpscabigncinihgovtoolscae) a system for storing and searching pathology anno-tations (5) caIntegrator httpgforgencinihgovprojectscaintegrator) a tool for storing querying andanalyzing translational data including SNP data
The objective of this article is to describe the caTRIP sys-tem in relation to its objectives overall software architec-ture security implementation usability and futuredevelopment plans
Case Study Outcomes AnalysisA motivating use case for the development of caTRIP isone of outcomes analysis whereby a clinician can querydata from a cohort of preexisting patients to help guidetreatment of another patient An oncologist can ask Whatare the treatments and outcomes of other patients thathave similar characteristics to my patient caTRIP makes
this possible on multiple levels local institutionalregional national and beyond by leveraging grid infra-structure to scale beyond traditional query systems
Consider an example of a patient with a breast lumpapproaching an oncologist for treatment at a tertiary carefacility The patient has been told by her local physicianthat the tumor she has is not uncommon but difficult tomanage The oncologist reviews the medical documentsfrom the outside institution and recognizes an entity thatis well-equipped to handle a lobular carcinoma of thebreast What is slightly odd is that this tumor has an inter-mediate nuclear grade and has been further designated asa pleomorphic lobular carcinoma The oncologist isfamiliar with the medical literature on the topic and rec-ognizes that in certain circumstances such an entity mayexhibit biologic behavior akin to an invasive ductal tumor(rather than that of a typical lobular) Paralleling the oddhistologic finding is an additional finding that the tumoris negative for estrogen receptor (ER) expression negativefor progesterone receptor (PR) expression but stronglypositive for human epidermal growth factor receptor 2(Her-2neu) overexpression This profile is very uncharac-teristic for a lobular carcinoma of the breast Thinking thatthe tumor may have been misclassified the oncologist hasthe pathology material reviewed again and re-runs themolecular studies to confirm the receptor studies reportedfrom the outside institution To complicate matters thereviewing pathologist reclassifies the tumor as a typicallobular carcinoma (not pleomorphic) but the receptorstudy profiles (ER PR) and Her-2neu status remainunchanged Now there is no diagnostic correlate for thereceptor profile results Lobular tumors (particularly typi-cal ones) are nearly always positive for ER and PR receptor(and usually strongly so) and negative for Her-2neu over-expression
As an example at Duke the Breast Pathology service willtypically see one or two such cases per year at most Theexpected biologic behavior and sensitivity to various treat-ment protocols are unknown in such cases The body ofliterature on this particular scenario of findings is stillmaturing [2-5] The fact of the matter is that even a largetertiary care facility will rarely come across such tumorscaTRIP provides a mechanism to examine all thosepatients who have been seen at any institution wherecaTRIP has been deployed and identify the cohort ofpatients that matches the specified criteria It is a matter ofa few minutes at most to construct a query that narrowsthe cohort to patients that have lobular tumors of thebreast are ER negative PR negative and Her-2neu over-expressed (Figure 1) Furthermore the type of treatmentemployed and the time to recurrence or death are easilycaptured as part of the result set by drawing upon theTumor Registry as an additional data source This permits
Page 2 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a level of decision in the clinic ndash heretofore impossiblewith prior technologies Running such a query at Dukereturns very few cases because of the relative rarity of suchclinical phenotypes However caTRIP is grid-enabledand if deployed at Cancer Centers across the country aspart of the NCIs caBIG initiative the statistical powercould easily increase by multiple orders of magnitudeRather than relying on single-institution or limited datasets published in the literature an oncologist now hasaccess to a rich live data set that can provide strong statis-tically significant data in mere minutes
While the preceding scenario focuses on facilitating clini-cal care caTRIP is adept at enabling research as well Con-sider a follow-up situation in which a researcher wouldlike to take that cohort of patients that have lobular carci-noma and Her-2neu overexpression and compare them
to a cohort that have ductal tumors that also overexpressHer-2neu as well as to a third group that have lobulartumors but not overexpress Her-2neu DNA arrays can beused to test for the presence of hundreds or even thou-sands of gene sequences in relatively short order if tissuesamples are available caTRIP can cross-query linkedbiospecimen repositories to identify samples that areavailable for the three cohorts of patients and retrievesamples that match specified criteria (eg frozen diseasedtissue with match adjacent normal tissue) With informat-ics solutions like caTRIP and newer molecular technolo-gies like DNA arrays the time and effort to develop newdiagnostic and prognostic markers of disease can be sig-nificantly reduced
Screenshot of the Simple Interface depicting a outcomes analysis translational queryFigure 1Screenshot of the Simple Interface depicting a outcomes analysis translational query
Page 3 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include
1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive
2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment
3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death
4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors
5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)
6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma
In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations
ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)
The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging
Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data
Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller
Page 4 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries
Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include
bull Globus 41 for the underlying messaging stack
bull caGrid 10 for the underlying service infrastructure
bull Introduce for service creation
bull Index Service for discovering services on the grid
bull Authentication Service to authenticate users using theirinstitutional credentials
bull Dorian to obtain a grid user proxy
bull Grid Grouper to authorize users to access data based ongroups
bull Common Security Module (CSM) for plugging authoriza-tion into the data service
bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)
High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface
Page 5 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts
bull Hibernate for an object-relational mapping
bull Oracle and MySQL for backend relational databases
More information can be found in the caTRIP architecturedocument and UML models
bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip
bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip
Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge
Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop
Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed
using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query
Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity
Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service
Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu
Page 6 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a level of decision in the clinic ndash heretofore impossiblewith prior technologies Running such a query at Dukereturns very few cases because of the relative rarity of suchclinical phenotypes However caTRIP is grid-enabledand if deployed at Cancer Centers across the country aspart of the NCIs caBIG initiative the statistical powercould easily increase by multiple orders of magnitudeRather than relying on single-institution or limited datasets published in the literature an oncologist now hasaccess to a rich live data set that can provide strong statis-tically significant data in mere minutes
While the preceding scenario focuses on facilitating clini-cal care caTRIP is adept at enabling research as well Con-sider a follow-up situation in which a researcher wouldlike to take that cohort of patients that have lobular carci-noma and Her-2neu overexpression and compare them
to a cohort that have ductal tumors that also overexpressHer-2neu as well as to a third group that have lobulartumors but not overexpress Her-2neu DNA arrays can beused to test for the presence of hundreds or even thou-sands of gene sequences in relatively short order if tissuesamples are available caTRIP can cross-query linkedbiospecimen repositories to identify samples that areavailable for the three cohorts of patients and retrievesamples that match specified criteria (eg frozen diseasedtissue with match adjacent normal tissue) With informat-ics solutions like caTRIP and newer molecular technolo-gies like DNA arrays the time and effort to develop newdiagnostic and prognostic markers of disease can be sig-nificantly reduced
Screenshot of the Simple Interface depicting a outcomes analysis translational queryFigure 1Screenshot of the Simple Interface depicting a outcomes analysis translational query
Page 3 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include
1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive
2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment
3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death
4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors
5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)
6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma
In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations
ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)
The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging
Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data
Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller
Page 4 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries
Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include
bull Globus 41 for the underlying messaging stack
bull caGrid 10 for the underlying service infrastructure
bull Introduce for service creation
bull Index Service for discovering services on the grid
bull Authentication Service to authenticate users using theirinstitutional credentials
bull Dorian to obtain a grid user proxy
bull Grid Grouper to authorize users to access data based ongroups
bull Common Security Module (CSM) for plugging authoriza-tion into the data service
bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)
High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface
Page 5 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts
bull Hibernate for an object-relational mapping
bull Oracle and MySQL for backend relational databases
More information can be found in the caTRIP architecturedocument and UML models
bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip
bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip
Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge
Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop
Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed
using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query
Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity
Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service
Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu
Page 6 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Design ObjectivesThe objectives and sample questions that caTRIP isdesigned to address include
1 Allow for the search of available tumor tissue forpatients with factors of interest A typical question to beanswered by caTRIP would thus be What are all the tissuespecimens from Her-2neu positive patients that have a primarytumor in the breast and are BRCA1 positive
2 Identify predictors contributing to increased ordecreased survival among a subset of cancer patientscaTRIP should answer questions such as What are all theER positive patients that have survived breast cancer after radi-ation treatment
3 Determine the distribution over time of a number offactors that may contribute to or reflect progression of adisease state Does a change in pathology biomarkers overtime contribute to recurrence or death
4 Determine the association between a multitude of clin-ical predictors and their respective post-operative out-comes such as Does a change in ER or PR status before andafter surgery correlate with other factors
5 Identify patients eligible for participation in clinical tri-als that meet a pre-defined set of inclusion and exclusioncriteria such as List all patients that are triple negative (ERPR and Her-2neu negative)
6 Identify pathology reports of interest such as Show meall pathology reports for Her-2neu positive patients with a lob-ular carcinoma
In addition it is important that the system provide secu-rity and confidentiality levels in accordance with HIPAA(Health Insurance Portability and Accountability Act) andpart 11 FDA regulations
ImplementationTechnical RequirementsOur high level scientific use cases were in part motivatedby the need to access a variety of data types stored in anumber of data systems at Duke The Medical Assistant onthe World Wide Web (MAW3) data from the GeneticModifiers of BRCA 12 study (GEMS) and Tumor Registrydata sources feed caTRIP pathology clinical tissue bank-ing single nucleotide polymorphism (SNP) and treat-mentendpoint data From this baseline we defined theoutcomes analysis use cases that feed into the system andapplication requirements that define the overall set of fea-tures that caTRIP implements (see Background for scien-tific use cases)
The system requirements primarily relate to developmentin the Cancer Biomedical Informatics Grid (caBIG) initia-tive [1] caBIG has outlined a rigorous set of requirementsrelated to interoperability that span the categories of leg-acy bronze silver and gold compatibility Silver compat-ibility places requirements on both the data elements thatare exposed toby caTRIP as well as the application pro-gramming interfaces (APIs) Data elements leveraged incaTRIP must be defined as common data elements(CDEs) registered in the Cancer Data Standards Reposi-tory (caDSR httpncicbncinihgovNCICBinfrastructurecacore_overviewcadsr) Each CDE must have one ormore semantic concepts registered in the EnterpriseVocabulary Service (EVS httpncicbncinihgovNCICBinfrastructurecacore_overviewvocabulary) Thiscombination of defining the structure of the data as wellas the underlying semantic concepts describing the mean-ing of the data provide caTRIP with semantically richinformation models to query upon However in order toquery across different data systems it is critical that theCDEs be harmonized This allows for data to be aggre-gated and presented to the user as well as provides com-mon CDEs or keys to join on One way to catalogue theseforeign CDEs is through a Master Person Index whichwould provide a common identifier to link patients acrosssystems [6] It is possible though not without challengesto identify and link patients within a single institutionusing an institutional medical record number (MRN)However queries that join on patients across institutionalboundaries can be quite challenging
Compatibility of the APIs in caTRIP falls more squarelyinto caBIG gold compatibility which has not fully beendefined by the caBIG program as of yet All data that isaccessible to caTRIP must be exposed via caGrid data serv-ices [7] This means that the data is queriable via a well-defined standards-based interface that accepts queries viathe caGrid Common Query Language (CQL) whichdescribes the query in object-oriented terms correspond-ing to the registered CDEs Finally there are also systemrequirements related to security which include authenti-cating users based on credentials and authorizing users toaccess data
Software architecturecaTRIP is developed as an N-tier architecture with threeprimary tiers domain services the distributed queryengine (DQE) and the graphical user interface (Figure 2)The domain services are grid services that expose informa-tion models that encapsulate the underlying data The dis-tributed query engine takes as input a distributedcommon query language (DCQL) query decomposes itinto individual common query language (CQL) queriesissues those queries to the domain services joinsaggre-gates the results and returns the final results to the caller
Page 4 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries
Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include
bull Globus 41 for the underlying messaging stack
bull caGrid 10 for the underlying service infrastructure
bull Introduce for service creation
bull Index Service for discovering services on the grid
bull Authentication Service to authenticate users using theirinstitutional credentials
bull Dorian to obtain a grid user proxy
bull Grid Grouper to authorize users to access data based ongroups
bull Common Security Module (CSM) for plugging authoriza-tion into the data service
bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)
High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface
Page 5 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts
bull Hibernate for an object-relational mapping
bull Oracle and MySQL for backend relational databases
More information can be found in the caTRIP architecturedocument and UML models
bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip
bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip
Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge
Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop
Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed
using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query
Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity
Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service
Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu
Page 6 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
The graphical user interface (GUI) provides a metadata-driven interface for building and running DCQL It alsoprovides an interface for saving and searching for queries
Another important goal for caTRIP is to leverage existingtechnologies wherever possible These include
bull Globus 41 for the underlying messaging stack
bull caGrid 10 for the underlying service infrastructure
bull Introduce for service creation
bull Index Service for discovering services on the grid
bull Authentication Service to authenticate users using theirinstitutional credentials
bull Dorian to obtain a grid user proxy
bull Grid Grouper to authorize users to access data based ongroups
bull Common Security Module (CSM) for plugging authoriza-tion into the data service
bull Cancer Data Standards Repository (caDSR) for storingcommon data elements (CDEs)
High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graphical User InterfaceFigure 2High-level view of the caTRIP architecture including Domain Services Distributed Query Engine and Graph-ical User Interface
Page 5 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts
bull Hibernate for an object-relational mapping
bull Oracle and MySQL for backend relational databases
More information can be found in the caTRIP architecturedocument and UML models
bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip
bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip
Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge
Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop
Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed
using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query
Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity
Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service
Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu
Page 6 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
bull Enterprise Vocabulary Services (EVS) for storing semanticconcepts
bull Hibernate for an object-relational mapping
bull Oracle and MySQL for backend relational databases
More information can be found in the caTRIP architecturedocument and UML models
bull httpgforgencinihgovpluginsscmcvscvswebphpcatripdocumentationtechnical_guidecaTRIP_Technical_Guidedoccvsroot=catrip
bull httpgforgencinihgovpluginsscmcvscvswebphca-tripdocumentationmodelscaTRIP_UMLdoccvsroot=catrip
Implementation at DukeWe deployed caTRIP at Duke to leverage data from theGenetic Modifiers of BRCA1 and BRCA2 study in theDuke Breast Specialized Program of Research Excellence(SPORE) The goal of the GEMS investigation is to identifyfactors that influence the incidence of breast cancer ingermline BRCA12 mutation carriers To this end wedeployed tissue banking and pathology data from MAW3into caTissue CORE CAE and caTIES treatment recur-rence and endpoint data from our Tumor Registry into agrid service and SNP data from flat files into caIntegratorThese services were secured and then exposed to thecaTRIP interface This deployment allowed us to answerall of the domain questions that are found in the DesignObjectives section of this article Users of this pilotdeployment included a set of pathologists and clinicianswho provided feedback that guided the developmentproviding important feedback such as the use of menu-drive interfaces that require little knowledge of the back-end data instead of drag-and-drop interfaces that dorequire such knowledge
Results and discussionGraphical User InterfaceThe Graphical User Interface (GUI) is a Java Swing-basedthick client that can be accessed via Java WebStart httpjavasuncomproductsjavawebstart It provides the userwith the ability to construct execute and share distrib-uted queries in a graphical fashion There are two primaryways to construct queries the Simple Interface is a morelimited yet powerful way to build queries using drop-down menus and the Advanced Interface is a flexible wayto build queries using drag-and-drop
Simple InterfaceThe Simple Interface was developed to target the non-technical users such as a clinician Queries are constructed
using drop-down menus to add filters ANDOR groupsand return attributes (Figures 1 and 3) Available servicesare selected via a configuration file as well as the CDEsthat make up the drop-down menu items for filters andreturn attributes Joins across these services are performedusing common identifiers Specifically in caTRIP we usethe (optionally) deidentified Medical Record Number(MRN) of patients to join and aggregate data The datamodels of the underlying services are composed of poten-tially complex classes attributes and associationsbetween the classes (see Figure 4) In order to constructqueries that span multiple classes and associations theSimple Interface leverages a configuration file thatdescribes the outbound path from each class to the MRNas well as the inbound path to each filterablereturnableCDE This enables the automated logic that would nor-mally be performed by a user manually selecting all of theassociations necessary to perform a query
Advanced InterfaceThe Advanced Interface was developed to target a datatechnician who may wish to perform more complex que-ries Queries are constructed by dragging classes onto acanvas drawing associations between them and addingfilters along the way (Figure 4) Because the logic of howto traverse the information space is not built-in like it iswith the Simple Interface users must manually createthe association paths to the classesattributes that theywould like to filter on as well as those they would like toperform cross-service foreign joins with This allows usersto perform queries not supported by the Simple Inter-face but adds an extra level of complexity
Distributed Query EngineThe Distributed Query Engine (DQE) provides the logic toperform a query across different caGrid data services ThecaTRIP team developed a distributed query engine whichwas later adopted into the caGrid 10 toolkit as the Feder-ated Query Processor (FQP) module Currently the DQEis available for use by the general caBIG community andis currently being leveraged by the Cancer Bench to Bed-side (caB2B) project httpgforgencinihgovprojectscab2b It takes as input a Distributed Common QueryLanguage (DCQL) query breaks it into separate CommonQuery Language (CQL) queries issues those queries to theappropriate underlying data services receives the resultsperforms joins aggregates data and returns the resultsThe DQE is built as a Java service and can be deployed asa grid service
Domain ServicesThe domain services themselves can be further dividedinto a number of tiers including a relational databasemanagement system (RDBMS) Hibernate CommonQuery Language processor httpgforgencinihgovplu
Page 6 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
ginsscmcvscvswebphpcagrid-1-0DocumentatiodocsDataServiceTechnicaldataServiceTechnicalGuidedoccvsroot=cagrid-1-0 and grid service Hiber-nate provides an object-oriented abstraction of the under-lying relational database exposing tables and columns asclasses and attributes in a flexible configurable mannerhttpwwwhibernateorg The CQL processor translatesa CQL query based upon common data elements andtheir associations into a Hibernate Query Language(HQL) query This is a relatively complex mappingbecause CQL and HQL have much different syntaxesFinally the grid service provides a standards-based entrypoint to issue a CQL query caGrid httpscabigncinihgovworkspacesArchitecturecaGrid pro-vides the underlying technology stack for the grid servicewhich itself is based upon Globus 41 httpwwwglobusorg
SecuritycaTRIP leverages the caGrid 10 infrastructure which pro-vides robust security for a distributed grid environment(Figure 5)
Security AuthenticationThe first step in the security workflow is authenticationwhich is the process by which a trusted organizationasserts a user is who they claim to be This is implementedusing the caGrid Authentication Service The goal is for theuser to obtain a grid user certificate to be used in authori-zation The user submits their credentials (user name andpassword) to the AuthenticationService which has aninstitutional identity provider (IdP) plug-in This compo-nent performs the necessary steps to determine that theusers credentials are valid In the case of the Duke IdPplugin the user name and password are validated against
Steps to perform a query in the caTRIP Simple InterfaceFigure 3Steps to perform a query in the caTRIP Simple Interface
Page 7 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
a Duke Domain Controller which uses NT Security Thismeans that a user can log in with the same Cancer Centercredentials they use to log in to their workstation Oncevalidated the AuthenticationService returns a SecurityAssertion Markup Language (SAML) document httpwwwoasis-openorgcommitteesdownloadphp14361sstc-saml-tech-overview-20-draft-08pdf to the userwhich asserts that the user is who they claim to be TheSAML Assertion can then be passed off to a Dorian Serviceto generate a grid user certificate All of these interactionsare handled for the user by the caTRIP application
Security AuthorizationThe second step in the security workflow is authorizationwhich is the process by which a system grants the useraccess to specific resources This is implemented using the
Common Security Module (CSM) plug-in to the grid dataservice The CSM Phillips 2006 is an open sourcetoolkit that provides a flexible way to provision users andmake authorization decisions When a data field isaccessed on behalf of the user an authorization structureis queried to determine whether the user has permissionto access it If not then a security exception is returned tothe client Authorization is made possible because trustexists between the service serving the data resource andthe AuthenticationService
Delegation and Foreign CDEsAn alternate flow in the security workflow is delegationwhich is the process by which a service acts on behalf ofthe user This impacts caTRIP because the distributedquery engine can be run as a service so it must query the
Example Advanced Interfacing depicting the potential complexity of distributed queriesFigure 4Example Advanced Interfacing depicting the potential complexity of distributed queries
Page 8 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
domain services using the users certificate The finalimplementation will likely leverage the DelegationServicefrom the Globus project The user will signal the Delega-tionService that it is permissible for a collocated service toact on the users behalf In the meantime the distributedquery engine is collocated with graphical user interface soit simply passes the users credentials to the data servicesidentified in the distributed query This bypasses some ofthe business rules defined by delegation such as thenumber of delegation hops allowed by the user but it issufficient to meet caTRIP use cases
Another change on the security workflow is the use offoreign CDEs for the purpose of joining data It is con-ceivable that a user does not have permission to view aCDE that is used for joining between two differentdomain services The classic example of this is to join
patients across services using the Medical Record NumberIn this case a special security workflow must be under-taken during authorization A level of trust is maintainedbetween the distributed query engine service and thedomain service such that foreign CDEs are returned to theDQE service with the understanding that the delegateduser is not authorized to view them but is authorized tojoin on them This workflow is only possible if the DQE isnot collocated with the client application
Automated Honest Broker ServiceThe primary goal of caTRIP is to provide users access todiverse datasets that cross domains and data sources Onechallenge that arises from this is deidentifying datasetsthat are managed by different groups which especiallyaffects deidentifying foreign CDEs (MRNs) that are usedto link across different datasets A typical scenario for
High-level overview of the caGrid security architecture httpcagridorgFigure 5High-level overview of the caGrid security architecture httpcagridorg
Page 9 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
cross-dataset deidentification is for an IRB appointed indi-vidual to collect each dataset and deidentify them all atonce tossing away the keys to the Protected Health Infor-mation (PHI) This approach is not scalable in a distrib-uted environment because it requires a middle-man toperform the deidentification Whenever a new dataset isadded to the system all previously existing data sets mustbe deidentified again To address this in caTRIP we devel-oped an Automated Honest Broker Service (AHBS) whichperforms the tasks once performed by a human appointedby the Institutional Review Board A data owner program-matically submits PHI (eg an MRN) to the AHBS using asecure connection The AHBS generates a random keyassociates it internally in a database to the PHI (which canbe encrypted) and then returns the key to the data ownerThe data owner then can deidentify his dataset by replac-ing the PHI with the random key When more than onedata ownerdataset is involved the AHBS will alwaysreturn the same random key for the same PHI In this waydiverse distributed datasets can be deidentified using thesame keys
Example from the Duke ImplementationThe following example illustrates how caTRIP would con-struct and execute a user query that joins disparate data-sets A user at Duke could ask the question What are allof the available tissue specimens from the breast obtainedfrom patients tested for positive expression of the estrogenreceptor The distributed query engine would first fetchall of the patient medical record numbers from the Clini-cal Annotation Engine (CAE) service that have estrogenreceptor expression annotated as positive It would thenissue a query to the caTissue CORE service for tissue spec-imens that are marked available were collected from thetissue site of breast and were obtained from patients thathave medical record numbers that match those returnedfrom the previous query The results would be a list of tis-sue specimens including their barcodes location sizeetc The single query issued to the distributed queryengine would look like this
1 ltDCQLQuery xmlns = httpcaGridcaBIG10govnihncicagriddcqlgt
2 ltTargetObject name = httpeduwustlcatissuecoredomainobjectimplTissueSpecimenImpl serv-iceURL = httplocalhost8080wsrfservicescagridCaTissueCoregt
3 ltGroup logicRelation = ANDgt
4 ltAttribute name = available predicate =EQUAL_TO value = truegt
5 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCharacteristicsImpl roleName = specimenCharacteristicsgt
6 ltAttribute name = tissueSite predicate =LIKE value = breastgt
7 ltAssociationgt
8 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplSpecimenCollectionGroupImpl roleName = specimenCollectionGroupgt
9 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplClinicalReportImplroleName = clinicalReportgt
10 ltAssociation name = httpeduwustlcatissuecoredomainobjectimplParticipantMedicalIdentifierImpl roleName = participantMedicalIdentifiergt
11 ltForeignAssociationgt
12 ltJoinConditiongt
13 ltLeftJoingt
14ltObjectgteduwustlcatissuecoredomai
nobjectimplParticipantMedicalIdentifierImplltObjectgt
15ltPropertygtmedicalRecordNumberlt
Propertygt
16 ltLeftJoingt
17 ltRightJoingt
18ltObjectgtedudukecatripcaedomaing
eneralParticipantMedicalIdentifierltObjectgt
19ltPropertygtmedicalRecordNumberlt
Propertygt
20 ltRightJoingt
21 ltJoinConditiongt
22 ltForeignObject name =httpedudukecatripcaedomaingeneralParticipantMedicalI dentifier serviceURL = httplocalhost8080wsrfservicescagridCAEgt
Page 10 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
23 ltAssociation name = httpedudukecatripcaedomaingeneralParticipantroleName = participantgt
24 ltAssociation name =httpedupittcabigcaedomaingeneralAnnotationEventPa rameters roleName = annotationEventParameter-sCollectiongt
25 ltAssociation name =httpedupittcabigcaedomainbreastBreastCancerBiomark ers roleName = annotationSetCollectiongt
26 ltAttribute name = estrogenRe-ceptor predicate = LIKE value = POSITIVEgt
27 ltAssociationgt
28 ltAssociationgt
29 ltAssociationgt
30 ltForeignObjectgt
31 ltForeignAssociationgt
32 ltAssociationgt
33 ltAssociationgt
ltAssociationgt
ltGroupgt
ltTargetObjectgt
ltDCQLQuerygt
Note on line 1 that the query is rooted at caTissue COREwhere available specimens are filtered on line 4 and spec-imens from the tissue site of breast are filtered on line 6Lines 11ndash31 define the foreign association between caTis-sue CORE and CAE The medical record numbers (MRNs)of each service are joined on lines 12ndash21 by defining theleft join criteria to be the MRN from caTissue CORE andthe right join criteria to the be MRN from CAE Line 22selects the foreign service to be CAE from which patientswith estrogen receptor expression marked as positive arefiltered on line 26 The query itself is unfolded and exe-cuted inside-out by the distributed query engine CAEbeing the first service to be queried and the results used toquery caTissue CORE Note also that the association pathsof the objects are defined within the query on lines 5 8 910 23 24 and 25 the logic of which is hidden from theusers who are simply selecting data elements to query
upon and return The abbreviated results would look likethis
1 ltqueryResults targetClassname =httpeduwustlcatissuecoredomainobjectimplTissueSpeci menImplgt
2 ltns1ObjectResult xmlnsns1 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
3 ltns2TissueSpecimen id = 397 type = Fixed TissueBlock available = true positionDimensionOne = 20positionDimensionTwo = 22 barcode = ISBN-10AB00-0013XY comments = TISSUE IN FORMALIN 30activityStatus = Active quantityInGram = 07 availa-bleQuantityInGram = 0199999988079071 xmlnsns2= gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
4 ltns1ObjectResultgt
5 ltns2ObjectResult xmlnsns2 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
6 ltns3TissueSpecimen id = 406 type = Fixed TissueBlock available = true positionDimensionOne = 19positionDimensionTwo = 14 barcode = ISBN-10AB00-0043XY comments = STRONG FAMILY HISTORYFOR BRCA MUTATIONS activityStatus = Active quan-tityInGram = 675 availableQuantityInGram = 625xmlnsns3 = gmecaCOREcabig30eduwustlcatissuecoredomainobjectgt
7 ltns2ObjectResultgt
8 ltns3ObjectResult xmlnsns3 = httpCQLcaBIG1govnihncicagridCQLResultSetgt
9 ltns4TissueSpecimen id = 411 type = Fixed TissueBlock available = true positionDimensionOne = 21positionDimensionTwo = 7 barcode = ISBN-10 AB00-0063XY comments = NO MORE WHOLE BLOODactivityStatus = Active quantityInGram = 4275 avail-ableQuantityInGram = 4225 xmlnsns4 =gmecaCOREcabig30eduwustlcatissuecoredomainob jectgt
10 ltns3ObjectResultgt
11 ltqueryResultsgt
Note that line 1 defines the type of object being returnedin the query results and that the results themselves found
Page 11 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
on lines 3 6 and 9 are each of the objects that match thequery criteria with data elements of interest embeddedwithin them as attributes
ConclusioncaTRIP is being implemented in a phased-developmentcycle meeting certain objectives at each point to provideutility as the program is extended It is currently in its sec-ond phase of development Phase one concluded the pilotdevelopment activities of caTRIP providing a baselinearchitecture and proof-of-concept user interface End usertraining was provided on a limited basis for the smallmultidisciplinary breast oncology group using live dem-onstrations and basic instructions provided through e-mail messages Feedback was collected through e-mail aswell The multidisciplinary group was given direct accessto the primary software developers and technical expertsduring this period In phase two we plan to deploycaTRIP-II along with enhancements determined throughend user interaction Furthermore we are also in the proc-ess of engaging different groups at Duke to adopt caTRIPto help meet their research goals
In phase one of caTRIP use cases focused on intra-institu-tional problems tearing down the silos of data-orientedsystems in a single cancer center In future iterations weplan to extend caTRIP to tackle cross-institutional usecases This involves extending the caTRIP interface anddistributed query engine to support aggregation of datafrom services that expose common data elements Forexample a motivating use case amongst prospectiveadopters is to aggregate treatment endpoint pathologyand tissue banking data from a number of different cancercenters This will allow researchers and clinicians toexpand their queries and answer questions that they pre-viously could not at their own institution Furthermorewe plan to integrate caTRIP with analytics specificallyfocusing on microarray analysis The caBIG gene expres-sion database caArray is a rich source of central or distrib-uted gene expression data which itself can be tied toclinical data such as pathology biomarkers This microar-ray data can be analyzed and visualized dynamically usingservices provided by other caBIG applications such asgeWorkbench Califano 2006 httpapiiiupmceduabstractsposterarchive2006watkinsonhtml GenePat-tern [8] and Bioconductor[9] This notion of buildingworkflows based upon queries designed using caTRIPssimple interface can provide a powerful tool for bothresearchers and clinicians Finally we also plan toenhance the user interface providing more sophisticatedreporting and data mining functionality All of these fea-tures will be driven by end-user requirements of the adop-ter sites which is a critical facet to the caBIG initiative
Availability and requirementsProject name Cancer Translational Research InformaticsPlatform (caTRIP)
Project home page httpsgforgencinihgovprojectscatrip
Operating system(s) platform independent
Programming language Java
Other requirements Tomcat 5028 a database (MySQL4110 or Oracle 10x) Java Version 15 or higher andApache Ant version 165
License NCI caBIG license (non-viral)
Any restrictions to use by non-academics no
Competing interestsThe authors declare that they have no competing interests
Authors contributionsPM was the Lead Architect and Project Manager of caTRIPRD provided subject matter expertise to guide the devel-opment of caTRIP Ram Chilukuri was the lead developerof caTRIP RP provided subject matter expertise KJ pro-vided subject matter expertise RA was the project directorAJC was the project PI
AcknowledgementsThis publication was made possible by Grant Number 1 UL1 RR024128-01 from the National Center for Research Resources (NCRR) a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH Information on NCRR is available at httpwwwncrrnihgov Information on Re-engineer-ing the Clinical Research Enterprise can be obtained from httpnihroad mapnihgovclinicalresearchoverview-translationalasp
The authors would additionally like to acknowledge NCI for funding the project under the 1435-04-04-CT-73980 grant and Breast SPORE under the 5P50 CA068438-09 grant The authors would like also to acknowledge the contributions of the developers database administrators and system administrators at Duke including Mark Peedin Christopher Hubbard Wilma Stanley Farid Mohammad and William Banks We would also like to thank the development efforts of Srini Akkala and Sanjeev Agarwal from Semantic Bits LLC as well as Bill Mason from at 5 AM Solutions Addition-ally we would like to thank the subject matter experts at Duke for their contributions in driving the use cases and requirements They include Drs Kelley Marcom Gretchen Kimmick Kimberly Blackwell and Lee Wilke Finally we would also like to thank our liaison to NCI Juli Klemm for her support and intellectual contributions
References1 Fenstermacher D Street C McSherry T Nayak V Overby C Feld-
man M The Cancer Biomedical Informatics Grid (caBIGtrade)Conf Proc IEEE Eng Med Biol Soc 2005 1743-6
Page 12 of 13(page number not for citation purposes)
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-
BMC Medical Informatics and Decision Making 2008 860 httpwwwbiomedcentralcom1472-6947860
Publish with BioMed Central and every scientist can read your work free of charge
BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime
Sir Paul Nurse Cancer Research UK
Your research papers will be
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours mdash you keep the copyright
Submit your manuscript herehttpwwwbiomedcentralcominfopublishing_advasp
BioMedcentral
2 Esposito NN Chivukula M Dabbs DJ The ductal phenotypicexpression of the E-cadherincatenin complex in tubulolobu-lar carcinoma of the breast an immunohistochemical andclinicopathologic study Mod Pathol 2007 20(1)130-8
3 Sahin FI Yilmaz Z Yagmurdur MC Atac FB Ozdemir BH KarakayaliH Demirhan B Haberal M Clinical findings and HER-2neu geneamplification status of breast carcinoma patients Pathol OncolRes 2006 12(4)211-5
4 Sughayer MA Al-Khawaja MM Massarweh S Al-Masri M Preva-lence of hormone receptors and HER2neu in breast cancercases in Jordan Pathol Oncol Res 2006 12(2)83-6
5 Radovic S Babic M Doric M Secic S Beslic S Balta E Kapetanovic EImmunohistochemical evaluation of the HER-2 protein inthe infiltrative lobular breast cancer Med Arh 200660(4)213-6
6 Mills ME Linkage of patient records to support continuity ofcare Issues and future directions Stud Health Technol Inform2006 122320-4
7 Saltz J Oster S Hastings S Langella S Kurc T Sanchez W Kher MManisundaram A Shanbhag K Covitz P caGrid design and imple-mentation of the core architecture of the cancer biomedicalinformatics grid Bioinformatics 2006 22(15)1910-6
8 Reich M Liefeld T Gould J Lerner J Tamayo P Mesirov JP GenePa-ttern 20 Nat Genet 2006 38(5)500-1
9 Gentleman RC Carey VJ Bates DM Bolstad B Dettling M Dudoit SEllis B Gautier L Ge Y Gentry J others Bioconductor open soft-ware development for computational biology and bioinfor-matics Genome Biol 2004 5(10)R80
Pre-publication historyThe pre-publication history for this paper can be accessedhere
httpwwwbiomedcentralcom1472-6947860prepub
Page 13 of 13(page number not for citation purposes)
- Abstract
-
- Background
- Results
- Conclusion
-
- Background
-
- Case Study Outcomes Analysis
- Design Objectives
-
- Implementation
-
- Technical Requirements
- Software architecture
- Implementation at Duke
-
- Results and discussion
-
- Graphical User Interface
- Simple Interface
- Advanced Interface
- Distributed Query Engine
- Domain Services
- Security
- Security Authentication
- Security Authorization
- Delegation and Foreign CDEs
- Automated Honest Broker Service
- Example from the Duke Implementation
-
- Conclusion
- Availability and requirements
- Competing interests
- Authors contributions
- Acknowledgements
- References
- Pre-publication history
-