enabling simultaneous analysis of multiple cohort studies: a brisskit use case
DESCRIPTION
A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical MedicineTRANSCRIPT
Enabling simultaneous analysis of multiple cohort studies without accessing the full
dataset: a BRISSKit use case
Dr Jonathan Tedds [email protected] @jtedds
Senior Research Fellow, Health Informatics & Interdisciplinary Research Group,Department of Health Sciences (University of Leicester)
PI #BRISSKit http://www.brisskit.le.ac.uk
http://www.astrogrid.org(April 2008 1st public release)
Data Reuse: asking new questions
Hubble Space Telescope• Papers based upon reuse of archived observations now exceed those based on the use
described in the original proposal.– http://archive.stsci.edu/hst/bibliography/pubstat.html
• See also work by Piwowar & Vision re life sciences: “Data reuse and the open data citation advantage”– http://peerj.com/preprints/1/
Science as an Open Enterprise ReportWhy open?• As a first step towards this intelligent openness,
data that underpin a journal article should be made concurrently available in an accessible database
• We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be interoperable. [p.7]
• Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/
• Issues linking data to the scientific record:– Data persistence– Data and metadata quality– Attribution and credit for data producers
• Geoffrey Boulton (Edinburgh), Lead author:– “Science has been sleepwalking into crisis of
replicability...and of the credibility of science”– “Publishing articles without making the data
available is scientific malpractice”
BRISSKit context: The I4Health goal of applying knowledge engineering to close the
‘ICT gap’ between research and healthcare (Beck, T. et al 2012)
Biomedical Research Infrastructure Software Service Kit
A vision for cloud-based open source research applications #BRISSKit
http://www.brisskit.le.ac.uk
http://www.brisskit.le.ac.uk
BRISSKit USPs Integrated support for core research processes
Well-established mature open source applications as protoyped in Cardiovascular, Respiratory, Cancer Theme Biobank: UK customised
A platform for seamless management and integration between applications
An API allows integration with existing clinical systems
Easy set up, use and administration through browser (including on mobile devices)
Capability of being hosted in any compliant cloud provider including UHL (NHS information governance)
www.brisskit.le.ac.uk Email: [email protected]
BRISSKit Community & Hack Event, Oct 2012
http://www.brisskit.le.ac.uk/node/35
BRISSKit Information Governance& Security Management Work Stream
- Dr Andrew Burnham leading
1. Information Governance Toolkit - analysis of Department of Health (DoH/NHS) IGT requirements vs. BRISSKit organisation/project and services/toolsa) Hosted Secondary Use Team/project (Hosted IGT)b) Acute Trust (Acute Trust IGT)
2. IG Training Tool (NHS – University is registered)
3. Pseudonymisation requirements
4. Data Management Plan
5. IT Security & standards – Penetration Testing & Security Testing
6. Other NHS Standards/Requirements:- Care Records Guarantee- NHS Constitution- NHS Records Management- Patient Safety DSCN 14/2009, 18/2009
The semantic bridge
?
OBiBa Onyx
Records participantconsent, questionnairedata and primaryspecimen IDs
i2b2
Cohort selection and data querying
Bio-ontology!
• Deploy solutions in international bio banking initiatives
• Investment through Prof Paul Burton (Health Sciences at Leicester/Bristol) & international collaborations
• Building on strong informatics expertise at University of Leicester in partnership with the University Hospitals Leicester Trust• Cardiovascular, Respiratory & Lifestyle
BRUs• Cancer Theme Biobank• Genomics etc
BRISSKit and Bio Banking
Contemporary biobanking: meeting the “data” challenge
Large data sets, why bother?
• Sample size• Depth of phenotyping• Quality of measurementAll critical
How big is BIG?• The direct effect of a gene• 2,000 cases minimum, 10,000 cases better
• Environmental and life-style factors• Highly context specific: from hundreds to tens of
thousands of cases• Gene-lifestyle and gene-gene “interactions”• Absolute minimum 10,000, usually need at least
20,000, a comprehensive platform needs at least 50,000• Scientifically fundamental
The bottom line Effective data access is crucial Effective joint analysis is essential too (integration) Fundamental challenges• Scientific harmonization• Restriction on access to individual level data• Streamlined access to multiple data sets
Central to the integrative aims of P3G, PHOEBE, BioSHaRE-eu etc
Also fundamental to the aims of potential BRISSKit users
18
Horizontally partitioned data
Data
PREVENDPREVEND
1958BC1958BC
Data
KORAGENKORAGEN
Joint centralanalysis
Data
FINRISKFINRISK
Data
How can we undertake a full joint analysis using multiple data sources if the data cannot physically be pooled? Ethico-legal constraints Physical size of the data objects Intellectual property issues
DataSHIELD: a novel solutionTake analysis to data not data to analysis
One step analyses: simple
Iterative analyses: parallel processes linked together by entirely non-identifying summary statistics
Typically produces mathematically identical results to fitting a single model to all the data held in one pooled data set
Take analysis to data not data to analysis
One step analyses: simple
Iterative analyses: parallel processes linked together by entirely non-identifying summary statistics
Typically produces mathematically identical results to fitting a single model to all the data held in one pooled data set
R
R
R RAnalysis
Computer
Web services
Web servicesWeb services
Data computer OpalFinrisk
OpalFinrisk
OpalPrevend
OpalPrevend
Opal1958BC
Opal1958BC
Data computer Data computer
BioSHaREweb site
BioSHaREweb site
Web services
Horizontal DataSHIELD
R
R
R RAnalysis
Computer
Web services
Web servicesWeb services
Data computer OpalFinrisk
OpalFinrisk
OpalPrevend
OpalPrevend
Opal1958BC
Opal1958BC
Data computer Data computer
BioSHaREweb site
BioSHaREweb site
Web services
Horizontal DataSHIELD
Opal includes• DataSHIELD• DataSHaPER• Researcher ID
R
R
R RAnalysis
Computer
Web services
Web servicesWeb services
Data computer OpalFinrisk
OpalFinrisk
OpalPrevend
OpalPrevend
Opal1958BC
Opal1958BC
Data computer Data computer
Horizontal DataSHIELD
BioSHaREweb site
BioSHaREweb site
Web services
Work in progress:• Embed Opal in BRISSKit • ALSPAC• MRC e-HIRCS• +more…
Opal1958BC
Opal1958BC
Opal1958BC
Opal1958BC
BRISSKit gains• DataSHIELD• DataSHaPER• Researcher ID
Opal1958BC
Opal1958BC
Opal gains• Direct interface with more tools• I2B2 functionality• Potential for
enhanced user interface
BRISSKit gains• DataSHIELD• DataSHaPER• Researcher ID
Opal1958BC
Opal1958BC
BRISSKit gains• DataSHIELD• DataSHaPER• Researcher ID
Opal gains• Direct interface with more tools• I2B2 functionality• Potential for
enhanced user interface
Everybody gains• Enhanced combined functionality - better science• Bigger user group - greater portability• Greater potential to become a sustainable standard
Opal1958BC
Opal1958BC
BRISSKit gains• DataSHIELD• DataSHaPER• Researcher ID
Everybody gains• Enhanced combined functionality - better science• Bigger user group - greater portability• Greater potential to become a sustainable standard
Opal gains• Direct interface with more tools• I2B2 functionality• Potential for
enhanced user interface
Enhanced joint analysis with• Ethico-legal constraints e.g.US/Europe biobanks• Intellectual property issues e.g. H3AFRICA
The bottom line Effective data access is crucial Effective joint analysis is essential too (integration) Fundamental challenges• Scientific harmonization• Restriction on access to individual level data• Streamlined access to multiple data sets
Central to the integrative aims of P3G, PHOEBE, BioSHaRE-eu etc
Also fundamental to the aims of potential BRISSKit users
29