workshop - finding and accessing data - cambridge august 22 2016
TRANSCRIPT
We are always looking for data
Finding and accessing human genomic data for
research
Cambridge, 22nd August 2016
Slides will be made available online
Tweets welcome #CamFindData
Outline of the day
- Data sources and data access (Charlotte)- Case study: University of Cambridge- Coffee break- Introduction to Repositive (Fiona)- Hands-on session: searching for data- Round up and closure
On-line tools used during the workshop
To ask questions during the presentation and answer questions:
go to slido.com
enter event code: 1641
To leave feedback on the workshop:
http://tinyurl.com/feedback220816
We are on twitter: @glyn_dk
@repositiveio@DNAdigest
@CamOpenData
Cambridge, 22nd August 2016
Slides will be made available online
Tweets welcome #CamFindData
1. What data are you looking for?
Join at slido.com with the event code #1641
This workshop will focus on finding and accessing human genomic data.
… why would you be looking for genomic data for your research?
How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015: UK10K & deCODE (>100k induviduals) Cancer Genome Atlas ~11,000 genomesExAC consortium 65,000 exomes
?
Case studies
Raquel,PhDStudent,London,UK.
Researchinggenesassociatedwithrareeyedisorders.
Problems:- Doesn’tknowwheretolook
fordata.- Doesn'tknowifdataeven
exists.
“I gave up on finding the data - it was very time consuming and not proving fruitful – so I started focusing more on generating my own data.”
Mahantesh,AcademicResearcher,Taipei,Taiwan.
Studyingpharmacogenomicsincardiovascularepidemiology.
Problems:- Needslotsofdata.- Knowsitexistsbutstruggles
withgettingaccesstoit.
“Often it’s very hard to get the required number of cases and controls to carry out research in public health and epidemiology.”
Jana,CompanyBiocurator,Zurich,Switzerland.
BiocuratingmicroarrayandRNA-Seqdata.
Problems:- Needslotsofdata.- Lotsofdataouttherebut
hardtofilterdownto‘useful/relevant’data.
“Many repositories don’t list the metadata details I need to know if a dataset is useful to me, I can waste a lot of time searching.”
What can I do?
PRO TIPS:
Involve a statistician early on in your study design!
Include more reference data in your analysis
Search for collaborators who have the data you need
Tell your colleagues and peers what type of data you have in your lab
Use external sources of data….
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈.5 PB Sequenceavailable
80+ PB
Sequencedeveryyear
WGS data available in public repos
Exponential growth rate
Under-utilised datahashuge potentialfor
medicalresearch
2. Data resources from around the world
Public repositories
• some you apply for access, especially if data contains clinical info or whole genome PID
• some are open access: GEO, SRA, PGP, OpenSNP, GigaDB, …
• some are consented for general research use, some have specific consent
How many data sources?
How many sources of human genomics data do you know
about?
Hundreds of data sources…buttheyaren’teasytofind!
http://dx.doi.org/10.1371/journal.pbio.1002418 First 30 data sources listed here:
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-160
50
100
150
200
250
300
1025 33 35
102
174
239
DATA is fragmented
Data sources across the globeGEOlocationof278datasourcesanalysed.
Found by tracking IP address of the source.
Theseinclude:
PublicRepositories
Universities
Companies
BioBanks
Researchconsortiums
It may be confusing
Data source content
Assay Types
Dedicated to…
More information about data sources
… in our recent paper:
http://tinyurl.com/plos-biology-repositive
3. Getting access to Restricted data
Benefits:• Strictgovernance• Individualsareprotected• Reviewofconsent• Applicantsignsforfull
responsibilityforgovernance
Disadvantages:• Nocontrolofdataonceaccess
isgiven• Highbarrierforaccess–too
high?
Data accessibility
Candownloadthedatastraightawayorafterloggingin.
Needtoapplyforaccesstothedata.
HasbothOpenandRestrictedaccessdatawithinone
repository.
Access type of 225 sampled data sources.
Often a long process
Bottlenecks: • Finding relevant and usable
data• Getting authorisation to
access data• Formatting data• Storing and moving data
We studied the problem with qualitative interviews followed by a survey of researchers in
human genetics
T. A. van Schaik et alThe need to redefine genomic data sharing: a focus on data accessibility, Applied & Translational Genomics, 2014 10.1016/j.atg.2014.09.013
Often a long process
Researchers spend months trying find and access genomic data, and often choose to not access data at all
NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
NoWrite research proposal
Yes+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access grantedFind/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application – access to all the GRU datasets.
dbGaP application process
Blog Post:http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/
Sanger eDAM Account
No
Write research proposal
+ 1 hourYes
Submit proposal
+ 1-2 days
Access grantedFind/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/
• PostdoctoralresearcheratUniversityofCambridgeMedicalSchool
• WorkingongeneticinheritanceandCancer• UsingNGSdataandbioinformatics
• Aftersearchingfordataonlineshedecidedtoapplyfor:• 2dbGaPdatasets• 3EGAdatasets
Cambridge specific Case Study
Blog Post:Pending… will be on http://blog.repositive.io/
The Research Operations Office -willhelpyouwiththecontracts(DTAs)andsignatures.
• HasadesignatedindividualwhoprocessesalldbGaPapplicationsastheyallabidebyNIHlegalrestrictionsandregulationsabouthowtohandlethedataoncegrantedaccess.
• ForEGAapplications,eachDTAmustgetprocessedseparatelybecausethereisnoconsensusforthe‘contracts’betweeneachdataset.
Cambridge specific Case Study
Blog Post:Pending… will be on http://blog.repositive.io/
The nominated IT director -willbespecifictoyourdepartment.
• TheywillneedtoconfirmyoucansupporttherequirementsoftheDTA.
• IftheheadofyourdepartmentalITisnothappytosign–theheadofITfortheUniversitywillbeabletosignitoff.
Cambridge specific Case Study
Blog Post:Pending… will be on http://blog.repositive.io/
Top Tips:Beprepared…
• Thinkaboutyourstoragespace!
• Thinkaboutwhatsortofanalysisandprocessingyouaregoingtodowiththedataonceyoudohaveit.Aftersuchalongprocess,theapprovalcouldbetooquick!!
• Designatetime!
• Understandwhatyouneedbeforeyoustarttheapplicationprocess!
• Youonlyhave1year!
Cambridge specific Case Study
4. Not all data is restricted
Applyingforaccesstorestricteddataisahardandtimeconsumingprocess.
Thinkaboutusingopen access data!
Makethe(research)worldabetterplacebysharinginreturn
Best practices: Share in return!
• Ifyouexpectdatatobeavailabletoyou–youhavetomakeyourdataavailabletoo!
• Encouragecollaborations:powerbynumbers
1. Get credit –publishandmakeyourdataavailable2. Give credit –citedatasources3. Understand consent –forallusesofclinicaldata
Best practices
• Useallavailabletools to make your life easier:• Datapublicationsvisibilityandcitationsforyourdata,e.g.
GigaScienceandScientific Data
• Figshare,Zenodo,Dryadforsharingopenaccessdata
• PhenomeCentral,Matchmaker exchange forrarediseaseresearch
• Repositiveforfindingdataacrossrepositoriesandmakeyourowndatadiscoverable
Best practices: use the tools
• Digital consent:towardsautomaticprocessingofapplications
• Dynamic consent andpowertothepatient,e.g.PatientsKnowBest
• Privacy-preserving access todatasets:preservingcontrolandgovernancewithdatacustodian,lowerbarrierforaccess
What the future holds
Workshop: Findingandaccessinghumangenomicdataforresearch
Fiona Nielsen – August 22nd 2016
We are always looking for data
Genetics, Cancer,
Rare diseaseresearch
Weneedaccesstotherightdataattherighttime
DNAinterpretation
requireslots of data
Data is not easy to find and access
FRAGMENTEDPoor visibility of available
genomic data
ADMIN BURDENHuge overhead to manage
data access
BAD CULTURELack of data sharing habits in
research culture
We are enabling best practices
MAKE DATA DISCOVERABLE
SIMPLIFY WORKFLOWS
CONTRIBUTE TOCOMMUNITY
DNAdigest and Repositive – Connecting the world of genomic datahttp://www.tinyurl.com/plos-biology-repositive
Connecting the world of genomic data
Live demo http://discover.repositive.io
Team 2 minute presentation
1. Introduction What data did you try to find and why?Have you tried to search for this data before?
2. MethodsThe 5 main steps you took on Repositive to try and find this data.
3. ResultsDid you find the data on Repositive?What challenges did you encounter?
4. ConclusionSum up your experience in 1 sentence.
1 2 3 4 5
Tell us your thoughts: @repositiveio
@glyn_dk
And read more on http://repositive.io
Bugs and feedback to: Charlotte at Repositive.io
Thank you!