Why should researchers care about data curation?

Presentation I gave as part of the selection process for my current position as Data Curation Editor at the Nature journal, Scientific Data.


<ul><li> 1. Why should researchers careabout data curation?Varsha Khodiyar</li></ul> <p> 2. WHY SHARE DATA 3. Expenditure on datageneration 16.8% NIH grant applications funded* Hours spent writing grants? Hours spent reviewing grants? Resources are finite/expensive Modified animals Specialized reagents Time and effort to generate good, validdata* For fiscal year 2013(http://report.nih.gov/success_rates/Success_ByIC.cfm) 4. Reproducibility is a cornerstoneof science[W]e evaluated the replication of dataanalyses in 18 articles on microarray-basedgene expression profilingpublished in Nature Genetics in 20052006...We reproduced two analyses inprinciple and six partially or with somediscrepancies; ten could not bereproduced. The main reason forfailure to reproduce was dataunavailability.Ioannidis JPA. et al. Repeatability of publishedmicroarray gene expression analyses. NatureGenetics 41, 14955 (2009) 5. HOW TO SHARE DATA 6. Data needs to be Discoverable Need to know its there Accessible Must be able to get to the data Usable Require sufficient information about how the data wasgenerated Persistent Historical data access as part of the scientific record, aswell as for new research Reliable Data provenance informs data reuse decisions 7. Traditional publishing Data in a PDF is discoverable and accessible, byreaders of the paper But is not usable - can't manipulate data in a PDF table 8. Ill send my data when someoneasks for it We examined the availabilityof data from 516 studiesbetween 2 and 22 years old The odds of a data setbeing reported as extant fell by 17% per year Broken e-mails and obsolete storage deviceswere the main obstacles to data sharingVines TH. et al. The availability of research data declinesrapidly with article age. Curr Biol 24, 947 (2014) 9. Ill make my data available in arepository Data is discoverable, accessible and persistent But data may not be usable, as limited space for data-specificdescription in an unstructured repository 10. Ill write a data paperMaterials and MethodsAnimal surgeryBehavioural testingData collection and cell-typeclassificationData descriptionData file organizationMetadata organization Data is discoverable, accessible and persistent Sufficient space for methodological detail 11. BUT ARE WE MISSINGSOMETHING? 12. Human vs. machine Is your data trulydiscoverable by researchersoutside your own domain? Too many papers to read ineach persons own field. Could increasing themachine readability of yourdata result in increased useof your data? Is making an entiredataset machine readable,feasible? 13. Metadata Fully describe the experiments thatgenerated the data Takes time to ensure full metadata capture Structure the metadata to ensuremachine readability Structure needs to be decidedprospectively Metadata can be discovered inautomated way Requires relevant infrastructure 14. Curation is a specialised task Researchers are not datamanagement professionals Learning how to curate data, takestime Article publication is carried out byspecialists (journals). Follows that data publication shouldalso be carried out by specialists. 15. Benefits of curated metadata Users of data Data is findable Data provenance is clear Increased data usability Reduce unnecessary duplication of data Data generators Data more likely to be used, so datacitation rates will increase Contribute to novel research that datagenerators would not have carried out 16. Metadata as an integral part of adata paper 17. FUTURE POSSIBILITIES 18. Machine readable researchmetadata could lead to...Linked DataInfrastructure forlinked research datais being developeda way to publish data so that data fromdifferent sources can be connected andqueried"Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzschand Richard Cyganiak. http://lod-cloud.net/" 19. The beginnings of linkedresearch dataAn open-access database of publiclyavailable antibodies against human proteintargets, with user and provider data onantibody efficacy in a range of assays.We show that Antibodypedia may be used totrack the development of available and validatedantibodies to the individual chromosomes, andthus the database is an attractive tool to identifyproteins with no or few antibodies yetgenerated. 20. Summary Reusing previously generated data iseconomical Data reuse dependant on discoverable,accessible and usable shared datasets Descriptive metadata enhances(re)usability of data Capture of structured metadata is aspecialist skill The future: machine readable metadatawill be important 21. Thanks for listening... </p>