a tale of two data catalogs

41
DATA CATALOGS Table of Contents I. NIH Data Discovery Index Methodology Findings Questions raised II. Institutional Data Interviews Methodology Findings III. Outcomes Benefits to the library By: Charles Dickens KEVIN READ 1

Upload: readkev

Post on 11-Aug-2014

345 views

Category:

Data & Analytics


3 download

DESCRIPTION

This presentation will describe two studies undertaken to build two separate data catalogs: the first for NIH-funded datasets and the second for institutional datasets created within an academic medical center. To inform the creation of an NIH data catalog, the purpose of the first study was to a) develop a set of minimal metadata elements used to describe datasets, and b) carry out an analysis to identify datasets in NIH-funded research articles that do not provide an indication that their data has been shared in a data repository. This study served as the foundation for developing an index of all NIH-funded datasets, and provided information about in what repositories researchers share their data most often. The second study was spurred on by the first, and involved interviewing institutional faculty members and researchers to learn more about how they collect data, what challenges they face when collecting data, whether they’ve thought about sharing data, and what they would find most useful from an institutional data catalog. The results of this study informed the workflows, metadata creation, and requirements for building a data catalog within the medical center. Additionally, interview responses were used to further inform the data services provided by the health sciences library, including education, research consultations and clinical quality improvement initiatives. Both studies provide various examples of how a librarian working in the health sciences can contribute to, and participate in data-related services within their institution.

TRANSCRIPT

Page 1: A Tale of Two Data Catalogs

1

DATA CATALOGS

Table of Contents

I. NIH Data Discovery Index

• Methodology• Findings• Questions raised

II. Institutional Data Interviews

• Methodology• Findings

III. Outcomes

• Benefits to the library

By: Charles DickensKEVIN READ

Page 2: A Tale of Two Data Catalogs

2

It was the best of times…

Page 3: A Tale of Two Data Catalogs

3

NIH Big Data to Knowledge (BD2K)Facilitating Broad Use of Biomedical Big Data

Page 4: A Tale of Two Data Catalogs

4

NIH Data Discovery Index

Datasets areCITABLE

Datasets areDISCOVERABLE

Datasets areLINKED TO

THE LITERATURE

Datasets arePART OF THE

RESEARCH ECOSYSTEM

Page 5: A Tale of Two Data Catalogs

NIH Data Sharing Repositories

http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

Page 6: A Tale of Two Data Catalogs

Searching for NIH-funded unidentified datasets in PubMed and PMC

6

Page 7: A Tale of Two Data Catalogs

113,089

75,441

Remaining articles with unidentified datasets

NIH-funded articles for 2011:

88,592 78,901

Non-PMC Articles

Non-research Articles

Molecular Sequence Data MH

71,913 SI Field

71,680 69,857XML

7

PMC Acknowledgements

Page 8: A Tale of Two Data Catalogs

SI Field

Clinical-Trials.gov

PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0

200

400

600

800

1000

1200

1400

1600

Excluded Articles

8

Page 9: A Tale of Two Data Catalogs

9

PMC Acknowledgements

PDB

Clinica

lTrials.

gov

GenBan

kGEO

IRD

MGIDIP

Flybase

dbGaPSRA

Worm

BaseM

PD

NURSARGD

ICPSR

VectorB

ase0

100

200

300

400

500

600

700

800

Excluded keywords

Page 10: A Tale of Two Data Catalogs

10

XML Keyword

GenBan

kPDB

GEOdbSNP

Clinica

lTrials.

govRGD

Flybase SRA DIPdbGaP

Worm

Base MGI

BioGRID

VectorB

ase

Multiple

Keywords

0

100

200

300

400

500

600

Excluded keywords

FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information

Framework:Rat Genome Database:WormBase:Zebrafish Model

Organism Database

GenBank:PDB

Page 11: A Tale of Two Data Catalogs

NIH-sponsored data repositories now added to PubMed and PMC search indexes

11

Page 12: A Tale of Two Data Catalogs

383

What category of dataset was used for the research described in the article?

Were live human or animal subjects used in the collection

of the data?

What were the subject(s) of study (from which or whom the data was collected)?

If new dataset(s) were created, what type(s) of data were

collected?

What existing dataset(s) were used? If any?

How many datasets are there in each article?

12

Page 13: A Tale of Two Data Catalogs

13

Measuring blood pressure in mice

Measuring left hemisphere of brain for growth factor

Staining and imaging

Analysis of images using software

Page 14: A Tale of Two Data Catalogs

Results

14

Page 15: A Tale of Two Data Catalogs

Average number of datasets per article:

2.92

15

Page 16: A Tale of Two Data Catalogs

% of datasets that use live subjects

54%

Human

51%Animal

49%

16

Page 17: A Tale of Two Data Catalogs

% of new data

87%

17

% of data created using pre-existing datasets

13%

Page 18: A Tale of Two Data Catalogs

18

It was the worst of times…

Page 19: A Tale of Two Data Catalogs

Data Types

19

Image Genetic or Genomic

Chemical

Biochemical

Electrical (Elecrophysiological)

Optical – non-image

Behavioral

Computational Simulation or model

Magnetic Resonance – non-image

Structural

Physiological

Questionnaire/Survey

Clinical Measures

Geospatial

INSUFFICIENT

Page 20: A Tale of Two Data Catalogs

Inter-rater Reliability:

20Total # of datasets (High) Total # of datasets (Low)

0

100

200

300

400

500

600

700

800

Total number of datasets found per 25 ar-ticles

43%

Page 21: A Tale of Two Data Catalogs

How do we define a data set?

21

Dataset

Page 22: A Tale of Two Data Catalogs

How do we define a data set?

22

Datasets

Page 23: A Tale of Two Data Catalogs

How do we define a data set?

23

Datasets

Page 24: A Tale of Two Data Catalogs

Where in the collection/processing pipeline

should data be described?

24

Page 25: A Tale of Two Data Catalogs

Book of the Second

Understanding institutional data challenges via interviews

Page 26: A Tale of Two Data Catalogs

26

Institutional Data Catalog

• Organize and describe institutional research data

• Promote collaboration within the institution

• Promote a culture of sharing and transparency

Page 27: A Tale of Two Data Catalogs

27

Methodology

• Literature review• ID researchers/PIs using

active grant system• Analyzed datasets in

researcher papers before interviews– Used NIH Data Discovery

Index method

Page 28: A Tale of Two Data Catalogs

Understand your researchersBASIC SCIENCE RESEARCHERS CLINICAL RESEARCHERS

Page 29: A Tale of Two Data Catalogs

Data Interviews

Postdocs or student leaves with data

Lack of standards/procedures

Size of data

Messiness/Disconnect between datasets

Too challenging

0 1 2 3 4 5 6 7

Challenges Organizing Data – Basic Science Researchers

Page 30: A Tale of Two Data Catalogs

Data Interviews

Storage expense

Changes in software

Lack of IT resources

Lack of preservation procedures (readme, plans, postdoc etc.)

Data in multiple storage locations

Storage space

0 1 2 3 4 5 6

Challenges Preserving Data – Basic Science Researchers

Page 31: A Tale of Two Data Catalogs

Data Interviews

Data quality

Messiness/Disconnect between datasets

Poor data output formats

Can't search data

Data loss

Team miscommunication on who's using data

0 1 2 3 4 5 6

Challenges Organizing Data – Clinical Researchers

Page 32: A Tale of Two Data Catalogs

Data Interviews

Collaboration only

unknown parties

data repository

general public

primary results only

Do not share

0 1 2 3 4 5 6 7 8 9

Basic ScienceClinical

Experience with Data Sharing

Page 33: A Tale of Two Data Catalogs

33

Only the best of times…How the library benefitted from this exercise

Page 34: A Tale of Two Data Catalogs

34

Identified group to pilot institutional data catalog – Population Health

Page 35: A Tale of Two Data Catalogs

35

Acquired new opportunities for teaching data management

Page 36: A Tale of Two Data Catalogs

36

Developing a lab tool for basic scientists to manage metadata

Page 37: A Tale of Two Data Catalogs

37

Developed a better understanding of researcher needs and challenges

Page 38: A Tale of Two Data Catalogs

38

AcknowledgementsBD2K Project• Lou Knecht, Jim Mork, Kathel Dunn, Betsy Humphreys, Jerry

Sheehan, Mike Huerta, Dr. Donald LindbergAnnotators• Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari

Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott

Page 39: A Tale of Two Data Catalogs

39

References1. Adamick J, Canavan M, McGinty S, Reznik-Zellen R, Schmidt M, Stevens R. Building as We Climb: The Data Working Group at the University of Massachusetts Amherst [Internet]. Univ. Massachusetts New Engl. Area Libr. e-Science Symp. 2011. Available from: http://escholarship.umassmed.edu/escience_symposium/2011/posters/3 2. Bardyn TP, Resnick T, Camina SK. Translational Researchers’ Perceptions of Data Management Practices and Data Curation Needs: Findings from a Focus Group in an Academic Health Sciences Library. J. Web Librariansh. [Internet]. 2012 Oct [cited 2013 Jan 30];6(4):274–87. Available from: http://www.tandfonline.com/doi/abs/10.1080/19322909.2012.730375 3. Carlson J, Fosmire M, Miller CC, Nelson MS. Determining Data Information Literacy Needs: A Study of Students and Research Faculty. portal Libr. Acad. 2011;11(2):629 – 657. 4. Delserone LM. At the watershed: Preparing for research data management and stewardship at the University of Minnesota Libraries. Libr. Trends [Internet]. Urbana-Champaign, Illinois: John Hopkins University Press and the Graduate School of Library and Information Science.; 2008 [cited 2013 Jan 11]. p. 202–10. Available from: https://www.ideals.illinois.edu/handle/2142/10670 5. Harrison A, Searle S. Not drowning , ingesting : dealing with the research data deluge at an institutional level. VALA2010 Proc. [Internet]. 2010. Available from: http://www.vala.org.au/vala2010/papers2010/VALA2010_43_Harrison_Final.pdf 6. Hruby GW, McKiernan J, Bakken S, Weng C. A centralized research data repository enhances retrospective outcomes research capacity: a case report. J. Am. Med. Inform. Assoc. [Internet]. 2013 Jan 15 [cited 2013 Apr 11];1–5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23322812 7. Johnson LM, Butler JT, Johnston LR. Developing E-Science and Research Services and Support at the University of Minnesota Health Sciences Libraries. J. Libr. Adm. [Internet]. Routledge; 2012 Nov [cited 2013 Jan 11];52(8):754–69. Available from: http://dx.doi.org/10.1080/01930826.2012.751291 8. Jones S, Ross S, Ruusalepp R. Data Audit Framework Methodology [Internet]. Glasgow; 2009 p. 1–70. Available from: http://www.data-audit.eu/DAF_Methodology.pdf 9. Lage K, Losoff B, Maness J. Receptivity to Library Involvement in Scientific Data Curation: A Case Study at the University of Colorado Boulder. portal Libr. Acad. [Internet]. 2011 [cited 2012 Nov 21];11(4):915–37. Available from: http://muse.jhu.edu/journals/portal_libraries_and_the_academy/v011/11.4.lage.html 10. Newton MP, Miller CC, Bracke MS. Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force. Collect. Manag. 2011;36(1):53–67. 11. Peters C, Dryden AR. Assessing the Academic Library’s Role in Campus-Wide Research Data Management: A First Step at the University of Houston. Sci. Technol. Libr. [Internet]. Routledge; 2011 Sep [cited 2013 Jan 11];30(4):387–403. Available from: http://dx.doi.org/10.1080/0194262X.2011.626340 12. Piwowar H a. Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS One [Internet]. 2011 Jan [cited 2013 Mar 10];6(7):e18657. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3135593&tool=pmcentrez&rendertype=abstract 13. Raboin R, Reznik-Zellen RC, Salo D. Forging New Service Paths: Institutional Approaches to Providing Research Data Management Services. J. eScience Librariansh. [Internet]. 2012;1(3). Available from: http://escholarship.umassmed.edu/jeslib/vol1/iss3/2/ 14. Reznik-Zellen R, Adamick J, McGinty S. Tiers of Research Data Support Services. J. eScience Librariansh. [Internet]. 2012 [cited 2012 Nov 10];1(1):27–35. Available from: http://escholarship.umassmed.edu/jeslib/vol1/iss1/5/ 15. Scaramozzino JM, Ramirez ML, McGaughey KJ. A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2012 Jul 1 [cited 2013 Jan 11];73(4):349–65. Available from: http://crl.acrl.org/content/73/4/349.abstract 16. Soehner C, Steeves C, Ward J. E-Science and Data Support Services. 2010 [cited 2013 Jan 11];(August). Available from: http://www.arl.org/storage/documents/publications/escience-report-2010.pdf 17. Trinidad SB, Fullerton SM, Bares JM, Jarvik GP, Larson EB, Burke W. Genomic research and wide data sharing: views of prospective participants. Genet. Med. 2010 Aug;12(8):486–95. 18. Walters TO. Data curation program development in U.S. universities: The Georgia Institute of Technology example. Int. J. Digit. Curation [Internet]. 2009;4(3):83–92. Available from: http://www.ijdc.net/index.php/ijdc/article/viewFile/136/153 19. Westra B. Data Services for the Sciences: A Needs Assessment. Ariadne [Internet]. 2010;(64). Available from: http://www.ariadne.ac.uk/issue64/westra 20. Williams SC. Using a Bibliographic Study to Identify Faculty Candidates for Data Services. Sci. Technol. Libr. [Internet]. Routledge; 2013 May 9 [cited 2013 May 14];1–8. Available from: http://dx.doi.org/10.1080/0194262X.2013.774622 21. Xia J, Liu Y. Usage Patterns of Open Genomic Data. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2013 Mar 1 [cited 2013 Mar 7];74(2):195–207. Available from: http://crl.acrl.org/content/74/2/195.abstract

Page 40: A Tale of Two Data Catalogs

40

ImagesPonderings for All Things Blog. 2010. Available from: http://ponderingsofallthings.blogspot.com/2010/05/tale-of-two-cities-charles-dickens.html Reading Charles Dickens Blog. Manette in Bastille. 2012. Available from: http://readingcharlesdickens.com/wp-content/uploads/2012/07/Manette-in-Bastille-253x300.jpg Grandma’s Graphics. Old Scrooge say busy in his counting-house. 2000. Available from: http://www.grandmasgraphics.com/graphics/childrens/childrens379_2000.jpg Sungardas Blog. Apple to Orange. 2010. Available from: http://blog.sungardas.com/wp-content/uploads/Apple-to-Orange.jpg Patel R. Questions?. Flickr. 2007. Available from: https://www.flickr.com/photos/23679420@N00/545653437 / Biomedical Engineering Laboratory.Wikimedia. 2012. Available from: http://upload.wikimedia.org/wikipedia/commons/a/a3/Biomedical_Engineering_Laboratory.jpg

Page 41: A Tale of Two Data Catalogs

41

[email protected] Questions?