thesis proposal, as presented for dissertation proposal defense
DESCRIPTION
The slides I presented for my PhD proposal defense for my project, "Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." Dept of Biomedical Informatics, University of Pittsburgh.TRANSCRIPT
Foundationalstudiesformeasuringtheimpact,
prevalence,andpatternsofpubliclysharingbiomedical
researchdata
HeatherPiwowarDepartmentofBiomedicalInformatics
UniversityofPittsburgh
Sharingresearchdata
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharingresearchdata
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharingresearchdata
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharingresearchdata
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharingresearchdata
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharingresearchdata
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Shareddatabenefitsscience
VerifyUnderstandExtendExploreCombineSynergizeTrainReduce
But...costlyforauthorsFindOrganizeDocumentDeidentifyFormatDecideAskSubmit
Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???
Asaresult,policymakershavespentlotsoftimeandmoney....
http://www.flickr.com/photos/tonivc/2283676770/
http://www.flickr.com/photos/johnnyvulkan/381941233/
...oninitiatives,requests,requirements,andtools
NIH data sharing plan requirement
Journal requirements
Databases
Data sharing grids like BIRN and caBIG
Standards
Editorials, letters to the editor, discussion....
http://www.flickr.com/photos/mesh/14102209/
lotsofdatasharing!
http://www.genome.jp/en/db_growth.html
buthowmuchisn’tshared?
whatisn’tshared?
whoisn’tsharingit?whynot?
whatcanwedoaboutit?
howmuchdoesitmatter?
youcannotmanagewhatyoudonotmeasure
http://www.flickr.com/photos/archeon/2941655917/
Long-term motivation:
I believe that analysis of the impact, prevalence, and patterns with which investigators share and withhold gene expression microarray research data can uncover rewards, best practices, and opportunities for increased adoption of data sharing.
Aim1:Doessharinghavebenefitforthosewhoshare?
Aim2:Cansharingandwithholdingbesystematicallymeasured?
Aim3:Howoftenisdatashared?Whatpredictssharing?Howcanwemodelsharingbehavior?
Relatedresearch
Data usually collected via surveys and/or manual audits
http://www.flickr.com/photos/jima/606588905/
Noor et al. PLoS Biology 2006.Ochsner et al. Nature Methods 2008.
Piwowar et al. PLoS ONE 2007.Editorial. Nature Biotech 2007.
DNA sequences
gene expression microarrays
proteomics spectra
0% 25% 50% 75% 100%
Prevalenceofdatasharingviamanualaudit
self-reported denying a request in last 3 years
trainees self-reported denying a request
been denied access to data, materials, code
authors “not able to retrieve raw data”
not willing to release data
0% 10% 20% 30% 40%
Prevalenceofdatawithholdingviasurveys
Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.
Campbell et al. JAMA 2002.
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results0% 20% 40% 60% 80%
Self‐reportedreasonsfordatawithholding
Blumenthal et al. Acad Med. 2006
industry involvement
perceived competitiveness of field
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3
Correlateswithself‐reporteddatawithholding
Modelsofdataandknowledgesharing
Andriessen. Conditions for the willingness to share knowledge, 2006.
Harder. SMG WP 6/2008 .
Cabrera and Cabrera. Int J of HR Mgmt. 2005.
Kuo. JASIST. 2008.
Limitationsoftherelatedresearch
• manual audits: small sample sizes
• surveys: few variables + self-reporting bias
• not much focus on measuring demonstrated behavior
• not much focus on impact or policy
• not much focus on biomedical data other than DNA sequences
Needed:
a study of data sharing behavior and impact that includes
• a measurement of demonstrated behavior• policy variables • estimate of rewards• a broad and deep selection of data creation instances• a focus on biomedical data other than DNA sequences
Aim1:Doessharinghavebenefitforthosewhoshare?
Aim2:Cansharingandwithholdingbesystematicallymeasured?
Aim3:Howoftenisdatashared?Whatpredictssharing?Howcanwemodelsharingbehavior?
Scopeofcurrentstudy• typeofdata:geneexpressionmicroarrays
• sharingmechanism:centralizeddatabases
• studies:Englishfulltextavailableinacentralizedportal
• covariates:extractedfromMedlineanddatabasesources
http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
Preliminaryresearch
http://farm3.static.flickr.com/2146/2389590651_9bbcc9d07e.jpg
Aim1
Aim1:Doessharinghavebenefitforthosewhoshare?
http://www.flickr.com/photos/sunrise/35819369/
Aim1:Doessharinghavebenefitforthosewhoshare?
Aim1:Doessharinghavebenefitforthosewhoshare?
Aim1:Doessharinghavebenefitforthosewhoshare?
Note the logarithmic scale
Aim1:Doessharinghavebenefitforthosewhoshare?
Aim1:Associatedcitationincrease
http://www.flickr.com/photos/sunrise/35819369/
Next:
Whatfactorspredictsharing?
http://www.flickr.com/photos/ryanr/142455033/
CanIusethesamemethodsofAim1tochoosestudiesanddeterminedatasharingstatus?
CanIusethesamemethodsofAim1tochoosestudiesanddeterminedatasharingstatus?
No,thosemethodsdon’tscaletoidentifyorclassifyenoughdatapoints
Aim2
Needautomatedmethodsto:
Identifystudiesthatgeneratedatasetsthatcouldpotentiallybeshared(Aim2a)
Determinewhichofthesehaveinfactbeenshared(Aim2b)
Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata
http://www.flickr.com/photos/lofaesofa/248546821/
Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata
Easy,viaMeSHindexingterms?
geneexpressionprofilingand/or
microarrayanalysis
Unfortunately,thesehaveneitherhighrecallnorprecision.
Lookforwetlabmethodsinfulltext:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrezhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
BUTthisrequiresdevelopingandmaintainingafull‐textarchive!
WhataboutusingPubMedCentral?
Canreach~85%ofarticleswithfull‐textlinksviaUofPittsburghlibrarysubscriptions,whencombinedwithtwootherfull‐textqueryportals:
Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata
Deriveafull‐textquerywithsuffientlyhighrecall(>1250studies)andprecision(>70%).
Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata
Referencestandard?
Ochsneretal.•2007•20journals•broadqueryformicroarraystudies•identified400studiesthatcreatedgeneexpressionmicroarraydata
Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata
Developmentcorpus?
PubMedCentralOpenAccesssubset+TRECGenomicsIRsubset
=about5000relevantarticleswithabout50%truepositiverate
Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata
Developmentapproach?
•Patternbuildingviamanualinspection•Classificationdecisiontreeswithn‐grams•Borrowapproachesfrom•Autoslog‐TS•automatedregularexpressionbuilding•semi‐supervisedlearning•retrievalqueryaspects
Aim2b
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
http://www.flickr.com/photos/dcassaa/422261773/
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
pmc_gds[filter]
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
Unfortunately,thesubmissioncitationisoftenleftblankwhendataissubmittedpriortopublication.
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
Toacheive70%recall,Imayhavetosupplementwithaqueryofthefulltext,suchas:
(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata
Referencestandard:
Aim3
Aim3–Howoftenisdatashared?Whatpredictssharing?Howcanwemodelsharingbehavior?
http://www.flickr.com/photos/ryanr/142455033/
Aim3a:Prevalenceofdatasharing
Aim3a:Prevalenceofdatasharing
PubMedID
PortalCreateddata?
234345456567678789890901
PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?
Aim3a:Prevalenceofdatasharing
PubMedID
PortalCreateddata?
234345456567678789890901
PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?
Aim3a:Prevalenceofdatasharing
PubMedID
PortalCreateddata?
234345456567678
PMC YesHighPr YesScirus YesPMC YesPMC Yes
Aim3a:Prevalenceofdatasharing
PubMedID
PortalCreateddata?
Shareddata?
234345456567678
PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO
Aim3a:Prevalenceofdatasharing
PubMedID
PortalCreateddata?
Shareddata?
234345456567678
PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO
Prevalence=NumberwithShareddataNumberwithCreateddata
Aim3b:Correlateswithdatasharing
Aim3b:Correlateswithdatasharing
PubMedID
PortalCreateddata?
Shareddata?
234345456567678
PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO
Covariates
Aim3b:Correlateswithdatasharing
Features to include:• Does the journal have a data sharing policy?• Is the study funded by the NIH?• Number of authors• Research-orientation of the primary
institution• Journal impact factor• Are the samples from humans?• Disease of study• Year of publication• …
Aim3b:Correlateswithdatasharing
PubMedID
PortalCreateddata?
Shareddata?
Journalpolicy
NIHfunds?
#authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Aim3b:Correlateswithdatasharing
PubMedID
PortalCreateddata?
Shareddata?
Journalpolicy
NIHfunds?
#authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Shareddata?
Journalpolicy? NIHfunded? #authors ...
Aim3c:Modelofdatasharing
Aim3c:Modelofdatasharing
PubMedID
PortalCreateddata?
Shareddata?
Journalpolicy
NIHfunds?
#authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Aim3c:Modelofdatasharing
PubMedID
PortalCreateddata?
Shareddata?
Journalpolicy
NIHfunds?
#authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Shareddata?
Mandates AmountofCollaboration
...
Aim3c:Modelofdatasharing
PubMedID
PortalCreateddata?
Shareddata?
Journalpolicy
NIHfunds?
#authors
...
234345456567678
PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2
Covariates
Shareddata?
Mandates AmountofCollaboration
...StrongWeak
http://www.flickr.com/photos/rachynymph/2930626195/
Assumptions
That the following limitations are randomly distributed:• Ambiguous author names • The method of describing data generation • Studies with data in GEO but no submission links• Studies that don’t mention sharing in the full-text article
The first and last authors are usually primary decision-makers about whether to share data
Citations are a valued, though imperfect, measure of research impact
Limitations
Association does not imply causation
Only one datatype: microarray data.
Only considering sharing in the primary centralized databases.
Many variables are USA-centric.
Results will only be generalizable to research studies made available in full-text portals.
Risksandcontingencyplans
NLP performance may be inadequatesupplement with manual annotating via Mechanical Turk
Author ambiguity may introduce extreme outliers.use Author-ity software on extreme outliers
Unable to derive a robust exploratory factor modeltry other clustering techniques
Several variables may be unexpectedly difficult to extract
if not essential, defer the analysis of that variable to future work
Contributions
• an assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing
• a publicly available dataset associating microarray study publications with data sharing status
• a generalizable approach for developing practical, real-world information retrieval using centralized full-text query portals
• preliminary models of data sharing behavior
Publicationplan
http://www.flickr.com/photos/linkwize/926334421/
Publicationplan:Aim1
Do studies with publicly shared datasets receive
more citations?
Published in PLoS ONE in February 2007
Publicationplan:Aim2a How can we identify studies that generate
certain data, given full-text query access through centralized portals?
Targeted journal:Journal of Medical Internet Research? BMC Bioinformatics?other?
Publicationplan:Aim2b,3a,3b
What factors are associated with demonstrated
data sharing behavior?
Targeted journal:BMC Bioinformatics?BMC Biology?PLoS Biology?a research policy journal?other?
Publicationplan:Aim3c Derive (and validate?) a preliminary a model of
demonstrated research data sharing behavior
Targeted journal:JASIST?
(Journal of the American Society for Information Science and Technology)
Information Research?Journal of Documentation?Science Communication?Data Science Journal?other?
Futurework
1. Identify and model data reuse2. Citation analysis of the large cohort3. Supplement with survey responses4. Generalize the method for creating
queries for full-text portals
http://www.flickr.com/photos/cogdog/123072/
Datasharingplan
I plan to share my code, data, and process openly during the research via blogs and repositories.
http://www.flickr.com/photos/myklroventine/892446624/
Thanks to
the Dept of Biomedical Informatics at the U of Pittsburgh,
the NLM for funding through training grant 5 T15 LM007059-22,
those with photos on Flickr under a Creative Commons license,
Wendy for her support and feedback, and my committee for anticipated feedback....
Questions and Suggestions?
Futurework
• Funders, policy makers and thought leaders.
• Database, software, and data standard developers.
• Biomedical informatics community.
• Information science and digital library community.
• Open Science community.
• Primary Investigators.
Audience
Recent related grants
NIH: Haga, S. Exploring Attitudes About Data Disclosure and Data-Sharing in Genomics Research.
NSF: Hedstrom, M. Incentives for Data Producers to Create Archive-Ready Data Sets.
National Inst of Nursing Research: Pienta, A. Barriers and Opportunities for Sharing Research Data.
+others