create, curate, re-use: the expanding life course of digital research data
DESCRIPTION
Presentation to Educause Australasia 2007TRANSCRIPT
a centre of expertise in data curation and preservation
Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
Create, curate, re-use: the expanding life course of digital research data
Chris Rusbridge
EDUCAUSE Australasia May 2007
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Contents• Science and digital curation• Why are data important?• What kinds of data?• What to do with your data: frontiers of
practice• Repository frontiers• Changing practice
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Digital Curation Centre Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Science and curation• Creating and managing data suitable for re-use• Good curation supports good science (managing
your data properly)• Poor curation allows sloppy science?
• Data curation should save money• Murray-Rust/Frey on interesting but fruitless experiments!
• Some science impossible without curation…• QCD strong coupling constant prediction (Bethke)• Viscosity of earth mantle from Shang Dynasty eclipse
records (Pang et al)• Science depending on past baselines (eg environmental,
social sciences)
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Records of science• Data increasingly important as evidence
• Key part of the scholarly record (public good)• Unrepeatable observations & experiments
• Experimental verifiability (the basis of science)• Would Chang retractions have been reduced if his first
data were available?
• Allows additional interpretations• Legal and compliance
• See APSR/AERES report for good examples
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
What kinds of data?• Observations
• eg UARS (Upper Atmosphere) Level 0: telemetry• UARS Level 1: measured physical parameters (post
calibration?)
• Derived data• UARS Level 2: calculated geophysical? profiles• UARS level 3: gridded, interpolated?
• Combined data• Crafted data
• Eg annotated gene/protein databases
• Descriptive (meta)data
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Retaining research data means…• Data secure against loss (within group)• Communal repository (secure bit dump)• Re-usable, sharable information• As above, plus active curation (eg bio-
informatics)• Long term preservation of information
• Be clear what you are trying to do!
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
… or the data trajectory is…• Hard drive lost (crash)• Hard drive DVD Cardboard box Loft
Skip/dumpster lost
• Sometimes this is a very bad thing• Sometimes these are the right options!
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Long term bit storage…• A solved problem? Just requires well-
understood good data management practices?
• Wrong! For very large datasets over very long time, there are significant problems…
BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T. J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys '06. Leuven, Belgium, ACM.
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
How Well Must We Preserve?
Keep a petabyte for a century
– With 50% chance of remaining completely undamaged
Consider each bit decaying independently
– Analogy with radioactive decay
That's a bit half life of 10**18 years
– One hundred million times the age of the universe
That's a very demanding requirement
– Hard to measure
– Even very unlikely faults will matter a lot
•Slide from David Rosenthal, LOCKSS
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
What to do about curation• Build curation/reusability into your workflow
• Curation begins before creation• What’s easy at first becomes (impossibly) hard
later• Describe your data (metadata schemas,
“representation info”, etc)• Keep experimental parameters (technical, who,
what, when, where)• Keep ability to process• Keep data!
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
What to do about curation - 2• Use standard/agreed formats for data• Make ownership & restrictions clear, &
explain how to cite your data• Offer for deposit in institutional or discipline
repository• Appraisal and selection essential• Possible time-limited embargos
• “Publish” data in support of articles
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Internet Archaeology: publication with data
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Database as book…• Buneman (early pilot)
work on IUPHAR database
• MySQL to XML database• Historic to logical
schema
• XML via XSLT to LaTeX
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
The StORe vision
• Seamless transport from research data to research publications and vice versa
• Bi-directional links proven in social science e-research but capable of export to other disciplines Source
Output
Middleware
•Slide from Graham Pryor•http://jiscstore.jot.com/WikiHome/
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
What are the reusability issues?• Data not neutral to hypothesis• Hard to know the risks & pitfalls of a particular
dataset• Data not self-describing: hard to find
appropriate data (but see Murray-Rust on Googling InChi etc)
• Hard to “understand” data once found• Really need information, not data!
• Hard to use data once understood
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Context • Data meaningless without context
• Metadata of many kinds• Representation information… from data to
information• Linkage and connection between datasets• Use your workflow!
• Provenance • Authenticity/integrity• Computational lineage
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Csat8-day composite and subsceneCsatE0SST8-day composite and subscenePbopt calc Ctot calc Zeu calcPPeu calcPAR subsceneHRPT
NASA
University research group1
research group3 local
decision-making body
University research group2
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Access and re-use• Ethics and rights control access
• Weak in expressing this long-term
• Collaboration tools• Annotation, discussion, review (see DART…)• Re-use leading to change and development
• “Publication”• Not just in “print”• Underlying data should be “published”, too
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Who does curation?• Individuals• Departments or groups• Institutions, maybe through libraries• Communities• Disciplines• Publishers• National services• Other 3rd parties…
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Curation: Individual• “Small science 2-3 times more data than Big
science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best
shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation
• Individual “knows” too much (tacit knowledge)• Documentation/metadata unlikely to be adequate
• Future: gone!
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Curation: Individual
•© Marita Bushell
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Department: eCrystals• Partnership with Institutional
Repository• Specialist department
archive (& national service)• Workflow recording of lab
parameters (R4L)• Public & private elements• Trying to build eCrystals
federation (eBank 3)• Future: likely to continue
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Data in institutional repositories
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Institution: Cambridge Chemistry• 175,000 small molecule
structures in CML• Alongside Archaeology,
Manuscripts, Learning Materials, etc
• No library curation skills; dependent on research group enthusiast
• Collection isolated from other Chemistry
• (Only 5 UK institutional repositories claim to hold data)
• Future: assured…
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Community: LOCKSS?• Self-selected group of
collectors: closest to genuine open activity (despite Alliance)?
• Traditionally libraries collecting eJournals
• Model respects IPR• No domain expertise; rely on
origins• Data limitations…• Future: potentially very
persistent (low cost, high reliability, attack resistance, distributed)
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Discipline: Atmospheric Science• Strong believer in need
for domain scientists as curators
• Significant participant in “community proxy” agenda-setting activities
• Internationally fragmented resources
• Future: mostly dependent on grant funding (but strong commitment)
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Bio-informatics: Nature article 23 June 05
• Databases in Peril• 51 out of 89 biological databases contacted reported they
were struggling financially• 7 have closed• Several being updated in owner’s spare time• (Notes that not all deserve long term support)
• [Nucleic Acids Research reports 968 databases in 2007!]
• Major issue: money
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Publisher: Crystallography
• Publisher and Scientific Union
• Created key domain crystallographic standard (CIF)
• Strong motivator for deposit of structure data
• Consistent quality checks• DOIs used for structure data• Future: publishing business
model
•Slide from IUCr
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
National bodies: British Library• Serious and robust
approach• Legal deposit powers &
responsibilities as driver• Oriented primarily
towards “cultural heritage” (broadly interpreted)
• Little data, no science domain experience
• Future: strong future commitment
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
National bodies: TNA/NDAD• Specialist archive for
government datasets• Understand government
regulations, dynamics & requirements
• Subject generalists; disconnected from associated science
• Technology specialists (understand databases)
• Future: likely to pass eventually to The National Archives
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
3rd parties: Portico• Specific area: eJournals• Depends on publisher
agreements• No data or domain
science expertise• Future: commitment
from Mellon + publishers + subscriptions, good funding mix
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
3rd Parties: Iron Mountain?• Records management
IS a curation problem• Organisations like this
very likely to branch out• No domain science
expertise• Future: business case,
viability, stock market…
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
3rd parties: Web 2.0 style, Swivel.com??
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Institutions & the network• Institutions have
fundamental sustainability
• Disciplines have domain knowledge advantage but sustainability is an issue
• Can we get the best of both?
• Needs serious work to examine!
Inst’n 1
Inst’n 2
Inst’n 3
Discipline 1 X X
Discipline 2 X X
Discipline 3 X X
etc
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Who are the curation players?
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Cultural change• If we build it, will they come? NO!!• Outreach important: communication with
scientists and researchers is hard graft• Cultural change to new approach requires more:
• Incentives, rewards and mandates• Successful exemplars (well publicised)• Discipline-oriented approach (one size does not fit all)
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Australian context?• In the emerging context of the Research
Quality Framework, and the expected National Collaborative Research Infrastructure Strategy, curation can only increase in importance!
a centre of expertise in data curation and preservation
EDUCAUSE Australasia 2007
Thank you
•(Citations in paper in proceedings)