create, curate, re-use: the expanding life course of digital research data

40
a centre of expertise in data curation and preservation Funded by: This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http: //creativecommons .org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Create, curate, re-use: the expanding life course of digital research data Chris Rusbridge EDUCAUSE Australasia May 2007

Upload: chris-rusbridge

Post on 30-Nov-2014

471 views

Category:

Technology


0 download

DESCRIPTION

Presentation to Educause Australasia 2007

TRANSCRIPT

Page 1: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Create, curate, re-use: the expanding life course of digital research data

Chris Rusbridge

EDUCAUSE Australasia May 2007

Page 2: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Contents• Science and digital curation• Why are data important?• What kinds of data?• What to do with your data: frontiers of

practice• Repository frontiers• Changing practice

Page 3: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Digital Curation Centre Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

Page 4: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Page 5: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Science and curation• Creating and managing data suitable for re-use• Good curation supports good science (managing

your data properly)• Poor curation allows sloppy science?

• Data curation should save money• Murray-Rust/Frey on interesting but fruitless experiments!

• Some science impossible without curation…• QCD strong coupling constant prediction (Bethke)• Viscosity of earth mantle from Shang Dynasty eclipse

records (Pang et al)• Science depending on past baselines (eg environmental,

social sciences)

Page 6: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Records of science• Data increasingly important as evidence

• Key part of the scholarly record (public good)• Unrepeatable observations & experiments

• Experimental verifiability (the basis of science)• Would Chang retractions have been reduced if his first

data were available?

• Allows additional interpretations• Legal and compliance

• See APSR/AERES report for good examples

Page 7: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What kinds of data?• Observations

• eg UARS (Upper Atmosphere) Level 0: telemetry• UARS Level 1: measured physical parameters (post

calibration?)

• Derived data• UARS Level 2: calculated geophysical? profiles• UARS level 3: gridded, interpolated?

• Combined data• Crafted data

• Eg annotated gene/protein databases

• Descriptive (meta)data

Page 8: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Retaining research data means…• Data secure against loss (within group)• Communal repository (secure bit dump)• Re-usable, sharable information• As above, plus active curation (eg bio-

informatics)• Long term preservation of information

• Be clear what you are trying to do!

Page 9: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

… or the data trajectory is…• Hard drive lost (crash)• Hard drive DVD Cardboard box Loft

Skip/dumpster lost

• Sometimes this is a very bad thing• Sometimes these are the right options!

Page 10: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Long term bit storage…• A solved problem? Just requires well-

understood good data management practices?

• Wrong! For very large datasets over very long time, there are significant problems…

BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T. J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys '06. Leuven, Belgium, ACM.

Page 11: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

How Well Must We Preserve?

Keep a petabyte for a century

– With 50% chance of remaining completely undamaged

Consider each bit decaying independently

– Analogy with radioactive decay

That's a bit half life of 10**18 years

– One hundred million times the age of the universe

That's a very demanding requirement

– Hard to measure

– Even very unlikely faults will matter a lot

•Slide from David Rosenthal, LOCKSS

Page 12: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What to do about curation• Build curation/reusability into your workflow

• Curation begins before creation• What’s easy at first becomes (impossibly) hard

later• Describe your data (metadata schemas,

“representation info”, etc)• Keep experimental parameters (technical, who,

what, when, where)• Keep ability to process• Keep data!

Page 13: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What to do about curation - 2• Use standard/agreed formats for data• Make ownership & restrictions clear, &

explain how to cite your data• Offer for deposit in institutional or discipline

repository• Appraisal and selection essential• Possible time-limited embargos

• “Publish” data in support of articles

Page 14: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Internet Archaeology: publication with data

Page 15: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Database as book…• Buneman (early pilot)

work on IUPHAR database

• MySQL to XML database• Historic to logical

schema

• XML via XSLT to LaTeX

Page 16: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

The StORe vision

• Seamless transport from research data to research publications and vice versa

• Bi-directional links proven in social science e-research but capable of export to other disciplines Source

Output

Middleware

•Slide from Graham Pryor•http://jiscstore.jot.com/WikiHome/

Page 17: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What are the reusability issues?• Data not neutral to hypothesis• Hard to know the risks & pitfalls of a particular

dataset• Data not self-describing: hard to find

appropriate data (but see Murray-Rust on Googling InChi etc)

• Hard to “understand” data once found• Really need information, not data!

• Hard to use data once understood

Page 18: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Context • Data meaningless without context

• Metadata of many kinds• Representation information… from data to

information• Linkage and connection between datasets• Use your workflow!

• Provenance • Authenticity/integrity• Computational lineage

Page 19: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Csat8-day composite and subsceneCsatE0SST8-day composite and subscenePbopt calc Ctot calc Zeu calcPPeu calcPAR subsceneHRPT

NASA

University research group1

research group3 local

decision-making body

University research group2

Slide from Rajendra Bose

Page 20: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Access and re-use• Ethics and rights control access

• Weak in expressing this long-term

• Collaboration tools• Annotation, discussion, review (see DART…)• Re-use leading to change and development

• “Publication”• Not just in “print”• Underlying data should be “published”, too

Page 21: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Who does curation?• Individuals• Departments or groups• Institutions, maybe through libraries• Communities• Disciplines• Publishers• National services• Other 3rd parties…

Page 22: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Curation: Individual• “Small science 2-3 times more data than Big

science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best

shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation

• Individual “knows” too much (tacit knowledge)• Documentation/metadata unlikely to be adequate

• Future: gone!

Page 23: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Curation: Individual

•© Marita Bushell

Page 24: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Department: eCrystals• Partnership with Institutional

Repository• Specialist department

archive (& national service)• Workflow recording of lab

parameters (R4L)• Public & private elements• Trying to build eCrystals

federation (eBank 3)• Future: likely to continue

Page 25: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Data in institutional repositories

Page 26: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Institution: Cambridge Chemistry• 175,000 small molecule

structures in CML• Alongside Archaeology,

Manuscripts, Learning Materials, etc

• No library curation skills; dependent on research group enthusiast

• Collection isolated from other Chemistry

• (Only 5 UK institutional repositories claim to hold data)

• Future: assured…

Page 27: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Community: LOCKSS?• Self-selected group of

collectors: closest to genuine open activity (despite Alliance)?

• Traditionally libraries collecting eJournals

• Model respects IPR• No domain expertise; rely on

origins• Data limitations…• Future: potentially very

persistent (low cost, high reliability, attack resistance, distributed)

Page 28: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Discipline: Atmospheric Science• Strong believer in need

for domain scientists as curators

• Significant participant in “community proxy” agenda-setting activities

• Internationally fragmented resources

• Future: mostly dependent on grant funding (but strong commitment)

Page 29: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Bio-informatics: Nature article 23 June 05

• Databases in Peril• 51 out of 89 biological databases contacted reported they

were struggling financially• 7 have closed• Several being updated in owner’s spare time• (Notes that not all deserve long term support)

• [Nucleic Acids Research reports 968 databases in 2007!]

• Major issue: money

Page 30: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Publisher: Crystallography

• Publisher and Scientific Union

• Created key domain crystallographic standard (CIF)

• Strong motivator for deposit of structure data

• Consistent quality checks• DOIs used for structure data• Future: publishing business

model

•Slide from IUCr

Page 31: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

National bodies: British Library• Serious and robust

approach• Legal deposit powers &

responsibilities as driver• Oriented primarily

towards “cultural heritage” (broadly interpreted)

• Little data, no science domain experience

• Future: strong future commitment

Page 32: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

National bodies: TNA/NDAD• Specialist archive for

government datasets• Understand government

regulations, dynamics & requirements

• Subject generalists; disconnected from associated science

• Technology specialists (understand databases)

• Future: likely to pass eventually to The National Archives

Page 33: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

3rd parties: Portico• Specific area: eJournals• Depends on publisher

agreements• No data or domain

science expertise• Future: commitment

from Mellon + publishers + subscriptions, good funding mix

Page 34: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

3rd Parties: Iron Mountain?• Records management

IS a curation problem• Organisations like this

very likely to branch out• No domain science

expertise• Future: business case,

viability, stock market…

Page 35: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

3rd parties: Web 2.0 style, Swivel.com??

Page 36: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Institutions & the network• Institutions have

fundamental sustainability

• Disciplines have domain knowledge advantage but sustainability is an issue

• Can we get the best of both?

• Needs serious work to examine!

Inst’n 1

Inst’n 2

Inst’n 3

Discipline 1 X X

Discipline 2 X X

Discipline 3 X X

etc

Page 37: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Who are the curation players?

Page 38: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Cultural change• If we build it, will they come? NO!!• Outreach important: communication with

scientists and researchers is hard graft• Cultural change to new approach requires more:

• Incentives, rewards and mandates• Successful exemplars (well publicised)• Discipline-oriented approach (one size does not fit all)

Page 39: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Australian context?• In the emerging context of the Research

Quality Framework, and the expected National Collaborative Research Infrastructure Strategy, curation can only increase in importance!

Page 40: Create, curate, re-use: the expanding life course of digital research data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Thank you

•(Citations in paper in proceedings)