the future of the dcc

21
a centre of expertise in data curation and preservation Funded by: This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http: //creativecommons .org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. The future of the DCC Chris Rusbridge E-Science Workshop April 2009

Upload: chris-rusbridge

Post on 30-Nov-2014

401 views

Category:

Technology


0 download

DESCRIPTION

Presentation to the National e-Science Centre workshop, 2009

TRANSCRIPT

Page 1: The future of the DCC

a centre of expertise in data curation and preservation

Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

The future of the DCC

Chris Rusbridge

E-Science Workshop April 2009

Page 2: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Contents• Curation & integrated science• Poetry & Philosophy of D H Rumsfeld• Designated Community & Knowledge Base• DCC services• Future of the DCC

Page 3: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Curation• Wikipedia

• Curator: a content specialist responsible for an institution's collections and, together with a publications specialist, their associated collections catalogs.

• Digital Curation: the curation, preservation, maintenance, collection and archiving of digital assets

• Sheer curation: an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets.

• DCC: Digital curation is maintaining and adding value to a trusted body of digital information for current and future use.

Page 4: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Integrated Science• The application of multiple scientific

disciplines to one or more core scientific challenges

• Examples of integrated sciences?• Archaeology• Environmental sciences

Page 5: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Integrated Science implications• Scientists will be using unfamiliar data,

therefore• Data curators and managers must make their

data available for unfamiliar users!

• And now for something unfamiliar?

Page 6: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Poetry & Philosophy of D H Rumsfeld

Hart Seely, April 2, 2003, SLATE http://www.slate.com/id/2081042/

Page 7: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

A Confession‘Once in a while,I'm standing here, doing something.And I think,"What in the world am I doing here?"It's a big surprise.’—May 16, 2001, interview with the New York Times

Page 8: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

The Unknown‘As we know,There are known knowns.There are things we know we know.We also knowThere are known unknowns.That is to sayWe know there are some thingsWe do not know.But there are also unknown unknowns,The ones we don't knowWe don't know.’—Feb. 12, 2002, Department of Defense news briefing

Page 9: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

The 4th Rumsfeld?• 3 epistemological classes (???)

• Known knowns• Known unknowns• Unknown unknowns

• 4th class?• Uknown knowns?• Critical issue for cross-disciplinary sciences

Page 10: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Some OAIS Concepts?• Knowledge Base: allows a consumer to understand

something• Designated Community: the set of consumers for whom

the archive curates something• Representation Information: helps you interpret a data

object yielding an information object• The amount and nature of RepInfo required is dependent on

the Knowledge Base of the Designated Community• If you curate for project colleagues in the short term, little if any

RepInfo required• If you curate for those unfamiliar with the data, more RepInfo is

needed• (All broadly interpreted!) •CCSDS (2002). Reference Model for an Open Archival Information System (OAIS).

•Retrieved. from http://public.ccsds.org/publications/archive/650x0b1.pdf.

Page 11: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Time• KB is f1(DC, t)• DC is f2(t)• RepInfo needed is f3(f1(DC, t), f2(t))

• (but none of these concepts can be precisely defined!)

• If DC is small and t is short (months to year or so), then both may be ignored, and RepInfo be assumed part of the KB

• If DC is extensive (eg cross-discipline) and t is long (5 years to 25 plus), then RepInfo must be articulated

• If t is very long, most bets are off (post-hoc reconstruction likely to be needed)

Page 12: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

What might RepInfo include• Structure information: file format definitions, etc • Semantic information: data dictionaries, code books etc• Robust methods (working code?)• Not to mention many kinds of metadata, provenance,

documentation of hidden assumptions, etc• Cross-domain schemas one approach to articulating

RepInfo?• (Never perfect, of course)

Page 13: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

What about Rumsfeld 4?• Biggest concern with unfamiliar user is

clashing concepts, eg different baselines, units, geographies, granularity• Especially where terms are ambiguous or

differently interpreted• The KBs of two DCs conflict, potentially silently• Happens all the time, of course

• The unspoken: tacit knowledge, unknown knowns!

Page 14: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Timing• Curation starts before creation

• Before project proposal!

• Data acquisition should not happen at the end• Continuous acquisition much better?

• Enforcement… or credit for data?

Page 15: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Other curation issues of concern• Sustainability (work on your survival)• Succession (what happens to your data if you don’t)• Data audit (know what you’ve got)• Data risk assessment (assess your chances of loss)• Repository external audit???• Provenance & computational lineage• Archiving database changes• Community proxy roles: help your communities develop

data standards & data practices

• DCC has tools & support for some of these…

Page 16: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

… and Research Outputs?• Need more semantically aware texts to

support cross-community understanding• Coded up (cf microformats, RDFa)

• People• Citations & references• Science features (eg chemicals, reactions)• Graphs, spectra, tables linking to • Supplementary data

• PDF is pretty bad at this

Page 17: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

DCC Phase 3• Post January 2010?• Smaller (2/3 budget if we’re lucky)• Joint planning with JISC• More tightly managed (hub and spoke)• No development (says JISC)• Core services plus optional additional services• 1st draft seen by JSR• Evaluation reported to JISC• Feedback session next week

Page 18: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Proposed core services• Reference Resources and Exemplars• Training and Staff Development• Expertise, Advice, Consultancy and Hands-on

Support• Community-building and Information-sharing

activities• Data Management and Sharing Plans• Policy and Strategic Development• Providing Access to Tools and Toolkits

Page 19: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Possible additional services• Development of Tools, Toolkits, Wizards and

Templates• Infrastructure Services• Model licences for data• Data citation guidelines

Page 20: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

Relationship to UKRDS?• Overlap of territory• Aiming for complementarity rather than

conflict• DCC becomes core part of UKRDS• Some issues about the vision, though

Page 21: The future of the DCC

a centre of expertise in data curation and preservation

E-Science Workshop

What do you want from the DCC?