http:// dirk roorda, coordinator infrastructure

44
http://www.dans.knaw.nl Dirk Roorda, coordinator infrastructure

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Http:// Dirk Roorda, coordinator infrastructure

http://www.dans.knaw.nl

Dirk Roorda, coordinator infrastructure

Page 2: Http:// Dirk Roorda, coordinator infrastructure
Page 3: Http:// Dirk Roorda, coordinator infrastructure

Overview

Part 1: The rising role of data

Part 2: The free use of data

Part 3: The care for data

Part 4: The re-use of data

Page 4: Http:// Dirk Roorda, coordinator infrastructure

Part 1: The rising role of data

http://en.wikipedia.org/wiki/Exabyte

Internet size (May 2009): 500 EB

500.000 PB

500 million TB

500 million fat USB disks

500 billion memory cards of 1 GB

70 memory cards per person

Page 5: Http:// Dirk Roorda, coordinator infrastructure

Data deluge

http://www.datadeluge.com/ http://en.wikipedia.org/wiki/File:Tree_of_life_SVG.svg

http://tolweb.org/tree/

Page 6: Http:// Dirk Roorda, coordinator infrastructure

Where does it come from?• Instruments

• satellites, sensors, dna-sequencing

• Records• administrations, censuses, surveys

• Digitisation• the analog legacy

• Hobby• pictures, movies, genealogy

• Integration• better interoperability of existing data

Page 7: Http:// Dirk Roorda, coordinator infrastructure

The driving force

Information and Communication Technology

Babbage Analytical Engine1870

Page 8: Http:// Dirk Roorda, coordinator infrastructure

A datacenter

Genealogy

2,5 PB

5328 servers

1,12 MW

http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret+Data+Center.aspx

http://www.ancestry.com/

Page 9: Http:// Dirk Roorda, coordinator infrastructure

A closer look

• Linguistics• text corpora, automatic translation

• Philology• how to read a million books?

• History• historical census data

• Archeology• archive law, commercial research

Page 10: Http:// Dirk Roorda, coordinator infrastructure

Linguistics and PhilologyA chronometric approach to Indian alchemical literatureAssessing frequency changes in multistage diachronic corporaEvaluating methods for computer-assisted stemmatology using artificial benchmark data sets A Corpus Study of the Rigveda Dictionary generation for less-frequent language pairs using WordNetAn exercise in non-ideal authorship attribution: the mysterious Maria Ward

http://llc.oxfordjournals.org/

Page 11: Http:// Dirk Roorda, coordinator infrastructure

History

http://www.volkstellingen.nl/nl/

Page 12: Http:// Dirk Roorda, coordinator infrastructure

http://www.volkstellingen.nl/en/

Page 13: Http:// Dirk Roorda, coordinator infrastructure

Archaeology

http://edna.itor.org/nl/intern/upload_directory/a00002/downloads/IMG0013.tif

Page 14: Http:// Dirk Roorda, coordinator infrastructure

Archaeology (2)

http://edna.itor.org/nl/oai/oai_addi/oai_addi/OAI:EVALMA:a00002.xml/

Page 15: Http:// Dirk Roorda, coordinator infrastructure

Part 2: The free use of Data

Page 16: Http:// Dirk Roorda, coordinator infrastructure

Open Access

Data is information

Information is knowledge

Knowledge is power

Why share it?

Page 17: Http:// Dirk Roorda, coordinator infrastructure

Open Access

Shared knowledge is double knowledge

Without free sharing of knowledge,

scientific progress will halt

Tensions between sharing and not sharing remain, though

Page 18: Http:// Dirk Roorda, coordinator infrastructure

A good Example

http://www.ploscompbiol.org/home.action

Page 19: Http:// Dirk Roorda, coordinator infrastructure
Page 20: Http:// Dirk Roorda, coordinator infrastructure
Page 21: Http:// Dirk Roorda, coordinator infrastructure

Work to do

• organise your data• let your data work together with those of

others • (colleagues, future scientists, the public)

• ask new questions to the data• because there is so much of it

• create new (virtual) data collections

Page 22: Http:// Dirk Roorda, coordinator infrastructure

Part 3: The care for data

Page 23: Http:// Dirk Roorda, coordinator infrastructure

Research Data Recycling

• existing data• collecting by experiments, surveys

• primary research data• verifying results by others• preserving unique data from experiments

• compilation, aggregation, annotation• databanks

• data mining, analysis, visualisation• new data as research input

Page 24: Http:// Dirk Roorda, coordinator infrastructure

Challenge: Software

Operating system (DOS, Windows 95, ...)

Programming Languages (Basic, Pascal)

File formats (Word Perfect, dBase)

Applications (Addressbook, Websites)

Old data may be locked up in old software.

Page 25: Http:// Dirk Roorda, coordinator infrastructure

Meeting the challenge

To prevent the problem in the futureBackward compatibility

Open Standards

Open Source Applications

Modular software engineering

keep data separated from interface and business logic

To remedy the problems of the pastEmulation

Migration

Page 26: Http:// Dirk Roorda, coordinator infrastructure

Challenge: Human organisation

Forgotten jargon

Forgotten knowledge

No metadata

Websites with broken links

Page 27: Http:// Dirk Roorda, coordinator infrastructure

Jargon

• II.17. Posterior berry aneurysm with subarachnoid bleed.

• II.18. Subarachnoid bleed with extension into the ventricles.

• II.19. Ruptured berry aneurysm at the end of the internal carotid artery, with obstructive hydrocephalus. Morgagni found the rupture.

• II.22. Subarachnoid hemorrhage.

http://www.pathguy.com/morgagni.htm

Page 28: Http:// Dirk Roorda, coordinator infrastructure

Meeting the challenge

Persistent Identifiers

Enough Metadata

Codification of knowledge and practices

Wikipedia

Datamanagement early on

Page 29: Http:// Dirk Roorda, coordinator infrastructure

Part 4: The re-use of data

Page 30: Http:// Dirk Roorda, coordinator infrastructure

Data management

Use common infrastructure rather than private means

Use open formats rather than proprietary formats

Use open source software rather than closed software

Use standard ways of documenting data

taxonomies, ontologies, metadata schemes

Page 31: Http:// Dirk Roorda, coordinator infrastructure

Common Infrastructure

Local file shares

University repository

DANS

European Infrastructures

Page 32: Http:// Dirk Roorda, coordinator infrastructure

DANS

http://easy.dans.knaw.nl/dms

Page 33: Http:// Dirk Roorda, coordinator infrastructure

EASY

Page 34: Http:// Dirk Roorda, coordinator infrastructure

Dataset

Page 35: Http:// Dirk Roorda, coordinator infrastructure

Datafiles

Page 36: Http:// Dirk Roorda, coordinator infrastructure

Metadata

Page 37: Http:// Dirk Roorda, coordinator infrastructure
Page 38: Http:// Dirk Roorda, coordinator infrastructure

linguists make their technology accessible- resources algorithms techniques

humanities and social sciences- they are the target users

Page 39: Http:// Dirk Roorda, coordinator infrastructure
Page 40: Http:// Dirk Roorda, coordinator infrastructure

Geleerdenbrieven=

Circulation of KnowledgeArchiving

=

circulation of information

Page 41: Http:// Dirk Roorda, coordinator infrastructure
Page 42: Http:// Dirk Roorda, coordinator infrastructure
Page 43: Http:// Dirk Roorda, coordinator infrastructure
Page 44: Http:// Dirk Roorda, coordinator infrastructure

Keep imagining