beyond storage: rethinking the role of repositories in scholarly communication

41
Beyond Storage: Rethinking the role of repositories in scholarly communication DELOS Workshop Digital Repositories: Interoperability and Common Services May 11, 2005 Sandy Payette Cornell University

Upload: ace

Post on 18-Jan-2016

47 views

Category:

Documents


1 download

DESCRIPTION

Beyond Storage: Rethinking the role of repositories in scholarly communication. DELOS Workshop Digital Repositories: Interoperability and Common Services May 11, 2005. Sandy Payette Cornell University. First… is there a problem?. Existing scholarly communication system. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Beyond Storage:

Rethinking the role of repositories in scholarly communication

DELOS WorkshopDigital Repositories: Interoperability and Common Services

May 11, 2005

Sandy PayetteCornell University

Page 2: Beyond Storage: Rethinking the role of repositories  in scholarly communication

First… is there a problem?

Page 3: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Existing scholarly communication system

• Does not mirror the reality of the scholarly process

• Published information artifacts do not resemble the rich information that is produced along the process

• Not evolved enough to enable easy and effective integration and dissemination of new, rich forms of digital information

Page 4: Beyond Storage: Rethinking the role of repositories  in scholarly communication

D is c o nne c te d ne tw o rk s :fo rm a l public a tio n ne tw o rks o c ia l ne tw o rk (a c to rs )

H ybr i d ne tw o r kdo c um e nts ( fo r m al and i nfo r m al )datas e r vi c e sac to r s

D ata

Ac to r

F o r m al d o c u m en t

I n f o r m al d o c u m en t

D ata s e ts

W eb s er v ic e

The Future: Rich Scholarly Information Networks

Page 5: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Roles of digital repositories today

• Early Dissemination: – Enhance upstream scholarly communication– Improvement over traditional pre-print (paper) sharing among

scholars

• Open Access: – Harnad’s “subversive proposal”– Possibility of bypassing or eliminating traditional publisher model

• Document Discovery: – Searching for documents in a repository, – Federation or metadata harvest for search over multiple repositories

• Storage and Archiving: – E-print archives: author-self archiving gives scholars control over

their intellectual output– Institutional repositories: institutions commit to preservation

Page 6: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Evolutionary, but not revolutionary

• In many ways repositories represent an evolution of the traditional publishing paradigm– Submit documents– Gain access to documents…– Share results earlier in the scholarly process, and

electronically

• Still locked into document-centric paradigm– Store documents to promote access– Store documents to promote archiving– Index documents to promote search and discovery– Citation analysis to understand relationships of documents

Page 7: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Signs of Change – Scholars exercising the network

• Grid computing in sciences– Share computing resources– Share services and distributed virtual file systems– Examples

• Enabling Grids for E-Science (http://public.eu-egee.org/)• National Virtual Observatory (http://www.us-vo.org/)

• Humanities computing– Hyperlinked historical documentary editions– New Forms of Digital Scholarship

• Rossetti archive (http://www.rossettiarchive.org/)• Perseus (www.perseus.tufts)• Pompeii Forum (http://pompeii.virginia.edu)

– Tibetan and Himalayan Digital Library (thdl.org)

Page 8: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Vision for more revolutionary approach

Page 9: Beyond Storage: Rethinking the role of repositories  in scholarly communication

The revolutionary opportunity…

• Looming on the horizon is the potential of a future scholarly communication system that is– Highly collaborative– Network-based– Data-intensive– Process-oriented

• We can change the way research and education is conducted by exposing rich knowledge-oriented information assets

• Digital repositories must be rationalized within this broader vision.

Page 10: Beyond Storage: Rethinking the role of repositories  in scholarly communication

New Functionality

• Content aggregation: – combining information entities in novel ways

• Knowledge integration: – capturing semantic and factual relationships among

information entities

• Information reuse: – allowing secondary, tertiary products

• Information transformation: – combining information entities with computational services

• Collaboration and contribution: – blurring the line between authors, publishers, users,

experts…

Page 11: Beyond Storage: Rethinking the role of repositories  in scholarly communication

A New Scholarly Information System

1. Redefine the “information unit” of scholarly communication

2. Create a scholarly communication system that better supports the process of research and learning

3. Record the “crumb trails” of the scholarly process

3 Basic Requirements

Page 12: Beyond Storage: Rethinking the role of repositories  in scholarly communication

(1) The new “information unit”

• Documents• Text• Data• Simulations• Images• Video• Computations• Automated

Analyses

Data

Aggregations

Page 13: Beyond Storage: Rethinking the role of repositories  in scholarly communication

(2) Process-oriented Scholarly Communication System

• Decompose the traditional process (Roosendaal & Geurts)– Registration (establish intellectual priority of result)– Certification (certify quality and validity of result)– Awareness (ensure accessibility)– Archiving (ensure availability for future use)– Rewarding (means to support tenure, promotion,

compensation)

• But, they missed some things…

Page 14: Beyond Storage: Rethinking the role of repositories  in scholarly communication

(2) Process-oriented Scholarly Communication System

• Add new services to the mix– Workflow – Collaborative functions (e.g., annotation, re-use) – Data mining and analysis– Preservation monitoring and migration

• Expose all as network-accessible atomic services– Service discovery– Service invocation– Service aggregation, orchestration, choreography

Page 15: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Process-orientation - workflows

Validatebyte-

streams

Ingestto

Repo

Link to Simulation

Service

AssignAccessPolicy

Indexand

Register

Ingest-oriented process

VisitThe

Doctor

FormatMigration

ObjectVersioning

In Repo

MakeCopies

IngestTo

ArchivePreservation-oriented process

IngestTo

Archive

SIP

DigitalObject

World of Services

Page 16: Beyond Storage: Rethinking the role of repositories  in scholarly communication

(3) Record the “crumb trails”

• Events– Critical state transitions of information assets– Preservation-noteworthy events

• Provenance– When we enable re-use and re-combination of

assets, we must be able to show from whence it came

• Relationships– Among information assets– Versions of an asset– Between agents and assets– Between services and assets

Page 17: Beyond Storage: Rethinking the role of repositories  in scholarly communication

How are current repository technologies poised?

Page 18: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Selected repositories with notable features re: the vision

• Open-source repository software– Fedora– DSpace

• Installed Systems– aDORe (Los Alamos National Laboratory)– arXiv

• Grid projects– Storage Resource Broker (SRB)– Chimera

Page 19: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Fedora vs. the vision

• Flexible digital object model• Services associated with digital objects• Relationships among digital objects

– Relationship ontology– RDF-based metadata– Search the repository “as a graph”

• Upcoming – new security architecture– Policy enforcement (XACML)

• Repository policy• Object policies (fine-grained control)

Page 20: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Fedora Repository – Web Services

M anage AuthN AuthZ

Access Validation Re source Inde x

Storage Dissemination Registry

Fedora Repository M odules

M an ag e A c c e s sR e g is try

S e arc hR D F

In d e x

R E S T

C lie n tA pp

B a tchPro g ra m

O th e rS e rv ice

W e bB ro ws e r

R E S T S O A PS O A P R E S T S O A PR E S T

O A IP ro v id e r

R E S TWeb Services

Exposure

Page 21: Beyond Storage: Rethinking the role of repositories  in scholarly communication

info :fe do ra/im age :1 1

la stM odD ate

hasM em ber

hasM em ber

h asR ep

h asR ep

in fo:fe dor a/ im ag e :1 1 / B LD G

in fo:fe dor a/ im ag e :1 1 / bde f:2 /g e tR e late dLe tte r

hasRep

i n fo:fe dor a/c ol l e c tion :1 / bde f:1 /M EM B ER S

info :fe do ra/im age :1 2

in fo :fe d o ra/c o lle c tio n :1

la stM odD ateh asR ep

"2 0 0 5 - 0 1 - 1 0 :1 1 :0 2 "

"2 0 0 5 - 0 2 - 0 1 :1 2 :0 5 "

lastModD

a te

"2 0 0 5 - 0 1 - 0 1 :1 0 :0 0 "

dc :

c rea

tor

"E lly C r am er "

d c:crea to r

"C h r is W ilp er "

in fo:fe dor a/ im ag e :1 2 / B LD G

d c:crea to r

"E d d ie S h in "

in fo:fe dor a/ im ag e :1 2 / bde f:2 /g e tHIGH

hasR ep

Fedora Objects – RDF Graph view

CollectionObject

MemberObject

Page 22: Beyond Storage: Rethinking the role of repositories  in scholarly communication

DSpace vs. the vision

• The related Simile project is most interesting– Significance: semantic web technologies brought to

the task of search and discovery across different repository systems

– RDF-based search across heterogeneous metadata formats

– Ontology-based

• DSpace History system– Event recording– RDF-based

• Opportunity in DSpace 2– Web service exposure?– Service-based dissemination architecture?

Page 23: Beyond Storage: Rethinking the role of repositories  in scholarly communication

LANL’s aDORe vs. the vision

• Standards-based repository architecture– OAI-PMH– MPEG21-DIDL– Open URL

• Very good example of the use of simple protocols to enable modular service-based architecture

• Services dynamically associated with objects

Page 24: Beyond Storage: Rethinking the role of repositories  in scholarly communication

aDORe architecture

LANL

OpenURL

Ing

est

Repo Index

publisher

OAI-PMH

OpenURL

OAI PMH

Identifier Resolver

OAI PMH OAI PMH

CNRI handle, JAVA, C

MPEG-21DIP

Engine

Registry of trans-

formations

DID

Profile/BehaviorR

egistry

DIDwith DIM

OAI PMH

OAI PMH

FTXT

A&I

TechReport

Pre

-Ing

est

publisher

Ind

ata.la

nl.go

vA&I

publisher

AP

PL

ICA

TIO

N

123

4

5

6

7

Slide courtesy of Herbert Van de Sompel

Page 25: Beyond Storage: Rethinking the role of repositories  in scholarly communication

arXiv vs. the vision

• Progress in decomposition and distribution of traditional steps in scholarly publishing value chain

Page 26: Beyond Storage: Rethinking the role of repositories  in scholarly communication

arXiv – service pathways (decomposed and distributed)

Page 27: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Selected Grid vs. the vision

• SRB– Distributed, virtualized file system– Support for very large amounts of data– Data grid compatible with computational grid– Possible as backend persistent store for other

repository systems (e.g., DSpace, Fedora)

• Chimera– Derived data as first class information entities– Information model (Virtual Data System)– Process model (Virtual Data Language)

Page 28: Beyond Storage: Rethinking the role of repositories  in scholarly communication

New Technical Architecture

Page 29: Beyond Storage: Rethinking the role of repositories  in scholarly communication

The architecture challenge

• Current situation – Heterogeneous repository systems– Heterogeneous object models (or no object model) – Multiple protocols and service APIs– Services lacking formal interface definitions

• Can these resources ever play nicely together?• Need common abstractions…

Page 30: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Solution: Information Network Overlay

DataStores

DocumentRepositories

Databases

WebResources

PublisherRepositories

Information Network API

Source Layer

NetworkRepresentation

Layer

Client Layer

Page 31: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Translate to Technical Requirements

• Rich information objects– Integration of local and remote sources– Mixed genre

• Dynamic information objects– Integration with local and distributed services

• Graph-based information model to enable overlay– Nodes are information objects– Edges are relationships among those objects

• Service-oriented process model: – Coordination of information entities and services– Workflow; multi-step executions; transformations– Interoperable access and management API for objects

• Fine granularity access control

Page 32: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Pathways Project

• National Science Foundation Funding 2004-2007(http://www.infosci.cornell.edu/pathways)

• Van de Sompel, Payette, Erickson, Lagoze, Warner. Rethinking Scholarly Communication: Building the System that Scholars Deserve. D-Lib Magazine September 2004.

Page 33: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Vision: “Graphite” Information Model

Im ag e O b jectW e b r e so ur c e

G ra ph ite O v e rla y Fra g m e n t

L A N LR e p o s i t o r y

S erv ice-B

U R I-1 0

T yp eU R I-1

T yp eU R I-3

T yp eU R I-4

T ypeU R I-7T ypeU R I-8

arX iv F ed ora

T yp eU R I-6

T ypeU R I-2

U R I-1

U R I-4

U R I-7

U R I-9

Gr id da t a se t

U R I-2

D o cu m en t

T yp eU R I-5

U R I-8

U R I-6

U R I-5

U R I-3

Cornell/LANL Pathways Project

Most things can be represented as a graph of nodes and arcs.

Page 34: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Service-oriented process model

• Key challenge is to integrate a distributed service model within the information network overlay.

• Technologies to watch– OWL-S (W3C)

• Ontology-based service descriptions• Service modeled within semantic web

– Netkernel (1060research)• Enables a graph-like overlay for URI-identified resources• Information entities and services can be accommodated

– Grid technologies (Open Grid Services Infrastructure)• Enables creation of ‘virtual organizations’ that can share

distributed computational resources and services• Web-services and WSDL in latest incarnation

Page 35: Beyond Storage: Rethinking the role of repositories  in scholarly communication

The W3C’s Take on Things…

• People and communities have data stores and programs to share

• Vision: Expanding Web of machine accessible resources

• Key Web technologies:• Web Services: Web of programs*

– Standards for interactions between programs on the Web – Easier to expose and use services

• Semantic Web: Web of data* – Standards for data, relationships, descriptions on the Web – Easier to Search for, Share, Aggregate, Extend information

• * abstractions :-)Source: http://www.w3.org/2004/Talks/0923-sb-whoiw3c/slide12-0.html

Page 36: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Conclusions: Implications for digital repositories

Page 37: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Beyond Storage

Must understand new scholarly activities and new technical developments…

so we can frame repositories within a broaderservice-oriented architecture.

Page 38: Beyond Storage: Rethinking the role of repositories  in scholarly communication

What basic changes can occur now?

• Expose repositories as web services• Support compound digital objects

– Local and remote content– Any media type– Provide a way to associate services with objects (dynamic

views)

• Provide ability to assert relationships among objects• Move toward ontology-based metadata • Enable easy integration of repository with other

services

Page 39: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Example: Fedora Service Framework (2005-2007)

Fe dora Re po sito rySe rv ice

Serv ices

Apps

P re se rva tionInte grityS e rvice

Ex te rna lW orkflow

JHOV E

GDFR

Ba sicW orkflowS e rvice

Dialog Box Name

O KTex t:

Tex t

Tex t

Tex t

Tex t

Tex t

Canc el

H elp

Sample Text Here Sample Text Here Sample TextHere Sample Text Here Sample Text Here SampleText Here Sample Text Here Sample Text HereSample Text Here Sample Text Here

S am ple Tex t Here S am ple Tex t Here S am ple Tex t Here Sam ple Tex t HereS am ple Tex t Here S am ple Tex t Here S am ple Tex t Here Sam ple Tex t HereS am ple Tex t Here S am ple Tex t Here S am ple Tex t Here Sam ple Tex t Here

Fedora-Web-IRAdministrator

OAIP rovide rS e rvice

Dire ctoryInge st

S e rvice

W e b-ba se dsubm ission a ndba sic w orkflow

Fe de rationPID Re s olution

Se rvicePre s e rvation

M onitor ingSe rvice

Eve ntNotification

Se rvice

Fe doraS e a rchS e rvice

Dyna m icDisse m ina tor

S e rvice

PolicyBuilder

Other

Ser v ice

Page 40: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Research Challenges

• Enable low barrier to entry – Simple protocols (e.g., like OAI)– Light-weight (REST vs. SOAP?)– Simple tools to create overlays– Note complexity in setting up Grid-based services

• Integration of information and service models

• Security and Trust– Authentication and trust among repositories and services– Interoperability of authorization policy

• Preservation– Distributed and dynamic resources

Page 41: Beyond Storage: Rethinking the role of repositories  in scholarly communication

Beyond Storage

Questionsand

Discussion!