overview – nsf site visit 8 february 2010 1 (v8) dataspace

20
Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Upload: nicholas-daniels

Post on 04-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Overview – NSF Site Visit8 February 2010

1

(v8)

DataSpace

Page 2: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Vision• “To bring the dramatic benefits of the Web to scientists –

comparable to the benefits the Web has had to commerce and other areas”

• . . . Not just in the impact to science, but also a similar distributed federated ecosystem for:– Technology Infrastructure– Organizational Responsibilities

• View: Research data-generating institutions and their libraries should play an active role in curating their researchers’ data

– Financial and Technical Sustainability– Openness: 3rd party extension and Open Source development

• Support research across all domains, but initially:– Neuroscience– Biological Oceanography 2

Page 3: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Other USA Nodes

International Nodes

DataSpaceHigh-Level

Architecture

Global Network (Web)

Local Network

Metadata Repository

for Scientific Data

Multiple Scientific Data Repositories (DataSpace Native Architecture)

Interface to Legacy Scientific

Data Repositories

. . .

Distributed Data Management Services: Security, Replication, Administration

Policy Management, Workflow Services

Additional Data User Services : • Data Analytics • Data Visualization

Basic Data User Services:Discovery, Quality, Conversion, IntegrationData Curation Services:Process, Catalog, Annotate, Preserve

DataSpace Services

MIT Node

. . .

Scientist Scientist Curator UserProvides

data,preliminary metadata

Process and ingests data,

complete metadata, and policies (e.g.

retention)

Searches (meta)data, accesses/integrates data, analyzes/visualizes data (via DataSpace data services or 3rd party data services)

Basic Workflow

DataSpace

3rd par

3rd Party Specialized Data Services

3

Page 4: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Initial Scientific Domains Chosen• Neuroscience and Biological Oceanography

– Sciences with complex interdisciplinary sub-domains– Different and diverse types of scientific data

• Though some aspects of overlap (genetic data)

– Faced with challenges related to• Data expression, encoding, sharing, integration, visualizing,

and preserving• Difficult to perform research that crosses sub-domains or

requires multi-source data

– Can build on existing collections and collaborations• But must also address technical, social, and legal issues

• Will bring in additional domains over time 4

Page 5: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

• Data Protection, Security• Distributed Policy Management • Data Analysis, Visualization• Data Analytics • Data Sharing Policy and Legal Advise• Data Quality• Data Semantics, Discovery• Data Interoperability, Integration• Data Storage Architecture• Data Curation Workflows

DataSpace Advisory Board (external, international,

10-15 members)

DataSpace Business Development Team

(DBDMT)

Research & Prototyping

Development & Operations

DataSpace PI

DataSpace Project Director

Management Board(PIs & Senior Personnel)

Sloan School of Management Finance

& Administration

• Cyberinfrastructure Architecture • Software Design & Development • Infrastructure Planning • Data Curation Operations • Technology Operations • Service Modeling • Business Modeling

Marketing & Outreach

• Communication • Coordination • User Needs Assessment • Usability and Feedback • Public outreach (‘citizen science’) • Educational outreach • Scholarly Publishing outreach

Other Cyberinfrastructure

Partners

Other Sectors (e.g. finance,

pharma, health care, insurance)

DataNet Partners

Outside Parners

DataSpace Organizational Structure (preliminary)

Page 6: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Some Key Goals for First Year• Complete hiring and staffing• Design and development of DataSpace v1 (Interim architecture)

– Build on existing software base (DSpace, Fedora)– Addition of initial DataSpace middleware

• Ingest of initial Neuroscience and Biological Oceanography data– Selection/development of ontologies– Recording of metadata (including preservation policies, etc.)

• Establish operational DataSpace v1– Service models defined with partner nodes

• Design of DataSpace long-term architecture– Initial results from research groups for v2

• Initial results of Business Development Management Team• Educational and Outreach efforts (i-schools, OCW)

6

Page 7: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Sustainability Approach• Core to Financial Sustainability

– Provide maximum value to science– Minimize cost to any one organization by broad distribution

• Can actually reduce costs by eliminating duplication and inefficiencies

– Build on the long-standing role and sustainability of libraries – Follows Web/Internet value (to both large and small orgs)

• Worldwide infrastructure, costs widely shared

• Technological Sustainability– Open Source software, multiple implementations possible, and

encourage 3rd party augmentation– Participation of commercial technology company partners

• Some Resources: DataSpace Federation, Partner experiences, Business Development Management Team (working with MIT Entrepreneurship Center, E&I students, etc.) 7

Page 8: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Some Key Features of the DataSpace Proposal• Distributed federated infrastructure for accessibility & long-term preservation

– Address privacy, property and data rights, etc. with legal and policy framework • Builds on successful Dspace/Fedora platform• Proposes new top-level internet domain (".arc")• Addresses need for “temporal semantics” and other advanced metadata• Risk mitigation: Research risk: Personnel with extensive experience.

Operational risk and sustainability: Distributed design and federated approach.

• Public/Private Partnership: Corporate partners help build more sustainable ecosystem and ensure sustainability, MIT Entrepreneurship Center, etc.

• Expert Advisory Board: Diverse fields (i.e. science, law, business, technology, libraries, and digital preservation) advise and promote the project

• Advances scholarly communications through data/publication integration• Advances educational technology through data/courseware integration • Outreach to minority and pre-college student, underserved small and medium

research groups. DataSpace will be a truly transformational project 8

Page 9: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Multi-disciplinary team of Principal Investigators

• Hal Abelson, MIT Computer Science & Artificial Intelligence Laboratory (CSAIL)

• Ed DeLong, MIT Departments of Civil and Environment Engineering and Biological Engineering

• John Gabrieli, MIT Department of Brain and Cognitive Sciences

• Stuart Madnick, MIT Sloan School of Management & School of Engineering

• MacKenzie Smith, MIT Libraries• Marilyn T. Smith, Director, MIT Information Systems

& Technology (IS&T) [replaces Jerry Grochow]9

Page 10: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Diverse and Experienced Senior Personnel• Timothy Berners-Lee

(W3C, WSRI)• Alon Halevy (Google)• Geneva Henry (Rice

University)• Mei Hsu (HP)• David Karger (MIT)• Michele Kimpton (DSpace

Foundation)• Thomas Malone (MIT)• Dejan Milojicic (HP)

[replaces John Erickson]

• Joe Pato (HP)10

• Terry Reese (Oregon State University)

• Michael Siegel (MIT)• Stephen Todd (EMC)• Tyler Walters (Georgia

Tech)• Danny Weitzner (W3C,

WSRI)• Steve White (Microsoft)

[addition to team]

• John Wilbanks (Science Commons)

• Wei Lee Woon (MIST, Abu Dhabi)

Page 11: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Advisory Board• Christine L. Borgman, Department of Information Studies, Graduate

School of Education and Information Science, UCLA• Randy Buckner, Psychology, Harvard University• Scott Doney, Marine Chemistry & Geochemistry, Woods Hole

Oceanographic Institution• Keith Jeffery, European Research Consortium of Informatics and

Mathematics (ERCIM) and UK Rutherford Appleton Laboratory• Liz Lyon, UKOLN and UK Digital Curation Centre• Ed Roberts, Management of Technology, MIT Sloan School of

Management and MIT Entrepreneurship Center• Pam Samuelson, School of Information and School of Law , UC Berkeley• Dan Schutzer, Financial Services Technical Consortium (FSTC)• Andrew Treloar, ARCHER Project, Australian National Data Service,

Monash University, Australia• Wanda Orlikowski, Information Technologies and Organization Studies ,

MIT Sloan School of Management 11

Page 12: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

12

DATASPACE AGENDA - NSF SITE VISIT - Final - As of 7 Feb 2010 (v21)Start Topic Presenters Minutes Sub-totals

8:00 NSF Panel leaves hotel & gathers at MIT8:30 1. INTRODUCTION & OVERVIEW (Vision & Rationale)

a. Stuart Madnick & MacKenzie Smith 20 208:50 2. SCIENTIFIC DOMAINS (Vision & Rationale) 42

Biological Oceanography a. Ed DeLong 12b. Terry Reese (Oregon State U) 5

Neuroscience / Neuroimaging c. John Gabrieli 12d. Steve White (Microsoft) 5e. Tyler Walters (Georgia Tech) 5f. Susan Hockfield (President, MIT) - arrives around 10am (approx) 3

10:15 Break10:30 NSF Panel Closed Session #111:00 Additional Q&A11:20 3. COMMUNITY BUILDING & PARTNERSHIPS (Activities, Organizational Structure) 39

Introduction a. MacKenzie Smith 4 b. John Wilbanks (Science Commons) 5 Library community c. Michele Kimpton (DuraSpace) 5

d. Geneva Henry (Rice) 5e. Terry Reese (Oregon State U) presentf. Tyler Walters (Georgia Tech) present

Broader community g. Stuart Madnick (for Google: Alon Halevy & MIST: Wei Lee Woon) 4

h. Joe Pato (for HP: Mei Hsu & Dejan Milojicic) 6i. Stephen Todd (EMC) 5j. Steve White (Microsoft) present

Citizen Science k. Tom Malone (Citizen Science) 512:30 Break12:45 NSF Panel Closed Session #2 (Work ing lunch)1:15 Additional Q&A1:35 4. RESEARCH, DEVELOPMENT & OPERATIONS (Activities) 41

Development & Operations a. MacKenzie Smith (Development, for Libraries, IS&T) 12b. Marilyn Smith (Operations: IS&T) 4

Research agenda c. Stuart Madnick (Research: data semantics, integration, quality, etc)

12

d. Hal Abelson (for DIG: Tim Berners-Lee, Danny Weitzner) 8

e. David Karger (Visualization) 5Others present

3:00 Break3:15 NSF Panel Closed Session #33:45 Additional Q&A3:55 5. SUSTAINABILITY & PROJECT MANAGEMENT 33

(Organizational Structure, DataNet Partner Leadership & Management)

a. MacKenzie Smith & Stuart Madnick 10b. Michael Siegel 15c. Ann Wolpert (Director, Libraries) & Claude Canizares (VP, Research)

8Others present

5:30 6. WRAP-UP 5a. Stuart Madnick & MacKenzie Smith 5

6:00 END 180RE NSF Q&A TIME: It is assumed that there w ill be some brief speaker-specific Q&A after each speaker, then general Q&A for the rest of each segment.

Page 13: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Backup Slides

13

Page 14: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

1. New types of science enabled• Enhance scientific interdisciplinarity and innovation via

standards-based data architecture and broad adoption• Disciplines: Neuroscience and Biological Oceanography

(a) Science and education goals help– Library and Computer Science goals: minimize duplication of

effort, maximize access to prior work, improve interoperation and quality

– Education goals: disseminated through multiple means (OCW) to enable semantic tagging of data and reuse of data

(b) Metrics of Success– Usage: number groups contributing and using, amount and

diversity of data shared and used, etc.– Impact: Publications, discoveries

14

Page 15: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Neuroscience Domain

• Address questions, such as “Variation of cognitive and emotions traits due to age?”

• Future requires access to large datasets, but– Broadly distributed across many organizations– Diverse types: DTI, fMRI, structural MRI, VBM– Difficult to aggregate and annotate

• Initial organizations include– Martinos Imaging Center (at MIT)– Center for Advanced Brain Imaging (Georgia Tech)– Collaboration with Microsoft

15

Page 16: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

Biological Oceanography Domain• Address questions such as “How does change in

ocean current cause proliferation of microbial groups that, in turn, influence flux of carbon into and out of the sea?”

• Need to interrelate diverse datasets– Scale: from genome to biomes– Types: 4D physical and biological oceanographic,

satellite, genomic, metagenomic, taxonomic, nutrient analysis, bio-optical

– From diverse sources• DataSpace will enable research not possible today

16

Page 17: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

2. Value to Previous Investments• For selected domains: Resources to organize, annotate,

archive, and publish existing data– Curated by partnership with library data curators– Improve collaborations, e.g., C-MORE (interrelate difficult)– Address complex legal, political, and social realities– Sustainability by providing significant new value to scientists

(e.g. ease of search, data integration, reuse)(a) What data contributors gain from DataSpace

– More efficiently archive and reutilize their own data– Able to utilize vast amounts of data from other sources– Over time, will be respected academic achievement (citations)

(b) Investment utilized and enhanced– Significant prior R&D by team members, e.g., Dspace, temporal

semantics, data quality and provenance, policy and legal, etc17

Page 18: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

3. Barriers to Implementation and Adoption• In past, scientists often don’t participate because:

– Insufficient time and expertise (which we address via better functionality and assistance from curators)

– Insufficient value back (which we address through re-use, etc) • Some points:

– Demonstrable ease-of-use and value• Especially sciences that are struggling with these problems• Examples from Neuroscience and Biological Oceanography

– Dedicated data curators– DataSpace Federation to represent collective needs– Openness: encourages scientific innovation and evolution– Support for policy and legal issues– Team has experience evolving systems (W3C, Dspace, etc.)

18

Page 19: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

4. Cyberinfrastructure, Technical Sustainability• Much of DataSpace cyberinfrastructure builds on prior

work (e.g. Dspace) and adds: (a) archive, (b) annotate to enable discovery and re-use, (c) interoperate with Ed Tech, “citizen science,” etc.

• Technical sustainability: Software free and open source – establish architecture and standards– Project will provide at least one reference implementation– Enable multiple implementations (including commercial)

• Will develop cost and service models as exemplars– Institutions already expand large amounts– DataSpace will streamline, rationalize, distribute costs– Libraries have stood the test of time– Additional business models

• 1st Year Goal: Initial system and ingest of data, test interop 19

Page 20: Overview – NSF Site Visit 8 February 2010 1 (v8) DataSpace

5. Manage Program, Providers, International• Experience with highly distributed projects (Dspace)• Management – see organization chart

– Multiple levels and multiple sub-groups– Public/private partnership to insure industrial adoption and

relevance to other sectors– Added management and data expertise from Advisory Board

• Data providers and assured participation– Data initially from partners (Georgia Tech, MIT, OSU, Rice)

• Already communicating with scientists– Then extend more broadly, initially to the DSpace community

• International Counterparts: (1) direct collaboration (DuraSpace), (2) International partner (MIST), (3) International corporations (EMC, Google, HP, Microsoft), (4) Advisory Board, (5) indirect collaborations (C-MORE)

20