developing data services to support escience/eresearch
TRANSCRIPT
School of Information Studies Syracuse University
Developing Data Services to Support eScience/eResearch
2012 Priscilla M. Mayden Lecture eScience and the Evolution of Library Services
Jian Qin
School of Information Studies Syracuse University
http://eslib.ischool.syr.edu/
February 22, 2012
School of Information Studies Syracuse University
The morning ahead
Priscilla M. Mayden Lecture 2012, Utah 2
An environmental scan • E-‐Science, cyberinfrastructure, and data • What do all these have to do with me?
Case study: The gravitational wave research data management
Group work: Role play in developing data management initiatives
School of Information Studies Syracuse University
Overview of E-‐Science and Data
An environmental scan • E-‐Science, cyberinfrastructure, and data • What do all these have to do with me?
Characteristics of e-‐science Data sets, data collections, and data
repositories Why does it matter to libraries?
School of Information Studies Syracuse University
E-‐Science
“In the future, e-‐Science will refer to the large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet. ”
Priscilla M. Mayden Lecture 2012, Utah
National e-Science Center. (2008). Defining e-Science. http://www.nesc.ac.uk/nesc/define.html
4
School of Information Studies Syracuse University
Characteris>cs of e-‐science • Digital data driven • Distributed • Collaborative • Trans-‐disciplinary • Fuses pillars of science
– Experiment – Theory – Model/simulation – Observation/correlation
Priscilla M. Mayden Lecture 2012, Utah
Greer, Chris. (2008). E-‐Science: Trends, Transformations & Responses. In: Reinventing Science Librarianship: Models for the Future, October 2008. http://www.arl.org/bm~doc/ff08greer.pps
5
School of Information Studies Syracuse University
Shi? in Science Paradigms Thousand years ago
A few hundred years ago
A few decades ago
Today
Science was empirical
describing natural phenomena
Theoretical branch
using models, generalizations
A computational approach simulating complex
phenomena
Data exploration (eScience) unify theory, experiment, and
simulation -- Data captured by instruments or generated by simulator -- Processed by software -- Information/Knowledge stored in computer -- Scientist analyzes database/files using data management and statistics
Gray, J. & Szalay, A. (2007). eScience – A transformed scienti_ic method. http://research.microsoft.com/en-‐us/um/people/gray/talks/NRC-‐CSTB_eScience.ppt
Priscilla M. Mayden Lecture 2012, Utah 6
School of Information Studies Syracuse University
X-‐Informa>cs • The evolution of X-‐Informatics and Computational-‐X for each discipline X
• How to codify and represent our knowledge
• Data ingest • Managing a petabyte • Common schema • How to organize it • How to reorganize it • How to share with others
• Query and Visualization tools • Building and executing models • Integrating data and Literature • Documenting experiments • Curation and long-‐term preservation
The Generic Problems
Experiments & Instruments
Simulations
answers questions
Literature Other Archives facts
facts ?
Gray, J. & Szalay, A. (2007). eScience – A transformed scienti_ic method. http://research.microsoft.com/en-‐us/um/people/gray/talks/NRC-‐CSTB_eScience.ppt
Priscilla M. Mayden Lecture 2012, Utah 7
School of Information Studies Syracuse University
Useful resources
Priscilla M. Mayden Lecture 2012, Utah 8
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Part 2: Health and Wellbeing
• The healthcare singularity and the age of semantic medicine
• Healthcare delivery in developing countries: challenges and potential solutions
• Discovering the wiring diagram of the brain
• Toward a computational microscope for neurobiology
• A unified modeling approach to data-intensive healthcare
• Visualization in process algebra models of biological systems
School of Information Studies Syracuse University
FUNDAMENTALS OF DATA
What are data? What are some of the major data formats? Why data formats?
Priscilla M. Mayden Lecture 2012, Utah 9
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah
What are data? (1)
An artist’s conception (above) depicts fundamental NEON observatory instrumentation and systems as well as potential spatial organization of the environmental measurements made by these instruments and systems. http://www.nsf.gov/pubs/2007/nsf0728/nsf0728_4.pdf
10
School of Information Studies Syracuse University
What are data? (2)
Priscilla M. Mayden Lecture 2012, Utah 11
School of Information Studies Syracuse University
Medical and health data
Priscilla M. Mayden Lecture 2012, Utah 12
http://www.weforum.org/issues/charter-health-data
Standardization Compliance Security
School of Information Studies Syracuse University
The mul>-‐dimensions of data
Priscilla M. Mayden Lecture 2012, Utah 13
Data form
ats
Data types
Res
earc
h or
ient
atio
n
Levels of processing
School of Information Studies Syracuse University
Scien>fic data formats
Priscilla M. Mayden Lecture 2012, Utah 14
Common data format Image formats Matrix formats
Microarray _ile formats Communication protocols
School of Information Studies Syracuse University
Scien>fic & medical data formats
Priscilla M. Mayden Lecture 2012, Utah 15
• Medical and Physiological Data Formats – BDF — BioSemi data format (.bdf)
– EDF — European data format (.edf)
• Molecular Biology data Formats – PDB — Protein Data Bank format (.pdb)
– MMCIF — MMCIF 3D molecular model format (.cif)
• Medical Imaging – DICOM — DICOM annotated medical images (.dcm, .dic)
• Chemical Formats – XYZ — XYZ molecule geometry _ile (.xyz)
– MOL — MDL MOL format (.mol) – MOL2 — Tripos MOL2 format (.mol2) – SDF — MDL SDF format (.sdf) – SMILES — SMILES chemical format (.smi)
• Bioinformatics Formats – GenBank — NCBI GenBank sequence format (.gb, .gbk)
– FASTA — bioinformatics sequence format (.fasta, .fa, .fsa, .mpfa)
– NEXUS — NEXUS phylogenetic data format (.nex, .ndk)
School of Information Studies Syracuse University
Why data formats?
Priscilla M. Mayden Lecture 2012, Utah 16
• Archiving – Preservation for posterity
• Storage – Availability for “arbitrary” access
• Transmission – delivery across • hardware • software • administrative
– system boundaries • Analysis – availability for processing
School of Information Studies Syracuse University
Summary • Scienti_ic data formats are closely tied to scienti_ic computing – Data structure, model, and attributes – Self-‐descriptive with header/metadata – API for manipulating the data – Interoperability: conversion between different formats
• No one-‐format-‐_its-‐all standard • Each standard has one or more tools for creating, editing, and annotating dataset
Priscilla M. Mayden Lecture 2012, Utah 17
School of Information Studies Syracuse University
DATASETS, METADATA, AND DATA MANAGEMENT
What is a dataset? What are some of the metadata standards for describing datasets? What is data management?
Priscilla M. Mayden Lecture 2012, Utah 18
School of Information Studies Syracuse University
Dataset classifica>on
Priscilla M. Mayden Lecture 2012, Utah 19
Volume Large-‐volume Small-‐volume
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 20
Ecological data example: Instantaneous streamflow by watershed http://www.hubbardbrook.org/data/dataset.php?id=1
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 21
Diabetes data and trends—Country level estimates: http://apps.nccd.cdc.gov/DDT_STRS2/NationalDiabetesPrevalenceEstimates.aspx?mode=PHY ; Diabetes Data & Trends home page: http://apps.nccd.cdc.gov/ddtstrs/default.aspx
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 22
Clinical trials data management: http://www.clinicaltrials.gov/ct2/show/NCT00006286?term=TADS+NIMH&rank=1
School of Information Studies Syracuse University
Common in the examples
• Attributes of a dataset tell users/managers: – What the dataset is about – How data was collected – To which project the data is related – Who were responsible for data collection – Who you may contact to obtain the data – What publications the data have generated – ??
Priscilla M. Mayden Lecture 2012, Utah 23
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 24
Metadata standards in medical & health sciences
Structure
Healthcare Medical images
Bioinfomatics
Semantics
HL7 DICOM
NCBI Taxonomy NCBO Bioportal
UMLS MeSH (Medical Subject
Headings) SNOMED CT (Systematized Nomenclature of Medicine--
Clinical Terms)
GenBank GenBank GenBank
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 25
School of Information Studies Syracuse University
Research data collec>ons
Priscilla M. Mayden Lecture 2012, Utah 26
Size Metadata Management Standards
Larger, discipline-‐based
Smaller, team-‐based
None or random
Multiple, comprehensive
Heroic individual inside the
team
Organized Institutionalized,
School of Information Studies Syracuse University
Research collec>ons • Limited processing or long-‐term management
• Not conformed to any data standards
• Varying sizes and formats of data _iles
• Low level of processing, lack of plan for data products
• Low awareness of metadata standards and data management issues
Priscilla M. Mayden Lecture 2012, Utah 27
School of Information Studies Syracuse University
Resource collec>ons • Authored by a
community of investigators, within a domain or science or engineering
• Developed with community level standards
• Life time is between mid-‐ and long-‐term
Priscilla M. Mayden Lecture 2012, Utah 28
School of Information Studies Syracuse University
Reference collec>on • Example: Global Biodiversity Information Facility
– Created by large segments of science community – Conform to robust, well-‐established and comprehensive standards, e.g. • ABCD (Access to Biological Collection Data) • Darwin Core • DiGIR (Distributed Generic Information Retrieval) • Dublin Core Metadata standard • GGF (Global Grid Forum) • Invasive Alien Species Pro_ile • LSID (Life Sciences Identi_ier) • OGC (Open Geospatial Consortium)
Priscilla M. Mayden Lecture 2012, Utah 29
School of Information Studies Syracuse University
Datasets, data collec>ons, and data repositories
• Data collections are built for larger segments of science and engineering
• Datasets – typically centered around an event or a study
– contain a single _ile or multiple _iles in various formats
– coupled with documentation about the background of data collection and processing
Data repository
System for storing, managing, preserving, and providing access to datasets
A repository may contain one or more data collections A data collection may contain one or more datasets A dataset may contain one or more data files 30 Priscilla M. Mayden Lecture 2012, Utah
School of Information Studies Syracuse University
Data management for science research • De_inition from Wikipedia: http://en.wikipedia.org/wiki/Data_management
• Key concepts in data management: – Data ownership – Data collection – Data storage – Data protection – Data retention – Data analysis – Data sharing – Data reporting
Priscilla M. Mayden Lecture 2012, Utah 31
How do they relate to responsible conduct of research? http://ori.hhs.gov/images/ddblock/data.pdf
School of Information Studies Syracuse University
An aPempt to define DM • In the context of libraries:
– Data management is a process in which librarians plan, design, and implement data services to support eScience/eResearch.
– Data services that libraries may provide: • Institutional or community data repositories • Data management plan for pre-‐ and post-‐award of grants • Metadata creation, linking, and discovery • Data archiving, preservation, and curation • Consultation for research group’s data management projects • Data management and data literacy training for graduate students and faculty
Priscilla M. Mayden Lecture 2012, Utah 32
School of Information Studies Syracuse University
Ini>a>ves in research libraries
• Pressure points: – Lack of resources – Dif_iculty acquiring the appropriate staff and expertise to provide eScience and data management or curation services
– Lack of a unifying direction on campus
Priscilla M. Mayden Lecture 2012, Utah 33
Data support and services in institutions:
45%
Libraries involved in supporting eScience:
73%
Source: Soehner, C., Steeves, C. & Ward, J. (2010). E-Science and data support services: A study of ARL member institution. http://www.arl.org/bm~doc/escience_report2010.pdf
School of Information Studies Syracuse University
Data preserva>on challenges • Data formats
– Vary in data types, e.g. vector and raster data types – Format conversions, e.g. from an old version to a newer one
• Data relations – e.g. there are data models, annotations, classi_ication schemes, and symbolization _iles for a digital map
• Semantic issues – Naming datasets and attributes
Priscilla M. Mayden Lecture 2012, Utah 34
School of Information Studies Syracuse University
Data access challenges • Reliability • Authenticity • Leverage technology to make data access easier and more effective – Cross-‐database search – Integration applications – “Science-‐ready” datasets
Priscilla M. Mayden Lecture 2012, Utah 35
School of Information Studies Syracuse University
Suppor>ng digital research data • Lifecycle of research data
– Create: data creation/capture/gathering from laboratory experiments, _ield work, surveys, devices, media, simulation output…
– Edit: organize, annotate, clean, _ilter… – Use/reuse: analyze, mine, model, derive additional data, visualize, input to instruments /computers
– Publish: disseminate, create, portals /data. Databases, associate with literature
– Preserve/destroy: store / preserve, store /replicate /preserve, store / ignore, destroy…
Priscilla M. Mayden Lecture 2012, Utah 36
School of Information Studies Syracuse University
Suppor>ng data management
Priscilla M. Mayden Lecture 2012, Utah 37
The data deluge Numerical, image, video Models, simulations, bit streams XML, CVS, DB, HTML
Specialized search engines to discover the data they need Powerful data mining tools to use and analyze the data
Researchers need:
School of Information Studies Syracuse University
Research data management
Priscilla M. Mayden Lecture 2012, Utah
Institution
Financial and policy support
Community
User requirements
Science domain
Data content idiosyncrasies
Institutional repository
Community repository
National repository
International repository
Evolving and interconnecting –
eScience librarian
38
School of Information Studies Syracuse University
Implica>ons to scholarly communica>on process
Priscilla M. Mayden Lecture 2012, Utah
Publishing Curation Archiving
39
Maintaining, preserving and adding value to digital research data throughout its lifecycle.
The long-‐term storage, retrieval, and use of scienti_ic data
and methods.
Data publishing; New scholarly publishing models—open access, institutional and
community repositories, self-‐publishing, library
publishing, ....
School of Information Studies Syracuse University
Summary • E-‐Science development has raised expectations to research libraries – Working knowledge and skills in e-‐Science – Focus on process (data and team science) rather than product (reference services)
– Proactive, collaborative, integrative, and interdisciplinary
Priscilla M. Mayden Lecture 2012, Utah 40
School of Information Studies Syracuse University
Case Study: Learning Data Management
Needs from Scien>sts
School of Information Studies Syracuse University
Gravita>onal Wave (GW) Research
Priscilla M. Mayden Lecture 2012, Utah 42
School of Information Studies Syracuse University
What is the problem? • Tracking data output and work_lows is dif_icult due to lack of provenance data
• Search of datasets is limited due to lack of speci_ic options
• Within the LIGO community, data sharing and reuse is dif_icult without provenance metadata
Data provenance case study 43
School of Information Studies Syracuse University
Understand the research workflow
• Interview the scientist – Listening (good listening skills) – Asking questions (don’t be afraid of asking questions)
– Use your librarian brain to ingest the conversation: • How does the research _low from one point to next? • What consists of the research input and output at each stage of research in terms of data?
Priscilla M. Mayden Lecture 2012, Utah 44
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 45
Mapping out the knowledge v0.1
School of Information Studies Syracuse University Mapping out the knowledge v0.2
Priscilla M. Mayden Lecture 2012, Utah 46
School of Information Studies Syracuse University
Priscilla M. Mayden Lecture 2012, Utah 47
Mapping out the knowledge v1.0
School of Information Studies Syracuse University
Lessons learned • Science is learnable even if you don’t have a subject background – Learn enough to understand the research process and work_low
• Scientists are eager to get help • Librarians need to be technical-‐minded
– Data, metadata, database – Structures, models, work_lows
• Librarians need to be good listeners while staying good conversation leaders – Know when and how to lead the conversation to get what you need for data management planning and implementation
– Do your homework on the subject so that you can be an intelligent listener
Priscilla M. Mayden Lecture 2012, Utah 48
School of Information Studies Syracuse University
Case Discussions
School of Information Studies Syracuse University
Case Study #1: To build or not to build a data repository?
Priscilla M. Mayden Lecture 2012, Utah 50
A university library has developed an institutional repository for preserving and providing access to the scholarly output by the researchers in this institution. Now the new challenge arises from e-science research demanding data management plan by the funding agency and the linking between publications and data by the authors and users. You already know that some faculty use their disciplinary data repository for submitting their datasets (e.g., GenBank for microbiology research data). The problem you face now is whether an institutional data repository should be built for those who do “small science” and don’t have funding nor expertise to manage their data. Questions to be addressed: • What are the strategies you will use to approach the problem? • What are the possible solutions for the problem? • What are some of the tradeoffs for the solutions you will adopt?
School of Information Studies Syracuse University
Case study #2: Developing a data taxonomy
Priscilla M. Mayden Lecture 2012, Utah 51
The concept of research data management is a stranger to many faculty as well as your library staff. What is data? What is a data set? These seemingly simple terms can be very confusing and have different interpretations in different context and disciplines. As part of the data management strategies, you decide to develop an authoritative data taxonomy for the campus research community. This data taxonomy will benefit the creation and use of institutional data policies, data repository or repositories, and data management plans required of funding agencies. Questions to be addressed: • What should the data taxonomy include? • What form should it take, a database-driven website or a static HTML page? • Who should be the constituencies in this process? • Who will be the maintainer once the taxonomy is released?
School of Information Studies Syracuse University
Case study #3: Developing a data policy
Priscilla M. Mayden Lecture 2012, Utah 52
Data policies play an important role in governing how the data will be managed, shared, and accessed. It is also an instrument that will fend off potential legal problems. Data policies have several types: data access and use, data publishing, and data management. Your university’s Office of Sponsored Research has some existing policy on data, but it is neither systematic nor complete. Many of the terms were defined years ago and did not cover the new areas such as the embargo period of data. As the university has decided to build a data repository for managing and preserving datasets, a data policy has become one of the top priorities for both the institution and the data repository. Questions to be addressed: • What should the data policy include? • Who should be the constituencies in this process? • Who will be the interpretation authority for the data policy?
School of Information Studies Syracuse University
Case study #4: Cataloging datasets
Priscilla M. Mayden Lecture 2012, Utah 53
Describing datasets is the process of creating metadata for datasets. In scientific disciplines, several metadata standards have been developed, e.g., the Content Standard for Digital Geospatial Metadata (CSDGM), Darwin Core, and Ecological Metadata Language (EML). Each of these metadata standards contains hundreds of elements and requires both metadata and subject knowledge training in order to use them. Besides, creating one record using any of these standards will require a tremendous time investment. But you library does not have such specialized personnel nor have the fund to hire new persons for the job. The existing staff has some general metadata skills such as Dublin Core. In deciding the metadata schema for your data repository, you need to address these questions: • Should I adopt a scientific metadata standard or develop one tailored to our need? • How can I learn what metadata elements are critical to dataset submitters and searchers? • What are some of the benefits and disadvantages for adopting a standard or developing a local schema?
School of Information Studies Syracuse University
Case study #5: Evalua>ng data repository tools
Priscilla M. Mayden Lecture 2012, Utah 54
Research data as a driving force for e-science is inherently a tool-intensive field. Tools related to data management can be divided into two broad categories: those for creating metadata records and those for data repository management. An academic institution decided to build their own data repository as part of the supporting service for researchers to meet the data management plan requirement of funding agencies. This data repository development task was handed down to the library. You the library director have to decide whether to develop an in-house system or use an off-the-shelf software system. As usual, you put together a taskforce to find a solution to this challenge. The questions to be addressed by the taskforce include: • What are the options available to us? • What evaluation criteria are the most important to our goal? • What are the limitations for us to adopt one option or the other? • How will this option be interoperate with existing institutional repository system? Or, can the existing repository system used for data repository purposes?