rdm slides march 2014

46
CISER Data Archive & Introduction to RDM Stuart Macdonald CISER Data Services Librarian [email protected] Research Design CRP-7201, Stone Laboratory, Cornell Univ. 19 March 2014

Post on 18-Oct-2014

393 views

Category:

Documents


2 download

DESCRIPTION

Slides for Research Design Graduate Course - http://courses.cornell.edu/preview_course_nopop.php?catoid=14&coid=190464#

TRANSCRIPT

Page 1: Rdm slides march 2014

CISER Data Archive & Introduction to RDM

Stuart MacdonaldCISER Data Services Librarian

[email protected]

Research Design CRP-7201, Stone Laboratory, Cornell Univ. 19 March 2014

Page 2: Rdm slides march 2014

• CISER Data Archive

• What is Research Data Management (RDM)• Research Data Defined• Data Management Planning• Organising Data• File Formats & Transformations• Documentation & Metadata• Storage & Security• Data protection & Rights• Preservation & Sharing

• Research Data MANTRA

Page 3: Rdm slides march 2014

CISER Data Archive: Collection and Services

Established over 30 years ago

Collection of numeric datasets to support quantitative research

c. 27,000 online files in addition to thousands of studies on CD/DVD

Emphasis on demography (state/federal censuses), economics, health, labor, election studies, attitudinal and behavioral studies, family life etc.

Page 4: Rdm slides march 2014

• Consulting services to match user needs with appropriate data and statistical analysis software

•finding, accessing and using data

• Current Cornell researchers can download archive files from online catalog (search & browse) in formats conversant with statistical software

• Data files are identified by a ‘traffic light’ icon that indicates usage level:

• Green – downloadable by anyone

• Yellow – downloadable from links in the catalog with CUWebAuth authentication (for use within the CISER research computing environment - CISERRSCH) – Cornell researchers can apply for a computing account

• Red – data to be used in restriction (via CRADC or conditions imposed by data provider)

Page 5: Rdm slides march 2014

CISER Data Catalog:

Page 6: Rdm slides march 2014

6

CISER Data Archive maintain links to a range of social science data resources including:

•Data Distributors and Producers: U.S. Government e.g. Dept. Agriculture, Dept. Commerce, Dept. Energy, Dept. Justice, Dept. Labor, Federal Agencies

•Data Distributors and Producers: Other U.S. Sources

•Data Distributors and Producers: International eg. Eurostat, FAOSTAT, ILO, OCED, UN Statistics Division, World Bank

•Data Libraries and Archives e.g. Harvard-MIT Data Center, UKDA, DANS, CESSDA,

•Social Science Research Institutes e.g. Odum Institute, Survey Research Institute

•Online Reference Tools e.g. Boundary files, geocoding tools, SIC codes, data citation tools

•State and Local Government data and statistical sources e.g. NY State Depts. Education, Health, Labor, State Data Center

See URL: http://ciser.cornell.edu/ASPs/datasource.asp

Page 7: Rdm slides march 2014

• Provides Cornell social science researchers with a repository for sharing and providing long-term preservation of their numeric/statistical research data

• Participates in Cornell’s Research Data Management Service Group

• Assist Cornell social science researchers with Research Data Management (RDM) plans

• Provide Cornell social science researchers with support and expertise in obtaining and using restricted data

Page 8: Rdm slides march 2014

Other social science research data resources:

• Inter-University Consortium for Political and Social Research (ICPSR)• National Archive of Criminal Justice Data• Minority Data Resource Center• National Archive of Computerized Data on Aging

• Roper Center for Public Opinion Archives

• International Data Archives • CESSDA, UKDA, Eurostat• CESSDA catalog (DDI) provides a multi-lingual interface to datasets from member social science data archives across Europe

• Non-Governmental Organizations

• National / Governmental Statistical Agencies

Page 9: Rdm slides march 2014

• CISER Data Archive Catalog:http://ciser.cornell.edu/ASPs/search.asp

• ICPSR:www.icpsr.umich.edu/

• Roper Center for Public Opinion Research:http://www.ropercenter.uconn.edu/

• CESSDA:http://www.cessda.org/

• Eurostat:http://www.epp.eurostat.ec.europa.eu/

URLs:

Page 10: Rdm slides march 2014

CISER Data Archive is located at 391 Pine Tree Road, Ithaca

CISER is open 8.30am – 4.30pm (Mon-Fri) – walk-in assistance is not always available – so appointments are recommended

Location & hours:

Contacts:

Tel.: (607) 255 4801Email: [email protected]

Page 11: Rdm slides march 2014

Introduction to Research Data Management (RDM)

Page 12: Rdm slides march 2014

Why Manage Research Data?

Current research data management initiatives are based on three trends:

The data deluge – exponential growth in volume of digital research artifacts created within academia (often created by publicly funded research)

Data management is required by multiple disciplines

Increasing perception of the value of data (data as commodity)

Page 13: Rdm slides march 2014

What is Research Data Management?

•RDM is an umbrella terms to describe all aspects of planning, organising, documenting, storing and sharing research data.

•It also takes into account issues such as documentation, data protection and confidentiality.

•It provides a framework that supports researchers and their data throughout the course of their research and beyond.

•It is one of the essential areas of responsible conduct of research

Page 14: Rdm slides march 2014

Research Data Lifecycle

Pink Colored Umbrellas Are Pretty Darned Rainproof

Page 15: Rdm slides march 2014

Research Data DefinedUS Office of Management and Budget in its grants management circular A-110 defines research data as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.”

The KRDS2 study (Beagrie et al, 2009) define research data as ‘collections of structured digital data from any disciplines or sources which can be used by academic researchers to undertake their research or provides an evidential record of their research.’

RIN Classification*• Observational – real-time, unique, usually irreplaceable• Experimental – from lab equipment, expensive, often reproducible• Simulation – generated from models – model & metadata are as important as output data• Derived – resulting from processing or combining “raw” data. reproducible but expensive• Reference - a (static or organic) collection of smaller (peer-reviewed) datasets, probably published and curated

* Stewardship of digital research data: a framework of principles and guidelines, Research Information Network, 2008. URL: http://tinyurl.com/l56gftx

Page 16: Rdm slides march 2014

Research Data Defined

• Research data, unlike other information types, is collected, observed, or created, for purposes of analysis to produce original research results.

• Research data can be generated for different purposes and through different processes in a multitude of digital formats.

Page 17: Rdm slides march 2014

Research data comes in many varied formats:

Text  ‐ Flat text files, Word, Portable Document Format (PDF), Rich Text Format (RTF), Extensible Markup Language (XML).

Numerical  ‐ SPSS, Stata, Excel.

Multimedia - jpeg, tiff, dicom, mpeg, quicktime.

Models - 3D, statistical.

Software - Java, C.

Discipline specific - Flexible Image Transport System (FITS) in astronomy, Crystallographic Information File (CIF) in chemistry,

Instrument specific - Olympus Confocal Microscope Data Format,Carl Zeiss Digital Microscopic Image Format (ZVI)

Page 18: Rdm slides march 2014

Research data may include the following:• Documents (text, MS Word), spreadsheets• Lab books, field notes, diaries• Questionnaires, transcripts, codebooks• Audiotapes, videotapes, photographs, images• Slides, artefacts, specimens, samples• Collection of digital objects acquired & generated during the

research process• Database contents (video, audio, text, images)• Models, algorithms, scripts• Contents of an application (input, output, logfiles for analysis

software, schemas)• Methodologies, workflows• SOPs, protocols

Page 19: Rdm slides march 2014

By managing your data you will:• ensure scientific integrity of research and aid replication• ensure research data and records are accurate, complete,

authentic and reliable• increase your research efficiency• save time, effort and resources in the long run• enhance data security and minimise the risk of data loss• prevent duplication of effort by enabling others to use your data

• meet funding grant requirements

Note:

It may also be important to manage research records (both digital & hardcopy) during and beyond the life of the project such as:

correspondence (emails)grant applications technical reports research reportsconsent formsethics applications

Page 20: Rdm slides march 2014

What Do Funders Want?

• timely release of data- once patents are filed or on (acceptance for)

publication.

• data shared openly- minimal or no restrictions if possible.

• preservation of data - typically 5-10+ years if of long-term value.

• data management plans

See :NIH Data Sharing Policy: https://grants.nih.gov/grants/policy/data_sharing/NSF Data Sharing Policy: http://www.nsf.gov/bfa/dias/policy/dmp.jsp

Page 21: Rdm slides march 2014

Data Management Plan. What is it?Funding bodies require researchers to supply detailed, cost-effective plans for managing research data. These are called Data Management Plans

A DMP is a document which describes: What research data will be created. What policies (funding, institutional, legal) apply to the data. What data management practices (backups, storage, access control, archiving) will be used. What facilities and equipment are equired (hard-disk space, backup server, repository). Who will own the copyright and have access to the data. How long-term preservation will be ensured after the original research is completed.

The data management plan must be continuously maintained and kept up-to-date throughout the course of research.

Page 22: Rdm slides march 2014

Why do we need one?

It improves your research both now and later...

•Data is often valuable for a long time!•Results of your research may outlast your project.•Will you use your data throughout your career?•Prevents loss of digital data and records.•Prevents loss of usefulness through media and software obsolescence, •Forgetting stuff!

Good practice → Better research

Page 23: Rdm slides march 2014

Why do we need one?•Ensure research integrity (and repeatability) through keeping better records.

•People can trace your outcomes from data collection, through research methodology, through to results.

•Maximises usefulness of data to fellow researchers.

•Highlights how data was collected, quality controls, how people can and should use it (access and licensing).

•Facilitates data use within collaboration.

•Can help lead to subsequent research papers.

Page 24: Rdm slides march 2014

Getting started with a DMPGain an understanding of terminology & issues.Gain understanding of your project/community

– Supervisor and colleagues– People in your School, i.e. IT Officers, Research

Coordinator/AdministratorTalk to your supervisor about data authorship, IP, licensing, policies.Keep it practical and simple, don't spend too much time. What you don't know leave gaps, investigate, fill in later.Remember it is never finished! Review it regularly through the course of your research.

CDL’s DMP Tool: https://dmp.cdlib.org/

Cornell University RDM Services Group - Writing a DMP: https://confluence.cornell.edu/display/rdmsgweb/data-management-planning-overview

Page 25: Rdm slides march 2014

Questions?

Page 26: Rdm slides march 2014

Benefits of organising your dataResearch data files and folders need to be labelled and organised in a systematic way so that:

•Data files are not accidentally overwritten or deleted

•Data files are distinguishable from each other within their containing folder

•Data file naming prevents confusion when multiple people are working on shared files

•Data files are easier to locate and browse

•Data files can be retrieved by both creator and by other users

•Data files can be sorted in logical sequence

•Different versions of data files can be identified

•If data files are moved to other storage platforms their names will retain useful context

Page 27: Rdm slides march 2014

File Formats & Transformation

• Files are based on either text or binary encoding. The former is both machine- and human-readable and the latter only readable by means of appropriate software.

• Thus text files are less likely to become obsolete. Examples of file name extensions for these files are .txt, .csv and .por. 

• Be aware of the file formats your data exists in– Does this format require a specific type of software?– Can others access the data in this format?– Can alternative formats be used?

• Using widely available or open formats maximises the chances of your data being stable and usable

Page 28: Rdm slides march 2014

File Formats & Transformation•When compressing  your data files for storage or transportation you encode the information using fewer bits than the original representation. Commonly used compression programs are  Zip and Tar.

•You may use the process of data normalisation. This means to convert data from one format (e.g. proprietary) into another for use or preservation (e.g. ASCII).

•If you convert or migrate your data files from one format to another, be aware of potential risk of data loss or corruption and take appropriate steps to avoid/minimise it.

•Watch out for backwards compatibility if software is upgraded

Page 29: Rdm slides march 2014

Exercise 1. Formatting your data

Page 30: Rdm slides march 2014

Documenting DataThere are many reasons why you need to document your data:

•To help you remember the details later•To help others understand your research•Verify your findings•Review your submitted publication•Replicate your results•Archive your data for access and re-use

Some examples of data documentation are:

•Laboratory notebooks•Field notes•Questionnaires

Page 31: Rdm slides march 2014

Documenting Data

Research data need to be documented at various levels:•Project level•File or database level•Variable or item level

The term metadata (‘data about data’) is often used.

The importance of metadata lies in the potential for machine-to-machine interoperability to assist location and access to data through search interfaces.

Page 32: Rdm slides march 2014

Secure data storage:For the purposes of integrity and efficiency it is important that research data is stored securely & backed up regularly via:

• Networked drives

• Fileservers managed by department / school / IT Dept.

• Stored in single, secure, accessible place – regular back-ups.

• Personal computers / laptops

• Convenient, temporary storage - should not be used for storing master copies.

• Local drives may fail & laptops may get lost/stolen.

Page 33: Rdm slides march 2014

• External storage devices

• Hard drives, USB sticks, CDs, DVDs – low cost & portable BUT not recommended for long term storage.

• Longevity not guaranteed – degradation over time.• Easily damaged or misplaced.• Not big enough for all research data – might be need to use multiple

discs/drives.• May pose a security threat.

If USB sticks, DVDs, CDs are used for working data or extra back-up then:• Choose high quality products from reputable manufacturers.• Conduct regular checks to ensure media is not failing.• Periodically refresh data (i.e. copy to a new disc or drive).• Ensure confidential data is password protected / encrypted

Page 34: Rdm slides march 2014

• Remote or online back-up services – services that provides an online system for storing and backing-up computer files e.g. Dropbox, Mozy, Humyo, A-Drive

• Allow users to store and sync data files online and between computers.

• Employ cloud computing storage facilities (e.g. Amazon S3).

• Business model – first few GBs free, pay for more space.

Page 35: Rdm slides march 2014

Backing-upConsiderations for back-up policy:

• Whether all data (full back-up), or only changed data will be backed-up (incremental back-up)?

• How often full and incremental back-ups will be made?• How much hard-drive space or DVDs will be required to maintain

this schedule?• If working with sensitive data, how will it be secured (and

destroyed)?• What back-up services are available that meet your these needs?• Who will be responsible for ensuring back-ups are available?

Recommendation:

Keep at least 3 copies of your data (e.g. original, external/local, and external/remote) and put in place regular back-up

procedure

Page 36: Rdm slides march 2014

Data SecurityThe means of ensuring that data is kept safe from corruption and that access to it is suitably controlled. It is important to consider data security to prevent:

• Accidental or malicious damage / modification to data.

• Theft of valuable or irreplaceable data.

• Breach of confidentiality agreements and privacy laws.

• Release of data before it has been checked for accuracy and authenticity.

Page 37: Rdm slides march 2014

Exercise 2. Data storage and Security

Page 38: Rdm slides march 2014

Data Protection (also called data privacy)

• In the US, there is no single, comprehensive federal (national) law regulating the collection and use of personal data. Instead, the US has a patchwork system of federal and state laws, and regulations that overlap, dovetail and may contradict one another.

• The combination of an increase in cross-border data flow, together with the increased enactment of data protection statutes heightens the risk of privacy violations and creates a significant challenge for a data owner/distributor.

Data protection is the relationship between:

•collection and dissemination of data•technology •the public expectation of privacy and the legal and political issues surrounding them

Page 39: Rdm slides march 2014

Rights and access• Intellectual property rights (IPR) can be defined as rights acquired

over any work created or invented with the intellectual effort of an individual.

• Facts are not copyrightable but the structure of a database could be.

• As a researcher, you should clarify ownership of and rights relating to research data before a project starts. This includes the right of access and the right to make copies.

• Data licences determine the terms and conditions of use by another, and may accompany a purchase or subscription.

• Open data licences attempt to “set data free” by minimising and standardising the terms and conditions of re-use. Conditions may include attribution, non-commercial use, no derivative works, or ‘share alike’.

Page 40: Rdm slides march 2014

Open Data Commons (ODC) have prepared a set of licences each with an accompanying statement which can be placed with your data on a webpage that points to your data.

Open Data Commons: http://opendatacommons.org/

Page 41: Rdm slides march 2014

Benefits of Sharing Data

• Scientific integrity – publishing & citing data in published research papers can allow others to replicate, validate, or correct results, thus improving the scientific record.

• Publicly funded research - there is a growing movement for making publicly funded research available to the public.

• Funding mandates - US Funding Agencies are increasingly mandating data sharing so as to avoid duplication of effort and save costs.

• Preserve research data for researchers’ own future use.

Page 42: Rdm slides march 2014

Research Data MANTRA

Page 43: Rdm slides march 2014

Research Data MANTRA

Partnership between:

EDINA & Data Library, University of Edinburgh

Institute for Academic Development

Funded by JISC Managing Research Data Programme (Sept. 2010 – Aug. 2011)

Aim was to develop online interactive open learning resources for PhD students and early career researchers that will:

Raise awareness of the key issues related to research data management & contribute to culture change.

Provide guidelines for good practice.

Page 44: Rdm slides march 2014

Eight units with activities, scenarios and videos:

• Research data explained• Data management plans• Organising data• File formats and transformation• Documentation and metadata• Storage and security• Data protection, rights and access• Preservation, sharing and licensing

Four data handling practicals: SPSS, NVivo, R, ArcGIS

Video stories from researchers in variety of settings

Online Learning Module

Page 45: Rdm slides march 2014

Online Learning Module

• Delivered online – self-paced, available ‘anytime, anyplace’• Emphasis on practical experience and active engagement via

online activities• One hour per unit• Read and work through scenarios & activities (incl. videos etc)• CC licence to allow manipulation of content for re-use with

attribution• Portable content in open standard formats (e.g. SCORM)

• Research data MANTRA course: http://datalib.edina.ac.uk/mantra

Page 46: Rdm slides march 2014

Questions?