data management for public health...

59
Data Management for Public Health Research Click to continue Developed by the University of Minnesota Liberal Arts Technologies & Innovation Services (LATIS) and University Libraries

Upload: others

Post on 27-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Data Management for Public Health Research

    Click to continue

    Developed by the University of Minnesota Liberal Arts Technologies & Innovation Services (LATIS) and University Libraries

  • After this tutorial, you will be able to:

    - Name files and folders using a proper naming convention- Document and describe your research project- Prepare data for public sharing- Recognize long-term archiving solutions

  • Have you ever…?

    ● Downloaded a research article● Kept the filename of said research article as is (e.g.,

    “AHBP-S-15-00036.pdf”)● Went to look for the article at a later date and had no idea

    where it was ● Saved a document with the word “final” in the filename● Ended up with more than one “final” file (“final_v2”,

    “final_FINAL”)

  • Does your desktop look like this?

  • If yes...

    Data management is for you!

    Nobody’s perfect at it, but you can get better over time.

  • What is Data Management?

    Practices across the lifecycle of a project that:

    ● ensure integrity of the files and data● facilitate replication● protect the security of data● enhance efficiency and reliability of the research

  • No, but really, what is it?

    It is the process of thinking about, and planning for:

    ● File names, folder structures, and their management● Documentation & metadata● Storage, backups, and security● Sharing and/or preserving data

  • Why should you care?

    Imagine: Three years after completing a study, a researcher contacts you wanting to use some of the data or materials to replicate your work.

    ● Can you locate these files or materials?● Are they stored someplace you can still access?● Are any of the files corrupt? Or in a format that you/others no longer

    have the software to read?● Do you have documentation about how you created, analyzed, and

    made use of the data/materials that still makes sense to you and others?

  • Data management at the startFile organizationDocumentationData SharingProject wind-downGet help

  • What is considered “data”?

    Types of data

    ● Transcripts (Interviews / focus groups)● Field notes● Audio/videos● Texts● Code● Meeting minutes● Brochures / posters / fliers● Physical collections● Spreadsheets / statistical files

  • Map Your Data Workflow

    ● Write down the steps you need to follow from the start of your research project to the end.

    ● At each step, write down all of your data considerations.

    Take a look at the example on the next slide.

  • Map Your Data Workflow

  • Take Inventory of your Files

    Add the following marks next to each file as applicable:

    “c” You will be collaborating with others on these files

    “v” You will need to keep multiple versions of these files

    “*” These files will contain sensitive or restricted information

    “ps” These files require unique, proprietary software to use

    “xl” These files will be extra large and will require special resources

    “a” These files need to be archived to understand your project later

  • Protect your data

    To avoid losing data, Use the 3-2-1 rule:

    3 copies of your work (1 working copy, 2 backups)

    On 2 different kinds of storage

    At least 1 copy off site

    Flooding destroys data in North Quad, University of Michigan. https://www.si.umich.edu/news/

  • Protect your data

    Minard Hall structural damage at North Dakota State University. https://www.mprnews.org/story/2009/12/27/building-collapse

  • Accepted storage Not accepted storage

    ● Managed servers● Box.com secure storage● Encrypted drive/container

    ● Shared/personal drive● Google Drive● Dropbox.com● Amazon.com (personal)● Email

    Store Data Securely

    For protected health information and/or sensitive identifiable data:

  • Accepted storage Not accepted storage

    ● Managed servers● Box.com secure storage● Shared / Personal drive

    ● Dropbox.com● Amazon.com (personal)● Email

    Store Data Securely

    For de-identified data (with (direct identifiers removed):

  • Data management at the start File organizationDocumentationData SharingProject wind-downGet help

  • Make a plan to organize your files

  • Folder structure

    Possible organizational strategies:

    ● By data type: databases, text, images,

    models, etc.

    ● By research activities: interviews, surveys,

    experiments, etc.

    ● By materials: data, documentation,

    publications, etc.

  • File naming

    ● Be descriptive

    interview.txt is not helpful.

    Instead: 20150814_interview_site01_respondent04.txt (up to 255 characters)

    ● Use consistent structure

    create a useful order (for sorting) and decide on shared terminology

    ● Use numerical dates

    YYYYMMDD rather than Dec09 or December 9

    ● Don’t embed information in folder structures

    2015/august/minneapolis/interviews/reactionmemo.txt

  • Here’s why you should use numerical dates

    Sort, with numerical dates Sort, without numerical dates

    Code_descriptions_20150214.docx

    Code_descriptions_20150801.docx

    Code_descriptions_20151208.docx

    Code_descriptions_12-8-15.docx

    Code_descriptions_2-14-2015.docx

    Code_descriptions_8-1-2015.docx

  • Here’s why you should use numerical dates

    For dates, be intentional and consider sorting.

    For example, we recommend YYYYMMDD because this will allow for the most effective sorting by date whether it’s newest to oldest or oldest to newest.

  • Here’s why you should use leading zero

    Sort, with leading zero Sort, without leading zero

  • Here’s why you should use leading zero

    Also, if it makes sense to name documents numerically by version--- 1-10.

    Consider that 01 sorts better than 1 because of how Excel and Google docs sorts. Otherwise 1 and 10 would be sorted near each other.

  • Version control

    ● List versions alphanumerically: v01, v02, v03

    ● Name files based on anticipated number of versions (...01.csv,

    ...001.csv)

    ● Decide how many versions of a file to keep (Also when and who

    will delete versions)

    ● Create master versions and identify milestone versions to keep;

    store them in a single location

    ● Assign responsibility of master files to one team member

  • Version control: Google Docs

  • Version control: MS Word

  • Data management at the startFile organization DocumentationData SharingProject wind-downGet help

  • "Your primary collaborator is yourself 6 months from now… and your past self doesn’t

    answer emails."

    https://dynamicecology.wordpress.com/2015/02/18/the-biggest-benefit-of-my-shift-to-r-reproducibility/

  • Documentation

    What should you document?

    ● Sources of data○ When, where, & how data was collected

    ● Study decisions (protocols, coding, etc)● Reactions/reflections on fieldwork sites● Statistical analyses● Software used and version● Where data/documentation are stored● Future research ideas and plans

  • How should you document it?

    Memos / Fieldnotes

    Syntax /User logs

    Datasetmetadata

    Codebooks

  • Example of Qualitative Codebook

    If you document carefully while using a qualitative coding tool, like NVivo, you can automatically generate a codebook with the tool.

  • Data management at the startFile organizationDocumentation Data SharingProject wind-downGet help

  • Data sharing for public use

    Why researchers are making their datasets available to the public:

    ● Increasing requirements for data sharing○ Granting agencies/journals

    ● Reuse/dissemination of data● Prevent data loss● Replication of results● Increased transparency● Data collection is expensive & time consuming

  • Things to think about for sharing

    1. Data privacy issues

    2. Metadata standards / schemas

    3. Human protections considerations

    4. Copyright/ intellectual property

    5. Where & how to share

  • Data privacy issues

    Consider direct and indirect identifiers

    ● Avoid collection of Personally Identifiable Information (PII) when possible● Remove direct identifiers and replace with pseudonyms, replacement

    terms, vague descriptors, or codes● Remove/replace any text that may identify participant (Luisa lives next to

    the post office)● Aggregate data● Generalize detailed text● Note replacements in text [...]

    Remember: Manner of speech can identify participants!

  • Data Privacy Issues in Qualitative Research

    Create an anonymization log (stored separately from the anonymized data files) of all replacements, aggregations, or removals.

  • Data Privacy Issues in Qualitative ResearchAnonymizing transcripts from interviews can be a difficult and tedious process, but an important step. Use this example log to help you determine what information in a transcript needs to be changed or redacted to protect the human subjects in your research study.

  • Data privacy issues

    When de-identification disintegrates the potential for re-use:

    ● Restrict access○ Embargo○ Contact researcher to get data

    ● Apply user license or user agreement○ Create clear guidelines for how the data may be used

    Sometimes when we anonymize data to protect human subjects, the resulting dataset loses its functionality or usefulness, especially with qualitative research. If you want to share your data publicly, there are other steps you can take instead of anonymizing the data.

  • Metadata standards

    ● Metadata is highly structured documentation

    ● Who, what, why, where, when, how

    ● Template: z.umn.edu/readme

    http://z.umn.edu/readme

  • Metadata standards for qualitative research

    Metadata unique to qualitative research:

    ● Active Citation○ Active Citation: A precondition for replicable qualitative

    research by Andrew Moravcsik● Sensitivity Issues● Data Preparation (e.g., anonymization)

    https://www.princeton.edu/~amoravcs/library/ps.pdfhttps://www.princeton.edu/~amoravcs/library/ps.pdf

  • IRB and human protections considerations

    Instead of promising to eventually destroy the data...

    ● Destroy direct identifiers, linking information, identifying audio files● Retain & preserve de-identified transcripts

    Instead of claiming responses will only be seen by the research team...

    ● Identifying data will be kept confidential● Only de-identified data will be shared

    Instead of stating data will only be shared in aggregate forms...

    ● Individual responses will only be shared in ways that will not identify the participant

  • IRB and human protections considerations

    Drafting the Participant Consent Form (Informed Consent)

    ● Ask for explicit permission to share

    ● IRB likely to approve consent if you demonstrate it will be done in

    a safe and careful manner

    ● Consider using graduating consent language

    ○ Can I share…

    ○ Can I share…

  • IRB and human protections considerations

    Hand out a data sharing information sheet asking:

    ● What is an archive? ● Why put information in an archive? ● How do I know my data will be used ethically? ● What does anonymizing mean? ● How might data be used? ● Who owns the data and what is copyright?● How do archives store my data safely?

  • Copyright / Intellectual Property

    ● Copyright is an intellectual property right that is automatically assigned

    to the original author or creator of many kinds of research data, datasets,

    databases, data sources, and data outputs

    ● Take into careful consideration any user agreements with entity from

    whom you secured the data

    ● Sharing copyrighted data in an unauthorized fashion is illegal

    ● Check with your institution regarding legalities around sharing your data

  • Where & how to share data

    Standards for trusted data repositories:

    ● Data Seal of Approval

    ● Open Archival Information System (OAIS) Standard

    ● Information and Documentation - Criteria for Trustworthy Digital Archives

    Be sure to look into their funding structure - will there be a charge to share

    your data with them?

  • Where & how to share

    Disciplinary Examples

    ● ICPSR (social science)● QDR (qualitative research)

    Institutional Examples

    ● Check with your university!

    Funder Examples

    ● National Institutes of Health

    Search in a repository of repositories

    ● re3data

    https://www.re3data.org/

  • Data management at the startFile organizationDocumentationData Sharing Project wind-downGet help

  • Data storage & archiving

    How will you ensure the data will be around beyond the life of the

    project?

    ● How long?

    ○ “forever” is not a realistic plan

    ● How will you do it?

    ○ “keep it indefinitely on hard drive/server” is not enough

  • Data storage & archiving

    You can keep the file, but will you be able to open it, find it, or know what it is?

    ● Hard drives die-- protected servers have redundancy built into their systems - but on your own computer/server you don’t have that protection

    ● Saving all your files seems like a good idea, but makes it that much harder to use it later since so much! Going through the process of selecting milestone versions will help you later!

    ● Create hold and delete folders - then clean out your delete folder on a regular basis

  • Storage vs. archiving

    During project After project

    Storage: Back-ups of active data

    Actions: Documentation

    Archival Storage: Final versions with offline copies

    Actions: Preservation

  • Storage vs. archiving

    People often have different concepts of what "archiving" means

    People get tripped up on the "archiving" part - and possibly the connotations of having to do a lot of mysterious work at the end in order to "archive" their data

    Think of storage vs archiving more in terms of "working storage" and "long-term storage."

  • Storage vs. archiving

    Sometimes moving between working and long-term storage can be obvious (submitting to a repository), and sometimes it requires: an intentional action of copying the final files to a new directory location/backup location or going through the working storage location and cleaning out drafts and old working files.

    Things like Amazon Glacier- Archival Storage- save things on tape - long lasting, offline, no accessing/changing.

    Some archives put it away, hard to get to; others have a sharing function.

  • File formats for archiving data

    Ideally there should be:

    ● Interoperability of data for different programs

    ● Long term viability of data● Non-proprietary formats

    Textual data .rtf; .txt

    Tabular data .csv; .tab

    Images .tif

    Audio .flac; .wav

    Video .mp4; .jp2

    Documentation .rtf; .pdf; .odt

    Recommended file formats:

  • Data management at the startFile organizationDocumentationData SharingProject wind-down Get help

  • Resources

    • UK Data Archive’s Guide to Managing and Sharing Data

    • Institute of Museum and Library Services Video/Audio preservation guide

    • ICPSR’s Guide to Data Preparation and Archiving

    • Library of Congress Digital Preservation Formats

    • DMP templates: Dataverse template, DMPTool.org, ICSPR

    • The Framework Method

    • Qualitative Data Repository

    http://www.data-archive.ac.uk/media/2894/managingsharing.pdfhttp://ohda.matrix.msu.edu/2012/06/digital-video-preservation-and-oral-history/http://ohda.matrix.msu.edu/2012/06/digital-video-preservation-and-oral-history/http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/index.htmlhttp://www.digitalpreservation.gov/formats/fdd/descriptions.shtmlhttp://thedata.org/book/data-management-plan-templatehttps://dmptool.org/http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/elements.htmlhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848812/https://qdr.syr.edu/

  • Review - you are now able to:

    - Name files and folders using a proper naming convention- Document and describe your research project- Prepare data for public sharing- Recognize long-term archiving solutions

  • Thank you!Attribution-NonCommercial CC BY-NC

    https://creativecommons.org/share-your-work/licensing-types-examples/licensing-examples/#nc