data management for public health...
TRANSCRIPT
-
Data Management for Public Health Research
Click to continue
Developed by the University of Minnesota Liberal Arts Technologies & Innovation Services (LATIS) and University Libraries
-
After this tutorial, you will be able to:
- Name files and folders using a proper naming convention- Document and describe your research project- Prepare data for public sharing- Recognize long-term archiving solutions
-
Have you ever…?
● Downloaded a research article● Kept the filename of said research article as is (e.g.,
“AHBP-S-15-00036.pdf”)● Went to look for the article at a later date and had no idea
where it was ● Saved a document with the word “final” in the filename● Ended up with more than one “final” file (“final_v2”,
“final_FINAL”)
-
Does your desktop look like this?
-
If yes...
Data management is for you!
Nobody’s perfect at it, but you can get better over time.
-
What is Data Management?
Practices across the lifecycle of a project that:
● ensure integrity of the files and data● facilitate replication● protect the security of data● enhance efficiency and reliability of the research
-
No, but really, what is it?
It is the process of thinking about, and planning for:
● File names, folder structures, and their management● Documentation & metadata● Storage, backups, and security● Sharing and/or preserving data
-
Why should you care?
Imagine: Three years after completing a study, a researcher contacts you wanting to use some of the data or materials to replicate your work.
● Can you locate these files or materials?● Are they stored someplace you can still access?● Are any of the files corrupt? Or in a format that you/others no longer
have the software to read?● Do you have documentation about how you created, analyzed, and
made use of the data/materials that still makes sense to you and others?
-
Data management at the startFile organizationDocumentationData SharingProject wind-downGet help
-
What is considered “data”?
Types of data
● Transcripts (Interviews / focus groups)● Field notes● Audio/videos● Texts● Code● Meeting minutes● Brochures / posters / fliers● Physical collections● Spreadsheets / statistical files
-
Map Your Data Workflow
● Write down the steps you need to follow from the start of your research project to the end.
● At each step, write down all of your data considerations.
Take a look at the example on the next slide.
-
Map Your Data Workflow
-
Take Inventory of your Files
Add the following marks next to each file as applicable:
“c” You will be collaborating with others on these files
“v” You will need to keep multiple versions of these files
“*” These files will contain sensitive or restricted information
“ps” These files require unique, proprietary software to use
“xl” These files will be extra large and will require special resources
“a” These files need to be archived to understand your project later
-
Protect your data
To avoid losing data, Use the 3-2-1 rule:
3 copies of your work (1 working copy, 2 backups)
On 2 different kinds of storage
At least 1 copy off site
Flooding destroys data in North Quad, University of Michigan. https://www.si.umich.edu/news/
-
Protect your data
Minard Hall structural damage at North Dakota State University. https://www.mprnews.org/story/2009/12/27/building-collapse
-
Accepted storage Not accepted storage
● Managed servers● Box.com secure storage● Encrypted drive/container
● Shared/personal drive● Google Drive● Dropbox.com● Amazon.com (personal)● Email
Store Data Securely
For protected health information and/or sensitive identifiable data:
-
Accepted storage Not accepted storage
● Managed servers● Box.com secure storage● Shared / Personal drive
● Dropbox.com● Amazon.com (personal)● Email
Store Data Securely
For de-identified data (with (direct identifiers removed):
-
Data management at the start File organizationDocumentationData SharingProject wind-downGet help
-
Make a plan to organize your files
-
Folder structure
Possible organizational strategies:
● By data type: databases, text, images,
models, etc.
● By research activities: interviews, surveys,
experiments, etc.
● By materials: data, documentation,
publications, etc.
-
File naming
● Be descriptive
interview.txt is not helpful.
Instead: 20150814_interview_site01_respondent04.txt (up to 255 characters)
● Use consistent structure
create a useful order (for sorting) and decide on shared terminology
● Use numerical dates
YYYYMMDD rather than Dec09 or December 9
● Don’t embed information in folder structures
2015/august/minneapolis/interviews/reactionmemo.txt
-
Here’s why you should use numerical dates
Sort, with numerical dates Sort, without numerical dates
Code_descriptions_20150214.docx
Code_descriptions_20150801.docx
Code_descriptions_20151208.docx
Code_descriptions_12-8-15.docx
Code_descriptions_2-14-2015.docx
Code_descriptions_8-1-2015.docx
-
Here’s why you should use numerical dates
For dates, be intentional and consider sorting.
For example, we recommend YYYYMMDD because this will allow for the most effective sorting by date whether it’s newest to oldest or oldest to newest.
-
Here’s why you should use leading zero
Sort, with leading zero Sort, without leading zero
-
Here’s why you should use leading zero
Also, if it makes sense to name documents numerically by version--- 1-10.
Consider that 01 sorts better than 1 because of how Excel and Google docs sorts. Otherwise 1 and 10 would be sorted near each other.
-
Version control
● List versions alphanumerically: v01, v02, v03
● Name files based on anticipated number of versions (...01.csv,
...001.csv)
● Decide how many versions of a file to keep (Also when and who
will delete versions)
● Create master versions and identify milestone versions to keep;
store them in a single location
● Assign responsibility of master files to one team member
-
Version control: Google Docs
-
Version control: MS Word
-
Data management at the startFile organization DocumentationData SharingProject wind-downGet help
-
"Your primary collaborator is yourself 6 months from now… and your past self doesn’t
answer emails."
https://dynamicecology.wordpress.com/2015/02/18/the-biggest-benefit-of-my-shift-to-r-reproducibility/
-
Documentation
What should you document?
● Sources of data○ When, where, & how data was collected
● Study decisions (protocols, coding, etc)● Reactions/reflections on fieldwork sites● Statistical analyses● Software used and version● Where data/documentation are stored● Future research ideas and plans
-
How should you document it?
Memos / Fieldnotes
Syntax /User logs
Datasetmetadata
Codebooks
-
Example of Qualitative Codebook
If you document carefully while using a qualitative coding tool, like NVivo, you can automatically generate a codebook with the tool.
-
Data management at the startFile organizationDocumentation Data SharingProject wind-downGet help
-
Data sharing for public use
Why researchers are making their datasets available to the public:
● Increasing requirements for data sharing○ Granting agencies/journals
● Reuse/dissemination of data● Prevent data loss● Replication of results● Increased transparency● Data collection is expensive & time consuming
-
Things to think about for sharing
1. Data privacy issues
2. Metadata standards / schemas
3. Human protections considerations
4. Copyright/ intellectual property
5. Where & how to share
-
Data privacy issues
Consider direct and indirect identifiers
● Avoid collection of Personally Identifiable Information (PII) when possible● Remove direct identifiers and replace with pseudonyms, replacement
terms, vague descriptors, or codes● Remove/replace any text that may identify participant (Luisa lives next to
the post office)● Aggregate data● Generalize detailed text● Note replacements in text [...]
Remember: Manner of speech can identify participants!
-
Data Privacy Issues in Qualitative Research
Create an anonymization log (stored separately from the anonymized data files) of all replacements, aggregations, or removals.
-
Data Privacy Issues in Qualitative ResearchAnonymizing transcripts from interviews can be a difficult and tedious process, but an important step. Use this example log to help you determine what information in a transcript needs to be changed or redacted to protect the human subjects in your research study.
-
Data privacy issues
When de-identification disintegrates the potential for re-use:
● Restrict access○ Embargo○ Contact researcher to get data
● Apply user license or user agreement○ Create clear guidelines for how the data may be used
Sometimes when we anonymize data to protect human subjects, the resulting dataset loses its functionality or usefulness, especially with qualitative research. If you want to share your data publicly, there are other steps you can take instead of anonymizing the data.
-
Metadata standards
● Metadata is highly structured documentation
● Who, what, why, where, when, how
● Template: z.umn.edu/readme
http://z.umn.edu/readme
-
Metadata standards for qualitative research
Metadata unique to qualitative research:
● Active Citation○ Active Citation: A precondition for replicable qualitative
research by Andrew Moravcsik● Sensitivity Issues● Data Preparation (e.g., anonymization)
https://www.princeton.edu/~amoravcs/library/ps.pdfhttps://www.princeton.edu/~amoravcs/library/ps.pdf
-
IRB and human protections considerations
Instead of promising to eventually destroy the data...
● Destroy direct identifiers, linking information, identifying audio files● Retain & preserve de-identified transcripts
Instead of claiming responses will only be seen by the research team...
● Identifying data will be kept confidential● Only de-identified data will be shared
Instead of stating data will only be shared in aggregate forms...
● Individual responses will only be shared in ways that will not identify the participant
-
IRB and human protections considerations
Drafting the Participant Consent Form (Informed Consent)
● Ask for explicit permission to share
● IRB likely to approve consent if you demonstrate it will be done in
a safe and careful manner
● Consider using graduating consent language
○ Can I share…
○ Can I share…
-
IRB and human protections considerations
Hand out a data sharing information sheet asking:
● What is an archive? ● Why put information in an archive? ● How do I know my data will be used ethically? ● What does anonymizing mean? ● How might data be used? ● Who owns the data and what is copyright?● How do archives store my data safely?
-
Copyright / Intellectual Property
● Copyright is an intellectual property right that is automatically assigned
to the original author or creator of many kinds of research data, datasets,
databases, data sources, and data outputs
● Take into careful consideration any user agreements with entity from
whom you secured the data
● Sharing copyrighted data in an unauthorized fashion is illegal
● Check with your institution regarding legalities around sharing your data
-
Where & how to share data
Standards for trusted data repositories:
● Data Seal of Approval
● Open Archival Information System (OAIS) Standard
● Information and Documentation - Criteria for Trustworthy Digital Archives
Be sure to look into their funding structure - will there be a charge to share
your data with them?
-
Where & how to share
Disciplinary Examples
● ICPSR (social science)● QDR (qualitative research)
Institutional Examples
● Check with your university!
Funder Examples
● National Institutes of Health
Search in a repository of repositories
● re3data
https://www.re3data.org/
-
Data management at the startFile organizationDocumentationData Sharing Project wind-downGet help
-
Data storage & archiving
How will you ensure the data will be around beyond the life of the
project?
● How long?
○ “forever” is not a realistic plan
● How will you do it?
○ “keep it indefinitely on hard drive/server” is not enough
-
Data storage & archiving
You can keep the file, but will you be able to open it, find it, or know what it is?
● Hard drives die-- protected servers have redundancy built into their systems - but on your own computer/server you don’t have that protection
● Saving all your files seems like a good idea, but makes it that much harder to use it later since so much! Going through the process of selecting milestone versions will help you later!
● Create hold and delete folders - then clean out your delete folder on a regular basis
-
Storage vs. archiving
During project After project
Storage: Back-ups of active data
Actions: Documentation
Archival Storage: Final versions with offline copies
Actions: Preservation
-
Storage vs. archiving
People often have different concepts of what "archiving" means
People get tripped up on the "archiving" part - and possibly the connotations of having to do a lot of mysterious work at the end in order to "archive" their data
Think of storage vs archiving more in terms of "working storage" and "long-term storage."
-
Storage vs. archiving
Sometimes moving between working and long-term storage can be obvious (submitting to a repository), and sometimes it requires: an intentional action of copying the final files to a new directory location/backup location or going through the working storage location and cleaning out drafts and old working files.
Things like Amazon Glacier- Archival Storage- save things on tape - long lasting, offline, no accessing/changing.
Some archives put it away, hard to get to; others have a sharing function.
-
File formats for archiving data
Ideally there should be:
● Interoperability of data for different programs
● Long term viability of data● Non-proprietary formats
Textual data .rtf; .txt
Tabular data .csv; .tab
Images .tif
Audio .flac; .wav
Video .mp4; .jp2
Documentation .rtf; .pdf; .odt
Recommended file formats:
-
Data management at the startFile organizationDocumentationData SharingProject wind-down Get help
-
Resources
• UK Data Archive’s Guide to Managing and Sharing Data
• Institute of Museum and Library Services Video/Audio preservation guide
• ICPSR’s Guide to Data Preparation and Archiving
• Library of Congress Digital Preservation Formats
• DMP templates: Dataverse template, DMPTool.org, ICSPR
• The Framework Method
• Qualitative Data Repository
http://www.data-archive.ac.uk/media/2894/managingsharing.pdfhttp://ohda.matrix.msu.edu/2012/06/digital-video-preservation-and-oral-history/http://ohda.matrix.msu.edu/2012/06/digital-video-preservation-and-oral-history/http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/index.htmlhttp://www.digitalpreservation.gov/formats/fdd/descriptions.shtmlhttp://thedata.org/book/data-management-plan-templatehttps://dmptool.org/http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/elements.htmlhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848812/https://qdr.syr.edu/
-
Review - you are now able to:
- Name files and folders using a proper naming convention- Document and describe your research project- Prepare data for public sharing- Recognize long-term archiving solutions
-
Thank you!Attribution-NonCommercial CC BY-NC
https://creativecommons.org/share-your-work/licensing-types-examples/licensing-examples/#nc