uc santa cruz: data management for scientists
DESCRIPTION
28 Feb 2012TRANSCRIPT
Data Management for Scientists
Carly Strasser, PhD California Digital Library, UC Office of the President
[email protected] www.carlystrasser.net
Reduce your workload Reuse your ideas Recycle your data
From Flickr by Mark McLaughlin
UC Santa Cruz February 2012
Roadmap
4. Toolbox
1. Background
2. Data management landscape 3. How to improve
NSF funded DataNet Project Office of Cyberinfrastructure
B
C A
Pre DataONE . DataONE
NSF funded DataNet Project Office of Cyberinfrastructure
Community Engagement &
Outreach
Courtesy of DataONE
Cyberinfrastructure
From Flickr by wetwebwork
Is data management being taught? Do attitudes about
sharing differ among disciplines?
What role can libraries play in data education?
How can we promote storing data in repositories?
What barriers to sharing can we eliminate?
Why don’t people share data?
Roadmap
4. Toolbox
1. Background
2. Data management landscape 3. How to improve
Digital data From
Flickr by Flickm
or
From
Flickr by US Arm
y En
vironm
ental C
omman
d
From
Flickr by DW08
25
C. Strasser
Courtesey of W
HOI
www.woodrow.org
From
Flickr by deltaMike
Digital data +
Complex analyses
Data
Maximum Likelihood estimation
Matrix Models
Models
Images Tables Paper
Data
Maximum Likelihood estimation
Matrix Models
Images Tables Paper
Models
UGLY TRUTH
are not taught data management
don’t know what metadata are
can’t name data centers or repositories
don’t share data publicly or store it in an archive
aren’t convinced they should share data
5shortessays.blogspot.com
Many Earth | Environmental | Ecological scientists…
Data Hangover
From Flickr by SteveMcN
What happened?
Where data end up
Data
Metadata
Recreated from Klump et al. 2006
blog.order2disorder.com
From Flickr by csessum
s From
Flickr by csessums
From Flickr by diylibrarian
www �
Who cares?
www.rba.gov.au
From Flickr by Redden-‐McAllister
From Flickr by AJC1
Data
Metadata
Recreated from Klump et al. 2006
www �
Where data end up
From Flickr by torkildr
From Flickr by diylibrarian
www �
Data Management
Data Reuse
Data Sharing
Trends in Data Archiving
Journal publishers Joint Data Archiving Agreement
Trends in Data Archiving
Journal publishers Joint Data Archiving Agreement Data Papers etc. Ecological Archives, Beyond the PDF Funders Data management requirements
Roadmap
4. Toolbox
1. Background
2. Data management landscape 3. Best practices
Best Practices for Data Management
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse 7. Planning
Best Practices for Data Management
C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet
Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004
SD for delta 13C = 0.07 SD for delta 15N = 0.15
Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 cB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 cB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 cB5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 cC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398
23.78 1.17
Reference statistics:
Sampling Site / Identifier:Sample Type:
Date:Tray ID and Sequence:
From Stephanie Hampton (2010) ESA Workshop on Best Practices
2 tables Random notes
C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet
Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004
SD for delta 13C = 0.07 SD for delta 15N = 0.15
Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 cB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 cB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 cB5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 cC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398
23.78 1.17
Reference statistics:
Sampling Site / Identifier:Sample Type:
Date:Tray ID and Sequence:
From Stephanie Hampton (2010) ESA Workshop on Best Practices
Wash Cres Lake Dec 15 Dont_Use.xls
C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet
Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004
SD for delta 13C = 0.07 SD for delta 15N = 0.15
Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUTB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression StatisticsB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square-0.022024B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error1.906378B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVAC1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance FC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278
23.78 1.17 Total 10 35.55962
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569
Reference statistics:
Sampling Site / Identifier:Sample Type:
Date:Tray ID and Sequence:
Random stats output
27
C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet
Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004
SD for delta 13C = 0.07 SD for delta 15N = 0.15
Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUTB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression StatisticsB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square-0.022024B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error1.906378B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVAC1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance FC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278
23.78 1.17 Total 10 35.55962
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569
Reference statistics:
Sampling Site / Identifier:Sample Type:
Date:Tray ID and Sequence:
SampleID ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07
Weight (mg) 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9
%C 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58delta 13C -21.11 -28.05 -29.56 -27.32 -27.50 -22.68 -24.58 -21.06 -29.44
delta 13C_ca -20.65 -27.59 -29.10 -26.86 -27.04 -22.22 -24.12 -20.60 -28.98
%N 0.48 2.30 1.68 1.97 1.36 0.34 0.15 0.34 1.74delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62
delta 15N_ca -1.62 -0.06 0.14 2.06 0.34 3.66 -2.34 -2.17 -0.03
-3.00
-2.00
-1.00
0.00
1.00
2.00
3.00
4.00
-35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00
Series1
Create unique identifiers • Decide on naming scheme early • Create a key • Different for each sample
2. Data collection & organization
From Flickr by sjbresnahan From Flickr by zebbie
Standardize • Consistent within columns – only numbers, dates, or text
• Consistent names, codes, formats
Modified from K. Vanderbilt From Pink Floyd, The Wall themurkyfringe.com
2. Data collection & organization
Google Docs Forms
Standardize • Reduce possibility of manual error by constraining entry choices
Modified from K. Vanderbilt
2. Data collection & organization
Excel lists Data
validataion
Identify missing data • Numeric fields: distinct value (e.g. 9999) • Text fields: NULL or NA • Use data flags in a separate column to qualify empty cells
M1 = missing; no sample collected
E1 = estimated from grab sample
2. Data collection & organization
2. Data collection & organization
Create parameter table Create a site table
From doi:10.3334/ORNLDAAC/777
From doi:10.3334/ORNLDAAC/777
From R Cook, ESA Best Practices Workshop 2010
Quick on the draw Clickety-‐click and you’re ready to fire
Always there in time Everyone has Excel
Smarter than he lets on Stats, Pivot tables, VB scripts
Cleans up real pretty Graphics, fonts, colors, borders
From Mark Schildhauer
2. Data collection & organization
SPREADSHEETS: THE GOOD
From Mark Schildhauer
2. Data collection & organization
Shoot first ask later Click&fire Click&fire Click&fire
No scruples Delete row, click&fire, ctrl-‐x/ctrl-‐c, click&fire, re-‐sort, save
Talks a good story but not much education Stats
SPREADSHEETS: THE BAD
Ill-‐mannered Takes data prisoner; conflates raw and summary data
Gaudy Use of visual cues as metadata: color, font, border
Shifty Cross-‐linking worksheets sets up “invisible” dependencies
Shiftless No provenance
The more complicated your spreadsheet, the uglier it gets for use with other software From Mark Schildhauer
2. Data collection & organization
SPREADSHEETS: THE UGLY
2. Data collection & organization All of the things that make Excel great for data are bad for archiving!
1. Create archive-‐ready raw data 2. Put it somewhere special 3. Have your fun with fancy Excel techniques 4. Keep archiving in mind
A relational database is A set of tables Relationships among the tables A language to specify & query the tables
2. Data collection & organization
From Mark Schildhauer
What about databases?
*siteID site_name latitude longitude description
Sample sites
* Denotes the primary key
*speciesID species_name common_name family order
Species *sampleID siteID sample_date speciesID height flowering flag comments
samples
*sampleID siteID sample_date speciesID height flowering flag comments
Samples
2. Data collection & organization
From Mark Schildhauer
Databases often enforce good practice Must define
Tables Attributes Relationships (constraints)
Databases provide:
Scalability: millions+ records Features for sub-‐setting, querying, sorting Scripted language: SQL Reduced redundancy & potential data entry errors
2. Data collection & organization
From Mark Schildhauer
A B C
1 2 3
4 5 6
7 8 9
D E
10 11
12 13
14 15
16 17
Spreadsheets • Good for simple, self-‐contained
charts, graphs, calculations • Handy for collecting raw data • Flexible cell content type But… • Hard to subset or sort • Lack “record” integrity: can sort a
column independently of all others • Harder to maintain as complexity
and size of data grows
Databases • Works well with lots of data • Easy to query and subset data • Data fields are constrainted • Columns cannot be sorted
independently of each other • Normalization reduces data entry
and potential for error But… • More to learn • Harder to use
2. Data collection & organization
From Mark Schildhauer
You should invest time in learning databases if your data sets are large or complex
Consider investing time in learning databases if your data are small and humble you ever intend to share your data you are < 30 years old
2. Data collection & organization
From Mark Schildhauer
Use descriptive file names
PhDcomics.com
2. Data collection & organization
Use descriptive file names • Unique • Reflect contents
From R Cook, ESA Best Practices Workshop 2010
Bad: Mydata.xls 2001_data.csv best version.txt
Better: Eaffinis_nanaimo_2010_counts.xls
Site name
Year What was measured
Study organism
2. Data collection & organization
*Not for everyone
*
Organize files logically
Biodiversity
Lake
Experiments
Field work
Grassland
Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv …
From S. Hampton
2. Data collection & organization
Preserve information • Keep raw data raw
• Use scripts to process data & save them with data
Raw data as .csv
R script for processing & analysis
2. Data collection & organization
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse 7. Planning
Best Practices for Data Management
Before data collection • Define & enforce standards • Assign responsibility for data quality
3. Quality control and quality assurance
From
Flickr by StacieBe
e
During data collection/entry • Minimize manual entry • Use double entry • Use text-‐to-‐speech program
to read data back
• Use a database • Document changes
3. Quality control and quality assurance
From
Flickr by scho
ck
After data entry • Check for missing, impossible,
anomalous values • Perform statistical summaries • Look for outliers
• Normal probability plots • Regression • Scatter plots • Maps
3. Quality control and quality assurance
0
10
20
30
40
50
60
0 10 20 30 40
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse 7. Planning
Best Practices for Data Management
4. Metadata basics Why are you promoting Excel?
What is metadata?
4. Metadata basics
Metadata = Data reporting
WHO created the data?
WHAT is the content of the data set?
WHEN was it created?
WHERE was it collected?
HOW was it developed?
WHY was it developed?
• Digital context
• Name of the data set
• The name(s) of the data file(s) in the data set
• Date the data set was last modified
• Example data file records for each data type file
• Pertinent companion files
• List of related or ancillary data sets
• Software (including version number) used to prepare/read the data set
• Data processing that was performed
• Personnel & stakeholders
• Who collected
• Who to contact with questions
• Funders
• Scientific context
• Scientific reason why the data were collected
• What data were collected
• What instruments (including model & serial number) were used
• Environmental conditions during collection
• Where collected & spatial resolution When collected & temporal resolution
• Standards or calibrations used
• Information about parameters
• How each was measured or produced
• Units of measure
• Format used in the data set
• Precision & accuracy if known
• Information about data
• Definitions of codes used
• Quality assurance & control measures
• Known problems that limit data use (e.g. uncertainty, sampling problems)
• How to cite the data set
4. Metadata basics
• Provides structure to describe data
Common terms | definitions | language | structure
4. Metadata basics
• Lots of different standards EML , FGDC, ISO19115, DarwinCore,…
• Tools for creating metadata files
Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)
What is metadata?
Select the appropriate metadata standard
What is a metadata standard?
What does metadata look like?
4. Metadata basics
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse 7. Planning
Best Practices for Data Management
Temperature data
Salinity data
Data import into R
Analysis: mean, SD
Graph production
Quality control & data cleaning “Clean” T
& S data
Summary statistics
Data in R format
5. Workflows
Workflow: how you get from the raw data to the final products of your research
Simple workflows: flow charts
• R, SAS, MATLAB • Well-‐documented code is…
Easier to review Easier to share Easier to repeat analysis
5. Workflows
Workflow: how you get from the raw data to the final products of your research
Simple workflows: commented scripts
# % $
&
Fancy Schmancy workflows: Kepler Resulting output
5. Workflows
https://kepler-‐project.org
Workflows enable
Reproducibility can someone independently validate findings?
Transparency
others can understand how you arrived at your results
Executability
others can re-‐run or re-‐use your analysis
5. Workflows
From Flickr by merlinprincesse
Minimally: document your analysis commented code; simple flow-‐chart
Emerging workflow applications will… − Link software for executable end-‐to-‐end analysis − Provide detailed info about data & analysis − Facilitate re-‐use & refinement of complex, multi-‐step
analyses − Enable efficient swapping of alternative models &
algorithms − Help automate tedious tasks
5. Workflows
www.littlebytesoflife.com
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse 7. Planning
Best Practices for Data Management
The 20-‐Year Rule The metadata accompanying a data set should be written for a user 20 years into the future
6. Data stewardship & reuse
(National Research Council 1991)
From Flickr by greensambaman
RULE
Use stable formats csv, txt, tiff
Create back-‐up copies original, near, far
Periodically test ability to restore information
6. Data stewardship & reuse
Modified from R. Cook
Store your data in a repository
Institutional archive
Discipline/specialty archive
DataCite list of repostiories: www.datacite.org/repolist
6. Data stewardship & reuse
From Flickr by torkildr
Allows readers to find data products
Get credit for data and publications
Promotes reproducibility
Better measure of research impact
Modified from R. Cook
6. Data stewardship & reuse
Data Citation
Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
Learn more at www.datacite.org
1. Planning 2. Data collection & organization 3. Quality control & assurance 4. Metadata 5. Workflows 6. Data stewardship & reuse 7. Planning & data management plans in
particular
Best Practices for Data Management
A document that describes what you will do with your data during your research and after you complete your research
What is a data management plan?
1. Planning
Data Hangover
Saves time Increases efficiency Easier to use data Others can understand & use data Credit for data products Funders require it
1. Planning
Why should I prepare a DMP?
DMP supplement may include: 1. the types of data, samples, physical collections, software, curriculum
materials, and other materials to be produced in the course of the project
2. the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies)
3. policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements
4. policies and provisions for re-‐use, re-‐distribution, and the production of derivatives
5. plans for archiving data, samples, and other research products, and for preservation of access to them
NSF DMP Requirements
From Grant Proposal Guidelines:
• Types of data produced
• Relationship to existing data
• How/when/where will the data be captured or created?
• How will the data be processed?
• Quality assurance & quality control measures
• Security: version control, backing up
• Who will be responsible for data management during/after project?
1. Types of data & other information
biology.kenyon.edu
C. Strasser
From Flickr by Lazurite
Wired.com
• What metadata are needed to make the data meaningful? • How will you create or capture these metadata? • Why have you chosen particular standards and approaches
for metadata?
2. Data & metadata standards
• Are you under any obligation to share data?
• How, when, & where will you make the data available?
• What is the process for gaining access to the data?
• Who owns the copyright and/or intellectual property?
• Will you retain rights before opening data to wider use? How long? • Are permission restrictions necessary? • Embargo periods for political/commercial/patent reasons? • Ethical and privacy issues? • Who are the foreseeable data users? • How should your data be cited?
3. Policies for access & sharing 4. Policies for re-‐use & re-‐distribution
• What data will be preserved for the long term? For how long?
• Where will data be preserved?
• What data transformations need to occur before preservation?
5. Plans for archiving & preservation
From Flickr by theManWhoSurfedTooMuch
• What metadata will be submitted alongside the datasets?
• Who will be responsible for preparing data for preservation? Who will be the main contact person for the archived data?
Don’t forget: Budget
• Costs of data preparation & documentation Hardware, software Personnel Archive fees
• How costs will be paid Request funding!
dorrvs.com
NSF’s Vision*
DMPs and their evaluation will grow & change over time (similar to broader impacts)
Peer review will determine next steps
Community-‐driven guidelines – Different disciplines have different definitions of acceptable
data sharing
– Flexibility at the directorate and division levels – Tailor implementation of DMP requirement
Evaluation will vary with directorate, division, & program officer
*Unofficially Help from Jennifer Schopf, NSF
Roadmap
4. Toolbox
1. Background
2. Data management landscape 3. Best practices
E-‐notebooks & online science
• NoteBook • ORNL eNote • Evernote • Google Docs • Blogs • wikis • TheLabNotebook.com • NoteBookMaker
TheLabNotebook.com!
Step-‐by-‐step wizard for generating DMP
Create | edit | re-‐use | share | save | generate
Open to community
Links to institutional resources
Directorate information & updates
DMPTool: dmp.cdlib.org
CDL Services for UC Community
www.cdlib.org/services/uc3
Where should I put my data?
Data Repository Deposit | Manage | Share | Preserve
• Precise identification of a dataset • Credit to data producers and data publishers • A link from the traditional literature to the data • Research metrics for datasets
CDL Services for UC Community
www.cdlib.org/services/uc3
Create & manage persistent identifiers
Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
• Open source add-‐in
• Facilitate data management, sharing, archiving for scientists
• Focus on atmospheric, ecological, hydrological, and oceanographic data
• Collecting requirements for add-‐in from scientists, data centers, libraries
Funders: Gordon and Betty Moore Foundation, Microsoft Research
Why are you promoting Excel?
Everyone uses it
Stopgap measure
Why are you promoting Excel?
Funders: Gordon and Betty Moore Foundation, Microsoft Research
• Data Education Tutorials • Database of best practices & software tools • Links to DMPTool • Primer on data management
www.dataone.org
dcxl.cdlib.org
Data Management 101"
www.carlystrasser.net
Resources"
Slideshare link: this presentation"
Best Practices for Preparing Environmental Data Sets to Share and Archive. September 2010. Hook, Santhana Vannan, Beaty, Cook, & Wilson http://daac.ornl.gov/PI/BestPractices-‐2010.pdf
Some Simple Guidelines for Effective Data Management. Borer, Seabloom, Jones, & Schildhauer. Bull Ecol Soc Amer, April 2009: 205-‐214.
Handy References
Roadmap
4. Toolbox
1. Background
2. Data management landscape 3. Best practices
1. Take stock 2. Take a time machine 3. Break it down 4. Get smart
Where to begin?
Getting down & dirty with your data
www.catfishing
tipstod
ay.com
• What data do you have?
• What data are you still generating?
• What does your workflow look like?
• Are you backing up?
• How’s your filing system?
• Etc…
1. Take stock
From Flickr by charlie llewellin
Knowing what you know now, how would you plan for this project?
– File structures
– Metadata generation
– Naming conventions
Consider writing up a formal data management plan
2. Take a time machine
From Flickr by F1RSTBORN
You now have a vision. Break into manageable chunks
– Set a final deadline
– Set intermediate deadlines
– Break down tasks to meet those deadlines
– Be reasonable
3. Break it down
From
www.gon
omad
.com
From
www.college
humor.com
Learn from mistakes
Plan better next time
Remember: good data management takes
Time
Thoughtfulness
Planning
Resources
4. Get smart
static.tvtropes.org
dcxl.cdlib.org @dcxlCDL www.facebook.com/DCXLatCDL
www.carlystrasser.net [email protected]
@carlystrasser