uc santa cruz: data management for scientists

94
Data Management for Scientists Carly Strasser, PhD California Digital Library, UC Office of the President [email protected] www.carlystrasser.net Reduce your workload Reuse your ideas Recycle your data From Flickr by Mark McLaughlin UC Santa Cruz February 2012

Upload: carly-strasser

Post on 17-Jan-2015

305 views

Category:

Technology


1 download

DESCRIPTION

28 Feb 2012

TRANSCRIPT

Page 1: UC Santa Cruz: Data Management for Scientists

Data  Management  for  Scientists    

Carly  Strasser,  PhD  California  Digital  Library,  UC  Office  of  the  President  

[email protected]  www.carlystrasser.net  

Reduce  your  workload  Reuse  your  ideas  Recycle  your  data    

From  Flickr  by  Mark  McLaughlin    

UC  Santa  Cruz  February  2012  

Page 2: UC Santa Cruz: Data Management for Scientists

Roadmap  

4.  Toolbox    

1.  Background    

2.  Data  management  landscape  3.  How  to  improve  

Page 3: UC Santa Cruz: Data Management for Scientists

NSF  funded  DataNet  Project  Office  of  Cyberinfrastructure  

Page 4: UC Santa Cruz: Data Management for Scientists

B  

C  A  

                                                       Pre  DataONE                                                                                  .   DataONE  

Page 5: UC Santa Cruz: Data Management for Scientists

NSF  funded  DataNet  Project  Office  of  Cyberinfrastructure  

Community  Engagement  &  

Outreach  

Courtesy  of  DataONE  

Cyberinfrastructure  

From  Flickr  by  wetwebwork  

Page 6: UC Santa Cruz: Data Management for Scientists

Is  data  management  being  taught?  Do  attitudes  about  

sharing  differ  among  disciplines?  

What  role  can  libraries  play  in  data  education?  

How  can  we  promote  storing  data  in  repositories?  

What  barriers  to  sharing  can  we  eliminate?  

Why  don’t  people  share  data?  

Page 7: UC Santa Cruz: Data Management for Scientists
Page 8: UC Santa Cruz: Data Management for Scientists

Roadmap  

4.  Toolbox    

1.  Background    

2.  Data  management  landscape  3.  How  to  improve  

Page 9: UC Santa Cruz: Data Management for Scientists

Digital  data  From

 Flickr  by  Flickm

or  

From

 Flickr  by  US  Arm

y  En

vironm

ental  C

omman

d  

From

 Flickr  by    DW08

25  

C.  Strasser  

Courtesey  of  W

HOI  

www.woodrow.org  

From

 Flickr  by    deltaMike  

Page 10: UC Santa Cruz: Data Management for Scientists

Digital  data  +    

Complex  analyses  

Page 11: UC Santa Cruz: Data Management for Scientists

Data  

Maximum  Likelihood  estimation  

Matrix  Models  

Models  

Images   Tables   Paper  

Page 12: UC Santa Cruz: Data Management for Scientists

Data  

Maximum  Likelihood  estimation  

Matrix  Models  

Images   Tables   Paper  

Models  

Page 13: UC Santa Cruz: Data Management for Scientists

UGLY TRUTH

 are  not  taught  data  management  

don’t  know  what  metadata  are  

can’t  name  data  centers  or  repositories  

don’t  share  data  publicly  or  store  it  in  an  archive  

aren’t  convinced  they  should  share  data  

 

5shortessays.blogspot.com  

Many  Earth  |  Environmental  |  Ecological  scientists…      

Page 14: UC Santa Cruz: Data Management for Scientists

Data  Hangover    

From  Flickr  by  SteveMcN  

What  happened?  

Page 15: UC Santa Cruz: Data Management for Scientists

Where  data  end  up  

Data  

Metadata  

Recreated  from  Klump  et  al.  2006  

blog.order2disorder.com  

From  Flickr  by  csessum

s  From

 Flickr  by  csessums  

From  Flickr  by  diylibrarian  

www �

Page 16: UC Santa Cruz: Data Management for Scientists

Who  cares?    

www.rba.gov.au  

From  Flickr  by  Redden-­‐McAllister  

From  Flickr  by  AJC1  

Page 17: UC Santa Cruz: Data Management for Scientists

Data  

Metadata  

Recreated  from  Klump  et  al.  2006  

www �

Where  data  end  up  

From  Flickr  by  torkildr  

From  Flickr  by  diylibrarian  

www �

Page 18: UC Santa Cruz: Data Management for Scientists

Data  Management  

Data  Reuse  

Data  Sharing  

Page 19: UC Santa Cruz: Data Management for Scientists

Trends  in  Data  Archiving  

Journal  publishers  Joint  Data  Archiving  Agreement  

Page 20: UC Santa Cruz: Data Management for Scientists

Trends  in  Data  Archiving  

Journal  publishers  Joint  Data  Archiving  Agreement    Data  Papers  etc.  Ecological  Archives,  Beyond  the  PDF    Funders  Data  management  requirements    

Page 21: UC Santa Cruz: Data Management for Scientists

Roadmap  

4.  Toolbox    

1.  Background    

2.  Data  management  landscape  3.  Best  practices  

Page 22: UC Santa Cruz: Data Management for Scientists

Best  Practices  for  Data  Management  

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  

Page 23: UC Santa Cruz: Data Management for Scientists

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  7.  Planning  

Best  Practices  for  Data  Management  

Page 24: UC Santa Cruz: Data Management for Scientists

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 cB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 cB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 cB5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 cC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398

23.78 1.17

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

From  Stephanie  Hampton  (2010)      ESA  Workshop  on  Best  Practices  

2  tables   Random  notes  

Page 25: UC Santa Cruz: Data Management for Scientists

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 cB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 cB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 cB5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 cC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398

23.78 1.17

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

From  Stephanie  Hampton  (2010)      ESA  Workshop  on  Best  Practices  

Wash  Cres  Lake  Dec  15  Dont_Use.xls  

Page 26: UC Santa Cruz: Data Management for Scientists

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUTB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression StatisticsB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square-0.022024B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error1.906378B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVAC1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance FC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278

23.78 1.17 Total 10 35.55962

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

Random  stats  output  

Page 27: UC Santa Cruz: Data Management for Scientists

27  

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUTB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression StatisticsB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square-0.022024B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error1.906378B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVAC1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance FC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278

23.78 1.17 Total 10 35.55962

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

SampleID ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07

Weight (mg) 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9

%C 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58delta 13C -21.11 -28.05 -29.56 -27.32 -27.50 -22.68 -24.58 -21.06 -29.44

delta 13C_ca -20.65 -27.59 -29.10 -26.86 -27.04 -22.22 -24.12 -20.60 -28.98

%N 0.48 2.30 1.68 1.97 1.36 0.34 0.15 0.34 1.74delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62

delta 15N_ca -1.62 -0.06 0.14 2.06 0.34 3.66 -2.34 -2.17 -0.03

-3.00

-2.00

-1.00

0.00

1.00

2.00

3.00

4.00

-35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00

Series1

Page 28: UC Santa Cruz: Data Management for Scientists

Create  unique  identifiers  •  Decide  on  naming  scheme  early  •  Create  a  key  •  Different  for  each  sample  

2.  Data  collection  &  organization  

From  Flickr  by  sjbresnahan  From  Flickr  by  zebbie  

Page 29: UC Santa Cruz: Data Management for Scientists

Standardize  •  Consistent  within  columns  – only  numbers,  dates,  or  text  

•  Consistent  names,  codes,  formats  

Modified  from  K.  Vanderbilt    From  Pink  Floyd,  The  Wall      themurkyfringe.com  

2.  Data  collection  &  organization  

Page 30: UC Santa Cruz: Data Management for Scientists

Google  Docs  Forms  

Standardize  •  Reduce  possibility  of  manual  error  by  constraining  entry  choices  

Modified  from  K.  Vanderbilt    

2.  Data  collection  &  organization  

Excel  lists  Data  

validataion  

Page 31: UC Santa Cruz: Data Management for Scientists

Identify  missing  data  •  Numeric  fields:  distinct  value  (e.g.  9999)  •  Text  fields:  NULL  or  NA    •  Use  data  flags  in  a  separate  column  to  qualify  empty  cells  

M1  =  missing;  no  sample  collected  

E1  =  estimated  from  grab  sample  

2.  Data  collection  &  organization  

Page 32: UC Santa Cruz: Data Management for Scientists

2.  Data  collection  &  organization  

   

Create  parameter  table  Create  a  site  table  

From  doi:10.3334/ORNLDAAC/777  

From  doi:10.3334/ORNLDAAC/777  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Page 33: UC Santa Cruz: Data Management for Scientists

Quick  on  the  draw    Clickety-­‐click  and  you’re  ready  to  fire  

Always  there  in  time      Everyone  has  Excel  

Smarter  than  he  lets  on    Stats,  Pivot  tables,  VB  scripts  

Cleans  up  real  pretty    Graphics,  fonts,  colors,  borders  

From  Mark  Schildhauer  

2.  Data  collection  &  organization  

SPREADSHEETS: THE GOOD

Page 34: UC Santa Cruz: Data Management for Scientists

From  Mark  Schildhauer  

2.  Data  collection  &  organization  

Shoot  first  ask  later  Click&fire  Click&fire  Click&fire  

No  scruples    Delete  row,  click&fire,  ctrl-­‐x/ctrl-­‐c,  click&fire,  re-­‐sort,  save  

Talks  a  good  story  but  not  much  education    Stats  

SPREADSHEETS: THE BAD

Page 35: UC Santa Cruz: Data Management for Scientists

Ill-­‐mannered  Takes  data  prisoner;  conflates  raw  and  summary  data  

Gaudy  Use  of  visual  cues  as  metadata:  color,  font,  border  

Shifty  Cross-­‐linking  worksheets  sets  up  “invisible”  dependencies  

Shiftless  No  provenance  

The  more  complicated  your  spreadsheet,  the  uglier  it  gets  for  use  with  other  software     From  Mark  Schildhauer  

2.  Data  collection  &  organization  

SPREADSHEETS: THE UGLY

Page 36: UC Santa Cruz: Data Management for Scientists

2.  Data  collection  &  organization  All  of  the  things  that  make  Excel  great  for  data  are  bad  for  archiving!  

1.  Create  archive-­‐ready  raw  data  2.  Put  it  somewhere  special  3.  Have  your  fun  with  fancy  Excel  techniques  4.  Keep  archiving  in  mind  

Page 37: UC Santa Cruz: Data Management for Scientists

A  relational  database  is      A  set  of  tables    Relationships  among  the  tables    A  language  to  specify  &  query  the  tables  

2.  Data  collection  &  organization  

From  Mark  Schildhauer  

What  about  databases?  

Page 38: UC Santa Cruz: Data Management for Scientists

*siteID  site_name  latitude  longitude  description  

Sample  sites  

*  Denotes  the  primary  key  

*speciesID  species_name  common_name  family  order  

Species  *sampleID  siteID  sample_date  speciesID  height  flowering  flag  comments  

samples  

*sampleID  siteID  sample_date  speciesID  height  flowering  flag  comments  

Samples  

2.  Data  collection  &  organization  

From  Mark  Schildhauer  

Page 39: UC Santa Cruz: Data Management for Scientists

Databases  often  enforce  good  practice    Must  define    

 Tables    Attributes    Relationships  (constraints)  

 Databases  provide:  

 Scalability:  millions+  records    Features  for  sub-­‐setting,  querying,  sorting    Scripted  language:  SQL      Reduced  redundancy  &  potential  data  entry  errors  

2.  Data  collection  &  organization  

From  Mark  Schildhauer  

A   B   C  

1   2   3  

4   5   6  

7   8   9  

D   E  

10   11  

12   13  

14   15  

16   17  

Page 40: UC Santa Cruz: Data Management for Scientists

Spreadsheets  •  Good  for  simple,  self-­‐contained  

charts,  graphs,  calculations  •  Handy  for  collecting  raw  data  •  Flexible  cell  content  type  But…  •  Hard  to  subset  or  sort  •  Lack  “record”  integrity:  can  sort  a  

column  independently  of  all  others  •  Harder  to  maintain  as  complexity  

and  size  of  data  grows  

Databases  •  Works  well  with  lots  of  data  •  Easy  to  query  and  subset  data  •  Data  fields  are  constrainted  •  Columns  cannot  be  sorted  

independently  of  each  other  •  Normalization  reduces  data  entry  

and  potential  for  error  But…  •  More  to  learn    •  Harder  to  use  

2.  Data  collection  &  organization  

From  Mark  Schildhauer  

Page 41: UC Santa Cruz: Data Management for Scientists

You  should  invest  time  in  learning  databases  if      your  data  sets  are  large  or  complex  

 

Consider  investing  time  in  learning  databases  if    your  data  are  small  and  humble    you  ever  intend  to  share  your  data    you  are  <  30  years  old  

2.  Data  collection  &  organization  

From  Mark  Schildhauer  

Page 42: UC Santa Cruz: Data Management for Scientists

Use  descriptive  file  names  

PhDcomics.com  

2.  Data  collection  &  organization  

Page 43: UC Santa Cruz: Data Management for Scientists

 Use  descriptive  file  names  •  Unique  •  Reflect  contents  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad:    Mydata.xls      2001_data.csv      best  version.txt  

Better:  Eaffinis_nanaimo_2010_counts.xls  

Site  name  

Year  What  was  measured    

Study  organism  

2.  Data  collection  &  organization  

*Not  for  everyone  

*  

Page 44: UC Santa Cruz: Data Management for Scientists

Organize  files    logically  

Biodiversity  

Lake  

Experiments  

Field  work  

Grassland  

Biodiv_H20_heatExp_2005to2008.csv  Biodiv_H20_predatorExp_2001to2003.csv  …  Biodiv_H20_PlanktonCount_2001toActive.csv  Biodiv_H20_ChlAprofiles_2003.csv  …    

From  S.  Hampton  

2.  Data  collection  &  organization  

Page 45: UC Santa Cruz: Data Management for Scientists

 Preserve  information  •  Keep  raw  data  raw  

•  Use  scripts  to  process  data      &  save  them  with  data  

Raw  data  as  .csv  

R  script  for  processing  &  analysis  

2.  Data  collection  &  organization  

Page 46: UC Santa Cruz: Data Management for Scientists

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  7.  Planning  

Best  Practices  for  Data  Management  

Page 47: UC Santa Cruz: Data Management for Scientists

Before  data  collection  •  Define  &  enforce  standards  •  Assign  responsibility  for  data  quality  

3.  Quality  control  and  quality  assurance  

From

 Flickr  by  StacieBe

e  

Page 48: UC Santa Cruz: Data Management for Scientists

During  data  collection/entry  •  Minimize  manual  entry  •  Use  double  entry  •  Use  text-­‐to-­‐speech  program  

to  read  data  back  

•  Use  a  database  •  Document  changes  

3.  Quality  control  and  quality  assurance  

From

 Flickr  by  scho

ck  

Page 49: UC Santa Cruz: Data Management for Scientists

After  data  entry  •  Check  for  missing,  impossible,  

anomalous  values  •  Perform  statistical  summaries    •  Look  for  outliers  

•  Normal  probability  plots  •  Regression  •  Scatter  plots  •  Maps  

 

3.  Quality  control  and  quality  assurance  

0  

10  

20  

30  

40  

50  

60  

0   10   20   30   40  

Page 50: UC Santa Cruz: Data Management for Scientists

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  7.  Planning  

Best  Practices  for  Data  Management  

Page 51: UC Santa Cruz: Data Management for Scientists

4.  Metadata  basics   Why  are  you  promoting  Excel?  

What  is  metadata?  

Page 52: UC Santa Cruz: Data Management for Scientists

4.  Metadata  basics  

   Metadata  =  Data  reporting    

WHO  created  the  data?  

WHAT  is  the  content  of  the  data  set?  

WHEN  was  it  created?  

WHERE  was  it  collected?  

HOW  was  it  developed?  

WHY  was  it  developed?  

Page 53: UC Santa Cruz: Data Management for Scientists

•  Digital  context  

•  Name  of  the  data  set  

•  The  name(s)  of  the  data  file(s)  in  the  data  set  

•  Date  the  data  set  was  last  modified  

•  Example  data  file  records  for  each  data  type  file  

•  Pertinent  companion  files  

•  List  of  related  or  ancillary  data  sets  

•  Software  (including  version  number)  used  to  prepare/read    the  data  set  

•  Data  processing  that  was  performed  

•  Personnel  &  stakeholders  

•  Who  collected    

•  Who  to  contact  with  questions  

•  Funders  

•  Scientific  context  

•  Scientific  reason  why  the  data  were  collected  

•  What  data  were  collected  

•  What  instruments  (including  model  &  serial  number)  were  used  

•  Environmental  conditions  during  collection  

•  Where  collected  &  spatial  resolution  When  collected  &  temporal  resolution  

•  Standards  or  calibrations  used  

•  Information  about  parameters  

•  How  each  was  measured  or  produced  

•  Units  of  measure  

•  Format  used  in  the  data  set  

•  Precision  &  accuracy  if  known  

•  Information  about  data  

•  Definitions  of  codes  used  

•  Quality  assurance  &  control  measures  

•  Known  problems  that  limit  data  use  (e.g.  uncertainty,  sampling  problems)    

•  How  to  cite  the  data  set  

4.  Metadata  basics  

Page 54: UC Santa Cruz: Data Management for Scientists

•  Provides  structure  to  describe  data  

Common  terms    |    definitions    |    language    |    structure  

4.  Metadata  basics  

•  Lots  of  different  standards    EML  ,  FGDC,  ISO19115,  DarwinCore,…  

•  Tools  for  creating  metadata  files  

 Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)    

   

What  is  metadata?  

Select  the  appropriate  metadata  standard  

Page 55: UC Santa Cruz: Data Management for Scientists

What  is  a  metadata  standard?  

What  does  metadata  look  like?  

4.  Metadata  basics  

Page 56: UC Santa Cruz: Data Management for Scientists

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  7.  Planning  

Best  Practices  for  Data  Management  

Page 57: UC Santa Cruz: Data Management for Scientists

Temperature  data  

Salinity                data  

Data  import  into  R  

Analysis:  mean,  SD  

Graph  production  

Quality  control  &  data  cleaning  “Clean”  T  

&  S  data  

Summary  statistics  

Data  in  R  format  

5.  Workflows  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflows:  flow  charts  

Page 58: UC Santa Cruz: Data Management for Scientists

•  R,  SAS,  MATLAB  •  Well-­‐documented  code  is…  

Easier  to  review  Easier  to  share  Easier  to  repeat  analysis  

5.  Workflows  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflows:  commented  scripts  

#  %  $  

&  

Page 59: UC Santa Cruz: Data Management for Scientists

Fancy  Schmancy  workflows:  Kepler  Resulting  output  

5.  Workflows  

https://kepler-­‐project.org  

Page 60: UC Santa Cruz: Data Management for Scientists

Workflows  enable    

Reproducibility    can  someone  independently  validate  findings?  

Transparency    

 others  can  understand  how  you  arrived  at  your  results  

Executability    

 others  can  re-­‐run  or  re-­‐use  your  analysis  

 

5.  Workflows  

From  Flickr  by  merlinprincesse  

Page 61: UC Santa Cruz: Data Management for Scientists

Minimally:  document  your  analysis      commented  code;  simple  flow-­‐chart  

 

Emerging  workflow  applications  will…  −  Link  software  for  executable  end-­‐to-­‐end  analysis  −  Provide  detailed  info  about  data  &  analysis  −  Facilitate  re-­‐use  &  refinement  of  complex,  multi-­‐step  

analyses  −  Enable  efficient  swapping  of  alternative  models  &  

algorithms  − Help  automate  tedious  tasks  

5.  Workflows  

www.littlebytesoflife.com  

Page 62: UC Santa Cruz: Data Management for Scientists

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  7.  Planning  

Best  Practices  for  Data  Management  

Page 63: UC Santa Cruz: Data Management for Scientists

The  20-­‐Year  Rule  The  metadata  accompanying  a  data  set  should  be  written  for  a  user  20  years  into  the  future  

   

6.  Data  stewardship  &  reuse  

(National  Research  Council  1991)  

From  Flickr  by  greensambaman  

RULE  

Page 64: UC Santa Cruz: Data Management for Scientists

Use  stable  formats      csv,  txt,  tiff  

Create  back-­‐up  copies    original,  near,  far  

Periodically  test  ability  to  restore  information  

6.  Data  stewardship  &  reuse  

Modified from R. Cook  

Page 65: UC Santa Cruz: Data Management for Scientists

Store  your  data  in  a  repository  

Institutional  archive  

Discipline/specialty  archive  

DataCite  list  of  repostiories:    www.datacite.org/repolist  

   

 

6.  Data  stewardship  &  reuse  

From  Flickr  by  torkildr  

Page 66: UC Santa Cruz: Data Management for Scientists

Allows  readers  to  find  data  products  

Get  credit  for  data  and  publications  

Promotes  reproducibility  

Better  measure  of  research  impact  

Modified from R. Cook  

6.  Data  stewardship  &  reuse  

Data  Citation  

Example:  Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological  diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20    

Learn  more  at  www.datacite.org  

Page 67: UC Santa Cruz: Data Management for Scientists

1.  Planning  2.  Data  collection  &  organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  7.  Planning  &  data  management  plans  in  

particular  

Best  Practices  for  Data  Management  

Page 68: UC Santa Cruz: Data Management for Scientists

A  document  that  describes  what  you  will  do  with  your  data  during  your  research  and  after  you  complete  your  research  

What  is  a  data  management  plan?  

1.  Planning  

Data  Hangover  

 

Page 69: UC Santa Cruz: Data Management for Scientists

 Saves  time  Increases  efficiency  Easier  to  use  data      Others  can  understand  &  use  data  Credit  for  data  products  Funders  require  it  

 

1.  Planning  

Why  should  I  prepare  a  DMP?    

Page 70: UC Santa Cruz: Data Management for Scientists

 DMP  supplement  may  include:  1.  the  types  of  data,  samples,  physical  collections,  software,  curriculum  

materials,  and  other  materials  to  be  produced  in  the  course  of  the  project  

2.   the  standards  to  be  used  for  data  and  metadata  format  and  content  (where  existing  standards  are  absent  or  deemed  inadequate,  this  should  be  documented  along  with  any  proposed  solutions  or  remedies)  

3.   policies  for  access  and  sharing  including  provisions  for  appropriate  protection  of  privacy,  confidentiality,  security,  intellectual  property,  or  other  rights  or  requirements  

4.   policies  and  provisions  for  re-­‐use,  re-­‐distribution,  and  the  production  of  derivatives  

5.   plans  for  archiving  data,  samples,  and  other  research  products,  and  for  preservation  of  access  to  them  

NSF  DMP  Requirements  

From  Grant  Proposal  Guidelines:  

Page 71: UC Santa Cruz: Data Management for Scientists

•  Types  of  data  produced  

•  Relationship  to  existing  data  

•  How/when/where  will  the  data  be  captured  or  created?  

•  How  will  the  data  be  processed?  

•  Quality  assurance  &  quality  control  measures  

•  Security:  version  control,  backing  up  

•  Who  will  be  responsible  for  data  management  during/after  project?  

1.  Types  of  data  &  other  information  

biology.kenyon.edu  

C.  Strasser  

From  Flickr  by  Lazurite  

Page 72: UC Santa Cruz: Data Management for Scientists

Wired.com  

•  What  metadata  are  needed  to  make  the  data  meaningful?  •  How  will  you  create  or  capture  these  metadata?    •  Why  have  you  chosen  particular  standards  and  approaches  

for  metadata?  

2.  Data  &  metadata  standards  

Page 73: UC Santa Cruz: Data Management for Scientists

•  Are  you  under  any  obligation  to  share  data?    

•  How,  when,  &  where  will  you  make  the  data  available?    

•  What  is  the  process  for  gaining  access  to  the  data?    

•  Who  owns  the  copyright  and/or  intellectual  property?  

•  Will  you  retain  rights  before  opening  data  to  wider  use?  How  long?  •  Are  permission  restrictions  necessary?  •  Embargo  periods  for  political/commercial/patent  reasons?    •  Ethical  and  privacy  issues?  •  Who  are  the  foreseeable  data  users?  •  How  should  your  data  be  cited?  

3.  Policies  for  access  &  sharing  4.  Policies  for  re-­‐use  &  re-­‐distribution  

Page 74: UC Santa Cruz: Data Management for Scientists

•  What  data  will  be  preserved  for  the  long  term?  For  how  long?      

•  Where  will  data  be  preserved?  

•  What  data  transformations  need  to  occur  before  preservation?  

5.  Plans  for  archiving  &  preservation  

From  Flickr  by  theManWhoSurfedTooMuch  

•  What  metadata  will  be  submitted  alongside  the  datasets?  

•  Who  will  be  responsible  for  preparing  data  for  preservation?  Who  will  be  the  main  contact  person  for  the  archived  data?  

Page 75: UC Santa Cruz: Data Management for Scientists

Don’t  forget:  Budget  

•  Costs  of  data  preparation  &  documentation  Hardware,  software  Personnel  Archive  fees  

•  How  costs  will  be  paid    Request  funding!  

dorrvs.com  

Page 76: UC Santa Cruz: Data Management for Scientists

NSF’s  Vision*  

DMPs  and  their  evaluation  will  grow  &  change  over  time  (similar  to  broader  impacts)  

Peer  review  will  determine  next  steps  

Community-­‐driven  guidelines    –  Different  disciplines  have  different  definitions  of  acceptable  

data  sharing  

–  Flexibility  at  the  directorate  and  division  levels  –  Tailor  implementation  of  DMP  requirement  

Evaluation  will  vary  with  directorate,  division,  &  program  officer  

 *Unofficially  Help  from  Jennifer  Schopf,  NSF  

Page 77: UC Santa Cruz: Data Management for Scientists

Roadmap  

4.  Toolbox    

1.  Background    

2.  Data  management  landscape  3.  Best  practices  

Page 78: UC Santa Cruz: Data Management for Scientists

E-­‐notebooks  &  online  science      

•  NoteBook  •  ORNL  eNote    •  Evernote  •  Google  Docs  •  Blogs  •  wikis  •  TheLabNotebook.com  •  NoteBookMaker  

TheLabNotebook.com!

Page 79: UC Santa Cruz: Data Management for Scientists

Step-­‐by-­‐step  wizard  for  generating  DMP  

Create    |    edit    |    re-­‐use    |    share    |    save    |    generate    

Open  to  community    

Links  to  institutional  resources  

Directorate  information  &  updates  

DMPTool:          dmp.cdlib.org  

Page 80: UC Santa Cruz: Data Management for Scientists

CDL  Services  for  UC  Community  

www.cdlib.org/services/uc3  

Where  should  I  put  my  data?  

Data  Repository  Deposit    |    Manage    |    Share    |    Preserve  

Page 81: UC Santa Cruz: Data Management for Scientists

•  Precise  identification  of  a  dataset  •  Credit  to  data  producers  and  data  publishers  •  A  link  from  the  traditional  literature  to  the  data  •  Research  metrics  for  datasets  

CDL  Services  for  UC  Community  

www.cdlib.org/services/uc3  

Create  &  manage  persistent  identifiers  

Example:  Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological  diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20    

Page 82: UC Santa Cruz: Data Management for Scientists

•  Open  source  add-­‐in  

•  Facilitate  data  management,  sharing,  archiving  for  scientists  

•  Focus  on  atmospheric,  ecological,  hydrological,  and  oceanographic  data  

•  Collecting  requirements  for  add-­‐in  from  scientists,  data  centers,  libraries  

Funders:  Gordon  and  Betty  Moore  Foundation,  Microsoft  Research  

Why  are  you  promoting  Excel?  

Page 83: UC Santa Cruz: Data Management for Scientists

Everyone  uses  it  

Stopgap  measure    

 

Why  are  you  promoting  Excel?  

Funders:  Gordon  and  Betty  Moore  Foundation,  Microsoft  Research  

Page 84: UC Santa Cruz: Data Management for Scientists

•  Data  Education  Tutorials  •  Database  of  best  practices    &  software  tools  •  Links  to  DMPTool  •  Primer  on  data  management  

www.dataone.org  

Page 85: UC Santa Cruz: Data Management for Scientists

dcxl.cdlib.org  

Data Management 101"

Page 86: UC Santa Cruz: Data Management for Scientists

www.carlystrasser.net  

Resources"

Slideshare link: this presentation"

Page 87: UC Santa Cruz: Data Management for Scientists

Best  Practices  for  Preparing  Environmental  Data  Sets  to  Share  and  Archive.  September  2010.  Hook,  Santhana  Vannan,  Beaty,  Cook,  &  Wilson  http://daac.ornl.gov/PI/BestPractices-­‐2010.pdf  

Some  Simple  Guidelines  for  Effective  Data  Management.  Borer,  Seabloom,  Jones,  &  Schildhauer.    Bull  Ecol  Soc  Amer,  April  2009:  205-­‐214.    

Handy  References  

Page 88: UC Santa Cruz: Data Management for Scientists

Roadmap  

4.  Toolbox    

1.  Background    

2.  Data  management  landscape  3.  Best  practices  

Page 89: UC Santa Cruz: Data Management for Scientists

1.  Take  stock  2.  Take  a  time  machine  3.  Break  it  down  4.  Get  smart  

Where  to  begin?  

Getting  down  &  dirty  with  your  data  

www.catfishing

tipstod

ay.com

 

Page 90: UC Santa Cruz: Data Management for Scientists

•  What  data  do  you  have?  

•  What  data  are  you  still  generating?  

•  What  does  your  workflow  look  like?  

•  Are  you  backing  up?  

•  How’s  your  filing  system?  

•  Etc…  

1.  Take  stock  

From  Flickr  by  charlie  llewellin  

Page 91: UC Santa Cruz: Data Management for Scientists

Knowing  what  you  know  now,  how  would  you  plan  for  this  project?    

–  File  structures  

–  Metadata  generation  

–  Naming  conventions  

Consider  writing  up  a  formal  data  management  plan  

2.  Take  a  time  machine  

From  Flickr  by  F1RSTBORN  

Page 92: UC Santa Cruz: Data Management for Scientists

You  now  have  a  vision.  Break  into  manageable  chunks  

–  Set  a  final  deadline  

–  Set  intermediate  deadlines  

–  Break  down  tasks  to  meet  those  deadlines  

–  Be  reasonable  

3.  Break  it  down  

From

 www.gon

omad

.com

 

From

 www.college

humor.com

 

Page 93: UC Santa Cruz: Data Management for Scientists

Learn  from  mistakes  

Plan  better  next  time  

Remember:  good  data  management  takes    

Time  

Thoughtfulness  

Planning  

Resources  

4.  Get  smart  

static.tvtropes.org  

Page 94: UC Santa Cruz: Data Management for Scientists

dcxl.cdlib.org  @dcxlCDL  www.facebook.com/DCXLatCDL  

www.carlystrasser.net  [email protected]  

@carlystrasser