collaborating reproducibly - biostatistics and medical ...kbroman/presentations/rrcollab_aaas… ·...

46
Collaborating reproducibly Karl Broman Biostatistics & Medical Informatics Univ. Wisconsin–Madison kbroman.org github.com/kbroman @kwbroman Slides: bit.ly/rrcollab

Upload: others

Post on 06-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Collaborating reproducibly

Karl Broman

Biostatistics & Medical InformaticsUniv. Wisconsin–Madison

kbroman.orggithub.com/kbroman

@kwbromanSlides: bit.ly/rrcollab

Page 2: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Karl -- this is very interesting ,however you used an old version ofthe data (n=143 rather than n=226).

I'm really sorry you did all thatwork on the incomplete dataset.

Bruce

2

Page 3: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

The results in Table 1 don’t seem tocorrespond to those in Figure 2.

3

Page 4: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

In what order do I run these scripts?

4

Page 5: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Where did we get this data file?

5

Page 6: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Why did I omit those samples?

6

Page 7: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

How did I make that figure?

7

Page 8: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Which image goes with which experiment?

8

Page 9: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“Your script is now giving an error.”

9

Page 10: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“The attached is similar to the code we used.”

10

Page 11: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Reproducible

vs.

Replicable

11

Page 12: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Reproducible

vs.

Correct

11

Page 13: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

kbroman.org/steps2rr1. Organize your data & code

2. Everything with a script

3. Automate the process (GNU Make)

4. Turn scripts into reproducible reports

5. Turn repeated code into functions

6. Create a package/module

7. Use version control (git/GitHub)

8. License your software

12

Page 14: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Organize your project

File organization and namingare powerful weapons against chaos.

– Jenny Bryan

13

Page 15: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Organize your project

Your closest collaborator is you six months ago,but you don’t reply to emails.

(paraphrasing Mark Holder)

13

Page 16: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Organize your project

Have sympathy for your future self.

13

Page 17: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Organize your project

RawData/ Notes/DerivedData/ Refs/

Python/ ReadMe.txtR/ ToDo.txtRuby/ Makefile

14

Page 18: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Organize your project

0_vcf2db.R1_prep_geno.R2_prep_pheno_clin.R2_prep_pheno_otu.R3_prep_covar.R4_prep_analysis_pheno_clin.R4_prep_analysis_pheno_otu.R5_scans.R6_grab_peaks.R7_find_nearby_peaks.R

14

Page 19: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Chaos

AimeeNullSims/ Deuterium/ Ping/AimeeResults/ ExtractData4Gary/ Ping2/AnnotationFiles/ FromAimee/ Ping3/Brian/ GoldStandard/ Ping4/Chr6_extrageno/ HumanGWAS/ Play/Chr6_segdis/ Insulin/ Prdm9/ChrisPlaisier/ Int2_for_Mark/ RBM_PlasmaUrine_2012 -03-08/Code4Aimee/ Islet_2011 -05/ Slco1a6/CompAnnot/ MappingProbes/ StudyLineupMethods/CondScans/ MultiProbes/ kidney_chr6.RD2O_2012 -02-14/ NewMap/ pck2_sucla2.RD2O_cellcycle/ Notes/ penalties.txtD2Ocorr/ NullSims/ transeQTL4Lude/Data4Aimee/ NullSims_2009 -09-10/Data4Tram/ PepIns_2012 -02-09/

14

Page 20: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

No “final” in file names

15

Page 21: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Reproducible reports

16

Page 22: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Reproducible reports

16

Page 23: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

kbroman.org/steps2rr1. Organize your data & code

2. Everything with a script

3. Automate the process (GNU Make)

4. Turn scripts into reproducible reports

5. Turn repeated code into functions

6. Create a package/module

7. Use version control (git/GitHub)

8. License your software

17

Page 24: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Collaboration

▶ Do more, by working in parallel▶ Do more, through diversity of ideas and skills▶ Reproducible pipelines have immediate advantages▶ Tests of reproducibility▶ Code review

18

Page 25: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Collaboration

▶ Do more, by working in parallel▶ Do more, through diversity of ideas and skills▶ Reproducible pipelines have immediate advantages▶ Tests of reproducibility▶ Code review

18

Page 26: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Challenges in collaborations

▶ Shared vision?▶ Compromise▶ Coordination▶ Communication▶ Sharing code and data▶ Synchronization

▶ Weakest link?

19

Page 27: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Challenges in collaborations

▶ Shared vision?▶ Compromise▶ Coordination▶ Communication▶ Sharing code and data▶ Synchronization▶ Weakest link?

19

Page 28: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Genetics of metabolic disease in mice

Alan Attie, UW-Madison, Biochemistry

Karl Broman, UW-Madison, Biostat & Med Info

Gary Churchill, Jackson Lab

Josh Coon, UW-Madison, Chemistry

Federico Rey, UW-Madison, Microbiology

Brian Yandell, UW-Madison, Statistics

20

Page 29: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Diversity outbred miceG0

A B C D E F G H

G1

G2

G10

G15

G20

21

Page 30: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Data

▶ 500 DO mice– generations 17–23– high fat, high sugar diet

▶ GigaMUGA SNP arrays– 140k SNPs

▶ Clinical traits– Weekly body weight– Glucose tolerance test– Longitudinal serum samples– ex vivo islet insulin secretion

▶ Islet gene expression byRNA-seq

▶ Proteins by mass spec▶ Lipids by mass spec▶ Gut microbiome

– 16S RNA– metagenomic data

22

Page 31: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Genome scansWeight change (week 11 vs 1)

1 3 5 7 9 11 13 15 17 192 4 6 8 10 12 14 16 180

2

4

6

8

10

Chromosome

LOD

sco

re

Inferred QTL

1 3 5 7 9 11 13 15 17 192 4 6 8 10 12 14 16 18

6

7

8

9

10

●●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●● ●

● ●

●●

● ●●

●●

●●

● ● ● ●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●● ●

● ●●

● ●

● ●●

●●

●●

●●

●●

●● ●

●●

●●

● ● ●● ●● ●

● ●● ●

●●

●●

●●

● ●

●●

●●

●●

●● ●●

● ● ●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

Chromosome

LOD

sco

re

23

Page 32: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Genome scansWeight change (week 11 vs 1)

1 3 5 7 9 11 13 15 17 192 4 6 8 10 12 14 16 180

2

4

6

8

10

Chromosome

LOD

sco

re

Inferred QTL

1 3 5 7 9 11 13 15 17 192 4 6 8 10 12 14 16 18

6

7

8

9

10

●●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●● ●

● ●

●●

● ●●

●●

●●

● ● ● ●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●● ●

● ●●

● ●

● ●●

●●

●●

●●

●●

●● ●

●●

●●

● ● ●● ●● ●

● ●● ●

●●

●●

●●

● ●

●●

●●

●●

●● ●●

● ● ●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

Chromosome

LOD

sco

re

bile acidbody weight

clinicalinsulin secretion

23

Page 33: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Challenges

(totally hypothetical)

24

Page 34: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“Could we meet to talk about the data file structure?”

“No.”

25

Page 35: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“Could we meet to talk about the data file structure?”

“No.”

25

Page 36: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“What the heck is ‘FAD_NAD SI 8.3_3.3G’?”

26

Page 37: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“Wait, these results seem to be basedon the older SNP map.”

27

Page 38: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“Could you write the methods section?”

“But I didn’t do the work,and we don’t have the code that was used.”

28

Page 39: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“My data analyst has taken a job at Google.”

29

Page 40: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

“Could you do these analyses? X said they would,but they’re not responding to my emails.”

30

Page 41: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Shared vision

▶ Publication▶ Code & data sharing▶ Who will do what▶ Timeline▶ Ongoing sharing of methods, results

31

Page 42: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Shared workspace

▶ Project structure▶ Data and metadata formats▶ Software environment▶ Automated sync (or it won’t happen)

32

Page 43: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Technology for sharing

▶ Data– figshare– dropbox / box / google drive

▶ Code– github / bitbucket

▶ Pipeline / workflow– make / drake / snakemake / rake

▶ Full environment– docker containers– mybinder.org / wholetale.org

33

Page 44: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

The most important tool is the mindset,when starting, that the end product

will be reproducible.

– Keith Baggerly

34

Page 45: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

The second-most important tool is training.

– me

35

Page 46: Collaborating reproducibly - Biostatistics and Medical ...kbroman/presentations/rrcollab_aaas… · Collaborating reproducibly Author: Karl Broman Created Date: 2/18/2019 2:34:37

Slides: bit.ly/rrcollab

kbroman.org

github.com/kbroman

@kwbroman

36