open science, open data, open source projects for undergraduate research experiences...

Post on 13-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Open Science, Open Data, Open Source Projects for Undergraduate Research

Experiences

BioQUEST/HHMI/CaseNet Summer WorkshopJune 13, 2015

Kam D. Dahlquist, Ph.D.Department of Biology

Loyola Marymount University

Outline• An open science ecosystem enhances student learning

• Quick example: XMLPipeDB project in a Biological Databases course

• Longer example: GRNmap project in Biomathematical Modeling course

• Potential research projects for BioQUEST participants

• Challenges are also opportunities– Computer literacy– Data literacy– Information literacy

Open Science(open process)

CitizenScience

OpenSource

Code

Open Access(creative commons)

Reproducible Research

Research Integrity

Open Science Ecosystem

Open DataOpen Pedagogy

With thanks to John Jungck

Open Science Pedagogy Adds Open Source Values and Tools to Problem Spaces

• Students solve an authentic research problem.

• They investigate large, publicly available datasets.

• They return the products of their research to the scholarly community.

Image: http://www.bioquest.org/bedrock/problem_spaces/

Official Open Source Definition (http://opensource.org)

Free redistribution

Source code

Derived works

Integrity of the author’ssource code

No discrimination againstpersons or groups

No discrimination againstfields of endeavor

Distribution of license

License must not bespecific to a product

License must notrestrict other software

License must betechnology-neutral

Open Source ValuesActive Learning

PedagogyOpen Source

Practices & Tools

Source code is available, modifiable,

and long-lived

Authentic problem to solve with realistic

complexity

Central code repository; version

control; provenance of code

Accountability to a developer and user

community

Participatory and collaborative work;

peer review

Task and bug trackers; continuous

integration; test-driven workflows

Responsibilities accompany rights

Responsibility and ownership of the learning process

Documentation: in-line, user manual,

web site, wiki

Open Source Values Mirror STEM Curricular Reform

Pedagogy Implemented on Course Wikis

• Team-taught and cross-listed− BIOL/CMSI 367: Biological Databases

https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Main_Page

− BIOL/MATH 388: Biomathematical Modelinghttp://www.openwetware.org/wiki/BIOL398-04/S15

• Single instructor− BIOL 368: Bioinformatics Laboratory

http://www.openwetware.org/wiki/BIOL368/F14

− BIOL 478: Molecular Biology of the Genome(wet lab, mostly offline)data analysis: http://www.openwetware.org/wiki/BIOL478/S15:Microarray_Data_Analysis

• Weekly assignments leading up to final research project

• All projects involve exploration of DNA microarray data

Pedagogy Implemented on Course Wikis

• Team-taught and cross-listed− BIOL/CMSI 367: Biological Databases

https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Main_Page

− BIOL/MATH 388: Biomathematical Modelinghttp://www.openwetware.org/wiki/BIOL398-04/S15

• Single instructor− BIOL 368: Bioinformatics Laboratory

http://www.openwetware.org/wiki/BIOL368/F14

− BIOL 478: Molecular Biology of the Genome(wet lab, mostly offline)data analysis: http://www.openwetware.org/wiki/BIOL478/S15:Microarray_Data_Analysis

• Weekly assignments leading up to final research project

• All projects involve exploration of DNA microarray data

GenMAPP-compatibleGene Database

Visualize data

PostgreSQLIntermediateDatabase

http://xmlpipedb.cs.lmu.edu/

Biological Databases Team Final Project:

create a gene database for a bacterial species

Microarray data

Each Student on the Team is Assigned a Specific Role

Coder

QualityControl

Data Analysis

Project Manager

Student Products Are Shared with the Scientific Community

http://sourceforge.net/projects/xmlpipedb/

Pedagogy Implemented on Course Wikis

• Team-taught and cross-listed− BIOL/CMSI 367: Biological Databases

https://xmlpipedb.cs.lmu.edu/biodb/fall2013/index.php/Main_Page

− BIOL/MATH 388: Biomathematical Modelinghttp://www.openwetware.org/wiki/BIOL398-04/S15

• Single instructor− BIOL 368: Bioinformatics Laboratory

http://www.openwetware.org/wiki/BIOL368/F14

− BIOL 478: Molecular Biology of the Genome(wet lab, mostly offline)data analysis: http://www.openwetware.org/wiki/BIOL478/S15:Microarray_Data_Analysis

• Weekly assignments leading up to final research project

• All projects involve exploration of DNA microarray data

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Statistical analysis,clustering,Gene Ontology, term enrichment

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Statistical analysis,clustering,Gene Ontology, term enrichment

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

DNA

mRNA

Protein

Central Dogma of Molecular Biology (simplified)

Transcription

Translation

Freeman (2003)

Genome

Transcriptome

Proteome

And Now in the “omics” Era…

Transcription

Translation

Freeman (2002)

Budding Yeast, Saccharomyces cerevisiae, isan Ideal Model Organism for Systems Biology

Alberts et al. (2004)

• Small genome of~6000 genes

• Extensive genome-wide datasets readily accessible

• Molecular genetictools available

Environmental Changes and Stresses

• All organisms must respond to changes in theenvironment– pH– oxygen availability– pressure– osmotic stress– temperature (heat and cold)

• Some changes in the environment cause cellular damage and trigger a “stress response”– damage from reactive oxygen species– damage from UV radiation– sudden and/or large change in temperature (increase or

decrease)

Cold Shock Is an Environmental Stressthat Is Not Well-Studied

• Increases in temperature (heat shock)– response very well-characterized– proteins denature due to heat– induction of heat shock proteins (chaperonins), that assist in

protein folding– conserved in all organisms (prokaryotes, eukaryotes)

• Decreases in temperature (cold shock)– response less well-characterized– decrease fluidity of membranes– stabilize DNA and RNA secondary structures– impair ribosome function and protein synthesis– decrease enzymatic activities– no equivalent set of cold shock proteins that are conserved in

all organisms

Yeast Respond to Cold Shock by Changing Gene Expression

• Cold shock temperature range for yeast is 10-18°C• Previous studies indicate that the cold shock response

can be divided into:• Late response genes – 12 to 60 hours

– General environmental stress response genes (ESR) are induced – Regulated by the Msn2/Msn4 transcription factors

• Early response genes – 15 minutes to 2 hours– Genes unique to cold shock are induced, such as genes involved

in ribosome biogenesis and membrane fluidity– Which transcription factors regulate this response is unknown

• Activators increase gene expression• Repressors decrease gene expression• Transcription factors are themselves proteins

that are encoded by genes

Transcription Factors Control Gene Expression by Binding to Regulatory DNA Sequences

Experimental Design and Methods

Yeast Cells Were Harvested for Microarrays Before, During, and After a Cold Shock and During Recovery

Mixture of labeled cDNA from two samples

• 4 replicates of each experiment with dye swaps• wt and transcription factor deletion strains

DNA Microarray

Freeman (2002)

One spot =one gene

Green = decreased

relative to control

Red =increased

Yellow =no changein geneexpression

Gene Expression Changes Due to Cold ShockReturn to Pre-shock Levels During Recovery

t30/t0 cold shock t60/t0 cold shock

t90/t0 recovery t120/t0 recovery

• Four sets of biological replicates were performed

• Dye orientation was swapped for two sets of replicates

Steps Used to Analyze DNA Microarray Data

1. Quantitate the fluorescence signal in each spot2. Calculate the ratio of red/green fluorescence3. Log2 transform the ratios4. Normalize the ratios on each microarray slide5. Normalize the ratios for a set of slides in an

experiment6. Perform statistical analysis on the ratios 7. Compare individual genes with known data8. Pattern finding algorithms/clustering9. Modeling the dynamics of the gene regulatory network10. Visualizing the results

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

Statistical analysis,clustering,Gene Ontology, term enrichment

Excel,stem

And so on…

 ANOVA wt Δgln3

p < 0.05 2378/6189 (38.42%) 1864/6189 (30.11%)

p < 0.01 1527/6189 (24.67%) 1008/6189 (16.29%)

p < 0.001 860/6189 (13.90%) 404/6189 (6.53%)

p < 0.0001 460/6189 (7.43%) 126/6189 (2.04%)

B-H p < 0.05 1656/6189 (26.76%) 913/6189 (14.75%)

Bonferroni p < 0.05 228/6189 (3.68%) 26/6189 (0.42%)

Within-strain ANOVA Reveals How Many Genes Had Significant Changes in

Expression at Any Timepoint

Number of Genes whose Expression Changes

Cold Shock Recoveryt15 t30 t60 t90 t120

Increased p < 0.05

439 (7%) 668 (11%) 609 (10%) 398 (6%) 191 (3%)

Decreasedp < 0.05

331 (5%) 517 (8%) 411 (7%) 249 (4%) 59 (1%)

Totalp < 0.05

770 (12%) 1185 (19%) 1020 (17%) 647 (10%) 250 (4%)

A Modified T Test Was Used to Determine Significant Changes in Gene Expression at Each Timepoint

wild type

Short Time Series Expression Miner (stem) Software Clusters Genes with Similar Profiles

Exp

ress

ion

(lo

g2

fold

ch

ang

e)

Time (minutes)

Short Time Series Expression Miner (stem) Software Clusters Genes with Similar Profiles

Exp

ress

ion

(lo

g2

fold

ch

ang

e)

Time (minutes)

Gene Ontology categories assigned to clusters:•Ribosome biogenesis•Zinc ion homeostasis•Hexose transport

• Endomembrane system• Protein and vesicle transport• Negative regulation of nitrogen

compound process

The Transcription Factor Gln3 Regulates Genes Involved in Nitrogen Metabolism

• Yeast differentiate between preferred and non-preferred nitrogen sources.

• When the nitrogen source is poor, Gln3 localizes to the nucleus and activates genes required to utilize the poor nitrogen source.

• The gln3 strain is impaired for growth at cold temperatures:

− Doubling time at 13°C of 15 hours vs. 8.3 hours for wild type.

• A microarray experiment was performed on the gln3 strain.

Gln3 Target Genes Were Extracted from the YEASTRACT Database

37 out of 164 (23%) have significantly different expression profiles in the wild type versus the gln3 strain

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Statistical analysis,clustering,Gene Ontology, term enrichment

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

YEASTRACT,Excel

• Does not show whether activation or repression occurs

• Shows topology, but not the behavior of the network over time

• Data found in YEASTRACT database

Genome-wide Location Analysis has Determined the Relationships between Transcription Factors

and their Target Genes in Yeast

Lee et al. (2002)

Assumptions made in our model:• Each node represents one gene encoding a transcription factor.• When a gene is transcribed it is immediately translated into protein;

a node represents both the gene and the protein it encodes.• An edge drawn between two nodes represents a regulation

relationship, either activation or repression, depending on the sign of the weight.

A Transcriptional Network Controllingthe Cold Shock Response

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Statistical analysis,clustering,Gene Ontology, term enrichment

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

GRNmap (Windows-only)

GRNmap: Gene Regulatory Network Modeling and Parameter Estimation

• Parameters are estimated from DNA microarray data from wild type and transcription factor deletion strains subjected to cold shock conditions.

• Weight parameter, w, gives the direction (activation or repression) and magnitude of regulatory relationship.

0

0.5

1Activation

1/w

0

0.5

1Repression

1/w

)(

)(exp1

)(txd

btxw

P

dt

tdxii

jjjij

ii

The “Worst” Rate Equation is:

1)6()4()1()7()1()1()5(exp1

11

144341353023105

1 PHDDbSWIwSWIwSKOwSKNwPHDwFHLwCINw

P

dt

dPHDPHD

PHD

Optimization of the 92 Parameters Requiresthe Use of a Regularization (Penalty) Term

• Plotting the least squares error function showed that not all the graphs had clear minima.

• We added a penalty term so that MATLAB’s optimization algorithm would be able to minimize the function.

• θ is the combined production rate, weight, and threshold parameters.

• is determined empirically from the “elbow” of the L-curve.

Q

rc

rd tztz

QE

1

22)]()([

1

Parameter Penalty Magnitude

Lea

st S

qu

ares

Res

idu

al

Forward Simulation of the Model Fits the Microarray Data

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Statistical analysis,clustering,Gene Ontology, term enrichment

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

GRNsight

GRNsight Rapidly Generates GRN graphs Using Our Customizations to the Open Source D3 Library

GRNsight: 10 milliseconds to generate, 5 minutes to arrange

Adobe Illustrator: several hours to create

GRNsight: colored edges for weights reveal patterns in data

The First Round of Modeling Has Suggested Future Experiments

Systems Biology Workflow

DNA microarray data:wet lab-generated or published

Statistical analysis,clustering,Gene Ontology, term enrichment

Generate gene regulatory network

Modeling dynamics of the network

Visualizing the results

New experimental questions

http://www.openwetware.org/wiki/Dahlquist:BioQUEST_Summer_Workshop_2015

95% of Bioinformatics is Getting Your Data into the Correct File Format

• Exposes deficiencies in computer literacy skills in so-called “digital natives”

• When you leave your comfort zone, it is, by definition, uncomfortable

• Emphasis on research process− Teamwork− Electronic lab notebook− Keeping track of files and code− Trouble-shooting problems that arise in the research

process: bugs, data issues, etc.

Summary• An open science ecosystem enhances student learning

• Quick example: XMLPipeDB project in a Biological Databases course

• Longer example: GRNmap project in Biomathematical Modeling course

• Potential research projects for BioQUEST participants

• Challenges are also opportunities– Computer literacy– Data literacy– Information literacy

Acknowledgments

Ben G. FitzpatrickLMU Math

John David N. DionisioLMU Computer Science

Juan Carrillo, Natalie Williams, K. Grace Johnson, Kevin Wyllie, Kevin McGeeMonica Hong, Nicole Anguiano, Anindita Varshneya, Trixie Roque, (Tessa Morris)

Special thanks to John Jungck & Sam Donovan

top related