overview of i519 & introduction to bioinformatics

39
Overview of I519 & Introduction to Bioinformatics

Upload: randell-douglas

Post on 13-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview of I519 & Introduction to Bioinformatics

Overview of I519 & Introduction to Bioinformatics

Page 2: Overview of I519 & Introduction to Bioinformatics

Structure of I519 Two classes and one lab each week Python, C (a little bit R) Textbook: Understanding Bioinformatics Homework assignments (~5 in total) Grading:

– midterm exam (25%) + final exam (25%) + assignments (30%) + class project (15%) + attendance (5%)

Course webpage: http://darwin.informatics.indiana.edu/col/courses/I519-12/

Page 3: Overview of I519 & Introduction to Bioinformatics

What’s Bioinformatics "Bioinformatics is the field of science in which biology,

computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information.” (NCBI)

"I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics, even when connected with biology-related problems. In my opinion, bioinformatics has to do with management and the subsequent use of biological information, particular genetic information.” (Durbin)

What’s bioinformatics

Page 4: Overview of I519 & Introduction to Bioinformatics

Bioinformatics vs Computational Biology

Almost interchangeable Computational biology may be broader

– Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems (wikipedia)

– Includes bioinformatics

What’s bioinformatics

Page 5: Overview of I519 & Introduction to Bioinformatics

Impacts of Bioinformatics

On biological sciences (and medical sciences)– Large scale experimental techniques – Information growth

On computational sciences– Biological has become a large source for new

algorithmic and statistical problems!

What’s bioinformatics

Page 6: Overview of I519 & Introduction to Bioinformatics

Related Fields Proteomics/genomics (metagenomics)/

comparative genomics/structural genomics Chemical informatics Health informatics/Biomedical informatics Complex systems Systems biology Biophysics Mathematical biology

– tackles biological problems using methods that need not be numerical and need not be implemented in software or hardware

What’s bioinformatics

Page 7: Overview of I519 & Introduction to Bioinformatics

Bioinformatics Problems/Applications

Figure from “Bioinformatics dummies”

What’s bioinformatics

Page 8: Overview of I519 & Introduction to Bioinformatics

Biology Primer

Figure 1-1 Molecular Biology of the Cell

Multicullar organisms

Eggs

Cell divisions

Underlying the diversity of life is a striking unity: DNA is universal genetic language; Cells are the basic units of structure and function

Biology primer

Page 9: Overview of I519 & Introduction to Bioinformatics

Cells are the Basic Unit of Life Cell Theory

– All organisms are made up of cells

– The cell is the basic living unit of organization for all organisms

– All cells come from pre-existing cells by division

– Cells contains hereditary information which is passed from cell to cell during cell division.

– All cells are basically the same in chemical composition

– All energy flow (metabolism & biochemistry) of life occurs within cells

Organisms can be of single cells or multiple cells (multicellular organisms)− Most living organisms are single cells (e.g., E.coli, Yeast)

− Multicellular organisms (e.g., human has more than 1013 cells. Have no idea about this number? World population as of July 2008 is 6.684 billion, (1 billion = 109)

Biology primer

Page 10: Overview of I519 & Introduction to Bioinformatics

Animal cell structurehttp://hyperphysics.phy-astr.gsu.edu/hbase/biology/imgbio/cellhlabel.gif

Cell Structures

Prokaryotic cell structurehttp://micro.magnet.fsu.edu/cells/procaryotes/images/procaryote.jpg

Biology primer

Page 11: Overview of I519 & Introduction to Bioinformatics

Scale Down to the Atomic Level

Figure 9-1 Molecular Biology of the Cell Figure 9-2

Cell

Biology primer

Page 12: Overview of I519 & Introduction to Bioinformatics

The Central Dogma

DNA RNA Protein

RNA virus

retrovirus

TranslationTranscription

The flow of genetic information in cells is from DNA to RNA to protein. All cells, from bacteria to humans, express their genetic information in this way—a principle so fundamental that it is termed the central dogma of molecular biology.

Biology primer

Page 13: Overview of I519 & Introduction to Bioinformatics

DNA and Replication

Figure 1-2 Molecular Biology of the Cell, Fifth Edition

Biology primer

Page 14: Overview of I519 & Introduction to Bioinformatics

From DNA (to RNA) to Protein

Biology primer

Page 15: Overview of I519 & Introduction to Bioinformatics

The Genetic Code

Biology primer

Page 16: Overview of I519 & Introduction to Bioinformatics

Genome Definition

– Genome of an organism is its whole hereditary information and is encoded in the form of DNA (or, for some viruses, RNA)

– Chromosome: structure composed of a long DNA and associated proteins; human has 46 chromosomes

DNA sequences can be determined by various sequencing techniques

Sequence first. Ask questions later– Cell. 2002 Oct 4;111(1):13-6

Biology primer

Page 17: Overview of I519 & Introduction to Bioinformatics

Characteristic Archaea Bacteria Eukaryotes

Predominately multicellular No No Yes

DNA structure circular circular linear

Cytoplasma is compartmentalized

No No Yes

Introns are present in most genes

No No Yes

Photosynthesis with chlorophyll

No Yes Yes

Histone proteins present in cell

Yes No Yes

Three (Super)Kingdoms

Biology primer

Page 18: Overview of I519 & Introduction to Bioinformatics

Organisms at Pivotal Positions in the Tree of Life

E.coli: 1997

Cell. 2002 Oct 4;111(1):13-6

Fly: 2000

Worm: 1998

Biology primer

Page 19: Overview of I519 & Introduction to Bioinformatics

Model Organisms

A model organism is a species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms.

Genetic models (with short generation times, such as the fruit fly and nematode worm), experimental models, and genomic models, with a pivotal position in the tree of life

Biology primer

Page 20: Overview of I519 & Introduction to Bioinformatics

Escherichia coli (E. coli)

A common gut bacterium, is the most widely-used organism in molecular genetics.

Some strains of E. coli are capable of causing disease under certain conditions

Different strains of E. coli have been extensively studied

Whole genome of several E. coli strains was sequenced (e.g., K-12, O157:H7, HS)

Biology primer

Page 21: Overview of I519 & Introduction to Bioinformatics

The Genome of E. coli K-12

Figure 1-29 Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008)

Circular DNA: a single, closed loop

Protein-coding genes

RNA genes

The whole genome was sequenced in 1997Total 4,639,221 bp.

Biology primer

Page 22: Overview of I519 & Introduction to Bioinformatics

Caenorhabditis elegans

C. elegans is a eukaryote (nematodes, or round worms)

Has small genome (~97megabases) (whole genome sequencing, 1998)

C. elegans is easy to maintain in the laboratory (in petri dishes) and has a fast and convenient life cycle. – the life span is 2-3 weeks.

– tiny (1 mm in length) and transparent organism and the developmental pattern of all 959 of its somatic cells has been traced.

• somatic cell: any cell of a plant or animal other than cells of the germ line (from Greek soma, body)

Biology primer

Page 23: Overview of I519 & Introduction to Bioinformatics

Caenorhabditis elegans (Cont.) Discovery of the mechanism of

RNA interference in C. elegans (1998)– Andrew Fire and Craig C. Mello shared the

Nobel Prize in Physiology or Medicine in 2006

– Silencing was triggered efficiently by injected dsRNA, but weakly or not at all by sense or antisense single-stranded RNAs

Biology primer

Page 24: Overview of I519 & Introduction to Bioinformatics

Drosophila melanogaster (fruit fly) It has been used as a model organism for over

100 years, widely used to study genetic and development biology– Small and has a simple diet.

– Short life cycle: taking about two weeks

– Have large polytene chromosomes, whose barcode patterns of light and dark bands allow genes to be mapped accurately

It was chosen in 1990 as one of the model organisms to be studied under the auspices of the federally funded Human Genome Project

Whole genome sequenced in 2000 >10 Drosophila genomes have been sequenced FlyBase: http://flybase.org/

Biology primer

Page 25: Overview of I519 & Introduction to Bioinformatics

Species Classification

Classification is arrangement of organisms into orderly groups based on their similarities

Also known as taxonomy Provide accurate and uniform naming system

Biology primer

Page 26: Overview of I519 & Introduction to Bioinformatics

Linnaean System of Classification Carolus Linnaeus (the “father of taxonomy”) -- the first

widely accepted hierarchical scheme, which consists today of 7 categories (kingdom, phylum, class, order, family, genus, and species) (not including domain)

Species is the most basic unit of biological classification (means “kind” in Latin)– Each species is different, and reproduces itself faithfully– Heredity is a central part of the definition of life

The Linnaean system uses two Latin name categories, genus and species, to designate each type of organism– Salmonela saintpaul (which caused the latest food-borne

disease)

– Capitalize the genus, but not the species; italicized in print

Biology primer

Page 27: Overview of I519 & Introduction to Bioinformatics

Homo sapiens

Domain: Eukaryotes Kingdom: Matazon (many-celled animal)

Phylum: Chordata (characterized by a notochord, nerve cord, and gill

slits)

(subphylum: Vertebrata) Class: Mammalia (warm-blooded vertebrates)

Order: Primates

Family: Hominidae

Genus: Homo

Species: Sapienshttp://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy

King

Philip

Came

Over

For

Gooseberry

Soup

Biology primer

Page 28: Overview of I519 & Introduction to Bioinformatics

Gene/Protein Family A protein/gene family is a group of evolutionarily

related proteins/genes Genes/proteins of the same family typically have

similar functions (and structures for proteins) and with sequence similarity

There are far more genes/proteins than the number of families—which shows the advantage of grouping genes/proteins into different families

Biology primer

Page 29: Overview of I519 & Introduction to Bioinformatics

Evolution of Genes

New genes are generated from preexisting genes– Intragenic mutation (modified by changes in DNA

sequence – errors occurred in the process of DNA replication)

– Gene duplication – two copies of genes may then diverge in the course of evolution

– Segment shuffling– Horizontal transfer

Biology primer

Page 30: Overview of I519 & Introduction to Bioinformatics

Analysis of Gene/Protein Families – Key Problems in Bioinformatics

Homolog detection Alignment (the residual-level mapping among

homologous genes/proteins) Application of the alignments

– Detect the conserved residues – functional sites– Prediction of protein structures– Motif finding (cis-elements)

Phylogeny Function annotation

None of these problems have been solved!

More on what’s bioinformatics

Page 31: Overview of I519 & Introduction to Bioinformatics

Is Protein A Related/Similar to Protein B?

Sequence similarity (alignment!) Structure similarity (structural comparison) Co-expression (Microarray data analysis) Any types of correlation (operon-structure, etc)

You will see this question again and again!

More on what’s bioinformatics

Page 32: Overview of I519 & Introduction to Bioinformatics

Guilty by AssociationMore on what’s bioinformatics

Page 33: Overview of I519 & Introduction to Bioinformatics

Computational Abstractions: Biological Sequences as Strings

DNA RNA Protein Phylotype

DNA A string in a four-letter alphabet

RNA

Protein

More on what’s bioinformatics

Page 34: Overview of I519 & Introduction to Bioinformatics

Computational Abstractions: Networks (and Others) as Graphs

Protein-protein interaction network Protein structures presented as graphs Gene functions presented as graphs (Gene

ontology) Metabolic pathways as graphs (directed)

More on what’s bioinformatics

Page 35: Overview of I519 & Introduction to Bioinformatics

Large Scale Data Analysis

Genome scale– genome, proteome, transcriptome

Metagenome scale– Metagenome, metaprotome, metatranscriptome

More on what’s bioinformatics

Page 36: Overview of I519 & Introduction to Bioinformatics

More than Implementation Find old/new biological problems

– Remember biology has become a large source for new algorithmic and statistical problem

Formulate as a computational problem– Define inputs and outputs

– (though there are many paper work on well-defined bioinformatics problems)

Apply existing algorithms and/or tools to solving your problem

Develop new ones if necessary Implement your algorithms with appropriate

programming language(s)

More on what’s bioinformatics

Page 37: Overview of I519 & Introduction to Bioinformatics

Where Can I Get the Biological Data?

Sequences– NCBI genbank– Swissprot

Structures– PDB

Genomes– NCBI, IMG, GOLD– Specialized genome resources

• Ensembl: selected eukaryotic genomes.

Others– KEGG, SEED (biological pathways)

More on what’s bioinformatics

Page 38: Overview of I519 & Introduction to Bioinformatics

Dealing with Databases

Databases are the backbone of bioinformatics research

Flat files were the first type of database; and are still used today

Rational databases are good for searching purposes

Databases can contain data and annotations of data– Primary and derived (secondary) data

Page 39: Overview of I519 & Introduction to Bioinformatics

Readings Biology primer (available at the course website) Anything about Python and/or C (if you have no

programming experience at all)

What’s in the textbook?– Chapter 1 ( The Nucleic Acid World)– Chapter 2 (Protein Structure)– Chapter 3 (Dealing With Databases)