© van belle werner25/11/2009 - pg. 1 v1 15. repetition / overview dr. werner van belle...

© Van Belle Werner

25/11/2009 - Pg. 1 v1

15. Repetition / Overview

Dr. Werner Van [email protected]

© Van Belle Werner

25/11/2009 - Pg. 2 v1

1. Correlation Mathematical definition of Average, Variance,

Standard deviation Mathematical definition of L.P. Correlation Graphical interpretation of correlation Implementing a correlation routine given the

math. Definition What is numerical stability ? Give an example

using the variance How to deal with missing numbers in a

correlation calculation ? What does significance of a correlation value

mean ?

© Van Belle Werner

25/11/2009 - Pg. 3 v1

Correlation Graphical Interpretation

© Van Belle Werner

25/11/2009 - Pg. 4 v1

Given a vector of n numbers

What is the average is of this vector ?

© Van Belle Werner

25/11/2009 - Pg. 5 v1

© Van Belle Werner

25/11/2009 - Pg. 6 v1

© Van Belle Werner

25/11/2009 - Pg. 7 v1

© Van Belle Werner

25/11/2009 - Pg. 8 v1

Step 1. Translation

© Van Belle Werner

25/11/2009 - Pg. 9 v1

Step 2. Variance normalization

© Van Belle Werner

25/11/2009 - Pg. 10 v1

Step 3. Covariance calulcation

r=0.936

© Van Belle Werner

25/11/2009 - Pg. 11 v1

1. Write a function that takes two inputs X and Y and returns the linear pearson correlation

between these two vectors.

2. Write a routine which reads a binary file in which we have a consecutive sequence of double

precision floats.

3. Modify your program to take 2 filenames as argument and let it report the correlation value

© Van Belle Werner

25/11/2009 - Pg. 12 v1

4. Write a program that will generate two random vectors of size N and let it report the correlation

between those two random vectors.

Let this program repeat this action 100 times and report the average absolute correlation.

Investigate the effect of the groupsize N on the average reported correlation.

5. We have 10 vectors stored in 10 different files. Write a program to read these files and report a

correlation matrix.

© Van Belle Werner

25/11/2009 - Pg. 13 v1

What should we do with missing numbers ?What is the significance of a correlation ?

© Van Belle Werner

25/11/2009 - Pg. 14 v1

Numerical StabilityVariance calculations are notorious for numerical

instability

A Simple Algorithm

for(int i=0; i < n; i++) var+=(x_i-avg)*(x_i-avg)var/=n;

100'000100'000100'000100'000100'000

1111111111111

10**1020**1030*10

40**1050*10

50**10+1 ?...

© Van Belle Werner

25/11/2009 - Pg. 15 v1

Note on a Method for Calculating Corrected Sums of Squares and Products B. P. Welford, technometrics, Vol. 4, No. 3 (Aug., 1962), pp. 419-420

Incremental online algorithm

© Van Belle Werner

25/11/2009 - Pg. 16 v1

2. Multi Dimensional Correlations What is 2D Gel electrophoresis ? What is a function/method declaration ? Exercise on converting a single dimensional

routine to multiple dimensions Correlation is not causation Correlation is not linear regression Correlations can be accidental Both no correlations -as well as- high/lo

correlations can be indicative

© Van Belle Werner

25/11/2009 - Pg. 17 v1

2D Gel Electrophoresis First dimension

acid sideacid sidepH 5pH 5

acid sideacid sidepH 5pH 5 neutralneutral

pH 7pH 7

neutralneutralpH 7pH 7

base sidebase sidepH 9pH 9


Protein Mixture

© Van Belle Werner

25/11/2009 - Pg. 18 v1

2D Gel Electrophoresis Iso Electric Focusing

acid sideacid sidepH 5pH 5

acid sideacid sidepH 5pH 5 neutralneutral

pH 7pH 7

neutralneutralpH 7pH 7



Protein Mixture

40' at 200 V40' at 200 V30' at 450 V30' at 450 V30' at 750 V30' at 750 V60' at 2000 V60' at 2000 V

40' at 200 V40' at 200 V30' at 450 V30' at 450 V30' at 750 V30' at 750 V60' at 2000 V60' at 2000 V

© Van Belle Werner

25/11/2009 - Pg. 19 v1

2D Gel Electrophoresis Transfer onto 2nd gel

pH seperated protein mixture

TransferTransferTransferTransfer

© Van Belle Werner

25/11/2009 - Pg. 20 v1

2D Gel Electrophoresis Transfer onto 2nd gel

pH seperated protein mixtureTime BasedTime BasedMass SeparationMass Separation

Time BasedTime BasedMass SeparationMass Separation

© Van Belle Werner

25/11/2009 - Pg. 21 v1

2D Gel Electrophoresis Washing/Drying/Staining

pH/mass seperated protein mixture

'staining' fluid

© Van Belle Werner

25/11/2009 - Pg. 22 v1

2D Gel Electrophoresis Capturing

© Van Belle Werner

25/11/2009 - Pg. 23 v1

2D Gels of multiple patients

Courtesy Gry Sjøholt, Nina Ånensen & Bjørn Tore Gjertsen

Patient #1Liver Size: 57

Patient #2Liver Size: 46

© Van Belle Werner

25/11/2009 - Pg. 24 v1

Given a stack of images: which areas correlate against our patient's tumrogrowth, life expectancy

etcetera ?

© Van Belle Werner

25/11/2009 - Pg. 25 v1

Reading The Image int img_sx int img_sy byte[,] read_image(String filename)

Will read the image from the provided filename

When img_sx is not set, will set both of them to the size of the image being loaded

When img_sx is set then the image must have the same size. Otherwise null is returned.

If the image does not exist null is returned as well.

The byte array is ordered as image[x][y]

© Van Belle Werner

25/11/2009 - Pg. 26 v1

Exercise Import your correlation routine from last time. It

should have the following declaration float correlate(float[] X, float[] Y, int n)

© Van Belle Werner

25/11/2009 - Pg. 27 v1

P53 Biosignature vs Liver size

© Van Belle Werner

25/11/2009 - Pg. 28 v1

Masking

© Van Belle Werner

25/11/2009 - Pg. 29 v1

Significance

© Van Belle Werner

25/11/2009 - Pg. 30 v1

Significance Mask

© Van Belle Werner

25/11/2009 - Pg. 31 v1

Variance

© Van Belle Werner

25/11/2009 - Pg. 32 v1

Variance Mask

© Van Belle Werner

25/11/2009 - Pg. 33 v1

Overall Mask

© Van Belle Werner

25/11/2009 - Pg. 34 v1

Overall Mask

© Van Belle Werner

25/11/2009 - Pg. 35 v1

P53 Biosignature vs Liver size

© Van Belle Werner

25/11/2009 - Pg. 36 v1

3. Nucleotides to Amino Acides Various biological terms briefly explained Prokaryotes/Eukaryotes/Chromosomes/Chromatide/

Chromatine/Karyotyping/Diploid/Haploid/Gametes Where is the genetic material stored ? Nucleotides / Amino Acids Complement, Reverse Sequence Proteins Translation Reading Frames, Open reading Frames

© Van Belle Werner

25/11/2009 - Pg. 37 v1

Cells Prokaryotic - no nucleus (bacteria) Eukaryotic – with nucleus (plants/animals)

The nucleus Contains the genetic material Genetic material can be in two states

Heterochromatine / Euchromatine (a diffuse state which makes the DNA accessible)

Chromosomes

© Van Belle Werner

25/11/2009 - Pg. 38 v1

Chromosomes

Karyotyping

© Van Belle Werner

25/11/2009 - Pg. 39 v1

Humans 23 chromosomes pairs: Diploid

Other organisms can have different layouts Tetraploid Hexaploid Octoploid

22 chromsome types (autosomes) 1 sex chromosome

© Van Belle Werner

25/11/2009 - Pg. 40 v1

Chromosome Various chromosome layouts

Somatic cells – diploid One set from mother One set from father

Gametes – haploid Mother -or- Father Gametes do not have the same genetic code

Autosomes in diploid cells are not strictly identical. Although 99% is the same

© Van Belle Werner

25/11/2009 - Pg. 41 v1

Chromatides 1 – chromatide 2 – centromere 3 – p-arm (short) 4 – q-arm (long) 5 – telomeres

Double chromatide state only during interphase

5

5

© Van Belle Werner

25/11/2009 - Pg. 42 v1

Chromosome ↔ DNA

© Van Belle Werner

25/11/2009 - Pg. 43 v1

Chromsome ↔ DNA

© Van Belle Werner

25/11/2009 - Pg. 44 v1

DNA

© Van Belle Werner

25/11/2009 - Pg. 45 v1

Nucleotides: A; C; T & G Paired nucleotides:

basepairs A-T; C-G (complementary

bases) Standard read from 5'

end to the 3' end Forward / reverse strands

© Van Belle Werner

25/11/2009 - Pg. 46 v1

Genes Gene identification is problematic

Position identifiers are not unique Sequences are not completely unique A biologists' agreement on terminology 'somewhere around this area' having 'largely' this

sequence. Similar sequences across species

© Van Belle Werner

25/11/2009 - Pg. 47 v1

Genes Specific areas in the genome (loci) have

meaning and translate to proteins afterward Number of bases in the genome ? Number of genes in the genome ?

© Van Belle Werner

25/11/2009 - Pg. 48 v1

Proteins 3D structures / molecular machines with

specific possibilities

© Van Belle Werner

25/11/2009 - Pg. 49 v1

Amino Acids Consist of a sequence of 20++ amino acids Alanine (Ala, A) Cysteine (Cys, C), Aspartic Acid

(Asp, D), Glutamic Acid (Glu, E), Phenylalanine (Phe, F), Glycine (Gly, G), Histidine (His, H), Isoleucine (Ile, I), Lysine (Lys, K), Leucine (Leu, L), Methionine (Met, M), Asparagine (Asn, N), Proline (Pro, P), Glutamine (Gln, Q), Arginine (Arg, R), Serine (Ser, S), Threonine (Thr, T), Valine (Val, V), Tryptophan (Trp, W), Tyrosine (Tyr, Y)

Selenocysteine, pyrrolysine (rare)

© Van Belle Werner

25/11/2009 - Pg. 50 v1

Essential Amino Acids Consist of a sequence of 20 amino acids Alanine (Ala, A) Cysteine (Cys, C), Aspartic

Acid (Asp, D), Glutamic Acid (Glu, E), Phenylalanine (Phe, F), Glycine (Gly, G), Histidine (His, H), Isoleucine (Ile, I), Lysine (Lys, K), Leucine (Leu, L), Methionine (Met, M), Asparagine (Asn, N), Proline (Pro, P), Glutamine (Gln, Q), Arginine (Arg, R), Serine (Ser, S), Threonine (Thr, T), Valine (Val, V), Tryptophan (Trp, W), Tyrosine (Tyr, Y)

© Van Belle Werner

25/11/2009 - Pg. 51 v1

DNA → Protein Codons (3 nucleotide sequence) translate to

Amino acid DNA copies to RNA, which is moved out of the

nucleus (T → U) Polymerases convert the sequence to proteins

Multiple translations possible. Most common is RNA Polymerase-II

© Van Belle Werner

25/11/2009 - Pg. 52 v1

Translation Table

© Van Belle Werner

25/11/2009 - Pg. 53 v1

Reading frames UCU AAA AUG GGU GAC ...CUA AAA UGG GUG AC ......UAA AAU GGG UGA C

An open reading frame (ORF) is a reading frame that contains a start codon a subsequent region which usually has a length

which is a multiple of 3 nucleotides a stop codon at its end.

© Van Belle Werner

25/11/2009 - Pg. 54 v1

Exercise Create a routine that calculates the complement of a

DNA sequence Create a routine that calculates the reverse of a DNA

sequence Create a routine that translates a DNA sequence into

an amino acid sequence Let the program try each reading frame and report

the sequence with the longest distance to the first stop codon

Include now complement, reverse and reverse complement sequences.

© Van Belle Werner

25/11/2009 - Pg. 55 v1

4. Exons, Introns,Splice Variants Translation mechanisms in eukaryotes Splice variants; Exons / Introns Ensembl Browsing of this type of information Designing probes to detect specific splice

variants

© Van Belle Werner

25/11/2009 - Pg. 56 v1

Transcription / Translation Prokaryotes

Transcription: Polymerase copies the gene into an RNA strand: mRNA

Translation: The mRNA is then used to generate proteins

These peptide chains then fold into Proteins

Problem for Eukaryotes DNA stays in the nucleus Proteins are mainly in the cytoplasm

© Van Belle Werner

25/11/2009 - Pg. 57 v1

Eukaryotic Translation Translate DNA to pre-mRNA Process pre-mRNA to mRNA

Adding caps (5' cap, polyadenylation) Splicing (select certain parts of the pre-mRNA) Editing (nucleotide modifications)

Transport mRNA to cytoplasm Translate mRNA to proteins.

© Van Belle Werner

25/11/2009 - Pg. 58 v1

The Process

© Van Belle Werner

25/11/2009 - Pg. 59 v1

Splicing Converts the freshly copied DNA (pre-m-RNA)

to a new strand (mRNA) Removes certain areas (introns) and joins

others together (exons)

© Van Belle Werner

25/11/2009 - Pg. 60 v1

m-RNA

© Van Belle Werner

25/11/2009 - Pg. 61 v1

One Gene, One Protein ? Non-coding genes Alternative splicing: one gene can have multiple

splice variants leading to different proteins Monocistronic mRNA: when the mRNA codes

for only one protein Polycistronic mRNA: codes for multiple genes

(operon)

© Van Belle Werner

25/11/2009 - Pg. 62 v1

TP53 In human: on which chromosome is it located ? Does this area in the genome also overlap with other

genes ? How many splice variants does the TP53 gene

have ? How many bases is the gene long ? How many exons does the gene have ? Is there a transcript which includes all exons ? What is the sequence of the shortest transcript ?

[TP53-205]

© Van Belle Werner

25/11/2009 - Pg. 63 v1

Some other Genes Genes: 5HTT, BRCA2, Wingless Alternative names ? Chromosome location ? Overlapping genes at this position ? Splice variants ? Do all exons transcribe in the same direction ?

© Van Belle Werner

25/11/2009 - Pg. 64 v1

Exercise We want to design a probe that will uniquely detect a

specific splice variant (target) We have a nice large table of all existing splice variants in

human together with their gene name and variant number (ENSTxxx)

Given a length L, we now want to find the first subsequence of the target that does not match any of the other existing splice variant. The subsequence is of length L The splice variant table will nopt contain the target itself

We also want the shortest sufficient probe to detect the target

© Van Belle Werner

25/11/2009 - Pg. 65 v1

5. Ensembl SQL Part 1 What is a database, schema, table, row,

column, attribute, value Ensembl Stable ID's Ensembl Genes, mapipng to stable ids Transcripts, Translations, Exons One-Many relationships across tables

© Van Belle Werner

25/11/2009 - Pg. 66 v1

Relational Databases Database Server Database aka [Database] Schema Tables Columns with specific types Rows Values can be NULL or real values

© Van Belle Werner

25/11/2009 - Pg. 67 v1

Joining tables

SELECT *FROM TABLE1 JOIN TABLE2 USING(Y)

SELECT *FROM TABLE1 t1JOIN TABLE2 t2WHERE t1.Y=t2.Y

© Van Belle Werner

25/11/2009 - Pg. 68 v1

Ensembl Provides very structured biological information Integrates many different data sources Cares about metadata

Keeps track of different versions, releases Keeps track where the data came from Keeps track how a specific analysis was performed

Access using Mysql-query-browser Host: ensembldb.ensembl.org Port: 3306 (up to #47) or 5306 (#48 onwards) Login: anonymous Password: <none>

© Van Belle Werner

25/11/2009 - Pg. 69 v1

Schemata Each organism has its own collection of databases

homo_sapiens_core_47_36i homo_sapiens_cdna_<x>_<y> homo_sapiens_core_expression_est_<x>_<y> homo_sapiens_core_expression_gnf_<x>_<y> homo_sapiens_disease_<x>_<y> homo_sapiens_est_<x>_<y> homo_sapiens_estgene_<x>_<y> homo_sapiens_funcgen_<x>_<y> homo_sapiens_haplotype_<x>_<y> homo_sapiens_lite_<x>_<y> homo_sapiens_otherfeatures_<x>_<y> homo_sapiens_variation_<x>_<y> homo_sapiens_vega_<x>_<y>

© Van Belle Werner

25/11/2009 - Pg. 70 v1

Stable Gene Identifiers Table gene_stable_id

gene_id – the current gene identification, acts as (part of the) primary key in many tables – a number

stable_id – the publicly visible ENSG<xxx> identifier

creation_date – when was this gene introduced ?

version – what is the current version of the gene

modified_date – when was the last change ?

© Van Belle Werner

25/11/2009 - Pg. 71 v1

Genes Table gene

gene_id biotype: proteincoding or not seq_region_id seq_region_start seq_region_end seq_region_strand display_xref_id source: where did it come from status: KNOWN/NOVEL description: human readable

© Van Belle Werner

25/11/2009 - Pg. 72 v1

Gene to Stable id mapping Write a query that will map a gene to its stable

id. The output should contain gene_id, Biotype, seq_region_id, seq_region_start,

seq_region_end, seq_region_strand, display_xref_id, source, status, description and of course the stable_id

© Van Belle Werner

25/11/2009 - Pg. 73 v1

Transcript Table transcript

transcript_id – these ids can coincide with gene_ids. Do not mix them ! gene_id seq_region_id / seq_region_start / seq_region_end /

seq_region_strand display_xref_id biotype status Description

Table transcript_stable_id stable_id – something like ENST000... transcript_id

© Van Belle Werner

25/11/2009 - Pg. 74 v1

Translation Table translation

translation_id transcript_id seq_start start_exon_id seq_end end_exon_id

Table translation_stable_id stable_id – something like ENSP0000.... translation_id

© Van Belle Werner

25/11/2009 - Pg. 75 v1

Exon Table exon

exon_id seq_region_id / seq_region_start /

seq_region_end / seq_region_strand phase end_phase

Table exon_transcript maps the transcript to something ? exon_id transcript_id Rank – exon number (1 to 10, 13, 170)

© Van Belle Werner

25/11/2009 - Pg. 76 v1

One to 0,1,+,* ?

© Van Belle Werner

25/11/2009 - Pg. 77 v1

Genes ↔ Transcript ↔ Exon Obtain a table with 3 columns

gene_id transcript_id exon_id

Start out with a table that lists all gene_id transcript_id

Then extend the table with exons belonging to that transcript

© Van Belle Werner

25/11/2009 - Pg. 78 v1

Genes ↔ Transcript ↔ Exon

© Van Belle Werner

25/11/2009 - Pg. 79 v1

6. Ensembl SQL Part 2 Various mappings: Genes to Proteins Various grouping operations: average, maxima,

minima, countings etcetera Ensembl Regions and Chromosome

information

© Van Belle Werner

25/11/2009 - Pg. 80 v1

Genes ↔ Protein Mapping Write a query to map genes to potential

proteins. The output table should contain A stable gene identifier A stable protein identifier

© Van Belle Werner

25/11/2009 - Pg. 81 v1

Genes ↔ Protein Mapping Write a query to map genes to potential

proteins. The output table should contain A stable gene identifier A stable protein identifier

SELECT G.stable_id as gen, T.stable_id as proteinFROM gene JOIN transcript USING (gene_id) JOIN translation USING(transcript_id)JOIN translation_stable_id T USING (translation_id)JOIN gene_stable_id G USING (gene_id)LIMIT 10

© Van Belle Werner

25/11/2009 - Pg. 82 v1

Genes ↔ Protein Mapping

© Van Belle Werner

25/11/2009 - Pg. 83 v1

Averages What is the average number of transcripts per

gene ? What is the average number of exons per

transcript ?

Based on #47 of the database33761 unique genes57365 unique transcripts503655 unique exon/transcript combinations288309 unique exons

Which gives 1.7 transcript/geneAnd 8.79 exons per transcriptBut only 8.5 unique exons per gene

© Van Belle Werner

25/11/2009 - Pg. 84 v1

Largest Gene Which gene has the most transcripts ?

SELECT COUNT(DISTINCT t.transcript_id) as tcount, si.stable_id, g.gene_idFROM gene gJOIN transcript t USING(gene_id)JOIN gene_stable_id si USING (gene_id)GROUP BY gene_idORDER BY tcount DESCLIMIT 10

ENSG00000154556 with 44 transcripts

© Van Belle Werner

25/11/2009 - Pg. 85 v1

Transcript with the most exons Which transcript has the most exons ?

SELECT MAX(rank) ecount, si.stable_id, t.transcript_idFROM transcript t JOIN exon_transcript et USING (transcript_id)JOIN transcript_stable_id si USING (transcript_id)GROUP BY transcript_idORDER BY ecount DESCLIMIT 10

ENST00000356127 with 313 exons

© Van Belle Werner

25/11/2009 - Pg. 86 v1

Regions The region codes in Ensembl can be a variety

of things. Table seq_region

seq_region_id name – can be a chromosome name coord_system_id length

© Van Belle Werner

25/11/2009 - Pg. 87 v1

Largest area covered Which gen covers the largest area in the

genome ? On which chromosome ?

SELECT seq_region_end-seq_region_start as L, stable_id, gene_id, nameFROM gene JOIN gene_stable_id USING (gene_id)JOIN seq_region USING (seq_region_id)ORDER BY L descLIMIT 10

Answer: ENSG00000174469 with 2'304'637 bases

© Van Belle Werner

25/11/2009 - Pg. 88 v1

Largest area

© Van Belle Werner

25/11/2009 - Pg. 89 v1

Create a table of TSS Retrieve a list of potential transcription start

sites and to which gene they belong

SELECT t.seq_region_start, t.seq_region_strand, stable_idFROM transcript tJOIN gene USING (gene_id)JOIN gene_stable_id USING (gene_id)

© Van Belle Werner

25/11/2009 - Pg. 90 v1

7. Ensembl Identifiers How are external identifiers represented in Ensembl ? Object_xref – links Ensembl objects to external

names Xref – remembers the external object name External_db – keeps track of a variety of different

databases How could we add a new nomenclature Map one nomenclature to ensembl identifiers Mapping exercices from one nomenclature to another Some nomenclatures have names for genes as well

as translations Mapping Uniprot to HGNC

© Van Belle Werner

25/11/2009 - Pg. 91 v1

External Databases Table external_db

external_db_id – primary key db_name – database name db_release – version status – predicted, known, cross referenced,

orthologue mapped etcetera type - misc, array etcetera db_display_name – how to print this database

name

© Van Belle Werner

25/11/2009 - Pg. 92 v1

External Databases Table external_db

external_db_id – primary key db_name – database name type - misc, array etcetera db_display_name – how to print this database

name

SELECT * FROM external_dbWHERE db_name=”HGNC”

external_db_id: 1100db_name: HGNCdb_display_name: HGNC symbolType: primary_db_synonym

© Van Belle Werner

25/11/2009 - Pg. 93 v1

External Names Table xref

xref_id – the cross reference primary key external_db_id – the external database key db_primary_acc – the 'primary key' in the external

database display_label – how to print this gene identifier description – a description according to the

external database

© Van Belle Werner

25/11/2009 - Pg. 94 v1

External Names to Stable Ids General purpose table object_xref

ensembl_id: the ensemble internal id (gene_id for instance)

ensembl_object_type: translation, transcript, gene xref_id: id from the xref table linkage_annotation

© Van Belle Werner

25/11/2009 - Pg. 95 v1

Adding new nomenclature Create external_db entry For each gene <A> in the nomenclature, allocate the

name in the database xref Gene A → xref_id 764 Gene B → xref_id 987 ...

For each external gene A,B,... in the nomenclature, map it to the database A → ENSG78646 → gene_id=76689 B → ENSG98768 → gene_id=7577

Link A,B to ENSG... through an object_xref of type Gene 764,76689, Gene 987,7577, Gene

© Van Belle Werner

25/11/2009 - Pg. 96 v1

External Names to Stable Ids

SELECT * FROM xref xJOIN object_xref oWHERE external_db_id=1100and x.xref_id=o.xref_idLIMIT 0,1000

xref_id: 1793295external_db_id: 1100dbprimary_acc 21076display_labe: TMEM14ADescription: transmembrane protein 14Ainfo_type: Dependentinfo_text: Generated via NP_054770ensembl_id: 18971ensembl_object_type: Genexref_id: 1793295linkage_annotation: NULL

© Van Belle Werner

25/11/2009 - Pg. 97 v1

HGNC to Ensembl

SELECT *FROM xref xJOIN object_xref oJOIN gene_stable_id gWHERE external_db_id=1100and x.xref_id=o.xref_idand ensembl_id=g.gene_id[and display_label="CXYorf1"]LIMIT 0,1000

display_label: CXYorf1ensembl_id: 7888ensembl_object_type: Transcript

display_label: CXYorf1ensembl_id: 4373emsembl_object_type: Gene

We obtain the wrong results ! Be aware that ensembl_id cannot alwaysBe mapped to a gene_id

© Van Belle Werner

25/11/2009 - Pg. 98 v1

HGNC to Ensembl

SELECT *FROM xref xJOIN object_xref oJOIN gene_stable_id gWHERE external_db_id=1100and x.xref_id=o.xref_idand ensembl_id=g.gene_idand ensembl_object_type='Gene'

Results in 18524 identifiersWith only 18107 unique ensembl identifiers

© Van Belle Werner

25/11/2009 - Pg. 99 v1

Ensembl to HGNC Find all ensembl genes that have no existing

HGNC name First: map all the stable_ids through the

object_xref table to the xref identities

SELECT * FROM gene_stable_id gJOIN object_xref xrJOIN xref x USING(xref_id)WHERE xr.ensembl_object_type='Gene' AND xr.ensembl_id=g.gene_idLIMIT 100

© Van Belle Werner

25/11/2009 - Pg. 100 v1


HGNC name Second: take only those that belong to the

HGNC nomenclature (1100)

SELECT * FROM gene_stable_id gJOIN object_xref xrJOIN xref x USING(xref_id)WHERE xr.ensembl_object_type='Gene' AND xr.ensembl_id=g.gene_idAND external_db_id=1100LIMIT 100

© Van Belle Werner

25/11/2009 - Pg. 101 v1


HGNC name Third: modify the query to only list the existing

identifiers

SELECT DISTINCT g.stable_idFROM gene_stable_id gJOIN object_xref xrJOIN xref x USING(xref_id)WHERE xr.ensembl_object_type='Gene' AND xr.ensembl_id=g.gene_idAND external_db_id=1100

© Van Belle Werner

25/11/2009 - Pg. 102 v1


HGNC name Fourth: get rid of all the existing identifiers from

the full stable_id list.

SELECT stable_id FROM gene_stable_id WHERE stable_id NOT IN (SELECT DISTINCT g.stable_idFROM gene_stable_id gJOIN object_xref xrJOIN xref x USING(xref_id)WHERE xr.ensembl_object_type='Gene' AND xr.ensembl_id=g.gene_idAND external_db_id=1100)

15654 genes have no HGNC identifier

© Van Belle Werner

25/11/2009 - Pg. 103 v1

HGNC to Uniprot Mapping Write a query that will map each known HGNC

identifier to a Uniprot identifier Problem

HGNC deals with genes Uniprot deals with proteins ('Translation')

© Van Belle Werner

25/11/2009 - Pg. 104 v1

HGNC → Uniprot Map each HGNC identifier that is a gene to a

Ensembl Translation identifier Map each HGNC identifier that is a transcript to

a Ensembl Translation identifier Map each of those identifiers to an Uniprot

identifier

© Van Belle Werner

25/11/2009 - Pg. 105 v1

8. Simulating Realtime PCR What is the PCR ? Wrote a small simulation with a limit on the

material copied How to calculate the volume after x cycles

when the initial amount was I ? CT/CP Values – how to go back from a CT

value to the initial amount Simulated the effect of less than 100%

efficiency

© Van Belle Werner

25/11/2009 - Pg. 106 v1

PCR

Enzyme + Reagents + DNA → 2 DNA + somewhat less reagents + enzyme

Polymerase Chain reaction Denaturate DNA → single DNA strands Anneal primer – attaches only to complementary

DNA Synthesize the rest of the strand – Polymerase

Consumes dNTPs (deoxynucleoside triphosphates)

© Van Belle Werner

25/11/2009 - Pg. 107 v1

RT-PCR / qPCR Beware

Reverse Transcription PCR Realtime PCR (= qPCR)

q-PCR Repetitive cycles (20-40 cycles) Includes oligonucletodies that emit light when

bound

© Van Belle Werner

25/11/2009 - Pg. 108 v1

Simulating a RT-PCR reaction We start off with a specific volume of DNA

material: amount With each cycle we increment amount by the

DNA we copied (copied_dna)

© Van Belle Werner

25/11/2009 - Pg. 109 v1

Simulating a RT-PCR reaction

© Van Belle Werner

25/11/2009 - Pg. 110 v1

Simulating a RT-PCR reaction The cell volume is not infinite. We must observe

how much reagentia is left for the reaction

© Van Belle Werner

25/11/2009 - Pg. 111 v1


© Van Belle Werner

25/11/2009 - Pg. 112 v1

Simulating a RT-PCR reaction The usable reagents are only part of the

remaining volume

© Van Belle Werner

25/11/2009 - Pg. 113 v1


© Van Belle Werner

25/11/2009 - Pg. 114 v1

Effect of initial amount

Each multiplication with 10leads to a shift of 3.32 cycles

© Van Belle Werner

25/11/2009 - Pg. 115 v1

Why ? Exponential growth

© Van Belle Werner

25/11/2009 - Pg. 116 v1

Why ? Given a target amount T and an initial amount

a0, how many cycles will it take to reach T ?

© Van Belle Werner

25/11/2009 - Pg. 117 v1

Why ? Suppose now that the initial amount a0 is

multiplied with a factor 10, what effect does this have on the cyclecount ?

© Van Belle Werner

25/11/2009 - Pg. 118 v1

CT / CP values Based on the 'cycles (c) to a certain threshold

(T)' one can estimate the initial amount. Problem 1:we measurement after each cycle.

There exists no such thing as a 3.3 cycles. Solution: fit an exponential curve to the points we

did measure. Problem 2:

At a certain point the exponential growth tapers off. Solution: find the best point

Still within 'exponential growth' Easy recognizable Useful

© Van Belle Werner

25/11/2009 - Pg. 119 v1

Problem 1 - Points between cycles Log value of amount is a linear curve

© Van Belle Werner

25/11/2009 - Pg. 120 v1

Problem 2 - CT / CP Values Possibility 1: A required intensity

© Van Belle Werner

25/11/2009 - Pg. 121 v1

CT / CP Values Possibility 2: A required slope = required growth

© Van Belle Werner

25/11/2009 - Pg. 122 v1

CT / CP Values Possibility 3: Maximum slope

© Van Belle Werner

25/11/2009 - Pg. 123 v1

Accuracy of the CT value

© Van Belle Werner

25/11/2009 - Pg. 124 v1

Cycle Variances Assume that with each cycle not everything is

copied, but only something between 99% and 100% of the available amount, what effect will this have on our CT values ?

To understand this Our algorithm needs to report its own CT value. We must modify the stepsize, instead of calculates

cycle by cycle we will do it for every 1/1000th of a cycle; this brings the error due to CT positioning down to 0.07 % (= 0.00069)

© Van Belle Werner

25/11/2009 - Pg. 125 v1

Multi-step

Beware off linear interpolation

© Van Belle Werner

25/11/2009 - Pg. 126 v1

Exercise 1. Modify the simulate routine to return the

cycle value before it reaches a volume of 500 What is your reported CT value ?

2. Modify the routine such that it will decrease the efficiency of the copy process at random Before adding the dna_to_copy multiply it with a

random number between 0.99 and 1 3. Modify your routine to generate 1000

simulations and calculate the average reported CT value What is your result ?

© Van Belle Werner

25/11/2009 - Pg. 127 v1

Results: 99% - 100% efficiency Initial amount: 0.001

Without decreasing the efficiency: 18.907 With decreasing the efficiency: 18.9988 Difference: 0.0918; effect on initial amount

estimation: 6% underestimated Initial amount: 0.00001

Without decreased efficiency: 25.49 With decreased efficiency: 25.588 Difference: 0.098; effect on initial amount

estimation: 7% underestimated

© Van Belle Werner

25/11/2009 - Pg. 128 v1

Results 95%-100% efficiency Initial amount: 0.001

Without decreasing the efficiency: 18.907 With decreasing the efficiency: 19.2524 Difference: 0.3454; effect on initial amount

estimation: 21% underestimated Initial amount: 0.00001

Without decrease: 25.49 With decrease: 26.0581 Difference: 0.5681; effect on initial amount

estimation: 33% underestimated.

© Van Belle Werner

25/11/2009 - Pg. 129 v1

9. Data Grouping Understanding questions Grouping data chunks together Across or foreach gene/plate etcera ? Layout of a PCR experiment and examples

© Van Belle Werner

25/11/2009 - Pg. 130 v1

Unstructured Questions Calculate the up or down regulation between cell types

For all or for each gene ? Including the different replicas ?

Calculate the average expression in each cell line Averaged per gene after resolving replicates (each

gene will have the same weight afterward) -or- directly across replicas ?

Is there an effect between the cell line and the cell type

Such unstructured questionscan be understood and implemented differently and produce highly different results

© Van Belle Werner

25/11/2009 - Pg. 131 v1

qPCR A plate: 96 wells Different probes/gene: ALFA, BETA, IOTA A cell type: WT, TG Different dilutions: 1:2, 1:5, 1:20, 1:50 Technical Replicas: R1, R2, R2 A cell line: HeLa, SK-N-DZ Biological Replicas: B1, B2

© Van Belle Werner

25/11/2009 - Pg. 132 v1

qPCR: A Common Layout

© Van Belle Werner

25/11/2009 - Pg. 133 v1

Why think about groups ? Group information is often implicit. If it is implicit: assume

foreach. Groups can help to resolve missing data-points Groups determine the control flow in an analysis

Calculate everything on technical replicates, then average things out over the biological replicates -or-

Pool all technical and biological replicates together before continuing with the analysis

Not all potential groups make sense Calculating the average of all dilutions is only possible if we have

the same number of elements in each replica → dangerous to do Groups can be artificial but structure experiments

E.g: we have three replicas of each probe on each plate and another technical replica on a second plate → plate distinction can be irrelevant and just introduces an extra technical replica

© Van Belle Werner

25/11/2009 - Pg. 134 v1

Why think about groups ? A group of data tends to be smaller than the full

dataset (we do not need to load other groups) Can make streaming possible Requires less RAM

E.g: calculate an exon overlap map for each chromosome

Can allow parallel execution Dependencies between data groups

Recalculate only necessary groups

© Van Belle Werner

25/11/2009 - Pg. 135 v1

Language Issues Foreach, Per, (Forall)

denotes a separation between groups. Foreach gene means that each group will only deal with one

gene at a time. Forall, Across, Ignoring, (Pooled, Grouped), Aggregate ...

Denotes an aggregation of data independent of this particular variable

Forall genes means that ALFA, BETA, THETA etc can all be included in each individual group.

Pairs, Couples, Combinations, Multiples, Between Denotes subgroups within larger groups e.g: for each combination of dilutions → means in whatever

group we are working with we want to create subgroups that are unique wrt their dilution and compare these.

© Van Belle Werner

25/11/2009 - Pg. 137 v1

Starting with the marked element,Mark all other elements that belong to this group

© Van Belle Werner

25/11/2009 - Pg. 141 v1

We can take all replicas: R1, R2, R3However: the dilution is not specified→Assume we will stay within thesame dilution

© Van Belle Werner

25/11/2009 - Pg. 142 v1

We can take all replicas: R1, R2, R3And compare one subgroup WT against the other subgroup TG1

© Van Belle Werner

25/11/2009 - Pg. 145 v1

Dealing with combinations If variable X is listed as a 'combination'

A celltype combination, or a dilution combination First create the parentgroup that assumes X is a group variable.

Celltype is treated as a group Dilution is treated as a group

From this parentgroup one can select a subgroup identified by a value of X. A subgroup where Celltype = WT A subgroup where Dilution = 1:2 This subgroup is then the first group of the combination

One can also select any other group that has a different value for X. A subgroup where Celltype = TG1 A subgroup where Dilution = 1:50 This subgroup is then another element of the combination.

© Van Belle Werner

25/11/2009 - Pg. 146 v1

Efficiency Estimation

Each multiplication with 10leads to a shift of 3.32 cyclesThis shift depends on the efficiency

© Van Belle Werner

25/11/2009 - Pg. 147 v1

Efficiency Estimation How do we want to estimate the PCR efficiency ?

For each plate For each probe/gene For each celltype (wildtype, modified) For each dilution combination For each replica

Exercise Extend the group provided to you to include all elements of that

group Color a second group belonging to the 'dilution combination' [Write down an object hierarchy to access the data quickly] [Write pseudocode to access the data]

© Van Belle Werner

25/11/2009 - Pg. 148 v1

For each plate, gene, dilution combination, celltype, biological replica, technical replica

This approach is somewhat flawed. Assume that R2/1:2 failed for IOTA but not for R1/1:2

© Van Belle Werner

25/11/2009 - Pg. 149 v1

For all plates and technical replicaseach gene, dilution combination, celltype, biological replica

Solves a mussing data problem

© Van Belle Werner

25/11/2009 - Pg. 150 v1

10. Accessing Data Groups An educational API to explore data groups Standard object hierarchies are difficult Data grouping enables concurrent access Helps with optimalisation (don't calculate what

didn't change) Cleaned up the output of a qPCR experiment

© Van Belle Werner

25/11/2009 - Pg. 151 v1

Object Hierarchies Calculate the median across replicas and across

plates, remove bad measurements first Gene → Dilution → Cell Type → Plate* → average replicas Gene → Dilution → Cell Type → average replicas Gene → Replica → Dilution → Cell Type

Calculate the copyratio per gene based on all dilution pairs

Gene → Dilution → [Cell Type] → CP Gene → Cell Type → Dilution → CP

Calculate the up down regulation between two genes

Gene → Dilution → Cell Type → Concentration → Gene → same Dilution → same Cell Type → Concentration

© Van Belle Werner

25/11/2009 - Pg. 152 v1

Data Grouping Hard coding a data representation/object

hierarchy often interferes with different data views / rotations of the data. XML/SQL/OODB

SQL can group data for you Unsuitable to perform an analysis loading group by group from a database can be highly

time consuming (latency) Group identification can be problematic. Requires a high level API that can be used to deal with

data: data slices.

© Van Belle Werner

25/11/2009 - Pg. 153 v1

A Table Interface A Table represents a table with attributes, records and

values Retrieve all unique keys given an attribute list Retrieve all records associated with a specific key

Differently stated: retrieve the group associated with (or identified by) key

Retrieve a value from a record Retrieve all values for an attribute in a table (a column) Iterate over all records in a table Iterate over all groups in a table

© Van Belle Werner

25/11/2009 - Pg. 154 v1

Table – loading in a tsvTable table=new Table("qpcr.tsv");

Will load the tab seperated value file qpcr.tsv in memory

Console.Out.Write(table) Will print out the table content, record by record

© Van Belle Werner

25/11/2009 - Pg. 155 v1

Records Each record maps an attribute (String) to a IComparable object (String, Double, …)

Record r=new Record();r[“Averaged CP”]=average;

Records can be shared between tables ! Records should be treated read-only after

creation and filling them with data. Records can can be copied; the content of the

record (the values that is) are not copied: r.copy();

© Van Belle Werner

25/11/2009 - Pg. 156 v1

Adding a record to a tableTable table=new Table();Record record=new Record();record[“test”]=60000;table.add(record);Console.Out.Write(table);

Records should be added only after their creation and initialization.

© Van Belle Werner

25/11/2009 - Pg. 157 v1

Retrieving a set of keysTable table=new Table(“qpcr.txt”);table.keys(“CellType”);

Returns a list of unique Records, with only the attribute 'CellType'

table.keys(“CellType”,”Gene”);

Returns a list of unique records with two attributes: 'CellType and Gene'

The return value is a List<Record>

© Van Belle Werner

25/11/2009 - Pg. 158 v1

Retrieving values belonging to a keytable.group(key)

Key is a record (CellType: WT; Gene: ALFA) The returned value is again a new Table. Remember the records are shared between the

returned subtable and the parent table. Do not modify records after they have been created and initialized.

If a record is added to the parent table it will not automatically appear in potential subtables.

© Van Belle Werner

25/11/2009 - Pg. 159 v1

Retrieving values from an attributerecord[“CellType”]

returns the IComparable content in this record

table[“CellType”]

returns an ArrayList of values linked to that attribute.

© Van Belle Werner

25/11/2009 - Pg. 160 v1

Iterating over the records in a table

foreach(Record tr in table){String str=(String)tr["CP"];…}

© Van Belle Werner

25/11/2009 - Pg. 161 v1

Iterating over all groups in a tableList<Table> groups=table.groups(“A”,”B”,...);

foreach(Table group in groups){

group.key → the current group identification

(contains A, B,....)…}

© Van Belle Werner

25/11/2009 - Pg. 162 v1

Advantages No need for an objects structure representing

the data Flexible wrt new data fields Flexible with regard to different regrouping (data

rotations) Foreach group

The inner loop can theoretically be executed in parallel

If necessary an SQL backend can be placed in the Table interface

© Van Belle Werner

25/11/2009 - Pg. 163 v1

Exercise: Data Cleanup Document which attributes exist in qpcr.tsv Write a routine that will run through the dataset

qpcr.tsv and clean up the data. All technical replicas should be averaged (for

each of the other attributes) Replicas with useless data should be omitted

cycle numbers >40 marked as bad [34.56] without a value [outlier replicas (median)]

© Van Belle Werner

25/11/2009 - Pg. 164 v1

11. Normalizing Efficiencies How to estimate the efficiency of a qPCR

experiment using dilution series How to normalize the CT values based on an

estimated efficiency Created a normalized table

© Van Belle Werner

25/11/2009 - Pg. 165 v1

Creating combinationsWrite a routine that will for for each plate, cell line,

cell type and probe report all dilution combinations of the averaged cp-values of the technical replicas

1.Reuse the table you created in the previous exercise

2.Foreach (plate,cellline,celltype,probe) obtain the associated group. Assume we want to have all dilutions included

3.Print out each combination of different dilutions

© Van Belle Werner

25/11/2009 - Pg. 166 v1

Copyrate Estimation

Each multiplication with 10leads to a shift of 3.32 cyclesThis shift depends on the copyrate

© Van Belle Werner

25/11/2009 - Pg. 167 v1

Normalization through dilution series If amplification were 100% efficient then halving

the initial concentration would shift the measurement 3.32 cycles to the right.

In reality it isn't and we will see shifts larger/smaller than 3.32 cycles.

By creating a dilution series, one can estimate the copyrate / efficiency

© Van Belle Werner

25/11/2009 - Pg. 168 v1

Copyrate Calculation If we have a dilution of factor 10 (becomes

stronger) And we have a shift (to the left) of x cycles, then

the copy ratio (r) is

© Van Belle Werner

25/11/2009 - Pg. 169 v1

ExamplesAfter diluting a sample a factor 10 we

observe a shift of

3.33 → ratio = 2 3.36 → ratio = 1.98 3.5 → ratio = 1.93 3.6 → ratio = 1.89 3.7 → ratio = 1.86

© Van Belle Werner

25/11/2009 - Pg. 170 v1

Routine to Calc. Eff.Based on the formula given before; write a routine

Double estimate_r(Double x, Double y, Double ratio_x2y);

To estimate the average multiplication factor for each cycle. To test your routine use the following inputs

X=10; Y=13.32; ratio_x2y=0.1 → copyratio 2

X=10; Y=6.68; ratio_x2y=10 → copyratio 2

© Van Belle Werner

25/11/2009 - Pg. 171 v1

Estimating the copyratio Plug in your routine into Ex. 2 such that the average

copy-ratio is calculated for each probe, celline, celltype and plate. All combinations should be taken into account.

Use the function Double estimate_r(Double x, Double y, Double ratio_x2y)to return the cycleratio.

© Van Belle Werner

25/11/2009 - Pg. 172 v1

Remarks on Efficiency One could also look at the material copied with

each cycle. It should double. Based on that we have a direct measurement of the efficiency. In low areas our sensititve is not sufficient to

estimate In High areas before the last cycle we have such a

position Only one measurement Depends somewhat on the software provided with

the machines

© Van Belle Werner

25/11/2009 - Pg. 173 v1

Remarks on Efficiency Often the efficiency is expressed as the relative

amount of input material that was used to create new DNA: efficiency = r-1.0. (between 80% and 140%)

Double estimate_efficiency (Double x, Double y, Double ratio_x2y)

{

return

100.0*(estimate_r(x,y,ratio_x2y)-1.0);

}

© Van Belle Werner

25/11/2009 - Pg. 174 v1

Normalizing CT Values Assume the reached volume was V, after x

cycles at a copyrate of r

What would be the number of cycles if the copyrate were 2 ?

© Van Belle Werner

25/11/2009 - Pg. 175 v1

Normalizing CT Values It could also be possible to go back to the initial

concentration instead of relying on normalizing the CT values

© Van Belle Werner

25/11/2009 - Pg. 176 v1

Normalize the CT Value Use your estimated copyratio to normalize the CT

value Implement the normalization equation Place the normalized CT values in a new table

© Van Belle Werner

25/11/2009 - Pg. 178 v1

Normalization ? CellType might affect the baseline gene expression

in the cell. A WT cell might be less active than a TG cell Or vice versa

To account for this problem one can compare a gene expression against a 'household' gene The household gene is supposed to be non related to

the measured gene As we know from microarrays, using one gene to

normalize various expressions is highly errorprone

© Van Belle Werner

25/11/2009 - Pg. 179 v1

Normalizing against a known gene If CP_A is the CP value for gene of interest A And CP_H is the CT value for the

householdgene H The concentrations [A] and [H] are given by

© Van Belle Werner

25/11/2009 - Pg. 180 v1

Normalizing against a known gene The ratio of the concentrations is then given by

© Van Belle Werner

25/11/2009 - Pg. 181 v1

Example A has a CT value of 15.786 H has a CT value of 18.875 DCT = 18.875-15.786 The concentration ratio is 8.51 This is an upregulation of a factor 8.51

(against the household gene concentration)

© Van Belle Werner

25/11/2009 - Pg. 182 v1

Example A has a CT value of 19.6 H has a CT value of 12.4 DCT = 12.4-19.6 = -7.2 The concentration ratio is 0.068011 Which is a down regulation of a factor 147.03

© Van Belle Werner

25/11/2009 - Pg. 183 v1

Calculating DCT Using GADPH as a houehold gene, Calculate for any other gene the DCT value and

report it in a new table

© Van Belle Werner

25/11/2009 - Pg. 185 v1

Calculating up/down regulations Up/down regulations are typically calculated

between celltypes. E.g: the relative expression of gene A in WT

condition against the relative expression of gene A in TG condition.

© Van Belle Werner

25/11/2009 - Pg. 186 v1

Example WT:

A has a CT value of 15.786 H has a CT value of 18.875 DCT = 18.875-15.786=3.089

TG: A has a CT value of 19.6 H has a CT value of 12.4 DCT = 12.4-19.6 = -7.2

DDCT = -7.2-3.089=-10.289 WT/TG=0.000799286

or a down regulation of a factor 1251.12

© Van Belle Werner

25/11/2009 - Pg. 187 v1

14. Reporting Regulations Reporting regulations as

Log values Ratios Ratios larger than 1

© Van Belle Werner

25/11/2009 - Pg. 188 v1

Reporting up-down regulations Can be reported as a ratio

x10 x0.01 x0

Can be reported as a log value x10 → log_10 value of 1 x0.1 → og_10 value of -1 x1 → log_10 value of 0 x0 → has no log_10 value

© Van Belle Werner

25/11/2009 - Pg. 189 v1

Reporting up/down regulations Can be reported as a ratio and a direction

x10 → 10 times upregulated x0.1 → 10 times downregulated

Exercise: Based on your DDCT table, create a report for the up / down regulations of each measured gene from the WT to the TG

© Van Belle Werner

25/11/2009 - Pg. 190 v1

Exercise Report Table

Gene CellLine Dilution Ratio Direction

Ddct Table Gene CellLine Dilution Ddct

Neither table contains CellType (siRNA versus Normal)(nor Plate)

Write a routine that will generate a new Report that Contains the average up/down ratios (averaged across dilutions (and plates))

© van belle werner25/11/2009 - pg. 1 v1 15. repetition / overview dr. werner van belle...

Documents