1. 2 talk overview overall project scenario prifi motivation prifi algorithm description web version...

28
1

Post on 21-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

1

2

Talk overview

• Overall project scenario• PriFi motivation• PriFi algorithm description• Web version• Demo

3

Overall project aim

Development of • general molecular markers for legume genetics• primers for the markers

4

CATS – comparative anchor tagged sequences

Alignment of ESTs from multiple legume species

Align to genomic regionIntron

Identification of evolutionarily conserved regions

Design of primers for PCR amplification of intron in mapping parents (hope to find polymorphism).

5Copyright ©2004 by the National Academy of Sciences

Choi, Hong-Kyu et al. (2004) Proc. Natl. Acad. Sci. USA 101, 15289-15294, Doyle & Luckow 2003

18.000 species

General legume

markers would be

very useful!

Legume Taxonomy

Genome

Genome

Arachis

6

Looking for conserved regions

exon intron exon exon exonintronintron

Lotus genomeAGC..AT CGAT..GGAC AGT..TG TAC..CC CAC..ATGGAGGAGGAC..TAAGAGACCTAAAC..TCTCTAG

TAC..CC CAC..AT

AGC..AT GGG..AA TAC..CC CAC..AT

TAC..CC CAC..AT<----intron----->

Glycine EST

Medicago EST

Phaseolus?

7

Primer design

introns replaced by X'esto help Clustal Good marker region

Usual method: Visual inspection of alignment "manual" design of primers.

Idea: automate primer design through computer program.

Primer consensus sequences:

Fw: TGCYTCAAAGGAGGAAATTTCAARAG

Rv: CTGTCAAYACCAGTATTTGCCCKKG

8

Primer finder program goals

Given alignment, program should find and rank primer pairs which:

• are placed in conserved regions,

• span an intron,

• have similar Tm

• fulfill numerous criteria regarding AT content, primer length,

ambiguity positions, product length, ..

I.e. formalize intuition and experience of skilled lab researchers.

9

Lab practiceWork method: go through numerous examples with lab people while

they explain what they do and why.

The "why" turned out to be difficult:

• Hard rules hard to formulate– "So Tm must always be above 55°."

– "Yes. Unless.. "

• Rules often contradictory– "But then the primer violates the AT content rule??"– "Oh, well, then the rule should be rephrased to .."

• Scoring primer pairs– "Why is this primer pair better than this one?"– "It just is!"

10

Primer finder program PriFi

Works with alignment (or Fasta file which it aligns itself using

Clustal).

1. Identifies conserved regions and locates introns

2. Identifies individual primer candidates– Checks most criteria

3. Considers pairs of primer candidates– Checks remaining criteria

4. Ranks all pairs

5. Suggests four pairs and explains their scores– Lets user make informed choice (discussions showed primer

design is not exact science!).

11

Check all possibilities?

• To the algorithm, a primer pair is simply four indices (fwstart, fwend, rvstart, rvend).

• For an alignment of length 1000, there are about 1.000.000.000.000 ways to pick four indices.

• Checking all possible four-tuples for all criteria is too slow.

• Algorithm applies three filters to reduce workload.

12

First filter

• Operates on the complete alignment.

• We only want primers in conserved regions: disregard less conserved regions.

• Delimit primer regions by masking out other columns.– Mask single-nucleotide columns.

– Mask intron columns.

– Mask "safety zone" around introns (to ensure unique identification of PCR product).

– Mask certain mismatch columns.

13

First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of

PCR product).4. Mask certain mismatch columns.

– Using two primer criteria: minimum length (l ) and maximum number of ambiguities (a)

– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.

– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.

**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC

14

First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of

PCR product).4. Mask certain mismatch columns.

– Using two primer criteria: minimum length (l) and maximum number of ambiguities (a)

– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.

– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.

**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC

15

First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of

PCR product).4. Mask certain mismatch columns.

– Using two primer criteria: minimum length (l) and maximum number of ambiguities (a)

– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.

– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.

**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC

16

First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of

PCR product).4. Mask certain mismatch columns.

– Using two primer criteria: minimum length (l) and maximum number of ambiguities (a)

– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.

– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.

**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC

5. Keep regions of length at least l with no masked columns.

17

Workload reduction

• Alignment length 43, min primer length 18, max primer length 35:

9 primer candidates of length 35,

10 of length 34,

11 of length 33, etc.;

a total of 315.

• If middle column is masked

we get two primer regions

of length 21:

one primer candidate of

length 21, two of length

20, etc., 10 in each region.

A total of 20.

18

Second filter• Operates on single primer candidates. • Can't check all criteria: primers don't have

an orientation yet.• Checks and scores primers according to

relevant criteria like:– end in ambiguities?

– Tm,

– have too many ambiguities?

• Prunes the set of remaining primers:– From any group of essentially identical,

greatly overlapping primers, keep only the superior "representatives".

– I.e. if two primers A and B overlap by more than 10 nt, and A scores better than B in all criteria, keep only A. Otherwise keep both.

– This step is a major algorithm speed-up!

19

Example– End in ambiguities?

– Tm

– Have too many ambiguities?– Pruning

!!**!*!!***!**!*******!************************!***************!*******!**! CAGCATGTTGACGAAGCCTTGGACCGCCAGCCCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGATAGATGCATACTGACGATGCCTTGGGCCGCCAGCCCAGGAATCAACCGTAATGGAATCCAGCTAAGGCACACGGATAGCAGCGTGCTGATGAAGCCTTGGACCGCCAGCCCAGGAATCAACCGTAGTGGAATCCAGCTAAGCCACACGGCTAC __________________________ ___________________________ __________________________ ___________________________ __________________________ ___________________________ __________________________________ __________________________________ __________________________________ __________________________________ _______________________________ ___________________________ ____________________________

(13 1)

20

Third filter

• Operates on primer pairs.• Considers all combinations of two

primer candidates (low number of candidates essential!).

• Checks remaining criteria, such as:– AT content and degeneracy in 3'-tail,

– distance to closest intron (to ensure identification of PCR product),

– PCR product length,

– similar Tm's.

• Discards invalid pairs, scores and ranks the rest.

• Suggests four pairs.– Best scoring, not-too-overlapping pairs.

Ensures some variation.

21

ReportFw 5'-ATCCGATTTCGAGAAATGCAAACCCTGGTTGATCCRv 5'-CCCTTCACAGTGGTGATACACTTTCGCTTGTTACG

Tm = 66.4 / 66.9Primer lengths: 35 / 35Avg. #sequences in primer alignments: 3.0 / 2.0Estimated product length: 1785Primer/intron distances: 36 / 88A/T's among last 8 bp of 3'-end: 4 / 5Ambiguities: 0 / 0

93.2: High-Tm bonus 6.0: Fw primer length 6.0: Rv primer length 24.7: bonus for #sequences in primer alignments 3.0: Fw has G/C terminal in 3'-end 3.0: Rv has G/C terminal in 3'-end 60.0: Good product length -5.0: Rv in unconserved region or based mostly on 2 seqs-11.3: Primer/intron distance(s) outside 70-150 bp -3.0: Too high AT content in 3'-ends

Score: 176

22

PriFi

on the

web

23

24

ConfigurationCritical melting temperatureIf both primer melting temperatures are below this value, penalize the pair.

Minimum melting temperature with ambiguity positionsIf a primer melting temperature is below this value, the primer can have no ambiguity positions.

Optimal PCR product length interval

Penalty Ok Optimal Ok Penalty

Critical ambiguity position distance from 3'-endPenalize ambiguity positions closer than this distance in nucleotides to the 3'-end.

p1

p2 p3

p4

PCR prod len

points

Introns in sequencesIf set to 'no', primer pairs do not have to span an intron (and introns are not marked by X'es).

Somewhat heuristic

parameters and

rules..

25

Status

• Genomic data from Medicago and Lotus,

ESTs from Medicago, Lotus, Glycine, Arachis, Phaseolus. • PriFi found primer pairs for 203 alignments.• 36 primer pairs tested:

– 24 correct products in Phaseolus.– 19 in Arachis.– Rest not polymorphic or yielded no product.

26

User statistics

27

28

Thanks for your attention.

Email: [email protected]

Website: http://cgi-daimi.au.dk/cgi-chili/PriFi/main

People involved in developing PriFi: Leif Schauser (BiRC), Lene H. Madsen, Niels

Sandal (Dept. of Mol. Biology). Grant holder: Jens Stougaard.