1. 2 talk overview overall project scenario prifi motivation prifi algorithm description web version...
Post on 21-Dec-2015
219 views
TRANSCRIPT
2
Talk overview
• Overall project scenario• PriFi motivation• PriFi algorithm description• Web version• Demo
3
Overall project aim
Development of • general molecular markers for legume genetics• primers for the markers
4
CATS – comparative anchor tagged sequences
Alignment of ESTs from multiple legume species
Align to genomic regionIntron
Identification of evolutionarily conserved regions
Design of primers for PCR amplification of intron in mapping parents (hope to find polymorphism).
5Copyright ©2004 by the National Academy of Sciences
Choi, Hong-Kyu et al. (2004) Proc. Natl. Acad. Sci. USA 101, 15289-15294, Doyle & Luckow 2003
18.000 species
General legume
markers would be
very useful!
Legume Taxonomy
Genome
Genome
Arachis
6
Looking for conserved regions
exon intron exon exon exonintronintron
Lotus genomeAGC..AT CGAT..GGAC AGT..TG TAC..CC CAC..ATGGAGGAGGAC..TAAGAGACCTAAAC..TCTCTAG
TAC..CC CAC..AT
AGC..AT GGG..AA TAC..CC CAC..AT
TAC..CC CAC..AT<----intron----->
Glycine EST
Medicago EST
Phaseolus?
7
Primer design
introns replaced by X'esto help Clustal Good marker region
Usual method: Visual inspection of alignment "manual" design of primers.
Idea: automate primer design through computer program.
Primer consensus sequences:
Fw: TGCYTCAAAGGAGGAAATTTCAARAG
Rv: CTGTCAAYACCAGTATTTGCCCKKG
8
Primer finder program goals
Given alignment, program should find and rank primer pairs which:
• are placed in conserved regions,
• span an intron,
• have similar Tm
• fulfill numerous criteria regarding AT content, primer length,
ambiguity positions, product length, ..
I.e. formalize intuition and experience of skilled lab researchers.
9
Lab practiceWork method: go through numerous examples with lab people while
they explain what they do and why.
The "why" turned out to be difficult:
• Hard rules hard to formulate– "So Tm must always be above 55°."
– "Yes. Unless.. "
• Rules often contradictory– "But then the primer violates the AT content rule??"– "Oh, well, then the rule should be rephrased to .."
• Scoring primer pairs– "Why is this primer pair better than this one?"– "It just is!"
10
Primer finder program PriFi
Works with alignment (or Fasta file which it aligns itself using
Clustal).
1. Identifies conserved regions and locates introns
2. Identifies individual primer candidates– Checks most criteria
3. Considers pairs of primer candidates– Checks remaining criteria
4. Ranks all pairs
5. Suggests four pairs and explains their scores– Lets user make informed choice (discussions showed primer
design is not exact science!).
11
Check all possibilities?
• To the algorithm, a primer pair is simply four indices (fwstart, fwend, rvstart, rvend).
• For an alignment of length 1000, there are about 1.000.000.000.000 ways to pick four indices.
• Checking all possible four-tuples for all criteria is too slow.
• Algorithm applies three filters to reduce workload.
12
First filter
• Operates on the complete alignment.
• We only want primers in conserved regions: disregard less conserved regions.
• Delimit primer regions by masking out other columns.– Mask single-nucleotide columns.
– Mask intron columns.
– Mask "safety zone" around introns (to ensure unique identification of PCR product).
– Mask certain mismatch columns.
13
First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of
PCR product).4. Mask certain mismatch columns.
– Using two primer criteria: minimum length (l ) and maximum number of ambiguities (a)
– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.
– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.
**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC
14
First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of
PCR product).4. Mask certain mismatch columns.
– Using two primer criteria: minimum length (l) and maximum number of ambiguities (a)
– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.
– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.
**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC
15
First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of
PCR product).4. Mask certain mismatch columns.
– Using two primer criteria: minimum length (l) and maximum number of ambiguities (a)
– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.
– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.
**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC
16
First filter1. Mask single-nucleotide columns.2. Mask intron columns.3. Mask "safety zone" around introns (to ensure unique identification of
PCR product).4. Mask certain mismatch columns.
– Using two primer criteria: minimum length (l) and maximum number of ambiguities (a)
– For each mismatch column: check if window of length l can be placed around it with at most a ambiguities. If not: mask column.
– For l = 18, a = 4: For each mismatch column find window of length 18 containing at most 4 mismatch columns, otherwise mask.
**!******!**!*******!******!!!*****!*!*******!*******!***!!**!*!!*!**!* CAGCATGCTGACGAAGCCTTGGACCGCCAXXXCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGAT------ATGCTGACGATGCCTTGGGCCGCCA---CAGGACTGAACCGTAATGGAATCTAGCTAAGGCTTACGGAT----GCGTGCTGATGAAGCCTTGGACCGCCA---CAGGAATCAACCGTAGTGGAATCCAGCCGAGCCACATGGCTAC
5. Keep regions of length at least l with no masked columns.
17
Workload reduction
• Alignment length 43, min primer length 18, max primer length 35:
9 primer candidates of length 35,
10 of length 34,
11 of length 33, etc.;
a total of 315.
• If middle column is masked
we get two primer regions
of length 21:
one primer candidate of
length 21, two of length
20, etc., 10 in each region.
A total of 20.
18
Second filter• Operates on single primer candidates. • Can't check all criteria: primers don't have
an orientation yet.• Checks and scores primers according to
relevant criteria like:– end in ambiguities?
– Tm,
– have too many ambiguities?
• Prunes the set of remaining primers:– From any group of essentially identical,
greatly overlapping primers, keep only the superior "representatives".
– I.e. if two primers A and B overlap by more than 10 nt, and A scores better than B in all criteria, keep only A. Otherwise keep both.
– This step is a major algorithm speed-up!
19
Example– End in ambiguities?
– Tm
– Have too many ambiguities?– Pruning
!!**!*!!***!**!*******!************************!***************!*******!**! CAGCATGTTGACGAAGCCTTGGACCGCCAGCCCAGGAATCAACCGTAGTGGAATCCAGCTAAGGCACACGGATAGATGCATACTGACGATGCCTTGGGCCGCCAGCCCAGGAATCAACCGTAATGGAATCCAGCTAAGGCACACGGATAGCAGCGTGCTGATGAAGCCTTGGACCGCCAGCCCAGGAATCAACCGTAGTGGAATCCAGCTAAGCCACACGGCTAC __________________________ ___________________________ __________________________ ___________________________ __________________________ ___________________________ __________________________________ __________________________________ __________________________________ __________________________________ _______________________________ ___________________________ ____________________________
(13 1)
20
Third filter
• Operates on primer pairs.• Considers all combinations of two
primer candidates (low number of candidates essential!).
• Checks remaining criteria, such as:– AT content and degeneracy in 3'-tail,
– distance to closest intron (to ensure identification of PCR product),
– PCR product length,
– similar Tm's.
• Discards invalid pairs, scores and ranks the rest.
• Suggests four pairs.– Best scoring, not-too-overlapping pairs.
Ensures some variation.
21
ReportFw 5'-ATCCGATTTCGAGAAATGCAAACCCTGGTTGATCCRv 5'-CCCTTCACAGTGGTGATACACTTTCGCTTGTTACG
Tm = 66.4 / 66.9Primer lengths: 35 / 35Avg. #sequences in primer alignments: 3.0 / 2.0Estimated product length: 1785Primer/intron distances: 36 / 88A/T's among last 8 bp of 3'-end: 4 / 5Ambiguities: 0 / 0
93.2: High-Tm bonus 6.0: Fw primer length 6.0: Rv primer length 24.7: bonus for #sequences in primer alignments 3.0: Fw has G/C terminal in 3'-end 3.0: Rv has G/C terminal in 3'-end 60.0: Good product length -5.0: Rv in unconserved region or based mostly on 2 seqs-11.3: Primer/intron distance(s) outside 70-150 bp -3.0: Too high AT content in 3'-ends
Score: 176
24
ConfigurationCritical melting temperatureIf both primer melting temperatures are below this value, penalize the pair.
Minimum melting temperature with ambiguity positionsIf a primer melting temperature is below this value, the primer can have no ambiguity positions.
Optimal PCR product length interval
Penalty Ok Optimal Ok Penalty
Critical ambiguity position distance from 3'-endPenalize ambiguity positions closer than this distance in nucleotides to the 3'-end.
p1
p2 p3
p4
PCR prod len
points
Introns in sequencesIf set to 'no', primer pairs do not have to span an intron (and introns are not marked by X'es).
Somewhat heuristic
parameters and
rules..
25
Status
• Genomic data from Medicago and Lotus,
ESTs from Medicago, Lotus, Glycine, Arachis, Phaseolus. • PriFi found primer pairs for 203 alignments.• 36 primer pairs tested:
– 24 correct products in Phaseolus.– 19 in Arachis.– Rest not polymorphic or yielded no product.
28
Thanks for your attention.
Email: [email protected]
Website: http://cgi-daimi.au.dk/cgi-chili/PriFi/main
People involved in developing PriFi: Leif Schauser (BiRC), Lene H. Madsen, Niels
Sandal (Dept. of Mol. Biology). Grant holder: Jens Stougaard.