consolidating software tools for dna microarray design and manufacturing
DESCRIPTION
Consolidating Software Tools for DNA Microarray Design and Manufacturing. Mourad Atlas Nisar Hundewale Ludmila Perelygina Alex Zelikovsky. Agenda. Introduction DNA Array Flow (DAF) Benchmarks: Herpes B virus Experiments and Results Conclusion and Future Work. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Consolidating Software Tools for DNA Microarray
Design and Manufacturing
Mourad Atlas
Nisar Hundewale
Ludmila Perelygina
Alex Zelikovsky
Agenda
Introduction DNA Array Flow (DAF) Benchmarks: Herpes B virus Experiments and Results Conclusion and Future Work
Motivation Microarrays provide a tool for answering a wide
variety of questions about the dynamics of cells: In which cells is each gene active? Under what environmental conditions is each
gene active? How does the activity level of a gene change
under different conditions? Stage of a cell cycle? Environmental conditions? Diseases?
What genes seem to be regulated together?
DNA Array Flow
1. Downloading genome sequence and extracting ORFs in FASTA format
2. For each gene G, find probes that hybridize to G at a given TM
but do not hybridize to any other gene at that TM
3. Probe placement: determine for each probe a site on the array 2-D surface for it to be placed or synthesized. Probe embeddings: which embeds each probe into the deposition sequence
4. Photolithographic process used in sequence masking
5. Each probe binds to its target using the complementary rules.
6. can be measured by a laser scanner and converted to a quantitative value that can be read
Genome ID
Mask and array manufacturing
Physical design
Probe selection
Hybridization experiment
Reading genomic data
Analysis of hybridization intensities
Reading genomic data
Genome ID
Mask and array manufacturing
Physical design
Probe selection
Hybridization experiment
Reading genomic data
Analysis of hybridization intensities
Reading Genomic Data
Input the genome ID Download genome sequence
Downloading genome sequence from GenBank
Bioperl
ORF Extraction from genome
GeneMark(Bordovsky GaTech)
Or: ORF Finder ExtractingExtra ORFs: ( )
ORF Parser: ORFs in FASTA format
Genome ID
Probe selection
ORF Extraction
ORF Extraction from genome
GeneMark(Bordovsky GaTech)
Or: ORF Finder ExtractingExtra ORFs: ( )
ORF Parser: ORFs in FASTA format
Genome ID
Probe selection
Downloading genome sequence from GenBank
Bioperl
ORF Parser
Open reading frame (ORF) is a subsequence of DNA that could potentially be transcribed into messenger RNA (mRNA)
Because of the differences between prokaryotic and eukaryotic transcription systems there are two types of ORF:
1. Prokaryotes: start and stop codon
2. Eukaryotic: stop codon
What is ORF?
ORF Parser
Downloading genome sequence from GenBank
Bioperl
ORF Extraction from genome
GeneMark(Bordovsky GaTech)
Or: ORF Finder ExtractingExtra ORF: ( )
ORF Parser
Genome ID
Probe selection
ORFs in FASTA format
DNA Array Flow
Genome ID
Mask and array manufacturing
Physical design
Probe selection
Hybridization experiment
Reading genomic data
Analysis of hybridization intensities
Probe Selection
Reading genomic data
ORF preprocessing
Choosing best melting
temperature
Ocand :find allcandidate for given temperature
Promide
Pools of probes
Physical design
Homogeneity:Ensure that the probes can bind to its target at the temperature
of the experimentSensitivity: Avoid self-hybridization: ensure that the probes will not form a
secondary structure. (Such a structure will prevent the probes from binding to its target)
Specificity: – the probes stay unique even after a few bases are changed – Probe must hybridize to one particular gene: For each gene
G, find probes that:1. hybridize to G at a given temperature2. do not hybridize to any other gene at that Temperature
– Avoid cross-hybridization
Probe Selection Requirements
Why Promide?
Possible solutions: Li and Stormo 2001 Kaderali and Schliep 2002 Rahmann (Promide) 2003 They use the same data
structure: Suffix array
Promide handles truly large scale datasets in a reasonable amount of time Human GeneNest clusters:
in 50 hours Neurospora Crassa:
Promide: few hours Li and Stormo: 1 week
ORF preprocessing
Classes of Sequences:• A Master sequence is a sequence we wish to
design oligos for.• A Background sequence is a sequence against which specificity is checked.• Every Master is also a Background
For each candidate oligo (substring) of a Master, do:– Check side constraints
– Compute specificity: Optimal TM- alignment
with every Background collection Compute Matching Statistics: mims Oligos Candidate Selection: ocand
Choosing best melting
temperature
Mask and array manufacturing
Genome ID
Mask and array manufacturing
Physical design
Probe selection
Hybridization experiment
Reading genomic data
Analysis of hybridization intensities
arrays are synthesized to a wafer
Selectively expose array sites to light
Flush chip’s surface with solution of protected A, C, G, T
Repeat last two steps until desired probes are synthesized
Mask and Array manufacturing
Mask and Array manufacturing
array probes
A 3×3 array
CG AC G
AC ACG AG
CG AG C
Nuc
leot
ide
Dep
ositi
on S
eque
nce
AC
G
A Mask 1
A
A
A
A
A
array probes
A 3×3 array
CG AC G
AC ACG AG
CG AG C
Nuc
leot
ide
Dep
ositi
on S
eque
nce
AC
G
C Mask 2
C
C
C C
C
CA
A
A
A
A
Array manufacturing
array probes
A 3×3 array
CG AC G
AC ACG AG
CG AG C
Nuc
leot
ide
Dep
ositi
on S
eque
nce
AC
G
G Mask 3
C
C
C C
C
CA
A
A
A
A
G
G G
G
G
G
A Nucleotide Deposition Sequence defines the order of nucleotide deposition
A Probe Embedding specifies the steps it uses in the nucleotide sequence to get synthesized
Array manufacturing
array probes
A 3×3 array
CG AC G
AC ACG AG
CG AG C
Nuc
leot
ide
Dep
ositi
on S
eque
nce
AC
G
A Mask 1
A
A
A
A
A
Border = 8
Border Reduction
Unwanted illumination
Chip’s yield
Border Minimization Challenges
Border Minimization Challenges
Lamp
Mask
Array
Problem: Diffraction, internal reflection, scattering, internal illumination
Occurs at sites near to intentionally exposed sites
Reduce Border
Increase yield
Reduce cost
Design objective: Minimize the border
Intentionally exposed sites
Unwanted illumination
Border
Physical design
Genome ID
Mask and array manufacturing
Physical design
Probe selection
Hybridization experiment
Reading genomic data
Analysis of hybridization intensities
Physical Design
Deposition sequence design
Mask and array manufacturing
Probe Selection
Test control
2D-probe placement
3D-probe embedding
•Probe Placement•Similar probes should be placed close together•Constructive placement•Placement improvement operators
•Probe Embedding•Degrees of freedom (DOF) in probe embedding•DOF exploitation for border conflict reduction
Physical Design
Border Reduction with Probe Placement
Probe Placement• Similar probes should be placed close together
Dep
ositi
on S
eque
nce
A
A
C
C
G
GT
T
CT
TA
Probes CT
C
T
C
T
TA
Border = 8
CT
CT
TA
C
T
T
T
A
C
Border = 4
Optimize
Border Reduction in Probe Embedding
Synchronous embedding: deposit one nucleotide in each group of “ACGT”
Probe Embedding
Asynchronous embedding: no restriction
Dep
ositi
on S
eque
nce
A
A
C
C
G
GT
T
CT
TAProbes
C
T
TA
Border = 4
CT
TA
C
T TA
Border = 2
Physical Design Problem
Placement of probes in n x n sites
Give: n2 probes
Total border cost
Find:
Embedding of the probes
Minimize:
Problem formulation for placement 2-dim (synchronous) Array Design Problem:
Minimize placement cost of Hamming graph H (vertices=probes, distance = Hamming)
Hamming Distance (P1, P2) = number of nucleotides which are different from its counterpart= border (synchronous embedding)
on 2-dim grid graph G2 (N x N array, edges b/w neighbors)
H
probe
G2
site
Placement Objective: Minimize Border
Sort the probes in lexicographical order
Probe 1
Probe 2
Probe 3
Probe 4
Probe 5
T A T T
A T A A
A A C A
G GC C
C G G G
1 2 3 25
T A T T
A T A A
A A C A
G GC C
C G G G
1 2 3 25
Problem: How to place the 1-D ordering of probes onto the 2-D chip?
Sorting the probes order reduces discrepancies between adjacent probes
TSP+1-Threading Placement
Hubbel 90’s Find TSP tour/path over given
probes with Hamming distance Place in the grid following TSP Adjacent probes are similar
Hannenhalli,Hubbel,Lipshutz, Pevzner’02:
Place the probes according to 1-Threading
further decreases total border by 20%
Placement By Threading
1 2 3 25
T A T T
A T A A
A A C A
G GC C
C G G G
Probe 1
Probe 2
Probe 3
Probe 4
Probe 5
Thread on the chip
1
2 3
4 5
For each site position (i, j):
Find the best probe which minimize border
(i, j)
Move the best probe to (i, j) and lock it in this position
Switch
Row-Epitaxial Placement Improvement
Row placement = sort + thread + row epitaxial
Probe Embedding
A
A
A
C
C
C
G
G
GT
T
T
Deposition Sequence
CTG
Hypothetical Probe
Gro
up
C
G
T
Synchronous Embedding
C
T
G
Asynchronous Embedding
C
G
T
Another Embedding
Embedding Determines Border Conflicts
A
A
A
C
C
C
T
T
TG
G
G
ACTG
AGT
GTG
A A
Synchronous Embedding
A
G
T
A
G
G
T
A
Dep
ositi
on S
eque
nce
Probes
G
A
A
G
T
A
G
T
ASAP Embedding
G
Problem formulation
2-dim (synchronous) Array Design Problem: Minimize placement cost of Hamming graph H
(vertices=probes, distance = Hamming) on 2-dim grid graph G2 (N x N array, edges b/w neighbors)
3-dim (asynchronous) Array Design Problem: Minimize cost of placement and embedding of Hamming graph H’
(vertices=probes, distance = Hamming b/w embedded probes) on 2-dim grid graph G2 (N x N array, edges b/w neighbors)
Post-placement Optimization Methods
Asynchronous re-embedding after 2-dim placement Greedy Algorithm
While there exist probes to re-embed with gain Optimally re-embed the probe with the largest gain
Batched greedy: speed-up by avoiding recalculations Chessboard Algorithm
While there there is gain Re-embed probes in red sites Re-embed probes in green sites
Analysis of hybridization intensities
Genome ID
Mask and array manufacturing
Physical design
Probe selection
Hybridization experiment
Reading genomic data
Analysis of hybridization intensities
Experimental Study
In our experiment we have considered the following parameters and we measured the results for different values of these parameters.
Melting Temperature: We choose the temperatures 60C and 65C as best
melting temperatures for our DNA probe array. Number of Candidates: We experimented with different values of K (number
of candidates) for each pools of probes: 1 and 2. Chip Size: We ran our Experiments with 2 different chip sizes.
We experimented with 50x50 and 60x60. We give the number of conflict and runtime for each
algorithm for the Herpes B virus and simulated data
Experiments Outline
Genome ID
Bioperl Sequence in FASTA format
ORF Extraction GenMark
ORF in Fasta format ORF Parser
Pools of probes in Chip format Probe Parser
Select Probes: Pool pf Probes Promide
Read Pool/ Genpool
Placements: Sorting
Placements: TSP
Placements: Row placement
Embedding: Chessboard
Chip
# of
Con
flict
s-C
PU
Tim
e fo
r al
l Alg
orith
ms
TM=65, Size=50x50 Herpes B Virus Simulated Data
K=2 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)
Initial 43459 183532
Tsort 39192 0.09 163402 0.04
Tsp 38143 0.11 159194 0.045
Lalign 34434 0.12 132698 0.9
Reptx 2 25938 7.75 109248 3.61
Chessboard 25504 25.66 106344 9.4
TM=65, Size=50x50 Herpes B Virus Simulated Data
K=1 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)
Initial 83096 183782
Tsort 74367 0.15 162926 0.05
Tsp 72141 0.2 159186 0.065
Lalign 60664 0.25 132358 0.08
Reptx 2 48582 4.25 115188 0.9
Chessboard 47652 18.64 112148 6.13
Number Of Conflicts for each AlgorithmTm=65,Size=50x50
0
50000
100000
150000
200000
250000
300000
Algorithms
Nu
mb
er
of
Co
nfl
icts
HB-k=2
Simulated-k=2
HB-k=1
Simulated-k=1
CPU Time for each AlgorithmTm=65,Size=50x50
0
5
10
15
20
25
Algoritms
CP
U T
ime i
n s
HB-k=2
Simulated-k=2
HB-k=1
Simulated-k=1
TM=65, Size=60x60 Herpes B Virus Simulated Data
K=1 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)
Initial 107577 265992
Tsort 98830 0.17 231526 0.08
Tsp 95640 0.22 227960 0.09
Lalign 79254 0.25 189272 0.1
Reptx 2 64830 4.45 154766 1.58
Chessboard 63594 15.58 150812 7.1
TM=65,Size=60x60 Herpes B Virus Simulated Data
K=2 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)
Initial 54205 265328
Tsort 49746 0.3 232954 0.14
Tsp 48541 0.34 227762 0.15
LAlign 42858 0.42 182972 0.16
Reptx 2 32098 7.84 149332 3.16
Chessboard 31498 20.93 146708 10.89
Number Of Conflicts for each AlgorithmTm=65,Size=60x60
020000400006000080000
100000120000140000160000180000200000
Algorithms
Nu
mb
er
Of
Co
nfl
icts
HB-k=2
Simulated-k=2
HB-k=1
Simulated-k=1
CPU Time for each AlgorithmTm=65,Size=60x60
02468
101214161820
Algorithms
CP
U T
ime i
n s
HB-k=2
Simulated-k=2
HB-k=1
Simulated-k=1
Conclusion and Future work
Conclusion: Our experiments show: The genomic data follow the pattern predicted by simulated data In case of Herpes B virus, like simulated data, increasing number of
candidates per probe (k) decreases number of border conflicts during the probe placement algorithms
The number of border conflicts is several times smaller than for simulated data
The trade-off between number of border conflicts and the CPU time taken for the various algorithms that are defined in the physical design
We give a concatenate software solution for the entire DNA array flow We explore all steps in a single automated software suite of tools Future work: The entire software suite be made available through web services Users can enter name of organism or ID and with an option of
choosing to set the required parameters the suite will produce the DNA probe micro-array chip layout
Thank you