consolidating software tools for dna microarray design and manufacturing

Consolidating Software Tools for DNA Microarray

Design and Manufacturing

Mourad Atlas

Nisar Hundewale

Ludmila Perelygina

Alex Zelikovsky

Agenda

Introduction DNA Array Flow (DAF) Benchmarks: Herpes B virus Experiments and Results Conclusion and Future Work

Motivation Microarrays provide a tool for answering a wide

variety of questions about the dynamics of cells: In which cells is each gene active? Under what environmental conditions is each

gene active? How does the activity level of a gene change

under different conditions? Stage of a cell cycle? Environmental conditions? Diseases?

What genes seem to be regulated together?

DNA Array Flow

1. Downloading genome sequence and extracting ORFs in FASTA format

2. For each gene G, find probes that hybridize to G at a given TM

but do not hybridize to any other gene at that TM

3. Probe placement: determine for each probe a site on the array 2-D surface for it to be placed or synthesized. Probe embeddings: which embeds each probe into the deposition sequence

4. Photolithographic process used in sequence masking

5. Each probe binds to its target using the complementary rules.

6. can be measured by a laser scanner and converted to a quantitative value that can be read

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities


Genome ID


Physical design

Probe selection




Reading Genomic Data

Input the genome ID Download genome sequence

Downloading genome sequence from GenBank

Bioperl

ORF Extraction from genome

GeneMark(Bordovsky GaTech)

Or: ORF Finder ExtractingExtra ORFs: ( )

ORF Parser: ORFs in FASTA format

Genome ID

Probe selection

ORF Extraction



Or: ORF Finder ExtractingExtra ORFs: ( )

ORF Parser: ORFs in FASTA format

Genome ID

Probe selection


Bioperl

ORF Parser

Open reading frame (ORF) is a subsequence of DNA that could potentially be transcribed into messenger RNA (mRNA)

Because of the differences between prokaryotic and eukaryotic transcription systems there are two types of ORF:

1. Prokaryotes: start and stop codon

2. Eukaryotic: stop codon

What is ORF?

ORF Parser


Bioperl



Or: ORF Finder ExtractingExtra ORF: ( )

ORF Parser

Genome ID

Probe selection

ORFs in FASTA format

DNA Array Flow

Genome ID


Physical design

Probe selection




Probe Selection


ORF preprocessing

Choosing best melting

temperature

Ocand :find allcandidate for given temperature

Promide

Pools of probes

Physical design

Homogeneity:Ensure that the probes can bind to its target at the temperature

of the experimentSensitivity: Avoid self-hybridization: ensure that the probes will not form a

secondary structure. (Such a structure will prevent the probes from binding to its target)

Specificity: – the probes stay unique even after a few bases are changed – Probe must hybridize to one particular gene: For each gene

G, find probes that:1. hybridize to G at a given temperature2. do not hybridize to any other gene at that Temperature

– Avoid cross-hybridization

Probe Selection Requirements

Why Promide?

Possible solutions: Li and Stormo 2001 Kaderali and Schliep 2002 Rahmann (Promide) 2003 They use the same data

structure: Suffix array

Promide handles truly large scale datasets in a reasonable amount of time Human GeneNest clusters:

in 50 hours Neurospora Crassa:

Promide: few hours Li and Stormo: 1 week

ORF preprocessing

Classes of Sequences:• A Master sequence is a sequence we wish to

design oligos for.• A Background sequence is a sequence against which specificity is checked.• Every Master is also a Background

For each candidate oligo (substring) of a Master, do:– Check side constraints

– Compute specificity: Optimal TM- alignment

with every Background collection Compute Matching Statistics: mims Oligos Candidate Selection: ocand

Choosing best melting

temperature


Genome ID


Physical design

Probe selection




arrays are synthesized to a wafer

Selectively expose array sites to light

Flush chip’s surface with solution of protected A, C, G, T

Repeat last two steps until desired probes are synthesized

Mask and Array manufacturing

Mask and Array manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

A Mask 1

A

A

A

A

A

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

C Mask 2

C

C

C C

C

CA

A

A

A

A

Array manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

G Mask 3

C

C

C C

C

CA

A

A

A

A

G

G G

G

G

G

A Nucleotide Deposition Sequence defines the order of nucleotide deposition

A Probe Embedding specifies the steps it uses in the nucleotide sequence to get synthesized

Array manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

A Mask 1

A

A

A

A

A

Border = 8

Border Reduction

Unwanted illumination

Chip’s yield

Border Minimization Challenges

Border Minimization Challenges

Lamp

Mask

Array

Problem: Diffraction, internal reflection, scattering, internal illumination

Occurs at sites near to intentionally exposed sites

Reduce Border

Increase yield

Reduce cost

Design objective: Minimize the border

Intentionally exposed sites

Unwanted illumination

Border

Physical design

Genome ID


Physical design

Probe selection




Physical Design

Deposition sequence design


Probe Selection

Test control

2D-probe placement

3D-probe embedding

•Probe Placement•Similar probes should be placed close together•Constructive placement•Placement improvement operators

•Probe Embedding•Degrees of freedom (DOF) in probe embedding•DOF exploitation for border conflict reduction

Physical Design

Border Reduction with Probe Placement

Probe Placement• Similar probes should be placed close together

Dep

ositi

on S

eque

nce

A

A

C

C

G

GT

T

CT

TA

Probes CT

C

T

C

T

TA

Border = 8

CT

CT

TA

C

T

T

T

A

C

Border = 4

Optimize

Border Reduction in Probe Embedding

Synchronous embedding: deposit one nucleotide in each group of “ACGT”

Probe Embedding

Asynchronous embedding: no restriction

Dep

ositi

on S

eque

nce

A

A

C

C

G

GT

T

CT

TAProbes

C

T

TA

Border = 4

CT

TA

C

T TA

Border = 2

Physical Design Problem

Placement of probes in n x n sites

Give: n2 probes

Total border cost

Find:

Embedding of the probes

Minimize:

Problem formulation for placement 2-dim (synchronous) Array Design Problem:

Minimize placement cost of Hamming graph H (vertices=probes, distance = Hamming)

Hamming Distance (P1, P2) = number of nucleotides which are different from its counterpart= border (synchronous embedding)

on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

H

probe

G2

site

Placement Objective: Minimize Border

Sort the probes in lexicographical order

Probe 1

Probe 2

Probe 3

Probe 4

Probe 5

T A T T

A T A A

A A C A

G GC C

C G G G

1 2 3 25

T A T T

A T A A

A A C A

G GC C

C G G G

1 2 3 25

Problem: How to place the 1-D ordering of probes onto the 2-D chip?

Sorting the probes order reduces discrepancies between adjacent probes

TSP+1-Threading Placement

Hubbel 90’s Find TSP tour/path over given

probes with Hamming distance Place in the grid following TSP Adjacent probes are similar

Hannenhalli,Hubbel,Lipshutz, Pevzner’02:

Place the probes according to 1-Threading

further decreases total border by 20%

Placement By Threading

1 2 3 25

T A T T

A T A A

A A C A

G GC C

C G G G

Probe 1

Probe 2

Probe 3

Probe 4

Probe 5

Thread on the chip

1

2 3

4 5

For each site position (i, j):

Find the best probe which minimize border

(i, j)

Move the best probe to (i, j) and lock it in this position

Switch

Row-Epitaxial Placement Improvement

Row placement = sort + thread + row epitaxial

Probe Embedding

A

A

A

C

C

C

G

G

GT

T

T

Deposition Sequence

CTG

Hypothetical Probe

Gro

up

C

G

T

Synchronous Embedding

C

T

G

Asynchronous Embedding

C

G

T

Another Embedding

Embedding Determines Border Conflicts

A

A

A

C

C

C

T

T

TG

G

G

ACTG

AGT

GTG

A A

Synchronous Embedding

A

G

T

A

G

G

T

A

Dep

ositi

on S

eque

nce

Probes

G

A

A

G

T

A

G

T

ASAP Embedding

G

Problem formulation

2-dim (synchronous) Array Design Problem: Minimize placement cost of Hamming graph H

(vertices=probes, distance = Hamming) on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

3-dim (asynchronous) Array Design Problem: Minimize cost of placement and embedding of Hamming graph H’

(vertices=probes, distance = Hamming b/w embedded probes) on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

Post-placement Optimization Methods

Asynchronous re-embedding after 2-dim placement Greedy Algorithm

While there exist probes to re-embed with gain Optimally re-embed the probe with the largest gain

Batched greedy: speed-up by avoiding recalculations Chessboard Algorithm

While there there is gain Re-embed probes in red sites Re-embed probes in green sites


Genome ID


Physical design

Probe selection




Experimental Study

In our experiment we have considered the following parameters and we measured the results for different values of these parameters.

Melting Temperature: We choose the temperatures 60C and 65C as best

melting temperatures for our DNA probe array. Number of Candidates: We experimented with different values of K (number

of candidates) for each pools of probes: 1 and 2. Chip Size: We ran our Experiments with 2 different chip sizes.

We experimented with 50x50 and 60x60. We give the number of conflict and runtime for each

algorithm for the Herpes B virus and simulated data

Experiments Outline

Genome ID

Bioperl Sequence in FASTA format

ORF Extraction GenMark

ORF in Fasta format ORF Parser

Pools of probes in Chip format Probe Parser

Select Probes: Pool pf Probes Promide

Read Pool/ Genpool

Placements: Sorting

Placements: TSP

Placements: Row placement

Embedding: Chessboard

Chip

# of

Con

flict

s-C

PU

Tim

e fo

r al

l Alg

orith

ms

TM=65, Size=50x50 Herpes B Virus Simulated Data

K=2 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)

Initial 43459 183532

Tsort 39192 0.09 163402 0.04

Tsp 38143 0.11 159194 0.045

Lalign 34434 0.12 132698 0.9

Reptx 2 25938 7.75 109248 3.61

Chessboard 25504 25.66 106344 9.4



Initial 83096 183782

Tsort 74367 0.15 162926 0.05

Tsp 72141 0.2 159186 0.065

Lalign 60664 0.25 132358 0.08

Reptx 2 48582 4.25 115188 0.9

Chessboard 47652 18.64 112148 6.13

Number Of Conflicts for each AlgorithmTm=65,Size=50x50

0

50000

100000

150000

200000

250000

300000

Algorithms

Nu

mb

er

of

Co

nfl

icts

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

CPU Time for each AlgorithmTm=65,Size=50x50

0

5

10

15

20

25

Algoritms

CP

U T

ime i

n s

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1



Initial 107577 265992

Tsort 98830 0.17 231526 0.08

Tsp 95640 0.22 227960 0.09

Lalign 79254 0.25 189272 0.1

Reptx 2 64830 4.45 154766 1.58

Chessboard 63594 15.58 150812 7.1

TM=65,Size=60x60 Herpes B Virus Simulated Data


Initial 54205 265328

Tsort 49746 0.3 232954 0.14

Tsp 48541 0.34 227762 0.15

LAlign 42858 0.42 182972 0.16

Reptx 2 32098 7.84 149332 3.16

Chessboard 31498 20.93 146708 10.89

Number Of Conflicts for each AlgorithmTm=65,Size=60x60

020000400006000080000

100000120000140000160000180000200000

Algorithms

Nu

mb

er

Of

Co

nfl

icts

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

CPU Time for each AlgorithmTm=65,Size=60x60

02468

101214161820

Algorithms

CP

U T

ime i

n s

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

Conclusion and Future work

Conclusion: Our experiments show: The genomic data follow the pattern predicted by simulated data In case of Herpes B virus, like simulated data, increasing number of

candidates per probe (k) decreases number of border conflicts during the probe placement algorithms

The number of border conflicts is several times smaller than for simulated data

The trade-off between number of border conflicts and the CPU time taken for the various algorithms that are defined in the physical design

We give a concatenate software solution for the entire DNA array flow We explore all steps in a single automated software suite of tools Future work: The entire software suite be made available through web services Users can enter name of organism or ID and with an option of

choosing to set the required parameters the suite will produce the DNA probe micro-array chip layout

Thank you

consolidating software tools for dna microarray design and manufacturing

Documents

gene g

master sequence

sequence maskingeach

background sequence

gene active

gene change

particular gene

array manufacturingmask