consolidating software tools for dna microarray design and manufacturing

46
Consolidating Software Tools for DNA Microarray Design and Manufacturing Mourad Atlas Nisar Hundewale Ludmila Perelygina Alex Zelikovsky

Upload: jayden

Post on 01-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Consolidating Software Tools for DNA Microarray Design and Manufacturing. Mourad Atlas Nisar Hundewale Ludmila Perelygina Alex Zelikovsky. Agenda. Introduction DNA Array Flow (DAF) Benchmarks: Herpes B virus Experiments and Results Conclusion and Future Work. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Consolidating Software Tools for DNA Microarray

Design and Manufacturing

Mourad Atlas

Nisar Hundewale

Ludmila Perelygina

Alex Zelikovsky

Page 2: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Agenda

Introduction DNA Array Flow (DAF) Benchmarks: Herpes B virus Experiments and Results Conclusion and Future Work

Page 3: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Motivation Microarrays provide a tool for answering a wide

variety of questions about the dynamics of cells: In which cells is each gene active? Under what environmental conditions is each

gene active? How does the activity level of a gene change

under different conditions? Stage of a cell cycle? Environmental conditions? Diseases?

What genes seem to be regulated together?

Page 4: Consolidating Software Tools for DNA Microarray Design and Manufacturing

DNA Array Flow

1. Downloading genome sequence and extracting ORFs in FASTA format

2. For each gene G, find probes that hybridize to G at a given TM

but do not hybridize to any other gene at that TM

3. Probe placement: determine for each probe a site on the array 2-D surface for it to be placed or synthesized. Probe embeddings: which embeds each probe into the deposition sequence

4. Photolithographic process used in sequence masking

5. Each probe binds to its target using the complementary rules.

6. can be measured by a laser scanner and converted to a quantitative value that can be read

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities

Page 5: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Reading genomic data

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities

Page 6: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Reading Genomic Data

Input the genome ID Download genome sequence

Downloading genome sequence from GenBank

Bioperl

ORF Extraction from genome

GeneMark(Bordovsky GaTech)

Or: ORF Finder ExtractingExtra ORFs: ( )

ORF Parser: ORFs in FASTA format

Genome ID

Probe selection

Page 7: Consolidating Software Tools for DNA Microarray Design and Manufacturing

ORF Extraction

ORF Extraction from genome

GeneMark(Bordovsky GaTech)

Or: ORF Finder ExtractingExtra ORFs: ( )

ORF Parser: ORFs in FASTA format

Genome ID

Probe selection

Downloading genome sequence from GenBank

Bioperl

ORF Parser

Page 8: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Open reading frame (ORF) is a subsequence of DNA that could potentially be transcribed into messenger RNA (mRNA)

Because of the differences between prokaryotic and eukaryotic transcription systems there are two types of ORF:

1. Prokaryotes: start and stop codon

2. Eukaryotic: stop codon

What is ORF?

Page 9: Consolidating Software Tools for DNA Microarray Design and Manufacturing

ORF Parser

Downloading genome sequence from GenBank

Bioperl

ORF Extraction from genome

GeneMark(Bordovsky GaTech)

Or: ORF Finder ExtractingExtra ORF: ( )

ORF Parser

Genome ID

Probe selection

ORFs in FASTA format

Page 10: Consolidating Software Tools for DNA Microarray Design and Manufacturing

DNA Array Flow

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities

Page 11: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Probe Selection

Reading genomic data

ORF preprocessing

Choosing best melting

temperature

Ocand :find allcandidate for given temperature

Promide

Pools of probes

Physical design

Page 12: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Homogeneity:Ensure that the probes can bind to its target at the temperature

of the experimentSensitivity: Avoid self-hybridization: ensure that the probes will not form a

secondary structure. (Such a structure will prevent the probes from binding to its target)

Specificity: – the probes stay unique even after a few bases are changed – Probe must hybridize to one particular gene: For each gene

G, find probes that:1. hybridize to G at a given temperature2. do not hybridize to any other gene at that Temperature

– Avoid cross-hybridization

Probe Selection Requirements

Page 13: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Why Promide?

Possible solutions: Li and Stormo 2001 Kaderali and Schliep 2002 Rahmann (Promide) 2003 They use the same data

structure: Suffix array

Promide handles truly large scale datasets in a reasonable amount of time Human GeneNest clusters:

in 50 hours Neurospora Crassa:

Promide: few hours Li and Stormo: 1 week

Page 14: Consolidating Software Tools for DNA Microarray Design and Manufacturing

ORF preprocessing

Classes of Sequences:• A Master sequence is a sequence we wish to

design oligos for.• A Background sequence is a sequence against which specificity is checked.• Every Master is also a Background

Page 15: Consolidating Software Tools for DNA Microarray Design and Manufacturing

For each candidate oligo (substring) of a Master, do:– Check side constraints

– Compute specificity: Optimal TM- alignment

with every Background collection Compute Matching Statistics: mims Oligos Candidate Selection: ocand

Choosing best melting

temperature

Page 16: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Mask and array manufacturing

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities

Page 17: Consolidating Software Tools for DNA Microarray Design and Manufacturing

arrays are synthesized to a wafer

Selectively expose array sites to light

Flush chip’s surface with solution of protected A, C, G, T

Repeat last two steps until desired probes are synthesized

Mask and Array manufacturing

Page 18: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Mask and Array manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

A Mask 1

A

A

A

A

A

Page 19: Consolidating Software Tools for DNA Microarray Design and Manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

C Mask 2

C

C

C C

C

CA

A

A

A

A

Array manufacturing

Page 20: Consolidating Software Tools for DNA Microarray Design and Manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

G Mask 3

C

C

C C

C

CA

A

A

A

A

G

G G

G

G

G

A Nucleotide Deposition Sequence defines the order of nucleotide deposition

A Probe Embedding specifies the steps it uses in the nucleotide sequence to get synthesized

Array manufacturing

Page 21: Consolidating Software Tools for DNA Microarray Design and Manufacturing

array probes

A 3×3 array

CG AC G

AC ACG AG

CG AG C

Nuc

leot

ide

Dep

ositi

on S

eque

nce

AC

G

A Mask 1

A

A

A

A

A

Border = 8

Border Reduction

Unwanted illumination

Chip’s yield

Border Minimization Challenges

Page 22: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Border Minimization Challenges

Lamp

Mask

Array

Problem: Diffraction, internal reflection, scattering, internal illumination

Occurs at sites near to intentionally exposed sites

Reduce Border

Increase yield

Reduce cost

Design objective: Minimize the border

Intentionally exposed sites

Unwanted illumination

Border

Page 23: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Physical design

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities

Page 24: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Physical Design

Deposition sequence design

Mask and array manufacturing

Probe Selection

Test control

2D-probe placement

3D-probe embedding

Page 25: Consolidating Software Tools for DNA Microarray Design and Manufacturing

•Probe Placement•Similar probes should be placed close together•Constructive placement•Placement improvement operators

•Probe Embedding•Degrees of freedom (DOF) in probe embedding•DOF exploitation for border conflict reduction

Physical Design

Page 26: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Border Reduction with Probe Placement

Probe Placement• Similar probes should be placed close together

Dep

ositi

on S

eque

nce

A

A

C

C

G

GT

T

CT

TA

Probes CT

C

T

C

T

TA

Border = 8

CT

CT

TA

C

T

T

T

A

C

Border = 4

Optimize

Page 27: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Border Reduction in Probe Embedding

Synchronous embedding: deposit one nucleotide in each group of “ACGT”

Probe Embedding

Asynchronous embedding: no restriction

Dep

ositi

on S

eque

nce

A

A

C

C

G

GT

T

CT

TAProbes

C

T

TA

Border = 4

CT

TA

C

T TA

Border = 2

Page 28: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Physical Design Problem

Placement of probes in n x n sites

Give: n2 probes

Total border cost

Find:

Embedding of the probes

Minimize:

Page 29: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Problem formulation for placement 2-dim (synchronous) Array Design Problem:

Minimize placement cost of Hamming graph H (vertices=probes, distance = Hamming)

Hamming Distance (P1, P2) = number of nucleotides which are different from its counterpart= border (synchronous embedding)

on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

H

probe

G2

site

Page 30: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Placement Objective: Minimize Border

Sort the probes in lexicographical order

Probe 1

Probe 2

Probe 3

Probe 4

Probe 5

T A T T

A T A A

A A C A

G GC C

C G G G

1 2 3 25

T A T T

A T A A

A A C A

G GC C

C G G G

1 2 3 25

Problem: How to place the 1-D ordering of probes onto the 2-D chip?

Sorting the probes order reduces discrepancies between adjacent probes

Page 31: Consolidating Software Tools for DNA Microarray Design and Manufacturing

TSP+1-Threading Placement

Hubbel 90’s Find TSP tour/path over given

probes with Hamming distance Place in the grid following TSP Adjacent probes are similar

Hannenhalli,Hubbel,Lipshutz, Pevzner’02:

Place the probes according to 1-Threading

further decreases total border by 20%

Page 32: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Placement By Threading

1 2 3 25

T A T T

A T A A

A A C A

G GC C

C G G G

Probe 1

Probe 2

Probe 3

Probe 4

Probe 5

Thread on the chip

1

2 3

4 5

Page 33: Consolidating Software Tools for DNA Microarray Design and Manufacturing

For each site position (i, j):

Find the best probe which minimize border

(i, j)

Move the best probe to (i, j) and lock it in this position

Switch

Row-Epitaxial Placement Improvement

Row placement = sort + thread + row epitaxial

Page 34: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Probe Embedding

A

A

A

C

C

C

G

G

GT

T

T

Deposition Sequence

CTG

Hypothetical Probe

Gro

up

C

G

T

Synchronous Embedding

C

T

G

Asynchronous Embedding

C

G

T

Another Embedding

Page 35: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Embedding Determines Border Conflicts

A

A

A

C

C

C

T

T

TG

G

G

ACTG

AGT

GTG

A A

Synchronous Embedding

A

G

T

A

G

G

T

A

Dep

ositi

on S

eque

nce

Probes

G

A

A

G

T

A

G

T

ASAP Embedding

G

Page 36: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Problem formulation

2-dim (synchronous) Array Design Problem: Minimize placement cost of Hamming graph H

(vertices=probes, distance = Hamming) on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

3-dim (asynchronous) Array Design Problem: Minimize cost of placement and embedding of Hamming graph H’

(vertices=probes, distance = Hamming b/w embedded probes) on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

Page 37: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Post-placement Optimization Methods

Asynchronous re-embedding after 2-dim placement Greedy Algorithm

While there exist probes to re-embed with gain Optimally re-embed the probe with the largest gain

Batched greedy: speed-up by avoiding recalculations Chessboard Algorithm

While there there is gain Re-embed probes in red sites Re-embed probes in green sites

Page 38: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Analysis of hybridization intensities

Genome ID

Mask and array manufacturing

Physical design

Probe selection

Hybridization experiment

Reading genomic data

Analysis of hybridization intensities

Page 39: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Experimental Study

In our experiment we have considered the following parameters and we measured the results for different values of these parameters.

Melting Temperature: We choose the temperatures 60C and 65C as best

melting temperatures for our DNA probe array. Number of Candidates: We experimented with different values of K (number

of candidates) for each pools of probes: 1 and 2. Chip Size: We ran our Experiments with 2 different chip sizes.

We experimented with 50x50 and 60x60. We give the number of conflict and runtime for each

algorithm for the Herpes B virus and simulated data

Page 40: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Experiments Outline

Genome ID

Bioperl Sequence in FASTA format

ORF Extraction GenMark

ORF in Fasta format ORF Parser

Pools of probes in Chip format Probe Parser

Select Probes: Pool pf Probes Promide

Read Pool/ Genpool

Placements: Sorting

Placements: TSP

Placements: Row placement

Embedding: Chessboard

Chip

# of

Con

flict

s-C

PU

Tim

e fo

r al

l Alg

orith

ms

Page 41: Consolidating Software Tools for DNA Microarray Design and Manufacturing

TM=65, Size=50x50 Herpes B Virus Simulated Data

 K=2 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)

Initial 43459   183532  

Tsort 39192 0.09 163402 0.04

Tsp 38143 0.11 159194 0.045

Lalign 34434 0.12 132698 0.9

Reptx 2 25938 7.75 109248 3.61

Chessboard 25504 25.66 106344 9.4

TM=65, Size=50x50 Herpes B Virus Simulated Data

 K=1 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)

Initial 83096   183782  

Tsort 74367 0.15 162926 0.05

Tsp 72141 0.2 159186 0.065

Lalign 60664 0.25 132358 0.08

Reptx 2 48582 4.25 115188 0.9

Chessboard 47652 18.64 112148 6.13

Page 42: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Number Of Conflicts for each AlgorithmTm=65,Size=50x50

0

50000

100000

150000

200000

250000

300000

Algorithms

Nu

mb

er

of

Co

nfl

icts

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

CPU Time for each AlgorithmTm=65,Size=50x50

0

5

10

15

20

25

Algoritms

CP

U T

ime i

n s

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

Page 43: Consolidating Software Tools for DNA Microarray Design and Manufacturing

TM=65, Size=60x60 Herpes B Virus Simulated Data

 K=1 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)

Initial 107577   265992  

Tsort 98830 0.17 231526 0.08

Tsp 95640 0.22 227960 0.09

Lalign 79254 0.25 189272 0.1

Reptx 2 64830 4.45 154766 1.58

Chessboard 63594 15.58 150812 7.1

TM=65,Size=60x60   Herpes B Virus Simulated Data

K=2 # Conflicts CPU Time(sec) # Conflicts CPU Time(sec)

Initial 54205   265328  

Tsort 49746 0.3 232954 0.14

Tsp 48541 0.34 227762 0.15

LAlign 42858 0.42 182972 0.16

Reptx 2 32098 7.84 149332 3.16

Chessboard 31498 20.93 146708 10.89

Page 44: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Number Of Conflicts for each AlgorithmTm=65,Size=60x60

020000400006000080000

100000120000140000160000180000200000

Algorithms

Nu

mb

er

Of

Co

nfl

icts

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

CPU Time for each AlgorithmTm=65,Size=60x60

02468

101214161820

Algorithms

CP

U T

ime i

n s

HB-k=2

Simulated-k=2

HB-k=1

Simulated-k=1

Page 45: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Conclusion and Future work

Conclusion: Our experiments show: The genomic data follow the pattern predicted by simulated data In case of Herpes B virus, like simulated data, increasing number of

candidates per probe (k) decreases number of border conflicts during the probe placement algorithms

The number of border conflicts is several times smaller than for simulated data

The trade-off between number of border conflicts and the CPU time taken for the various algorithms that are defined in the physical design

We give a concatenate software solution for the entire DNA array flow We explore all steps in a single automated software suite of tools Future work: The entire software suite be made available through web services Users can enter name of organism or ID and with an option of

choosing to set the required parameters the suite will produce the DNA probe micro-array chip layout

Page 46: Consolidating Software Tools for DNA Microarray Design and Manufacturing

Thank you