automi cellulari -

1

LAUREA MAGISTRALE

IN

BIOTECNOLOGIE INDUSTRIALI

A.A 2011-2012

AUTOMI CELLULARI

APPLICAZIONI DEI SIMULATORI NELLO

STUDIO DI SEQUENZE BIOLOGICHE E

MALATTIE INFETTIVE

DANILO COMINO

GIORGIA DRERA

MARIANA DUMITRIU

MATTIA SALVATORE

2

1 Automi cellulari Molti sistemi del mondo fisico, socioeconomico, urbanistico e biologico sono definibili come sistemi complessi, ad esempio il moto dei fluidi, la trasformazione economica di una regione, la crescita di un centro urbano, la vita di un organismo, la diffusione di malattie e molti altri. Lo studio di sistemi complessi stato affidato a strumenti alternativi alla tradizionale matematica, tra cui gli automi cellulari, ovviando cos a due problemi condotti dal calcolo infinitesimale di Newton e Leibnitz: la complessit delle equazioni e lapprossimazione computazionale la quale pu influire fortemente sullo sviluppo del sistema. Il concetto di automa cellulare stato introdotto nel 1947 da Von Neumann durante i suoi studi sui fenomeni biologici descritti come modalit di mutua interazione tra entit elementari chiamate appunto automi. Secondo Von Neumann, considerato un insieme di molti automi dotati della capacit di interagire in maniera opportuna, il sistema, nella sua globalit, si mostrer capace di comportamenti complessi e differenti, come se fossero finalizzati ad un obiettivo globale. Quindi, un sistema complesso pu essere definito come un insieme di entit semplici, gli automi, i quali interagiscono tra loro e questa mutua interazione genera, nellinsieme, il comportamento globale del sistema complesso. A fronte di quanto appena menzionato, si potrebbe asserire che lidea degli automi cellulari quella di descrivere un sistema complesso tramite semplici regole che descrivono le interazioni tra i componenti in cui si suddiviso il sistema stesso. In matematica e logica un automa un formalismo che consente di descrivere il comportamento di una macchina; lautoma una sorta di scatola chiusa che riceve informazioni dallesterno (input), compie azioni, e restituisce altre informazioni (output). Le azioni si basano su regole che definiscono le relazioni tra ingresso, stato interno e uscita, ovvero associano input e output. Gli stati interni rappresentano la situazione del sistema ad un dato istante e costituiscono la memoria del sistema. Pi automi possono essere connessi in modo che loutput di un automa sia linput di un altro automa, formando cos una rete di automi. Un automa caratterizzato da un reticolo (o griglia) e un intorno. 1.1 Reticolo Nello spazio Rd (dove d la dimensione dello spazio nel quale si trova lautoma cellulare, ove solitamente d = 3 quindi spazio tridimensionale) viene considerato un insieme di cellule (o celle), disposte in genere su un reticolo . Se d = 1 le celle dellautoma sono solitamente disegnate nel seguente modo:

!

!"!#$%&'()(

"#$$%&'()*%+,-,%.#,/$(,*0#1&*%1#,#$$%&'()*%1#$23($#&*45%.($6(34%0(7#$$3$(5#8%.#

&%$*4(0#14#,9:23*1,*&'()*%45*,*0#1&*%1($#;.*#1#7%1&*,#5(4%31 *1&*#0#,*7#$$3$#-%7#$$#;8

,*&'%&4#*1#,9?$#7#$$#,#$$6(34%0(&%1%&%$*4(0#14#,** 1%4* 7D# %

3

d=2 le celle possono essere di vario tipo:

Se d=3 le celle sono solitamente rappresentate con cubi o parallelepipedi. Si noti che ogni cellula pu essere univocamente individuata assegnandole d numeri interi indicando la loro posizione nella griglia i1,...,id Z: nel caso monodimensionale, per esempio, basta stabilire quale cella etichettare con 0 e in quale direzione mettere i numeri positivi:

Nel caso bidimensionale basta invece stabilire quale cella etichettare con (0,0) e quali sono le due direzioni positive (come in un sistema di assi cartesiani). La posizione di una cella viene dunque indicata con due indici (i,j), per cui la cella a sinistra sar (i,j i), quella a destra (i, j +i) e via di seguito. Se il reticolo = Zd, e quindi con un numero infinito di punti e senza bordi, si devono fare le seguenti scelte:

Identificazione dei bordi: nel caso monodimensionale le celle sono rappresentate su un cilindro, mentre nel caso bidimensionale su un toroide (vedi figure)

Riflessione delle celle di bordo: Consideriamo per esempio il caso monodimensionale

= Z. Nel caso di riflessione delle celle del bordo distinguiamo tra celle interne e celle di bordo: nel seguente caso la cella 1 interna mentre le celle 0 e 2 sono di bordo. Come espresso in modo simbolico dalla figura lo stato delle celle a sinistra della cella

!

!"!#$%&'()(

"#$$%&'()*%+,-,%.#,/$(,*0#1&*%1#,#$$%&'()*%1#$23($#&*45%.($6(34%0(7#$$3$(5#8%.#

&%$*4(0#14#,9:23*1,*&'()*%45*,*0#1&*%1($#;.*#1#7%1&*,#5(4%31 *1&*#0#,*7#$$3$#-%7#$$#;8

,*&'%&4#*1#,9?$#7#$$#,#$$6(34%0(&%1%&%$*4(0#14#,** 1%4* 7D# %#,9:$#7#$$#&%1%&%$*4(0#14#5(''5#(4#7%173C*%'(5($$#$#'*'#,*=

>* 1%4* 7D# %"#"'($"/)#'"*2*#?&@(0$",2"$*6(#(>&0* 1*

2*7)*#'"2%*1'*A

9 !"#$%&'&()*&+$# "#& ,+-"&A #*1 %&2(=(#($"=*#2"(#&1* 1* %*11* 2(#( 0&//0*2*#'&'* 2) )#

%"1"#$0(,=*#'0*#*1%&2(@"$"=*#2"(#&1*2))#'(0("$*+6*$">"7)0*.

B"7)0&CA-)..-#/#$%)*&+$#)(&0&$"-+"&1$)1%+2)2+$+"&2#$/&+$)0#

B"7)0&DA-)..-#/#$%)*&+$#)%+-+&"#"&1$)1%+2),&"&2#$/&+$)0#

9 3&'0#//&+$#"#00#(#00#"&,+-"+AE(#2"$*0"&=(/*0*2*=/"("1%&2(=(#($"=*#2"(#&1*;1*22"(#*$*11*%*11*$*1@(0$($"2'"#7)"&=('0&%*11*"#'*0#**%*11*$"@(0$(A

#*12*7)*#'*%&2(1&%*11&G:"#'*0#&=*#'0*1*%*11*H*I2(#($"@(0$(8E(=**2/0*22("#

=($(2"=@(1"%($&11& >"7)0& 1(2'&'($*11*%*11*&2"#"2'0&$*11&%*11&H&60#(2*=/0* 1(

2'*22( 2'&'( $*11& %*11& $" @(0$( H * 1( 2'&'( $*11* %*11* & $*2'0& $*11& %*11& I &60#(

2*=/0*1(2'*22(2'&'($*11&%*11&$"@(0$(I8

9 4+-"+ (+/%)$%#A 2" 2'&@"1"2%* %J* 1( 2'&'( $*1 @(0$( : %(2''*, %"(: #(# J& #*22)#&

*6(1)?"(#*'*=/(0&1*8

4

0 avranno sempre lo stesso stato della cella di bordo 0 e lo stato delle celle a destra della cella 2 avranno sempre lo stesso stato della cella di bordo 2.

Bordo costante: si stabilisce che lo stato del bordo costante, cio non ha nessuna evoluzione temporale.

1.2 Intorno Si assume che una qualsiasi cella i interagisca solo con un certo insieme U (i) di altre celle (ad esempio quelle immediatamente vicine). In tal caso U (i) = {i k,i,i + k}, dove k un numero costante che esprime lordine dellintorno (|U (i)| = K). Nel caso monodimensionale (con = Z) assumendo di avere U (i) = {i 1,i,i + 1}, facendo riferimento ad un reticolo con la seguente struttura:

lintorno assume i valori U (2) = {1,2,3}. Si noti che U (i) contiene sempre la cella stessa, cio i U (i). I tipi di intorno maggiormente considerati sono quelli di von Neumann e di Moore, di seguito rappresentati per il caso bidimensionale per due valori del raggio (r=1 e r=2):

Se = Zd, ogni cella rappresentata da d numeri interi. La definizione generale degli intorni di von Neumann : U (i1,...,id) = {(j1,...,jd) : |j1 i1|+...+|jd id| r} Mentre quella degli intorni di Moore : U (i1,...,id) = {(j1,...,jd) : |j1 i1| r e ... e |jd id| r} Nel caso monodimensionale d =1 essi quindi coincidono. Abbiamo detto in precedenza che ad ogni cella attribuito uno stato che esprime una sua qualit . Gli stati sono per ipotesi in numero finito. !(i) =i !S dove S linsieme degli stati (detto anche spazio degli stati) e ! lo stato della cella i.

!

"#$"%&'& %(#$)* "#$"%" +",-.,/*0%)" 1& %*11&& 2"#"2'0& 2&03 +",- ".,4)*11&&$*2'0& +", -5".*6"&$"

2*7)"'(8

9* "10*'"%(1(:;"7)0& 1(2'&'($*11*%*11*&2"#"2'0&$*11&%*11&H&60#(2*=/0* 1(

2'*22( 2'&'( $*11& %*11& $" @(0$( H * 1( 2'&'( $*11* %*11* & $*2'0& $*11& %*11& I &60#(

2*=/0*1(2'*22(2'&'($*11&%*11&$"@(0$(I8

9 4+-"+ (+/%)$%#A 2" 2'&@"1"2%* %J* 1( 2'&'( $*1 @(0$( : %(2''*, %"(: #(# J& #*22)#&

*6(1)?"(#*'*=/(0&1*8

!

!"#$%&'(%'

"#$%%&'()*(&+$,&$-%#$%#)(--$#*#+.(/$0#%)$%1-1)1+&+)(/.1#+%#('(23#45#$-./()(--(3$5(%('6#1,&(--( #''(5#$.$'(+.(7#)#+(48 9+.$-)$%123#4:;#

5

Si parla invece di configurazione locale come la possibile distribuzione degli stati delle celle in un intorno ed una funzione tra le celle dellintorno e linsieme degli stati. Dato un insieme di stati S, supponendo che allinterno di S ci siano s stati, quindi |S|=s lordine CL della configurazione locale per un intorno |U(i)| = k definito dalla seguente espressione: |CL| = |S| |U,i| = sk. Nel caso di un automa ad una dimensione con intorno di raggio 1, lordine k dellintorno uguale a 3 e lintorno U (i) = {i 1,i,i +1}. Per un insieme degli stati S = {0,1}, dunque s=2, si ottiene una CL composta da sk = 2*3 = 8 combinazioni. La distribuzione degli stati la seguente:

Esistono delle regole di evoluzione, cio delle regole che descrivono come gli stati delle celle evolvono nel tempo ovvero la dinamica del modello. Le regole di evoluzione descrivono il passaggio dallo stato i (t) = stato della cella i al tempo t a quello i (t+1) al tempo t+1. Si noti, quindi, che il tempo in un AC per ipotesi discreto t = 0,1,2,3,... Infine la regola di evoluzione viene assunta dipendere solo dagli stati j (t) per j U (i), cio solo dagli stati delle celle j vicine a i , nel senso di appartenenti allintorno prefissato U (i) di i. Concretamente se gli intorni sono tutti costituiti da n celle, allora le regole di evoluzione di un AC sono date da una funzione:

:SnS che agli stati delle n celle presenti nellintorno di una cella fa corrispondere lo stato successivo. 1.3 Caratteristiche di un automa cellulare Le caratteristiche fondamentali di un automa cellulare sono le seguenti: Parallelismo: le celle si aggiornano simultaneamente (in parallelo) elaborando ognuna le informazioni ricevute e passando nello stato conseguente. Localit: il nuovo stato cui giunge la cella al tempo t+1 dipende solo dal suo stato e da quello delle celle appartenenti al suo intorno e al tempo t. Omogeneit: ogni cella aggiornata in base alle stesse regole. Si definisce invece legge locale una funzione tra linsieme di tutte le possibili configurazioni locali e linsieme degli stati. Se la cella locale si trova in mezzo a una certa configurazione, passa ad uno stato dato da f con f : CL S. Si definisca s(t) lo stato f (s(t) S) in cui si trova una cella data c al tempo t. Se al tempo t la cella c nello stato 0, s(t) = 0, al tempo (t+1) il nuovo stato della cella sar: s(t+1) = f(x). Se prendiamo pi automi e diamo loro una localizzazione spaziale possibile stabilire le connessioni in base alla distanza. In altre parole automi vicini saranno connessi in qualche modo e automi lontani non saranno connessi. Questo aspetto rimanda al calcolo della

!

V "#$%V# &'()*&+,-#./#*0*'*1,#/232#"'*22(3.45*/637#('*1,#/232#$*V +,(/232('*,,34*,,3#8

639,3 #.)*4*'#!"#$%&'()*%"#+ ,"!),+4(0* ,36(//#:#,*'#/29#:;7#(.*'*1,#/232#'*,,*4*,,* #.;.

#.2(9.(*'+;.3#>#JGK8L*9;.#./#*0*'*1,#/232#&%IM>GK>';.?;*/%N>/#(22#*.*;.3AB4(06(/23'3/D

%NOH%P4(0:#.37#(.#8B3'#/29#:;7#(.*'*1,#/232#+,3/*1;*.2*E

Q/#/2(.( '*,,* (+&",+ -% +.",'*%"#+> 4#(+ '*,,* 9*1(,* 45* '*/49#)(.( 4(0* 1,# /232# '*,,* 4*,,*

*)(,)(.( .*, 2*06( ())*9( ,3 '#.30#43 '*, 0('*,,(8 B* 9*1(,* '# *)(,;7#(.* '*/49#)(.( #,

63//311#('3,,(/232(#"2$%R/232('*,,34*,,3#3,2*06(2S3?;*,,(#"2JG$3,2*06(2JG8.(2#>

?;#.'#>45*#,2*06(#.;.TA+6*9#6(2*/#'#/49*2(2%M>G>N>H>UV.4#(+/(,('31,#/232#'*,,*4*,,*W)#4#.*3#>.*,/*./( '# 366392*.*.2# 3,,-#.2(9.( 69*

6

distanza e quindi alla definizione di una metrica sullo spazio. Lo spazio pu avere diverse dimensioni, zero, una, due o tre. Gli automi cellulari sono particolarmente efficaci per descrivere fenomeni complessi che hanno luogo nello spazio. Un automa cellulare pu essere visto come una matrice di celle quadrate che evolvono in un dato tempo; ad ogni istante ciascuna cella si trova in uno stato che appartiene ad un insieme finito di stati possibili. Al tempo (t+1) il cambiamento di stato di una cella dipende dallo stato delle celle vicine al tempo precedente (t) e il contenuto di una cella viene aggiornato in base ad una regola fissata che dipende dal contenuto della cella stessa e dal contenuto delle celle con cui pu comunicare (quello che stato precedentemente definito intorno della cella). Ad ogni passo (ciclo) il contenuto di tutte le celle viene aggiornato simultaneamente in parallelo. Il modo pi semplice per comprendere la dinamica spazio temporale quello di utilizzare uno spazio ad una dimensione. Gli automi sono localizzati lungo questo spazio. Questa situazione si rappresenta , come s detto in precedenza, con una fila di celle. Ogni cella un automa e quindi ha un output ossia uno stato ad essa associato. Supponiamo che questo stato possa assumere i valori zero o uno, in cui uno corrisponde a cella edificata (rossa) e zero a cella libera (azzurra). Ogni cella per modificare il suo stato ha bisogno di un input che proviene dalle celle vicine, in questo caso dalla cella di destra e da quella di sinistra. Le due celle rappresentano lintorno della cella. Il loro stato al tempo t viene utilizzato come input dalla cella centrale per calcolare il suo stato al tempo t+1.

Il modo in cui viene calcolato usualmente lo stato dell'automa al tempo successivo spiegato nell'esempio seguente. Rappresentandolo con il suo stato, l'automa della figura precedente si presenta nel modo seguente: 123456789 000101000 La regola che permette di calcolare lo stato dell'automa al tempo la seguente: si considerano per ogni cella di posizione (x), gli stati delle due celle confinanti con la cella in questione, di posizione (x1, x+1) e se almeno una delle due celle dell'intorno ha valore 1, allora la cella in questione prender valore 1; invece se tutte e due le celle dell'intorno hanno valore zero, allora la cella centrale prender valore zero. Quindi, tralasciando le due celle al bordo, (la 1 e la 9) si inizia dalla numero 2. Si vede che questa ha due celle confinanti con uno stato uguale a zero e, quindi, essa prender al tempo t+1 il valore zero. Questo valore viene memorizzato, ma non influisce sul calcolo dello stato delle altre celle, perch queste reagiscono allo stato delle celle al tempo t. Quindi si passa alla cella 3 e si vede che ha una cella a sinistra uguale a 1. Quindi al tempo t+1 prender il valore 1. Con lo stesso metodo la cella 4 ha due celle confinanti in stato zero e quindi al tempo t+1 prender valore zero. Si continua cos sino alla

!

"#$%&$'($'&))*#$%&)'#'+,))##))&-*#'&./$((&%*#0&-",1$,)'&)'#'&(&))##))$0,2, *

3#4&2*5,2&&$%&6*%-$*$%&-&22*)(&%+,#-,44#,+*7*-#'#82&))#9,5,2-,44#4*:,2#8#55$22#9.;*%'&2%&+,44#-,44#.?4

4&2& )'#' ',0(& !3*,%,$'*4*55#'&-&0, *%($'+#44#-,44#-,%'2#4,(,2-#4-&4#2, *4 )$& )'#'

',0(&!"#.

@*

7

cella 8. Le celle del bordo si suppone rimangano sempre con lo stesso stato (bordi costanti). Al tempo successivo si ha la seguente distribuzione di stati: 123456789 000101000 t 001010100 t+1 010101010 t+2 Utilizzando gli stati al tempo si calcolano con lo stesso metodo gli stati al tempo t+2. Poich questo metodo prevede di aggiornare gli stati di tutte le celle al medesimo istante viene detto sincrono. Se, al contrario, avessimo aggiornato lo stato della cella appena calcolato, avremmo avuto un metodo di calcolo asincrono. In questo secondo caso evidente che il risultato finale influenzato dall'ordine col quale si scelgono le celle. Generalmente viene scelto un ordine casuale. Nel caso dello spazio a due dimensioni un automa, come s visto, viene rappresentato con celle quadrate disposte su una griglia. Ogni cella, anche in questo caso, ha un stato che dipende dallo stato delle celle intorno. Questo intorno (von Neumann o Moore) pu essere definito in vari modi: 4 celle, 8 celle e oltre. Assumendo che non ci sia differenza tra le celle poste in alto o in basso, a destra o a sinistra, spesso si sommano gli stati delle celle nell'intorno. Questa somma diviene un unico input per la cella centrale che, in base ad esso, ed eventualmente anche al proprio stato al tempo t, stabilisce lo stato al tempo t+1 . Per stabilire lo stato della cella si confronta questa somma con una soglia stabilita e, se la somma risulta maggiore, allora la cella prende uno stato 1, altrimenti 0. Questo tipo di automi vengono detti totalistici. Questo rapporto tra somma e stato della cella si pu sintetizzare in un grafico a scalino nel quale sulle ascisse riportata la somma degli stati dell'intorno di otto celle al tempo t e sulle ordinate lo stato della cella centrale al tempo t+1.

Un caso molto interessante quello dei modelli basati sulla diffusione delle infezioni. In questo caso, se una cella malata, pu infettare qualsiasi cella vicina con cui si trovi a contatto. Ogni cella sana ha probabilit di essere contagiata. Se non viene contagiata per uno o pi periodi si pu supporre che non sar pi contagiata.

!"

#$%%&'( )*+,'-,'& #(+%)( .&' /& #0(##& #0,0& 12&)3* .,'0*45 6/ 0(+%& #$..(##*7& #* 8, /,

#(-$('0(3*#0)*2$9*&'(3*#0,0*:

!;?@AB

"""!"!"""0

""!"!"!""0C!

"!"!"!"!"0C;

D0*/*99,'3&-/*#0,0*,/0(+%*.,/.&/,'&.&'/(##&+(0&3&-/*#0,0*,/0(+%&0C;5

E&*.8FG$(#0&+(0&3&%)(7(3(3*,--*&)',)(-/*#0,0*3*0$00( /(.(//(,/+(3(#*+& *#0,'0(7*('(

3(00*'.)&'&5H(I,/.&'0),)*&I,7(##*+&,--*&)',0&/,0&3(//,.(//,

,%%(', .,/.&/,0&I ,7)(++& ,7$0& $'+(0&3& 3* .,/.&/& ,#*'.)&'&5 J' G$(#0& #(.&'3& .,#& K

(7*3('0(.8(*/)*#$/0,0&L*',/(K*'L/$('9,0&3,//M&)3*'(.&/G$,/(#*#.(/-&'&/(.(//(5N('(),/+('0(

7*('(#.(/0&$'&)3*'(.,#$,/(5

O(/.,#&3(//%,9*&,3$(3*+('#*&'*$',$0&+,I.&+(#PK7*#0&I7*('(),%%)(#('0,0&.&'.(//(

G$,3),0(3*#%(#$$',-)*-/*,5Q-'*.(//,I,'.8( *'G$(#0&.,#&I8,$'#0,0&.8(3*%('3(3,//&

#0,0&3(//(.(//(*'0&)'&5R$(#0&*'0&)'&17&'O($+,''&S&&)(4%$T(##()(3(L*'*0&*'7,)*+&3*:

=.(//(IA.(//((&/0)(56##$+('3&.8('&'.*#*,3*LL()('9,0),/(.(//(%(*',/0&&*'2,##&I,

3(#0),&,#*'*#0),I#%(##*#&++,'&-/*#0,0*3(//(.(//('(//M*'0&)'&5R$(#0,#&++,3*7*('($'

$'*.&*'%$0%()/,.(//,.('0),/(.8(I*'2,#(,3(##&I(3(7('0$,/+('0(,'.8(,/%)&%)*,0&,/

0(+%& 0I #0,2*/*#.( /& #0,0& ,/ 0(+%& 0C! 5 E() #0,2*/*)( /& #0,0& 3(//, .(//, #* .&'L)&'0, G$(#0,

#&++,.&'$',#&-/*,#0,2*/*0,(I#(/,#&++,)*#$/0,+,--*&)(I,//&),/,.(//,%)('3($',0&!I

,/0)*+('0*"5R$(#0&0*%&3*,$0&+*7('-&'&3(00*0&0,/*#0*.*5R$(#0&),%%&)0&0),#&++,(#0,0&

3(//,.(//,#*%$T#*'0(0*99,)( *'$'-),L*.&,#.,/*'&'(/G$,/(#$//(,#.*##(K)*%&)0,0, /,#&++,

3(-/*#0,0*3(//M*'0&)'&3*&00&.(//(,/0(+%&0(#$//(&)3*',0(/,0&3(//,.(//,.('0),/(,/0(+%&

0C!5

U*-$),@!"#$%&'()#*"%'$+%$("#,#$$%-(..%,#/$'-*%*',#$$0')*(")(1%-2'--%3#$(-*%*(,#$$%2#$$%1(",')%*%3

!!

"#$%&'(')*' +#*,-,&&%#*,./0,))'1,+('1,))+2%&%*+&0))%1+330&+'#,1,)), +#3,4+'#+5 6#/0,&*'

$%&'7&,0#%$,))%.(%)%*%7809+#3,**%-,/0%)&+%&+$,))%:+$+#%$'#$0+&+*-':+%$'#*%**'5;

8

2 Automi cellulari per lo studio di sequenze biologiche e DNA 2.1 Using cellular automata to generate image representation for biological sequences 2.1.1 Sommario E' stato sviluppato, un nuovo approccio sulla base degli automi cellulari per la visualizzazione delle sequenze biologiche (Wolfram, S. Natura 1984, 311, 419-424), I CA sono insiemi di sistemi dinamici in cui spazio e tempo sono discreti ovvero sono sistemi isolati. Trasformando la sequenza dei codici simbolici in codici digitali, e utilizzando alcune ottime regole sull'evoluzione spazio-tempo degli automi cellulari, una sequenza biologica pu essere rappresentata come un'unica immagine, la cosiddetta cellular automata image. Molte caratteristiche importanti, che sono originariamente nascoste in una sequenza biologica lunga e complicata, possono essere chiaramente rivelate attraverso la sua cellular automata image. Con l'aumentare del numero delle informazioni nelle banche dati nel periodo della post-genomica, stato previsto che la cellular automata image diventer un mezzo molto utile per analizzare le caratteristiche principali, identificare la funzione, nonch rivelare l''impronta digitale'' delle sequenze biologiche in esame. Si prevede che, utilizzando il concetto della pseudo aminoacid composition (Chou, KC Proteine: Structure, Function, and Genetics, 2001, 43, 246-255), la cellular automata image pu essere utilizzata per migliorare le caratteristiche previste delle proteine, come la classe strutturale e la localizzazione subcellulare. 2.1.2 Introduzione Il successo del progetto genoma umano ha generato un importante numero di informazioni sulle sequenze. Banche dati delle sequenze, come GenBank e EMBL, sono cresciute ad un tasso esponenziale (Venter et al, 1996;. Chou, 2002; Chou, 2004). In generale, le sequenze genetiche sono memorizzate nel database del sistema informatico in forma di stringhe lunghe di caratteri. E' impossibile per gli esseri umani leggere queste sequenze. Inoltre, molto difficile estrarre le caratteristiche principali direttamente leggendo le sequenze. Tuttavia, se possono essere convertite in diagrammi e schemi (vedi, ad esempio, Chou e Zhang, 1992; Zhang e Chou, 1994), alcune caratteristiche importanti delle sequenze diventerebbero facilmente visibili. Come visualizzare le sequenze geniche un argomento attuale (Hu et al, 2003;. Kashuk et al, 2002;. Liu et al, 2002;. Sindaco et al, 2000;. Nandy, 1996; Randic et al. , 2000). Lo sforzo nella visualizzazione delle sequenze biologiche si concentrato sulla rappresentazione di una singola sequenza. Circa 20 anni fa, stata proposta la prima 3D H curve per rappresentare una sequenza di DNA (Hamori, 1985; Hamori e Ruskin, 1983). Successivamente, una rappresentazione grafica delle sequenze di DNA stata proposta utilizzando la funzione interattiva Barnsley (Jeffrey, 1990). Pi tardi, stato proposto un altro metodo attraverso l'utilizzo della funzione di un altro sistema interattivo (romano-Roldan et al, 1994;. Tino, 1999). Estrapolando il lavoro di Hamori e Jeffrey, stato presentato un diverso metodo

9

interattivo chiamato W-curve (Wu et al., 1993). Gates (1985) ha proposto una rappresentazione grafica 2D che pi semplice della curva H. Tuttavia la rappresentazione grafica di Gates ha elevate probabilit di errore. Guo ha compiuto un passo ulteriore e ha proposto una nuova rappresentazione grafica 2D di sequenze di DNA con basse probabilit di errore (Guo et al., 2001). Nel 2003, Yau ha presentato una rappresentazione senza possibilit di errori (Yau et al., 2003). Parallelamente allo sviluppo sopra citato, sono stati proposte varie rappresentazioni per le sequenze proteiche. Williams et al. (1995) hanno utilizzato cinque spazi verticali per rappresentare ogni posizione amminoacidica, con gli spazi riempiti secondo le propriet chimiche dei residui. Questo porta a sequenze simili al Morse, con alcune caratteristiche strutturali evidenziate dal pattern risultante pattern di punti. Le propriet degli amminoacidi di una proteina possono anche essere visualizzate sotto forma di grafico a linee, ad esempio, la proteina rodopsina rappresentata mediante la scala idrostatica (Alston et al., 2003). Chou et al. (1997) hanno introdotto lo '' Wenxiang schema'' per evidenziare la caratteristica tipica della sequenza delle eliche antipatiche delle proteine. Vi una caratteristica comune nei metodi suddetti per la rappresentazione genica, infatti il punto della curva corrispondente ad un certo acido nucleico collegato solo con la base prima di esso, mentre tutte le basi dietro ad esso sono totalmente ignorate. Questo coerente con il fatto che tutte le basi di un gene sono accoppiati tra loro come un'entit in natura. In considerazione di quanto detto, qui sar introdotto un metodo completamente nuovo e diverso per l'immaging delle sequenze geniche. Il nuovo metodo basato sugli automi cellulari, come sar illustrato pi avanti 2.1.3 Metodi Gli automi cellulari sono sistemi dinamici discreti il cui comportamento completamente specificato in termini di relazione locale. Un automa cellulare pu essere pensato come un universo stilizzato costituito da una griglia regolare di celle, ciascuna delle quali pu essere di un numero finito di stati possibili k, aggiornate in modo sincrono in time step discreti secondo un locale, secondo una regola d'interazione identica ( Wolfram, 1986). Gli automi cellulari forniscono un sistema per modellare complessi fenomeni dinamici, ridefinendo il comportamento macroscopico in regole microscopiche e mesoscopiche che sono discrete nello spazio e nel tempo. Un insieme di regole specifica l'evoluzione del tempo e dello spazio nel sistema, che discreto in entrambe le variabili. Questi sistemi hanno suscitato un grande interesse negli ultimi anni, perch anche con regole molto semplici gli automi cellulari sono in grado di dimostrare l'evoluzione di modelli molto complessi. L' applicazione di semplici regole pu portare a comportamenti estremamente complessi in grado di emulare sistemi fisici, sociali e biologici. La dimensione di un automa cellulare costituita da una raccolta di variabili tempo-dipendenti Sit, vale a dire gli Stati locali, disposti su un reticolo di N siti (o celle), i= 0,1,2, ..., N-1. Intendiamo ciascuno di questi come una variabile booleana: Sit = (0, 1). Come visualizzazione considerato come automa a 2 stati, ciascuna delle celle pu essere nero o bianco. La raccolta di tutti gli Stati locali chiamata configurazione: St = S0t S1t --- SNt -1, in cui S0 indica una configurazione iniziale. La regola F degli automi cellulari pu essere espressa come una tabella di ricerca che elenca, per ciascun intorno locale, lo stato che viene assunto dalla cella dell'intorno centrale al passaggio successivo. Un intorno comprende una cella e i suoi intorno r su entrambi i lati, dove r chiamato raggio degli automi cellulari. Il corso dell'evoluzione di stato pu essere rappresentato come: Sit+1= F(Sit-rSitSi+rt). Se r 1,

10

ogni cella pu essere sia bianca che nera, questo permette 23 = 8 possibili combinazioni di colori lungo le prime tre celle. Poich ciascuna di queste combinazioni former una cella che pu essere sia nera che bianca e ci sono otto possibili combinazioni cromatiche superiori ci saranno 28 = 256 possibilit in totale. In generale, se ci sono stati K e se ogni cella ha vicini N intorni (compreso se stesso), allora ci sono KN regole. Si pu facilmente utilizzare un byte binario per codificare questi insiemi di regole in numeri decimali tra i numeri 0 e 255. Ad esempio, la regola numero 184 corrisponde alla fig. 1.

Codifica digitale per aminoacidi e l'acido ribonucleico I biologi molecolari cercano di determinare i geni nelle cellule degli organismi, la funzione delle proteine che codificano questi geni e come queste proteine sono correlate evolutivamente tra i diversi organismi. Il genoma composto da RNA che rappresentato da sequenze di acidi nucleici, chiamate anche basi. I quattro acidi nucleici sono adenina (A), citosina (C), guanina (G), uracile (U). Con un computer, la sequenza nucleotidica codificata come segue: A=00, C=01, G=10; U=11 (1) Le proteine sono rappresentate da una sequenza di amminoacidi, chiamati anche residui. Ci sono 20 amminoacidi acidi nativi. Mediante la regola della similitudine, la regola della complementarit, la teoria del riconoscimento molecolare e la teoria dell'informazione, sono formulati un insieme di codici digitali per rappresentare gli amminoacidi, come mostrato nella Tabella 1. La rappresentazione riflette meglio le propriet chimico fisiche degli amminoacidi, cos come la loro struttura e la degenerazione (Xiao et al., 2004).

Williams et al. (1995) used five vertical spaces to repre-

sent each amino acid position, with the spaces filled

according to the chemical properties of the residues. This

leads to sequences resembling Morse code, with some

structural features highlighted by the resulting pattern

of dots. The properties of a proteins amino acids may

also be visualized in the form of a line graph, for

example, protein rhodopsin is showed using the hydro-

pathic scale (Alston et al., 2003). Chou et al. (1997) first

introduced the elegant wenxiang diagram to highlight

the typical sequence feature of the amphiphilic helices in

proteins.

There is a common characteristic in the aforementioned

visual methods for the gene representation, i.e., the point

of the special curve corresponding to a certain nucleic

acid is colligated only with the base prior to it, while

the effects of all the bases behind it are totally ignored.

This is inconsistent with the fact that all the bases in a

gene are coupled with each other as an entity in nature. In

view of this, here a completely new and different method

will be introduced to image the gene sequences. The novel

method is based on Cellular Automata, as will be illus-

trated below.

II Methods

Cellular automata

Cellular automata are discrete dynamical systems whose behavior iscompletely specified in terms of a local relation. A cellular automaton

can be thought of as a stylised universe consisting of a regular grid of

cells, each of which can be in one of a finite number of k possible

states, updated synchronously in discrete time steps according to alocal, identical interaction rule (Wolfram, 1986). Cellular automata

provide us an access to model complex dynamical phenomena by

reformulating the macroscopic behavior into microscopic and meso-

scopic rules that are discrete in space and time. A set of rules specifiesthe time and space evolution of the system, which is discrete in both

variables. These systems have attracted a great deal of interest in recent

years because even with very simple rules cellular automata can showvery complex evolution patterns. It is recognized that repeated applica-

tions of simple rules can lead to extremely complex behavior that can

emulate physical, social and biological systems.

A one-dimensional cellular automata consists a collection of time-dependent variables Sit, namely the local states, arrayed on a lattice of N

sites (or cells), i 0; 1; 2; . . . ;N " 1. We take each of these to be aBoolean variable: Sit f0; 1g. As visualization is considered in a two-state automaton, each of the cells can be either black or white. Thecollection of all local states is called the configuration: St S0t S

1t # # # SN"1t , where S0 denotes an initial configuration. The rule F of

cellular automata can be expressed as a lookup table that lists, for eachlocal neighborhood, the state that is taken on by the neighborhoods central

cell at the next step. A neighborhood comprises a cell and its r neighbors

on either side, where r is called the cellular automata radius. The course of

state evolving can be represented as: Sit1 FSi"rt # # # Sit # # # Sirt . If the ris 1, each cell can be either black or white, then this will allows 23 8possible color combinations along the top three cells. Because each of

these combinations will cause a cell to be either black or white and there

are eight possible upper color combinations then there will be 28 256possibilities in total. In general, if there are K states and if each cell is

taken to have N neighbors (including itself), then there are KN rules. Wecan easily utilize a binary byte to encode these rule sets into decimal

numbers between the numbers 0 and 255. For example, rule number 184

would correspond to Fig. 1. The global equation of motion ! maps aconfiguration at one time step to the next; i.e., St1 FSt, where thelocal function ! is applied simultaneously to all lattice sites.

Digital coding for amino acid and ribonucleic acid

Molecular biologists seek to determine the genes in the cells of organ-isms, the function of the proteins that these genes encode, and how

these proteins are related evolutionarily across organisms. Genes, com-

posed of RNA, is represented by sequences of nucleic acids, also called

bases. The 4 nucleic acids are adenine(A), cytosine(C), guanine(G),uracil(U). To deal with it in a computer, a nucleotide sequence is coded

as follows:

A 00; C 01; G 10; U 11 1Proteins are represented by sequences of amino acids, also called

residues. There are 20 native amino acids. By means of the similarity

rule, complementarity rule, molecular recognition theory and information

theory, a set of digital codes are formulated to represent amino acids, asshown in Table 1. The representation can better reflect the chemical

physical properties of amino acids, as well as their structure and degen-

eracy (Xiao et al., 2004).

Space-time evolution of gene sequence

A gene sequence is always a 1D string regardless it is denoted by bases or

by binary digits. It is very difficult to find its characteristic vector parti-

cularly when it is very long. To cope with this situation, we resort to theimages derived from the 1D sequence thru the space-time evolution of

cellular automata. The cellular automata we adopt here is a simple two-

state, one-dimensional cellular automata, consisting of a line of cells with

the value of 0 or 1. The rule is simply implemented as that the nearest cellsaround the one we focus will decide its next state. Because many genes are

circular, we adopt the circulating boundary condition with the iterative

formula given by:

Di; j FDi" 1; j" 1; Di" 1; j; Di" 1; j 11' i

11

Evoluzione spazio-temporale della sequenza genica Una sequenza genica sempre una stringa 1D indipendentemente e si indica con basi o cifre binarie. E' molto difficile trovare il suo vettore caratteristico quando molto lungo. Per far fronte a questa situazione, si ricorre alle immagini derivate dalla sequenza 1D attraverso l'evoluzione spazio-tempo degli automi cellulari. Gli automi cellulari adottati qui sono a due stati semplici, unidimensionali, costituiti da una linea di celle con valore 0 o 1. La regola semplicemente applicata alle cellule pi vicine a quella su cui ci si focalizza, verr cos deciso il suo stato successivo. Dato che molti geni sono circolari, si adotta la condizione al contorno circolare con la formula interattiva data da:

dove, D(i,j) un elemento della matrice 2D per rappresentare l'immagine della sequenza genica, F regola interattiva, n tempo interattivo, N la lunghezza della sequenza genica. Se la sequenza composta da RNA allora M= 2, se la sequenza composta di amminoacidi l'M =5. Per esempio, la regola 84 pu essere illustrata in Fig. 2.

Generazione dell immagine Quando si trasforma la matrice 2D (matrice) in una immagine binaria con tecniche di visualizzazione scelto il formato bitmap di base perch la sua struttura facilmente gestibile. In questo modo, se l'elemento della matrice zero, il colore del pixel bit corrispondente sar nero, altrimenti bianco. Compressione dellimmagine La dimensione totale ottenuta per alcune sequenze molto lunghe risulta a volte troppo grande, la compressione deve necessariamente evidenziare le caratteristiche dell'immagine interessata utilizzando la seguente mappatura matematica:

Williams et al. (1995) used five vertical spaces to repre-

sent each amino acid position, with the spaces filled

according to the chemical properties of the residues. This

leads to sequences resembling Morse code, with some

structural features highlighted by the resulting pattern

of dots. The properties of a proteins amino acids may

also be visualized in the form of a line graph, for

example, protein rhodopsin is showed using the hydro-

pathic scale (Alston et al., 2003). Chou et al. (1997) first

introduced the elegant wenxiang diagram to highlight

the typical sequence feature of the amphiphilic helices in

proteins.

There is a common characteristic in the aforementioned

visual methods for the gene representation, i.e., the point

of the special curve corresponding to a certain nucleic

acid is colligated only with the base prior to it, while

the effects of all the bases behind it are totally ignored.

This is inconsistent with the fact that all the bases in a

gene are coupled with each other as an entity in nature. In

view of this, here a completely new and different method

will be introduced to image the gene sequences. The novel

method is based on Cellular Automata, as will be illus-

trated below.

II Methods

Cellular automata

Cellular automata are discrete dynamical systems whose behavior iscompletely specified in terms of a local relation. A cellular automaton

can be thought of as a stylised universe consisting of a regular grid of

cells, each of which can be in one of a finite number of k possible

states, updated synchronously in discrete time steps according to alocal, identical interaction rule (Wolfram, 1986). Cellular automata

provide us an access to model complex dynamical phenomena by

reformulating the macroscopic behavior into microscopic and meso-

scopic rules that are discrete in space and time. A set of rules specifiesthe time and space evolution of the system, which is discrete in both

variables. These systems have attracted a great deal of interest in recent

years because even with very simple rules cellular automata can showvery complex evolution patterns. It is recognized that repeated applica-

tions of simple rules can lead to extremely complex behavior that can

emulate physical, social and biological systems.

A one-dimensional cellular automata consists a collection of time-dependent variables Sit, namely the local states, arrayed on a lattice of N

sites (or cells), i 0; 1; 2; . . . ;N " 1. We take each of these to be aBoolean variable: Sit f0; 1g. As visualization is considered in a two-state automaton, each of the cells can be either black or white. Thecollection of all local states is called the configuration: St S0t S

1t # # # SN"1t , where S0 denotes an initial configuration. The rule F of

cellular automata can be expressed as a lookup table that lists, for eachlocal neighborhood, the state that is taken on by the neighborhoods central

cell at the next step. A neighborhood comprises a cell and its r neighbors

on either side, where r is called the cellular automata radius. The course of

state evolving can be represented as: Sit1 FSi"rt # # # Sit # # # Sirt . If the ris 1, each cell can be either black or white, then this will allows 23 8possible color combinations along the top three cells. Because each of

these combinations will cause a cell to be either black or white and there

are eight possible upper color combinations then there will be 28 256possibilities in total. In general, if there are K states and if each cell is

taken to have N neighbors (including itself), then there are KN rules. Wecan easily utilize a binary byte to encode these rule sets into decimal

numbers between the numbers 0 and 255. For example, rule number 184

would correspond to Fig. 1. The global equation of motion ! maps aconfiguration at one time step to the next; i.e., St1 FSt, where thelocal function ! is applied simultaneously to all lattice sites.

Digital coding for amino acid and ribonucleic acid

Molecular biologists seek to determine the genes in the cells of organ-isms, the function of the proteins that these genes encode, and how

these proteins are related evolutionarily across organisms. Genes, com-

posed of RNA, is represented by sequences of nucleic acids, also called

bases. The 4 nucleic acids are adenine(A), cytosine(C), guanine(G),uracil(U). To deal with it in a computer, a nucleotide sequence is coded

as follows:

A 00; C 01; G 10; U 11 1Proteins are represented by sequences of amino acids, also called

residues. There are 20 native amino acids. By means of the similarity

rule, complementarity rule, molecular recognition theory and information

theory, a set of digital codes are formulated to represent amino acids, asshown in Table 1. The representation can better reflect the chemical

physical properties of amino acids, as well as their structure and degen-

eracy (Xiao et al., 2004).

Space-time evolution of gene sequence

A gene sequence is always a 1D string regardless it is denoted by bases or

by binary digits. It is very difficult to find its characteristic vector parti-

cularly when it is very long. To cope with this situation, we resort to theimages derived from the 1D sequence thru the space-time evolution of

cellular automata. The cellular automata we adopt here is a simple two-

state, one-dimensional cellular automata, consisting of a line of cells with

the value of 0 or 1. The rule is simply implemented as that the nearest cellsaround the one we focus will decide its next state. Because many genes are

circular, we adopt the circulating boundary condition with the iterative

formula given by:

Di; j FDi" 1; j" 1; Di" 1; j; Di" 1; j 11' i

12

dove (x0, y0) indicano le coordinate del pixel dell'immagine originale, mentre (x1, y1) le coordinate corrispondenti per l'immagine trasformata, fx la scala lungo l'asse orizzontale, e fy la scala lungo l'asse verticale. La trasformazione inversa data da:

2.1.4 Risultati e discussione Le immagini di dati reali e simulati del gene saranno utlizzati come esempi per mostrare le cellular automata images possano fornire informazioni utili. Le sequenze geniche citate sono tutte state scaricate da Genbank: http://www.ncbi.nlm.nih.gov. Per la stessa sequenza, se le regole d'evoluzione sono differenti, le immagini saranno differenti. Vale a dire che 256 immagini diverse possono essere create per la stessa sequenza utilizzando gli automi cellulari. Queste immagini possono essere classificate in 4 classi. La prima classe chiamata equilibrata, gli stati delle cellule sono rapidamente risolti in configurazioni di base, per esempio, tutti 0 o tutti 1. La seconda classe definita periodica. La terza classe quella del caos. La quarta classe non disordinata, ma complessa e a volte di lunga durata. La regola d'evoluzione per la formulazione dell'immagine che dobbiamo generare pu essere facilmente utilizzata per distinguere se i geni in questione sono omologhi tra loro. In questo modo, le basi di un gene o i residui di una proteina devono essere accoppiati tra loro come entit. Durante il processamento dell'immagine del gene, lo stato della cella corrispondente ad un certo acido nucleico collegato sia con la base prima e che con la base successiva. Grazie alle caratteristiche suddette, l'immagine del gene pu rivelare alcune caratteristiche implicite della sequenza, e queste caratteristiche sono difficili da individuare con l'utilizzo di altri sistemi per la visualizzazione dei geni. E stato scoperto che tra le 256 regole d'evoluzione possibili questa la migliore per la costruzione dell'immagine del gene. Ad esempio, la regola 184 pi adatta per i virus corona, mentre la regola 84 la migliore per costruire unimmagine di sequenze amminoacidiche. Se regola e tempo d'evoluzione sono immutabili, la sequenza genica e l'immagine prodotta sar corrispondente uno a uno. Poich la codifica digitale di aminoacidi e nucleotidi degenerativa, le immagini risulteranno in celle diverse almeno per la prima fila. La figura 3 mostra l'immagine comparativa tra il gene TGFA di topo P01134 e il suo gene ricombinante. Il gene ricombinante ha una sola differenza P01134 nell'amminoacido 61, fenilalanina in lisina. Il metodo di generazione dell'immagine comparativa utile per confrontare il corrispondente bit tra due parti di immagini generate precedentemente: se il colore lo stesso, il punto di pixel corrispondente dell'immagine comparativa sar del colore originale, altrimenti la controparte dell'immagine comparativa verr rappresentata come un punto rosso.

Image generation

When transforming the 2D array (matrix) into a binary image with visua-lization techniques, the basic bitmap format is chosen because its property

is easily handled. In this way, if the matrix element was zero, the color of

the counterpart pixel bit will be black; otherwise, white.

Image compression

The total size thus obtained are too large for some long sequences,

the compression of the image is needed that is actually to highlightthe characteristic of the image concerned the following mathematical

mapping:

x1y1

! " fx 0

0 fy

! "x0y0

! "5

where (x0; y0) denote the coordinates of the pixel in the original image,while (x1; y1) the corresponding coordinates for the transformed image, fxis the scaling along the horizontal axis, and fy the scaling along the vertical

axis. The inverse transformation is given by:

x0y0

! " 1=fx 0

0 1=fy

! "x1y1

! "6

i.e.,

x0 x1=fxy0 y1=fy

#7

III Results and discussion

The images of real and simulated gene data will be pre-

sented as examples to show how these cellular automata

images provide useful information. The aforementioned

gene sequences are all downloaded from Genbank:

http:==www.ncbi.nlm.nih.gov. To the same sequence, ifthe evolving rules are different, the images are different.

That is to say, 256 different images can be created for a

same sequence based on cellular automata. These images

can fall into 4 classes. The first class is named balanced,

the states of cells been quickly resolved into boring con-

figurations, e.g., all 0 or all 1. The second class is peri-

odic. The third class is of chaos. The fourth class is not

disordered, but complex and sometimes long-lived. The

evolution rule of the formulation image that we need must

generate the features that can be easily used to distinguish

whether the gene concerned are homologous to each

Fig. 2. Illustration of a one-dimensional, binary-state, nearest-neighbor(r 1) cellular automata with N 10. Both the lattice and the rule tableF for updating the lattice are illustrated. The lattice configuration is

shown at two successive time steps. The cellular automaton has spa-

tially periodic boundary conditions: the lattice is viewed as a circle,with the leftmost cell being the right neighbor of the rightmost cell, and

vice versa

Table 1. Binary notation of amino acid coding language

codon amino acid binary notation codon amino acid binary notation

ccu ccc P 00001 cuu cuc L 00011cca ccg cua cug

uua uug

caa cag Q 00100 cau cac H 00101

cgu cgc R 00110 ucu ucc S 01001cga cgg uca ucg

aga agg agu agg

uau uac Y 01100 uuu uuc F 01011

ugg W 01110 ugu ugc C 01111acu acc T 10000 auu auc I 10010

aca acg aua

aug M 10011 aaa aag K 10100aau aac N 10101 gcu gcc A 11001

gca gcg

guu guc V 11010 gau gac D 11100

gua guggaa gag E 11101 ggu ggc G 11110

gga ggg

uaa uag end 11111

uga

Cellular automata images for biological sequences 31

Image generation

When transforming the 2D array (matrix) into a binary image with visua-lization techniques, the basic bitmap format is chosen because its property

is easily handled. In this way, if the matrix element was zero, the color of

the counterpart pixel bit will be black; otherwise, white.

Image compression

The total size thus obtained are too large for some long sequences,

the compression of the image is needed that is actually to highlightthe characteristic of the image concerned the following mathematical

mapping:

x1y1

! " fx 0

0 fy

! "x0y0

! "5

where (x0; y0) denote the coordinates of the pixel in the original image,while (x1; y1) the corresponding coordinates for the transformed image, fxis the scaling along the horizontal axis, and fy the scaling along the vertical

axis. The inverse transformation is given by:

x0y0

! " 1=fx 0

0 1=fy

! "x1y1

! "6

i.e.,

x0 x1=fxy0 y1=fy

#7

III Results and discussion

The images of real and simulated gene data will be pre-

sented as examples to show how these cellular automata

images provide useful information. The aforementioned

gene sequences are all downloaded from Genbank:

http:==www.ncbi.nlm.nih.gov. To the same sequence, ifthe evolving rules are different, the images are different.

That is to say, 256 different images can be created for a

same sequence based on cellular automata. These images

can fall into 4 classes. The first class is named balanced,

the states of cells been quickly resolved into boring con-

figurations, e.g., all 0 or all 1. The second class is peri-

odic. The third class is of chaos. The fourth class is not

disordered, but complex and sometimes long-lived. The

evolution rule of the formulation image that we need must

generate the features that can be easily used to distinguish

whether the gene concerned are homologous to each

Fig. 2. Illustration of a one-dimensional, binary-state, nearest-neighbor(r 1) cellular automata with N 10. Both the lattice and the rule tableF for updating the lattice are illustrated. The lattice configuration is

shown at two successive time steps. The cellular automaton has spa-

tially periodic boundary conditions: the lattice is viewed as a circle,with the leftmost cell being the right neighbor of the rightmost cell, and

vice versa

Table 1. Binary notation of amino acid coding language

codon amino acid binary notation codon amino acid binary notation

ccu ccc P 00001 cuu cuc L 00011cca ccg cua cug

uua uug

caa cag Q 00100 cau cac H 00101

cgu cgc R 00110 ucu ucc S 01001cga cgg uca ucg

aga agg agu agg

uau uac Y 01100 uuu uuc F 01011

ugg W 01110 ugu ugc C 01111acu acc T 10000 auu auc I 10010

aca acg aua

aug M 10011 aaa aag K 10100aau aac N 10101 gcu gcc A 11001

gca gcg

guu guc V 11010 gau gac D 11100

gua guggaa gag E 11101 ggu ggc G 11110

gga ggg

uaa uag end 11111

uga


13

Sono state applicate regole diverse per analizzare il corona virus 90, ma solo applicando la regola 184 sono state ottenute immagini di SARS-CoVs differenti da quelli di altri coronavirus (Wang et al., 2005). Le immagini ottenute direttamente dalle suddette procedure sono generalmente troppo grandi per l'analisi. Dopo le immagini sono state rimpicciolite con un rapporto di compressione 14:2 come mostrato in Fig. 4, le immagini di SARS-CoVs sono principalmente con la V trasversale a forma di linee modello, mentre le immagini del virus non SARS sono caratterizzate da linee parallele. Analizzando l'intera immagine delle sequenze di RNA stato trovata un'impronta notevole di SARS-CoV . E' in alcune di queste regioni del SARS-CoV vicino al 5' (Chou et al, 1996;. Zhang e Chou, 1996) che le frequenze del carattere ripetuto 'A' (vale a dire,'AA' , 'AAA' , e 'AAAA' ) sono ovviamente maggiori di quelle con carattere ripetuto 'U' (vale a dire,'UU' , 'UUU' , e 'UUUU'). Tuttavia, per tutti gli altri corona-virus, la frequenze di 'AA', 'AAA', e 'AAAA' sono ovviamente inferiori a quelli di 'UU', 'UUU' , e 'UUUU'. Pertanto, una caratteristica unica di SARS-CoV pu essere definita come la sua impronta digitale. In realt, si riscontrato che il numero di 'A' nella forma V di alcuni SARS approssimativamente uguale al numero di 'U' secondo il risultato statistico. Questi segmenti vanno dal 3232 al 5624 nt, dal 5703 al 7195nt, dal 12.128 al 14470nt, dal 16.444 al 19231nt, e dal 17928 al 21803 nt della sequenza SARS-CoV vicino al 5' terminale. Non c' una caratteristica simile nel corona virus non-SARS, come verr elaborato altrove.

other. By this way, the bases in a gene or residues in a

protein must be coupled with each other as an entity.

During the process of producing the gene image, the state

of cell corresponding to a certain nucleic acid is colligated

with both the base prior to it and bases behind it. Because

of above-mentioned characteristics, the gene image can

reveal some implicit sequence features, and these features

are difficult to be displayed by other gene visualizations.

We have found that among the 256 evolving rules some is

better than the others in building gene image for a given

gene. For example, Rule 184 is most suitable for corona-

virus, while Rule 84 is the best for building the image of

amino acid sequences.

If the rule and time for the evolution are all changeless,

the gene sequence and image thus produced will be one-

to-one correspondence. Because digital coding for amino

acid and nucleotide are degeneracy, the images will ap-

pear in different cells for the first row at least. Figure 3

shows the comparative image between mouse TGFA gene

(P01134) and its recombine gene. The recombine gene

only has one difference to P01134 in the 61th amino acid,

phenylalanine to lysine. The method of generating com-

parative image is for comparing the corresponding bit

between the previously generated two pieces of images:

if the color is same, the corresponding pixel point on the

comparative image will be drawn in the original color;

otherwise, the counterpart in the comparative image will

be drawn as a red point.

Different rules have been applied to analyze the 90

coronavirus, but only when Rule 184 is used, are the

images of SARS-CoVs different most distinctively from

those of other coronavirus (Wang et al., 2005). The

images obtained directly by the aforementioned proce-

dures are generally too large for analysis. After the images

are zoomed out with the compression ratio 14:2 as showed

in Fig. 4, the images of SARS-CoVs are mainly with the

V-shaped cross-lines pattern, whereas those of non-SARS

virus RNA sequences are mainly with the parallel slash-

lines pattern. By analyzing the different parts of the full-

length RNA sequence visualized images, a remarkable

fingerprint for the SARS-CoV has been found. It is in

some regions of the SARS-CoV sequences near 50-term-inal (Chou et al., 1996; Zhang and Chou, 1996) that the

occurrence frequencies of repeated character A (i.e.,

AA, AAA, and AAAA) are obviously greater than

those of repeated character U (i.e., UU, UUU, and

UUUU), respectively. However, for all other corona-

viruses, the situation is just opposite in the same region;

i.e., the occurrence frequencies of AA, AAA, and

AAAA are obviously less than those of UU, UUU,

and UUUU. Therefore, such a unique feature of SARS-

CoV can be defined as its fingerprint. Actually, it was

found that the number of individual A in the V-shape

region of some SARS gene sequences is approximately

equal to the number of individual U according to the

statistic result. These segments are from 3232 to 5624 nt,

5703 to 7195 nt, 12128 to 14470 nt, 16444 to 19231 nt,

and 17928 to 21803 nt in the SARS-CoV sequence near

5-terminal. There is no such a feature in non-SARS coro-

naviruses, as will be elaborated elsewhere.

Besides, the gene cellular automata image also has the

following features as illustrated below. Shown in Fig. 5 is

the cellular automata image for a C gene of Hepatitis B

virus (HBV) built by the Rule 84. From the figure we can

see that the image of HBV C gene has its particular pat-

tern and character. Because the circulating boundary con-

dition was used, the image can be a circle when the right

Fig. 3. Comparative image between mouse TGFA gene (P01134) and itsrecombine gene. The recombine gene only has one different to P01134 in

61th amino acid, phenylalanine to lysine. The Rule 84 was used for the

evolutive

Fig. 4. Sample images obtained by applying the Rule 184 on the SARScoronal virus and non-SARS coronavirus: (a) BJ01(AY278488), and (b)AF208066_Murine. The time of evolving was 2400, the compression

ratio is 14:2. the SARS image is with a V-shaped cross-lines pattern, a

token for SARS coronal viruses; and the non-SARS coronavirus image iswith a parallel slash-lines pattern, a remarkable distinction with the

SARS coronal virus

32 X. Xiao et al.

other. By this way, the bases in a gene or residues in a

protein must be coupled with each other as an entity.

During the process of producing the gene image, the state

of cell corresponding to a certain nucleic acid is colligated

with both the base prior to it and bases behind it. Because

of above-mentioned characteristics, the gene image can

reveal some implicit sequence features, and these features

are difficult to be displayed by other gene visualizations.

We have found that among the 256 evolving rules some is

better than the others in building gene image for a given

gene. For example, Rule 184 is most suitable for corona-

virus, while Rule 84 is the best for building the image of

amino acid sequences.

If the rule and time for the evolution are all changeless,

the gene sequence and image thus produced will be one-

to-one correspondence. Because digital coding for amino

acid and nucleotide are degeneracy, the images will ap-

pear in different cells for the first row at least. Figure 3

shows the comparative image between mouse TGFA gene

(P01134) and its recombine gene. The recombine gene

only has one difference to P01134 in the 61th amino acid,

phenylalanine to lysine. The method of generating com-

parative image is for comparing the corresponding bit

between the previously generated two pieces of images:

if the color is same, the corresponding pixel point on the

comparative image will be drawn in the original color;

otherwise, the counterpart in the comparative image will

be drawn as a red point.

Different rules have been applied to analyze the 90

coronavirus, but only when Rule 184 is used, are the

images of SARS-CoVs different most distinctively from

those of other coronavirus (Wang et al., 2005). The

images obtained directly by the aforementioned proce-

dures are generally too large for analysis. After the images

are zoomed out with the compression ratio 14:2 as showed

in Fig. 4, the images of SARS-CoVs are mainly with the

V-shaped cross-lines pattern, whereas those of non-SARS

virus RNA sequences are mainly with the parallel slash-

lines pattern. By analyzing the different parts of the full-

length RNA sequence visualized images, a remarkable

fingerprint for the SARS-CoV has been found. It is in

some regions of the SARS-CoV sequences near 50-term-inal (Chou et al., 1996; Zhang and Chou, 1996) that the

occurrence frequencies of repeated character A (i.e.,

AA, AAA, and AAAA) are obviously greater than

those of repeated character U (i.e., UU, UUU, and

UUUU), respectively. However, for all other corona-

viruses, the situation is just opposite in the same region;

i.e., the occurrence frequencies of AA, AAA, and

AAAA are obviously less than those of UU, UUU,

and UUUU. Therefore, such a unique feature of SARS-

CoV can be defined as its fingerprint. Actually, it was

found that the number of individual A in the V-shape

region of some SARS gene sequences is approximately

equal to the number of individual U according to the

statistic result. These segments are from 3232 to 5624 nt,

5703 to 7195 nt, 12128 to 14470 nt, 16444 to 19231 nt,

and 17928 to 21803 nt in the SARS-CoV sequence near

5-terminal. There is no such a feature in non-SARS coro-

naviruses, as will be elaborated elsewhere.

Besides, the gene cellular automata image also has the

following features as illustrated below. Shown in Fig. 5 is

the cellular automata image for a C gene of Hepatitis B

virus (HBV) built by the Rule 84. From the figure we can

see that the image of HBV C gene has its particular pat-

tern and character. Because the circulating boundary con-

dition was used, the image can be a circle when the right

Fig. 3. Comparative image between mouse TGFA gene (P01134) and itsrecombine gene. The recombine gene only has one different to P01134 in

61th amino acid, phenylalanine to lysine. The Rule 84 was used for the

evolutive

Fig. 4. Sample images obtained by applying the Rule 184 on the SARScoronal virus and non-SARS coronavirus: (a) BJ01(AY278488), and (b)AF208066_Murine. The time of evolving was 2400, the compression

ratio is 14:2. the SARS image is with a V-shaped cross-lines pattern, a

token for SARS coronal viruses; and the non-SARS coronavirus image iswith a parallel slash-lines pattern, a remarkable distinction with the

SARS coronal virus

32 X. Xiao et al.

14

Inoltre la cellular automa image del gene ha anche le seguenti caratteristiche come illustrato in Fig. 5 ed la cellular automa image di un gene C del virus dell'epatite B (HBV) costruito con l'utilizzo della regola 84. Dalla figura si pu vedere che l'immagine del gene C di HBV ha un suo particolare pattern. Poich stata utilizzata la condizione del confine circolante, se i bordi destro e sinistro sono collegati tra loro l'immagine pu risultare come un cerchio. Ci sono due grandi e tre piccole aree triangolari nelle immagini della figura. Molti piccoli triangoli sono presenti nel triangolo pi grande e questi triangoli sono tutti invertiti.

Pertanto, l'attuale metodo fornisce una caratterizzazione molto pi intuitiva e pi facile da identificare rispetto la complicata sequenza genica originale. Inoltre, analizzando la Regola 84 risulta che:

dove x= (0,1), e l'inversione di x. Cos, secondo la regola 84, siamo in grado di ricavare l'immagine per il gene WIAD (Fig. 6).

Diversi tipi di sequenze geniche dello stesso organismo sono stati utilizzati per testare il metodo. Il TGFA e i geni principali della beta-globina sono diversi nelle loro funzioni.

and left edges are connected with each other. There are

two big triangular areas and three small triangular areas in

the images of the figure. A lot of small triangles are nested

into big triangle, and these triangles are all inverted.

Therefore, the current method provides a much more

intuitive and easier-to-be-identified feature for the com-

plicated gene sequence than the original symbolic sequen-

tial expression.

Furthermore, it follows by analyzing the Rule 84 that

Di; j 0; Di$ 1; j$ 1Di$ 1; j 00!xx; Di$ 1; j$ 1Di$ 1; j 6 00; Di$ 1; j 1 x

!8

where x f0; 1g, and !xx is the inversion of x. Thus,according to Rule 84 we can derive the image for the

WIAD gene (Fig. 6).

Different types of the gene sequences from the same

organism were used to test the method. The TGFA and

beta-globin major genes are different in their functions.

Figures 7 and 9 show the two mouse genes, respectively.

It can be seen by comparing the two images that both

images are quite different and there is no significant simi-

larity at all. In molecular biology, there are many simila-

rities in their functions and appearances among homology

sequences. The sequences of Transforming Growth

Factor-Alpha (TGFA) genes are examined. They include

homo sapiens (AAA61157, AAH05308, AAH05309,

CAA49806), Capreolus (AAF73229), Danio rerio

(CAE30382), Sheep (P98135), Rhesus monkey (P55244),

Mus musculus (AAB50554), Rabbit (P98138), Chicken

(NP_001001614), Norway rat (NP_036803), and Canis

familiaris (AAR21186). As shown in Figs. 7, 8, two

images of human and mouse are very similar although

they are from three different kinds of organisms. In other

words, they do have some common features in these two

sequences, which are hard to be identified from their

Fig. 5. The cellular automota images of Hepatitis B virus C gene aregenerated by cellular automata Rule 84: the time of evolving is 300, andthe sequence is obtained from NCBI GenBank (ab059661). (a) Theoriginal image, and (b) the compressed image from (a). The compressionratio is 2:2

Fig. 6. The cellular automota image of WIAD gene with some periodicsections: the time of evolving is 300, and the evolving rule is the Rule 84.

The compression radio is 2:2

Fig. 7. Compressed image of the mouse TGFA gene. The sequence isobtained from NCBI GenBank (P01134), its length is 159 amino acids,the compression ratio is 2:2, and the time of evolving is 300

Fig. 8. Compressed image of the human TGFA gene. The sequence wasobtained from NCBI GenBank (AAH05308), its length is 159 aminoacids, the compression ratio is 2:2, and the time of evolving is 300

Fig. 9. Compressed image of the mouse beta-globin major gene. Thesequence was obtained from NCBI GenBank (J00413), the compressionratio is 2:2, and the time of evolving is 300









tial expression.



!8


WIAD gene (Fig. 6).



































tial expression.



!8


WIAD gene (Fig. 6).




























15

Le figure 7 e 9 mostrano i due differenti geni di un topo.

Si pu notare confrontando le due immagini che non c' una significativa somiglianza. In biologia molecolare, ci sono molte similarit nelle funzioni tra 2 sequenze omologhe. Sono state esaminate le sequenze del gene TGF-alfa (TGFA). Essi comprendono homo sapiens (AAA61157, AAH05308, AAH05309, CAA49806), Capreolus (AAF73229), Danio rerio (CAE30382), Pecora (P98135), scimmia Rhesus (P55244), Mus musculus (AAB50554), coniglio (P98138), pollo (NP_001001614 ), ratto della Norvegia (NP_036803), e Canis familiaris (AAR21186). Come mostrato nelle Fig. 7, 8, le immagini del gene umano e del topo sono molto simili anche se derivano da tre diversi tipi di organismi. In altre parole, essi hanno alcune caratteristiche comuni difficili da identificare rispetto alle loro sequenze geniche. Questi risultati indicano che l'attuale approccio con gli automi cellulari davvero molto utile per distinguere una sequenza di particolari geni fornendo un'immagine induttiva.

Infine, chiaro che, con il concetto della pseudo composizione amminoacidica come introdotto da Chou (Chou, 2001), l'attuale approccio con la cellular automata image pu essere utilizzato anche per migliorare la previsione della classe strutturale delle proteine [see, e.g., (Chou and Zhang, 1993; Chou, 1993; Chou, 1995; Chou, 2000; Chou and Cai, 2004a; Chou and Maggiora, 1998; Chou and Zhang, 1994; Chou, 1989; Luo et al., 2002; Nakashima et al., 1986; Zhou, 1998)], protein subcellular location prediction [see, e.g., Chou and Cai, 2002; Chou and Cai, 2004b; Chou and Elrod, 1999b; Pan et al., 2003; Zhou and Doctor, 2003)], and membrane protein type prediction [see, e,g., (Cai et al., 2003; Chou and Elrod, 1999a; Wang et al., 2004a, b)], as demonstrated elsewhere (Xiao et al., 2004). 2.1.5 Conclusioni Si dimostra attraverso questo studio che il nuovo metodo sviluppato sulla base degli automi cellulari molto utile per studiare complicate sequenze biologiche.








tial expression.



!8


WIAD gene (Fig. 6).



































tial expression.



!8


WIAD gene (Fig. 6).



































tial expression.



!8


WIAD gene (Fig. 6).



































tial expression.



!8


WIAD gene (Fig. 6).




























16

2.2 Cellular automaton model for the study of dna sequence evolution 2.2.1 Sommario Gli Automi cellulari vengono introdotti come modello per studiare la struttura, la funzione e l'evoluzione del DNA. Il DNA modellato come un automa cellulare unidimensionale con quattro stati per cella. Questi stati sono le quattro basi del DNA rappresentate A, C, T e G. I quattro stati sono rappresentati da un numero del sistema numerico quaternario. Sono state prese in considerazione le regole di evoluzione lineari, rappresentate da matrici quadrate. Sulla base di questo modello stato sviluppato un simulatore d'evoluzione del DNA e nelle pagine seguenti verranno presentati i risultati della simulazione. Questo simulatore ha una semplice interfaccia di ingresso e pu essere utilizzato per lo studio dell'evoluzione DNA. 2.2.2 Introduzione Biologi, informatici e ingegneri hanno recentemente unito i loro sforzi, dando vita alla Bioinformatica. La Bioinformatica pu essere definita come una disciplina che genera strumenti informatici, banche dati, hardware, algoritmi e metodi per sostenere la ricerca genomica e post-genomica. Si utilizza per lo studio della struttura del DNA, la funzione, l'evoluzione, l'espressione di geni e proteine, produzione di proteine, struttura e funzione, sistemi di regolazione genetica e applicazioni cliniche. I metodi utilizzati con successo in Informatica e Ingegneria sono stati recentemente utilizzati per costruire i modelli per la simulazione della struttura, la funzione e l'evoluzione del DNA. A causa della grande quantit di informazioni memorizzate nella struttura del DNA dovrebbero essere sviluppati al pi presto nuovi modelli, algoritmi e processori con lo

automi cellulari -

Documents