souravsaha008_ciitsep2013

7
  Abstract    Contemporary researchers of Bio-informatics have witnessed an exponential growth in the amount of biological information over the years. The increasing volume of DNA sequences has of late created interest among many scientists in computational approaches to DNA sequence analysis. A lot of computer analysis of DNA sequences is directed toward meaningful interpretation of  biolog ically significant pattern s. Pa ttern c lassifica tion f orms one o f t he most important foundations for extraction of knowledge from the enormous DNA sequence databases. This paper reports a cheap and efficient DNA pattern classifier based on the sparse network of Cellular Automata. K eyw ords   Bio-informatics,  Cellular Automata, DNA, Pattern Classification, Sequence Analysis I. I  NTRODUCTION ellular Automata (CA) is a simple model of a spatially extended decentralized system made up of a number of individual components (cells) [1]. The communication  between constitue nt cells is limited to local interactio n. The state of each individual cell changes over time depending on the states of its neighbors [2], [3]. The overall structure can be viewed as a parallel processing computer [4]. Cellular Automata is used b y Bio-informatics researc hers for automated recognition, description, classification and grouping of patterns [5], [6]. For example, CA based models have been reported to recognize genetic disorder in cells responsible for the development of cancer [7], [8]. A CA based pattern classification model generally comprises of two basic operations    exploration of CA for supervised classification and prediction of class for unkn own patterns. DNA (deoxyribonucleic acid) consists of two long strands, each strand being made of uni ts called phosphates, deoxyribose sugars and nucleotides (adenine [A], guanine [G], cytosine [C], and thymine [T]) linked in series. For ease of understanding,  biologists c ommonly rep resent DNA molecules simply by their different nucleotides using the symbols {A, G, C, T}. The DNA in each cell provides the full genetic blueprint for that cell. Manuscript received October 9, 2013. Tamal Chakrabarti is with the Computer Science and Engineering Department, Institute of Engineering and Management, Y-12, Block-EP, Sector-V, Salt Lake Electronics Complex, Kolkata-700091, West Bengal, India ( e-mail: [email protected]). Sourav Saha is with the Computer Science and Engineering Department, Institute of Engineering and Management, Y-12, Block-EP, Sector-V, Salt Lake Electron ics Complex, Kolkata-700091, West Bengal, India (e-mail: [email protected]). Devadatta Sinha is with the Computer Science and Engineering Department, Calcutta University, 92 Acharya Prafulla Chandra Road, Kolkata-700009, West Bengal, India (e-mail: [email protected]). The identification of genes in a given DNA sequence is an emerging field of study in Bio-informatics. One popular approach is to develop a predictive computer model from a database of known gene sequences and use the resulting model to predict where genes are likely to be in newly generated sequence information. Discovery of coding regions in DNA sequences can therefore be viewed as a pattern recognition  problem. The e xplosive growth in bio logical dat a demands t hat the most advanced and powerful ideas in machine learning, such as cellular automata, should be brought to bear on such  problems [9], [10]. A pattern classification/recognition algorithm using CA usually has two phases [11], the learning or training phase and the testing phase. In the training phase, the machine/network/algorithm is trained with some benchmark  patterns [12]. In the testing phase new patterns are tested against the trained model [13], built in the previous step. This paper proposes a unique DNA-classification scheme using Cellular Automata (CA), which is evolved through Simulated Annealing heuristic technique. II. R ELATED WORKS Plenty of research works deal with the problem of classification of DNA patterns from the genomic sequences. One of the vital tasks in the study of genomes is DNA sequence identification [14]. Recently researchers have attempted various soft-computing techniques for DNA sequence identification. Peterson et. al. used proper orthogonal decomposition (POD) technique to recognize various cancerous patterns in DNA sequences [15]. Since the performance of a classification strategy heavily depends on selection of similarity or distance measure, there has been a demand for exploration of various similarity metrics for DNA classification. Priness et. al. compared various unsupervised classificat ion tec hniques of DNA sequences with respect to the Euclidean distance and the Pearson correlatio n [16]. Kulp et. al . proposed a ge neralized Hidden Markov Model (GHMM) based framework to recognize human genes in DNA [17]. Kumar et. al. integrated  pattern mining and neural netw ork-based app roaches to class ify DNA-sequences with reduced dimensions using Multi-linear Principal Component Analysis (MPCA) [18]. However, the VLSI-implementation- friendly sparse structure of CA has not yet been extensively utilized as DNA pattern classifier. The  proposed work evolves a CA based classification framework for DNA pattern prediction in linear time. In order to derive desirable CA for DNA pattern classification, the proposed scheme has employed simulated annealing heuristic with Fuzzy Levenshtein distance as similarity-cost-measure among DNA A Cellula r Automata Based DNA Pattern Classifier Tamal Chakrabarti, Sourav Saha, and Devadatta Sinha C

Upload: sourav-saha

Post on 13-Oct-2015

9 views

Category:

Documents


0 download

TRANSCRIPT

  • 5/22/2018 SouravSaha008_CiiTSep2013

    1/7

    Abstract Contemporary researchers of Bio-informatics havewitnessed an exponential growth in the amount of biologicalinformation over the years. The increasing volume of DNA sequences

    has of late created interest among many scientists in computationalapproaches to DNA sequence analysis. A lot of computer analysis ofDNA sequences is directed toward meaningful interpretation ofbiologically significant patterns. Pattern classification forms one of themost important foundations for extraction of knowledge from theenormous DNA sequence databases. This paper reports a cheap andefficient DNA pattern classifier based on the sparse network ofCellular Automata.

    KeywordsBio-informatics, Cellular Automata, DNA, PatternClassification, Sequence Analysis

    I. INTRODUCTIONellular Automata (CA) is a simple model of a spatially

    extended decentralized system made up of a number ofindividual components (cells) [1]. The communication

    between constituent cells is limited to local interaction. The

    state of each individual cell changes over time depending on the

    states of its neighbors [2], [3]. The overall structure can be

    viewed as a parallel processing computer [4].Cellular Automata is used by Bio-informatics researchers for

    automated recognition, description, classification and grouping

    of patterns [5], [6]. For example, CA based models have been

    reported to recognize genetic disorder in cells responsible for

    the development of cancer [7], [8]. A CA based pattern

    classification model generally comprises of two basic

    operations exploration of CA for supervised classification

    and prediction of class for unknown patterns.

    DNA (deoxyribonucleic acid) consists of two long strands,

    each strand being made of units called phosphates, deoxyribosesugars and nucleotides (adenine [A], guanine [G], cytosine [C],and thymine [T]) linked in series. For ease of understanding,

    biologists commonly represent DNA molecules simply by theirdifferent nucleotides using the symbols {A, G, C, T}. The DNA

    in each cell provides the full genetic blueprint for that cell.

    Manuscript received October 9, 2013.Tamal Chakrabarti is with the Computer Science and Engineering

    Department, Institute of Engineering and Management, Y-12, Block-EP,Sector-V, Salt Lake Electronics Complex, Kolkata-700091, West Bengal, India( e-mail: [email protected]).

    Sourav Saha is with the Computer Science and Engineering Department,

    Institute of Engineering and Management, Y-12, Block-EP, Sector-V, SaltLake Electronics Complex, Kolkata-700091, West Bengal, India (e-mail:

    [email protected]).Devadatta Sinha is with the Computer Science and Engineering Department,

    Calcutta University, 92 Acharya Prafulla Chandra Road, Kolkata-700009,

    West Bengal, India (e-mail: [email protected]).

    The identification of genes in a given DNA sequence is a

    emerging field of study in Bio-informatics. One popula

    approach is to develop a predictive computer model from

    database of known gene sequences and use the resulting mode

    to predict where genes are likely to be in newly generatesequence information. Discovery of coding regions in DNA

    sequences can therefore be viewed as a pattern recognitio

    problem. The explosive growth in biological data demands thathe most advanced and powerful ideas in machine learning

    such as cellular automata, should be brought to bear on suc

    problems [9], [10].

    A pattern classification/recognition algorithm using CA

    usually has two phases [11], the learning or training phase an

    the testing phase. In the training phase, th

    machine/network/algorithm is trained with some benchmar

    patterns [12]. In the testing phase new patterns are teste

    against the trained model [13], built in the previous step.

    This paper proposes a unique DNA-classification schem

    using Cellular Automata (CA), which is evolved throug

    Simulated Annealing heuristic technique.

    II. RELATED WORKSPlenty of research works deal with the problem of classificatio

    of DNA patterns from the genomic sequences. One of the vita

    tasks in the study of genomes is DNA sequence identificatio

    [14]. Recently researchers have attempted variou

    soft-computing techniques for DNA sequence identification

    Peterson et. al. used proper orthogonal decomposition (PODtechnique to recognize various cancerous patterns in DNA

    sequences [15]. Since the performance of a classificatio

    strategy heavily depends on selection of similarity or distanc

    measure, there has been a demand for exploration of variou

    similarity metrics for DNA classification. Priness et. a

    compared various unsupervised classification techniques oDNA sequences with respect to the Euclidean distance and th

    Pearson correlation [16]. Kulp et. al. proposed a generalize

    Hidden Markov Model (GHMM) based framework t

    recognize human genes in DNA [17]. Kumar et. al. integratepattern mining and neural network-based approaches to classif

    DNA-sequences with reduced dimensions using Multi-lineaPrincipal Component Analysis (MPCA) [18]. However, th

    VLSI-implementation- friendly sparse structure of CA has no

    yet been extensively utilized as DNA pattern classifier. Th

    proposed work evolves a CA based classification framewor

    for DNA pattern prediction in linear time. In order to deriv

    desirable CA for DNA pattern classification, the propose

    scheme has employed simulated annealing heuristic with Fuzz

    Levenshtein distance as similarity-cost-measure among DNA

    A Cellular Automata Based DNA Pattern

    Classifier

    Tamal Chakrabarti, Sourav Saha, and Devadatta Sinha

    C

  • 5/22/2018 SouravSaha008_CiiTSep2013

    2/7

    patterns. The simple structure of CA renders the proposed

    model easily implementable on a VLSI chip suitable for

    embedded applications, which demand high speed.

    III. CELLULAR AUTOMATA AS DNAPATTERN CLASSIFIERA Cellular Automaton (CA) consists of a number of cells

    organized in the form of a lattice [19]. It evolves in discrete

    space and time, and can be viewed as an autonomous finite state

    machine (FSM) [20]. Each cell stores a discrete variable at timet that refers to the current state of the cell. The next state ofthe cell +1at time (t + 1) is affected by its current state andthe states of its neighbors at time-t. For example, in case of

    3-neighborhood CA, the state transition depends on the cell

    itself and its left and right neighbors), such that:

    +1 = 1 , , +1 (1)Where 1 and +1 are the current states of left and rightneighbors of the ith CA cell at time t and is the ith statetransition function.

    Every CA gives rise to a state transition graph consistingof a number of cyclic and acyclic states [21]. The state

    transition graph of an arbitrary CA is shown in Fig. 1. The set of

    non-cyclic states of the CA as depicted in Fig. 1 forms inverted

    trees rooted at the cyclic states. The cyclic states are referred to

    as attractors [22]. The states of a tree rooted at the cyclic state

    forms the -basin [23].

    Fig. 1 State Transition Diagram

    A CA with multiple basins may be viewed as a natural

    classifier [24], [25]. It tends to classify a given set of patterns

    into multiple disjoint state transition graphs (Fig. 1) with each

    disjoint graph representing a class falling in their respective

    attractor basin.

    As an example let us consider the DNA sequences depicted

    in Table I. We have encoded the four nucleotides as A = 00, T =

    01, G = 10, C = 11. This binary encoding scheme, gives rise t

    the following binary codes for the DNA sequences unde

    consideration.TABLEI

    BINARY ENCODING OF DNASEQUENCES

    Serial Nr. DNA Sequence Binary Code (b9b8b1b0)

    1 AATTC 0000010111

    2 ATTTC 0001010111

    3 ATTGA 0001011000

    4 CATTC 11000101115 AATTA 0000010100

    6 GCGCT 1011101101

    7 GTGCT 1001101101

    8 GTGCC 1001101111

    9 TTGCT 0101101101

    10 GCGCC 1011101111

    To classify the given set of DNA sequences into two classe

    we need to design a CA based classifier for two pattern sets P

    and P2, such that two arbitrary patterns P1 and P2 shoul

    fall into different attractor basins.

    Let us use the rules of state transition as depicted in Table I

    Here

    means the bitwise XOR operation.

    TABLEII

    STATE TRANSITION RULES

    Bit Position Rule

    0 b1b01 b2 b12 b3 b2 b13 b4 b3 b24 b4 b35 b5

    6 b7 b57 b7

    8 b9 b8 b79 b9 b8

    Using the given CA-rule set we observe that the sequencegiven in Table I, can be classified into two classes, th

    0-attractor basin (an attractor with all zeros) and the non-zer

    attractor basins, as depicted in Fig. 2.

    Fig. 2 Classifying the DNA Patterns into Attractor Basins

    Using the above CA the given DNA sequences can b

    0000010100 0001010111 0001011000

    0-Basin

    1100010111 0000010111

    1011101101 1001101101 1001101111

    Non 0-Basin

    0101101101 1011101111

    10001 01001

    10000 11000 01000

    00001 00000 11001

    10010 10011

    01010 11011 01011

    00010 00011 11010

    10100 10101

    01100 11100 01101

    00101 00100 11101

    10110 10111

    01110 11111 01111

    00110 00111 11110

  • 5/22/2018 SouravSaha008_CiiTSep2013

    3/7

    categorized into two sets as shown in Fig. 3.

    Fig. 3 Categorization of the DNA Patterns into two Classes

    Any CA rule with only XOR-operation can be emulated by a

    coefficient-matrix multiplication scheme as illustrated below.

    The next state of a binary pattern B = bn-1 bn-2b1b0 can be

    derived by multiplying it with corresponding CA-Coefficient

    matrix.

    = 0,0 0,1 0,11,0 1,1 1,11,0 1,1 1,1 011

    In order to build CA-Coefficient matrix for a CA rule the

    following equation is used.

    , = 1, , , = 0, 1 , ,0, 1

    From the equation it is clear that the (i, j)th position of the

    coefficient matrix will hold one if only if the CA-rule at ith

    bitdepends on jthbit for XOR operation otherwise it holds zero

    value. For example, the corresponding CA-Coefficient

    matrix(C) for the CA-Rule set shown in Table-II can be derived

    as follows.

    b9 b8 b7 b6 b5 b4 b3 b2 b1 b0

    b9 1 1 0 0 0 0 0 0 0 0

    b8 1 1 1 0 0 0 0 0 0 0

    b7 0 0 1 0 0 0 0 0 0 0

    b6 0 0 1 0 1 0 0 0 0 0

    C = b5 0 0 0 0 1 0 0 0 0 0b4 0 0 0 0 0 1 1 0 0 0

    b3 0 0 0 0 0 1 1 1 0 0

    b2 0 0 0 0 0 0 1 1 1 0

    b1 0 0 0 0 0 0 0 1 1 0

    b0 0 0 0 0 0 0 0 0 1 1

    In case of XOR-CA rule, the following theorem relates a

    pattern to its basin.

    Theorem 1 If any pair of arbitrary patterns- B1 and B2 ever

    reach -basin on consecutive applications of XOR-CA rulethen the pattern-B=B1B2 will reach zero-basin onconsecutive applications of XOR-CA rule.

    Proof:

    Let 0 and 0 are two arbitrary patterns falling in the

    -basin. The pattern

    0reaches

    -basin after k

    thconsecutive

    application of XOR-CA rule. Also, let denote the patternwhich is derived afterithconsecutive application of XOR-CArule on 0 . The above assumption implies followingequations.

    0 = 1 1 = 2 = +1 = = = 1 = = . 0 =

    =1

    0 =

    0

    Similarly, if the pattern 0 reaches -basin after k' consecutivapplication of XOR-CA rule and k > k' then we can state thfollowing equation since the attractor pattern will not chang

    even after application of XOR-CA rule.

    0 = = 0Now, 0 0 = = 0 leads to 0 0 = 0 The above equation implies that the pattern (B) derived from

    =

    0

    0also reaches zero-basin.

    Hence is the proof.

    Example 1In Fig.1, two patterns B1 = 01000 and B2 = 1000are in the zero basin with B = B1B2= 11000 also falling ithe same zero-basin.

    Theorem 1 confirms that the hamming distance between a paof patterns falling in the same basin gets reflected in the zer

    basin patterns. This result obviously leads to the fact tha

    XOR-CA rules with patterns in zero-basin close to each othe

    with respect to their hamming distance can act as effectiv

    pattern classifiers. The state-transition characteristics of suc

    CA are desirable for DNA-pattern classification wherei

    similar patterns will tend to fall in zero-basin.

    Design of Multi-stage Hierarchical Classifier: A two clas

    XOR-CA-classifier is favourable for implementation due to it

    simplicity but has several limitations due to its lineacharacteristics. The limitations of single-stagXOR-CA-Classifier can be avoided to a certain extent bdesigning a multi-stage hierarchical classifier. In multi-stag

    classification scheme, the single stage classifier is repeatedl

    employed at every stage leading to a hierarchical tree-lik

    structure with each node corresponding to a single stage CA

    classifier (Fig. 4).

    Class B

    GCGCT

    GTGCT

    GTGCC

    TTGCT

    GCGCC

    Class A

    AATTC

    ATTTC

    ATTGA

    CATTC

    AATTA

  • 5/22/2018 SouravSaha008_CiiTSep2013

    4/7

    IV. EVOLUTION OF THE CABY SIMULATED ANNEALINGSimulated annealing is a generalization of a Monte Carlo

    method for examining the equations of state and frozen states of

    n-body systems [26]. We employ and appropriately tune theSimulated Annealing to arrive at the desired CA with patterns

    in zero-basin close to each other.

    In Simulated Annealing an initial temperature (Ti) is set. The

    temperature decreases exponentially during the process [27]. Ateach temperature point (Tp) some action is taken based on the

    value of Cost Function. The entire process continues till

    temperature becomes zero. To evaluate the CA rules as a DNA

    pattern classifier, we design a heuristic cost function asdescribed below. Let us assume that we are given with two

    distinct classes of DNA sequences, represented by class A = {

    AATTC, ATTTC, ATTGA, CATTC, AATTA} and class B = {

    GCGCT, GTGCT, GTGCC, TTGCT, GCGCC}. We initially

    create a randomly generated CA rule, represented by the

    Coefficient matrix C. For training the classifier we arbitrarily

    select NA number of DNA sequences from class A and NBnumber of DNA sequences from class B. Let us assume that

    out of the (NA + NB) sequences NAB patterns fall in thezero-basin by applying the CA rules and the rest of the

    sequences fall in the non-zero basin. Next we emit the

    consensus sequence-CSeq for these NAB numbers of DNAsequences using HMMER [28], which is an online DNA

    sequence analysis tool based on Hidden Markov Models. Let L

    be the average Levenshtein distance of these NABnumbers of

    DNA sequences as determined with respect to the consensus

    sequence-CSeq. The Levenshtein distance Lev(x, y) between

    two DNA sequences x and y of lengths m and n respectively is

    given by

    Levx,y m, n = maxm, n ,if minm, n = 0

    min Levx,y

    m

    1, n

    + 1

    Levx,y m, n 1 + 1Levx,y m 1, n 1 + [xm yn] otherwiseThe Levenshtein distance is an integer, which gives a measure

    of similarity between two DNA sequences. To compute the

    Fuzzy Levenshtein distance [29], the percentage similarity

    between two DNA sequences is computed. To transform the

    Levenshtein distance into a percentage, the number of edits

    required are subtracted from 1.0 and divided by the length ofthe longest string. The Fuzzy Levenshtein distance is obtained

    by multiplying the resulting value by 100. The Fuzzy

    Levenshtein distance of the sequences in a of three DNA

    sequences in the same basin from their consensus sequence

    illustrated in the table below:

    TABLEIII

    COMPUTATION OF FUZZY LEVENSHTEIN DISTANCE

    SequenceConsensus

    Sequence

    Fuzzy Levenshtein

    distance

    CAGAT

    CAGTT

    0.8

    AGGTT 0.2

    CAATT 0.6

    The fitness cost of a CA as solution is then calculated as th

    average of the Fuzzy Levenshtein distances of each sequence i

    the alignment to the consensus alignment. For example, th

    fitness cost of the solution in the previous example is 0.53. Th

    lower is the value of L the better is the fitness cost of the CA

    rule as a classifier.

    There are two types of solutions based on cost value - Bes

    Solution (BS) and Current Solution (CS). A New Solution (NS

    at the next Tp compares its cost value with CS. If NS has bette

    cost value than CS, then NS becomes CS. The new solutio(NS) is also compared with BS and if NS is better, then N

    becomes BS. Even if NS is not as good as CS, NS is accepte

    with a probability. This step is done typically to avoid any locaminima. The complete algorithm is presented below:

    AlgorithmSA_EvolveCA

    // Input: Pattern Size (n), Pattern Set (S), Initial Temp. (Ti)

    // Output: CA Rule.

    1 Tp= Ti2 CS = BS = NULL3 while(Tp> 0) {4 i f(Tp> 0.5 * Ti) {5 Randomly generate a CA as guess solution6 }7 else {8 Generate a new solution from CS9 }10 Generate state transition table and rule table11 NS = CA-Rule12 cost= cost-value(NS)cost-value(CS)13 i f(cost< 0) {14 CS = NS15 i f(cost-value(NS) < cost-value(BS)) {16 BS = NS17 }18 }19 else20 accept CS = NS with probability / 21 Reduce Tpexponentially22 }The above mentioned simulated annealing algorithm continue

    to explore CA-search space with heuristic approach fo

    obtaining desired CA-rule as long as the temperature remain

    positive. The temperature (Tp) in simulated annealing

    initialized with a large value (line 1) and at every attempt it i

    S1, S2, S3, S4

    S1, S2 S3, S4

    S1 S2 S3 S4

    Fig. 4 Multi-stage hierarchical classifier

  • 5/22/2018 SouravSaha008_CiiTSep2013

    5/7

    reduced (line 21) gradually to get to the termination phase. A

    CA-Rule as a candidate solution is randomly generated (line 5)

    through random synthesis of CA-Coefficient matrix. In order to

    obtain neighbor candidate solution to CS (i.e. current solution

    derived so far), a few bits of CA-Coefficient matrix

    corresponding to CS is altered (line 8). The probability of

    accepting a new candidate solution as current solution depends

    on the fitness-cost value. It is evident from the algorithm that

    every new candidate solution has the possibility to becomecurrent solution irrespective of its fitness-cost. However, with

    the temperature approaching zero value i.e. as the algorithm

    approaches termination phase, the probability of accepting

    less-fit CA-Rule also diminishes (line 21). During the

    exploration, the algorithm records the best explored CA-Rule

    as BS (line 16).

    V. RESULTThis section reports experimental observations during

    evaluation of our proposed CA based DNA classification

    scheme. To analyse the performance of the proposed CA based

    DNA-pattern classifier, the experiment has been performed onsynthetic datasets. All the experiments have been conducted

    under the following setup.

    Hardwareo Processor - Intel Core i7-3610QM CPU

    @ 2.30GHz 8o RAM8GBo Disk 1000 GB

    Softwareo Operating system Open SUSE Kernel

    version 3.1.0-1.2-desktop

    o OS type32-bito Compiler used javac version 4.6.2 (SUSE

    Linux)

    During the experimentation, emphasis has been put on thebehavior of our model in response to the varying DNA

    sequence length as well as number of DNA-trainee patterns (i.e.trainee-size). The given set of DNA sequences has been

    randomly divided into a trainee set and testing-set. The most

    desirable CA is evolved through simulated annealing heuristic

    algorithm and the CA is assumed to be the best explored

    solution which can classify the trainee DNA patterns efficiently

    with respect to their Fuzzy Levenshtein distances as discussed

    in previous section. The testing-set is used to measure the

    class-prediction accuracy of the proposed model built with

    randomly chosen trainee patterns in comparison with the actual

    class membership. The overall performance of the proposedscheme is represented in the form of following graphs plottedwith variations of DNA sequence size and number of trainee

    sequences. Each of the figures presents classification accuracy

    of the proposed model for DNA patterns of various sequence

    lengths against varying trainee pattern size. Fig. 5 displaysclassification accuracy of the proposed model with DNA

    sequence length 20 whereas Fig. 6 reports classification

    accuracy with DNA sequence length 40 against various trainee

    pattern sizes. The observation reveals several interesting facts

    on the behavior of the model. In both the cases, the accuracy

    level has been observed as ranging from 60 percent to 9

    percent showing linear improvement with the increase i

    number of trainee DNA patterns.

    Fig. 5 Classification Accuracy vs. Number of Trainee Patterns for asequence length of 20

    Fig. 6 Classification Accuracy vs. Number of Trainee Patterns for asequence length of 40

    20 40 60 80 100

    Classification

    Accuracy67.19 74.45 80.13 89.03 93.77

    60

    65

    70

    75

    80

    85

    90

    95

    100

    ClassificationAccuracy

    Number of Trainee Patterns

    20 40 60 80 100

    Classification

    Accuracy62.44 66.33 69.06 78.11 83.04

    60

    65

    70

    75

    80

    85

    Classificati

    onAccuracy

    Number of Trainee Patterns

    Sequence Length = 20

    Sequence Length = 40

  • 5/22/2018 SouravSaha008_CiiTSep2013

    6/7

    Fig. 7 Classification Accuracy vs. Number of Trainee Patterns for asequence length of 60

    Fig. 8 Classification Accuracy vs. Number of Trainee Patterns for asequence length of 80

    The behavior of the proposed scheme does not vary drasticallywith respect to the other sequence lengths as obvious in Fig. 7

    (DNA Sequence Length 60), Fig. 8 (DNA Sequence Length80), and Fig. 9 (DNA Sequence Length 100). However, it is

    evident from each graph that as the number of trainee patternsincreases the accuracy level also increases almost linearly. One

    of the interesting observations is that as long as the trainee

    pattern size remains below 60 percent, the performance of the

    model does not vary too much with the variation in DNA

    sequence length. However, while dealing with number of

    trainee patterns exceeding 60 percent of the given set, theclassification accuracy of the model falls with the increase in

    DNA sequence length. The outcome also indicates that as the

    sequence length increases the average performance of the

    scheme slides down a bit. But the accuracy rate rises sharply

    with the increase in number of trainee patterns. It is eviden

    from our observation that the proposed classification schem

    has the potential to classify DNA sequences with reasonabl

    accuracy rate.

    Fig. 9 Classification Accuracy vs. Number of Trainee Patterns for asequence length of 100

    VI. CONCLUSIONThe identification and classification of genes in new DNA

    sequences information is not a trivial problem. The researche

    is quite often faced with hundreds of gigabytes of data to b

    analyzed. This difficulty is compounded by the man

    competing choices for the parameters, in choosing th

    algorithm, in choosing the similarity metric, in selecting thclassification model and finally in selecting a terminatin

    criterion. This paper has presented the idea of a cellula

    automata based DNA pattern classifier, which is low-cos

    high-speed and works with high accuracy. The propose

    technique of DNA pattern classification would open up a wid

    scope of investigative studies with a goal to explore furthe

    improvements in this area. .

    REFERENCES

    [1] Stefania Bandini. Guest Editorial - Cellular Automata. Future GeneratioComputer Systems, 18:vvi, August 2002.

    [2] A. Albicki, S. K. Yap, M. Khare, and S. Pamper. Prospects on CellulAutomata Application to Test Generation. Technical Report EL-88-0Dept. of Electrical Engg., Univ. of Rochester, 1988.

    [3] H. Baltzer, W. P. Braun, and W. Kohler. Cellular Automata Model foVegetable Dynamics. Ecological Modelling, 107:113125, 1998.

    [4] S. Wolfram, Theory and application of Cellular Automata, WorScientific, 1986.

    [5] P. H. Bardell. Analysis of Cellular Automata used as Pseudo-RandoPattern Generators. In International Test Conference, pages 762761990.

    [6] C. Burks and D. Farmer. Towards Modeling DNA Sequences aAutomata. Physica D, 10:157167, 1984.

    [7] J. H. Moore and L. W. Hahn. A Cellular Automata-based PatterRecognition Approach for Identifying Gene-Gene an

    Gene-Environment Interactions. American Journal of Human Genetic67(52), 2000.

    20 40 60 80 100

    Classification

    Accuracy61.78 65.31 70.07 79.99 87.09

    60

    65

    70

    75

    80

    85

    90

    ClassificationAccuracy

    Number of Trainee Patterns

    20 40 60 80 100

    Classification

    Accuracy60.03 68.65 71.01 73.76 84.88

    50

    55

    60

    65

    70

    75

    80

    85

    90

    ClassificationAccuracy

    Number of Trainee Patterns

    20 40 60 80 100

    Classification

    Accuracy

    54.97 57.06 61.21 69.33 73.07

    50

    55

    60

    65

    70

    75

    ClassificationAccuracy

    Number of Trainee Patterns

    Sequence Length = 100

    Sequence Length = 60

    Sequence Length = 80

  • 5/22/2018 SouravSaha008_CiiTSep2013

    7/7

    [8] J. H. Moore and L. W. Hahn. Multilocus Pattern Recognition usingCellular Automata and Parallel Genetic Algorithms. In Proc. of the

    Genetic and Evolutionary Computation Conference (GECCO-2001),page 1452, 7-11 July 2001.

    [9] A. Albicki and M. Khare. Cellular Automata used for Test PatternGeneration. In Proc. ICCD, pages 5659, 1987.

    [10] A. Albicki and S. K. Yap. Covering a Set of Test Patterns by a CellularAutomata. Research Review, Dept. of Comp. Sc. and Engg., Univ. of

    Rochester, 1987.[11] E. R. Banks. Information Processing and Transmission in Cellular

    Automata. PhD thesis, M. I. T., 1971.

    [12] S. C. Benjamin and N. F. Johnson. A Possible Nanometer-scaleComputing Device based on an Adding Cellular Automaton. Applied

    Physics Letters, 1997.

    [13] A. M. Barbe. A Cellular Automata Ruled by an Eccentric ConservationLaw. Physica D, 45:4962, 1990.

    [14] Jianbo Gao, Yan Qi, Yinhe Cao, and Wen-wen Tung, "Protein CodingSequence Identification by Simultaneously Characterizing the Periodicand Random Features of DNA Sequences", Journal of Biomedicine andBiotechnology, Vol. 2, pp. 139146, 2005.

    [15] Peterson, D.; Lee, C.H., "A DNA-based pattern recognition technique forcancer detection," Engineering in Medicine and Biology Society, 2004.IEMBS '04. 26th Annual International Conference of the IEEE , vol.2,

    no., pp.2956,2959, 1-5 Sept. 2004 doi: 10.1109/IEMBS.2004.1403839[16] Ido Priness, Oded Maimon and Irad Ben-Gal, Evaluation of

    gene-expression clustering via mutual information distance measure,

    BMC Bioinformatics 2007, 8:111 doi:10.1186/1471-2105-8-111

    [17] David Kulp, avid Haussler, Martin G. Reese Frank, H. Eeckman , AGeneralized Hidden Markov Model for the Recognition of Human Genesin DNA, ISMB-96 Proceedings, 1996.

    [18] Sathish Kumar S, N.Duraipandian, An Effective Identification ofSpecies from DNA Sequence: A Classification Technique by Integrating

    DM and ANN, International Journal of Advanced Computer Science andApplications, Vol. 3, No.8, , pp. 104114, 2012.

    [19] A. W. Burks. Essays on Cellular Automata. Technical Report, Univ. ofIllinois, Urbana, 1970.

    [20] S. Bhattacharjee, J. Bhattacharya, and P. Pal Chaudhuri. An EfficientData Compression based on Cellular Automata. In Data Compression

    Conference (DCC95), 1995.[21] Stephen A Billings and Yingxu Yang. Identification of Probabilistic

    Cellular Automata. IEEE Transaction on System, Man and Cybernetics,

    Part B, pages 112, 2002.[22] M. S. Capcarrere. Cellular Automata and Other Cellular System: Design

    and Evolution. PhD thesis, Swiss Federal Institute of Technology,

    Luassane, 2002.[23] S. Chakraborty, D. Roy Chowdhury, and P. Pal Chaudhuri. Theory and

    Application of Non-Group Cellular Automata for Synthesis of Easily

    Testable Finite State Machines. IEEE Trans. on Computers,45(7):769781, July 1996.

    [24] S. Chattopadhyay, S. Adhikari, S. Sengupta, and M. Pal. Highly Regular,Modular, and Cascadable Design of Cellular Automata-based PatternClassifier. IEEE Transaction on VLSI Systems, 8(6):724735, December2000.

    [25]N. Ganguly, P. Maji, S. Dhar, B. K. Sikdar, and P. Pal Chaudhuri.Evolving Cellular Automata as Pattern Classifier. In Proc. of FifthInternational Conference on Cellular Automata for Research andIndustry, ACRI 2002, Switzerland, pages 5668, October 2002.

    [26] E. H. L. Aarts and J. Korst. Simulated Annealing and BoltzmannMachines. John Wiley & Sons, Essex, U.K., 1989.

    [27] De Vicente, Juan; Lanchares, Juan; Hermida, Romn (2003). "Placementby thermodynamic simulated annealing". Physics Letters A 317 (56):415423.

    [28] HMMER 3.1 (February 2013);http://hmmer.org/[29] Hjelmqvist, Sten (March 2012), Fast, memory efficient Levenshtein

    algorithm(http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Lev

    enshtein-algorithm)

    Prof. Tamal Chakrabarti is currently Assistant Professor, Department of

    Computer Science and Engineering, Institute of Engineering and Management.He started his career with Wipro Technologies, India as a Software Engineer.Then he joined Flextronics Software Systems, India as a Technical Leader.

    After that he was associated with IBM India Pvt. Limited, where he was leading

    a software development team. Subsequently, he worked with InfosysTechnologies, India, as a Project Manager. Since 2009, he has been teaching in

    Institute of Engineering and Management. He did his graduation (B.Sc., Honsin Physics from Calcutta University in 1997, and B.Tech. in Computer Scienc

    and Engineering from Calcutta University in 2000. In 2006 He received his Mdegree from BITS Pilani, India. He has been presented with numerous awardfrom professional bodies and academia; including Feather in My Cap Award

    (twice) by Wipro Technologies, Spot Award by Lucent TechnologieBravo Award by IBM India Pvt. Ltd. and Award of Excellence fo

    contribution in the International Conference on innovativ

    techno-management solution for social sector, in 2012. He has participated various projects in India, Belgium and Ireland. IBM India Pvt. Ltd. had honorehim with Mentor Award for guiding a projectin The Great Mind Challeng

    2011. Prof. Chakrabarti is a member of the Computer Society of India (CSIHe has authored numerous papers in journals and conferences. His researc

    interests include, Bio-informatics, Programming Languages and Design an

    Analysis of Algorithms.

    Prof. Sourav Sahais currently Assistant Professor, Department of Comput

    Science and Engineering, Institute of Engineering and Management. He startehis career working in R&D sector at various companies. Since 2011, he ha

    been teaching in Institute of Engineering and Management. He did h

    graduation (B.Tech) in Computer Science & Engineering from Kalyan

    University in 2000, and obtained his Master of Engineering (M.E.) degree Computer Science and Engineering from Bengal Engineering and Scienc

    University in 2002. He was awarded university medal for securing highemark in M.E. and also received award from Indian National Academy oEngineering for best innovative bachelor level project in 2000. He ha

    numerous international and national publications in reputed journals an

    conferences to his credit throughout his entire career. His research interesinclude Cellular Automata, Pattern Recognition, Bio-Medical Engineerin

    Bio-Informatics etc.

    Prof. (Dr.) Devadatta Sinhais currently Professor, Department of Comput

    Science and Engineering of University of Calcutta, India. He joined thdepartment as a Reader in 1989. Prior to this, he worked as Assistant Professo

    Department of Computer Engineering, B.I.T. Mesra Ranchi and as Lecturer an

    Senior Lecturer (Computer Science) at the Department of MathematicJadavpur University. He obtained his Ph.D. from Jadavpur University in 198and his area of research was Program Testing. He has published more than 5

    papers and articles in different national and international journals, proceedingperiodicals and monographs. His area of interests includes Softwa

    Engineering, Parallel and Distributed Computing, Bioinformatic

    Cryptography. He has guided a number of doctoral and masters thesis Computer Science. He worked as Head of the Department of ComputScience and Engineering, University of Calcutta for two terms of two yea

    each. He worked as Chairman, undergraduate studies in Computer SciencUniversity of Calcutta and currently the Convener, Ph.D. Committee

    Computer Science and Engineering, University of Calcutta. He is associate

    with a number of academic institutions as member in their academic bodies. His involved in a number of national and international conferences in thcapacity of Chairman of PC/OC. He served as Chairman, Computer Society o

    India, Kolkata Chapter and is a Patron of the chapter. He was SectionaPresident, Section of Computer Science, and Indian Science CongreAssociation in 1993-94.

    http://hmmer.org/http://hmmer.org/