a robust elicitation algorithm for discovering dna motifs using fuzzy self-organizing maps
Post on 18-Dec-2016
Embed Size (px)
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013 1677
A Robust Elicitation Algorithm for DiscoveringDNA Motifs Using Fuzzy Self-Organizing Maps
Dianhui Wang, Senior Member, IEEE, and Sarwar Tapan
Abstract It is important to identify DNA motifs in pro-moter regions to understand the mechanism of gene regulation.Computational approaches for finding DNA motifs are wellrecognized as useful tools to biologists, which greatly help insaving experimental time and cost in wet laboratories. Self-organizing maps (SOMs), as a powerful clustering tool, havedemonstrated good potential for problem solving. However, thecurrent SOM-based motif discovery algorithms unfairly treatdata samples lying around the cluster boundaries by assigningthem to one of the nodes, which may result in unreliable systemperformance. This paper aims to develop a robust framework fordiscovering DNA motifs, where fuzzy SOMs, with an integrationof fuzzy c-means membership functions and a standard batch-learning scheme, are employed to extract putative motifs withvarying length in a recursive manner. Experimental results oneight real datasets show that our proposed algorithm outperformsthe other searching tools such as SOMBRERO, SOMEA, MEME,AlignACE, and WEEDER in terms of the F-measure andalgorithm reliability. It is observed that a remarkable 24.6%improvement can be achieved compared to the state-of-the-art SOMBRERO. Furthermore, our algorithm can produce a20% and 6.6% improvement over SOMBRERO and SOMEA,respectively, in finding multiple motifs on five artificial datasets.
Index Terms Computational motif discovery, DNA sequences,fuzzy self-organizing maps, robust elicitation algorithm.
TRANSCRIPTION factor binding sites (TFBSs) are smallDNA segments (usually 30 bp in length) that interactwith the transcription proteins known as the transcriptionfactors (TFs) to initiate gene transcription, which is the firststep of gene expression . A collection of binding sites thatbind to a specific TF from a set of co-expressed genes istermed as a motif and characterizes the binding preferencesof that TF. Discovering novel motifs in co-expressed genes orfinding new binding sites associated with known TFs is crucialin understanding gene regulatory mechanisms .
Experimental approaches for finding DNA motifs, e.g.,ChIP-chip , ChIP-seq, and micro-array technology ,are still laborious, time consuming, and expensive. Hence,computational approaches have received considerable attentionin the last two decades . Over these years, many motifsearch algorithms and Web-based tools have been developedbased on computational intelligence systems and data mining
Manuscript received September 8, 2012; revised April 24, 2013; acceptedJuly 26, 2013. Date of publication August 15, 2013; date of current versionSeptember 27, 2013.
The authors are with the Department of Computer Science and Com-puter Engineering, La Trobe University, Melbourne 3086, Australia (e-mail:firstname.lastname@example.org; email@example.com).
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2013.2275733
techniques, e.g., . However, according to some sur-veys , the existing tools are still unable to achievesatisfactory performance in terms of accuracy, reliability, andscalability. Thus, effective motif discovery still remains chal-lenging despite the enormous number of attempts over thepast years, which necessitates the exploration of possibilitiesfor improved developments.
Being one of the powerful data mining tools, clusteringtechniques can be used to group unlabeled data for subtlepattern recognition in biological sequences . Thebasic idea behind clustering-based DNA motif discovery algo-rithms is to extract a set of clusters, which are composed of asimilar type of same length (k-length) subsequences (known ask-mers), followed by merging and optimizing some potentialclusters with a considerable degree of presence of the func-tional motif properties . The clustering process iterativelyupdates the centroids with preferably random initialization tobecome useful representations of the potential clusters withmotif properties. Clustering algorithms are capable of findingmotifs embedded in DNA sequences because of some inher-ent properties of motifs, such as the positional reservation,rareness in respect to the background, and overrepresentationnature . Also, the use of domain-specific scoring functionsfor data assignment plays a key role in clustering-basedmotif extraction exercises . Note that a generic clusteringapproach for DNA motif discovery may not be effective unlessit is equipped with specific mechanisms for forming clustersof similar k-mers from the input sequences, ensuring somedegree of presence of the functional motif characteristics inthe clusters. The appropriateness of the similarity metrics andthe type of centroids are the two key factors in clustering-basedmotif discovery.
Self-organizing maps (SOMs)  can be employed formotif discovery through clustering similar k-mers extractedfrom DNA sequences , . Its competitive learningwith neighborhood-based cooperation between nodes intro-duces a useful generality during the initial learning phase,which distinguishes SOMs from a typical winner-takes-alltype k-means clustering algorithm in terms of significantlyimproved DNA motif discovery results . Furthermore, theuse of SOMs enables a useful refinement of putative clusters(nodes) with the contribution from the grid-based immediateneighboring clusters that often preserve considerably similaror partially similar motif patterns. Although the conditionsfor keeping the topology preservation in SOMs are rarelyexamined in practice , the applicability of SOM-basedclustering techniques for DNA motif discovery is undoubted.
SOMs perform hard partitions in the k-mer dataset, andeach node in post-training SOMs encodes a nonoverlapping
2162-237X 2013 IEEE
1678 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013
area in the input distribution. Despite its proven usefulnessin a wide range of applications, it has limited suitabilityfor the DNA motif discovery task, since the hard partitionspush the data samples lying around the cluster boundariesto belong only to the best matching cluster even thoughthe matching is insignificant. This can have a considerablenegative impact on the DNA motif discovery performance,especially in finding weak motifs. Consequently, it is inevitableand usually unavoidable to have some true but weak bindingsites distributed among the neighboring clusters of a true motif,because of the fact that the functional binding sites can oftenbe considerably degenerated in functional motifs. This impliesthat a motif extracted from a crisp partitioning approachoften fails to attach weak binding sites, which eventuallydegrades the performance and reliability of the SOM-basedmotif finding tools.
To overcome the problem mentioned above, this paperpresents a fuzzy SOMs (FSOMs) based approach for findingmotifs, where the FSOM performs a fuzzy c-means (FCMs)type soft clustering . Here, each cluster in the FSOMshas a global relationship to all k-mers in the dataset, whichincreases the chances of attaching degenerated instances tothe functional motifs. Motivated by an incremental learningalgorithm of fuzzy SOMs proposed in , a batch learningalgorithm for training FSOM networks is developed in thispaper. It should be pointed out that such an extension isnecessary and meaningful for motif discovery. Indeed, thenecessity comes from the nature of modeling motifs, i.e.,each node as a whole stores a motif candidate. The incre-mental learning schemes, in this case, fail to comply withthe requirements of modeling motifs since a single k-meris incapable of characterizing the properties of functionalmotifs. Also, such type of learning schemes slow downthe processing of finding motifs. In this paper, a robustframework, named robust elicitation algorithms for discover-ing (READ) DNA motifs, is proposed by utilizing multipleFSOMs and some innovative heuristics for motif candidateoptimization.
An important issue related to SOM-based motif discoverytools is about the robustness of performance with respect tothe map size. An SOM network with an improper numberof nodes may produce inappropriate partitions of the datasetand degrade the motif-like clusters and eventually degradethe motif finding performance. Being data-specific in nature,setting a proper map size in SOMs is mostly an empiricaltask that needs considerable human intervention and carefulattempts. To minimize the impact of an improper map sizesetting on the motif mining performance, two complementarystrategies are adopted in the READ framework. First, FSOMsare employed, which naturally have a better tolerance in han-dling an improper map size setting due to its soft-partitioningmechanism. Second, the cluster quality degradation due to aninappropriate map size setting is compensated by an improvedpostprocessing scheme that aims to regain the lost informationon the degraded motif-like clusters. As observed in our experi-ments, the proposed READ framework performs favorably androbustly compared to the FCM-based and crisp SOM-basedalgorithms.
Fig. 1. Collection of binding sites (6 bp length) of MyoD TF  representedas a motif PFM and visualized using an entropy-based logo.
The remainder of this paper is organized as follows:Section II gives some preliminaries including motif modelingand evaluation, which will be used in the sequential sectionfor model selection and rankin. Section III describes theproposed batch learning algorithm for the FSOMs. Sectio