research article cloud based metalearning system for predictive...

11
Research Article Cloud Based Metalearning System for Predictive Modeling of Biomedical Data Milan VukiTeviT, Sandro RadovanoviT, Miloš MilovanoviT, and Miroslav MinoviT Faculty of Organizational Sciences, University of Belgrade, Jove Ili´ ca 154, 11000 Belgrade, Serbia Correspondence should be addressed to Miroslav Minovi´ c; [email protected] Received 20 December 2013; Accepted 21 January 2014; Published 14 April 2014 Academic Editors: R. Colomo-Palacios and V. Stantchev Copyright © 2014 Milan Vuki´ cevi´ c et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. is problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data. 1. Introduction Data mining can be defined as the process of finding previously unknown patterns and trends in databases and using that information to build predictive models [1]. Due to increasing amount of data generated in healthcare systems (medical records, gene expression data, medical image data, etc.), analysis became too complex and voluminous for traditional methods and this is why data mining is becoming increasingly important [2]. In the last decade data mining techniques (like clustering, classification, or association) were successfully applied on different medical and biomedical problems like prediction of heart attacks [3], diagnostics based on gene expression microarray data [4], classification of Parkinson’s disease [5], identification of liver cancer signature [6], and so forth. Special area of medical data mining is biomedical data mining that seeks to connect phenotypic data to biomarker profiles and therapeutic treatments, with the goal of creating predictive models of disease detection, progression, and therapeutic response. is area includes mining genomic data (and data from other high-throughput technologies such as DNA sequencing and RNA expression), text mining of the biological literature, medical records, and so forth, and image mining across a number of modalities, including X-rays, functional MRI, and new types of scanning microscopes [7]. Even though many algorithms were specially designed for application in this area [8], the exponential increase of genomic data brought by the advent of the third generation sequencing (NGS) technologies and the dramatic drop in sequencing cost have posed many challenges in terms of data transfer, storage, computation, and analysis of big biomedical data [2, 7, 9, 10]. ese authors emphasize the lack of computing power and storage space, as a major hurdle in achieving research goals. ey propose cloud computing as a service model sharing a pool of configurable resources, which is a suitable workbench to address these challenges (Figure 1). One of the “soſt” approaches for reducing the need for computer power for data analysis is introduction of met- alearning systems for selection and ranking of the best suited algorithms for different problems (datasets). ese systems store historical experimental records (descriptions of datasets and algorithm performances) and, based on these records, evolve models for prediction of algorithm performances on a new dataset. By using these systems, analyst does not have to evaluate large number of algorithms on a big data (only ones with the best predicted performance) and this way saves computational and time resources. Even though specialized metalearning systems are developed for many application areas like electricity load forecasting [11], gold market forecasting [12], choosing metaheuristic optimization algorithm for traveling salesman problem [13], and so forth, Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 859279, 10 pages http://dx.doi.org/10.1155/2014/859279

Upload: others

Post on 17-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

Research ArticleCloud Based Metalearning System for Predictive Modeling ofBiomedical Data

Milan VukiTeviT Sandro RadovanoviT Miloš MilovanoviT and Miroslav MinoviT

Faculty of Organizational Sciences University of Belgrade Jove Ilica 154 11000 Belgrade Serbia

Correspondence should be addressed to Miroslav Minovic miroslavminovicfonbgacrs

Received 20 December 2013 Accepted 21 January 2014 Published 14 April 2014

Academic Editors R Colomo-Palacios and V Stantchev

Copyright copy 2014 Milan Vukicevic et alThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcareprocesses On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existingdata mining algorithms This problem is addressed by proposing a cloud based system that integrates metalearning framework forranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedicaldata

1 Introduction

Data mining can be defined as the process of findingpreviously unknown patterns and trends in databases andusing that information to build predictive models [1] Dueto increasing amount of data generated in healthcare systems(medical records gene expression data medical image dataetc) analysis became too complex and voluminous fortraditional methods and this is why data mining is becomingincreasingly important [2]

In the last decade datamining techniques (like clusteringclassification or association) were successfully applied ondifferent medical and biomedical problems like predictionof heart attacks [3] diagnostics based on gene expressionmicroarray data [4] classification of Parkinsonrsquos disease [5]identification of liver cancer signature [6] and so forth

Special area of medical data mining is biomedical datamining that seeks to connect phenotypic data to biomarkerprofiles and therapeutic treatments with the goal of creatingpredictive models of disease detection progression andtherapeutic responseThis area includesmining genomic data(and data from other high-throughput technologies such asDNA sequencing and RNA expression) text mining of thebiological literature medical records and so forth and imagemining across a number of modalities including X-raysfunctional MRI and new types of scanning microscopes [7]

Even though many algorithms were specially designedfor application in this area [8] the exponential increase ofgenomic data brought by the advent of the third generationsequencing (NGS) technologies and the dramatic drop insequencing cost have posed many challenges in terms of datatransfer storage computation and analysis of big biomedicaldata [2 7 9 10] These authors emphasize the lack ofcomputing power and storage space as a major hurdle inachieving research goals They propose cloud computing as aservicemodel sharing a pool of configurable resources whichis a suitable workbench to address these challenges (Figure 1)

One of the ldquosoftrdquo approaches for reducing the need forcomputer power for data analysis is introduction of met-alearning systems for selection and ranking of the best suitedalgorithms for different problems (datasets) These systemsstore historical experimental records (descriptions of datasetsand algorithm performances) and based on these recordsevolve models for prediction of algorithm performances ona new dataset By using these systems analyst does nothave to evaluate large number of algorithms on a big data(only ones with the best predicted performance) and thisway saves computational and time resources Even thoughspecialized metalearning systems are developed for manyapplication areas like electricity load forecasting [11] goldmarket forecasting [12] choosingmetaheuristic optimizationalgorithm for traveling salesman problem [13] and so forth

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 859279 10 pageshttpdxdoiorg1011552014859279

2 The Scientific World Journal

0

1

2

3

4

5

6

7

8

92010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

2025

2026

2027

2028

2029

2030

billions of base pairsLog10

Figure 1 Projected growth ofDNAsequence data in the 21st century[7]

there are not many researches that utilize metalearningin medicine [14] and in this paper we will propose suchapproach Similar efforts have been made in the field ofcontinuous improvement of business performance with bigdata [15] which lets users analyze business performance indistributed environments with a short response time whichcan be analogous with biomedical systems

Researchers in this area suggest that the main problemof exponential data growth is to provide adequate com-puting infrastructure that has the possibility to assemblemanage and mine the enormous and rapidly growing data[2 7 9 10] They emphasize that intersection point betweengenome technology cloud computing and biological datamining provides a launch pad for developing a globallyapplicable cloud computing platform capable of supportinga new paradigm of data intensive cloud-enabled predictivemedicine

In this paper we propose an extension of cloud basedsystems [16 17] with data and model driven services basedonmetalearning approach Additionally this system includesopen source data mining environments as a platform servicefor users System is based on open source technologiesand this is very important since they enable collaborativecollection of data and fast development of new algorithms[18]

2 State of the Art

In this section a brief overview of cloud based healthcaresystems and metalearning systems which are correlated withproposed system is discussed Cloud systems emerged as atechnology breakthrough in the last decade and impacteda wide range of business like SME [19] education [20] e-government [21] data mining [22] and so forth

Ahuja et al [23] exhaustively reviewed usage and con-sideration points in implementing cloud healthcare systemThey identified that themost important points are infrastruc-ture and number of facilities Infrastructure has great influ-ence sincemost of the healthcare facilities and office locationswere built years ago and cannot use cloud systems Numberof facilities is important on operation of health organizationand whether their IT infrastructure is distributed betweenfacilities or is in a single datacenter Moving to the cloudwould help communication application and collaboration

between health organizations Cloud computing reducesoperating costs because the need for IT staff in each facilityis lower and overall IT budget is reduced

The advantages of cloud computing and big data tech-nologies like Hadoop and related software increased theirpopularity in medicine and bioinformatics Dai et al [16]identified four bioinformatics cloud services Those are DaaS(data as a service) SaaS (software as a service) PaaS (platformas a service) and IaaS (infrastructure as a service) Bioinfor-matics generates huge amount of raw data and they shouldbe available for data analysis through DaaS Additionally alarge diversity of software tools is necessary for data analysisand SaaS is provided as an option in regarding this problemPlatform as a service provides programmable platform fordevelopment testing and deploying solutions online IaaSoffers a complete computer infrastructure for bioinformaticsanalysis

Lack of support for complex and large scale healthcareapplication of electronic medical records was identified by Liet al [24]Therefore XML was used as a model for managingmedical data while Hadoop infrastructure and MapReduceframework were used for data analysis Their system calledXBase is doing various datamining tasks like classification ofheart valvular disease detecting association rules diagnosisassistance and treatment recommendation

As Schatz et al [25] stated sequencing of DNA chainis improving at a rate of about 5-fold per year whilecomputer performance is doubling only every 18 or 24months Therefore addressing the issue of designing dataanalysis arises as a question A practical solution for solvingthis problem is to concentrate on developing methods thatmake better use of multiple computers and processorswhere cloud computing emerges with promising resultsTheystated that HadoopMapReduce technology is particularlywell suited from genomic point of view for analysis ofDNA sequenceTheCrossbow genotyping program leveragesHadoopMapReduce to launch many copies of the short readin parallel leveraging of HadoopMapReduce and Crossbowfor greater results In their benchmark test on the Amazoncloud Crossbow HadoopMapReduce analyzed 27 billiondata points in about 4 hours which included the timerequired for uploading the raw data for a total cost of $85USD Beside this they described obstacles which can posesignificant barrier in analysis of DNA sequence

An interesting approach to design of biomedical cloudsystem was described by Taverna [26] It is a workflowmanagement system that allows uploading data to the cloudfrom web application creating data flow and run analysisfrom multiple computing units While analysis is runninguser can monitor the progress This application also allowssharing of data flow enabling many researchers from thesame or other projects to influence and give contribution toresearch process

Frameworks for cloud based genome data analysis suchas Galaxy [27] offer generalized tools and libraries as com-ponents in workflow editor Galaxy enables users to definepipelines that through specifically developed visualizationshow progression of the workflow It is extensible and there-fore a community built around it contributes in developing

The Scientific World Journal 3

various tools for genome analysis It is important to noticethat Galaxy is integrated with biomedical databases such asUCSC table data BioMart Central andmodENCODE server

Hadoop and MapReduce distributed computingparadigm has been implemented in CloudBurst [28] Itis used for mapping short reads to reference genomes in aparallel fashion in cloud environment Essentially it providesa parallel read-mapping algorithm optimized for mappingsequence data to the human genome and other referencegenomes intended for use in a biological analysis includingSNP discovery genotyping and personal genomics

Another large biological extensibleworkbench is SeqWire[29] Users are allowed to write and share pipelinemodules Itprovides massive parallel processing using sequencing tech-nologies such as ABI SOLID and Illumina web applicationpipeline for processing and annotating sequenced data queryengine and a MetaDB

Web based cloud system such as FX [30] provides highusability for users that are not familiar with programmingtechniques It is developed for various biomedical data anal-yses such as estimating gene expression level and genomicvariant calling from the RNA sequence using transcriptome-based references User uploads data and configures dataanalysis settings on Amazon Web Service (AWS) Since thisapplication is domain specific it does not require manualarrangement of pipelines

Critical cloud services in biomedicine are infrastructureservices Therefore CloVR [31] offers a virtual operatingsystem with preinstalled packages and libraries required forbiomedical data analysis such as large-scale BLAST searcheswhole-genome assembly gene finding and RNA sequenceanalysis It is implemented as an online application but doesnot provide GUI Instead command-line based automatedanalysis pipelines with preconfigured software packages forcomposing workflows are implemented

Chae et al [9] focus on two emerging problems inbioinformatics data analysis Those are computation powerand big data analysis for the biomedical data Biomedicalanalysis requires very big computing power with huge storagespaceThey proposed BioVLab as an affordable infrastructureon the cloud with a graphical workflow creator which pro-vides an efficient way to deal with these problems BioVLabconsists of three layers The first layer is a graphical workflowengine called XBaya which enables the composition andmanagement of scientific workflows on a desktopThe secondlayer gateway is a web-based analysis tool for the integratedanalysis of microRNA and mRNA expression data Analysisis done on Amazon S3 Interface which presents thirdlayer of architecture Data and commands from gateway aretransferred to cloud which analyze data and return results touser on desktopThey emphasized that analysis of bigmedicaldata requires use of appropriate tools and databases from avast number of tools and databases therefore using cloudwould not solve problems of computational power and bigdata analysis

Cloud healthcare application could have great impacton society but security privacy and government regulationissues limit its usage Zhang and Liu [32] defined securitymodel for cloud healthcare system First part of the model is

secure collection of data Data collection module is createdand maintained independently by care delivery organiza-tions Second part is secure storage Storage system must beencrypted and it can allow only authorized access Third partis secure usage which consists of medicine staff signatureand verification sections Their model deals with problemof information ownership authenticity and authenticationnonrepudiation patient consent and authorization integrityand confidentiality of data and availability of system

Since security is identified as a major challenge in cloudsystemsWooten et al [33] designed and implemented securehealthcare cloud system This was achieved with a trust-aware role-based access control and a tag system System wasimplemented on top of Amazon Web Services (AWS) ElasticCompute Cloud (EC2) and Linux Apache MySQL and PHP(LAMP) solution stack

Liu and Park [34] focused on challenges and adaptationof e-healthcare cloud systems This system extends the cloudparadigm in order to satisfy global demands in digitalhealthcare applications Therefore technology healthcareprocess and service are identified as the main characteristicsof healthcare cloud systems Similarly new challenges aroseby the unique requirements of the e-healthcare industry forusing cloud services for regulation security issues accessintercloud connectivity and resource distribution

IBM Watson is also used in healthcare as cloud serviceGiles and Wilcox [35] used Watson ability to use naturallanguage processing and combine it with content analysisin order to help medical staff in diagnostic analysis Thisapplication of Watson identifies diseases symptoms rightmedications and modifiers directly from medical recordsfrom different medical facilities stored on cloud

Knowledge cloud based systems in medicine as Lai et al[17] stated are one of the major governmentrsquos strategic plansto drive the healthcare services which are identified as publicconcerns in China They highlighted some successful criteriafor establishment of a private knowledge network for businessnetwork collaboration and the knowledge cloud system forradiotherapy dynamic treatment service in China such asinnovation outsourcing marketing opportunity economy ofscales leverage existing resources and service on demandThree parties are identified in the KaaS service model Thefirst party is the knowledge user (patients hospitals anddoctors) the one who pays for the knowledge service ondemand The second party is the knowledge expert (externalconsultants) the one who provides the knowledge service ondemand The third party is the knowledge agent who linkstogether the knowledge user and the knowledge expert ondemand

Great potential of cloud services in the area ofbiomedicine is identified by Grossman and White [7]who made a vision of biomedical cloud in the future Sinceamount of data which hospitals and medical institutionsare dealing with is growing rapidly big data technologieswill have indispensable role in data analysis Consequentlymanaging and processing data will fundamentally changeand new data mining and machine learning algorithmswill be developed to deal with these changes Explosion ofdata is expected to be in genomic proteomic and other

4 The Scientific World Journal

ldquoomicrdquo data and molecular and system biology Probesthat collect data from tissue are more complex and allowsimultaneous tracking and collect more data than few yearsago Authors have also considered several issues such assecurity scalability of storage scalability of analysis peerwith other private clouds and peer with public clouds

Metalearning presents powerful methodology whichenables learning on its past knowledge of solving differenttasks This methodology is used in fraud detection [36] timeseries forecasting [37] load forecasting [11] and others

3 Metalearning Framework forClustering Biomedical Data

Exponential growth of the data and rapid development oflarge number of complex and computationally intensive datamining algorithms led to one of the major problems inmodern data mining selection of the best algorithm for dataat hand [38]Namely analyst often does not have enough timeor resources for creating models and evaluating themwith allavailable algorithms

One of the most promising approaches for dealingwith this problem is metalearning [39 40] Metalearning ismethodology which solves different data mining tasks basedon past knowledgeThemain idea is to store history of exper-imental results with descriptions (meta-attributes) of datasets(eg dataset characteristics algorithm and classificationaccuracies for classification problems) and based on thisto create metamodel (classification or regression) that willpredict the performances of each algorithm on new datasetCreating of suchmetamodel (with good performance) wouldreduce the need for brute force evaluation (evaluation ofevery algorithm)

Metalearning system is built on set of algorithm orcombination of algorithms (ensembles) Therefore everyalgorithm or combination of algorithms (ensembles) is sim-plerTheoreticallymetalearning system can be infinitely largeby puttingmetalearning as component of other metalearningsystems An advantage of its using is that it can address newtypes of tasks that have not been seen but are similar toalready defined problems

Metalearning by Smith-Miles [39 40] is definedwith thefollowing aspects

(i) the problem space 119875 which represents set ofinstances (datasets) of a given problem class

(ii) the meta-attribute space119872 which contains charac-teristics that describe existing problems (eg numberof attributes entropy normality etc)

(iii) the algorithm space 119860 which represents the set ofcandidate algorithms which can be used to solve theproblems defined in problem space 119875

(iv) a performance metric 119884 which represents measuresof performance of an algorithm on a problem (egclassification accuracy (for classification problems) orroot mean square error (for regression problems))

General procedure for metalearning is done in severalsteps first datasets from problem space are evaluated by

algorithms fromalgorithm space Furthermetafeatures of thedatasets are related to algorithm performance forming thedatabase of metaexamples Then regression or classificationmodels are created (and evaluated by performance metric)Finally when new problem (dataset) arrives meta-attributesare extracted and performance prediction is made In thisway analyst does not have to evaluate each algorithm on eachdataset but only ones with the best predicted performanceof the problem and algorithm spaces While technologies fordata collection enabled cheap and fast accumulation of dataand extension of problem space development and collectionof data mining algorithms is a more difficult problem sincedevelopment of new algorithms demands a lot of time andeffort and also different algorithms are implemented ondifferent platforms

The most important issue for good performance of met-alearning systems is the size of problem and algorithm spacebecause the accuracy of metamodels is directly dependant onthese spaces [39 40]Thismeans that cloud based system andservice oriented architecture should be natural environmentfor this kind of systems because it would enable communitybased extension models to be created and evaluated byperformance metric in order to capture relations betweenmeta-attributes and algorithm performance

With metalearning approach in solving problems timefor choosing appropriate algorithm for problem is greatlyreduced but requires time for creating and updating meta-models especially if data and algorithms are gathered fromcommunity This is one of the main motivations for integra-tion of such a system in cloud based environment and inte-gration with big data technologies (like Hadoop Hive andMahout) for storing and aggregation of data and predictivemodeling

31 Component Based Metalearning System for BiomedicalData One of the promising approaches for tackling theseproblems (existence of large algorithm space and existenceof efficient procedure for selection of the best algorithm)is component based data mining algorithm design [41ndash43]This approach divides algorithms with similar structure (inthis case representative based algorithms) into parts withthe same functionality called subproblems Every subproblemhas standardized IO structure and can be solved with oneor more reusable components (RCs) presented in Table 1This approach combination of RCs which originates fromdifferent algorithms can be used to design large number(thousands) of new ldquohybridrdquo algorithms This approach gavevery promising results in the area of clustering biomedical(gene expression) data [8 44 45]

Combining RCs is used for reproducing or creationof cluster algorithms For example K-means algorithmscan be reconstructed as RANDOM-EUCLIDEAN-MEAN-COMPACT However a new hybrid algorithm can be con-structed using DIANA-CORREL-MEDIAN-CONN whereDIANA is used to initialize representatives CORREL tomeasure distance MEDIAN to update representatives andCONN to evaluate clusters

The Scientific World Journal 5

Table 1 Sub-problems and RCs for generic clustering algorithmdesign

Sub-problem Reusable components

Initialize representatives DIANA RANDOM XMEANSGMEANS PCA KMEANS++ SPSS

Measure distance EUCLIDEAN CITY CORRELCOSINE

Update representatives MEAN MEDIAN ONLINE

Evaluate clusters AIC BIC SILHOU COMPACT XBCONN

Extendedmetalearning system shown in Figure 2 is usedin this research Problem space is presented in upper leftcloud Every problem (dataset) from problem space 119875 has itstask (clustering classification regression etc) denounced 119909Based on problem function 119891 extracts meta-attributes Forselected problem based onmeta-attributes function 119878 selectsalgorithm from algorithm space 119860 Every algorithm is con-structed from reusable components (RCs) from which addi-tional meta-attributes were derived (algorithm descriptions)Also as a result of clustering algorithm on specific datasetinternal evaluationmeasures (additional meta-attributes) arecalculated and saved as meta-attributes Central cloud is themost important part of metalearning system It is responsiblefor ranking and selection of algorithms Inputs in this cloudare task 119909 algorithm 119886 and performance metric 120587 Meta-attributes created for each task 119909 are input for ranking andselection as they are a basis for learning on metalevel Formost tasks performance ofmeta-learning system is calculatedearlier as output label and it is available on metalevel

32 Initial Evaluation of Component Based MetalearningSystem for Clustering Biomedical Data In this section we willdescribe the data and the procedure for initial evaluation ofthe proposed system 30 datasets gathered from original met-alearning system [46] were used (httpbioinformaticsrut-gerseduStaticSupplementsCompCancerdatasetshtm)

For the construction of themetaexamples a set of 13meta-attributes for dataset description proposed by Nascimientoet al [46] are used (detailed description of datasets andmetafeatures can be found in Nascimiento et al [46])

Additionally meta-attribute space is extended withdescriptions of algorithms (four components of algorithmand normalization type described in Table 1) and internalcluster evaluation measures including compactness globalsilhouette indexAIC BICXB-index and connectivityThesethree types of meta-attributes (dataset descriptions reusablecomponents and internal evaluation measures) form thespace of 24 meta-attributes

Component based clustering algorithms were used todefine algorithm space For that purpose 504 RC-based clus-ter algorithms were designed for experimental evaluationThese algorithms were built by combining already describedRCs (Table 1) with 4 different normalization techniqueswhich lead to total of 2016 clustering experiments

Table 2 Meta-algorithm performance

AlgorithmError RMSE MAERBFN 0143 0109 (plusmn0092)LR 0111 0086 (plusmn0070)LMSR 0265 0094 (plusmn0248)NN 0101 0064 (plusmn0078)SVM 0050 0034 (plusmn0036)

For validation of clusteringmodels AMI (adjustedmutualinformation) index was used since it is recently recom-mended as a ldquogeneral purposerdquo measure for clustering valida-tion comparison and algorithm design [47] after exhaustivecomparison between a number of information theoreticand pair counting measures Even more this measure isthoroughly evaluated on gene expression microarray data

After validation of component based clustering algo-rithms on 30 datasets 55326 valid results were gatheredwhich represent metaexample repository Next step wasidentification of the best algorithm for ranking and selectionof algorithms for clustering gene expression microarray data

A procedure for ranking and selection of the best clus-tering algorithms is based on regression algorithms calledmeta-algorithms which predict (regression task) AMI valuesbased on a dataset metafeatures algorithm components andinternal evaluation measures

In this research five meta-algorithms were used Thoseare radial basis function network (RBFN) linear regression(LR) leastmedian square regression (LMSR) neural network(NN) and support vector machine (SVM)

Estimate of quality of regression algorithms is done usingmean absolute error (MAE) and root mean squared error(RMSE) Validation of results is done using 70 of datasetfor training the model and the remaining 30 for testing

Performance of each algorithm in terms of MAE andRMSE and best values are shown in bold (Table 2)

Although all five algorithms showed good results SVMas in Table 2 gave the best performance and this modelshould be used for prediction of algorithm performances fornew datasets RMSE of 005 and MAE of 0034 with thesmallest variance (numbers in brackets) indicate that thismetamodel is applicable to the new problems since AMImeasure takes values from 0 to 1 where 1 is the best Notethat with an extension of algorithm space and problem spacethese results could be changed and so continuous evaluationof available meta-algorithms should be done Because of thisit is important to have adequate computing infrastructurefor processing big data and updating the models Process forcreating and updating the models is presented in Figure 3Creation of model contains several steps First microarraymetaexamples are loaded from which only important vari-ables are selected After that data preparation phase wasconducted where only those attributes that are importantfor model building were selected label attribute was set onAMI attribute missing values were replaced with averagevalue and nominal values were transformed to numericalvalues using dummy coding Modeling phase is conducted

6 The Scientific World Journal

Gene expressionclustring

Cell typeclassifcation

Ranking and selection

Performance 120587

AlgorithmA

Reusablecomponents

(RCs)

PCADIANA

Initializerepresentatives

CITYCORREL

RMSEAbsolute error Accuracy

Precision

AICBIC

XB

MEANMEDIAN

ONLINE

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

LgE

BCPMV

NRE

K-NN

Algorithm space (A)

Evaluate cluster Updaterepresentatives

Measure distance

Algorithm

fe PCA-CITY-MEAN-XBltgt a 120587(a x) gt 120587( x)

a = combination of RC

Performance space (Y)

Task x

Feature extraction (f)

Meta-attribute space (M)Problem space (P)

selection (S)

A(x) = S(f(x))

a998400 a998400

Figure 2 Extended metalearning system for clustering biomedical data

Figure 3 Main process for finding the best model for prediction ofAMI

Figure 4 Stream for application of metalearning system on newcases

using 10-fold cross validation where the above-mentionedfive algorithms were used Every trained model is saved onhard disk which allows its reusability

Automatic application of the selected (in this case SVM)model is presented in Figure 4 After every update of themodel or after inserting new dataset this process needs to beupdated Savedmodel in this case SVM is loaded and appliedon new dataset Results gathered are sorted and exportedDetailed information can be found in Radovanovic et al [48]

4 Extended Cloud Based Model forBig Data Analysis

Dai et al [16] addressed the problems of storing and analysisof biomedical data by proposing a cloud based model foranalysis of biomedical data and it is composed of four servicecategories

(i) data as a service (DaaS)(ii) software as a service (SaaS)(iii) platform as a service (PaaS)(iv) infrastructure as a service (IaaS)

DaaS is group of cloud services which enables on demanddata access and provides up-to-date data that are accessibleby a wide range of devices that are connected over theweb In case of bioinformatics AmazonWeb Services (AWS)provides repository of public archives (data sets) includingGenBank Unigene Ensembl and Influenza Virus which canbe accessed from cloud based applications

Since bioinformatics requires a large diversity of softwaretools for data analyses the task of SaaS in bioinformatics is todeliver and enable remote access to software services onlineThus installation of software tools on desktop computer isno longer required Another one advantage of using SaaS isenablingmuch easier collaboration betweendispersed groupsof users

Platform as a service (PaaS) should offer a programmableenvironment for users in order to develop test and deploycloud applications Computer resources scale automatically

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

2 The Scientific World Journal

0

1

2

3

4

5

6

7

8

92010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

2025

2026

2027

2028

2029

2030

billions of base pairsLog10

Figure 1 Projected growth ofDNAsequence data in the 21st century[7]

there are not many researches that utilize metalearningin medicine [14] and in this paper we will propose suchapproach Similar efforts have been made in the field ofcontinuous improvement of business performance with bigdata [15] which lets users analyze business performance indistributed environments with a short response time whichcan be analogous with biomedical systems

Researchers in this area suggest that the main problemof exponential data growth is to provide adequate com-puting infrastructure that has the possibility to assemblemanage and mine the enormous and rapidly growing data[2 7 9 10] They emphasize that intersection point betweengenome technology cloud computing and biological datamining provides a launch pad for developing a globallyapplicable cloud computing platform capable of supportinga new paradigm of data intensive cloud-enabled predictivemedicine

In this paper we propose an extension of cloud basedsystems [16 17] with data and model driven services basedonmetalearning approach Additionally this system includesopen source data mining environments as a platform servicefor users System is based on open source technologiesand this is very important since they enable collaborativecollection of data and fast development of new algorithms[18]

2 State of the Art

In this section a brief overview of cloud based healthcaresystems and metalearning systems which are correlated withproposed system is discussed Cloud systems emerged as atechnology breakthrough in the last decade and impacteda wide range of business like SME [19] education [20] e-government [21] data mining [22] and so forth

Ahuja et al [23] exhaustively reviewed usage and con-sideration points in implementing cloud healthcare systemThey identified that themost important points are infrastruc-ture and number of facilities Infrastructure has great influ-ence sincemost of the healthcare facilities and office locationswere built years ago and cannot use cloud systems Numberof facilities is important on operation of health organizationand whether their IT infrastructure is distributed betweenfacilities or is in a single datacenter Moving to the cloudwould help communication application and collaboration

between health organizations Cloud computing reducesoperating costs because the need for IT staff in each facilityis lower and overall IT budget is reduced

The advantages of cloud computing and big data tech-nologies like Hadoop and related software increased theirpopularity in medicine and bioinformatics Dai et al [16]identified four bioinformatics cloud services Those are DaaS(data as a service) SaaS (software as a service) PaaS (platformas a service) and IaaS (infrastructure as a service) Bioinfor-matics generates huge amount of raw data and they shouldbe available for data analysis through DaaS Additionally alarge diversity of software tools is necessary for data analysisand SaaS is provided as an option in regarding this problemPlatform as a service provides programmable platform fordevelopment testing and deploying solutions online IaaSoffers a complete computer infrastructure for bioinformaticsanalysis

Lack of support for complex and large scale healthcareapplication of electronic medical records was identified by Liet al [24]Therefore XML was used as a model for managingmedical data while Hadoop infrastructure and MapReduceframework were used for data analysis Their system calledXBase is doing various datamining tasks like classification ofheart valvular disease detecting association rules diagnosisassistance and treatment recommendation

As Schatz et al [25] stated sequencing of DNA chainis improving at a rate of about 5-fold per year whilecomputer performance is doubling only every 18 or 24months Therefore addressing the issue of designing dataanalysis arises as a question A practical solution for solvingthis problem is to concentrate on developing methods thatmake better use of multiple computers and processorswhere cloud computing emerges with promising resultsTheystated that HadoopMapReduce technology is particularlywell suited from genomic point of view for analysis ofDNA sequenceTheCrossbow genotyping program leveragesHadoopMapReduce to launch many copies of the short readin parallel leveraging of HadoopMapReduce and Crossbowfor greater results In their benchmark test on the Amazoncloud Crossbow HadoopMapReduce analyzed 27 billiondata points in about 4 hours which included the timerequired for uploading the raw data for a total cost of $85USD Beside this they described obstacles which can posesignificant barrier in analysis of DNA sequence

An interesting approach to design of biomedical cloudsystem was described by Taverna [26] It is a workflowmanagement system that allows uploading data to the cloudfrom web application creating data flow and run analysisfrom multiple computing units While analysis is runninguser can monitor the progress This application also allowssharing of data flow enabling many researchers from thesame or other projects to influence and give contribution toresearch process

Frameworks for cloud based genome data analysis suchas Galaxy [27] offer generalized tools and libraries as com-ponents in workflow editor Galaxy enables users to definepipelines that through specifically developed visualizationshow progression of the workflow It is extensible and there-fore a community built around it contributes in developing

The Scientific World Journal 3

various tools for genome analysis It is important to noticethat Galaxy is integrated with biomedical databases such asUCSC table data BioMart Central andmodENCODE server

Hadoop and MapReduce distributed computingparadigm has been implemented in CloudBurst [28] Itis used for mapping short reads to reference genomes in aparallel fashion in cloud environment Essentially it providesa parallel read-mapping algorithm optimized for mappingsequence data to the human genome and other referencegenomes intended for use in a biological analysis includingSNP discovery genotyping and personal genomics

Another large biological extensibleworkbench is SeqWire[29] Users are allowed to write and share pipelinemodules Itprovides massive parallel processing using sequencing tech-nologies such as ABI SOLID and Illumina web applicationpipeline for processing and annotating sequenced data queryengine and a MetaDB

Web based cloud system such as FX [30] provides highusability for users that are not familiar with programmingtechniques It is developed for various biomedical data anal-yses such as estimating gene expression level and genomicvariant calling from the RNA sequence using transcriptome-based references User uploads data and configures dataanalysis settings on Amazon Web Service (AWS) Since thisapplication is domain specific it does not require manualarrangement of pipelines

Critical cloud services in biomedicine are infrastructureservices Therefore CloVR [31] offers a virtual operatingsystem with preinstalled packages and libraries required forbiomedical data analysis such as large-scale BLAST searcheswhole-genome assembly gene finding and RNA sequenceanalysis It is implemented as an online application but doesnot provide GUI Instead command-line based automatedanalysis pipelines with preconfigured software packages forcomposing workflows are implemented

Chae et al [9] focus on two emerging problems inbioinformatics data analysis Those are computation powerand big data analysis for the biomedical data Biomedicalanalysis requires very big computing power with huge storagespaceThey proposed BioVLab as an affordable infrastructureon the cloud with a graphical workflow creator which pro-vides an efficient way to deal with these problems BioVLabconsists of three layers The first layer is a graphical workflowengine called XBaya which enables the composition andmanagement of scientific workflows on a desktopThe secondlayer gateway is a web-based analysis tool for the integratedanalysis of microRNA and mRNA expression data Analysisis done on Amazon S3 Interface which presents thirdlayer of architecture Data and commands from gateway aretransferred to cloud which analyze data and return results touser on desktopThey emphasized that analysis of bigmedicaldata requires use of appropriate tools and databases from avast number of tools and databases therefore using cloudwould not solve problems of computational power and bigdata analysis

Cloud healthcare application could have great impacton society but security privacy and government regulationissues limit its usage Zhang and Liu [32] defined securitymodel for cloud healthcare system First part of the model is

secure collection of data Data collection module is createdand maintained independently by care delivery organiza-tions Second part is secure storage Storage system must beencrypted and it can allow only authorized access Third partis secure usage which consists of medicine staff signatureand verification sections Their model deals with problemof information ownership authenticity and authenticationnonrepudiation patient consent and authorization integrityand confidentiality of data and availability of system

Since security is identified as a major challenge in cloudsystemsWooten et al [33] designed and implemented securehealthcare cloud system This was achieved with a trust-aware role-based access control and a tag system System wasimplemented on top of Amazon Web Services (AWS) ElasticCompute Cloud (EC2) and Linux Apache MySQL and PHP(LAMP) solution stack

Liu and Park [34] focused on challenges and adaptationof e-healthcare cloud systems This system extends the cloudparadigm in order to satisfy global demands in digitalhealthcare applications Therefore technology healthcareprocess and service are identified as the main characteristicsof healthcare cloud systems Similarly new challenges aroseby the unique requirements of the e-healthcare industry forusing cloud services for regulation security issues accessintercloud connectivity and resource distribution

IBM Watson is also used in healthcare as cloud serviceGiles and Wilcox [35] used Watson ability to use naturallanguage processing and combine it with content analysisin order to help medical staff in diagnostic analysis Thisapplication of Watson identifies diseases symptoms rightmedications and modifiers directly from medical recordsfrom different medical facilities stored on cloud

Knowledge cloud based systems in medicine as Lai et al[17] stated are one of the major governmentrsquos strategic plansto drive the healthcare services which are identified as publicconcerns in China They highlighted some successful criteriafor establishment of a private knowledge network for businessnetwork collaboration and the knowledge cloud system forradiotherapy dynamic treatment service in China such asinnovation outsourcing marketing opportunity economy ofscales leverage existing resources and service on demandThree parties are identified in the KaaS service model Thefirst party is the knowledge user (patients hospitals anddoctors) the one who pays for the knowledge service ondemand The second party is the knowledge expert (externalconsultants) the one who provides the knowledge service ondemand The third party is the knowledge agent who linkstogether the knowledge user and the knowledge expert ondemand

Great potential of cloud services in the area ofbiomedicine is identified by Grossman and White [7]who made a vision of biomedical cloud in the future Sinceamount of data which hospitals and medical institutionsare dealing with is growing rapidly big data technologieswill have indispensable role in data analysis Consequentlymanaging and processing data will fundamentally changeand new data mining and machine learning algorithmswill be developed to deal with these changes Explosion ofdata is expected to be in genomic proteomic and other

4 The Scientific World Journal

ldquoomicrdquo data and molecular and system biology Probesthat collect data from tissue are more complex and allowsimultaneous tracking and collect more data than few yearsago Authors have also considered several issues such assecurity scalability of storage scalability of analysis peerwith other private clouds and peer with public clouds

Metalearning presents powerful methodology whichenables learning on its past knowledge of solving differenttasks This methodology is used in fraud detection [36] timeseries forecasting [37] load forecasting [11] and others

3 Metalearning Framework forClustering Biomedical Data

Exponential growth of the data and rapid development oflarge number of complex and computationally intensive datamining algorithms led to one of the major problems inmodern data mining selection of the best algorithm for dataat hand [38]Namely analyst often does not have enough timeor resources for creating models and evaluating themwith allavailable algorithms

One of the most promising approaches for dealingwith this problem is metalearning [39 40] Metalearning ismethodology which solves different data mining tasks basedon past knowledgeThemain idea is to store history of exper-imental results with descriptions (meta-attributes) of datasets(eg dataset characteristics algorithm and classificationaccuracies for classification problems) and based on thisto create metamodel (classification or regression) that willpredict the performances of each algorithm on new datasetCreating of suchmetamodel (with good performance) wouldreduce the need for brute force evaluation (evaluation ofevery algorithm)

Metalearning system is built on set of algorithm orcombination of algorithms (ensembles) Therefore everyalgorithm or combination of algorithms (ensembles) is sim-plerTheoreticallymetalearning system can be infinitely largeby puttingmetalearning as component of other metalearningsystems An advantage of its using is that it can address newtypes of tasks that have not been seen but are similar toalready defined problems

Metalearning by Smith-Miles [39 40] is definedwith thefollowing aspects

(i) the problem space 119875 which represents set ofinstances (datasets) of a given problem class

(ii) the meta-attribute space119872 which contains charac-teristics that describe existing problems (eg numberof attributes entropy normality etc)

(iii) the algorithm space 119860 which represents the set ofcandidate algorithms which can be used to solve theproblems defined in problem space 119875

(iv) a performance metric 119884 which represents measuresof performance of an algorithm on a problem (egclassification accuracy (for classification problems) orroot mean square error (for regression problems))

General procedure for metalearning is done in severalsteps first datasets from problem space are evaluated by

algorithms fromalgorithm space Furthermetafeatures of thedatasets are related to algorithm performance forming thedatabase of metaexamples Then regression or classificationmodels are created (and evaluated by performance metric)Finally when new problem (dataset) arrives meta-attributesare extracted and performance prediction is made In thisway analyst does not have to evaluate each algorithm on eachdataset but only ones with the best predicted performanceof the problem and algorithm spaces While technologies fordata collection enabled cheap and fast accumulation of dataand extension of problem space development and collectionof data mining algorithms is a more difficult problem sincedevelopment of new algorithms demands a lot of time andeffort and also different algorithms are implemented ondifferent platforms

The most important issue for good performance of met-alearning systems is the size of problem and algorithm spacebecause the accuracy of metamodels is directly dependant onthese spaces [39 40]Thismeans that cloud based system andservice oriented architecture should be natural environmentfor this kind of systems because it would enable communitybased extension models to be created and evaluated byperformance metric in order to capture relations betweenmeta-attributes and algorithm performance

With metalearning approach in solving problems timefor choosing appropriate algorithm for problem is greatlyreduced but requires time for creating and updating meta-models especially if data and algorithms are gathered fromcommunity This is one of the main motivations for integra-tion of such a system in cloud based environment and inte-gration with big data technologies (like Hadoop Hive andMahout) for storing and aggregation of data and predictivemodeling

31 Component Based Metalearning System for BiomedicalData One of the promising approaches for tackling theseproblems (existence of large algorithm space and existenceof efficient procedure for selection of the best algorithm)is component based data mining algorithm design [41ndash43]This approach divides algorithms with similar structure (inthis case representative based algorithms) into parts withthe same functionality called subproblems Every subproblemhas standardized IO structure and can be solved with oneor more reusable components (RCs) presented in Table 1This approach combination of RCs which originates fromdifferent algorithms can be used to design large number(thousands) of new ldquohybridrdquo algorithms This approach gavevery promising results in the area of clustering biomedical(gene expression) data [8 44 45]

Combining RCs is used for reproducing or creationof cluster algorithms For example K-means algorithmscan be reconstructed as RANDOM-EUCLIDEAN-MEAN-COMPACT However a new hybrid algorithm can be con-structed using DIANA-CORREL-MEDIAN-CONN whereDIANA is used to initialize representatives CORREL tomeasure distance MEDIAN to update representatives andCONN to evaluate clusters

The Scientific World Journal 5

Table 1 Sub-problems and RCs for generic clustering algorithmdesign

Sub-problem Reusable components

Initialize representatives DIANA RANDOM XMEANSGMEANS PCA KMEANS++ SPSS

Measure distance EUCLIDEAN CITY CORRELCOSINE

Update representatives MEAN MEDIAN ONLINE

Evaluate clusters AIC BIC SILHOU COMPACT XBCONN

Extendedmetalearning system shown in Figure 2 is usedin this research Problem space is presented in upper leftcloud Every problem (dataset) from problem space 119875 has itstask (clustering classification regression etc) denounced 119909Based on problem function 119891 extracts meta-attributes Forselected problem based onmeta-attributes function 119878 selectsalgorithm from algorithm space 119860 Every algorithm is con-structed from reusable components (RCs) from which addi-tional meta-attributes were derived (algorithm descriptions)Also as a result of clustering algorithm on specific datasetinternal evaluationmeasures (additional meta-attributes) arecalculated and saved as meta-attributes Central cloud is themost important part of metalearning system It is responsiblefor ranking and selection of algorithms Inputs in this cloudare task 119909 algorithm 119886 and performance metric 120587 Meta-attributes created for each task 119909 are input for ranking andselection as they are a basis for learning on metalevel Formost tasks performance ofmeta-learning system is calculatedearlier as output label and it is available on metalevel

32 Initial Evaluation of Component Based MetalearningSystem for Clustering Biomedical Data In this section we willdescribe the data and the procedure for initial evaluation ofthe proposed system 30 datasets gathered from original met-alearning system [46] were used (httpbioinformaticsrut-gerseduStaticSupplementsCompCancerdatasetshtm)

For the construction of themetaexamples a set of 13meta-attributes for dataset description proposed by Nascimientoet al [46] are used (detailed description of datasets andmetafeatures can be found in Nascimiento et al [46])

Additionally meta-attribute space is extended withdescriptions of algorithms (four components of algorithmand normalization type described in Table 1) and internalcluster evaluation measures including compactness globalsilhouette indexAIC BICXB-index and connectivityThesethree types of meta-attributes (dataset descriptions reusablecomponents and internal evaluation measures) form thespace of 24 meta-attributes

Component based clustering algorithms were used todefine algorithm space For that purpose 504 RC-based clus-ter algorithms were designed for experimental evaluationThese algorithms were built by combining already describedRCs (Table 1) with 4 different normalization techniqueswhich lead to total of 2016 clustering experiments

Table 2 Meta-algorithm performance

AlgorithmError RMSE MAERBFN 0143 0109 (plusmn0092)LR 0111 0086 (plusmn0070)LMSR 0265 0094 (plusmn0248)NN 0101 0064 (plusmn0078)SVM 0050 0034 (plusmn0036)

For validation of clusteringmodels AMI (adjustedmutualinformation) index was used since it is recently recom-mended as a ldquogeneral purposerdquo measure for clustering valida-tion comparison and algorithm design [47] after exhaustivecomparison between a number of information theoreticand pair counting measures Even more this measure isthoroughly evaluated on gene expression microarray data

After validation of component based clustering algo-rithms on 30 datasets 55326 valid results were gatheredwhich represent metaexample repository Next step wasidentification of the best algorithm for ranking and selectionof algorithms for clustering gene expression microarray data

A procedure for ranking and selection of the best clus-tering algorithms is based on regression algorithms calledmeta-algorithms which predict (regression task) AMI valuesbased on a dataset metafeatures algorithm components andinternal evaluation measures

In this research five meta-algorithms were used Thoseare radial basis function network (RBFN) linear regression(LR) leastmedian square regression (LMSR) neural network(NN) and support vector machine (SVM)

Estimate of quality of regression algorithms is done usingmean absolute error (MAE) and root mean squared error(RMSE) Validation of results is done using 70 of datasetfor training the model and the remaining 30 for testing

Performance of each algorithm in terms of MAE andRMSE and best values are shown in bold (Table 2)

Although all five algorithms showed good results SVMas in Table 2 gave the best performance and this modelshould be used for prediction of algorithm performances fornew datasets RMSE of 005 and MAE of 0034 with thesmallest variance (numbers in brackets) indicate that thismetamodel is applicable to the new problems since AMImeasure takes values from 0 to 1 where 1 is the best Notethat with an extension of algorithm space and problem spacethese results could be changed and so continuous evaluationof available meta-algorithms should be done Because of thisit is important to have adequate computing infrastructurefor processing big data and updating the models Process forcreating and updating the models is presented in Figure 3Creation of model contains several steps First microarraymetaexamples are loaded from which only important vari-ables are selected After that data preparation phase wasconducted where only those attributes that are importantfor model building were selected label attribute was set onAMI attribute missing values were replaced with averagevalue and nominal values were transformed to numericalvalues using dummy coding Modeling phase is conducted

6 The Scientific World Journal

Gene expressionclustring

Cell typeclassifcation

Ranking and selection

Performance 120587

AlgorithmA

Reusablecomponents

(RCs)

PCADIANA

Initializerepresentatives

CITYCORREL

RMSEAbsolute error Accuracy

Precision

AICBIC

XB

MEANMEDIAN

ONLINE

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

LgE

BCPMV

NRE

K-NN

Algorithm space (A)

Evaluate cluster Updaterepresentatives

Measure distance

Algorithm

fe PCA-CITY-MEAN-XBltgt a 120587(a x) gt 120587( x)

a = combination of RC

Performance space (Y)

Task x

Feature extraction (f)

Meta-attribute space (M)Problem space (P)

selection (S)

A(x) = S(f(x))

a998400 a998400

Figure 2 Extended metalearning system for clustering biomedical data

Figure 3 Main process for finding the best model for prediction ofAMI

Figure 4 Stream for application of metalearning system on newcases

using 10-fold cross validation where the above-mentionedfive algorithms were used Every trained model is saved onhard disk which allows its reusability

Automatic application of the selected (in this case SVM)model is presented in Figure 4 After every update of themodel or after inserting new dataset this process needs to beupdated Savedmodel in this case SVM is loaded and appliedon new dataset Results gathered are sorted and exportedDetailed information can be found in Radovanovic et al [48]

4 Extended Cloud Based Model forBig Data Analysis

Dai et al [16] addressed the problems of storing and analysisof biomedical data by proposing a cloud based model foranalysis of biomedical data and it is composed of four servicecategories

(i) data as a service (DaaS)(ii) software as a service (SaaS)(iii) platform as a service (PaaS)(iv) infrastructure as a service (IaaS)

DaaS is group of cloud services which enables on demanddata access and provides up-to-date data that are accessibleby a wide range of devices that are connected over theweb In case of bioinformatics AmazonWeb Services (AWS)provides repository of public archives (data sets) includingGenBank Unigene Ensembl and Influenza Virus which canbe accessed from cloud based applications

Since bioinformatics requires a large diversity of softwaretools for data analyses the task of SaaS in bioinformatics is todeliver and enable remote access to software services onlineThus installation of software tools on desktop computer isno longer required Another one advantage of using SaaS isenablingmuch easier collaboration betweendispersed groupsof users

Platform as a service (PaaS) should offer a programmableenvironment for users in order to develop test and deploycloud applications Computer resources scale automatically

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

The Scientific World Journal 3

various tools for genome analysis It is important to noticethat Galaxy is integrated with biomedical databases such asUCSC table data BioMart Central andmodENCODE server

Hadoop and MapReduce distributed computingparadigm has been implemented in CloudBurst [28] Itis used for mapping short reads to reference genomes in aparallel fashion in cloud environment Essentially it providesa parallel read-mapping algorithm optimized for mappingsequence data to the human genome and other referencegenomes intended for use in a biological analysis includingSNP discovery genotyping and personal genomics

Another large biological extensibleworkbench is SeqWire[29] Users are allowed to write and share pipelinemodules Itprovides massive parallel processing using sequencing tech-nologies such as ABI SOLID and Illumina web applicationpipeline for processing and annotating sequenced data queryengine and a MetaDB

Web based cloud system such as FX [30] provides highusability for users that are not familiar with programmingtechniques It is developed for various biomedical data anal-yses such as estimating gene expression level and genomicvariant calling from the RNA sequence using transcriptome-based references User uploads data and configures dataanalysis settings on Amazon Web Service (AWS) Since thisapplication is domain specific it does not require manualarrangement of pipelines

Critical cloud services in biomedicine are infrastructureservices Therefore CloVR [31] offers a virtual operatingsystem with preinstalled packages and libraries required forbiomedical data analysis such as large-scale BLAST searcheswhole-genome assembly gene finding and RNA sequenceanalysis It is implemented as an online application but doesnot provide GUI Instead command-line based automatedanalysis pipelines with preconfigured software packages forcomposing workflows are implemented

Chae et al [9] focus on two emerging problems inbioinformatics data analysis Those are computation powerand big data analysis for the biomedical data Biomedicalanalysis requires very big computing power with huge storagespaceThey proposed BioVLab as an affordable infrastructureon the cloud with a graphical workflow creator which pro-vides an efficient way to deal with these problems BioVLabconsists of three layers The first layer is a graphical workflowengine called XBaya which enables the composition andmanagement of scientific workflows on a desktopThe secondlayer gateway is a web-based analysis tool for the integratedanalysis of microRNA and mRNA expression data Analysisis done on Amazon S3 Interface which presents thirdlayer of architecture Data and commands from gateway aretransferred to cloud which analyze data and return results touser on desktopThey emphasized that analysis of bigmedicaldata requires use of appropriate tools and databases from avast number of tools and databases therefore using cloudwould not solve problems of computational power and bigdata analysis

Cloud healthcare application could have great impacton society but security privacy and government regulationissues limit its usage Zhang and Liu [32] defined securitymodel for cloud healthcare system First part of the model is

secure collection of data Data collection module is createdand maintained independently by care delivery organiza-tions Second part is secure storage Storage system must beencrypted and it can allow only authorized access Third partis secure usage which consists of medicine staff signatureand verification sections Their model deals with problemof information ownership authenticity and authenticationnonrepudiation patient consent and authorization integrityand confidentiality of data and availability of system

Since security is identified as a major challenge in cloudsystemsWooten et al [33] designed and implemented securehealthcare cloud system This was achieved with a trust-aware role-based access control and a tag system System wasimplemented on top of Amazon Web Services (AWS) ElasticCompute Cloud (EC2) and Linux Apache MySQL and PHP(LAMP) solution stack

Liu and Park [34] focused on challenges and adaptationof e-healthcare cloud systems This system extends the cloudparadigm in order to satisfy global demands in digitalhealthcare applications Therefore technology healthcareprocess and service are identified as the main characteristicsof healthcare cloud systems Similarly new challenges aroseby the unique requirements of the e-healthcare industry forusing cloud services for regulation security issues accessintercloud connectivity and resource distribution

IBM Watson is also used in healthcare as cloud serviceGiles and Wilcox [35] used Watson ability to use naturallanguage processing and combine it with content analysisin order to help medical staff in diagnostic analysis Thisapplication of Watson identifies diseases symptoms rightmedications and modifiers directly from medical recordsfrom different medical facilities stored on cloud

Knowledge cloud based systems in medicine as Lai et al[17] stated are one of the major governmentrsquos strategic plansto drive the healthcare services which are identified as publicconcerns in China They highlighted some successful criteriafor establishment of a private knowledge network for businessnetwork collaboration and the knowledge cloud system forradiotherapy dynamic treatment service in China such asinnovation outsourcing marketing opportunity economy ofscales leverage existing resources and service on demandThree parties are identified in the KaaS service model Thefirst party is the knowledge user (patients hospitals anddoctors) the one who pays for the knowledge service ondemand The second party is the knowledge expert (externalconsultants) the one who provides the knowledge service ondemand The third party is the knowledge agent who linkstogether the knowledge user and the knowledge expert ondemand

Great potential of cloud services in the area ofbiomedicine is identified by Grossman and White [7]who made a vision of biomedical cloud in the future Sinceamount of data which hospitals and medical institutionsare dealing with is growing rapidly big data technologieswill have indispensable role in data analysis Consequentlymanaging and processing data will fundamentally changeand new data mining and machine learning algorithmswill be developed to deal with these changes Explosion ofdata is expected to be in genomic proteomic and other

4 The Scientific World Journal

ldquoomicrdquo data and molecular and system biology Probesthat collect data from tissue are more complex and allowsimultaneous tracking and collect more data than few yearsago Authors have also considered several issues such assecurity scalability of storage scalability of analysis peerwith other private clouds and peer with public clouds

Metalearning presents powerful methodology whichenables learning on its past knowledge of solving differenttasks This methodology is used in fraud detection [36] timeseries forecasting [37] load forecasting [11] and others

3 Metalearning Framework forClustering Biomedical Data

Exponential growth of the data and rapid development oflarge number of complex and computationally intensive datamining algorithms led to one of the major problems inmodern data mining selection of the best algorithm for dataat hand [38]Namely analyst often does not have enough timeor resources for creating models and evaluating themwith allavailable algorithms

One of the most promising approaches for dealingwith this problem is metalearning [39 40] Metalearning ismethodology which solves different data mining tasks basedon past knowledgeThemain idea is to store history of exper-imental results with descriptions (meta-attributes) of datasets(eg dataset characteristics algorithm and classificationaccuracies for classification problems) and based on thisto create metamodel (classification or regression) that willpredict the performances of each algorithm on new datasetCreating of suchmetamodel (with good performance) wouldreduce the need for brute force evaluation (evaluation ofevery algorithm)

Metalearning system is built on set of algorithm orcombination of algorithms (ensembles) Therefore everyalgorithm or combination of algorithms (ensembles) is sim-plerTheoreticallymetalearning system can be infinitely largeby puttingmetalearning as component of other metalearningsystems An advantage of its using is that it can address newtypes of tasks that have not been seen but are similar toalready defined problems

Metalearning by Smith-Miles [39 40] is definedwith thefollowing aspects

(i) the problem space 119875 which represents set ofinstances (datasets) of a given problem class

(ii) the meta-attribute space119872 which contains charac-teristics that describe existing problems (eg numberof attributes entropy normality etc)

(iii) the algorithm space 119860 which represents the set ofcandidate algorithms which can be used to solve theproblems defined in problem space 119875

(iv) a performance metric 119884 which represents measuresof performance of an algorithm on a problem (egclassification accuracy (for classification problems) orroot mean square error (for regression problems))

General procedure for metalearning is done in severalsteps first datasets from problem space are evaluated by

algorithms fromalgorithm space Furthermetafeatures of thedatasets are related to algorithm performance forming thedatabase of metaexamples Then regression or classificationmodels are created (and evaluated by performance metric)Finally when new problem (dataset) arrives meta-attributesare extracted and performance prediction is made In thisway analyst does not have to evaluate each algorithm on eachdataset but only ones with the best predicted performanceof the problem and algorithm spaces While technologies fordata collection enabled cheap and fast accumulation of dataand extension of problem space development and collectionof data mining algorithms is a more difficult problem sincedevelopment of new algorithms demands a lot of time andeffort and also different algorithms are implemented ondifferent platforms

The most important issue for good performance of met-alearning systems is the size of problem and algorithm spacebecause the accuracy of metamodels is directly dependant onthese spaces [39 40]Thismeans that cloud based system andservice oriented architecture should be natural environmentfor this kind of systems because it would enable communitybased extension models to be created and evaluated byperformance metric in order to capture relations betweenmeta-attributes and algorithm performance

With metalearning approach in solving problems timefor choosing appropriate algorithm for problem is greatlyreduced but requires time for creating and updating meta-models especially if data and algorithms are gathered fromcommunity This is one of the main motivations for integra-tion of such a system in cloud based environment and inte-gration with big data technologies (like Hadoop Hive andMahout) for storing and aggregation of data and predictivemodeling

31 Component Based Metalearning System for BiomedicalData One of the promising approaches for tackling theseproblems (existence of large algorithm space and existenceof efficient procedure for selection of the best algorithm)is component based data mining algorithm design [41ndash43]This approach divides algorithms with similar structure (inthis case representative based algorithms) into parts withthe same functionality called subproblems Every subproblemhas standardized IO structure and can be solved with oneor more reusable components (RCs) presented in Table 1This approach combination of RCs which originates fromdifferent algorithms can be used to design large number(thousands) of new ldquohybridrdquo algorithms This approach gavevery promising results in the area of clustering biomedical(gene expression) data [8 44 45]

Combining RCs is used for reproducing or creationof cluster algorithms For example K-means algorithmscan be reconstructed as RANDOM-EUCLIDEAN-MEAN-COMPACT However a new hybrid algorithm can be con-structed using DIANA-CORREL-MEDIAN-CONN whereDIANA is used to initialize representatives CORREL tomeasure distance MEDIAN to update representatives andCONN to evaluate clusters

The Scientific World Journal 5

Table 1 Sub-problems and RCs for generic clustering algorithmdesign

Sub-problem Reusable components

Initialize representatives DIANA RANDOM XMEANSGMEANS PCA KMEANS++ SPSS

Measure distance EUCLIDEAN CITY CORRELCOSINE

Update representatives MEAN MEDIAN ONLINE

Evaluate clusters AIC BIC SILHOU COMPACT XBCONN

Extendedmetalearning system shown in Figure 2 is usedin this research Problem space is presented in upper leftcloud Every problem (dataset) from problem space 119875 has itstask (clustering classification regression etc) denounced 119909Based on problem function 119891 extracts meta-attributes Forselected problem based onmeta-attributes function 119878 selectsalgorithm from algorithm space 119860 Every algorithm is con-structed from reusable components (RCs) from which addi-tional meta-attributes were derived (algorithm descriptions)Also as a result of clustering algorithm on specific datasetinternal evaluationmeasures (additional meta-attributes) arecalculated and saved as meta-attributes Central cloud is themost important part of metalearning system It is responsiblefor ranking and selection of algorithms Inputs in this cloudare task 119909 algorithm 119886 and performance metric 120587 Meta-attributes created for each task 119909 are input for ranking andselection as they are a basis for learning on metalevel Formost tasks performance ofmeta-learning system is calculatedearlier as output label and it is available on metalevel

32 Initial Evaluation of Component Based MetalearningSystem for Clustering Biomedical Data In this section we willdescribe the data and the procedure for initial evaluation ofthe proposed system 30 datasets gathered from original met-alearning system [46] were used (httpbioinformaticsrut-gerseduStaticSupplementsCompCancerdatasetshtm)

For the construction of themetaexamples a set of 13meta-attributes for dataset description proposed by Nascimientoet al [46] are used (detailed description of datasets andmetafeatures can be found in Nascimiento et al [46])

Additionally meta-attribute space is extended withdescriptions of algorithms (four components of algorithmand normalization type described in Table 1) and internalcluster evaluation measures including compactness globalsilhouette indexAIC BICXB-index and connectivityThesethree types of meta-attributes (dataset descriptions reusablecomponents and internal evaluation measures) form thespace of 24 meta-attributes

Component based clustering algorithms were used todefine algorithm space For that purpose 504 RC-based clus-ter algorithms were designed for experimental evaluationThese algorithms were built by combining already describedRCs (Table 1) with 4 different normalization techniqueswhich lead to total of 2016 clustering experiments

Table 2 Meta-algorithm performance

AlgorithmError RMSE MAERBFN 0143 0109 (plusmn0092)LR 0111 0086 (plusmn0070)LMSR 0265 0094 (plusmn0248)NN 0101 0064 (plusmn0078)SVM 0050 0034 (plusmn0036)

For validation of clusteringmodels AMI (adjustedmutualinformation) index was used since it is recently recom-mended as a ldquogeneral purposerdquo measure for clustering valida-tion comparison and algorithm design [47] after exhaustivecomparison between a number of information theoreticand pair counting measures Even more this measure isthoroughly evaluated on gene expression microarray data

After validation of component based clustering algo-rithms on 30 datasets 55326 valid results were gatheredwhich represent metaexample repository Next step wasidentification of the best algorithm for ranking and selectionof algorithms for clustering gene expression microarray data

A procedure for ranking and selection of the best clus-tering algorithms is based on regression algorithms calledmeta-algorithms which predict (regression task) AMI valuesbased on a dataset metafeatures algorithm components andinternal evaluation measures

In this research five meta-algorithms were used Thoseare radial basis function network (RBFN) linear regression(LR) leastmedian square regression (LMSR) neural network(NN) and support vector machine (SVM)

Estimate of quality of regression algorithms is done usingmean absolute error (MAE) and root mean squared error(RMSE) Validation of results is done using 70 of datasetfor training the model and the remaining 30 for testing

Performance of each algorithm in terms of MAE andRMSE and best values are shown in bold (Table 2)

Although all five algorithms showed good results SVMas in Table 2 gave the best performance and this modelshould be used for prediction of algorithm performances fornew datasets RMSE of 005 and MAE of 0034 with thesmallest variance (numbers in brackets) indicate that thismetamodel is applicable to the new problems since AMImeasure takes values from 0 to 1 where 1 is the best Notethat with an extension of algorithm space and problem spacethese results could be changed and so continuous evaluationof available meta-algorithms should be done Because of thisit is important to have adequate computing infrastructurefor processing big data and updating the models Process forcreating and updating the models is presented in Figure 3Creation of model contains several steps First microarraymetaexamples are loaded from which only important vari-ables are selected After that data preparation phase wasconducted where only those attributes that are importantfor model building were selected label attribute was set onAMI attribute missing values were replaced with averagevalue and nominal values were transformed to numericalvalues using dummy coding Modeling phase is conducted

6 The Scientific World Journal

Gene expressionclustring

Cell typeclassifcation

Ranking and selection

Performance 120587

AlgorithmA

Reusablecomponents

(RCs)

PCADIANA

Initializerepresentatives

CITYCORREL

RMSEAbsolute error Accuracy

Precision

AICBIC

XB

MEANMEDIAN

ONLINE

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

LgE

BCPMV

NRE

K-NN

Algorithm space (A)

Evaluate cluster Updaterepresentatives

Measure distance

Algorithm

fe PCA-CITY-MEAN-XBltgt a 120587(a x) gt 120587( x)

a = combination of RC

Performance space (Y)

Task x

Feature extraction (f)

Meta-attribute space (M)Problem space (P)

selection (S)

A(x) = S(f(x))

a998400 a998400

Figure 2 Extended metalearning system for clustering biomedical data

Figure 3 Main process for finding the best model for prediction ofAMI

Figure 4 Stream for application of metalearning system on newcases

using 10-fold cross validation where the above-mentionedfive algorithms were used Every trained model is saved onhard disk which allows its reusability

Automatic application of the selected (in this case SVM)model is presented in Figure 4 After every update of themodel or after inserting new dataset this process needs to beupdated Savedmodel in this case SVM is loaded and appliedon new dataset Results gathered are sorted and exportedDetailed information can be found in Radovanovic et al [48]

4 Extended Cloud Based Model forBig Data Analysis

Dai et al [16] addressed the problems of storing and analysisof biomedical data by proposing a cloud based model foranalysis of biomedical data and it is composed of four servicecategories

(i) data as a service (DaaS)(ii) software as a service (SaaS)(iii) platform as a service (PaaS)(iv) infrastructure as a service (IaaS)

DaaS is group of cloud services which enables on demanddata access and provides up-to-date data that are accessibleby a wide range of devices that are connected over theweb In case of bioinformatics AmazonWeb Services (AWS)provides repository of public archives (data sets) includingGenBank Unigene Ensembl and Influenza Virus which canbe accessed from cloud based applications

Since bioinformatics requires a large diversity of softwaretools for data analyses the task of SaaS in bioinformatics is todeliver and enable remote access to software services onlineThus installation of software tools on desktop computer isno longer required Another one advantage of using SaaS isenablingmuch easier collaboration betweendispersed groupsof users

Platform as a service (PaaS) should offer a programmableenvironment for users in order to develop test and deploycloud applications Computer resources scale automatically

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

4 The Scientific World Journal

ldquoomicrdquo data and molecular and system biology Probesthat collect data from tissue are more complex and allowsimultaneous tracking and collect more data than few yearsago Authors have also considered several issues such assecurity scalability of storage scalability of analysis peerwith other private clouds and peer with public clouds

Metalearning presents powerful methodology whichenables learning on its past knowledge of solving differenttasks This methodology is used in fraud detection [36] timeseries forecasting [37] load forecasting [11] and others

3 Metalearning Framework forClustering Biomedical Data

Exponential growth of the data and rapid development oflarge number of complex and computationally intensive datamining algorithms led to one of the major problems inmodern data mining selection of the best algorithm for dataat hand [38]Namely analyst often does not have enough timeor resources for creating models and evaluating themwith allavailable algorithms

One of the most promising approaches for dealingwith this problem is metalearning [39 40] Metalearning ismethodology which solves different data mining tasks basedon past knowledgeThemain idea is to store history of exper-imental results with descriptions (meta-attributes) of datasets(eg dataset characteristics algorithm and classificationaccuracies for classification problems) and based on thisto create metamodel (classification or regression) that willpredict the performances of each algorithm on new datasetCreating of suchmetamodel (with good performance) wouldreduce the need for brute force evaluation (evaluation ofevery algorithm)

Metalearning system is built on set of algorithm orcombination of algorithms (ensembles) Therefore everyalgorithm or combination of algorithms (ensembles) is sim-plerTheoreticallymetalearning system can be infinitely largeby puttingmetalearning as component of other metalearningsystems An advantage of its using is that it can address newtypes of tasks that have not been seen but are similar toalready defined problems

Metalearning by Smith-Miles [39 40] is definedwith thefollowing aspects

(i) the problem space 119875 which represents set ofinstances (datasets) of a given problem class

(ii) the meta-attribute space119872 which contains charac-teristics that describe existing problems (eg numberof attributes entropy normality etc)

(iii) the algorithm space 119860 which represents the set ofcandidate algorithms which can be used to solve theproblems defined in problem space 119875

(iv) a performance metric 119884 which represents measuresof performance of an algorithm on a problem (egclassification accuracy (for classification problems) orroot mean square error (for regression problems))

General procedure for metalearning is done in severalsteps first datasets from problem space are evaluated by

algorithms fromalgorithm space Furthermetafeatures of thedatasets are related to algorithm performance forming thedatabase of metaexamples Then regression or classificationmodels are created (and evaluated by performance metric)Finally when new problem (dataset) arrives meta-attributesare extracted and performance prediction is made In thisway analyst does not have to evaluate each algorithm on eachdataset but only ones with the best predicted performanceof the problem and algorithm spaces While technologies fordata collection enabled cheap and fast accumulation of dataand extension of problem space development and collectionof data mining algorithms is a more difficult problem sincedevelopment of new algorithms demands a lot of time andeffort and also different algorithms are implemented ondifferent platforms

The most important issue for good performance of met-alearning systems is the size of problem and algorithm spacebecause the accuracy of metamodels is directly dependant onthese spaces [39 40]Thismeans that cloud based system andservice oriented architecture should be natural environmentfor this kind of systems because it would enable communitybased extension models to be created and evaluated byperformance metric in order to capture relations betweenmeta-attributes and algorithm performance

With metalearning approach in solving problems timefor choosing appropriate algorithm for problem is greatlyreduced but requires time for creating and updating meta-models especially if data and algorithms are gathered fromcommunity This is one of the main motivations for integra-tion of such a system in cloud based environment and inte-gration with big data technologies (like Hadoop Hive andMahout) for storing and aggregation of data and predictivemodeling

31 Component Based Metalearning System for BiomedicalData One of the promising approaches for tackling theseproblems (existence of large algorithm space and existenceof efficient procedure for selection of the best algorithm)is component based data mining algorithm design [41ndash43]This approach divides algorithms with similar structure (inthis case representative based algorithms) into parts withthe same functionality called subproblems Every subproblemhas standardized IO structure and can be solved with oneor more reusable components (RCs) presented in Table 1This approach combination of RCs which originates fromdifferent algorithms can be used to design large number(thousands) of new ldquohybridrdquo algorithms This approach gavevery promising results in the area of clustering biomedical(gene expression) data [8 44 45]

Combining RCs is used for reproducing or creationof cluster algorithms For example K-means algorithmscan be reconstructed as RANDOM-EUCLIDEAN-MEAN-COMPACT However a new hybrid algorithm can be con-structed using DIANA-CORREL-MEDIAN-CONN whereDIANA is used to initialize representatives CORREL tomeasure distance MEDIAN to update representatives andCONN to evaluate clusters

The Scientific World Journal 5

Table 1 Sub-problems and RCs for generic clustering algorithmdesign

Sub-problem Reusable components

Initialize representatives DIANA RANDOM XMEANSGMEANS PCA KMEANS++ SPSS

Measure distance EUCLIDEAN CITY CORRELCOSINE

Update representatives MEAN MEDIAN ONLINE

Evaluate clusters AIC BIC SILHOU COMPACT XBCONN

Extendedmetalearning system shown in Figure 2 is usedin this research Problem space is presented in upper leftcloud Every problem (dataset) from problem space 119875 has itstask (clustering classification regression etc) denounced 119909Based on problem function 119891 extracts meta-attributes Forselected problem based onmeta-attributes function 119878 selectsalgorithm from algorithm space 119860 Every algorithm is con-structed from reusable components (RCs) from which addi-tional meta-attributes were derived (algorithm descriptions)Also as a result of clustering algorithm on specific datasetinternal evaluationmeasures (additional meta-attributes) arecalculated and saved as meta-attributes Central cloud is themost important part of metalearning system It is responsiblefor ranking and selection of algorithms Inputs in this cloudare task 119909 algorithm 119886 and performance metric 120587 Meta-attributes created for each task 119909 are input for ranking andselection as they are a basis for learning on metalevel Formost tasks performance ofmeta-learning system is calculatedearlier as output label and it is available on metalevel

32 Initial Evaluation of Component Based MetalearningSystem for Clustering Biomedical Data In this section we willdescribe the data and the procedure for initial evaluation ofthe proposed system 30 datasets gathered from original met-alearning system [46] were used (httpbioinformaticsrut-gerseduStaticSupplementsCompCancerdatasetshtm)

For the construction of themetaexamples a set of 13meta-attributes for dataset description proposed by Nascimientoet al [46] are used (detailed description of datasets andmetafeatures can be found in Nascimiento et al [46])

Additionally meta-attribute space is extended withdescriptions of algorithms (four components of algorithmand normalization type described in Table 1) and internalcluster evaluation measures including compactness globalsilhouette indexAIC BICXB-index and connectivityThesethree types of meta-attributes (dataset descriptions reusablecomponents and internal evaluation measures) form thespace of 24 meta-attributes

Component based clustering algorithms were used todefine algorithm space For that purpose 504 RC-based clus-ter algorithms were designed for experimental evaluationThese algorithms were built by combining already describedRCs (Table 1) with 4 different normalization techniqueswhich lead to total of 2016 clustering experiments

Table 2 Meta-algorithm performance

AlgorithmError RMSE MAERBFN 0143 0109 (plusmn0092)LR 0111 0086 (plusmn0070)LMSR 0265 0094 (plusmn0248)NN 0101 0064 (plusmn0078)SVM 0050 0034 (plusmn0036)

For validation of clusteringmodels AMI (adjustedmutualinformation) index was used since it is recently recom-mended as a ldquogeneral purposerdquo measure for clustering valida-tion comparison and algorithm design [47] after exhaustivecomparison between a number of information theoreticand pair counting measures Even more this measure isthoroughly evaluated on gene expression microarray data

After validation of component based clustering algo-rithms on 30 datasets 55326 valid results were gatheredwhich represent metaexample repository Next step wasidentification of the best algorithm for ranking and selectionof algorithms for clustering gene expression microarray data

A procedure for ranking and selection of the best clus-tering algorithms is based on regression algorithms calledmeta-algorithms which predict (regression task) AMI valuesbased on a dataset metafeatures algorithm components andinternal evaluation measures

In this research five meta-algorithms were used Thoseare radial basis function network (RBFN) linear regression(LR) leastmedian square regression (LMSR) neural network(NN) and support vector machine (SVM)

Estimate of quality of regression algorithms is done usingmean absolute error (MAE) and root mean squared error(RMSE) Validation of results is done using 70 of datasetfor training the model and the remaining 30 for testing

Performance of each algorithm in terms of MAE andRMSE and best values are shown in bold (Table 2)

Although all five algorithms showed good results SVMas in Table 2 gave the best performance and this modelshould be used for prediction of algorithm performances fornew datasets RMSE of 005 and MAE of 0034 with thesmallest variance (numbers in brackets) indicate that thismetamodel is applicable to the new problems since AMImeasure takes values from 0 to 1 where 1 is the best Notethat with an extension of algorithm space and problem spacethese results could be changed and so continuous evaluationof available meta-algorithms should be done Because of thisit is important to have adequate computing infrastructurefor processing big data and updating the models Process forcreating and updating the models is presented in Figure 3Creation of model contains several steps First microarraymetaexamples are loaded from which only important vari-ables are selected After that data preparation phase wasconducted where only those attributes that are importantfor model building were selected label attribute was set onAMI attribute missing values were replaced with averagevalue and nominal values were transformed to numericalvalues using dummy coding Modeling phase is conducted

6 The Scientific World Journal

Gene expressionclustring

Cell typeclassifcation

Ranking and selection

Performance 120587

AlgorithmA

Reusablecomponents

(RCs)

PCADIANA

Initializerepresentatives

CITYCORREL

RMSEAbsolute error Accuracy

Precision

AICBIC

XB

MEANMEDIAN

ONLINE

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

LgE

BCPMV

NRE

K-NN

Algorithm space (A)

Evaluate cluster Updaterepresentatives

Measure distance

Algorithm

fe PCA-CITY-MEAN-XBltgt a 120587(a x) gt 120587( x)

a = combination of RC

Performance space (Y)

Task x

Feature extraction (f)

Meta-attribute space (M)Problem space (P)

selection (S)

A(x) = S(f(x))

a998400 a998400

Figure 2 Extended metalearning system for clustering biomedical data

Figure 3 Main process for finding the best model for prediction ofAMI

Figure 4 Stream for application of metalearning system on newcases

using 10-fold cross validation where the above-mentionedfive algorithms were used Every trained model is saved onhard disk which allows its reusability

Automatic application of the selected (in this case SVM)model is presented in Figure 4 After every update of themodel or after inserting new dataset this process needs to beupdated Savedmodel in this case SVM is loaded and appliedon new dataset Results gathered are sorted and exportedDetailed information can be found in Radovanovic et al [48]

4 Extended Cloud Based Model forBig Data Analysis

Dai et al [16] addressed the problems of storing and analysisof biomedical data by proposing a cloud based model foranalysis of biomedical data and it is composed of four servicecategories

(i) data as a service (DaaS)(ii) software as a service (SaaS)(iii) platform as a service (PaaS)(iv) infrastructure as a service (IaaS)

DaaS is group of cloud services which enables on demanddata access and provides up-to-date data that are accessibleby a wide range of devices that are connected over theweb In case of bioinformatics AmazonWeb Services (AWS)provides repository of public archives (data sets) includingGenBank Unigene Ensembl and Influenza Virus which canbe accessed from cloud based applications

Since bioinformatics requires a large diversity of softwaretools for data analyses the task of SaaS in bioinformatics is todeliver and enable remote access to software services onlineThus installation of software tools on desktop computer isno longer required Another one advantage of using SaaS isenablingmuch easier collaboration betweendispersed groupsof users

Platform as a service (PaaS) should offer a programmableenvironment for users in order to develop test and deploycloud applications Computer resources scale automatically

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

The Scientific World Journal 5

Table 1 Sub-problems and RCs for generic clustering algorithmdesign

Sub-problem Reusable components

Initialize representatives DIANA RANDOM XMEANSGMEANS PCA KMEANS++ SPSS

Measure distance EUCLIDEAN CITY CORRELCOSINE

Update representatives MEAN MEDIAN ONLINE

Evaluate clusters AIC BIC SILHOU COMPACT XBCONN

Extendedmetalearning system shown in Figure 2 is usedin this research Problem space is presented in upper leftcloud Every problem (dataset) from problem space 119875 has itstask (clustering classification regression etc) denounced 119909Based on problem function 119891 extracts meta-attributes Forselected problem based onmeta-attributes function 119878 selectsalgorithm from algorithm space 119860 Every algorithm is con-structed from reusable components (RCs) from which addi-tional meta-attributes were derived (algorithm descriptions)Also as a result of clustering algorithm on specific datasetinternal evaluationmeasures (additional meta-attributes) arecalculated and saved as meta-attributes Central cloud is themost important part of metalearning system It is responsiblefor ranking and selection of algorithms Inputs in this cloudare task 119909 algorithm 119886 and performance metric 120587 Meta-attributes created for each task 119909 are input for ranking andselection as they are a basis for learning on metalevel Formost tasks performance ofmeta-learning system is calculatedearlier as output label and it is available on metalevel

32 Initial Evaluation of Component Based MetalearningSystem for Clustering Biomedical Data In this section we willdescribe the data and the procedure for initial evaluation ofthe proposed system 30 datasets gathered from original met-alearning system [46] were used (httpbioinformaticsrut-gerseduStaticSupplementsCompCancerdatasetshtm)

For the construction of themetaexamples a set of 13meta-attributes for dataset description proposed by Nascimientoet al [46] are used (detailed description of datasets andmetafeatures can be found in Nascimiento et al [46])

Additionally meta-attribute space is extended withdescriptions of algorithms (four components of algorithmand normalization type described in Table 1) and internalcluster evaluation measures including compactness globalsilhouette indexAIC BICXB-index and connectivityThesethree types of meta-attributes (dataset descriptions reusablecomponents and internal evaluation measures) form thespace of 24 meta-attributes

Component based clustering algorithms were used todefine algorithm space For that purpose 504 RC-based clus-ter algorithms were designed for experimental evaluationThese algorithms were built by combining already describedRCs (Table 1) with 4 different normalization techniqueswhich lead to total of 2016 clustering experiments

Table 2 Meta-algorithm performance

AlgorithmError RMSE MAERBFN 0143 0109 (plusmn0092)LR 0111 0086 (plusmn0070)LMSR 0265 0094 (plusmn0248)NN 0101 0064 (plusmn0078)SVM 0050 0034 (plusmn0036)

For validation of clusteringmodels AMI (adjustedmutualinformation) index was used since it is recently recom-mended as a ldquogeneral purposerdquo measure for clustering valida-tion comparison and algorithm design [47] after exhaustivecomparison between a number of information theoreticand pair counting measures Even more this measure isthoroughly evaluated on gene expression microarray data

After validation of component based clustering algo-rithms on 30 datasets 55326 valid results were gatheredwhich represent metaexample repository Next step wasidentification of the best algorithm for ranking and selectionof algorithms for clustering gene expression microarray data

A procedure for ranking and selection of the best clus-tering algorithms is based on regression algorithms calledmeta-algorithms which predict (regression task) AMI valuesbased on a dataset metafeatures algorithm components andinternal evaluation measures

In this research five meta-algorithms were used Thoseare radial basis function network (RBFN) linear regression(LR) leastmedian square regression (LMSR) neural network(NN) and support vector machine (SVM)

Estimate of quality of regression algorithms is done usingmean absolute error (MAE) and root mean squared error(RMSE) Validation of results is done using 70 of datasetfor training the model and the remaining 30 for testing

Performance of each algorithm in terms of MAE andRMSE and best values are shown in bold (Table 2)

Although all five algorithms showed good results SVMas in Table 2 gave the best performance and this modelshould be used for prediction of algorithm performances fornew datasets RMSE of 005 and MAE of 0034 with thesmallest variance (numbers in brackets) indicate that thismetamodel is applicable to the new problems since AMImeasure takes values from 0 to 1 where 1 is the best Notethat with an extension of algorithm space and problem spacethese results could be changed and so continuous evaluationof available meta-algorithms should be done Because of thisit is important to have adequate computing infrastructurefor processing big data and updating the models Process forcreating and updating the models is presented in Figure 3Creation of model contains several steps First microarraymetaexamples are loaded from which only important vari-ables are selected After that data preparation phase wasconducted where only those attributes that are importantfor model building were selected label attribute was set onAMI attribute missing values were replaced with averagevalue and nominal values were transformed to numericalvalues using dummy coding Modeling phase is conducted

6 The Scientific World Journal

Gene expressionclustring

Cell typeclassifcation

Ranking and selection

Performance 120587

AlgorithmA

Reusablecomponents

(RCs)

PCADIANA

Initializerepresentatives

CITYCORREL

RMSEAbsolute error Accuracy

Precision

AICBIC

XB

MEANMEDIAN

ONLINE

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

LgE

BCPMV

NRE

K-NN

Algorithm space (A)

Evaluate cluster Updaterepresentatives

Measure distance

Algorithm

fe PCA-CITY-MEAN-XBltgt a 120587(a x) gt 120587( x)

a = combination of RC

Performance space (Y)

Task x

Feature extraction (f)

Meta-attribute space (M)Problem space (P)

selection (S)

A(x) = S(f(x))

a998400 a998400

Figure 2 Extended metalearning system for clustering biomedical data

Figure 3 Main process for finding the best model for prediction ofAMI

Figure 4 Stream for application of metalearning system on newcases

using 10-fold cross validation where the above-mentionedfive algorithms were used Every trained model is saved onhard disk which allows its reusability

Automatic application of the selected (in this case SVM)model is presented in Figure 4 After every update of themodel or after inserting new dataset this process needs to beupdated Savedmodel in this case SVM is loaded and appliedon new dataset Results gathered are sorted and exportedDetailed information can be found in Radovanovic et al [48]

4 Extended Cloud Based Model forBig Data Analysis

Dai et al [16] addressed the problems of storing and analysisof biomedical data by proposing a cloud based model foranalysis of biomedical data and it is composed of four servicecategories

(i) data as a service (DaaS)(ii) software as a service (SaaS)(iii) platform as a service (PaaS)(iv) infrastructure as a service (IaaS)

DaaS is group of cloud services which enables on demanddata access and provides up-to-date data that are accessibleby a wide range of devices that are connected over theweb In case of bioinformatics AmazonWeb Services (AWS)provides repository of public archives (data sets) includingGenBank Unigene Ensembl and Influenza Virus which canbe accessed from cloud based applications

Since bioinformatics requires a large diversity of softwaretools for data analyses the task of SaaS in bioinformatics is todeliver and enable remote access to software services onlineThus installation of software tools on desktop computer isno longer required Another one advantage of using SaaS isenablingmuch easier collaboration betweendispersed groupsof users

Platform as a service (PaaS) should offer a programmableenvironment for users in order to develop test and deploycloud applications Computer resources scale automatically

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

6 The Scientific World Journal

Gene expressionclustring

Cell typeclassifcation

Ranking and selection

Performance 120587

AlgorithmA

Reusablecomponents

(RCs)

PCADIANA

Initializerepresentatives

CITYCORREL

RMSEAbsolute error Accuracy

Precision

AICBIC

XB

MEANMEDIAN

ONLINE

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

LgE

BCPMV

NRE

K-NN

Algorithm space (A)

Evaluate cluster Updaterepresentatives

Measure distance

Algorithm

fe PCA-CITY-MEAN-XBltgt a 120587(a x) gt 120587( x)

a = combination of RC

Performance space (Y)

Task x

Feature extraction (f)

Meta-attribute space (M)Problem space (P)

selection (S)

A(x) = S(f(x))

a998400 a998400

Figure 2 Extended metalearning system for clustering biomedical data

Figure 3 Main process for finding the best model for prediction ofAMI

Figure 4 Stream for application of metalearning system on newcases

using 10-fold cross validation where the above-mentionedfive algorithms were used Every trained model is saved onhard disk which allows its reusability

Automatic application of the selected (in this case SVM)model is presented in Figure 4 After every update of themodel or after inserting new dataset this process needs to beupdated Savedmodel in this case SVM is loaded and appliedon new dataset Results gathered are sorted and exportedDetailed information can be found in Radovanovic et al [48]

4 Extended Cloud Based Model forBig Data Analysis

Dai et al [16] addressed the problems of storing and analysisof biomedical data by proposing a cloud based model foranalysis of biomedical data and it is composed of four servicecategories

(i) data as a service (DaaS)(ii) software as a service (SaaS)(iii) platform as a service (PaaS)(iv) infrastructure as a service (IaaS)

DaaS is group of cloud services which enables on demanddata access and provides up-to-date data that are accessibleby a wide range of devices that are connected over theweb In case of bioinformatics AmazonWeb Services (AWS)provides repository of public archives (data sets) includingGenBank Unigene Ensembl and Influenza Virus which canbe accessed from cloud based applications

Since bioinformatics requires a large diversity of softwaretools for data analyses the task of SaaS in bioinformatics is todeliver and enable remote access to software services onlineThus installation of software tools on desktop computer isno longer required Another one advantage of using SaaS isenablingmuch easier collaboration betweendispersed groupsof users

Platform as a service (PaaS) should offer a programmableenvironment for users in order to develop test and deploycloud applications Computer resources scale automatically

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

The Scientific World Journal 7

Daas

SaaS

PaaS

IaaS

KaaS

∙ Public datasets∙

∙ Tools∙ Pipelines

∙ Analysis platforms∙ Programming environments

∙ Virtual machines∙ Virtualized resources

∙ Domain experts∙ Knowledge networks

Biological datasets

Figure 5 Cloud based system [16] integrated with KaaS extension[17] for analysis of biomedical data

without user interference In most PaaS services besidesprogramming environment database and web server areavailable

Since most medical institutions do not have computingresources such as CPU IaaS offers a full computer infras-tructure delivering virtualized resources Those virtualizedresources can be operating systems RAM CPU or othercomputer resource

Lai et al [17] introduced a new service model in cloudcomputingmdashthe knowledge as a service (KaaS) that facilitatesthe interoperations amongmembers in a knowledge networkExtended model is depicted on Figure 5

This new service model relies on data created in acollaboration process of domain experts It was recognizedas a new form of cloud service and categorized as KaaS Ina nutshell such approach is data driven and lacks higherorder structure in order to provide knowledge as a serviceThis implies that KaaS in this form is provided by humandomain experts We extend this class of services with dataand model approach By combining data driven models andexpert knowledge that is stored in unstructured data likedocuments notes and collaborations knowledge is createdand offered to cloud users Specifically we extend models of[16 17] by including big data technologies platforms for datamining andmetamodels for ranking and selection of the bestalgorithms for biomedical data mining System componentsare displayed on a diagram (Figure 6) and classified accordingto service type directed towards end user Big data engineprovides data storage and access and it is a basis of each classof services (diagram center)

Key component of KaaS is metalearning algorithms andselected algorithms for clustering biomedical data Bothare represented by algorithm space component and bothare accompanied by their describing metamodels Thesemetamodels are used in runtime for ranking and selection

of best algorithms for the new problem (dataset) Experi-mental results provide knowledge on performances of eachalgorithm with meta-attributes on each dataset Data flowsare kept using big data engine such as Hadoop Segmentof experimental results component lies on DaaS since thesemeta-attributes can be provided as a data service DaaSprovides biomedical data It is divided into protected andpublic segments

Cloud approach provides data accumulation and higheravailability of data to interested parties (with rights of accessimplied) Interested parties can be found among not onlymedical employees and researchers but also community usingdifferent software tools to access data using DaaS

Central part of the circle (Figure 6) contains componentsof a big data engine (HDFS Hive and Mahout) where allthe data are centralized (medical data metadata and algo-rithms performances) Additionally big data engine providesinterfaces for data manipulation and analysis Apache Hive[49] is data warehouse software built on top of Hadoopused for querying and managing large datasets residing indistributed storage Apache Mahout [50] is also built ontop of Hadoop and is used as a scalable machine learninglibrary for classification clustering recommender systemsand dimension reduction

Our solution provides software for data analysis in aservice form (SaaS) with respect of SOA dependability [51]Additionally third party software can easily become anintegrated part of SaaS (RapidMiner [52] R [53] or others)

These software solutions are recommended because theyhave a direct interface for access and analysis of big data(eg Radoop [54] allows using visual RapidMiner interfaceand has operators that run distributed algorithms based onHadoop Hive [49] and Mahout [50] without writing anycode)

Most medical institutions require not only computingresources such as CPUs but also communication infras-tructure these components are offered as a part of IaaSin a form of virtualized resources Higher level of servicerests on using platform such as variety of operating systemsBut platform can also provide tools for development ofalgorithms and applications (Eclipse Netbeans RapidMinerR ) The circle is closed by development of new algorithmsand also for algorithm deployment and execution For thisreason algorithms space component is partly situated in PaaSspace

5 Conclusion and Future Research

In this paper we proposed a cloud based architecture forstoring analysis and predictive modeling of biomedical bigdata Existing service based cloud architecture is extended byincluding metalearning system as a data and model drivenknowledge service As a part of the proposed architecturewe provided a support for community based gathering ofdata and algorithms that is an important precondition forquality of metalearning Advancement of this research areaand adding new value are enabled through platform for devel-opment and execution of distributed data mining processes

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

8 The Scientific World Journal

Big data engine

Meta modelsmodels

Experimentalresults

Algorit

hms s

pace M

edical data

Publicdata

Protected

data

Virtual machines

Virtual network

Ope

ratin

g sy

stem

s(li

nux

Mac

OS Reporting

sofware(JasperReports Eclipse

PaaS

IaaS

KaaS

Daas

SaaS

BIRT OpenOffice

Win)

)

Figure 6 Cloud based system for predictive modeling of biomedical data

and algorithms Finally we provided data and model drivendecision support on selecting best algorithms for workingwith biomedical data

Retrospectively proposed solution focuses on a specifictype of biomedical data while other types still remain tobe included and evaluated Data security and privacy stillremains a concern to be taken into amore serious account Inorder to provide even further impact in research communityadditional work is necessary on providing interoperabilityamong potential open source components

System was tested on microarray gene expression datawith specific meta-attributes for this data type (eg chiptype) Further efforts will be made to include other typesof biomedical data This will be done by identifying spe-cific meta-attributes that fit newly included data typesAdditionally as a further work integration with OpenML[55] platform used for storing and gathering datasets and

clustering algorithm runs is plannedThis platform providesa base for a community to share experiments algorithms anddata Significant clustering algorithm meta-attributes can beextracted and used for updating our metalearning system

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research is partially funded by Grants from the SerbianMinistry of Science and Technological Development Con-tract nos III 41008 and TR 32013

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

The Scientific World Journal 9

References

[1] K Kincade ldquoData mining digging for healthcare goldrdquo Insur-ance and Technology vol 23 no 2 pp IM2ndashIM7 1998

[2] H C Koh andG Tan ldquoDatamining applications in healthcarerdquoJournal ofHealthcare InformationManagement vol 19 no 2 pp64ndash72 2011

[3] K Srinivas B K Rani and A Govrdhan ldquoApplications ofdata mining techniques in healthcare and prediction of heartattacksrdquo International Journal on Computer Science and Engi-neering vol 2 no 2 pp 250ndash255 2010

[4] P Maji ldquoMutual information-based supervised attribute clus-tering for microarray sample classificationrdquo IEEE Transactionson Knowledge and Data Engineering vol 24 no 1 pp 127ndash1402012

[5] R Das ldquoA comparison of multiple classification methods fordiagnosis of Parkinson diseaserdquo Expert Systems with Applica-tions vol 37 no 2 pp 1568ndash1572 2010

[6] A Thomas N H Patterson M M Marcinkiewicz A LazarisP Metrakos and P Chaurand ldquoHistology-driven data miningof lipid signatures from multiple imaging mass spectrometryanalyses application to human colorectal cancer liver metasta-sis biopsiesrdquoAnalytical Chemistry vol 85 no 5 pp 2860ndash28662013

[7] R L Grossman and K P White ldquoA vision for a biomedicalcloudrdquo Journal of Internal Medicine vol 271 no 2 pp 122ndash1302012

[8] M Vukicevic B Delibasic Z Obradovic M Jovanovic andMSuknovic ldquoA method for design of data-tailored partitioningalgorithms for optimizing the number of clusters in microarrayanalysisrdquo in Proceedings of the IEEE Symposium on Computa-tional Intelligence in Bioinformatics and Computational Biology(CIBCB rsquo12) pp 252ndash259 IEEE San Diego Calif USA 2012

[9] H Chae I Jung H Lee S Marru S W Lee and S Kim ldquoBioand health informatics meets cloud BioVLab as an examplerdquoHealth Information Science and Systems vol 1 no 1 article 62013

[10] J Sun and C K Reddy ldquoBig data analytics for healthcarerdquo inProceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining pp 1525ndash1525 ACMAugust 2013

[11] M Matijas J A Suykens and S Krajcar ldquoLoad forecastingusing a multivariate meta-learning systemrdquo Expert SystemsWith Applications vol 40 no 11 pp 4427ndash4437 2013

[12] S Zhou K K Lai and J Yen ldquoA dynamic meta-learning rate-based model for gold market forecastingrdquo Expert Systems withApplications vol 39 no 6 pp 6168ndash6173 2012

[13] J Kanda C Soares E Hruschka and A de Carvalho ldquoA meta-learning approach to select meta-heuristics for the travelingsalesman problem using MLP-based label rankingrdquo in NeuralInformation Processing pp 488ndash495 Springer Berlin Germany2012

[14] A Attig and P Perner ldquoMeta-learning for image processingbased on case-based reasoningrdquo in Computational Intelligencein Healthcare vol 4 pp 229ndash264 Springer Berlin Germany2010

[15] AVera-Baquero R Colomo-Palacios andOMolloy ldquoBusinessprocess analytics using a big data approachrdquo IT Professional vol15 no 6 pp 29ndash35 2013

[16] L Dai X Gao Y Guo J Xiao and Z Zhang ldquoBioinformaticsclouds for big data manipulationrdquo Biology Direct vol 7 no 1article 43 2012

[17] I K Lai S K Tam and M F Chan ldquoKnowledge cloud systemfor network collaboration a case study in medical serviceindustry in Chinardquo Expert Systems with Applications vol 39 no15 pp 12205ndash12212 2012

[18] S Sonnenburg M L Braun S O Cheng et al ldquoThe need foropen source software in machine learningrdquo Journal of MachineLearning Research vol 8 pp 2443ndash2466 2007

[19] S Marston Z Li S Bandyopadhyay J Zhang and A GhalsasildquoCloud computing the business perspectiverdquo Decision SupportSystems vol 51 no 1 pp 176ndash189 2011

[20] N Sultan ldquoCloud computing for education a new dawnrdquoInternational Journal of Information Management vol 30 no2 pp 109ndash116 2010

[21] W Zhang and Q Chen ldquoFrom E-government to C-governmentvia cloud computingrdquo in Proceedings of the 1st InternationalConference on E-Business and E-Government (ICEE rsquo10) pp679ndash682 IEEE May 2010

[22] A Fernandez S del Rıo F Herrera and J M Benıtez ldquoAnoverview on the structure and applications for business intel-ligence and data mining in cloud computingrdquo in Proceedingsof the 7th International Conference on Knowledge Managementin Organizations Service and Cloud Computing pp 559ndash570Springer Berlin Germany January 2013

[23] S P Ahuja S Mani and J Zambrano ldquoA survey of the state ofcloud computing in healthcarerdquo Network and CommunicationTechnologies vol 1 no 2 pp 12ndash19 2012

[24] W Li J Yan Y Yan and J Zhang ldquoXbase cloud-enabledinformation appliance for healthcarerdquo in Proceedings of the 13thInternational Conference on Extending Database TechnologyAdvances in Database Technology (EDBT rsquo10) pp 675ndash680March 2010

[25] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[26] T Oinn M Addis J Ferris et al ldquoTaverna a tool forthe composition and enactment of bioinformatics workflowsrdquoBioinformatics vol 20 no 17 pp 3045ndash3054 2004

[27] BGiardine C Riemer R CHardison et al ldquoGalaxy a platformfor interactive large-scale genome analysisrdquo Genome Researchvol 15 no 10 pp 1451ndash1455 2005

[28] M C Schatz ldquoCloudBurst highly sensitive read mapping withMapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009

[29] B D OrsquoConnor B Merriman and S F Nelson ldquoSeqWare queryengine storing and searching sequence data in the cloudrdquo BMCBioinformatics vol 11 supplement 12 article S2 2010

[30] D Hong A Rhie S Park et al ldquoFX an RNA-seq analysis toolon the cloudrdquo Bioinformatics vol 28 no 5 pp 721ndash723 2012

[31] S V Angiuoli M Matalka A Gussman et al ldquoCloVR a virtualmachine for automated and portable sequence analysis from thedesktop using cloud computingrdquo BMC Bioinformatics vol 12no 1 article 356 2011

[32] R Zhang and L Liu ldquoSecurity models and requirements forhealthcare application cloudsrdquo in Proceedings of the 3rd IEEEInternational Conference on Cloud Computing (CLOUD rsquo10) pp268ndash275 IEEE July 2010

[33] R Wooten R Klink F Sinek Y Bai and M Sharma ldquoDesignand implementation of a secure healthcare social cloud systemrdquoin Proceedings of the 12th IEEEACM International Symposiumon Cluster Cloud and Grid Computing (CCGrid rsquo12) pp 805ndash810 IEEE May 2012

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

10 The Scientific World Journal

[34] W Liu and E K Park ldquoE-healthcare cloud computing appli-cation solutions cloud-enabling characteristics challenges andadaptationsrdquo in Proceedings of the International Conference onComputing Networking and Communications (ICNC rsquo13) pp437ndash443 IEEE January 2013

[35] T Giles and R Wilcox ldquoIBMWatson and medical records textanalyticsrdquo in Proceedings of the 2011 Healthcare Information andManagement Systems Society Meeting 2011

[36] A Abbasi C Albrecht A Vance and J Hansen ldquoMetafrauda meta-learning framework for detecting financial fraudrdquo MISQuarterly vol 36 no 4 pp 1293ndash1327 2012

[37] C Lemke and B Gabrys ldquoMeta-learning for time series fore-casting and forecast combinationrdquo Neurocomputing vol 73 no10-12 pp 2006ndash2016 2010

[38] N Iam-On T Boongoen and S Garrett ldquoLCE a link-basedcluster ensemble method for improved gene expression dataanalysisrdquo Bioinformatics vol 26 no 12 pp 1513ndash1519 2010

[39] K A Smith-Miles ldquoTowards insightful algorithm selection foroptimisation using meta-learning conceptsrdquo in Proceedings ofthe (IEEE World Congress on Computational Intelligence) IEEEInternational Joint Conference on Neural Networks (IJCNN rsquo08)pp 4118ndash4124 IEEE Hong Kong June 2008

[40] K A Smith-Miles ldquoCross-disciplinary perspectives on meta-learning for algorithm selectionrdquo ACMComputing Surveys vol41 no 1 article 6 2008

[41] B Delibasic K Kirchner J Ruhland M Jovanovic and MVukicevic ldquoReusable components for partitioning clusteringalgorithmsrdquo Artificial Intelligence Review vol 32 no 1ndash4 pp59ndash75 2009

[42] B Delibasic M Vukicevic M Jovanovic K Kirchner J Ruh-land and M Suknovic ldquoAn architecture for component-baseddesign of representative-based clustering algorithmsrdquoData andKnowledge Engineering vol 75 pp 78ndash98 2012

[43] M Suknovic B Delibasic M Jovanovic M Vukicevic DBecejski-Vujaklija and Z Obradovic ldquoReusable components indecision tree induction algorithmsrdquo Computational Statisticsvol 27 no 1 pp 127ndash148 2012

[44] M Vukicevic B Delibasic M Jovanovic M Suknovic and ZObradovic ldquoInternal evaluation measures as proxies for exter-nal indices in clustering gene expression datardquo in Proceedingsof the IEEE International Conference on Bioinformatics andBiomedicine (BIBM rsquo11) pp 574ndash577 IEEE Computer Society2011

[45] M Vukicevic K Kirchner B Delibasic M Jovanovic J Ruh-land and M Suknovic ldquoFinding best algorithmic componentsfor clustering microarray datardquo Knowledge and InformationSystems vol 35 no 1 pp 111ndash130 2013

[46] A Nascimiento R Prudencio M de Souto and I CostaldquoMining rules for the automatic selection process of clusteringmethods applied to cancer gene expression datardquo in Proceedingsof the 19th International Conference on Artificial Neural Net-works Part II pp 20ndash29 Springer Limassol Cyprus 2009

[47] N X Vinh J Epps and J Bailey ldquoInformation theoreticmeasures for clusterings comparison variants properties nor-malization and correction for chancerdquo Journal of MachineLearning Research vol 11 pp 2837ndash2854 2010

[48] S Radovanovic M Vukicevic M Jovanovic B Delibasic andM Suknovic ldquoMeta-learning system for clustering gene expres-sion microarray datardquo in Proceedings of the 4th RapidMinerCommunity Meeting and Conference (RCOMM rsquo13) pp 97ndash111Shaker Porto Portugal 2013

[49] A Thusoo J S Sarma N Jain Z Shao P Chakka and SAnthony ldquoHive a warehousing solution over a map-reduceframeworkrdquo Proceedings of the VLDB Endowment vol 2 no 2pp 1626ndash1629 2009

[50] R Anil T Dunning and E Friedman Mahout in ActionManning 2011

[51] V Stantchev andMMalek ldquoAddressing dependability through-out the SOA life cyclerdquo IEEE Transactions on Services Comput-ing vol 4 no 2 pp 85ndash95 2011

[52] I Mierswa M Wurst R Klinkenberg M Scholz and T EulerldquoYALE rapid prototyping for complex data mining tasksrdquo inProceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD rsquo06) pp 935ndash940 ACM August 2006

[53] Development Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2005

[54] Z Prekopcsak G Makrai T Henk and C Gaspar-PapanekldquoRadoop analyzing big data with rapidminer and hadooprdquo inProceedings of the 2nd RapidMiner Community Meeting andConference (RCOMM rsquo11) June 2011

[55] J N van Rijn B Bischl L Torgo B Gao V Umaashankarand S Fischer ldquoOpenML a collaborative science platformrdquo inMachine Learning and Knowledge Discovery in Databases pp645ndash649 Springer Berlin Germany 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Cloud Based Metalearning System for Predictive …downloads.hindawi.com/journals/tswj/2014/859279.pdf · 2019. 7. 31. · paradigm in order to satisfy global demands

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014