semantic web service-based messaging framework for

12
Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=taut20 Automatika Journal for Control, Measurement, Electronics, Computing and Communications ISSN: 0005-1144 (Print) 1848-3380 (Online) Journal homepage: https://www.tandfonline.com/loi/taut20 Semantic web service-based messaging framework for prediction of fitness data using Hadoop distributed file system R. Sethuraman & T Sasiprabha To cite this article: R. Sethuraman & T Sasiprabha (2019) Semantic web service-based messaging framework for prediction of fitness data using Hadoop distributed file system, Automatika, 60:3, 349-359, DOI: 10.1080/00051144.2019.1637175 To link to this article: https://doi.org/10.1080/00051144.2019.1637175 © 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group Published online: 03 Jul 2019. Submit your article to this journal Article views: 396 View related articles View Crossmark data

Upload: others

Post on 22-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic web service-based messaging framework for

Full Terms & Conditions of access and use can be found athttps://www.tandfonline.com/action/journalInformation?journalCode=taut20

AutomatikaJournal for Control, Measurement, Electronics, Computing andCommunications

ISSN: 0005-1144 (Print) 1848-3380 (Online) Journal homepage: https://www.tandfonline.com/loi/taut20

Semantic web service-based messagingframework for prediction of fitness data usingHadoop distributed file system

R. Sethuraman & T Sasiprabha

To cite this article: R. Sethuraman & T Sasiprabha (2019) Semantic web service-basedmessaging framework for prediction of fitness data using Hadoop distributed file system,Automatika, 60:3, 349-359, DOI: 10.1080/00051144.2019.1637175

To link to this article: https://doi.org/10.1080/00051144.2019.1637175

© 2019 The Author(s). Published by InformaUK Limited, trading as Taylor & FrancisGroup

Published online: 03 Jul 2019.

Submit your article to this journal

Article views: 396

View related articles

View Crossmark data

Page 2: Semantic web service-based messaging framework for

AUTOMATIKA2019, VOL. 60, NO. 3, 349–359https://doi.org/10.1080/00051144.2019.1637175

SPECIAL ISSUE: COMPUTATIONAL INTELLIGENCE AND CAPSULE NETWORKS

Semantic web service-basedmessaging framework for prediction of fitnessdata using Hadoop distributed file system

R. Sethuraman and T Sasiprabha

School of Computing, Sathyabama Institute of Science and Technology, Chennai, India

ABSTRACTBig data is coined as word of mouth in this era due to the generation of huge volume of dataevery second frommultiple sources like logs, web sources, and sensors, electrical and electronicdevices. The manipulation is performed over the injected data and is termed as Data ProcessingSegment. In this proposedpaper the data are obtained from thewearable deviceswith attributeslike calories, weight, fat, step count, sleep, BMI and so on. The obtained data is stored in HDFSin a persistence manner. The component Kafka acts as a queue for the real time data and regu-lates the data before storing in HDFS. Now Apache Spark does the streaming of data. Here thedata are cleaned, applied theMachine Learning Algorithms (KNN Classifier) to obtain themodel,by splitting the cleaned data into Training data and Testing Data. Now the obtained predictedresult is sent to Web service Telephony ontology, which in turns communicates with ontologyservice repository consisting of cloud telephony services ontology and fitness activity ontol-ogy through OWL API. The classified and predicted value is analysed and intimated to the usersthrough visualization graphs, SMS, IVR and e-mail.

ARTICLE HISTORYReceived 31 January 2019Accepted 29 April 2019

KEYWORDSWearable device; big data;HDFS; spark; prediction;semantic web serviceontology

1. Introduction

Advancements and Innovations in the field of elec-tronic and wireless communications resulted in the useof wearable devices. Mostly these devices are in com-pact size and easy to carry. They can be easily mountedin human body mostly worn in the wrist. These wear-able’s [1] track the fitness activities of the humans likewalking, jogging, steps count, weight, sleep time, calo-ries, BMI and so on. These wearable along with theirApplication Programming Interface (API) tracks andstores the data of these activities and is stored in theHDFS. This helps the people to ensure the fitness ofthe individuals. Based on the final outcome of this datathe fitness activities of the individuals are viewed andimprovements are made accordingly. The recordings ofthe data includes distance covered, number of floorsmoved, elevation carried out, calories, sedentary min-utes, Lightly active minutes, Fairly active minutes, veryactive minutes, sleep time, weight, BMI, fat and calo-ries [2]. These data received continuously from thewearable’s are crawled and stored in HDFS.

Big data refers to huge volume of data which growscontinuously in real time systems. These data supportand fall into any one of the knowledge system com-monly known as 10 V’s of Big Data, which includesvolume, velocity, variety, value, veracity, validity, vari-ability, viscosity, and visualization. The activities on

Big data depends on the need of the application. Themajor classification is Data Collection, Data Ingestion,Data Processing, Data Storage and Data Visualization.Various frameworks and components like Flume, Nifi,Kafka, Sqoop, Flink, Spark, and Storm are used for per-forming corresponding activities. After data ingestion,processing of these data plays a vital role which deter-mines the final prediction and provides classificationfor the ingested data.

Apache Spark is used for processing and cleaningthe data since it is designed in such a way to supportthe computational speed of the data. Spark is advancedcluster computing engine than Hadoop Map ReducebecauseMap Reduce processes structured and unstruc-tured data available in clusters in batch mode onlywhereas the Spark does frommultiple ranges like inter-active, iterative, batch and streaming. Also it is easy tomanage and do real time analysis. Spark process realtime data using Spark Streaming.it supports cross plat-form and is developed in language Scala. Since Spark isan analytic tool, most of the data scientist prefers thisfor Analytics.

Data processing layer is the core activity in the bigdata Analytics. The cleaning of data takes place in thislayer. The accuracy of the model is directly proposi-tional to the accuracy in the cleaning data. After datacleaning, applied the Machine Learning Algorithms

CONTACT R. Sethuraman [email protected]

© 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis GroupThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricteduse, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 3: Semantic web service-based messaging framework for

350 R. SETHURAMAN AND T. SASIPRABHA

and Models to this data to obtain the predictive value.The machine learning algorithms are classified intothree major categories as Supervised Learning Algo-rithms, Unsupervised and Reinforcement Algorithms.Each of these categories has their own functionality.The most preferred categories of Machine LearningAlgorithm is Supervised Learning. In this category theinput data is processed and obtained desired output.A Function is created to accept the input and pro-cessed till the desired output is obtained. The outcomeof this algorithm yields to two categories as Regres-sion and Classification. Most popular Regression cat-egories are Decision Tree, Random Forest and KNN.Suitable Algorithm is applied and the model is inferredto evolve at the predictive results. Also consider fewparameters before concluding the algorithm. Below isthe table which suggests the selection of algorithmbased on the parameters like Size, Accuracy and speedfor Supervised Learning. Similarly unsupervised learn-ing can also be classified with its parameter. Thispaper provides its elaboration with respect to Super-vised Learning Algorithm. There are no restrictionswith the selection of parameters. The entire processis carried out with the training and testing data, thusobtaining the Predicted value. After data processing theflow control goes to on demand cloud Telephony WebServices.

Web service Registry is an intermediate repositorythat connects cloud telephony web services and ontol-ogy service Repository through SOAP/XML and OWLAPI respectively. Ontology is defined as set of concepts,descriptions, properties and relations. There are manytools available to create the ontology. This paper pro-poses two categories of Ontology namely Cloud tele-phony services and Fitness Activity ontology [3]. Thecloud telephony ontology decides upon the selectionof mode of communication of the predicted resultslike SMS, IVR, E-mail, Mobile App and visualization.Fitness Activity ontology provides the data on the pre-dicted value, which will be communicated throughthe HTTP Request and Response API to the enduser.

The key contributions in this paper are as follows:

• Data obtained from various sources are ingested onHDFS available in the cloud Environment, which isenhanced by the component Apache Kafka shown inTable 1.

• Data Analytics is done in Data processing layer bySpeed analytical tool Apache Spark along with theSupervised Machine Learning Algorithm.

• KNN Predicting Model is applied which yields themaximum accuracy with other Models and pre-dicted the values.

• Ontology knowledge representation of Cloud Tele-phony ontology and Fitness Activity Ontology foreffective selection of modes of communication.

Table 1. Model suggestion for supervised category based onparameters.

Supervisedtype Parameter Suggested model

Classification Accuracy SVM, Neural Network, GradientBoosting Tree, RandomForest

Classification Speed and Elaborated Decision Tree, LogisticRegression

Classification Speed, Non-Elaboratedand Large data

Naïve Bayes

Classification Speed, Non-Elaboratedand Moderate data

Linear SVM, Naïve Bayes

Regression Accuracy Neural Network, GradientBoosting Tree, RandomForest

Regression Speed Decision Tree, LinearRegression

• On demand cloud Telephony web services to rep-resent and implement the decisions of Ontologyknowledge representation to the end users.

2. Fitbit data on big data ecosystem in cloud

Input data is obtained from the wearable devices whichthe customers will wear on their body mostly likely asa watch for tracking the daily activities towards theirfitness. These data are injected in the Hadoop Distri-bution File System available in the Hadoop Ecosys-tem, which resides on the cloud Environment. Thesedata are processed in Data Processing layer wherethe splitting of data into training data and testingdata happens. These are then computed with multi-ple Machine learning Algorithms and predicted withthe model inferred from the Algorithm. Based on theAlgorithm and the Model predicted value is obtainedin this paper. Figure 1 shows entire flow of data fromwearable devices till the visualization and communi-cation platforms. The novelty of this proposed paperis explained below in various sections. This paperprovides you the analysis, prediction and appropri-ate communication channel selection for the real timedata obtained from the wearable devices. The bench-mark Data Set is available in web in the locationhttps://zenodo.org/record/14996#.WfxTJFuCzZ4

2.1. Data ingestion

In the concept of big data, the sources of data are multi-ple and the data to be handled are in surplus volume.Generally the data are obtained from sensor devices,production logs, Events, sources of web like Facebook,Twitter, and Linked In and so on. The selection of datasource depends on the type of analysis and predictionto be carried out. This paper performs the Analysisand prediction for the data obtained from wearabledevices like Fit Bit. The available data are either in struc-tured, Unstructured or Semi structured formats likein HTML, Pdf, Image and Text. These types of data

Page 4: Semantic web service-based messaging framework for

AUTOMATIKA 351

Figure 1. Overall architecture.

are crawled using the crawler service like web crawlerwhich is built using the Framework named Akka. Thisframework supports HTTP and HTTPS protocols. Italso supports proxy server for performing this activity.Here the data are obtained from the wearable devicesand stored in the Hadoop Distributed File System.HDFS is a component available in the.

Hadoop ecosystem which resides in the form of acluster in the cloud environment. The fit bit data isstored in a betterway inHDFSwith the implementationof a light-weight library known as Kafka. This library iscapable of integratingwith any application and ismeantfor perfect and efficient storage of incoming data overHDFS. Kafka is a Queuing system which maintains allthe input data in a pipeline and sends the data in aQueue to store on HDFS. The stored data is sent toApache Spark Streaming which performs the process ofdata cleaning. Spark streaming receives real time eventsfrom Kafka. This is used in the Data Processing layer tofetch the data and then split into training and test data.In the Spark streaming live data streams are receivedfrom Kafka and are split into batches. These batchesare processed by the Spark Engine and the final dataprocessed are obtained in batches. The working flow ofSpark Streaming is depicted below. Spark does not haveits own file system. It works along with Hadoopmakinguse of Hadoop Distributed File System(HDFS) in pro-cessing the streaming data into Batches of Input Data.

this processing is made faster and efficient using vari-ous features of Spark like Speed, multi-language accep-tance with interactive querying and Analytics. Thework of Spark Engine is accepting the input batches andapplying ResilientDistributedDatasets(RDD) transfor-mations on it. RDD stores memory state as objectsacross.these objects are shared between the jobs. Herethe intermediate results are stored in Distributedmem-ory there by making the process faster and efficientshown in Figure 2.

2.2. Data processing

Batches of processed data received from Spark stream-ing are fed into the Data Processing layer. Here thesedata are split into two data sets namely Training datasets and testing data sets. The ratio of splitting the train-ing data and testing data is 80:20. Now the MachineLearning Algorithms are acted upon these data andrequired information’s are extracted. Supervised Learn-ing Techniques are used for these scenarios of run-ning the models. The selection of appropriate MachineLearning Algorithm is the heart of Data Analytics[4]. The selection of improper Algorithm for a sce-nario yields to the drastic variations in the outcomeof the application Results. The selection of SupervisedMachine Learning Algorithms is made in an accu-rate way by considering the factors like size of data,

Page 5: Semantic web service-based messaging framework for

352 R. SETHURAMAN AND T. SASIPRABHA

Figure 2. Spark streaming working flow.

data for processing is continuous or not, the expectedoutput is of form classification or Regression, the pre-defined variables in the data set are labelled or unla-belled, the goal is to predict the output or to rankthe output, what is the level of interpretation of theresult – easy or hard and so on. Most of the prac-tical machine learning implementations uses Super-vised Learning Algorithms. The most commonly usedMLL’s are Regression, Logistic Regression, DecisionTree, Random Forest and KNN.

2.3. Predictivemodelling using KNN

In this proposed paper, the classification technique K-nearest neighbours (KNN) Machine LearningAlgorithm to predict the output for the consideredinput parameters. KNN handles both the classifica-tion and regression Problem Predictions. However theindustry people use this Model for classification prob-lems. The main reason to prefer KNN is its Predic-tive power is too high and very high Accuracy level isachieved. It also takes less time for computation [5].Besides all it is very flexible with the data to inter-pret the output. In this model the key controller isthe value of “K” selected to run this Model. With itssimplicity it provides high accurate results. Rather toperform exact match with the stored data, it providesthe way to recognize patterns of the data. Similar casesare close to each other and are known as neighbours,more the distance between the cases implies they aremore dissimilar. The number of nearest neighboursis controlled by the “K” value. The main challengein the predictive modelling is Training the data. Thetraining of data is performed based on the followingmodels.

2.3.1. DistancemetricThismetric is used tomeasure the similarity of the casesand their nearest neighbours. Themethods are availableto calculate this Distance Metrics are Euclidean Dis-tance and City Block Distance. In Cartesian coordi-nates,

If a= (a1,a2,a3, . . . .an) and b= (b1,b2,b3 . . . ..bn)are two points, then the distance (dis) is calculatedbetween a to b (or) b to a by the Pythagorean formulaas in Eq.1 and the generalized formula is given in Eq.2

dis(a, b) = dis(b, a)

=√

(a1−b1)2 + (a2−b2)2 + . . . + (an−bn)2

(1)

dis(a, b) = dis(b, a) =√√√√

n∑i=1

(ai−bi)2 (2)

City Block Distance measures, the distance betweencases is the sum, overall dimensions, of the weightedabsolute differences between the values for the cases.

2.3.2. Crossvalidation for selection of KCross validation is an important data modelling tech-nique in which only 80 percent of training data aretrained and remaining 20 percent is not trained but isreserved for testing. Testing of the model with this 20percent of the data before finalizing the model. Crossvalidation (CV) helps in improving the performanceof the model. In cross validation for each value of k,where K ∈ [kmin, kmax], calculate the average error rateor sum of square error of k, denoted by ev. The equationfor cross validation is given in Eq.3. In Cross valida-tion, the parameter k is selected based on the number of

Page 6: Semantic web service-based messaging framework for

AUTOMATIKA 353

parts the data is split as train and test. In 10 Cross val-idation, training is done with 9 parts and tested with1 part. this is repeated for all combinations of trainand test splits. The accuracy of the model is evaluatedas ratio of number of correctly predicted instances tothe total number of instances in the dataset. This isusedwhile building and evaluating eachmodel. TheCVmethod is great because it measures directly the param-eter which are about to measure, mostly the predictiveperformance on unseen data. For the more complexmodels it is used to select the tuning parameters. In adata set with n observations, n fold CV is used and isknown as leave-one-out cross validation (LOOCV).

CVk =V∑v=1

evV

(3)

2.3.3. Feature selectionFeature selection is derived from feature engineering.Here more information is extracted from the exist-ing data. Good hypothesis gives good results. Futureselection is the process of finding the best subset ofattributes, which better explains the relation of inde-pendent variable with the target variable. Feature selec-tion is based on forward selection approach which isimplemented for the features entered into the model.Other features are entered in a sequential manner. Ateach step, the selected feature reduces the error rateor sum of squares errors. Feature selection makes theMLAlgorithms to train faster by reducing the complex-ity of the model and makes it easy to interpret. Basedon the appropriate selection of subset accuracy of themodel is attained. Pearson’s correlation, a measure forquantifying two continuous variable X and Y in a Lin-ear Manner. The range is between± 1. The Pearsonequation is described in Eq.4 as PA,B. The Pearson’scorrelation coefficient is the ratio of covariance of thetwo variables X and Y to the product of their standarddeviations, where X and Y are random variables. Theequation is represented by letter ρ.

PA,B = cov(A,b)σAσB

(4)

The three classes of feature selection are filter meth-ods, wrapper methods and embedded methods. Fewexamples of filter methods are Chi Squared Test andinformation gain. Wrapper methods are used in algo-rithms like hill climbing and forward – backwardpass. The embedded method does the optimization ofthe Algorithm like regression. Feature selection is themethod of selecting the most appropriate parametersavailable in the dataset for prediction of target variablesand building better model. FS trains the model fasterand reduces themodel complexity thereby easy to inter-pret. It also provides the wide possibilities of selectingthe subsets. As the appropriate model is selected, high

accuracy is attained. The flow starts with set of all fea-tures to selection of best subset and then building bestmodel to attain the accuracy.

3. Ondemand cloud telephony web services

After the prediction of results in data processing layerusing the ML Algorithms and Models, the result issent to the Telephony Web services which reside inthe Cloud Environment. When the results are recom-mended by the customers it is given to them throughthis on demand telephony services. Cloud Telephony(CT) is a communication technology in which the host-ing takes place at the service provider’s premises. Theapplication does not need any software or hardwarefor the activation and use of this service. There aretwo networks available to support and carry over theseservices namely IP and PSTN. The services throughIP enable the delivery of telephony services via Inter-net. VoIP enabled devices provides these services. Theavailability of services without Internet is PSTN. CloudTelephony Services provides the scalable solutions. Thisacts as a complete solution to track the customers. Theefficiency of cloud telephony has the following charac-teristics for better services. These services includes ondemand self-service, Broad Network Access, ResourcePooling, Rapid Elasticity andMeasured Service. The ondemand cloud telephony service includes Missed call,SMS, (IVR). These services are communicated withthe web service Registry through SOAP/XML [6]. Allthe web services are available in this repository. Thiscommunicates with the ontology services and desiredoutput is obtained. This cloud Telephony system com-municates with the ontology services using OWL API.The Business logics written in Web services has multi-ple methods with common insights. These insights aredelivered as services to the customers based on theirDemand request. The unique parameter isUser Insightsin all services. The other parameter varies based onthe services as Mobile number in IVR/SMS, Email Idin Mail Communication and Application id in MobileApp.

4. Ontology knowledge representation

The knowledge representation is the next model ofactive after cloud telephonyweb services. The represen-tation of the obtained knowledge is done by Ontologyservice Repository. Two ontologies are available in thisrepository namely cloud telephony services ontologyand Fitness Activity ontology. The fitness activity ontol-ogy maintains the predictive result about calories burntbased on various attributes of comparisons classifiedin its ontology structure. The modes of communica-tion like SMS, e-mail, IVR,mobile app and visualizationare stored in telephony services ontology. Based on

Page 7: Semantic web service-based messaging framework for

354 R. SETHURAMAN AND T. SASIPRABHA

Table 2. Metric values for the selected feature variables.

Variable Min Max Mean Correlation Covariance Standard deviation Skewness

Step 0 34954 4012.656 0.644 998413.594 4300.212 1.334Calories 985 6092 2807.43 1 130026.142 360.591 2.679Weight 80.309 97.523 89.267 0.157 252.444 4.462 −0.35BMI 22.13 26.873 24.598 0.157 69.562 1.229 −0.35Fat 14.892 24.443 20.93 0.158 138.964 2.441 −0.53

the user request the appropriate mode of communi-cation is selected and intimated to the user [7]. Thiscommunication takes place via OWL API to the cloudtelephony web services and is communicated via HTTPRequest and Response to the end user. To the selectedcommunication mode the details or analytics on thevalue of calories burnt against each of the attributesused in the data set is provided. Semantic service com-munication is established along with Ontology. Basedon the selection of web services corresponding ontol-ogy is chosen. It helps to take right decision to selectappropriate semantic-based services.

5. Related work

BigData and the analysis done or predictionmade fromthe available data is the trend that has been following inall the verticals that exists. The predicted results helpthem to improve their verticals to reach the maximumout of their business and progress. In the recent days ithas also been done in Animal research [8] also. Heredata collected from multiple sources has been main-tained and processed with Machine Learning Modelsto identify the new species, behaviour and movements.It also provides better improved classification. The han-dling of data plays a vital role. The cloud environmenthelps us in handling and storing these multiple data inthe cloud. This storage is supported and the function-ality is enhanced by NoSql [9] databases. The betterprocessing is supported here, since more comfortablethe data for storage, then more better the data are takenand processed for efficient usage. This cloud supportsin providing the good quality of service for the han-dling data. This involves the generation of data set forprocessing. The crucial point is the selection of model.Based on the model the evaluation is performed. Theother trend is EHR [10], the Electronic Health Record.For the generation of EHR, the data are available fromwearable devices and then stored for processing. Theprocessed data are made available for the customersas a report which they can make available from theirmobile devices. These reports are more like statisticalform which the user can view in a comfortable way.They use Kafka framework for the ingestion of datafrommultiple devices as input. The data available fromwearable and other health domains are of streamingdata. These streaming data are processed by the ApacheSpark. These reports help the doctors in monitoringthe patient and Guides in the suggestion of Fitness

activities, food habits and Disease Management for thepatients. Irrespective of the data that are available, pro-cessed and applied with Machine Learning Models thesecurity plays a vital role for the data [11]. Parame-ters like consumption of power on thewearable devices,transmission rate of the data and other security issues.To attain security, the devices are controlled by thecorresponding mobile devices [12]. Upon the authenti-cation from the devices the data are stored continuouslyand then processed. Similarlymultiple analyses are per-formed and the prediction is made and used in anefficient manner.

6. Results and discussions

6.1. Data exploration

The process that defines the quality and accuracy ofoutcome is Data Exploration. It is the process of clean-ing and making the data ready for the Analysis usingthe Model. The process starts with the identification ofinput and Target variable. In the proposed model thedata input variables includes calories, steps, weight, bmiand fat. The target variable is the Calories burnt. Thedata received from Fit Bit device is a continuous vari-able, so the Outliers are very less when compared tocategorical variables. The strength of the Relationshipdepends on the correlation value which lies between−1 and +1. The metrics like Correlation, Covariance,Standard Deviation and Skewness are calculated andgiven in Table 2. Metric Values are statistical valuesavailable in the core Dataset, obtained from the wear-able FitBit device. From the available parameters, fewparameters like Step, Calories, Weight, BMI and Fat areconsidered as Feature Variables and statistical calcu-lations such as Min, Max, Mean, Correlation, Covari-ance, SD and Skewness are calculated. Missing valuesare treated by removing incomplete data and by cal-culating mean value for the particular Variable. Thepresence of outliers if any is also removed. Since datawas available from realtime dataset, many inconsis-tent and missing value data was found. Using statis-tical analysis and real time Data Exploration usingopen source tools, the outliers and incorrect data wasrectified.

6.2. Visualization charts

The input of data from the wearable device fit bit ismeasured with multiple parameters like calories, steps,

Page 8: Semantic web service-based messaging framework for

AUTOMATIKA 355

Figure 3. Visualization of calories against BMI.

Figure 4. Visualization of input calories against the variable fat.

Figure 5. Predicted calories burnt against the variable fat using KNN.

Figure 6. Predicted calories burnt against the variable steps using KNN.

Page 9: Semantic web service-based messaging framework for

356 R. SETHURAMAN AND T. SASIPRABHA

weight, bmi, fat and so on. The data obtained is ofcontinuous category because each and every seconddata is tracked and stored in the HDFS through CloudAPI in the cloud Environment [13]. The data outliersare eliminated in the process of data Extraction andData Ingestion using the library Kafka in our applica-tion. The quality of the data set is improvised by elim-inating the outliers and Extremes. Since the measure-ment of data is continuous the probability of inputmiss-ing is reduced. The amount of calories burnt againstthe recorded input of BMI value is depicted in a graphnamed Figure 3.

For the computation of data using any MachineLearning Algorithms, the data need to be availablein the form of Training data and testing data. Thealgorithm is capable of taking all the features intoaccount for the prediction. Doing so for the large dataset leads to exhaust the computational power and thememory of the System. Selection of most appropriatefeature is somandatory. In the proposed paper themostappropriate features includes BMI, Fat and Steps forthe prediction of calories burnt. The visualization ofa selected feature fat against the calories is shown inFigure 4.

Themost crucial step inData processing is to suggestthe next set of actions to be carried out after building anaccurate and efficientmodel. In this proposal it refers toKNN. These set of actions are useful for enhancing theratio of the goals. In this context, had the most impor-tant predictor calories burnt by the model. These arebased on activities taken and analysed using data set.Figure 5 shows the plotting against the fat value andthe KNN Model predicted calories burnt values [14].This is based on the predicted value computed fromtraining and testing data. Based on the hypothesis con-cluded, reports and information in graphics showcasethe impact of these predictions. After obtaining the pre-dicted value several input features are used to have newpredictions of the same parameter. In this paper in allcases the amount of calories burnt is predicted basedon the input features Fat, Steps andWeight. More com-parison of predicted calories burnt against the otherfeatures leads to strong discussion and focus to havemore practical conclusions.

This comparison yields the best of the model withmore accuracy. This scenarios are tested, predictedcalories burnt and visualized in Figures 6 and 7as the feature variable Steps vs Calories burnt and

Figure 7. Predicted calories burnt against the variable weight using KNN.

Figure 8. Three dimensional representation of prediction against calories, fat and steps.

Page 10: Semantic web service-based messaging framework for

AUTOMATIKA 357

Figure 9. Three dimensional representation of prediction against calories steps and weight.

Figure 10. 3 cross 3 matrix representation of prediction against calories, steps and fat.

the feature variable weight vs Calories burnt respec-tively. From Figures 5–7, these visualized graph modelhas the Y-axis labelled as $N-calories. This notationstands for the predicted value obtained from the KNNModel.

In general Bar graph is the most commonly graphused irrespective of representing dependent and inde-pendent variables. Figure 8 represents the steps, pre-dicted calories, and fat. From these plots the mappingsare dragged and inferred if necessary. Based on these

inferences conclusions are made as per the require-ments. Other categories which are available are Scattergraph, Line graph andmultiple line graphs. Similarly inFigure 9 the plotting of steps, weight and burnt caloriesare visualized.

To find out the linear correlation between multiplevariables go for Scatterplot Matrices [15]. In Figures10 and 11 multiple feature variables such as predictedcalories, steps, weight and Fat are involved. Here eachvariable is plotted against each other.

Page 11: Semantic web service-based messaging framework for

358 R. SETHURAMAN AND T. SASIPRABHA

Figure 11. 3 cross 3 matrix representation of prediction against calories steps and weight.

Since multiple feature variables are involved in scat-ter plot matrix, the individual reference is defined interms of variable Xi and Yi for Vertical axis and Hor-izontal axis respectively.

From this scatterplot matrix the existence of pair-wise relationship among the variables are concluded.The outliers among the data are easily figured out.This Scatterplot Matrix is also coined as SPLOM. Therepresentation of histogram in the above figures isagainst the same variables. For better insights fromthe dataset, various visualization methods and tech-niques are used for multiple combinations of featuresavailable in the dataset. Multiple Feature variables likeInput Calories vs Fat, Calories burnt vs Step Count andCalories burnt vs Weight are shown in charts.

7. Conclusion and future work

The novelty of this paper is explained in terms ofData Ingestion and Data Processing. Data from wear-able devices like Fit Bit is stored on HDFS using thelibrary called Kafka in the application. Apache Sparkhandles the Streaming data in an efficient manner andthe data is divided as Training and Test data. Thesedata are acted upon withMachine Learning SupervisedAlgorithm K-Nearest Neighbour (KNN). Based on thevalue of “K”, neighbours are formed and the modelis drawn. Thus obtained the predicted value from theKNN classifier and the Model. In this proposal the pre-cious Input includes step count, fat, weight and calories.

Various other Parameters are available in the Data setbut not all the parameters are considered for the pre-diction by the Model. The predicted value here is theamount of calories burnt. This prediction is carriedover against all the precious inputs mentioned above.This value is sent to the on demand cloud telephonyontology, which intimated the results to the end userthrough email, SMS, IVR, Mobile App and visualiza-tion Techniques. The selection of this intimation isdecided by the Ontology service repository, where twoontologies are maintained. The interface between thetelephony web services and ontology is carried overby SOAP/XML and OWL API. All these services aremaintained in the web service registry. The end resultreaches the user or the requestor through the HTTPrequest and Response API.

The future work of this paper lies on the creation ofOntology by the Machine Learning Model itself. Basedon the type of data and the Analysis performed thesemantic ontology can be created. Resource Descrip-tion Framework (RDF) can be used for the creation ofsemantic ontology. In the Machine Learning model thememory size needs to be scalable, so that all the vari-ables can be considered for computing the predictionparameter.

Acknowledgements

We wish to acknowledge the Department of Science andTechnology, India and School of Computing, Sathyabama

Page 12: Semantic web service-based messaging framework for

AUTOMATIKA 359

University, Chennai for providing the facilities to do theresearch under the DST-FIST Grant Project No.SR/FST/ETI-364/2014.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

[1] Takacs J, Pollock CL, Guenther JR, et al. Validation ofthe fitbit One activity monitor device during treadmillwalking. J Sci Med Sp. 2014; doi:10.1016/j.jsams.2013.10.241.

[2] Diaz KM, Krupka DJ, Chang MJ, et al. An accurate andreliable device for wireless physical activity tracking. IntJ Cardiol. 2015; doi:10.1016/j.ijcard.2015.03.038.

[3] Cadmus-Bertram LA, Marcus BH, Patterson RE, et al.Randomized trial of a fitbit-based physical activ-ity intervention for women. Am J Prev Med. 2015;doi:10.1016/j.amepre.2015.01.020.

[4] Sasaki JE, Hickey A, Mavilia M, et al. Validationof the fitbit wireless activity tracker for predictionof energy expenditure. J Phys Activity Health. 2015;doi:10.1123/jpah.2012-0495.

[5] Slootmaker SM, Schuit AJ, Chinapaw MJ, et al. Dis-agreement in physical activity assessed by accelerometerand selfreport in subgroups of age, gender, educationand weight status. Int J Behav Nutr Phys Act. 2009.doi:10.1186/1479-5868-6-17.

[6] Gusmer RJ, Bosch TA, Watkins AN, et al. Comparisonof FitBit® ultra to ActiGraphTM GT1M for assessment ofphysical activity in young adults during treadmill walk-ing. Open Sp Med J. 2014; doi:10.2174/1874387001408010011.

[7] Shameer* K, Badgeley* MA, Miotto R, et al. (2016).Translational bioinformatics in the era of real-time

biomedical, health care and wellness data streams.doi:10.1093/bib/bbv118.

[8] Valletta JJ, Torney C, Kings M, et al. Applications ofmachine learning in animal behaviour studies. AnimBehav. 2017; doi:10.1016/j.anbehav.2016.12.005.

[9] Farias VAE, Sousa FRC, Maia JGR, et al. Regres-sion based performance modeling and provisioning forNoSQL cloud databases. Future Gener Comput Syst -Int J eSci. 2018; doi:10.1016/j.future.2017.08.061.

[10] Ilapakurti A, Kedari S, Vuppalapati C. (2016). Therole of big data in creating sense EHR, an inte-grated approach to create next generation mobile sen-sor and wearable data driven electronic health record(EHR). IEEE Second International Conference onBig Data Computing Service and Applications. doi:https://doi.org/10.1016/j.future.2017.08.061.

[11] Manogaran G, Thota C, Lopez D, et al. Big data knowl-edge system in healthcare. In: Bhatt C, Dey N, AshourA, editor. Internet of things and big data technologiesfor next generation healthcare (pp. 133–157, vol. 23).Cham: Springer; 2017.

[12] Shamim Hossain M, Muhammad G. Cloud-based col-laborative media service framework for healthcare. Int JDistrib Sens Netw. 2014; doi:10.1155/2014/858712.

[13] Vigneshwari S, AramudhanM. Personalized cross onto-logical framework for secured document retrieval in thecloud. Natl Acad Sci Lett. 2015; doi 10.1007/s40009-015-0391-3.

[14] Reichherzer T, TimmM, Earley N, et al. Using machinelearning techniques to track individuals & their fit-ness activities. In: A Bossard, G Lee, L Miller, edi-tor. Proceedings of 32nd international conference oncomputers and their applications, March, 20-22, 2017,Honolulu, Hawaii, USA. Winona, MN: ISCA; 2017. p.119–124.

[15] Montgomery DC, Peck EA, Vining GG. Introductionto linear regression analysis (Vol. 821). London: JohnWiley & Sons; 2012.