experimental elearning application for ... - c3.icvl.eu

6
Experimental Elearning Application for Distributed Data Mining Systems Pupezescu Valentin 1 , Dragomir Marilena-Cătălina 2 (1) Electronics, Telecommunications and Information Technology Faculty, Polytechnic University of Bucharest,Bd. Iuliu Maniu, Bucharest, ROMANIA E-mail: vpupezescu[at]yahoo.com (2) Electronics, Telecommunications and Information Technology Faculty, Polytechnic University of Bucharest, Bd. Iuliu Maniu, Bucharest, ROMANIA E-mail: catalina.dragomir[at]protonmail.com Abstract The development of machine learning algorithms, computing and communication in recent years is producing a world that depends on information. Nowadays, most of the information is stored as raw data in real distributed database management systems. Although many scientific discoveries were made in research fields such as Distributed Databases, Data Preparation, Machine Learning, Distributed Data Mining and Elearning, there is a lack of experimental applications that facilitates the deepen of knowledge in all these research fields blended together. This paper presents an experimental Elearning application that allows students to assimilate knowledge through experiments from the aforementioned research domains. The application has a module that imports, prepares and transforms data in order to be processed by the data mining task. The data is stored in the MySql database management system in a distributed manner, achieved through the replication process in a master-slave topology. Students can set the replication type: Statement, Row or Mixed Based Replication. The Data Mining task (classification) is achieved in a distributed manner using Distributed Committee Machines with a modified version of a multilayer perceptron proposed in our previous research (autoresetting multilayer perceptron). The users can choose from three standard data sets: iris1, wine1 and conc1. Through the web interface, the users from the master system will send the configuration parameters for the neural network and the addresses of the distributed slave systems. In the application the students can visualize the classification results derived from the distributed experiments and choose the highest-scoring classifier. Keywords: Elearning, Distributed Data Mining, Distributed Database Management Systems, Machine Learning 1 Introduction Data Mining(DM) algorithms have become very popular nowadays because of their potential to extract useful knowledge from large datasets. With the increasing of information availability that is stored in database management systems (DBMS) came the necesity to obtain useful information from distributed database management systems (DDBMS). Most studies that were done in this research field did not take into account the implementation aspects of real DBMS. Another problem is the fact that students and many researchers do not have at their disposal elearning applications or tools that integrates real implementations of distributed databases and machine learning algorhitms.

Upload: others

Post on 01-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experimental Elearning Application for ... - c3.icvl.eu

Experimental Elearning Application for

Distributed Data Mining Systems

Pupezescu Valentin

1, Dragomir Marilena-Cătălina

2

(1) Electronics, Telecommunications and Information Technology Faculty, Polytechnic

University of Bucharest,Bd. Iuliu Maniu, Bucharest, ROMANIA

E-mail: vpupezescu[at]yahoo.com

(2) Electronics, Telecommunications and Information Technology Faculty, Polytechnic

University of Bucharest, Bd. Iuliu Maniu, Bucharest, ROMANIA

E-mail: catalina.dragomir[at]protonmail.com

Abstract The development of machine learning algorithms, computing and communication in

recent years is producing a world that depends on information. Nowadays, most of the

information is stored as raw data in real distributed database management systems. Although

many scientific discoveries were made in research fields such as Distributed Databases, Data

Preparation, Machine Learning, Distributed Data Mining and Elearning, there is a lack of

experimental applications that facilitates the deepen of knowledge in all these research fields

blended together. This paper presents an experimental Elearning application that allows

students to assimilate knowledge through experiments from the aforementioned research

domains. The application has a module that imports, prepares and transforms data in order

to be processed by the data mining task. The data is stored in the MySql database

management system in a distributed manner, achieved through the replication process in a

master-slave topology. Students can set the replication type: Statement, Row or Mixed Based

Replication. The Data Mining task (classification) is achieved in a distributed manner using

Distributed Committee Machines with a modified version of a multilayer perceptron proposed

in our previous research (autoresetting multilayer perceptron). The users can choose from

three standard data sets: iris1, wine1 and conc1. Through the web interface, the users from

the master system will send the configuration parameters for the neural network and the

addresses of the distributed slave systems. In the application the students can visualize the

classification results derived from the distributed experiments and choose the highest-scoring

classifier.

Keywords: Elearning, Distributed Data Mining, Distributed Database Management

Systems, Machine Learning

1 Introduction

Data Mining(DM) algorithms have become very popular nowadays because of their potential

to extract useful knowledge from large datasets. With the increasing of information availability

that is stored in database management systems (DBMS) came the necesity to obtain useful

information from distributed database management systems (DDBMS). Most studies that were

done in this research field did not take into account the implementation aspects of real DBMS.

Another problem is the fact that students and many researchers do not have at their disposal

elearning applications or tools that integrates real implementations of distributed databases and

machine learning algorhitms.

Page 2: Experimental Elearning Application for ... - c3.icvl.eu

University of Bucharest and “1 December 1918” University of Alba Iulia

398

In this paper we proposed an experimental elearning application that allows its users to learn

and work with a classical neural network in a distributed manner using Distributed Committe

Machines (DCM) on a real distributed database management system. The proposed application is

called “Experimental DCM” and its current implementation is at version 1.0.

2 Application Architecture

The figure presented bellow (Figure 1) is a representation of the backend functionalities and

the backbone architecture of the proposed elearning application. This architecture was also used in

our previous works(Pupezescu V., 2015) but in this paper we integrate it in an elearning

application.

In this application we used the MySql database management system installed on multiple

computing systems. All MySql servers are arranged in a distributed manner in a master-slave

topology (Schwartz, B., et al, 2008). All the database operations are made on the master system

and propagated on the slave systems through the replication process (Schwartz, B., et al, 2008).

The developed application allows an experimental study of the interaction between the data

mining classification task and distributed databases. From our past research (Pupezescu V., 2015)

we reached the conclusion that the most suited structures for mining distributed data are the

distributed committee machines.

Figure 1 . The implementation of DCM

architecture (Pupezescu, V., Rădescu, R., 2016)

Figure 2. Distributed Committee Machine

(Pupezescu, V., Rădescu, R., 2016)

The DCM architecture (Figure 2) contain more than one neural structure that work in a

distributed manner in order to achieve better classification results (Tahir, M.A., 2007). In our case

we worked with a classical multilayer perceptron (MLP) that has a modification in the

backpropagation training algorithm: after a certain number of training and testing epochs, the MLP

resets by itself if it remains blocked on a local minima - Auto Resetting Multilayer Perceptron

(AMLP) (Pupezescu, V., 2017).

The developed elearning application work with a “winner takes all” policy applied to the

distributed neural networks.

On the slave systems we will have distributed AMLP structures that will run autonomous.

After the last system finishes its classification task, the user from the master system will use the

combiner module in order to extract the results from all distributed systems.

As we observe in Figure 1, we will have replicated all the training and testing data sets (TR

stands for training, TS stands for testing). The combiner extracts the classification results (Y1,...,

YN) and offers an overview for the user in order to interpret them.

Page 3: Experimental Elearning Application for ... - c3.icvl.eu

The 13th International Conference on Virtual Learning ICVL 2018

399

3 Data Preparation Module

The problems that can be analyzed by the students/users with the Experimental DCM

application are iris1, wine1 and conc1 data sets (http://mlr.cs.umass.edu/ml/datasets/Iris,

http://mlr.cs.umass.edu /ml/ datasets/Wine).

For each data set we developed a Java module in the application that makes it easier for the

users to import data sets that are stored in csv files - they can export for instance data sets that are

stored in Matlab in a simple cvs file and import them in the MySql DBMS. This is one of the

problems that is very annoying for regular users. This happens because there is not a standard for

storing data sets yet in the industry. The data sets are arranged and stored in the DBMS in the

following format:

Table 1. iris1, wine1 and conc1 data sets (Pupezescu, V., 2016)

iris1 trr tsr trs tss

Lines 100 50 100 50

Columns 3 3 4 4

wine1 trr tsr trs tss

Lines 90 88 90 88

Columns 3 3 13 13

conc1 trr tsr trs tss

Lines 200 100 200 100

Columns 1 1 2 2

4 Experimental Distributed Committee Machine v1.0

The application was made entirely in the Java programming language. The implementations of

the neural networks are completely original - we did not use any neural network library in order to

achieve the DM classification task.

The neural networks are developed in the following manner: we had a class for the neuron

model (we use the classical neuron model).

After that we constructed a new class Layer that contains among its private members an array

of neurons. The final structure(MLP) contains a given number of layers: input layer, hidden layers

and the output layer. The modified version of the backpropagation (AMLP) is implemented in a

separate class.

In order to be able to start the elearning application, the students must first have installed the

eclipse platform on their system. Secondly, they must import the training and testing data sets into

their MySql servers. Thirdly, they must configure the replication on the entire distributed system

and start on each slave system the client servers (these were developed using TCP sockets in Java).

Lastly, the user from the master server(combiner) can run the index.jsp page in order to start the

entire distributed data mining task.

All the web pages were developed using Java Server Pages technology.

The frontend of the web app has a responsive design, developed using Bootstrap 4, a powerful

HTML, CSS and JavaScript framework which offers a large set of components that streamline the

process of creating mobile first designs.

Page 4: Experimental Elearning Application for ... - c3.icvl.eu

University of Bucharest and “1 December 1918” University of Alba Iulia

400

The web app consists of 6 pages, which follow the same general structure: the header section contains the name of the app which links back to the homepage, a left menu with 2 sections: Execution architecture which allows the students to select the execution architecture they want to experiment with as well as an easy way to access the documentation regarding the MLP and how to configure the replication process and a section that contains the main content of the page(Figure 4).

Their layout is built using the flex box grid from Bootstrap 4 and predefined CSS classes to customize the font-size, spacing, table, form and menu styling.

In the index.jsp (Figure 4) the students can choose the mode in which they want to run the experiment: in the “Executie distribuita” mode (Distributed execution), users can start the distributed experiment. After the experiment finishes, if one wishes, the entire experiment can be reconstructed in a sequential manner in the “Reconstructie secventiala” mode (Sequential reconstruction). The application offers another two modes that allow the reconstruction of an experiment with a optimum DCM architecture that was proposed in our previous works (Pupezescu, V., 2017).

If we enter in the distributed run mode the user will be able to set the configuration parameters for the neural structure (Figures 4 and 5). Every optimum neural structure(from all the slave systems) will be stored in database as BLOB objects.

Users can also configure the type of replication for the experiment: Statement Based Replication, Row Based Replication or Mixed Based Replication.

Figure 3. The index.jsp page

Figure 4. The start.jsp page and the setup parameters for the AMLP structures

Page 5: Experimental Elearning Application for ... - c3.icvl.eu

The 13th International Conference on Virtual Learning ICVL 2018

401

Figure 5. The distributed run of the DCM

After setting the configuration parameters for all the neural networks the user from the master

system must wait until the distributed runs finish their tasks. After this step, the user must set the

name of the csv file that will store the experimental findings for future interpretations.

Figure 6. The final results (misclassification rates and execution performance) for the DCM

Figure 7 shows the reconstruction module that was developed. In case that some experiments

must be re-run with the same neural structures from past executions, we have the possibility to

choose the experiments we are interested in.

Page 6: Experimental Elearning Application for ... - c3.icvl.eu

University of Bucharest and “1 December 1918” University of Alba Iulia

402

Figure 7. The reconstruction module for the DCM architecture

Conclusions In this paper we proposed an experimental elearning application for the students that are

interested in researching real implementations and interactions between distributed neural

architectures and a commercial DBMS (for this version we used MySql). This approach is unique

in the elearning field because the students are allowed to directly experiment and interpret data

mining solutions that are closer to the profile industry. The application is implemented using

modern technologies that allow work from desktops or mobile devices.

In the near future the application will be further enhanced with a login module and with other

neural structures. One of our future goal is to implement in this application a General Committee

Machine that is able to analyze big data sets in a distributed manner with multiple types of neural

networks. This work is useful in research fields such as Elearning, Machine Learning, Data

Mining and Knowledge Discovery in Distributed Databases.

References

Pupezescu, V., (2015), The Influence of Database Engines in Distributed Committee Machine Architectures,

Proceedings of the 10th International Conference on Virtual Learning (ICVL-2015) Timişoara, pp. 240-

246, October 31, ISSN 1844-8933, 2015.

Schwartz, B., Zaitsev, P., Tkachenko, V., Zawodny, J., Lentz, A., Balling, D. (2008), High Performance

MySQL, Second Edition, O’Reilly Media,ISBN: 978-0-596-10171-8, United States of America, 2008.

Pupezescu, V., Rădescu, R. (2016), The Influence of Data Replication in the Knowledge Discovery in

Distributed Databases Process, ECAI 2016 – International Conference – 8th Edition, 30 June – 02 July,

Ploieşti, ROMÂNIA, 2016.

Tahir, M.A., (2007), Java Implementation of Neural Networks, ISBN 1-4196-6535-9, 2007.

Pupezescu, V., (2017), Auto Resetting Multilayer Perceptron in an Adaptive Elearning Architecture,

Proceedings of The 12th International Conference on Virtual Learning (ICVL-2017), pp.311-317,

Octomber 28, Sibiu, ISSN: 1844-8933, 2017.

Pupezescu, V., (2016), Distributed neural structures in adaptive eLearning systems, Proceedings of the 11th

International Conference on Virtual Learning(ICVL-2016), 2016.

http://mlr.cs.umass.edu/ml/datasets/Iris

http://mlr.cs.umass.edu /ml/ datasets/Wine

Pupezescu, V., (2015), Advances in Knowledge Discovery in Distributed Databases, Proceedings of the 11th

International Scientific Conference eLearning and Software for Education (eLSE-2015), Bucharest, April

23-24, pp.311-319, ISSN 2066-026X,2015.