experimental elearning application for ... - c3.icvl.eu
TRANSCRIPT
Experimental Elearning Application for
Distributed Data Mining Systems
Pupezescu Valentin
1, Dragomir Marilena-Cătălina
2
(1) Electronics, Telecommunications and Information Technology Faculty, Polytechnic
University of Bucharest,Bd. Iuliu Maniu, Bucharest, ROMANIA
E-mail: vpupezescu[at]yahoo.com
(2) Electronics, Telecommunications and Information Technology Faculty, Polytechnic
University of Bucharest, Bd. Iuliu Maniu, Bucharest, ROMANIA
E-mail: catalina.dragomir[at]protonmail.com
Abstract The development of machine learning algorithms, computing and communication in
recent years is producing a world that depends on information. Nowadays, most of the
information is stored as raw data in real distributed database management systems. Although
many scientific discoveries were made in research fields such as Distributed Databases, Data
Preparation, Machine Learning, Distributed Data Mining and Elearning, there is a lack of
experimental applications that facilitates the deepen of knowledge in all these research fields
blended together. This paper presents an experimental Elearning application that allows
students to assimilate knowledge through experiments from the aforementioned research
domains. The application has a module that imports, prepares and transforms data in order
to be processed by the data mining task. The data is stored in the MySql database
management system in a distributed manner, achieved through the replication process in a
master-slave topology. Students can set the replication type: Statement, Row or Mixed Based
Replication. The Data Mining task (classification) is achieved in a distributed manner using
Distributed Committee Machines with a modified version of a multilayer perceptron proposed
in our previous research (autoresetting multilayer perceptron). The users can choose from
three standard data sets: iris1, wine1 and conc1. Through the web interface, the users from
the master system will send the configuration parameters for the neural network and the
addresses of the distributed slave systems. In the application the students can visualize the
classification results derived from the distributed experiments and choose the highest-scoring
classifier.
Keywords: Elearning, Distributed Data Mining, Distributed Database Management
Systems, Machine Learning
1 Introduction
Data Mining(DM) algorithms have become very popular nowadays because of their potential
to extract useful knowledge from large datasets. With the increasing of information availability
that is stored in database management systems (DBMS) came the necesity to obtain useful
information from distributed database management systems (DDBMS). Most studies that were
done in this research field did not take into account the implementation aspects of real DBMS.
Another problem is the fact that students and many researchers do not have at their disposal
elearning applications or tools that integrates real implementations of distributed databases and
machine learning algorhitms.
University of Bucharest and “1 December 1918” University of Alba Iulia
398
In this paper we proposed an experimental elearning application that allows its users to learn
and work with a classical neural network in a distributed manner using Distributed Committe
Machines (DCM) on a real distributed database management system. The proposed application is
called “Experimental DCM” and its current implementation is at version 1.0.
2 Application Architecture
The figure presented bellow (Figure 1) is a representation of the backend functionalities and
the backbone architecture of the proposed elearning application. This architecture was also used in
our previous works(Pupezescu V., 2015) but in this paper we integrate it in an elearning
application.
In this application we used the MySql database management system installed on multiple
computing systems. All MySql servers are arranged in a distributed manner in a master-slave
topology (Schwartz, B., et al, 2008). All the database operations are made on the master system
and propagated on the slave systems through the replication process (Schwartz, B., et al, 2008).
The developed application allows an experimental study of the interaction between the data
mining classification task and distributed databases. From our past research (Pupezescu V., 2015)
we reached the conclusion that the most suited structures for mining distributed data are the
distributed committee machines.
Figure 1 . The implementation of DCM
architecture (Pupezescu, V., Rădescu, R., 2016)
Figure 2. Distributed Committee Machine
(Pupezescu, V., Rădescu, R., 2016)
The DCM architecture (Figure 2) contain more than one neural structure that work in a
distributed manner in order to achieve better classification results (Tahir, M.A., 2007). In our case
we worked with a classical multilayer perceptron (MLP) that has a modification in the
backpropagation training algorithm: after a certain number of training and testing epochs, the MLP
resets by itself if it remains blocked on a local minima - Auto Resetting Multilayer Perceptron
(AMLP) (Pupezescu, V., 2017).
The developed elearning application work with a “winner takes all” policy applied to the
distributed neural networks.
On the slave systems we will have distributed AMLP structures that will run autonomous.
After the last system finishes its classification task, the user from the master system will use the
combiner module in order to extract the results from all distributed systems.
As we observe in Figure 1, we will have replicated all the training and testing data sets (TR
stands for training, TS stands for testing). The combiner extracts the classification results (Y1,...,
YN) and offers an overview for the user in order to interpret them.
The 13th International Conference on Virtual Learning ICVL 2018
399
3 Data Preparation Module
The problems that can be analyzed by the students/users with the Experimental DCM
application are iris1, wine1 and conc1 data sets (http://mlr.cs.umass.edu/ml/datasets/Iris,
http://mlr.cs.umass.edu /ml/ datasets/Wine).
For each data set we developed a Java module in the application that makes it easier for the
users to import data sets that are stored in csv files - they can export for instance data sets that are
stored in Matlab in a simple cvs file and import them in the MySql DBMS. This is one of the
problems that is very annoying for regular users. This happens because there is not a standard for
storing data sets yet in the industry. The data sets are arranged and stored in the DBMS in the
following format:
Table 1. iris1, wine1 and conc1 data sets (Pupezescu, V., 2016)
iris1 trr tsr trs tss
Lines 100 50 100 50
Columns 3 3 4 4
wine1 trr tsr trs tss
Lines 90 88 90 88
Columns 3 3 13 13
conc1 trr tsr trs tss
Lines 200 100 200 100
Columns 1 1 2 2
4 Experimental Distributed Committee Machine v1.0
The application was made entirely in the Java programming language. The implementations of
the neural networks are completely original - we did not use any neural network library in order to
achieve the DM classification task.
The neural networks are developed in the following manner: we had a class for the neuron
model (we use the classical neuron model).
After that we constructed a new class Layer that contains among its private members an array
of neurons. The final structure(MLP) contains a given number of layers: input layer, hidden layers
and the output layer. The modified version of the backpropagation (AMLP) is implemented in a
separate class.
In order to be able to start the elearning application, the students must first have installed the
eclipse platform on their system. Secondly, they must import the training and testing data sets into
their MySql servers. Thirdly, they must configure the replication on the entire distributed system
and start on each slave system the client servers (these were developed using TCP sockets in Java).
Lastly, the user from the master server(combiner) can run the index.jsp page in order to start the
entire distributed data mining task.
All the web pages were developed using Java Server Pages technology.
The frontend of the web app has a responsive design, developed using Bootstrap 4, a powerful
HTML, CSS and JavaScript framework which offers a large set of components that streamline the
process of creating mobile first designs.
University of Bucharest and “1 December 1918” University of Alba Iulia
400
The web app consists of 6 pages, which follow the same general structure: the header section contains the name of the app which links back to the homepage, a left menu with 2 sections: Execution architecture which allows the students to select the execution architecture they want to experiment with as well as an easy way to access the documentation regarding the MLP and how to configure the replication process and a section that contains the main content of the page(Figure 4).
Their layout is built using the flex box grid from Bootstrap 4 and predefined CSS classes to customize the font-size, spacing, table, form and menu styling.
In the index.jsp (Figure 4) the students can choose the mode in which they want to run the experiment: in the “Executie distribuita” mode (Distributed execution), users can start the distributed experiment. After the experiment finishes, if one wishes, the entire experiment can be reconstructed in a sequential manner in the “Reconstructie secventiala” mode (Sequential reconstruction). The application offers another two modes that allow the reconstruction of an experiment with a optimum DCM architecture that was proposed in our previous works (Pupezescu, V., 2017).
If we enter in the distributed run mode the user will be able to set the configuration parameters for the neural structure (Figures 4 and 5). Every optimum neural structure(from all the slave systems) will be stored in database as BLOB objects.
Users can also configure the type of replication for the experiment: Statement Based Replication, Row Based Replication or Mixed Based Replication.
Figure 3. The index.jsp page
Figure 4. The start.jsp page and the setup parameters for the AMLP structures
The 13th International Conference on Virtual Learning ICVL 2018
401
Figure 5. The distributed run of the DCM
After setting the configuration parameters for all the neural networks the user from the master
system must wait until the distributed runs finish their tasks. After this step, the user must set the
name of the csv file that will store the experimental findings for future interpretations.
Figure 6. The final results (misclassification rates and execution performance) for the DCM
Figure 7 shows the reconstruction module that was developed. In case that some experiments
must be re-run with the same neural structures from past executions, we have the possibility to
choose the experiments we are interested in.
University of Bucharest and “1 December 1918” University of Alba Iulia
402
Figure 7. The reconstruction module for the DCM architecture
Conclusions In this paper we proposed an experimental elearning application for the students that are
interested in researching real implementations and interactions between distributed neural
architectures and a commercial DBMS (for this version we used MySql). This approach is unique
in the elearning field because the students are allowed to directly experiment and interpret data
mining solutions that are closer to the profile industry. The application is implemented using
modern technologies that allow work from desktops or mobile devices.
In the near future the application will be further enhanced with a login module and with other
neural structures. One of our future goal is to implement in this application a General Committee
Machine that is able to analyze big data sets in a distributed manner with multiple types of neural
networks. This work is useful in research fields such as Elearning, Machine Learning, Data
Mining and Knowledge Discovery in Distributed Databases.
References
Pupezescu, V., (2015), The Influence of Database Engines in Distributed Committee Machine Architectures,
Proceedings of the 10th International Conference on Virtual Learning (ICVL-2015) Timişoara, pp. 240-
246, October 31, ISSN 1844-8933, 2015.
Schwartz, B., Zaitsev, P., Tkachenko, V., Zawodny, J., Lentz, A., Balling, D. (2008), High Performance
MySQL, Second Edition, O’Reilly Media,ISBN: 978-0-596-10171-8, United States of America, 2008.
Pupezescu, V., Rădescu, R. (2016), The Influence of Data Replication in the Knowledge Discovery in
Distributed Databases Process, ECAI 2016 – International Conference – 8th Edition, 30 June – 02 July,
Ploieşti, ROMÂNIA, 2016.
Tahir, M.A., (2007), Java Implementation of Neural Networks, ISBN 1-4196-6535-9, 2007.
Pupezescu, V., (2017), Auto Resetting Multilayer Perceptron in an Adaptive Elearning Architecture,
Proceedings of The 12th International Conference on Virtual Learning (ICVL-2017), pp.311-317,
Octomber 28, Sibiu, ISSN: 1844-8933, 2017.
Pupezescu, V., (2016), Distributed neural structures in adaptive eLearning systems, Proceedings of the 11th
International Conference on Virtual Learning(ICVL-2016), 2016.
http://mlr.cs.umass.edu/ml/datasets/Iris
http://mlr.cs.umass.edu /ml/ datasets/Wine
Pupezescu, V., (2015), Advances in Knowledge Discovery in Distributed Databases, Proceedings of the 11th
International Scientific Conference eLearning and Software for Education (eLSE-2015), Bucharest, April
23-24, pp.311-319, ISSN 2066-026X,2015.