experiment-1 · data checks in dimension table as well as history table. check the bi reports on...

36
EXPERIMENT-1 AIM: Study of ETL process and its tools. What is ETL? ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts the data from different RDBMS source systems then transforms the data like applying calculations, concatenations, etc. and then load the data into the Data Warehouse system. It's tempting to think a creating a Data warehouse is simply extracting data from multiple sources and loading into database of a Data warehouse. This is far from the truth and requires a complex ETL process. The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives and is technically challenging. In order to maintain its value as a tool for decision-makers, Data warehouse system needs to change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented. Why do you need ETL? There are many reasons for adopting ETL in the organization: It helps companies to analyze their business data for taking critical business decisions. ETL provides a method of moving the data from various sources into a data warehouse. As data sources change, the Data Warehouse will automatically update. Allow verification of data transformation, aggregation and calculations rules. ETL process allows sample data comparison between the source and the target system. ETL Process in Data Warehouses

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-1

AIM: Study of ETL process and its tools.

What is ETL?

ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts

the data from different RDBMS source systems then transforms the data like applying

calculations, concatenations, etc. and then load the data into the Data Warehouse system.

It's tempting to think a creating a Data warehouse is simply extracting data from multiple

sources and loading into database of a Data warehouse. This is far from the truth and requires

a complex ETL process. The ETL process requires active inputs from various stakeholders

including developers, analysts, testers, top executives and is technically challenging.

In order to maintain its value as a tool for decision-makers, Data warehouse system needs to

change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data

warehouse system and needs to be agile, automated, and well documented.

Why do you need ETL?

There are many reasons for adopting ETL in the organization:

● It helps companies to analyze their business data for taking critical business decisions.

● ETL provides a method of moving the data from various sources into a data

warehouse.

● As data sources change, the Data Warehouse will automatically update. ● Allow verification of data transformation, aggregation and calculations rules. ● ETL process allows sample data comparison between the source and the target

system.

ETL Process in Data Warehouses

Page 2: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Step 1) Extraction

In this step, data is extracted from the source system into the staging area. Transformations if

any are done in staging area so that performance of source system in not degraded. Also, if

corrupted data is copied directly from the source into Data warehouse database, rollback will

be a challenge. Staging area gives an opportunity to validate extracted data before it moves

into the Data warehouse.

Hence one needs a logical data map before data is extracted and loaded physically. This data

map describes the relationship between sources and target data.

Three Data Extraction methods:

1. Full Extraction

2. Partial Extraction- without update notification.

3. Partial Extraction- with update notification

Irrespective of the method used, extraction should not affect performance and response time

of the source systems. These source systems are live production databases. Any slow down or

locking could effect company's bottom line.

Some validations are done during Extraction:

● Reconcile records with the source data

● Make sure that no spam/unwanted data loaded ● Data type check ● Remove all types of duplicate/fragmented data ● Check whether all the keys are in place or not

Step 2) Transformation

Data extracted from source server is raw and not usable in its original form. Therefore it

needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL

process adds value and changes data such that insightful BI reports can be generated.

Page 3: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

In this step, you apply a set of functions on extracted data. Data that does not require any

transformation is called as direct move or pass through data.

In transformation step, you can perform customized operations on data. For instance, if the

user wants sum-of-sales revenue which is not in the database. Or if the first name and the last

name in a table is in different columns. It is possible to concatenate them before loading.

Following are Data Integrity Problems:

1. Different spelling of the same person like Jon, John, etc.

2. There are multiple ways to denote company name like Google, Google Inc.

3. Use of different names like Cleaveland, Cleveland.

4. There may be a case that different account numbers are generated by various

applications for the same customer.

5. In some data required files remains blank

6. Invalid product collected at POS as manual entry can lead to mistakes.

Validations are done during this stage

● Filtering – Select only certain columns to load ● Data threshold validation check. For example, age cannot be more than two digits.

● Required fields should not be left blank.

● Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to

"F" etc.) ● Using any complex data validation (e.g., if the first two columns in a row are empty

then it automatically reject the row from processing)

Step 3) Loading

Loading data into the target datawarehouse database is the last step of the ETL process. In a

typical Data warehouse, huge volume of data needs to be loaded in a relatively short period

(nights). Hence, load process should be optimized for performance.

Page 4: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

In case of load failure, recover mechanisms should be configured to restart from the point of

failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel

loads as per prevailing server performance.

Types of Loading:

● Initial Load — populating all the Data Warehouse tables ● Incremental Load — applying ongoing changes as when needed periodically. ● Full Refresh —erasing the contents of one or more tables and reloading with fresh

data.

Load verification

● Ensure that the key field data is neither missing nor null.

● Test modeling views based on the target tables. ● Check that combined values and calculated measures.

● Data checks in dimension table as well as history table. ● Check the BI reports on the loaded fact and dimension table.

ETL tools

There are many Data Warehousing tools are available in the market. Here, are some most

prominent one:

1. Sybase

Sybase is a strong player in data integration market. Sybase ETL tool is developed for loading data from different

data sources and then transforming them into data sets and finally loading this data into data

warehouse.

Sybase ETL use sub-components such as Sybase ETL Server and Sybase ETL Development.

Key Features:

● Sybase ETL provides automation for data integration. ● Simple GUI to create data integration jobs. ● Easy to understand and no separate training is required.

● Sybase ETL dashboard provides a quick view of where exactly the processes stand. ● Real-time reporting and better decision-making process. ● It only supports Windows platform. ● It minimizes the cost, time and human efforts for data integration and extraction

process.

2. Oracle - Warehouse Builder

Oracle has introduced an ETL tool known as Oracle Warehouse Builder (OWB). It is a

graphical environment which is used to build and manage the data integration process.

OWB uses various data sources in the data warehouse for integration purposes. The core

capability of OWB is data profiling, data cleansing, fully integrated data modeling and data

auditing. OWB uses Oracle database to transform the data from various sources and is used to

connect various other third-party databases.

Page 5: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Key Features:

● OWB is a comprehensive and flexible tool for data integration strategy. ● It allows a user to design and build the ETL processes. ● It supports 40 metadata files from various vendors. ● OWB supports Flat files, Sybase, SQL Server, Informix and Oracle Database as a

target database. ● OWB supports data types such as numeric, text, date, etc.

3. CloverETL

CloverETL is developed by a company named Javlin which was launched in 2002 with

offices across the globe like USA, Germany, and the UK. Javelin majorly provides services

like data processing and data integration.

CloverETL is a high-performance data transformation and robust data integration platform.

CloverETL processes a huge volume of data and transfers the data to various destinations.

CloverETL consists of three packages such as – CloverETL Engine, CloverETL Designer,

and CloverETL Server.

Key Features:

● CloverETL is a commercial ETL software. ● CloverETL has a Java-based framework.

● Easy to install and simple user interface. ● Combines business data in a single format from various sources.

● It supports Windows, Linux, Solaris, AIX and OSX platforms. ● It is used for data transformation, data migration, data warehousing and data

cleansing.

● It helps to create various reports using data from the source.

4. Jasper Jaspersoft is a leader in data integration which is launched in 1991 with its headquarters in

California, United States. It extracts, transforms and loads data from various other sources

into the data warehouse.

Jaspersoft is a part of Jaspersoft Business Intelligent suite. Jaspersoft ETL is a data

integration platform with high performing ETL capabilities.

Key Features:

● Jaspersoft ETL is an open source ETL tool.

● It has activity monitoring dashboard which helps to monitor the job execution and its

performance. ● It has connectivity to applications like SugarCRM, SAP, Salesforce.com, etc. ● It also has connectivity to Big Data environment Hadoop, MongoDB etc. ● It provides a Graphical editor to view and edit the ETL processes.

● Using GUI, it allows the user to design, schedule and execute data movement,

transformation etc. ● Real-time, an end to end process and ETL statistic tracking.

Page 6: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-2

AIM: Overview of these terms: DBMS, Data Warehouse, Data Science, Data

Mining, History and Orgin of data mining, ML, AI, ETL.

What is DBMS?

Data is nothing but facts and statistics stored or free flowing over a network, generally it's raw

and unprocessed. For example: When you visit any website, they might store your IP address,

that is data, in return they might add a cookie in your browser, marking you that you visited the

website, that is data, your name, it's data, your age, it's data.

Data becomes information when it is processed, turning it into something meaningful. Like,

based on the cookie data saved on user's browser, if a website can analyse that generally men of

age 20-25 visit us more, that is information, derived from the data collected.

A Database is a collection of related data organised in a way that data can be easily accessed,

managed and updated. Database can be software based or hardware based, with one sole

purpose, storing data.

A DBMS is a software that allows creation, definition and manipulation of database, allowing

users to store, process and analyse data easily. DBMS provides us with an interface or a tool, to

perform various operations like creating database, storing data in it, updating data, creating tables

in the database and a lot more.

DBMS also provides protection and security to the databases. It also maintains data consistency

in case of multiple users.

Here are some examples of popular DBMS used these days:

MySql

Oracle

SQL Server

IBM DB2

PostgreSQL

Amazon SimpleDB (cloud based) etc.

What is Data Warehouse

A data warehousing is a technique for collecting and managing data from varied sources to

provide meaningful business insights. It is a blend of technologies and components which allows

the strategic use of data.

Page 7: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

It is electronic storage of a large amount of information by a business which is designed for

query and analysis instead of transaction processing. It is a process of transforming data into

information and making it available to users in a timely manner to make a difference.

A data warehouse is a database, which is kept separate from the organization's

operational database.

There is no frequent updating done in a data warehouse.

Data warehouse systems help in the integration of diversity of application systems.

A data warehouse system helps in consolidated historical data analysis.

What is Data Science?

Data Science is a blend of various tools, algorithms, and machine learning principles with the

goal to discover hidden patterns from the raw data.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and

systems to extract knowledge and insights from data in various forms, both structured and

unstructured, similar to data mining.

Data science is a "concept to unify statistics, data analysis, machine learning and their related

methods" in order to "understand and analyze actual phenomena" with data. It employs

techniques and theories drawn from many fields within the context of mathematics, statistics,

information science, and computer science.

Page 8: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

So, Data Science is primarily used to make decisions and predictions making use of predictive

causal analytics, prescriptive analytics (predictive plus decision science) and machine learning.

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we can

say that data mining is the procedure of mining knowledge from data. The information or

knowledge extracted so can be used for any of the following applications −

Market Analysis

Fraud Detection

Customer Retention

Production Control

Science Exploration

Data Mining Applications

Data mining is highly useful in the following domains −

Market Analysis and Management

Corporate Analysis & Risk Management

Fraud Detection

Page 9: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

History of Data Mining

1763 Thomas Bayes’ paper is published posthumously regarding a theorem for relating current

probability to prior probability called the Bayes’ theorem. It is fundamental to data mining and

probability, since it allows understanding of complex realities based on estimated probabilities.

1805 Adrien-Marie Legendre and Carl Friedrich Gauss apply regression to determine the orbits

of bodies about the Sun (comets and planets). The goal of regression analysis is to estimate the

relationships among variables, and the specific method they used in this case is the method of

least squares. Regression is one of the key tools in data mining.

1936 This is the dawn of computer age which makes possible the collection and processing of

large amounts of data. In a 1936 paper, On Computable Numbers, Alan Turing introduced the

idea of a Universal Machine capable of performing computations like our modern day

computers. The modern day computer is built on the concepts pioneered by Turing.

1943 Warren McCulloch and Walter Pitts were the first to create a conceptual model of a neural

network. In a paper entitled A logical calculus of the ideas immanent in nervous activity, they

describe the idea of a neuron in a network. Each of these neurons can do 3 things: receive

inputs, process inputs and generate output.

Page 10: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

1965 Lawrence J. Fogel formed a new company called Decision Science, Inc. for applications

of evolutionary programming. It was the first company specifically applying evolutionary

computation to solve real-world problems.

1970s With sophisticated database management systems, it’s possible to store and query

terabytes and petabytes of data. In addition, data warehouses allow users to move from a

transaction-oriented way of thinking to a more analytical way of viewing the data. However,

extracting sophisticated insights from these data warehouses of multidimensional models is very

limited.

1975 John Henry Holland wrote Adaptation in Natural and Artificial Systems, the ground-

breaking book on genetic algorithms. It is the book that initiated this field of study, presenting

the theoretical foundations and exploring applications.

1980s HNC trademarks the phrase “database mining.” The trademark was meant to protect a

product called DataBase Mining Workstation. It was a general purpose tool for building neural

network models and now no longer is available. It’s also during this period that sophisticated

algorithms can “learn” relationships from data that allow subject matter experts to reason about

what the relationships mean.

1989 The term “Knowledge Discovery in Databases” (KDD) is coined by Gregory Piatetsky-

Shapiro. It also at this time that he co-founds the first workshop also named KDD.

1990s The term “data mining” appeared in the database community. Retail companies and the

financial community are using data mining to analyze data and recognize trends to increase their

customer base, predict fluctuations in interest rates, stock prices, customer demand.

1992 Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested an improvement

on the original support vector machine which allows for the creation of nonlinear classifiers.

Support vector machines are a supervised learning approach that analyzes data and recognizes

patterns used for classification and regression analysis.

1993 Gregory Piatetsky-Shapiro starts the newsletter Knowledge Discovery Nuggets

(KDnuggets). It was originally meant to connect researchers who attended the KDD workshop.

However, KDnuggets.com seems to have a much wider audience now.

2001 Although the term data science has existed since 1960s, it wasn’t until 2001 that William

S. Cleveland introduced it as an independent discipline. As per Build Data Science Teams, DJ

Patil and Jeff Hammerbacher then used the term to describe their roles at LinkedIn and

Facebook.

Page 11: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

2003 Moneyball, by Michael Lewis, is published and changed the way many major league front

offices do business. The Oakland Athletics used a statistical, data-driven approach to select for

qualities in players that were undervalued and cheaper to obtain. In this manner, they

successfully assembled a team that brought them to the 2002 and 2003 playoffs with 1/3 the

payroll.

2015 In February 2015, DJ Patil became the first Chief Data Scientist at the White House.

Today, data mining is widespread in business, science, engineering and medicine just to name a

few. Mining of credit card transactions, stock market movements, national security, genome

sequencing and clinical trials are just the tip of the iceberg for data mining applications.Terms

like Big Data are now commonplace with the collection of data becoming cheaper and the

proliferation of devices capable of collecting data.

Present (2017) Finally, one of the most active techniques being explored today is Deep

Learning. Capable of capturing dependencies and complex patterns far beyond other techniques,

it is reigniting some of the biggest challenges in the world of data mining, data science and

artificial intelligence.

What is Machine Learning?

In simple words, we can say that machine learning is the competency of the software to perform

a single or series of tasks intelligently without being programmed for those activities. This is

part of Artificial Intelligence. Normally, the software behaves the way the programmer

programmed it; while machine learning is going one step further by making the software

capable of accomplishing intended tasks by using statistical analysis and predictive analytics

techniques.

You may have noticed that whenever we like or comment a friend’s pictures or videos on a

social media site, the related images and videos are posted earlier and keeps on displaying.

Same with the ‘people you may know’ suggestions, the system suggests us other user’s profiles

to add as a friend who is somehow related to our existing friend’s list. Wondering! How does

the system know that? That is called Machine learning. The software uses the statistical analysis

to identify the pattern as a user you are performing, and using the predictive analytics it

populates the related news feed on your social media site.

Types of Machine Learning:

i. Supervised learning

ii. Unsupervised Learning

iii. Reinforcement Learning

Page 12: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

What is Artificial Intelligence?

According to the father of Artificial Intelligence, John McCarthy, it is “The science and

engineering of making intelligent machines, especially intelligent computer programs”.

Artificial Intelligence is a way of making a computer, a computer-controlled robot, or a

software think intelligently, in the similar manner the intelligent humans think.

AI is accomplished by studying how human brain thinks, and how humans learn, decide, and

work while trying to solve a problem, and then using the outcomes of this study as a basis of

developing intelligent software and systems.

Philosophy of AI

While exploiting the power of the computer systems, the curiosity of human, lead him to

wonder, “Can a machine think and behave like humans do?”

Thus, the development of AI started with the intention of creating similar intelligence in

machines that we find and regard high in humans.

Goals of AI

To Create Expert Systems − The systems which exhibit intelligent behavior, learn,

demonstrate, explain, and advice its users.

To Implement Human Intelligence in Machines − Creating systems that understand,

think, learn, and behave like humans.

Page 13: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

What is ETL?

ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts the

data from different RDBMS source systems then transforms the data like applying calculations,

concatenations, etc. and then load the data into the Data Warehouse system.

It's tempting to think a creating a Data warehouse is simply extracting data from multiple sources

and loading into database of a Data warehouse. This is far from the truth and requires a complex

ETL process. The ETL process requires active inputs from various stakeholders including

developers, analysts, testers, top executives and is technically challenging.

In order to maintain its value as a tool for decision-makers, Data warehouse system needs to

change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data

warehouse system and needs to be agile, automated, and well documented.

Page 14: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-3

AIM: Introduction to WEKA tool.

WEKA(Waikato Environment for Knowledge Analysis) is data mining software that uses a

collection of machine learning algorithms. Named after a flightless New Zealand bird, Weka is a

set of machine learning algorithms that can be applied to a data set directly, or called from your

own Java code.

Weka is a collection of tools for:

• Regression

• Clustering

• Association

• Data pre-processing

• Classification

• Visualisation

Weka is an open source application that is freely available under the GNU general public licence

agreement. Originally written in c, the WEKA application has been completely rewritten in java

and is compatible with almost every computing platform. The WEKA application allows novice

users a tool to identify hidden information from database and file systems with simple to use

options and visual interfaces.

Figure 1: Weka’ s features

Page 15: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Weka application interfaces

There are totally five application interfaces available for Weka. When we open Weka, it will

start the Weka GUI Chooser screen from where we can open the Weka application interface.

Figure 2: Weka’s application interfaces

Weka data formats

Weka uses the Attribute Relation File Format for data analysis,by default. But listed below are

some formats that weka supports, from where data can be imported:

• CSV

• ARFF

• Database using ODBC

Attribute Relation File Format (ARFF): This has two parts:

1) The header section defines the relation (data set) name, attribute name and the type.

2) The data section lists the data instances.

An ARFF file requires the declaration of the relation, attribute and data. Figure 3 is an example

of an ARFF file.

· @relation: This is the first line in any ARFF file, written in the header section, followed by the

relation/data set name. The relation name must be a string and if it contains spaces, then it should

be enclosed between quotes.

· @attribute: These are declared with their names and the type or range in the header section.

Weka supports the following data types for attributes:

• Numeric

• <nominal-specification>

• String

Page 16: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

• date

• @data – Defined in the Data section followed by the list of all data segments

Figure3: Example of an ARPF file

Weka Explorer

The Weka Explorer is illustrated in Figure 4 and contains a total of six tabs.

The tabs are as follows.

1) Preprocess: This allows us to choose the data file.

2) Classify: This allows us to apply and experiment with different algorithms on preprocessed

data files.

3) Cluster: This allows us to apply different clustering tools, which identify clusters within the

data file.

4) Association: This allows us to apply association rules, which identify the association within

the data.

5) Select attributes: These allow us to see the changes on the inclusion and exclusion of

attributes from the experiment.

6) Visualize: This allows us to see the possible visualisation produced on the data set in a 2D

format, in scatter plot and bar graph output.

The user cannot move between the different tabs until the initial preprocessing of the data set has

been completed.

Page 17: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Figure4: Weka Explorer

Preprocessing: Data preprocessing is a must. There are three ways to inject the data for

preprocessing:

• Open File – enables the user to select the file from the local machine

• Open URL – enables the user to select the data file from different locations

• pen Database – enables users to retrieve a data file from a database source

A screen for selecting a file from the local machine to be preprocessed is shown in Figure 5.

After loading the data in Explorer, we can refine the data by selecting different options. We can

also select or remove the attributes as per our need and even apply filters on data to refine the

result.

Figure5: Preprocessing-Open Data Set

Page 18: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Classification: To predict nominal or numeric quantities, we have classifiers in Weka. Available

learning schemes are decision-trees and lists, support vector machines, instance-based classifiers,

logistic regression and Bayes’ nets. Once the data has been loaded, all the tabs are enabled.

Based on the requirements and by trial and error, we can find out the most suitable algorithm to

produce an easily understandable representation of data.

Before running any classification algorithm, we need to set test options. Available test options

are listed below.

Use training set: Evaluation is based on how well it can predict the class of the instances it was

trained on.

Supplied training set: Evaluation is based on how well it can predict the class of a set of

instances loaded from a file.

Cross-validation: Evaluation is based on cross-validation by using the number of folds entered in

the ‘Folds’ text field.

Split percentage: Evaluation is based on how well it can predict a certain percentage of the data,

held out for testing by using the values entered in the ‘%’ field.

To classify the data set based on the characteristics of attributes, Weka uses classifiers.

Clustering: The cluster tab enables the user to identify similarities or groups of occurrences

within the data set. Clustering can provide data for the user to analyse. The training set,

percentage split, supplied test set and classes are used for clustering, for which the user can

ignore some attributes from the data set, based on the requirements. Available clustering

schemes in Weka are k-Means, EM, Cobweb, X-means and FarthestFirst.

Association: The only available scheme for association in Weka is the Apriori algorithm. It

identifies statistical dependencies between clusters of attributes, and only works with discrete

data. The Apriori algorithm computes all the rules having minimum support and exceeding a

given confidence level.

Visualisation: The user can see the final piece of the puzzle, derived throughout the process. It

allows users to visualise a 2D representation of data, and is used to determine the difficulty of the

learning problem. We can visualise single attributes (1D) and pairs of attributes (2D), and rotate

3D visualisations in Weka. It has the Jitter option to deal with nominal attributes and to detect

‘hidden’ data points.

Page 19: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-04 AIM: Implementation of classification techniques on ARFF files using WEKA.

Weka makes learning applied machine learning easy, efficient, and fun. It is a GUI tool that allows you to load datasets, run algorithms and design and run experiments with results statistically robust enough to publish.

1. Start Weka

Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This will start the Weka GUI Chooser. The Weka GUI Chooser lets you choose one of the Explorer, Experimenter, KnowledgeFlow and the Simple CLI (command line interface).

Fig.1. Weka GUI Chooser

Click the “Explorer” button to launch the Weka Explorer.

2. Open the data/iris.arff Dataset

Click the “Open file…” button to open a data set and double click on the “data” directory. Weka provides a number of small common machine learning datasets that you can use to practice on.

Select the “iris.arff” file to load the Iris dataset.

Page 20: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Fig.2. Weka Explorer Interface with the Iris dataset loaded

3. Select and Run various Algorithms

Now that you have loaded a dataset, it’s time to choose a machine learning algorithm to model the problem and make predictions.

Click the “Classify” tab. This is the area for running algorithms against a loaded dataset in Weka. You will note that the “ZeroR” algorithm is selected by default.

Click the “Choose” button in the “Classifier” section and select an algorithm:

1. Rules -> ZeroR

2. Trees -> J48

3. Meta -> ADABoostM1

4. Lazy -> KStar

5. Bayes -> NaiveBayes

Click the “Start” button to run the algorithm and compare the result of each algorithm.

Page 21: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

4. Comparing Results

Algorithm Correctly Classified Instances

Incorrectly Classified Instances

Total Instances

Correctly Classified Instances

Percentage

Incorrectly Classified Instances

Percentage

1. ZeroR 14 31 45 31.11% 68.89%

2. J48 43 2 45 95.56% 4.44%

3. ADABoostM1 43 2 45 95.56% 4.44%

4. KStar 42 3 45 93.33% 6.67%

5. NaiveBayes 43 2 45 95.56% 4.44%

Summary

Loaded dataset, implemented various classification algorithms and compared their results successfully in Weka.

Page 22: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-05

AIM: Implementation of clustering techniques on ARFF files using WEKA.

Weka makes learning applied machine learning easy, efficient, and fun. It is a GUI tool that allows you to load datasets, run algorithms and design and run experiments with results statistically robust enough to publish.

1. Start Weka

Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This will start the Weka GUI Chooser. The Weka GUI Chooser lets you choose one of the Explorer, Experimenter, KnowledgeFlow and the Simple CLI (command line interface).

Fig.1. Weka GUI Chooser

Click the “Explorer” button to launch the Weka Explorer.

2. Open the data/iris.arff Dataset

Click the “Open file…” button to open a data set and double click on the “data” directory. Weka provides a number of small common machine learning datasets that you can use to practice on.

Select the “iris.arff” file to load the Iris dataset.

Page 23: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Fig.2. Weka Explorer Interface with the Iris dataset loaded

3. Open cluster mode

Now that you have loaded a dataset, select Simple K-Means clustering algorithm. Then click on start and an output window will pop up.

Simple K-Means Clustering Algorithm

Page 24: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

Algorithm:

1. Clusters the data into k groups where k is predefined. 2. Select k points at random as cluster centers. 3. Assign objects to their closest cluster center according to the Euclidean

distance function. 4. Calculate the centroid or mean of all objects in each cluster. 5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in

consecutive rounds.

4. Review Results

The way Weka evaluates the clustering depends on the cluster mode.

1. Use training set (default). After generating the clustering Weka classifies the training instances into clusters according to clusters representation and computes the percentage of instances falling in each cluster.

2. In Supplied test set or Percentage split Weka can evaluate clustering on sperate test data if the cluster representation is probabilistic.

3. Classes to clusters evaluation. In this mode Weka first ignores the class attribute and generates the clustering. Then during the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error.

Page 25: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

X: Instance_Number Y: SepalLength

X: Instance_Number Y: Sepalwidth

Page 26: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

X: Instance_Number Y: PetalLength

X: Instance_Number Y: Petalwidth

Summary

Loaded dataset and implemented Simple K-means clustering algorithm in Weka.

Page 27: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-6

Aim : Implementation of Association rule technique on ARFF files using Weka Association Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules. Association rules are often used to analyze sales transactions.

Apriori Algorithm Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Each transaction is seen as a set of items (an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Algorithm

1. Scan the transaction data base to get the support of S each 1-itemset, compare S with min_sup, and get a support of 1-itemsets, L1.

2. Use L�◌ଵି join L�◌ଵି to generate a set of candidate k-itemsets. And use Apriori property to prune the unfrequented k-itemsets from this set.

3. Scan the transaction database to get the support S of each candidate k-itemset in the find set, compare S with min_sup, and get a set of frequent k-itemsets L�.

4. If the candidate set is not null, go to step 2. 5. For each frequent itemset 1, generate all nonempty subsets of 1 6. For every non empty subset s of 1, output the rule “s=>(1-s)” if confidence C of the

rule “s=>(1-s)” (=support s of 1/support S of s)’ min_conf 7. End.

Limitations • Candidate generation generates large numbers of subsets (the algorithm attempts to load up

Candidate set with as many as possible before each scan). • Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any

maximal subset S only after all 2N-1 of its proper subsets.

Page 28: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Implementation: Step 1: Open the explorer in Weka Tool, and open a file named supermarketarff.

Step 2: Clicking on the "Associate" tab will bring up the interface for the association rule algorithms. The Apriori algorithm we will use is the default algorithm selected.

Page 29: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Step 3: WEKA allows the resulting rules to be sorted according to different metrices such as confidence, leverage and lift. In this example, we have selected confidence as the criteria. Furthermore, we have entered 0.7 as the maximum value for confidence

Page 30: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Step 4: Once the parameters have been set, we now click on start to run the program. This results in a set of rules as depicted in following figure. It shows association rules generated by the algorithm.

Page 31: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

EXPERIMENT-7

AIM : Study of DBMiner Tool

Introduction DBMiner, a data mining system for interactive mining of multiple-level knowledge in large relational databases, has been developed based on our years-of-research. The system implements a wide spectrum of data mining functions, including generalization, characterization, discrimination, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction, progressive deepening for mining multiple-level rules, and meta-rule guided knowledge mining, the system provides a user-friendly, interactive data mining environment with good performance. A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases. It is based on studies of data mining techniques and experience in the development of an early system prototype, DBLearn. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction, statistical analysis, progressive deepening for mining multiple level knowledge, and meta-rule guided mining, the system provides a user-friendly, interactive data mining environment with good performance.

General architecture of DBMiner

Page 32: EXPERIMENT-1 · Data checks in dimension table as well as history table. Check the BI reports on the loaded fact and dimension table. ETL tools There are many Data Warehousing tools

Discovery Modules

DBMiner user interfaces

• UNIX-based • Windows/NT-based • WWW/Netscape-based

DBMiner Wizard