vipul divyanshu mahout_documentation

18
Data Analytics Project Documentation Vipul Divyanshu IIL/2012/14 Summer Internship Mentor: Saish Kamat India Innovation Labs Tasks at hand : *Data Analytics on a Medium Size Data Base *Building an Recommender Engine for products Tools and topics Explored: Mahout Root Hadoop Data Rush Rush Analyser (with KNIME) Google Analytics engine Analysis of the tools and what was explored: Mahout: Mahout is an open source machine learning library from Apache. The algorithmsit implements fall under the broad umbrella of machine learning or collective Intelligence. Mahout currently has: Collaborative Filtering User and Item based recommenders

Upload: vipul-divyanshu

Post on 15-Jan-2015

406 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Vipul divyanshu mahout_documentation

Data Analytics

Project Documentation

Vipul Divyanshu IIL/2012/14 Summer Internship Mentor: Saish Kamat India Innovation LabsTasks at hand:

*Data Analytics on a Medium Size Data Base

*Building an Recommender Engine for products

Tools and topics Explored:

Mahout Root Hadoop Data Rush Rush Analyser (with KNIME) Google Analytics engine

Analysis of the tools and what was explored:

Mahout: Mahout is an open source machine learning library from Apache. The algorithmsit implements fall under the broad umbrella of machine learning or collectiveIntelligence.

Mahout currently has: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining

Page 2: Vipul divyanshu mahout_documentation

Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections)

The fact that mahout has this many features and sub tools and libraries to work with, it is the best suited tool for the self-designed data analytics programs.And mahout also has core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

NOTE: For a well understanding of Mahout, the book ‘Mahout In action’ is suggested.

ROOT: It is an object-oriented framework aimed at solving the data analysis challenges of high-energy physics.Below, you can find a quick overview of the ROOT framework:

Save data. You can save your data (and any C++ object) in a compressed binary form in a ROOT file. The object format is also saved in the same file. ROOTprovides a data structure that is extremely powerful for fast access of hugeamounts of data - orders of magnitude faster than any database.

Access data. Data saved into one or several ROOT files can be accessedfrom your PC, from the web and from large-scale file delivery systems usede.g. in the GRID. ROOT trees spread over several files can be chained andaccessed as a unique object, allowing for loops over huge amounts of data.

Process data. Powerful mathematical and statistical tools are provided tooperate on your data. The full power of a C++ application and of parallelprocessing is available for any kind of data manipulation. Data can alsobe generated following any statistical distribution, making it possible tosimulate complex systems.

Show results. Results are best shown with histograms, scatter plots,fitting functions, etc. ROOT graphics may be adjusted real-time by fewmouse clicks. High-quality plots can be saved in PDF or other format.

Interactive or built application. You can use the CINT C++ interpreter orPython for your interactive sessions and to write macros, or compile yourprogram to run at full speed. In both cases, you can also create a GUI.

Link to know more about root: http://root.cern.ch/drupal/Link for ROOT user’s guide: http://root.cern.ch/download/doc/ROOTUsersGuide.pdf

Constrains of ROOT:What was found was that it is concentrates more on displaying and the graphical presentation of the collected data and on the representation of computed(processed) result in the form of canvas, histograms, TGraphs. This can be used in later point of time to present the processed data in a well-defined and interactive manner.

Page 3: Vipul divyanshu mahout_documentation

Screenshot:

HADOOP:Hadoop is an open source framework for writing and running distributedapplications that process large amounts of data on different networks.Key distinctions of Hadoop are:Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2).Robust—because it is intended to run on commodity hardware, Hadoop is architected withthe assumption of frequent hardware malfunctions. It can gracefully handle most suchfailures.Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.Simple—Hadoop allows users to quickly write efficient parallel code.

Link to explore more in Hadoop: http://hadoop.apache.org/NOTE: For a well understanding of hadoop, the book ‘Hadoop In action’ is suggested.

Page 4: Vipul divyanshu mahout_documentation

Setting Up Mahout development environment in Eclipse:NOTE: The following explanation is for Ubuntu (Linux) OS .we can even implementit on any other OS such as windows.PREREQUIREMENTS:1. Java SDK 6u23 x642. Maven 3.0.23. ANY UPDATED MAHOUT LIBRARY4. IDE(I had used eclipse)5. CYGWIN (in case of windows OS)

Running your first sample code:Once all the above requirements are met we are ready to execute ourfirst sample code.Step 1:

At first, start Eclipse and create a workspace. We take it “Users\Vipul\workspace” for the present.

Extract the source of Mahout below the workspace. It is “Users\Vipul\workspace\mahout-distribution-0.4″ for the present.

Convert Maven project of Mahout into Eclipse project with the below command.cd Users\Vipul\workspace\mahout-distribution-0.4mvn eclipse: eclipse

Now set the classpath variable M2_REPO of Eclipse to Maven2 localrepository. mvn -Declipse.workspace= eclipse: add-maven-repo

But “Maven – Guide to using Eclipse with Maven 2.x” says “Issue: Thecommand does not work”. So set it in Eclipse directly.Open Window > Preferences > Java > Build Path > Classpath Valirablesfrom Eclipse’s menu.Press “New” and Add Name as “M2_REPO” and Path as Maven 2repository path (its default is .m2/repository at your user directory).

Finally import the converted Eclipse project of Mahout.Open File > Import > General > Existing Projects into Workspace from

Eclipse menu.Select the project directory Users\Vipul\workspace\mahoutdistribution-0.6 and all projects.NOTE: Now you need to have your first code to be implementedready. If so proceed to Step 2.Step 2:

At first, generate a Maven project for sample codes on the Eclipseworkspace directory.$ cd Users/Vipul/workspace$ mvn archetype: create -DgroupId=mia.recommender - DartifactId=recommender Do the following.Delete a generated Skelton code src/main/App.java and copy thecode into src/main/java/mia/recommender of the ‘recommender’project.Convert the Maven project into Eclipse project.$ cd Users/Vipul/workspace/recommender$ mvn eclipse: eclipse Import the project into Eclipse.Open File > Import > General > Existing Projects into Workspacefrom Eclipse menu and select the ‘recommender’ project.Then the ‘recommender’ project is available on Eclipse workspace,but all classes have errors because of no Mahout Library reference.

Page 5: Vipul divyanshu mahout_documentation

Right click the ‘recommender’ project, select Properties > Java Build

Path > Projects from pop-up menu and click ‘Add’ and select the

below Mahout projects.

mahout-core mahout-examples mahout-taste-webapp mahout-math mahout-utils

Then only 4 errors remain.

Page 6: Vipul divyanshu mahout_documentation

Hence they are conflicts with updated APIs, these error correction need to modify codes.For example, open mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 and press ctrl+1 at error line in it.

This error says that the code does not catch or declare a exception of

TasteException which NearestNUserNeighborhood’s constructor

throws. So you can choise whichever you like a solution in the pop up

menu. Others as well.

The classes which has main() function can be executed on Eclipse.

For example, select mia.recommender.ch02.RecommenderIntro and

click Run > Run in Eclipse’s menu (or may press ctrl+F11 insted).

Then It throws an exception as ‘Exception in thread “main”

java.io.FileNotFoundException: intro.csv’.

Page 7: Vipul divyanshu mahout_documentation

To make it read a sample data file ‘intro.csv’ in

src/mia/recommender/ch02, click Run > Run Configurations in

Eclipse’s menu and select the configuration of RecommenderIntro

which is created by the above execution. Then set

mia/recommender/ch02 to Working directory in Arguments tab(see

the below figure). Click “Workspace…” button and select the

directory.

Page 8: Vipul divyanshu mahout_documentation

Then it outputs a result like “RecommendedItem[item:104, value:4.257081]“.

If you want to make a project, repeat from Maven project creation.

RECOMMENDATION ENGINE:Recommendation is all about predicting patterns of taste, and using them to discover new and desirable things you didn’t already know about.We have many types of recommender like:

GenericUserBasedRecommender GenericItemBasedRecommender SlopeOneRecommender SVDRecommender KnnItemBasedRecommender

Well I had implemented the code for the first three but with time in hand theother two and some more can be implemented.NOTE: For every recommender to feed the data to it we need a file normally oftype .csv and don’t forget to place it in the same folder in which we have ourpom file of the current project being build.THE USER BASED RECOMMENDATION ENGINEAll the required details of the user based recommender engine are given in detail in the book which I had mentioned before. The output of my recommender is shown below:

The output if the above code can be observed in the ellipse.

Page 9: Vipul divyanshu mahout_documentation

THE ITEM BASED RECOMMENDATION ENGINE:It is similar to that of the user based recommendation engine the only difference is that it finds the similarity between the item instead of users.Note: Due to the above reason it is more suited in the case when we is a fast growing list of users and a slower growing product or item list.The output of the Item based recommender code is:

Page 10: Vipul divyanshu mahout_documentation

THE SLOPE-ONE RECOMMENDATION ENGINE:It is similar to that to of item based recommendation engine but has a pre-processing state and the output is on the basis of the relation between the different items. The output of my code is:

Page 11: Vipul divyanshu mahout_documentation

THE EVLUATOR FOR RECOMMENDATION ENGINE:There are many possible ways to evaluate the performance of an the recommender engine, I have explored the following:

RecommenderIRStatsEvaluator AverageAbsoluteDifferenceRecommenderEvaluator RMSRecommenderEvaluator.

Well I had implemented the first two of themAVERAGE ABSOLUTE DIFFERENCE RECOMMENDE REVALUATORIt takes the a part of data as test data and rest as training data and recommends items for our test data and latter is matched with the real values of the test data. The output for my code is:

Page 12: Vipul divyanshu mahout_documentation

RecommenderIRStatsEvaluator:This evaluator computes the recall and precision of the recommender and gives their values as the output. The output of the evaluator code is:

Page 13: Vipul divyanshu mahout_documentation

Note: To test the above codes on a larger scale we can download theInput files for them from: http://www.grouplens.org/node/12Mahout is still in development stage and still many fields can be explored like clustering, network pattern learning and classification.The Hadoop could be used with mahout to implement a cluster and map-reduce to receive data.

Rush Analyser (with Knime):This tool is also in Java and eclipse is needed. It was downloaded from the link:http://bigdata.pervasive.com/Products/Download-Center.aspxIs the graphical version of Data rush and is very handy in the terms of data analytics and visualisation.Here is a snapshot of my work where I have loaded the 10K movie rating data downloaded from the test data download link given.

Page 14: Vipul divyanshu mahout_documentation

In the image different nodes can be seen used to perform different operations on the data set.

This is the parallel plot of the data set.

Page 15: Vipul divyanshu mahout_documentation

This is the scatter plot generated for the same 10K data value scattered on the 2-D plan.

By use of clustering blocks in the rush analyser the data was analysed.Few of the blocks explored by me are:

Regression Classifiers Recommender Clustering Filters

Data from different Databases can be directly imported by the use of Data Base reader block.These are few of the topics explored in Rush Analyser (a interactive Datarush tool)

Page 16: Vipul divyanshu mahout_documentation

And it only the tip of the ice berg as Rush Analyser has a lot more in store to be explored.For more info go to The given link could be referred for exploring data rush: http://bigdata.pervasive.com/Products/Analytic-Engine-Pervasive-DataRush.aspx.

The potential of the DATA RUSH is still to be explored for the project.

Thank You IIL:

Vipul Divyanshu

IIL/2012/14