tools andtechnologies for large scale data mining

33
Tools andTechnologies for Large Scale Data Mining Jaganadh G Project Lead NLP R&D 365Media Pvt. Ltd. [email protected] DRDO Sponsored National Level Seminar on Challenging Issues on Data Mining Semantic Web, Sri Krishna College of Engineering and Technology, Coimbatore 27th Jan 2012 Jaganadh G Tools andTechnologies for Large Scale Data Mining

Upload: jaganadh-gopinadhan

Post on 19-May-2015

1.295 views

Category:

Technology


1 download

DESCRIPTION

Tools andTechnologies for Large Scale DataMining

TRANSCRIPT

Page 1: Tools andTechnologies for Large Scale Data Mining

Tools andTechnologies for Large Scale DataMining

Jaganadh GProject Lead NLP R&D

365Media Pvt. [email protected]

DRDO Sponsored National Level Seminaron

Challenging Issues on Data Mining Semantic Web,Sri Krishna College of Engineering and Technology,

Coimbatore

27th Jan 2012

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 2: Tools andTechnologies for Large Scale Data Mining

About me !!

Software Engineer Specializing in Text Analytics Research &Development

When free, teaches Python, Speaks about FOSS and blogs athttp://jaganadhg.in

Working as Project Lead (NLP) 365Media Pvt. Ltd.Coimbatore

I am a computational linguist / Linguist and Indologist, Bookreviewer

Maters Degree Holder in Sanskrit from University of Kerala

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 3: Tools andTechnologies for Large Scale Data Mining

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI)concerned with algorithms that allow computers to learn.

This talk is not aimed to give introduction about MachineLearning

Dont expect some mathy equations here

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 4: Tools andTechnologies for Large Scale Data Mining

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI)concerned with algorithms that allow computers to learn.

This talk is not aimed to give introduction about MachineLearning

Dont expect some mathy equations here

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 5: Tools andTechnologies for Large Scale Data Mining

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI)concerned with algorithms that allow computers to learn.

This talk is not aimed to give introduction about MachineLearning

Dont expect some mathy equations here

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 6: Tools andTechnologies for Large Scale Data Mining

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI)concerned with algorithms that allow computers to learn.

This talk is not aimed to give introduction about MachineLearning

Dont expect some mathy equations here

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 7: Tools andTechnologies for Large Scale Data Mining

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life??

Yes

In our day to day life we may use many Machine Learningpowered tools

E-mail spam filtering , product recommendations etc ..

Fraud detection

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 8: Tools andTechnologies for Large Scale Data Mining

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life??

Yes

In our day to day life we may use many Machine Learningpowered tools

E-mail spam filtering , product recommendations etc ..

Fraud detection

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 9: Tools andTechnologies for Large Scale Data Mining

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life??

Yes

In our day to day life we may use many Machine Learningpowered tools

E-mail spam filtering , product recommendations etc ..

Fraud detection

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 10: Tools andTechnologies for Large Scale Data Mining

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life??

Yes

In our day to day life we may use many Machine Learningpowered tools

E-mail spam filtering , product recommendations etc ..

Fraud detection

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 11: Tools andTechnologies for Large Scale Data Mining

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life??

Yes

In our day to day life we may use many Machine Learningpowered tools

E-mail spam filtering , product recommendations etc ..

Fraud detection

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 12: Tools andTechnologies for Large Scale Data Mining

Examples

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 13: Tools andTechnologies for Large Scale Data Mining

Examples

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 14: Tools andTechnologies for Large Scale Data Mining

Examples

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 15: Tools andTechnologies for Large Scale Data Mining

Tool for building Machine Learning powerd product/service

Apache Mahout

Apache Mahout is a scalable machine learning library that supportslarge data sets. Apache Mahout’s goal is to build scalable machinelearning libraries.

Commercially friendly licence

Well documented

Healthy community

Targeted to developers

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 16: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 17: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 18: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 19: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 20: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 21: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 22: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 23: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 24: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 25: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 26: Tools andTechnologies for Large Scale Data Mining

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 27: Tools andTechnologies for Large Scale Data Mining

Demo

Building recommendations engines with Mahout

Document Classification with Mahout

Some Python stuff on Machine Learning

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 28: Tools andTechnologies for Large Scale Data Mining

Reference

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 29: Tools andTechnologies for Large Scale Data Mining

Reference

Mahout in Action - Book by Sean Owen and Robin Anil,published by Manning Publications.

Taming Text - By Grant Ingersoll and Tom Morton, publishedby Manning Publications.

Introducing Apache Mahout - Grant Ingersoll - Intro toApache Mahout focused on clustering, classification andcollaborative filtering.https://www.ibm.com/developerworks/java/library/j-mahout/index.html

Programming Collective Intelligence: Building Smart Web 2.0Applicationshttp://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 30: Tools andTechnologies for Large Scale Data Mining

Useful Resources

Apache Mahout Site http://mahout.apache.org/

Apache Mahout Mailing List [email protected]

The code which I used for Mahout demo is available athttp://bitbucket.org/jaganadhg/blog/src/tip/bck9/java/

Twenty News Group data sethttp://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 31: Tools andTechnologies for Large Scale Data Mining

Questions ??

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 32: Tools andTechnologies for Large Scale Data Mining

Acknowledgments

Thanks to :

Manning Publications for Review Copy of the book ”Mahoutin Action”

Apache Mahout mailing list members

Ted Dunning and Robin Anil for suggestions

Sreejith S and Biju B for Java help

@chelakkandupoda for review and criticism

Mukundhanchari R&D Director 365Media Pvt. Ltd. forsupport and encouragement

Jaganadh G Tools andTechnologies for Large Scale Data Mining

Page 33: Tools andTechnologies for Large Scale Data Mining

Finally

Jaganadh G Tools andTechnologies for Large Scale Data Mining