introduction to mining massive datasets

14
Mining massive datasets (based on Standford CS246) Viet-Trung TRAN VietTrung Tran 1

Upload: viet-trung-tran

Post on 16-Aug-2015

115 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Introduction to mining massive datasets

Mining massive datasets(based on Standford CS246)

Viet-Trung TRAN

Viet-­‐Trung  Tran   1  

Page 2: Introduction to mining massive datasets

Credits

•  Jure Leskovec, Anand Rajaraman, Jeff Ullman - Stanford University

•  http://web.stanford.edu/class/cs246/ •  http://mmds.org/

Viet-­‐Trung  Tran   2  

Page 3: Introduction to mining massive datasets

What is data mining?

•  Knowledge discovery from data

Viet-­‐Trung  Tran   3  

Page 4: Introduction to mining massive datasets

Viet-­‐Trung  Tran   4  

Page 5: Introduction to mining massive datasets

Data contains value and knowledge

Viet-­‐Trung  Tran   5  

Page 6: Introduction to mining massive datasets

Data mining

•  Store •  Manage •  Analyzed

Data  mining  ~  Big  Data  ~    Predic5ve  Analysis  ~  Data  science  

Viet-­‐Trung  Tran   6  

Page 7: Introduction to mining massive datasets

Demand for data mining (US)

Viet-­‐Trung  Tran   7  

Page 8: Introduction to mining massive datasets

What is data mining

•  Given lots of data •  Discover patterns and make predictions that

are – Valid – Useful – Unexpected – Understandable

Viet-­‐Trung  Tran   8  

Page 9: Introduction to mining massive datasets

Data mining tasks

•  Descriptive methods – Find human-interpretable patterns that describe

data •  Clustering

•  Predictive methods – Use some variables to predict the unknown or

future values of other variables •  Recommender systems

Viet-­‐Trung  Tran   9  

Page 10: Introduction to mining massive datasets

Meaningfulness of analytic answers

•  Risk of "data mining" is that the discover is meaningless

•  Bonferroni's principle – An algorithm or method we think is useful for

finding a particular set of data actually returns more false positives

Viet-­‐Trung  Tran   10  

Page 11: Introduction to mining massive datasets

Dealing with data?

Viet-­‐Trung  Tran   11  

Page 12: Introduction to mining massive datasets

Data mining cultures •  Overlap with

–  Database: large scale data, simple queries –  Machine learning: Small data, complex models –  CS theory: (Randomized) algorithms

•  Different cultures –  To DB guys: extreme form of analytic

processing –  To ML guys: inference of models (A conclusion

reached on the basis of evidence and reasoning)

Viet-­‐Trung  Tran   12  

Page 13: Introduction to mining massive datasets

What will be learn

•  Mine different types of data – High dimensional – Graph –  Infinite/never-ending – Labeled

•  Use different models of computation – Batch processing – Stream

Viet-­‐Trung  Tran   13  

Page 14: Introduction to mining massive datasets

To solve real-world problems

Viet-­‐Trung  Tran   14