powerpoint presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · title: powerpoint...
TRANSCRIPT
![Page 1: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/1.jpg)
Big Data Mining
Zhifei Zhang
![Page 2: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/2.jpg)
Outline
• Big data vs. Small data
• Software for big data
• An example
![Page 3: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/3.jpg)
Small Data vs. Big Data
![Page 4: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/4.jpg)
Big Data
• Volume• Variety• Velocity• Variability• Veracity
Attributes
• High dimensional• Sparse• Graph• Infinite/streaming• Labeled
Types of Datasets
• Map Reduce• Streams / Online• Single machine in-
memory
Compute Models
![Page 5: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/5.jpg)
Big Data
Acquire (data)
Extract and
Clean
Aggregate and
Integrate
Represent
Analyze and
Model
Interpret
• Where is the data coming from?
• How to deal with missing/incomplete/noisy data?
• How to bring datasets together?
• How to facilitate and host datasets for analysis?
• How to query, mine and model data?
• How to ensure that end users understand the data?
![Page 6: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/6.jpg)
Algorithms for Big Data
• Classification: Bayes, clustering.
• Regression: polynomial, SVM.
• Dim. Reduction: PCA, SVD.
![Page 7: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/7.jpg)
Example 1: Naïve Bayes
𝑷(𝒚|𝒙) ∝ 𝒑 𝒙 𝒚 𝑷(𝒚)
𝑷 𝒙 𝒚 =𝑪(𝑿 = 𝒙, 𝒀 = 𝒚)
𝑪(𝒀 = 𝒚)
𝑷(𝒚) =𝑪(𝒀 = 𝒚)
𝑪(𝒀 = 𝒂𝒏𝒚)
𝑪 𝒀 = 𝒂𝒏𝒈 + +
𝑪 𝒀 = 𝒚 + +
𝑪 𝒀 = 𝒚, 𝑿 = 𝒙 + +
Given a data p𝐨𝐢𝐧𝐭 𝒙, 𝒚 :
![Page 8: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/8.jpg)
min𝜶 𝒊,𝒋𝜶𝒊𝜶𝒋𝒚𝒊𝒚𝒋𝒌(𝒙𝒊, 𝒙𝒋) −
𝒊𝜶𝒊
SVM SVM SVM SVM SVM
SVM SVM SVM
SVM SVM
SVM
Implement SVM
Save the local model
Combine local models
Given a partial data set:Parallel-Hierarchical SVM
Example 2: SVM
![Page 9: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/9.jpg)
Algorithms for Big Data
• On-line version• High efficiency• Less iteration• Approximation
![Page 10: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/10.jpg)
Software for Big Data
![Page 11: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/11.jpg)
Software for Big Data
• Common Software
• Machine Learning Lib
• Faster Framework
![Page 12: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/12.jpg)
Map & Reduce
![Page 13: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/13.jpg)
Tracking Drinking Behavior from Twitter Data
![Page 14: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/14.jpg)
3/27/2014
5/1/2014
Twitter Data
60, 000, 000Tweets
Drinking11,255,207
Tweets
![Page 15: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/15.jpg)
LDA + Hadoop Drinking Words
Mapper
Unmeaningword list
Englishword list(235,886)
ReducerCorpus
(17,752)
Filtersettings
For LDA
Tweets(60 million)
![Page 16: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/16.jpg)
Mapper
LDA: 𝛽
drunkwine
hangoverbar…
Reducer
<date, cnt>
Sentiment(positive/negative)
<time-zone, cnt>
<source, cnt>
<date, sentiment>
Hadoop Drinking Tweets
Tweets(60 million)
![Page 17: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/17.jpg)
4/23/2014 (Wed.)
![Page 18: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/18.jpg)
What Happened around Apr. 23rd, 2014 ?
Sinking of the South Korean ferry killed 304 passengers
Car bomb in Kenya, killing 4 peopleUnrest in Egypt
Unrest in Ukraine6.6M earthquake in Canada
![Page 19: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/19.jpg)
The Truth is …
There is no data on 4/23
![Page 20: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/20.jpg)
![Page 21: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/21.jpg)
Drinkers Love iPhone More?
![Page 22: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/22.jpg)
Drinkers Love iPhone More?
2985 Results of “wine” in App Store
248 Results of “wine” in Google play
![Page 23: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/23.jpg)
Location of Drinking Related Tweets46 zones have over 10,000 drunk related tweets
'Eastern/Time/(US/&/Canada)' 1349324'Central/Time/(US/&/Canada)' 1141305'Pacific/Time/(US/&/Canada)' 783803'London' 583376'Atlantic/Time/(Canada)' 497300'Quito' 490727'Amsterdam' 325605'Arizona' 291043'Mountain/Time/(US/&/Canada)' 234895'Hawaii' 198108
'Casablanca' 187121'Alaska' 129952'Athens' 102358'Brasilia' 92126'Beijing' 74283'Greenland' 52823'Chennai' 51933'Bangkok' 49231'Edinburgh' 44594'Buenos/Aires' 43854
'Dublin' 38960'Singapore' 38276'Santiago' 37012'Sydney' 31784'Madrid' 31394'Kuala/Lumpur' 29112'Paris' 27678'Pretoria' 26745'Mexico/City' 26023'Caracas' 25682
![Page 24: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/24.jpg)
Zones with over 10,000 Drinking Related Tweets
![Page 25: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/25.jpg)
American Drink More?
Twitter user distribution Drinking Tweet distribution
![Page 26: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/26.jpg)
Pros & Cons
Pros:• Through big data mining, we may find something hard
to be recognized in daily life.• Identify effect of certain event on the public.
Cons:• How to interpret the result? (Misleading)• Only reflect pubic behavior but not for individuals.
![Page 27: PowerPoint Presentationweb.eecs.utk.edu/~zzhang61/docs/reports/2015.04... · Title: PowerPoint Presentation Author: Zhifei Zhang Created Date: 5/6/2015 4:48:32 AM](https://reader033.vdocuments.net/reader033/viewer/2022053023/60544c4a50d5072bfc0ad7c9/html5/thumbnails/27.jpg)