twitterpedia
DESCRIPTION
Twitterpedia. Visualization Lab By: Thomas Kraft. Overview. Current State. Future. Problem. What is being talked about and where? Twitter has massive amounts of data Tweets are unstructured Goal: Quickly identify current events / topics on a large scale. Overview. Current State. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/1.jpg)
Twitterpedia
Visualization LabBy: Thomas Kraft
![Page 2: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/2.jpg)
What is being talked about and where?
Twitter has massive amounts of data
Tweets are unstructured
Goal: Quickly identify current events / topics on a large scale
Problem
OverviewCurrent State
Future
![Page 3: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/3.jpg)
Data Collection◦ Database◦ Web Crawler
Analyze Data◦ Topic Modeling
Get Trends and topics!
What Needs To Be Done
Current State
Overview Future
![Page 4: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/4.jpg)
Processes large datasets◦ Splits data into chunks◦ Data processed on multiple machines
Very Scalable◦ Add/remove computers easily◦ As dataset grows so can # of machines
Hadoop
Current State
Overview Future
![Page 5: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/5.jpg)
Computer Cluster
Current State
Overview Future
![Page 6: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/6.jpg)
Latent Dirichlet Allocation (LDA)◦ Correlations between words in topics
Topics composed of keyword groups
Tweets topic can effectively be inferred
Topic Modeling
Current State
Overview Future
![Page 7: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/7.jpg)
“Can Rick Ross Please put his clothes on?”
“Bruno & alicia! I love it!”
June 26, 2011
Current State
Overview Future
![Page 8: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/8.jpg)
Topic Modeling Resource Intensive◦ Iterates over data
Single Computer can’t handle large dataset
Solution: Parallelizethe process
Challenge
Current State
Overview Future
![Page 9: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/9.jpg)
Write algorithm to split up tweets and join output
Improves scalability for LDA◦ Shows near linear improvements
PLDA will take twitterpedia to next level◦ Larger datasets with quicker processing
Parallel - LDA
Current State
Overview Future
![Page 10: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/10.jpg)
Write algorithm to parallelize tweet distribution and aggregation
Create website implementing topics
Future
FutureCurrent State
Overview
![Page 11: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/11.jpg)
Working on this project has been a great learning experience◦ Designed and managed a large database◦ Efficiency high priority◦ Learned cool tricks along the way…
Conclusion
Current State
Overview Future
![Page 12: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/12.jpg)
A Special thanks to my advisor Xiaoyu Wang, Wenwen Dou, and to the Visualization Center
Thomas Kraft : [email protected]
Thanks
Current State
Overview Future
![Page 13: Twitterpedia](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ae0550346895da32fcc/html5/thumbnails/13.jpg)
Questions?