cx4242: data & visual analytics data collection...cse6242/cx4242: data & visual analytics...
TRANSCRIPT
![Page 1: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/1.jpg)
http://poloclub.gatech.edu/cse6242CSE6242/CX4242: Data & Visual Analytics
Data CollectionDuen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS AnalyticsMachine Learning Area Leader, College of Computing Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
![Page 2: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/2.jpg)
How to Collect Data?
2
Method Effort
Download Low
API (Application program interface)
Medium
Scrape/Crawl High
![Page 3: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/3.jpg)
How to Collect Data?
2
Method Effort
Download Low
API (Application program interface)
Medium
Scrape/Crawl High
![Page 4: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/4.jpg)
Data you can just downloadNYC Taxi data: Trip (11GB), Fare (7.7GB)StackOverflow (xml)Wikipedia (data dump)Atlanta crime data (csv)Soccer statisticsData.gov…
3
![Page 5: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/5.jpg)
Data you can just downloadIf you have leads, let us know on Piazza!
4
More datasets on course website:
![Page 6: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/6.jpg)
Collect Data via APIsGoogle Data API (e.g., Google Maps Directions API)https://developers.google.com/gdata/docs/directory
Twitter (small subset)https://dev.twitter.com/streaming/overview
Last.fm (Pandora has unofficial API)
Flickrdata.nasa.govdata.govFacebook (your friends only)
5
![Page 7: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/7.jpg)
Data that needs scrapingAmazon (reviews, product info)ESPNeBayGoogle PlayGoogle Scholar…
6
![Page 8: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/8.jpg)
How to Scrape? Google Play example
Goal: collect the network of similar apps
7
![Page 9: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/9.jpg)
8
![Page 10: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/10.jpg)
How to Scrape?Goal: Write a program/algorithm to scrape Google Play to
collect a million-node network of similar apps
9
Each node is an app
An edge connects two similar apps
Hint: start with some apps (e.g., Shazam), and go from there.
![Page 11: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/11.jpg)
How to Scrape? Google Play example
Goal: collect the network of similar apps
10
https://play.google.com/store/apps/details?id=com.shazam.android
https://play.google.com/store/apps/details? id=com.spotify.music
![Page 12: CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate](https://reader035.vdocuments.net/reader035/viewer/2022071216/60485fc43f561100a0772241/html5/thumbnails/12.jpg)
Popular Scraping LibrariesSelenium. Supports multiple languages. http://www.seleniumhq.orgBeautiful Soup. Python. https://www.crummy.com/software/BeautifulSoupScrapy. Python. https://scrapy.org JSoup. Java. https://jsoup.org
Important considerations:Different web content shows up depending on web browsers used Scraper may need different “web driver” (e.g., in Selenium), or browser “user agent”
Data may show up after certain user interaction (e.g., click a button)
• Scraper may need to simulate the actions.
• Selenium supports more actions than beautiful soup:http://www.discoversdk.com/blog/web-scraping-with-selenium
11