data science for tackling the challenges of big data dr. brand niemann director and senior data...

28
Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup November 14, 2014 1

Upload: roxana-voyce

Post on 15-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

1

Data Science for Tackling the Challenges of Big Data

Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

November 14, 2014

Page 2: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

2

Overview• Six Week MIT Online Course:

– Started November 4th and Completed November 12th.• Mined this MIT Online Course for Data Sets and Ideas:

– Found subset of the slides that contained data sets and ideas and were interesting and useful visualizations in themselves.

• Professor Karger's Lecture Slides on Visualization User Interfaces Were All About My Heroes:– Tukey, Tufte, Sneiderman, and Spotfire. (In fact it was everything leading

up to Spotfire, but Spotfire itself!)• Preserve My Work & Present Tutorial to the Federal Big Data

Working Group Meetup:– MindTouch Knowledge Base, Excel Spreadsheet Index, and Spotfire

Interactive Visualizations.

Page 3: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

3

MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment

Web Site (private)

Page 4: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

4

MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress

https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/2T2014/progress

Page 5: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

5

MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage

Web Site (private)

Page 7: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

7

Courseware: Big Data Storage• I was especially interested in the following since both

Professors Stonebraker and Madden presented to our Federal Big Data Working Group Meetup:– This module begins with an overview of a number of these technologies by

renowned database professor Mike Stonebraker. In his unique and ardent fashion, Mike expresses his skepticism about many new technologies, particularly Hadoop/MapReduce and NoSQL, and voices support for many new relational technologies, including column stores and main memory databases.

– After that, Professors Matei Zaharia and Samuel Madden provide a more nuanced view of the tradeoffs between the various approaches, discussing Hadoop and its derivatives, as well as NoSQL and its tradeoffs, in more detail.

– Professor Stonebraker expresses a number of strong opinions in this module. Which of them do you agree with? Which do you disagree with? Why?

3.0 Introduction to Big Data Storage and Discussion 3

Page 11: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

11

Google Search: Singapore Taxi Data

Page 12: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

12

Think Business:Why can’t I find a taxi when I really need one?

http://thinkbusiness.nus.edu/smart-finance/item/131-why-can%E2%80%99t-i-find-a-taxi-when-i-really-need-one?

Based on: Labor Supply Decisions of Singaporean Cab Drivers, May 8, 2013Newer Paper: Labor Supply Decisions of Singaporean Cab Drivers, September 2014

Page 13: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

13

Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days

http://www.ushakrisna.com/Cabdrivers.pdf

Page 14: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

14

MIT Big Data Knowledge Base: Table 1 Spreadsheet

Spreadsheet

My Note: Image PDF so had to hand build!

Page 15: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

15

Singapore Land Transport Authority:Traffic Info Service Providers

http://www.lta.gov.sg/content/ltaweb/en/industry-matters/traffic-info-service-providers.html

Page 16: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

16

Singapore Land Transport Authority:MyTransport.sg

http://www.mytransport.sg/content/mytransport/home/dataMall.html#All_Datasets

Screen Scrape

Page 17: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

17

Singapore Land Transport Authority:All Datasets Spreadsheet

Spreadsheet

Page 18: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

18

MIT Big Data Knowledge Base: MindTouch

Data Science for Tackling the Challenges of Big Data

Labor Supply Decisions of Singaporean Cab Drivers, September 2014, as a Data Science Data Publication

Page 19: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

19

MIT Big Data:Knowledge Base Spreadsheet

Spreadsheet

Page 20: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

20

MIT Big Data:Course Participant Spreadsheet

Spreadsheet

My Note: This was mapped in Spotfire after data curation (cleaning of the country names).Spotfire has built in data curation functions.

Page 24: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

24

New York City Open Data: Socrata

https://nycopendata.socrata.com/

Page 25: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

25

New York City Open Data:Search Results

Web Site

My Note: Could Only Find Taxi Drivers Data.

Page 27: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

27

Visualizing NYC’s Open Data:Socrata Beta

https://nycopendata.socrata.com/viz

Page 28: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

28

MIT Big Data Assessment:Questions and Answers

• Big Data Collection– 2) Data science requires:

• Knowledge of statistics• Knowledge of data management• Knowledge of curation• Alloftheabove-correct

• Big Data Systems– 13) For which of the following tasks is interactive visualization most useful? (choose all

that apply)• Developingahypothesisaboutdata-correct• Formally confirming a hypothesis• Communicatingaconclusionaboutdata-correct• All of the above

• Big Data Analytics:– 13) Big Data means that there's no shortage of useful data.

• True• False-correct Story