Transcript
Page 1: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

1

Data Science for Tackling the Challenges of Big Data

Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

November 14, 2014

Page 2: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

2

Overview• Six Week MIT Online Course:

– Started November 4th and Completed November 12th.• Mined this MIT Online Course for Data Sets and Ideas:

– Found subset of the slides that contained data sets and ideas and were interesting and useful visualizations in themselves.

• Professor Karger's Lecture Slides on Visualization User Interfaces Were All About My Heroes:– Tukey, Tufte, Sneiderman, and Spotfire. (In fact it was everything leading

up to Spotfire, but Spotfire itself!)• Preserve My Work & Present Tutorial to the Federal Big Data

Working Group Meetup:– MindTouch Knowledge Base, Excel Spreadsheet Index, and Spotfire

Interactive Visualizations.

Page 3: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

3

MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment

Web Site (private)

Page 4: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

4

MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress

https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/2T2014/progress

Page 5: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

5

MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage

Web Site (private)

Page 7: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

7

Courseware: Big Data Storage• I was especially interested in the following since both

Professors Stonebraker and Madden presented to our Federal Big Data Working Group Meetup:– This module begins with an overview of a number of these technologies by

renowned database professor Mike Stonebraker. In his unique and ardent fashion, Mike expresses his skepticism about many new technologies, particularly Hadoop/MapReduce and NoSQL, and voices support for many new relational technologies, including column stores and main memory databases.

– After that, Professors Matei Zaharia and Samuel Madden provide a more nuanced view of the tradeoffs between the various approaches, discussing Hadoop and its derivatives, as well as NoSQL and its tradeoffs, in more detail.

– Professor Stonebraker expresses a number of strong opinions in this module. Which of them do you agree with? Which do you disagree with? Why?

3.0 Introduction to Big Data Storage and Discussion 3

Page 11: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

11

Google Search: Singapore Taxi Data

Page 12: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

12

Think Business:Why can’t I find a taxi when I really need one?

http://thinkbusiness.nus.edu/smart-finance/item/131-why-can%E2%80%99t-i-find-a-taxi-when-i-really-need-one?

Based on: Labor Supply Decisions of Singaporean Cab Drivers, May 8, 2013Newer Paper: Labor Supply Decisions of Singaporean Cab Drivers, September 2014

Page 13: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

13

Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days

http://www.ushakrisna.com/Cabdrivers.pdf

Page 14: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

14

MIT Big Data Knowledge Base: Table 1 Spreadsheet

Spreadsheet

My Note: Image PDF so had to hand build!

Page 15: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

15

Singapore Land Transport Authority:Traffic Info Service Providers

http://www.lta.gov.sg/content/ltaweb/en/industry-matters/traffic-info-service-providers.html

Page 16: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

16

Singapore Land Transport Authority:MyTransport.sg

http://www.mytransport.sg/content/mytransport/home/dataMall.html#All_Datasets

Screen Scrape

Page 17: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

17

Singapore Land Transport Authority:All Datasets Spreadsheet

Spreadsheet

Page 18: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

18

MIT Big Data Knowledge Base: MindTouch

Data Science for Tackling the Challenges of Big Data

Labor Supply Decisions of Singaporean Cab Drivers, September 2014, as a Data Science Data Publication

Page 19: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

19

MIT Big Data:Knowledge Base Spreadsheet

Spreadsheet

Page 20: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

20

MIT Big Data:Course Participant Spreadsheet

Spreadsheet

My Note: This was mapped in Spotfire after data curation (cleaning of the country names).Spotfire has built in data curation functions.

Page 24: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

24

New York City Open Data: Socrata

https://nycopendata.socrata.com/

Page 25: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

25

New York City Open Data:Search Results

Web Site

My Note: Could Only Find Taxi Drivers Data.

Page 27: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

27

Visualizing NYC’s Open Data:Socrata Beta

https://nycopendata.socrata.com/viz

Page 28: Data Science for Tackling the Challenges of Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

28

MIT Big Data Assessment:Questions and Answers

• Big Data Collection– 2) Data science requires:

• Knowledge of statistics• Knowledge of data management• Knowledge of curation• Alloftheabove-correct

• Big Data Systems– 13) For which of the following tasks is interactive visualization most useful? (choose all

that apply)• Developingahypothesisaboutdata-correct• Formally confirming a hypothesis• Communicatingaconclusionaboutdata-correct• All of the above

• Big Data Analytics:– 13) Big Data means that there's no shortage of useful data.

• True• False-correct Story


Top Related