data science for agency initiatives 2015 dr. brand niemann director and senior data scientist/data...

14
Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Semantic Community Data Science Data Science for Agency Initiatives 2015 August 3, 2015 1

Upload: peter-griffith

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

1

Data Science for Agency Initiatives 2015

Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist

Semantic CommunitySemantic Community

Data ScienceData Science for Agency Initiatives 2015

August 3, 2015

Page 2: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

2

Activities

• 906 Members on July 28th and 13 New Members on July 22nd – a Daily Record!• Member Mary Galvin had John Patrick Junior this month.• Dr. Tom Rindflesh, NIH/NLM Semantic Medline on August 17th on Glucan.• Data Science for EPA Hydraulic Fracturing Webinar, September 1st.• OSTP/NSF Data Science Meetup of Meetups, November 6th, Ballston, VA.• Steve Hanmer, Mission Source, co-planning Data Science for Data Act Datathon Meetup. He

attended the Data Act Datathon and Forum this week and will report.• Jonathan Hines, ORNL science writer, doing a story on Semantic Medline and the ORNL

CADES – Compute and Data Environment for Science.• Dr. David Booth, Yosemite Project (Semantic Interoperability of EHRs), Cambridge Semantic

Web Meetup Founder, Accepted to Speak with Date TBD.• Attended Algorithms for Geospatial Data Analysis and Data Owls Meetups.

Page 3: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

3

Algorithms for Geospatial Data Analysis and Data Owls Meetups

• I am not able to help with a blog for the Wednesday Meetup because there is not enough information to write a blog. My slide 3 (that I posted to your Meetup) shows the information I need for a blog, and collect beforehand for my Meetup blogs. In this case my research since the Meetup shows both authors could have accessed and used the actual data from the EIA.• An example of what I am saying is my data science blog for our

Monday August 3rd Meetup.• Listen to CFPB Data Manager, get Consumer Complaint Database,

and see Data Science on that data set!

Page 4: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

4

Data Mining - Data Science – Data Publication Process

• Data Mining Process:• Business Understanding• Data Understanding• Data Preparation• Modeling• Evaluation• Deployment

• Data Science Process:• Data Preparation• Data Ecosystem• Data Story

• Data Science Questions:• How was the data collected?• Where is the data stored?• What are the data results? and• Why should we believe the data results?

• Data Science Data Publication:• Knowledge Base• Spreadsheet Index• Web & PDF Tables to Spreadsheet• Data Browser• Dynamically Linked Adjacent

Visualizations

Page 5: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

5

Data Science Data Publication:Data Browser

Page 6: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

6

Data Science Data Publication:Dynamically Linked Adjacent Visualizations

Page 7: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

7

USGS geochem.csv Data Problem 1

• Sophia, In Brand Niemann's presentation to the Big Data group, he mentioned trouble with geographic coordinates in the file geochem.csv located at http://mrdata.usgs.gov/geochem/geochem.csv. I've examined this file in Microsoft Excel 2010, plotting the latitude against longitude, and I don't see any anomalies. If there is any other information that might help to clarify the problem Brand had, I'd be happy to investigate further, but with the available evidence it looks like a software problem with the tools he was using. Peter• My Note: I also did a scatter plot in Spotfire when the Map Tool did not work.• Peter (cc Brand), Thanks for following up on this. I have included Brand so that

he can reply with a more thorough response. I was also very interested to know why there was a discrepancy with the geographic coordinates. It would be helpful to know the source of the issue. Thanks, Sophia

Page 8: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

8

USGS geochem.csv Data Problem 1

• Peter, The problem is that the geochem.csv treats Latitude and Longitude as Categorical Data and not Numerical Data as does say the MRDS.csv, etc. A sophisticated program like Spotfire is sensitive to that important difference. Brand

• Brand, Your statement makes no sense. CSV files are plain text, with the rows specified as lines, and the columns delimited by commas. There is no type information, no category information, nothing at all to which a program reading these data can be "sensitive" other than the actual values in the field. Instead, it is the obligation of the person operating the software to understand the information in the data file, and apply that understanding in the use of software. That includes substantive knowledge of the meaning of the fields as well as the simple technical observations that one can make by examining the values contained in each field. That's why we have documentation. So first of all, when you had trouble, you should have investigated further with other software (Excel, for example), then you should have contacted me if you continued to have trouble using the data. It was irresponsible for you to claim that the problem you encountered is in the data. Peter

Page 9: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

9

USGS geochem.csv Data Problem 2

• Peter, Please download a free trial of Spotfire and import the two csv files: geochem.csv and MRDS.csv and you will see what I am talking about. I can come to the USGS and show you this if you would like. This is data science. Brand

• Brand, You have to understand the data, and you have to use the data responsibly. It is not up to the software to do that work for you. My suspicion is that your program treaded the coordinates differently than numbers because some of the rows have no coordinates--they're the geochemical analyses of materials standards used to ensure that the sample measurements are correct, and are used by knowledgeable specialists to assess the accuracy and precision of the data values. But you didn't look at the data, otherwise you would have seen this. That's not science of any sort. A scientist examines the evidence with which he or she works, and tries to understand what the evidence is, where it came from, and what it means. Peter

• Peter, I did look at multiple USGS data sets with the premier data science tool (IMHO) and reported what I found. I am telling you how you could verify my results and learn something about data science. The choice is up to you. Brand

Page 10: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

10

Data Science Data Curation for Sustainable Data Science Meetups

of Meetups• I just finished four data science ecosystems:• RDA Climate Data Challenge (July 15):

• http://semanticommunity.info/Data_Science/Data_Science_for_RDA_Climate_Change_Data_Challenge

• RDA Information Week 2016 (Ebola Response and Nepal Earthquake) (July 17):• http

://semanticommunity.info/Data_Science/Data_Science_for_Global_Ebola_Response_Data

• USDA Microsoft Innovation Challenge (July 27):• http://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision_Farming_

Business#Story• US Data Act (July 28):

• http://semanticommunity.info/Data_Science/Data_Science_for_the_DataAct_Datathon

Page 11: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

11

Collaboration for Data Science Win-Wins

• USDA Open Government Data Training, Innovation Competition, and Online Course in Data-Driven Farming:• http

://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision_Farming_Business#Story

• Many Curated Government Data Sets and Data Science Products:• http://semanticommunity.info

• Pick an Agency and/or a Data Set and Look for a Meetup on That:• http://www.meetup.com/Federal-Big-Data-Working-Group/

• Mentor Startups Partnership with Eastern Foundry:• http://www.meetup.com/Federal-Big-Data-Working-Group/events/223140032/

Page 12: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

12

USDA Collaboration Chronology

• March 16th: USDA CIO and ACDO on Open Data Plan and Roundtable Meetup• March 25th: Government Technology & Innovation Incubator for Big Data Analytics II Meetup

at Eastern Foundry• May 18th: USDA Data Science MOOC Meetup• May 21st, USDA Open Data Quarterly Submission to OMB on USDA Data Usage provided (USDA

Data Science MOOC)• July 21st, Data-Driven Farming Online Course Announced by HeatSpring and Semantic

Community• July 27th: USDA Microsoft Innovation Challenge Submission on Farm Data Dashboards• July 29th, Partnerships Sought for Data-Driven Farming Online Course• September 17th: Big Data Science for Precision Farming Business Online Course Meetup and

Commercial Examples: Farmers Business Network, FarmLogs, etc.• October 26-December 18th, Data-Driven Farming Online Course with Partners

Page 13: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

13

https://www.farmersbusinessnetwork.com/

Page 14: Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science

14

Agenda

• 6:30 p.m. Welcome and Introduction (New Tutorial and Mentoring) Slides Data Science for Agency Initiatives 2015• 7:15 p.m. Brief Member Introductions• 7:30 p.m. Chad Tompkins, Section Chief, Data Section, Office of

Consumer Response (suggested by (Linda F. Powell, Chief Data Officer, Consumer Financial Protection Bureau) Consumer Complaint Database Slides (not cleared for public release)• 8:15 p.m. Open Discussion• 8:45 p.m. Networking• 9:00 p.m. Depart

Listen to CFPB Data Manager, get Consumer Complaint Database, and see Data Science on that data set!