federal big data working group meetup

1

Federal Big Data Working Group Meetup

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

May 20, 2014

http://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

2

Mission Statement• Federal: Supports the Federal Big Data Initiative, but not

endorsed by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which

is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal

Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Co-organizers: Brand Niemann and Kate Goodier

3

May 6th Meetup: EPA/NASA Climate-Environment al Data Analytics & A Redesigned, Open Data.gov

• How was the Meetup?– Thanks for continually providing a forum facilitating discussion and

bringing in speakers with diverse experience. On my drive home NPR was fittingly enough talking about big data.

– Just lots of good info on big data; I also am a big fan of data.gov, so it's exciting that so much is happening with government open data. Perhaps we'll see even more APIs?

– Jeanne Holm: You can find more of the APIs at https://www.data.gov/developers/apis and http://catalog.data.gov/dataset?res_format=api There are about 450 between the two.

– Amazing growth in membership: Our 200th member!• Welcome: Inge, Consultant working in the federal/health space.

http://www.meetup.com/Federal-Big-Data-Working-Group/events/174975182/

http://catalog.data.gov/dataset?res_format=api

http://catalog.data.gov/dataset?res_format=api



4

EPA & NASA Climate/Environmental Data Analytics, Dr. Joan Aron, Global Environmental/Climate Change Scientist

• Data Analytics Needs Scenario Water Quality:– End User of Big Data:

• Perspective of Risk Analysis:– CODATA Integrated Research on Disaster Risk

• Continuity of Data:– US EPA Air Data

• Linkages of Data:– Conservation International

• Linkages of Climate and Water Quality:– US Interagency Chesapeake Bay Program

• Answer Three Questions (with sample analytics by Brand Niemann):– How was the data collected?– Where is the data stored?– What are the data results?

http://semanticommunity.info/@api/deki/files/29022/JoanAron05062014.pptx



5

Federating Big Data for Big Innovation and A Redesigned, Open Source Data.gov, Dr. Jeanne Holm, Data.gov Evangelist

• Background:– Usability Tests Put Brakes on Data.gov Redesign– Linkedin Discussion

• Main Points:– Releasing and using open data is about empowering people to make better

decisions– Open data is an ecosystem– Building a federated catalog of national data– Keeping the conversation fresh: Multiple rounds of usability testing found that

redesign was needed and now doing monthly builds– A Global Movement has begun to provide transparency and democratization of

data• My Note:

– See my Tutorial Slides 12-19http://semanticommunity.info/@api/deki/files/29263/JeanneHolm05062014.pptx

http://www.fiercegovernmentit.com/story/usability-tests-put-brakes-datagov-redesign/2014-04-30

https://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&discussionID=5829192811223728132&gid=1800648&commentID=5866632158088564736&trk=view_disc&fromEmail=&ut=0-PQzK_SLrNmc1

https://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&discussionID=5829192811223728132&gid=1800648&commentID=5866632158088564736&trk=view_disc&fromEmail=&ut=0-PQzK_SLrNmc1

http://semanticommunity.info/@api/deki/files/29023/BrandNiemann05062014.pptx

http://semanticommunity.info/@api/deki/files/29263/JeanneHolm05062014.pptx

http://semanticommunity.info/@api/deki/files/29263/JeanneHolm05062014.pptx

6

Activities• White Paper for DARPA, NASA, NIH, NIST and NITRD: “Making Big Data Small"

using Data Science and Semantics:– See Framework and Questions and Answers– Dan Kaufman, DARPA Director of Innovation, and Paul Cohen, DARPA Big Mechanism

Project Director– Drs. Farnam Jahanian (NSF Big Data Publications), Phil Bourne (Data Culture at NIH),

and John Holdren (Climate Change Impacts)• Health Datapalooza V, June 1-3:

• See next slides• CODATA International Society for Digital Earth (ISDE) Workshop on Big Data for

International Scientific Programmes: Challenges and Opportunities, June 8-9:• See next slides

• Big Data for Government, June 16-17:• Keynote from Dr. George Strawn and Presentation by Dr. Tom Rindflesch and Semantic

Medline/YarcData Team• Earth Cube All-Hands Meeting, June 24-26:

• Report at July Meetup

7

Framework for White Paper• Organize a Community of Data Scientists and Related Fields to focus on treating all of your

content as "Big Data"– Example: Federal Big Data Working Group Meetup

• Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment– Example: Semantic Community Data Science Knowledge Base (Big Data Science for CODATA)

• Mine prominent scientific journals for data policy, data bases, and data results that can be reused.– Example: CODATA Data Science Journal (509 publication by 9 attributes)

• Provide data stories and presentation materials for public education and conferences– Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9,

in Beijing• Obtain NSF funding for sustained data science for data publications work over a period of years

– Example: Critical Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA)• Provide a Data Fairport with “Data Publication in Data Browsers”

– Example: Semantic Community Spotfire Cloud Library

8

Framework Questions & Answers• Is this the Barend Mons Nanopub approach to the data publication of cardinal

assertions: No, please see the examples in these slides.• What are the goals of the White Paper and NSF Grant Proposal?:

– White Paper documents the Framework for general public relations and marketing purposes. The NSF Grant Proposal is to obtain long-term funding to sustain this Framework and Mission Statement activity. • In essence we know that NSF wants a community that follows standards to produce data science

publications that reside in a knowledge base repository and workforce training that supports STEM and data scientists.

• What type of Meetup presentations do we want?:– Content that supports the Framework, Mission Statement, and White Paper. But not

every presentation does because we leave that to each presenter. All we ask is that they at least answer three fundamental questions in their presentation:• How was the data collected?• Where is the data stored?, and• What are the data results?• So the presentations are not marketing-vendor-organization promoting.

9

Kaufman and Cohen: A Data Science Big Mechanism for DARPA

http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA

My Note: Invited to June 2nd Meetup on Reading & Reasoning with Semantic Insights for the DARPA Big Mechanism



10

Farnam Jahanian: NSF Big Data Publications

http://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story

Answer: This is how the data was collected.



11

STM Innovations Seminar U.S. 2014

• International Association of Scientific, Technical & Medical Publishers: The Voice of Academic and Professional Publishing– STM is at the leading edge of the latest technology trends within publishing. This annual US-

event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come.

• Annual US Event: Bright Research, Smart Articles and the new Author Ego-System– Opening Keynotes: Analytics and Metrics

• David Smith (Baseball) and Kevin Boyack (Mapping & Analytics of Science Publishing)– Plenary: The Smart Article

• Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference? Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11. (Larry Hunter and Anita de Waard)

http://www.stm-assoc.org/events/stm-innovations-seminar-u-s-2014/



12

Mined STM 2014 Tweets• Tech trend 1: the machine is the new reader. Highlights from the Future Lab team• Tech trend 2: the return to the author• Tech trend 3: new players changing the game. see http://ow.ly/3jPdvY• Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than

journal articles in sciences• L Hunter: "With enough data you don't need semantic search. You can just use

statistics."• L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative

knowledge sharing• A baseball metrics talk to open. With perfect timing, the latest submission to the

@writelatex gallery is an article on baseball!: https://www.writelatex.com/articles/professional-baseball-pitchers-performance-and-its-effect-on-salary/

• Anita de Waard: "Looking for Data: Finding New Science“: http://t.co/eok3ma37vOhttp://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story

http://ow.ly/3jPdvY

https://www.writelatex.com/articles/professional-baseball-pitchers-performance-and-its-effect-on-salary/



http://t.co/eok3ma37vO





13

Analytics and Metrics: Baseball Salaries

Answer: This is where the data is stored

http://semanticommunity.info/@api/deki/files/29262/BaseballSalaries.xlsx?origin=mt-web

My Note: All data sets integrated into one spreadsheet.



14

Data Science for Baseball Salaries:Spotfire Data Publication

Answer: This is where the data is stored and the results.

Web Player

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/BaseballSalaries-Spotfire&waid=b47050f23cacec31b02e1-15004527bfaec0

15

Philip Bourne: Changing the Data Culture at NIH

http://semanticommunity.info/Data_Science/Data_Culture_at_the_NIH#Story

Answer: This is how the data was collected.



16

Earlier Interactive Visualization of HINI Data in Spotfire

Web Player


https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/H1N1Spread-Spotfire&waid=b473d1f56d4d1fb1118a5-14013627bfda31

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/H1N1Spread-Spotfire&waid=b473d1f56d4d1fb1118a5-14013627bfda31

17

NIH Data Publication 1: Spotfire

Web Player


https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/NIHDataPublication1&waid=f57a8191749c81d9b14a7-15004527bfaec0

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/NIHDataPublication1&waid=f57a8191749c81d9b14a7-15004527bfaec0

18

Data Science for Health Datapalooza V

• Started by Todd Park, US CTO, in 2010.• I have participated in all of them as a Government Data Scientist

(2010) and Private Data Scientist (Contest in 2011: Medicare Zombie Hunter) and Data Journalist (2012-2014).

• Like Sessions (4-One is Semantic Medline), Activities (Demos and Code-a-Palooza), and Data Lab (Damon Davis)

• Used Centers for Medicare & Medicare Services (CMS) Claims Data without Coding!– The 1.7GB uncompressed with 27 columns and more than 9 million

records, was easily downloaded, uncompressed and imported into Spotfire resulting in a 381 MB sized file!

– The only problem was that the Web Player display timed-out for the two scatterplots of the data relationships.

19

Data Science for Health Datapalooza V: MindTouch Knowledge Base

http://semanticommunity.info/Data_Science/Data_Science_for_Health_Datapalooza

Answer: Data was collected by Methodology.



20

Data Science for Health Datapalooza V: Spotfire Data Storage and Results

Web Player

Answer: Data Stored All In-Memory

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/Medicare-Physician-and-Other-Supplier-PUF-CY2012-Spotfire&waid=adb5a63a17657fc180d47-15004527bfaec0

21

Data Science for Health Datapalooza V: Data Storage and Results

http://semanticommunity.info/Data_Science/Data_Science_for_Health_Datapalooza#Story

Answer: Data Results are in the Story.

Answer: Data Dictionary is in Spreadsheet.



22

CODATA International Workshop on Big Data for International Scientific Programmes

• Summary of Data Publications in Data Browsers Products:– Presentation and Tutorial: Big Earth Sciences Data - From

Descriptive to Prescriptive Analytics• Meteorite Data Set

– Data Science Journal• 509 publication by 9 attributes Data Set

– International Journal of Digital Earth• 350 publications by 10 attributes Data Set

– Workshops on Extremely Large Databases• Collaboration invited by Michael Stonebraker

• Some Highlights in Tutorial for June 2nd Meetup

http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA#Story



23

Data Science for Climate Change:MindTouch Data Publication

http://semanticommunity.info/Data_Science/Data_Science_for_Climate_Change#Story

Answer: How was the data collected.



24

Data Science for Climate Change:Excel Data Publication

http://semanticommunity.info/@api/deki/files/29340/ClimateChangeImpacts.xlsx

Answer: Where the data is stored.



25

Data Science for Climate Change:Spotfire Data Publication

Web Player (in progress)

26

Agenda• 6:30 p.m. Brand Niemann, Introduction and Continue Data Science Tutorials (Refreshments)• 7:00 p.m. Introductions and Announcements (10 seconds per individual

depending on the size of the group)• 7:10 p.m. Big Data: Forward - Backward, Charles Randall Howard, Adjunct

Professor in the Applied IT Department and Sr. Data Scientist at Novetta Solutions

• 7:45 p.m., Stories that Persuade, Anita de Waard, VP Research Data Collaborations at Elsevier Research Data Services/University of Utrecht. Also see Looking for Data: Finding New Science and Ten Habits of Highly Effective Data

• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work)

• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

http://www.crhphdconsulting.net/

http://researchdata.elsevier.com/anita




http://www.slideshare.net/anitawaard/ten-habits-of-highly-effective-data

27

Next Meetups• June 2nd: In Planning: Ontology Summit 2014 Postmortem and Reading & Reasoning with Semantic

Insights for the DARPA Big Mechanism– 6:30 pm Welcome and Introduction Slides– 6:35 pm Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and

Careers and Sarah Soliman, Rand, and IV MOOC Student Project (invited)– 7:00 p.m. Brief Member Introductions– 7:10 pm Ontology Summit 2014 Postmortem: Big Data with Semantic Web and Applied Ontology, Brand

Niemann See Ontology for Big Data– 7:30 pm Two SIRA-based products: Research Assistant™ and Research Librarian™, Chuck Rehberg,

Semantic Insights and Kate Goodier, Xcelerate Solutions (limited beta test in process). See A Data Science Big Mechanism for DARPA

– 8:30 p.m. Open Discussion– 8:45 p.m. Networking– 9:00 p.m. Depart

• June 30th: MIT Big Data Initiative: bigdata@CAIL and the new Intel Science and Technology Center for Big Data, Sam Madden and Why the current "elephants" are good at nothing, Data Tamer, and data integration issues, Michael Stonebraker

• July and August: Once a month to be announced– Silver Line Spring Hill Metro Station Opens in July?

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists#3.2F18_Data_Science_Students_and_Careers

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists#3.2F18_Data_Science_Students_and_Careers

http://semanticommunity.info/Data_Science/Data_Science_for_VIVO

http://semanticommunity.info/Data_Science/Ontology_for_Big_Data#Story

http://semanticommunity.info/Data_Science/Ontology_for_Big_Data#Story

http://www.semanticinsights.com/

http://www.xceleratesolutions.com/

http://www.xceleratesolutions.com/

http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA#Story

http://bigdata.csail.mit.edu/

http://istc-bigdata.org/

http://en.wikipedia.org/wiki/Samuel_Madden_(computer_scientist)

http://en.wikipedia.org/wiki/Samuel_Madden_(computer_scientist)

http://www.tamr.com/

http://en.wikipedia.org/wiki/Michael_Stonebraker

http://en.wikipedia.org/wiki/Michael_Stonebraker

28

May 20th Meetup:Continue Data Science Tutorial

• Practical Data Science for Data Scientists:– Reading Assignments:

• Chapter 11: Causality– This chapter will explore the topic of causality, and we have two experts in this area as guest

contributors, Ori Stitelman and David Madigan. In these cases your mentality or goal is not to optimize for predictive accuracy, but rather to be able to isolate causes.

• Chapters 12: Epidemiology– The contributor for this chapter is David Madigan, professor and chair of statistics at Columbia. Madigan

has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance, and probabilistic graphical models.

– Resources: See 2/25 Specific Data Science Tools and Applications 3• Team Homework Exercise:

– See my work with the KDD Cup data sets where I have updated this to include 2011-2013.

– See my Research Notes for Project TYCHO Data for Health.– Form Teams (Same or New), Ask Me Questions, and Prepare to Present One of

These Next Week.

http://semanticommunity.info/Data_Science/KDD_Cup

http://semanticommunity.info/Modus_Operandi#Research_Notes

http://www.tycho.pitt.edu/

http://www.tycho.pitt.edu/

29

Practical Data Science for Data Scientists


Class 6

Providing On-Line ClassWith Private Tutoring


30

KDD Cups Data SetInventory and Metadata

Year TitleData Set Comment

1997Direct marketing for lift curve optimization Yes Finish Data Dictionary

1998Direct marketing for profit optimization Yes Same as 1997

1999Computer network intrusion detection Yes

2000Online retailer website clickstream analysis Yes Cannot Read

2001Molecular bioactivity; plus protein locale prediction Yes DATA

2002BioMed document; plus gene role classification Yes MedLine

2003Network mining and usage log analysis Yes TAR GZ

2004Particle physics; plus protein homology prediction Yes TAR GZ

2005Internet user search query categorization No Not Found

2006Pulmonary embolisms detection from image data Yes TAR GZ

2007Consumer recommendations No No Longer Available

2008Breast cancer Yes

2009Customer relationship prediction Test DAT

2010Student performance evaluation Yes

2011The Yahoo! Music Dataset No Use Current?

2012Predict the click-through rate of ads given the query and user information Yes

2012Predict which users (or information sources) one user might follow in Tencent Weibo Yes

2013Determine whether an author has written a given paper Yes

This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.

http://semanticommunity.info/@api/deki/files/27392/DoingDataScience.xlsx



federal big data working group meetup

Documents

data results

agenciesbig data

big data small

big data products

government open data

thefederal big data

end user of big data

data science teams