federal big data working group meetup
DESCRIPTION
Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup May 20, 2014. - PowerPoint PPT PresentationTRANSCRIPT
1
Federal Big Data Working Group Meetup
Dr. Brand NiemannDirector and Senior Data Scientist
Semantic Communityhttp://semanticommunity.info/
http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup
May 20, 2014
2
Mission Statement• Federal: Supports the Federal Big Data Initiative, but not
endorsed by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which
is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal
Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and
• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.
Co-organizers: Brand Niemann and Kate Goodier
3
May 6th Meetup: EPA/NASA Climate-Environment al Data Analytics & A Redesigned, Open Data.gov
• How was the Meetup?– Thanks for continually providing a forum facilitating discussion and
bringing in speakers with diverse experience. On my drive home NPR was fittingly enough talking about big data.
– Just lots of good info on big data; I also am a big fan of data.gov, so it's exciting that so much is happening with government open data. Perhaps we'll see even more APIs?
– Jeanne Holm: You can find more of the APIs at https://www.data.gov/developers/apis and http://catalog.data.gov/dataset?res_format=api There are about 450 between the two.
– Amazing growth in membership: Our 200th member!• Welcome: Inge, Consultant working in the federal/health space.
http://www.meetup.com/Federal-Big-Data-Working-Group/events/174975182/
4
EPA & NASA Climate/Environmental Data Analytics, Dr. Joan Aron, Global Environmental/Climate Change Scientist
• Data Analytics Needs Scenario Water Quality:– End User of Big Data:
• Perspective of Risk Analysis:– CODATA Integrated Research on Disaster Risk
• Continuity of Data:– US EPA Air Data
• Linkages of Data:– Conservation International
• Linkages of Climate and Water Quality:– US Interagency Chesapeake Bay Program
• Answer Three Questions (with sample analytics by Brand Niemann):– How was the data collected?– Where is the data stored?– What are the data results?
http://semanticommunity.info/@api/deki/files/29022/JoanAron05062014.pptx
5
Federating Big Data for Big Innovation and A Redesigned, Open Source Data.gov, Dr. Jeanne Holm, Data.gov Evangelist
• Background:– Usability Tests Put Brakes on Data.gov Redesign– Linkedin Discussion
• Main Points:– Releasing and using open data is about empowering people to make better
decisions– Open data is an ecosystem– Building a federated catalog of national data– Keeping the conversation fresh: Multiple rounds of usability testing found that
redesign was needed and now doing monthly builds– A Global Movement has begun to provide transparency and democratization of
data• My Note:
– See my Tutorial Slides 12-19http://semanticommunity.info/@api/deki/files/29263/JeanneHolm05062014.pptx
6
Activities• White Paper for DARPA, NASA, NIH, NIST and NITRD: “Making Big Data Small"
using Data Science and Semantics:– See Framework and Questions and Answers– Dan Kaufman, DARPA Director of Innovation, and Paul Cohen, DARPA Big Mechanism
Project Director– Drs. Farnam Jahanian (NSF Big Data Publications), Phil Bourne (Data Culture at NIH),
and John Holdren (Climate Change Impacts)• Health Datapalooza V, June 1-3:
• See next slides• CODATA International Society for Digital Earth (ISDE) Workshop on Big Data for
International Scientific Programmes: Challenges and Opportunities, June 8-9:• See next slides
• Big Data for Government, June 16-17:• Keynote from Dr. George Strawn and Presentation by Dr. Tom Rindflesch and Semantic
Medline/YarcData Team• Earth Cube All-Hands Meeting, June 24-26:
• Report at July Meetup
7
Framework for White Paper• Organize a Community of Data Scientists and Related Fields to focus on treating all of your
content as "Big Data"– Example: Federal Big Data Working Group Meetup
• Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment– Example: Semantic Community Data Science Knowledge Base (Big Data Science for CODATA)
• Mine prominent scientific journals for data policy, data bases, and data results that can be reused.– Example: CODATA Data Science Journal (509 publication by 9 attributes)
• Provide data stories and presentation materials for public education and conferences– Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9,
in Beijing• Obtain NSF funding for sustained data science for data publications work over a period of years
– Example: Critical Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA)• Provide a Data Fairport with “Data Publication in Data Browsers”
– Example: Semantic Community Spotfire Cloud Library
8
Framework Questions & Answers• Is this the Barend Mons Nanopub approach to the data publication of cardinal
assertions: No, please see the examples in these slides.• What are the goals of the White Paper and NSF Grant Proposal?:
– White Paper documents the Framework for general public relations and marketing purposes. The NSF Grant Proposal is to obtain long-term funding to sustain this Framework and Mission Statement activity. • In essence we know that NSF wants a community that follows standards to produce data science
publications that reside in a knowledge base repository and workforce training that supports STEM and data scientists.
• What type of Meetup presentations do we want?:– Content that supports the Framework, Mission Statement, and White Paper. But not
every presentation does because we leave that to each presenter. All we ask is that they at least answer three fundamental questions in their presentation:• How was the data collected?• Where is the data stored?, and• What are the data results?• So the presentations are not marketing-vendor-organization promoting.
9
Kaufman and Cohen: A Data Science Big Mechanism for DARPA
http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA
My Note: Invited to June 2nd Meetup on Reading & Reasoning with Semantic Insights for the DARPA Big Mechanism
10
Farnam Jahanian: NSF Big Data Publications
http://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story
Answer: This is how the data was collected.
11
STM Innovations Seminar U.S. 2014
• International Association of Scientific, Technical & Medical Publishers: The Voice of Academic and Professional Publishing– STM is at the leading edge of the latest technology trends within publishing. This annual US-
event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come.
• Annual US Event: Bright Research, Smart Articles and the new Author Ego-System– Opening Keynotes: Analytics and Metrics
• David Smith (Baseball) and Kevin Boyack (Mapping & Analytics of Science Publishing)– Plenary: The Smart Article
• Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference? Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11. (Larry Hunter and Anita de Waard)
http://www.stm-assoc.org/events/stm-innovations-seminar-u-s-2014/
12
Mined STM 2014 Tweets• Tech trend 1: the machine is the new reader. Highlights from the Future Lab team• Tech trend 2: the return to the author• Tech trend 3: new players changing the game. see http://ow.ly/3jPdvY• Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than
journal articles in sciences• L Hunter: "With enough data you don't need semantic search. You can just use
statistics."• L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative
knowledge sharing• A baseball metrics talk to open. With perfect timing, the latest submission to the
@writelatex gallery is an article on baseball!: https://www.writelatex.com/articles/professional-baseball-pitchers-performance-and-its-effect-on-salary/
• Anita de Waard: "Looking for Data: Finding New Science“: http://t.co/eok3ma37vOhttp://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story
13
Analytics and Metrics: Baseball Salaries
Answer: This is where the data is stored
http://semanticommunity.info/@api/deki/files/29262/BaseballSalaries.xlsx?origin=mt-web
My Note: All data sets integrated into one spreadsheet.
14
Data Science for Baseball Salaries:Spotfire Data Publication
Answer: This is where the data is stored and the results.
Web Player
15
Philip Bourne: Changing the Data Culture at NIH
http://semanticommunity.info/Data_Science/Data_Culture_at_the_NIH#Story
Answer: This is how the data was collected.
16
Earlier Interactive Visualization of HINI Data in Spotfire
Web Player
Answer: This is where the data is stored and the results.
17
NIH Data Publication 1: Spotfire
Web Player
Answer: This is where the data is stored and the results.
18
Data Science for Health Datapalooza V
• Started by Todd Park, US CTO, in 2010.• I have participated in all of them as a Government Data Scientist
(2010) and Private Data Scientist (Contest in 2011: Medicare Zombie Hunter) and Data Journalist (2012-2014).
• Like Sessions (4-One is Semantic Medline), Activities (Demos and Code-a-Palooza), and Data Lab (Damon Davis)
• Used Centers for Medicare & Medicare Services (CMS) Claims Data without Coding!– The 1.7GB uncompressed with 27 columns and more than 9 million
records, was easily downloaded, uncompressed and imported into Spotfire resulting in a 381 MB sized file!
– The only problem was that the Web Player display timed-out for the two scatterplots of the data relationships.
19
Data Science for Health Datapalooza V: MindTouch Knowledge Base
http://semanticommunity.info/Data_Science/Data_Science_for_Health_Datapalooza
Answer: Data was collected by Methodology.
20
Data Science for Health Datapalooza V: Spotfire Data Storage and Results
Web Player
Answer: Data Stored All In-Memory
21
Data Science for Health Datapalooza V: Data Storage and Results
http://semanticommunity.info/Data_Science/Data_Science_for_Health_Datapalooza#Story
Answer: Data Results are in the Story.
Answer: Data Dictionary is in Spreadsheet.
22
CODATA International Workshop on Big Data for International Scientific Programmes
• Summary of Data Publications in Data Browsers Products:– Presentation and Tutorial: Big Earth Sciences Data - From
Descriptive to Prescriptive Analytics• Meteorite Data Set
– Data Science Journal• 509 publication by 9 attributes Data Set
– International Journal of Digital Earth• 350 publications by 10 attributes Data Set
– Workshops on Extremely Large Databases• Collaboration invited by Michael Stonebraker
• Some Highlights in Tutorial for June 2nd Meetup
http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA#Story
23
Data Science for Climate Change:MindTouch Data Publication
http://semanticommunity.info/Data_Science/Data_Science_for_Climate_Change#Story
Answer: How was the data collected.
24
Data Science for Climate Change:Excel Data Publication
http://semanticommunity.info/@api/deki/files/29340/ClimateChangeImpacts.xlsx
Answer: Where the data is stored.
25
Data Science for Climate Change:Spotfire Data Publication
Web Player (in progress)
26
Agenda• 6:30 p.m. Brand Niemann, Introduction and Continue Data Science Tutorials (Refreshments)• 7:00 p.m. Introductions and Announcements (10 seconds per individual
depending on the size of the group)• 7:10 p.m. Big Data: Forward - Backward, Charles Randall Howard, Adjunct
Professor in the Applied IT Department and Sr. Data Scientist at Novetta Solutions
• 7:45 p.m., Stories that Persuade, Anita de Waard, VP Research Data Collaborations at Elsevier Research Data Services/University of Utrecht. Also see Looking for Data: Finding New Science and Ten Habits of Highly Effective Data
• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work)
• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)
27
Next Meetups• June 2nd: In Planning: Ontology Summit 2014 Postmortem and Reading & Reasoning with Semantic
Insights for the DARPA Big Mechanism– 6:30 pm Welcome and Introduction Slides– 6:35 pm Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and
Careers and Sarah Soliman, Rand, and IV MOOC Student Project (invited)– 7:00 p.m. Brief Member Introductions– 7:10 pm Ontology Summit 2014 Postmortem: Big Data with Semantic Web and Applied Ontology, Brand
Niemann See Ontology for Big Data– 7:30 pm Two SIRA-based products: Research Assistant™ and Research Librarian™, Chuck Rehberg,
Semantic Insights and Kate Goodier, Xcelerate Solutions (limited beta test in process). See A Data Science Big Mechanism for DARPA
– 8:30 p.m. Open Discussion– 8:45 p.m. Networking– 9:00 p.m. Depart
• June 30th: MIT Big Data Initiative: bigdata@CAIL and the new Intel Science and Technology Center for Big Data, Sam Madden and Why the current "elephants" are good at nothing, Data Tamer, and data integration issues, Michael Stonebraker
• July and August: Once a month to be announced– Silver Line Spring Hill Metro Station Opens in July?
28
May 20th Meetup:Continue Data Science Tutorial
• Practical Data Science for Data Scientists:– Reading Assignments:
• Chapter 11: Causality– This chapter will explore the topic of causality, and we have two experts in this area as guest
contributors, Ori Stitelman and David Madigan. In these cases your mentality or goal is not to optimize for predictive accuracy, but rather to be able to isolate causes.
• Chapters 12: Epidemiology– The contributor for this chapter is David Madigan, professor and chair of statistics at Columbia. Madigan
has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance, and probabilistic graphical models.
– Resources: See 2/25 Specific Data Science Tools and Applications 3• Team Homework Exercise:
– See my work with the KDD Cup data sets where I have updated this to include 2011-2013.
– See my Research Notes for Project TYCHO Data for Health.– Form Teams (Same or New), Ask Me Questions, and Prepare to Present One of
These Next Week.
29
Practical Data Science for Data Scientists
http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists
Class 6
Providing On-Line ClassWith Private Tutoring
30
KDD Cups Data SetInventory and Metadata
Year TitleData Set Comment
1997Direct marketing for lift curve optimization Yes Finish Data Dictionary
1998Direct marketing for profit optimization Yes Same as 1997
1999Computer network intrusion detection Yes
2000Online retailer website clickstream analysis Yes Cannot Read
2001Molecular bioactivity; plus protein locale prediction Yes DATA
2002BioMed document; plus gene role classification Yes MedLine
2003Network mining and usage log analysis Yes TAR GZ
2004Particle physics; plus protein homology prediction Yes TAR GZ
2005Internet user search query categorization No Not Found
2006Pulmonary embolisms detection from image data Yes TAR GZ
2007Consumer recommendations No No Longer Available
2008Breast cancer Yes
2009Customer relationship prediction Test DAT
2010Student performance evaluation Yes
2011The Yahoo! Music Dataset No Use Current?
2012Predict the click-through rate of ads given the query and user information Yes
2012Predict which users (or information sources) one user might follow in Tencent Weibo Yes
2013Determine whether an author has written a given paper Yes
This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.
http://semanticommunity.info/@api/deki/files/27392/DoingDataScience.xlsx