data science for business: semantic verses dr. brand niemann director and senior data scientist...

14
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup February 14, 2014 1

Upload: lacey-gorbet

Post on 15-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

1

Data Science for Business:Semantic Verses

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info

http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

February 14, 2014

Page 2: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

2

Data Science for Business• Book Review Summary:

– If you are a data scientist, take this as our challenge: think deeply about exactly why your work is relevant to helping the business and be able to present it as such.

– Remember:– If you can’t explain it simply, you don’t understand it well enough.—Albert Einstein

• Semantic Verses Magnet:– “Magnet is the only engine that treats topics as semantic objects, which

gives it a competitive edge since the identification of “key topics” is generally considered to be the main feature of any semantic engine.”

– “Semantic is used here to refer to understanding what a piece of text is about. We do not claim we are doing NLP/NLU for question/answering purposes.”• Source: Walid S. Saba, PhD, AI/NLP Scientist, February 2014.

Page 3: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

3

Magnet Text Analysis Engine:Understands What the Text is About

http://semanticverses.com/Default.aspx

Page 4: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

4

Data Science for Business Knowledge Base

http://semanticommunity.info/Data_Science/Data_Science_for_Business

My Note: A Knowledge Base* with:• Data Story• Slides• Data Sets• Spotfire Dashboard• Book Web Pages*Structured Mashup with everything treatedas an object with a well-defined URL for theGlossary (taxonomy) and Table of Contents (thesaurus)Integrated together in an Information Model!

Page 5: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

5

MindTouch• MindTouch:

– Treats topics as semantic objects (they can be searched for links to content).– MindTouch headings identify “key topics” (see Table of Content for book in this

page).– Allows one to construct a natural language front-end for enterprise data (and big

data) integration across multiple sources (Google Chrome and Spotfire can Find words and data in their mashup Knowledge Bases).

– Can be combine with Be Informed, YARCData, and big data analytics (Spotfire) and could pilot including Semantic Verses.

– An example of expert subject matter that serves to provide a metamodel of topics as an interface to the integration of content (text and data) that can be both personalized by the user and integrated with similar metamodels.

• Semantic Community:– Doing Natural Language Processing (NLP)/Natural Language Understanding (NLU)

by hand in MIndTouch and I see why it is so difficult to automate for massive information on the Internet without Subject Matter Expertise and Structure.

Page 6: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

6

Specific Example:TFIDF - Term Frequency (TF) and Inverse Document Frequency (IDF)

• Using Google Find for TFIDF (12 hits) where the first is: Combining Them: TFIDF which says: See “Example: Attribute Selection with Information Gain” on page 56.

• Which says: For a dataset with instances described by attributes and a target variable, we can determine which attribute is the most informative with respect to estimating the value of the target variable. We also can rank a set of attributes by their informativeness, in particular by their information gain. This can be used simply to understand the data better. It can be used to help predict the target. Or it can be used to reduce the size of the data to be analyzed, by selecting a subset of attributes in cases where we can not or do not want to process the entire dataset.

• See this UC Irvine Machine Learning Repository page for the data set used to illustrate information gain.

Page 7: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

7

Using Google Find for TFIDF 1

Page 8: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

8

Using Google Find for TFIDF 10

Page 9: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

9

The Data Mining Process 1

• Business Understanding• Data Understanding• Data Preparation• Modeling• Evaluation• Deployment

Page 10: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

10

The Data Mining Process 2• Business Understanding:

– Use real Subject Matter Expertise content instead of general Web content.• Data Understanding:

– Make all content data so unstructured, semi-structured, and structure information are integrated data.

• Data Preparation:– Create an index of content topics and objects that is both a relational and graph

database.• Modeling:

– A searchable Information Model with Analytics (Ontology) linked to the Thesaurus (Taxonomy) linked to the Glossary (Vocabulary).

• Evaluation:– Finding more needles in the needle haystack and discovering things of interest that you

did not know how to look for.• Deployment:

– Publically available on the Web using the Google Chrome Browser.

Page 11: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

11

Data Preparation

TopicsKnowledge Base URLFunctionWithin Topic URLsFigure and Tables URLsWithin Footnote URL

Relational and Graph (Subject, Object, & Predicate) Databases

Page 12: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

12

Modeling

A searchable Information Model with Analytics (Ontology) linked to the Thesaurus (Taxonomy) linked to the Glossary (Vocabulary)

Page 13: Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community

13

Evaluation• Find:

– The find tool is a fast way to find contents in your data, navigate in the analysis, and to perform actions found in the menus of Spotfire. It consists of a text field where you enter a search string and a list of results for the search.

– To reach the Find dialog: Press Ctrl+F. OR Select Tools > Find....• Searching in TIBCO Spotfire:

– There are many places in TIBCO Spotfire where you can search for different items. For example, you can search for filters, analyses in the library or elements used to build information links in the Information Designer. All of the available search fields use the same basic search syntax, which is presented below. For more information regarding search of a specific item, see the links at the bottom of this page.

– Tip: If you cannot find what you are looking for, try adding more wildcards. For example, to locate a filter called "Sales ($)" , enter the search expression "Sales ($*", to avoid interpreting the text within the parenthesis as a Boolean expression.http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Find