reasoning over big data

15
Reasoning Over Big Data Stores Eric Little, PhD VP Data Science Polytechnic School of Engineering - NYU [email protected]

Upload: osthus

Post on 14-Apr-2017

376 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Reasoning over big data

Reasoning Over Big Data StoresEric Little, PhDVP Data SciencePolytechnic School of Engineering - [email protected]

Page 3: Reasoning over big data

Slide 3

Semantic Technologies – Smart Data Piece

Semantic Technologies Provide several important features for emerging new

technologies• Controlled vocabularies• Taxonomies• Metadata structures• Ontology models• Logical inference

Data today continues to evolve and grow in both size and complexity. We need hybrid solutions that can provide real insights

Analytics is growing into a new kind of field – Data Science Is data science about interacting with machines or humans? Must be able to strike a balance between complexity of the

data and simplicity of the presentation to the user

Page 4: Reasoning over big data

Slide 4

Metadata, Reference Data & Master Data

• While often lumped together, these are distinct kinds of data

• Semantic Technologies can help with the organization of these kinds of data – but should not be done in isolation

• Scalability is achieved using complementary approaches

Incr

ease

d co

ncep

tual

com

plex

ity Increased Scalability Issues

Page 5: Reasoning over big data

Slide 5

Graphs are good for information – not so good for high-bandwidth applications where speed and scalability are the primary drivers.

Can require highly specialized hardware, software techniques or engineers

Semantics should be confined to the metadata aspects of the problem – use other tech for the rest

Where Semantics Can Fall Short

Page 6: Reasoning over big data

Slide 6

Big Data is a real challenge – but starting to become a buzz word

Many “Big Data Problems” can be reduced to smaller data problems

Applications exist that require complex inferencing over very large data sets

A current client has lab readings from 40,000+ devices

How to do this effectively?

The Big Data Problem

Page 7: Reasoning over big data

Slide 7

Why Not Just Build the Data Lake?

Data lakes are fine when you are gathering and storing the data

What happens later on when a lot of data is in there?

The benefits are that data can stay in its original form – no real ETLBut running analytics across disparate stores is very challenging“Without metadata, every subsequent use of data means analysts start from scratch.” (Gartner 2014)

Page 8: Reasoning over big data

Slide 8

Reasoning Over Big Data Is A Growing Topic

There has been an inordinate amount of time and energy spent on just queries.

This is not reasoning though – it is just retrieval

What is Reasoning? More than just automated query sets run in sequence or

parallel Reasoning is about inferring new information that isn’t in the

raw data. It is a heuristic – where one discovers or learns something new

for themselves Deductive, Inductive, Abductive

Page 10: Reasoning over big data

Slide 10

Reasoning Evolution

Page 11: Reasoning over big data

Slide 11

Types of Semantic Inference (Forward and Backward Chaining)

Uses Modus Ponens

Finds a T consequent and affirms related antecedent (verifies connection)

Uses Modus Ponens

Finds a T antecedent & affirms a related consequent (new knowledge)

Page 12: Reasoning over big data

Slide 12

Ontology Layering Is Important for Scale

Data Source Models

Multi- & Single-Source Data Integration Models

Domain Models (Objs, Attributes, Process & Relations)

System Lvl Models (Rules)

Dat

a Tr

acea

bilit

y (P

rove

nanc

e)

Use

r Driv

en O

ntol

ogie

s

Upper-Lvl Models

Meta-dataLevels

(Human Concepts)

Data-centric Levels

(Machine Language)

Metaphysics – not just data modelsData Sources connected directly to higher classificationsFederation allows for improved scale

Page 13: Reasoning over big data

Slide 13

Get your semantics experts and your big data scientists on the same page

Utilize tables where possible – avoid multi-node graph hops Use graphs for metadata – leave instance data in place when

possible Large graphs should be avoided Lots of columns and rows are fine – joins across tables are not Break graph information into other formats wherever possible

Pre-compute phases are important Pre-compute multi-table joins based on SME input, known

semantic patterns, business rules/logics, etc. Use statistical methods to cluster data (e.g., normalcy calcs)

Use the tech that is right for the job

Combining Semantics and NoSQL

Page 14: Reasoning over big data

Slide 14

One Example of Using RDF in Cloud-scalable Applications

Example of a current approach being used – there are othersCan scale across multiple cloud nodes (where TS’s have issues)Triples are indexed items

Page 15: Reasoning over big data

THANK YOU – QUESTIONS?Eric Little, PhDVP Data ScienceOSTHUS, [email protected](M) 321-480-4818 www.linkedin.com/pub/eric-little