ibm - managing uncertain data at scale

• Click to add text

© 2013 IBM Corporation

Managing Uncertain Data at ScaleNikolay Marin

© 2013 3IBM Corporation

Managing Uncertain Data at Scale

2


Trend: Most of the world’s analyzed data will be uncertain

By 2015, 80% of the world’s data will be uncertain

Uncertain data management requires new techniques

These techniques are necessary for real-world Big Data Analytics

Opportunity: Business leadership using Big Data Analytics

Robust, business-aware uncertain data management

Use analytics over uncertain web, sensor, and human-generated data

Enable good business decisions by understanding analysis confidence

Challenge: Taking Big Data Analytics into an uncertain world

Analysis of text is highly nuanced; sensor-based data is imprecise

Timely business decisions require efficient large-scale analytics

It is more difficult to obtain insight about an individual than a group, especially if the source data is uncertain



3

* Truthfulness, accuracy or precision, correctness

The fourth dimension of Big Data: Veracity – handling data in doubt

Volume Velocity Veracity*Variety

Data at Rest

Terabytes to exabytes of existing

data to process

Data in Motion

Streaming data, milliseconds to

seconds to respond

Data in Many Forms

Structured, unstructured, text,

multimedia

Data in Doubt

Uncertainty due to data inconsistency& incompleteness,

ambiguities, latency, deception, model approximations



4

Forecasting a hurricane(www.noaa.gov)

Fitting a curve to data

Model UncertaintyAll modeling is approximate

Process UncertaintyProcesses contain

“randomness”

Uncertainty arises from many sources

Uncertain travel times

Semiconductor yield

Intended Spelling Text Entry

Actual Spelling

GPS Uncertainty

??

?

RumorsContaminated?

{John Smith, Dallas}{John Smith, Kansas}

Data UncertaintyData input is uncertain

Ambiguity

{Paris Airport}Testimony

Conflicting Data

??

?



5

Glo

bal

Dat

a V

olu

me

in E

xab

ytes

Sens

ors

(Inte

rnet

of T

hing

s)

Multiple sources: IDC,Cisco

100

90

80

70

60

50

40

30

20

10

Agg

rega

te U

ncer

tain

ty %

VoIP

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

2005 2010 2015

By 2015, 80% of all available data will be uncertain

Enterprise Data

Data quality solutions exist for enterprise data like customer, product, and address data, but

this is only a fraction of the total enterprise data.

By 2015 the number of networked devices will be double the entire global population. All

sensor data has uncertainty.

Social Media

(video, audio and text)

The total number of social media accounts exceeds the entire global

population. This data is highly uncertain in both its expression and content.



6

Requires specific business process and industry context

How to reduce uncertainty in processes, models, and data

Constructing context for better understanding Extract as much information as feasible from each source

Combine (condense) data from multiple sources

More data from more sources is better– Gathers more evidence for statistical methods

Using statistical methods scaled for Big Data Stochastic techniques efficiently reason about uncertainty

Monte Carlo techniques explore many possible scenarios in order to gain insight



7

Attributes

Tro

uble

tic

kets

Help agent findsimilar tickets

Improve suggestions for similar problems using corroborating data and better mathematical techniques

Analyze all the data – do not subset

Use related techniques to automate Level 1 support, finding problem clusters, etc.

Use stochastic search to find trouble ticketsthat are similar

Trouble ticket attributes

Some attributes such as server type are precise

Other attributes such as words in trouble tickets may be imprecise indicators of the problem

Model approximation

Treat N attributes as N dimensions in space

Model similarity as closeness in the N dimensional space

Prediction

Improve predictability by getting

agent feedback

Statistical techniques reduce uncertainty in analytical models

© 2013 3IBM Corporation 8


Analytics is broadly defined as the use of data and computation to make smart decisions

Data

Historical

Simulated

Text Video, Images Audio

Data instances

Reports and queries on data aggregates

Predictive models

Answers and confidence

Feedback and learning

Decision point Possible outcomes

Option 1

Option 2

Option 3



Future of Analytics

Explosion of unstructured data

Creates new analytics opportunities

Addresses new enterprise needs

Consistent, extensible, and consumable analytics platform

Reduces cost-to-value for enterprises

Increases analytics solution coverage with limited supply of skills

Optimizing across the stack to deploy analytics at scale

Analytics becomes a dominant IT workload and drives HW design

Opportunity to seamlessly scale from terascale to exascale



Analytics toolkits will be expanded to support ingestion and interpretation of unstructured data, and enable adaptation and learning

Extended from: Competing on Analytics, Davenport and Harris, 2007

Standard Reporting

Ad hoc Reporting

Query/Drill Down

Alerts

Forecasting

Simulation

Predictive Modeling

In memory data, fuzzy search, geo spatial

Causality, probabilistic, confidence levels

High fidelity, games, data farming

Larger data sets, nonlinear regression

Rules/triggers, context sensitive, complex events

Query by example, user defined reports

Real time, visualizations, user interaction

Report

Decide and Act

Understand and Predict

Collect and Ingest/Interpret

Learn

Optimization

Optimization under Uncertainty

Decision complexity, solution speed

Quantifying or mitigating risk

Adaptive Analysis

Continual Analysis Responding to local change/feedback

Responding to context

Entity Resolution

Annotation and Tokenization

Relationship, Feature Extraction

People, roles, locations, things

Rules, semantic inferencing, matching

Automated, crowd sourcedDecide what to count;enable accurate counting

In the context of the decision process

Tradi-tional

New Methods

NewData



11

Finally...what about a longer term view.... say the next 10-50 years?

1. Artificial Intelligence

2. Nano –“everything”

3. Cognitive Computing

4. Deep (Exascale) Computing

5. Automic & Quantum Computing

6. Human / Computer Interaction

7. Machine to Machine Interaction

8. BioTech / Human Augmentation

9. Robots & Robotics

10. Advanced / Predictive Analytics

11. Security & Privacy

12. 3-D Printing

13. Video-enabled Business Processes

14. Personalized Web/Assistants

15. Ubiquitous Computing

16. Gaming

17. Simulation

18. Virtual Computing (including virtual worlds, tele-presence, etc.)

19. Augmented Reality

IBM Academy of Technology and Global Technology Outlook can help you find some answers

ibm - managing uncertain data at scale

Documents

sensor data

source data

worlds data

motionstreaming data

available data

testimonyconflicting

address data

scale2managing uncertain