big data and analytics challenges and issues

95
Big Data And Analytics Challenges and Issues Stephen H. Kaisler, D.Sc. Frank J. Armour, Ph.D. J. Alberto Espinosa, Ph.D. William H. Money, Ph.D. Presented at HICSS-49 January 5, 2016 Grand Hyatt, Poipu, Kauai, Hawaii

Upload: kelley-george

Post on 06-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Copyright (except where referenced) 2014-2016 Who We Are Stephen H. Kaisler, D.Sc. Senior Associate PCI Strategic Management Columbia, MD [email protected] J. Alberto Espinosa, Ph.D. Professor and Chair Kogod School of Business American University Washington, DC [email protected] William H. Money, Ph.D. Associate Professor School of Business Administration The Citadel Charleston, SC [email protected] Frank J. Armour, Ph.D. Kogod School of Business American University Washington, DC [email protected] 4/26/2017 4/26/2017 4/26/2017 Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money BDA-2

TRANSCRIPT

Page 1: Big Data And Analytics Challenges and Issues

Big Data And AnalyticsChallenges and Issues

Stephen H. Kaisler, D.Sc.Frank J. Armour, Ph.D.

J. Alberto Espinosa, Ph.D.William H. Money, Ph.D.

Presented at HICSS-49January 5, 2016

Grand Hyatt, Poipu, Kauai, Hawaii

Page 2: Big Data And Analytics Challenges and Issues

BDA-205/03/2305/03/23 BDA-205/03/23

Who We Are

Stephen H. Kaisler, D.Sc.Senior AssociatePCI Strategic ManagementColumbia, [email protected]

Frank J. Armour, Ph.D.Kogod School of BusinessAmerican UniversityWashington, [email protected]

J. Alberto Espinosa, Ph.D.Professor and ChairKogod School of BusinessAmerican UniversityWashington, [email protected]

William H. Money, Ph.D.Associate ProfessorSchool of Business AdministrationThe CitadelCharleston, [email protected]

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 3: Big Data And Analytics Challenges and Issues

BDA-3

Outline

Topic Schedule

Analytics: An Introduction 1300-1415

Break 1415-1445

Analytics: 1445-1600

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 4: Big Data And Analytics Challenges and Issues

BDA-4

Tutorial Purpose

• Big Data and Analytics is thought to be about:– Business Intelligence and Analytics– Computational Science

• But, it is much more than that!– Demographic Analysis– Geointelligence: Spatial Analysis– The Grand Challenges in Science – Medicine: Processing 3-D hyperspectral high resolution images

for diagnostics, genomic research, Proteonomics, etc.– Media Analysis: Processing text, audio, video, imagery– And much, much more …..

• So, we want to introduce you to the issues and challenges at the frontiers of Advanced Analytics!

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 5: Big Data And Analytics Challenges and Issues

The Business Analytics Landscape

StrategiesSocial, Email, Blogs, Video, MobileMarketing, Sales - Product Listing, Promotions

ApplicationsERP, CRM, Databases, Internal Applications, Customer/Consumer facing applications

ContextWeb, Customers, Products, Business Systems,Processes and Services

Support SystemsCRM, Recommendation SystemsData warehouses, Business Intelligence

Ref: S. Radhakrishnan, Advanced analytics: The Next Wave of Business Intelligence, Business Intelligence Conference, 2011

BDA-5

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 6: Big Data And Analytics Challenges and Issues

The Emerging Analytics Landscape

• Extending diagnostic analytics to different domains• Developing new predictive and prescriptive analytics

based on advanced analytic techniques– Prediction based on scenario development rather than just

probabilities– Prescription based on advanced simulation and visualization

capabilities

• Development of Analytic Scientist curricula and degree programs

• Expansion beyond the traditional business intelligence applications and scientific application based on descriptive analytics.

BDA-6

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 7: Big Data And Analytics Challenges and Issues

BDA-7

What is Advanced Analytics?

• Advanced analytics: – the application of multiple analytic methods that address the diversity of

big data – structured or unstructured – – to provide descriptive results, and – to yield actionable predictive and prescriptive results that facilitate

decision-making.• Beyond data mining and statistical processing methods to

encompass logic-based methods, qualitative analytics, and non-statistical quantitative methods.

• A diverse set of techniques that require new software architectures and application frameworks to solve complex problems.

• New metrics that focus on the contributions of the value of the analysis as a holistic result are required to assess and evaluate the outcomes of advanced analytics.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 8: Big Data And Analytics Challenges and Issues

BDA-8

Setting the Stage:A Few Words

About Big Data

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 9: Big Data And Analytics Challenges and Issues

BDA-9

Big Data Definition

• Big Data is the amount of data just beyond technology’s capability to store, manage and process efficiently.

Ah, but a man’s reach should exceed his grasp, Or what’s a heaven for?” – Robert BrowningCopyright (except where referenced) 2014-2016

Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 10: Big Data And Analytics Challenges and Issues

BDA-10

Big Data - Quick Recap

• Organizations have access to a wealth of information:– They can’t get value out of it because most of it is sitting in its

most raw form or in a semistructured or unstructured form– They don't even know whether it's worth keeping (or even able

to keep it for that matter).

• Attributes:– Gartner: Volume, Velocity, Variety– Kaisler, Armour, Espinosa, Money: Value, Veracity– Others have also been defined

• Value relies on Big Data processing being fast and agile

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 11: Big Data And Analytics Challenges and Issues

BDA-11

Hmmm!

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 12: Big Data And Analytics Challenges and Issues

BDA-12

The Data Scientist

Hal Varian, Mckinsey Quarterly, January 2009:“The sexy job in the next ten years will be statisticians… The ability to take data

—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

Ref: http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286

“The critical job in the next 20 years will be the analytic scientist … the individual with the ability to understand a problem domain, to understand and know what data to collect about it, to identify analytics to process that data/information, to discover its meaning, and to extract knowledge from it—that’s going to be a very critical skill.”- Kaisler, Armour, Espinosa, Money (2014) Amended

For both roles:Analytic scientists require advanced training in specific domains, data science

tools, multiple analytics, and visualization to perform predictive and prescriptive analytics. They may hold Ph.D.’s, but pragmatic experience in a domain will be equally important.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 13: Big Data And Analytics Challenges and Issues

BDA-13

Some Big Data Issues Affecting Analytics

• Volume:– How much data is really relevant to the problem solution? Cost of processing?– So, can you really afford to store and process all that data?

• Velocity:– Much data coming in at high speed– Need for streaming versus block approach to data analysis– So, how to analyze data in-flight and combine with data at-rest

• Variety:– A small fraction is structured formats, Relational, XML, etc.– A fair amount is semi-structured, as web logs, etc. – The rest of the data is unstructured text, photographs, etc. – So, no single data model can currently handle the diversity

• Veracity: cover term for …– Accuracy, Precision, Reliability, Integrity– So, what is it that you don’t know you don’t know about the data?

• Value:– How much value is created for each unit of data (whatever it is)?– So, what is the contribution of subsets of the data to the problem solution?

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 14: Big Data And Analytics Challenges and Issues

BDA-14

Types of Analytics

• Descriptive: A set of techniques for reviewing and examining the data set(s) to understand the data and analyze business performance.

• Diagnostic: A set of techniques for determine what has happened and why• Predictive: A set of techniques that analyze current and historical data to

determine what is most likely to (not) happen• Prescriptive: A set of techniques for computationally developing and

analyzing alternatives that can become courses of action – either tactical or strategic – that may discover the unexpected

• Decisive: A set of techniques for visualizing information and recommending courses of action to facilitate human decision-making when presented with a set of alternatives.

Passive ActiveDeductive Descriptive DiagnosticInductive Predictive Prescriptive

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 15: Big Data And Analytics Challenges and Issues

BDA-15

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Descriptive Analytics

• Process:– Identify the attributes, then assess/evaluate the attributes– Estimate the magnitude to correlate the relative contribution of each attribute to

the final solution– Accumulate more instances of data from the data sources – If possible, perform the steps of evaluation, classification and categorization

quickly– Yield a measure of adaptability within the OODA loop

• At some threshold, crossover into diagnostic and predictive analytics

http

://v1

shal

.com

/con

tent

/25-

carto

ons-

give

-cur

rent

-big

-dat

a-hy

pe-p

ersp

ectiv

e/

Page 16: Big Data And Analytics Challenges and Issues

16

Diagnostic Analytics

• Process:– Begin with descriptive analytics– Extract patterns from large data quantities via data mining– Correlate data types for explanation of near-term behavior – past and

present– Estimate linear/non-linear behavior not easily identifiable through other

approaches. • Example: by classifying past insurance claims, estimate the number

of future claims to flag for investigation with a high probability of being fraudulent.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 17: Big Data And Analytics Challenges and Issues

Predictive Analytics

• Process:– Begin with descriptive AND diagnostic analytics– Choose the right data based on domain knowledge and relationships among

variables– Choose the right techniques to yield insight into possible outcomes– Determine the likelihood of possible outcomes given initial boundary conditions– Remember! Data driven analytics is non-linear; do NOT treat like an engineering

project

17

Page 18: Big Data And Analytics Challenges and Issues

BDA-18

Prescriptive Analytics

• Process:– Begin w/ predictive analytics– Determine what should occur and how to make it so– Determine the mitigating factors that lead to desirable/undesirable outcomes– “What-if” analysis w/ local or global optimization– Ex: Find the best set of prices and advertising frequency to maximize revenue– Ex: And, the right set of business moves to make to achieve that goal

“Make it so”

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 19: Big Data And Analytics Challenges and Issues

BDA-19

Decisive Analytics

•Process:– Given a set of decision

alternatives, choose the one course of action to do from possibly many

– But, it may not be the optimal one.

– Visualize alternatives – whole or partial subset

– Perform exploratory analysis – what-if and why

• How do I get to there from here?

• How did I get here from there?

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 20: Big Data And Analytics Challenges and Issues

BDA-2005/03/2305/03/2305/03/23

Advanced Analytics:Application of Analytics

ToCritical Problems

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 21: Big Data And Analytics Challenges and Issues

BDA-2105/03/2305/03/2305/03/2305/03/23

The Role of Analytics

• “Tools and techniques that gear the analyst’s mind to apply higher levels of critical thinking can substantially improve analysis… structuring information, challenging assumptions, and exploring alternative interpretations.”Richards Heuer, Jr., “The Psychology of Intelligence Analysis”

• Beware Frege’s Caution:– Converse Problems:

• If you magnify on details, you are losing the overview• If you focus on the overview, you don’t see the details

– Problem with Data Mining:• Applying statistics to understand the trends causes a loss of

grounding in the data

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 22: Big Data And Analytics Challenges and Issues

BDA-2205/03/2305/03/2305/03/2305/03/23

The Analytics Continuum

• Analytics problems span a continuum:– Short-term analysis leads to quick fixes and quick results, which

may be unsustainable– What are the disruptive innovations in the middle-term that

provide near-term domain leadership?– Long-term leads to strategic changes and innovations that

provide sustainable domain dominance.

Finding a NeedleIn a Haystack

Top-Down AnalysisDeductive

Spinning Straw into GoldBottom-Up Analysis

Inductive

A little bit of this; a little bit of thatMiddle-Out Analysis

Abductive?

push pull

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 23: Big Data And Analytics Challenges and Issues

BDA-2305/03/2305/03/2305/03/2305/03/23

Analytics Classes

Indications & WarningGenerally, indications of warfare and potential conflict andother crises, based on quantitative information found in opensource datasets

Dynamical Systems Differential or difference equations of low dimensionalityrepresenting competing actors (incl. system dynamics)

(Hidden) Markov ModelsTime-phased data aggregated at fixed intervals with scaled values.Separate from underlying events input to set of discrete states w/associated probabilities.

Event Data Analysis Analysis of abstracted and coded streams of short-term interactionsamong competing or cooperating actors

Econometric Models Large-scale aggregate models of social actors, states or organizations in economic and social systems – regional, national, international.

Probabilistic Models Regression and statistical models estimating the probability of howvariables will affect a specified outcome.

Principal ComponentsAnalysis

Techniques for the reduction of high-dimensionality models to a few criticaldimensions to facilitate prediction and visualization.

Game Theory Models Application of 2-person and N-person game theory to competitive andcollaborative situations involving strategic interdependence.

Logic Systems Use of logical formulae and systems to represent and solve qualitativeProblems, including deductive, abductive, and inductive techniques.

Ref: Kaisler and Cioffi-Revilla 2007; Kaisler, Armour, Espinosa, and Money 2014

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 24: Big Data And Analytics Challenges and Issues

BDA-2405/03/23

Analytics Classes

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 25: Big Data And Analytics Challenges and Issues

Analytics Is About Discovery

Novelty Discovery– Finding new, rare, one-in-a-[million / billion / trillion/ etc.] objects

and events

Class Discovery– Finding new classes of objects and behaviors– Learning the rules that constrain class boundaries

Association Discovery– Finding unusual (improbable) co-occurring associations

Correlation Discovery– Finding patterns and dependencies, which reveal new natural

laws or new scientific principles associations

Ref: Kirk Borne, Dynamic Events in Massive Data Streams, GMU

BDA-25

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 26: Big Data And Analytics Challenges and Issues

The Goal of Analytics

From sensors (data collection, measurement, observation, …)to Monitoring and Alerting

to Sensemaking (Data and Analytics Science)to Cents-Making (Getting to ROI!!)

BDA-26

Adapted from: Kirk Borne, Dynamic Events in Massive Data Streams, GMU

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

http

://tim

oelli

ott.c

om/b

log/

wp-

cont

ent/u

ploa

ds/

2013

/04/

Wha

ts-th

e-R

OI-o

f-kno

win

g-w

hat-y

ou-d

ont-

know

.jpg

Page 27: Big Data And Analytics Challenges and Issues

BDA-2705/03/2305/03/2305/03/23

Sensemaking• In the end, what analytics is really about is sensemaking:

– What does that event really mean to me/him/her/my friends/etc.?– What is a plausible explanation?

• Sensemaking:– Fits data into a frame or mental model– Can be physical or social– Requires situational awareness that helps us to adapt and respond to

known and unexpected or unknown situations– Interpreting – something is there that is waiting to be discovered or

approximated– Comparison to previous experience - retrospectively– Requires a higher level of intellectual engagement, not a passive

translation

• Klein (2006) theorizes that sensemaking processes are initiated when individuals or organizations recognize a lack of understanding of events

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 28: Big Data And Analytics Challenges and Issues

BDA-2805/03/2305/03/23 BDA-2805/03/23

Sensemaking Challenges

• Lillian Wu (IBM) has noted that everything is becoming:– Instrumented: We now have the ability to measure, sense and see the exact

condition of practically everything.– Interconnected: People, systems and objects can communicate and interact with

each other in entirely new ways– Intelligent: People, systems and objects can respond to changes quickly and

accurately, and get better results by predicting and optimizing for future events.

• How to deal with ambiguity? • How to deal with too much data?• Have we let algorithms and large centralized data centres not only control

the remembering but also the meaning and interpretation of the data? (Giulia Forsythe, http://gforsythe.ca/big-data-sensemaking/)

• We know how to do massive data collection and have the ability to index, curate, search and share.

– But, what seems to be missing is the ability to review, reflect, recall and ponder

Ref: Technology, Data. Analytics, PSM workshop -- October 14, 2011

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 29: Big Data And Analytics Challenges and Issues

BDA-29

05/03/2305/03/2305/03/2305/03/23

Sensemaking: Properties (Weick)

• Grounded in identity construction– Know thyself & Know your enemies/friends, …

• Retrospective:– Look back to look forward; Look forward to look back

• Social– Who socialized you, how, and who will see the results– Beliefs: I believe what you believe what I believe, etc.

• Ongoing and dynamic:– Who or what changes over time and space

• Cues: what are the initiators?• Plausibility rather than accuracy:

– Understanding only needs to be sufficient, not comprehensive

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 30: Big Data And Analytics Challenges and Issues

BDA-3005/03/2305/03/23

Sensemaking: An Application

• Mobile Crowdsensing:– Individuals with sensing and computing devices collectively share

information to measure and map phenomena of common interest within communities of people

• Participatory sensing - individuals are actively involved in contributing sensor data

• Opportunistic sensing - autonomous and user involvement is minimal

• Localized analytics at the device and near field• Aggregate analytics at the central repository• Privacy Issues: Potentially collecting sensitive sensor data pertaining to

individuals

Ref: Ganti, R.K., F. Le, and H. Lei. ??. Mobile Crowdsensing: Current State and Future Challenges, BM T.J. Watson Research Center, Hawthorne, NY

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 31: Big Data And Analytics Challenges and Issues

BDA-3105/03/2305/03/2305/03/2305/03/23

Knowledge-Centric Systems

• User-centric Systems — Systems That Know

– Knowledge-driven solutions that connect open information, share open decision-making rules, deliver open composite services, access and navigate information in context of use, and provide virtual assistants that manage cases and complete tasks.

• Adaptive Systems — Systems That Learn

– Knowledge-driven solutions that feature modeling, collaboration, and advanced analytics to detect patterns, make sense, simulate, predict, learn, take action, and improve performance with use and scale.

• Smart Operations — Systems That Reason

– Knowledge-driven solutions that reason like experts, advise as avatars, adapt, are autonomic, perform autonomously.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Thes

e ar

e th

e ty

pes

of s

yste

ms

usin

gad

vanc

ed a

naly

tics

Page 32: Big Data And Analytics Challenges and Issues

BDA-3205/03/2305/03/2305/03/2305/03/23

The New Analytic Paradigm

#1: You will be expected to do something with information#2: There really is more to know#3: You will have to know more about knowing#4: Brain science and decision science are converging#5: The environment is changing our brain#6: Information management is the essence of leadership#7: A more connected world means much more data is available (and accessible)#8: Math matters (but so does logic and rules)#9: There are significant downsides to not knowing#10: Knowing can change the world

Source: Thompson May, The New Know: Innovation Powered by Analytics, 2009Copyright (except where referenced) 2014-2016

Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 33: Big Data And Analytics Challenges and Issues

Analytics Challenges

BDA-33

Page 34: Big Data And Analytics Challenges and Issues

BDA-3405/03/2305/03/2305/03/23 Copyright (except where referenced) 2014-2016

Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money05/03/23

Finding A Needle in a Haystack

• With (all) the data available, find the/a key pattern that indicates a situational change

– A single event– Perhaps, a sequence of events– (Not the signal in the noise problem!!)

• Have we seen this pattern before?– Determine its characteristics, not just that it exists

• Predict what event occurs next because this/these event(s) occurred in the pattern

• How to identify relevant fragments of data easily from a multitude of data sources?

• Difficult to determine what the right answer is in advance

• Problem: The needle hasn’t grown as fast as the haystack!!

• Problem: We need new analytics methods to deal with larger, more complex data and problems!!

Page 35: Big Data And Analytics Challenges and Issues

BDA-3505/03/23

Finding A Needle in a Haystack

• What if the “needle” happens to be a complex data structure?– Brute force search and computation are unlikely to succeed due to inefficiency– Complexity increases with streaming data as opposed to a static data set

• Absence of evidence (so far) is not evidence of absence! (Borne 2013)

• What preprocessing do we need to do before searching?– Quality vs. Quantity: What data are required to satisfy the given value

proposition?– At what precision, accuracy, and reliability?

• What if the needle must be derived rather than found?– How do we track the provenance of the derived data/information?– Is the process repeatable as we change algorithms and data structures?

Challenge: Consider finding the few packets in the millions (er, tens of billions) flowing through a network that carry a virus or malware.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 36: Big Data And Analytics Challenges and Issues

BDA-36

Needle(s) in a HaystackR

ef: C

raw

ford

, L. A

cces

s an

d A

naly

tics

to th

e U

K A

rchi

ve,

Brit

ish

Libr

ary,

201

0

Blogs

Uhmm! What are we supposed to glean from this picture?

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 37: Big Data And Analytics Challenges and Issues

Network Forensics

• Networks have become exponentially faster. They carry more traffic and more types of data than ever before. Yet as they get faster, they become more difficult to monitor and analyze.

– 40G Networks– Richer Data: VOIP as the telephony standard– Malicious security threats are more subtle

• Problems:– Finding proof of a security attack– Troubleshooting intermittent performance issues– Identifying the source of data leaks– Troubleshooting VOIP and Video over VOIP

• Network forensics must be:– Precise: capture high-speed packets without droppage– Scalable: extend to new network technologies and speeds– Flexible: adapt to heterogeneous network segments– VOIP-Smart: reconstruct & replay VoIP calls; present Call Detail Records (CDR) for each

call – Continuously available: run 24/7 with adequate storage; support real-time analysis

Page 38: Big Data And Analytics Challenges and Issues

BDA-3805/03/23

Finding the Knees

• The knee of an algorithm or analytic is the scale value at which the performance begins to degrade as larger data volumes are processed.

– Every analytic method and algorithm can have one (or more?)– Where positive slope increases begin to flatten out– Where positive or flat slopes transition to negative slopes

• Factors affecting the knee:– data structure, volume, and variety– algorithm complexity and implementation, and– infrastructure implementation.

• What is/are the corollaries for non-algorithmic analytics?

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 39: Big Data And Analytics Challenges and Issues

BDA-39

Finding the Tipping Point

• A tipping point is one in which change in a system becomes potentially irreversible and maybe even unstoppable.

– Maybe associated with negative or positive effects– In social systems, a buildup to a critical mass at which point a seminal change occurs. – Ex: MySpace was a formidable component of Facebook, but once the Facebook

membership reached its “tipping point” people started abandoning MySpace and signing up for Facebook.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Ref: Choucri, N., et al. (2006) Understanding and Modeling State stability: Exploiting System Dynamics.MIT Sloan Research Papers, No. 4574-06, Jan. 2006.

• Small events can create ripple effects – may be linear or non-linear, chaotic or perturbative• Concept of emerging trends in the commercial marketplace• The explosion of a viral infection into an epidemic

Page 40: Big Data And Analytics Challenges and Issues

BDA-4005/03/2305/03/23

Spinning Straw Into Gold

• Patterns may be unknown or ambiguously defined.

• Patterns may be morphing over time.• The problem is sensemaking: the dual

process of trying to fit data to a frame or model and of fitting a frame around the data.

• Neither data nor frame comes first!• Must evolve concurrently!

• With (all) the data available, describe a situation in a generalized form such that predictions for future events and prescriptions for courses of actions can be made.

• Objective: Identify one or more patterns that characterize the behavior of the system.

• Remember: All data has value to someone, but not all data has value to everyone.

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 41: Big Data And Analytics Challenges and Issues

Dealing with Ambiguity

• Ambiguity arises from entity resolution:– Not enough data to explicitly resolve two or more entities or objects– 1st Degree ER: “who is who”? (within domain)– 2nd Degree ER: “who knows whom” (a graph/network analysis problem)– 3rd Degree ER: cross-domain linkage (ontological resolution)

• Example: Text Processing and Understanding:– Resolving ambiguity in human languages (much data is unstructured

text)• Ex: The word “strike” has over 30 meanings in English

– Entity resolution is a multi-level process (Talburt 2009-2011)• Ex: There are more than 45,000 people named “John Smith” in the U.S.• Computational complexity increases with knowledge level.

– Tradeoff is end-to-end processing time versus number of entities to be resolved

• Scaling may be problematic.

BDA-41

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 42: Big Data And Analytics Challenges and Issues

BDA-42

Talburt’s Hierarchy for Entity Resolution

Method Description

DeterministicMatching

Link/merge two entity mentions based on the degree of similarity between thevalues of corresponding entity attributes.Ex: Direct match of names, addresses, other attributes

ProbabilisticMatching

Link/merge two entities based on corresponding attributes – even if some have different values (but within the expected range).Ex: “John, Doe, 1989-08-13” and “Jon, Doe, 1989-08-13”

TransitiveMatching

Link/merge two entities based on corresponding attributes: A matches B andB matches C, so A matches C – even if not all attributes match. But, may lead tofalse positives.

AssociativeMatching

Link/merge two entities based on semantics and domain knowledge.If (Mary Smith, 123 Oak St), (Mary Smith, 456 Elm St), (John Smith,123 Oak St),and (John Smith, 456 Elm St) are entity mentions, none of the six possiblepairings of four records agree on name or address. But, we may infer that theseare the same John and Mary Smith at both addresses.

Ref: Talburt, J. (2009-2011) Reference Linking Methods, Identity Resolution Daily, Retrieved November 3, 2013 fromhttp://identityresolutiondaily.com/

AssertiveMatching

Link/merge two entities based on prior and derived knowledge to reason about possiblerelationships between them. Different models may be used to deduce/infer relationships.Ex: “The Mary Smith who lives with her brother and resided at 123 Oak St. nowresides at 456 Elm St.”)

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 43: Big Data And Analytics Challenges and Issues

Advanced Analytics

BDA-43

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 44: Big Data And Analytics Challenges and Issues

BDA-4405/03/2305/03/2305/03/23

Crowdsourcing: A New Analytic?

• Tasks normally performed by employees are outsourced via an open call to a large, self-selected community

• Some examples– Netflix prize– InnoCentive: solve R&D challenges– DARPA Network Challenge

• Follows an AI blackboard model• Distributed co-creation has become a mainstream

technology• Issue: What metrics do we use to assess the results?• Issue: How robust are the results?• And, more issues to be addressed ….

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 45: Big Data And Analytics Challenges and Issues

BDA-4505/03/2305/03/23 BDA-45

Crowdsourcing: Wikipedia: An Example

Ref: Mary Meeker. KPCB, presentation at All Things D.

The Death of theEncyclopediaBusiness Model

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 46: Big Data And Analytics Challenges and Issues

Moving Analytics To the Edge

• Traditional analysis: data is stored and then analyzed– Usually at some central location or a few distributed locations– The cost and time to move large amounts of data may render it

obsolete or of little worth• Moving analytics to the frontier of the domain:

– (Near) real-time analysis and decisions are required– Streaming massive amounts of data is expensive, fraught with error:

microsecond latency, millions of events– We just can’t store it all– Perishability is a key factor– May only really need the synthesized, aggregate information, not the raw data– True data-driven analysis

• Example: Pushing analytics into cameras for images, full motion video analysis, motion correction, 3D perception, …

BDA-46

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 47: Big Data And Analytics Challenges and Issues

BDA-4705/03/23

Edge Device Analytics

• Filtering and compression processes are often tied to the downstream analytical requirements– Means the filtering and compression algorithms must be dynamic

• How close to the edge can we push the filtering and compression algorithms?– What (additional) data do these algorithms need to be effective?– How do we measure the efficiency of these algorithms?– Does the in situ hardware have the computational capacity to support

such algorithms?– How much data correction can we do at the edges?

• Challenges:– How can we summarize streaming data?– How fast can we determine changes in the incoming data?– How fast can we adapt to changes in the data stream?– How fast can we affect the environment based on what we see?

NOTE: Not the same as IBM Edge Analytics!!

Page 48: Big Data And Analytics Challenges and Issues

Streaming Analytics

• Streaming data: analytics must (often) occur in real time, as the data passes through the sensing/collecting device:– Allows you to identify and examine patterns of interest as the

data is being created.– Can yield instant insight and immediate (re)action.

• “Real-time is for robots”– Joe Hellerstein, Professor of Computer Science at UC Berkeley.– “If you have people in the loop, it’s not real time. Most people

take a second or two to react, and that’s plenty of time for a traditional transactional system to handle input and output.”

BDA-48

Page 49: Big Data And Analytics Challenges and Issues

BDA-49

Location Analytics

• What is it?– Augmenting mission-critical, enterprise

business systems with complementary content, mapping, and geographic capabilities

– Mapping & Visualization: use maps as the media to visualize data

– Spatial analytics: merging GIS w/ other types of analytics

– Find spatio-temporal patterns indicative of physical activities or social behavior

– Data/information enrichment: add maps, imagery, demographics, consumer and lifestyle data, environment and weather, social media, etc.

• Ubiquity of GPS on cellphones, cars, wristwatches, laptops, tablets, etc.

Ref: Kerr & Nelson,ESRI International User Conference, July 2012

Ref: http://www.esri.com/software/location-analytics

Page 50: Big Data And Analytics Challenges and Issues

Location Analytics:After a Regime Change …

• Immediate transition from “normal operations in an urban environment” to “government in place”– Need Civil Affairs support for managing city/region relief efforts

• No existing analysis, planning, and management tool to assist civil Government and Relief Officials• A lack of coherent view will lead to failure and, possibly, continuing

deterioration of government, law and order, civil systems, etc…• Why?

– (nearly) complete destruction of governing regime– Disintegration of social/governmental institutions: water, education,

health, law enforcement, financial, food– Society reorganized overnight to adapt to the new power structure,

which may/may not include former government personnel• Ex: Sadr handed out food; created immediate constituency and had people

talking about him

• Cannot win “hearts and minds” if you cannot “feed and house them”

BDA-50

Page 51: Big Data And Analytics Challenges and Issues

BDA-5105/03/23 BDA-51

Web Analytics

• What is it?– Now: The study of the behavior of web users– Future: The study of one mechanism for how society makes

decisions– Example: Behavior of Web Users

• How many people clicked on Ebola (or related terms in the past 2 months)

• Their location, their dwell time, the number of sites they examined, the difficulty or complexity of the material on the web site

• What can this tell us about popular concern about Ebola?• Can it help decision makers to better present information and

decisions– Commercially, it is the collection and analysis of data from a web

site to determine which aspects of the website achieve the business objectives

Page 52: Big Data And Analytics Challenges and Issues

BDA-5205/03/23

Web Analytics

• Challenge: When people start getting more of their information from the Web than from radio/tv/papers:– How do we design web pages to influence the opinions of people?

• The web is not a neutral medium– How do we measure the influence of a web page’s contents on user

opinions?• Does # visitors translate into a viable influence metric?• Does dwell time translate into influence?• Are opinions more or less influenced as the visitor clicks through a sequence of

pages?• How many different web sites (each of one or more pages) does a user

searching a particular topic click through?– How do we measure the difficulty and/or complexity of the material

presented on a web page(s)?• What is the corollary to the Flesch-Kincaid Score for web pages?

– How do we design “good” web pages to increase their Google Page Rank score?

Page 53: Big Data And Analytics Challenges and Issues

BDA-5305/03/2305/03/2305/03/23

Visual Analytics

• Visual analytics: the science of analytical reasoning facilitated by interactive visual interfaces

Where are the walkers going?

Remember: The eye can be easily fooled and, thereby, fool the brain.

Page 54: Big Data And Analytics Challenges and Issues

BDA-5405/03/2305/03/2305/03/23

Visual Analytics

• Visual analytics: an evolving discipline which is driving new ways of presenting data and information to the user.

• Visual analytics: “the science of analytical reasoning facilitated by interactive visual interfaces” (Thomas and Cook 2005)

• Visual analytics:– the formation of visual metaphors in combination with a human

information discourse (interaction)– that enables detection of the expected and discovery of the

unexpected within massive, dynamically changing information spaces. (Wong and Thomas 2004)

• Visual analytics provides the “last 12 inches” between the masses of information and the human mind that enables us to make decisions.

Page 55: Big Data And Analytics Challenges and Issues

BDA-5505/03/2305/03/23

Visual Analytics

• Why is it hard?– You can only see 2D because your screen is 2D

• To visualize k-dimensional data:– Divide the screen into multiple 2D regions and show pair-wise

correlations across selected dimensions– Project k-dimensional data into 2D

• Projection Methods (Dimension Reduction):• PCA, MDS, LDA, LLE, many others …

• Many others! Usually, try to preserve distances in 2D as they exist in k-D

Page 56: Big Data And Analytics Challenges and Issues

Visualization Analytics:A Periodic Table

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 57: Big Data And Analytics Challenges and Issues

BDA-5705/03/23

Visual Analytics: Challenges

• Projection methods:– How to choose which is best for a given problem– Scaling– Harder to understand what is being conveyed

• How to visualize non-numeric data, e.g. text, icons, or images?– Interactive multiple displays– Change in one display begets a change in another to allow

exploration

• As k grows really large and the data types in the problem space are mixed, what do we do?– Remember, Miller 1956: 7 +/- 2

Page 58: Big Data And Analytics Challenges and Issues

BDA-5805/03/23

Visual Analytics: Challenges

• What if the data cannot fit on your computer?– Truncate (sample, filter)

• Easy to implement; efficient; scalable• Sampling is often data- or task-dependent

– Resolution reduction (“blurring”, image zooming)• Fine details can be lost (get the big picture)• Can zoom in on specific features (but lose forest for trees)

– Streaming:• Inspect data in blocks (with or without overlapping windows)

Page 59: Big Data And Analytics Challenges and Issues

BDA-5905/03/23

Visual Analytics: Affecting Factors

• Spatial Ability• Cognitive Workload/Mental Demand• Personality• Experience (novice vs. expert)• Emotional State• Perceptual Speed• … and more

Conclusions:- Computer must be more aware of the user- Computer must develop a model of the user’s behavior- Develop a symbiotic environment for data exploration

Page 60: Big Data And Analytics Challenges and Issues

BDA-6005/03/2305/03/2305/03/23

10 Exascale VA Challenges - I

In-Situ Analysis

Beyond PBytes, storing data, then retrieving it later forvisualization may not be feasible. Develop new algorithms forin-situ VA to greatly reduce I/O. Perform VA concurrently withdata analysis.

Ref: Wong, Shen, Johnson, Chen and Ross (2012)

User-driven Data ReductionWhile data volumes grow rapidly, human cognitive capabilitiesremain unchanged. Provide flexible, interactive user-controlmechanisms for dynamically filtering data for VA.

Multilevel Hierarchy Hierarchy depth and complexity grow with data volume. Newalgorithms are required for transformation and traversal of multilevel hierarchies.

Representing Evidenceand Uncertainty

Evidence synthesis and uncertainty quantification are usuallyunited through visualization. How best to present evidenceand uncertainty without introducing significant bias.

Heterogeneous Data FusionExtreme scale problems are often heterogeneous withcomplex structures. New algorithms are required for fusionof heterogeneous data objects.

Page 61: Big Data And Analytics Challenges and Issues

BDA-6105/03/2305/03/23

10 Exascale VA Challenges - II

Data summarization and Triagefor Interactive Query

Analyzing entire exascale data sets is likely impractical.New tools for interactive filtering of data for analysis anddisplay of selected, relevant data is required.

Ref: Wong, Shen, Johnson, Chen and Ross (2012)

Temporally Evolved Features Develop tractable algorithms for VA of temporal streams.

Mitigate the Human Bottleneck Finds ways to compensate for human cognitive limitations.Need to understand how brain processes visual images.

New VA Frameworks for HPC Design and develop new frameworks with open APIs forinteraction and UI that do not constrain HPC systems

Replace Conventional Wisdom New ideas are required for VA Methodologies.

Page 62: Big Data And Analytics Challenges and Issues

Context-Aware Computing

• Arose from ubiquitous computing in early 90s: computing everywhere and “invisible”

• Ex: Active Badges– Problem: locating researchers– Solution: badge tied to identity, tracked as researcher moves in

building– Xerox PARC pioneered this idea in 90s under Mark Weiser– Now used in museums and other venues to “customize” the user

experience– Can be based on RFIDs embedded in badges

Page 63: Big Data And Analytics Challenges and Issues

What Is Context?

• By example– Location, time, identities of nearby users …

• By synonym – Situation, environment, circumstance

• By dictionary [WordNet]– the set of facts or circumstances that surround a situation or event

• “Context is any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and the application themselves.” [Dey and Abowd, 2000]

Page 64: Big Data And Analytics Challenges and Issues

BDA-64

Personal AI: Context-Aware Computing

• No consensus on what it is!• “A system is context-aware if it uses context to provide relevant

information and/or services to the user, where relevancy depends on the user’s task.”

• For our purposes, it is symbiotic computing:– A partnership between human and machine where each is aware of the

other – capabilities and limitations– Each is a reasoning autonomous adaptive entity– Each may initiate an activity and propose/dispose of ideas

• Contradicts Weiser’s first view of Ubiquitous Computing– Computing recedes into the background

• Need to define what Context is:– Location, time, situation, data/information

Copyright (except where referenced) 2014-2016 Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money

Page 65: Big Data And Analytics Challenges and Issues

BDA-65

Context-Aware Requirements

• John McCarthy defined the key ideas (for a robot), but we adapt them to a general situation:– The system observes its physical environment, recognizes the

status of its effectors, notices the relation of itself to the environment and notices the values of important internal variables, e.g. the state of its power supply and of its communication channels.

– Observes that it does/does not know the value of a certain term• E.g., observing whether it knows the telephone number of a certain

person.– Observing that it does know the number or that it can get it by

some procedure is likely to be straightforward.– Keeping a journal of physical and intellectual events so it can

refer to its past beliefs, observations and actions.– Observing its goal structure and forming sentences about it.

• Notice that merely having a stack of subgoals doesn't achieve this unless the stack is observable and not merely obeyable.

Page 66: Big Data And Analytics Challenges and Issues

BDA-66

Context-Aware Requirements

• John McCarthy continued:– The entity may intend to perform a certain action.

• It may later infer that certain possibilities are irrelevant in view of its intentions.

• This requires the ability to reflect on its intentions.– Observing how it arrived at its current beliefs.

• Most of the important beliefs of the system will have been obtained by nonmonotonic reasoning, and therefore are usually uncertain.

• It will need to maintain a critical view of these beliefs, i.e., believe meta-sentences about them that will aid in revising them when new information warrants doing so.

• It will presumably be useful to maintain a pedigree for each belief of the system so that it can be revised if its logical ancestors are revised.

• Reason maintenance systems maintain the pedigrees but not in the form of sentences that can be used in reasoning.

• Neither do they have introspective subroutines that can observe the pedigrees and generate sentences about them.

Page 67: Big Data And Analytics Challenges and Issues

BDA-67

Context-Aware Requirements

• John McCarthy continued:– A system should be able to answer the questions: ``Why do I

believe ?'' or alternatively ``Why don't I believe ?''. – Contexts need to be modeled as objects that represent mental

states of events/things/people in the world around it.– The ability to transcend one's present context and think about it

as an object is an important form of introspection.– Knowing what goals it can currently achieve and what its

choices are for action.• He claims that the ability to understand one's own choices

constitutes free will.

Page 68: Big Data And Analytics Challenges and Issues

BDA-68

Context-Aware Requirements

• On the computer side of the partnership:– System must be dynamically extensible:

• One can incorporate a new module with new functionality, incrementally, without the need for recompilation

• Indeed, the system environment should be (nearly) wholly self-contained.

– Self-Modifying:• Ability to generate new or revised modules that change the

functionality of the system• Modules may affect any representation or control mechanisms of the

system– Self-Adapting:

• System must be able to integrate new modules into its computational repertoire without ceasing operations, except for snapshotting and restarting critical processes.

• Safety and health of the system must be ensured.

• Note: Humans already do this, but not very well at times.

Page 69: Big Data And Analytics Challenges and Issues

Context-Aware Computing

• Challenges:– New situations don’t fit examples– How to use in practice?

• Presentation to user• Types of analytics

– How to model context in a computational environment?

• Spatial• Temporal• Dynamically changing: velocity, volume, ….• Is it a Big Data problem?

Page 70: Big Data And Analytics Challenges and Issues

BDA-7005/03/2305/03/23

Advanced Analytics:Critical Challenges

Page 71: Big Data And Analytics Challenges and Issues

BDA-71

What Are Grand Challenges?

• Definition: A specific scientific or technological innovation that would remove a critical barrier to solving an important domain problem with a high likelihood of global impact and feasibility.– Provide scope for engineering ambition to build something that has never

been seen before.– Generally comprehensible, and capture the imagination of the general

public, as well as the esteem of scientists in other disciplines.– Go beyond what is initially possible, and requires development of

understanding, techniques and tools unknown at the start of the project.– Since these first appeared in the 80s, they abound in every science and

discipline!• Not just a restatement of the many “big problems” facing the world today

– Are they equivalent to Wicked Problems?, or are– Grand Challenges Wicked Problems?

• A tool for focusing investigators working towards overcoming one or more bottlenecks in a foreseeable path toward a solution to significant domain problems.

Page 72: Big Data And Analytics Challenges and Issues

Some Grand Challenges

• Modeling Our Planet’s Systems:– Assessing global warming and determining mitigating actions

• Confronting Existential Risk:– What is the impact of a dangerous genetically modified pathogen

• Exploring Transhumanism:– What is the impact of embedded nanotechnology, genetic

therapy, and “smart” prosthetics?

• The Singularity?– What happens when systems approach the level of human

intelligence? Emotional intelligence?

• Dealing Effectively with Globalism:– Modeling the interconnected of human societies/organizations

72

Ref: Martin, J. "The Meaning of the 21st Century: A Vital Blueprint for Ensuring Our Future“, Jan 2007

Page 73: Big Data And Analytics Challenges and Issues

BDA-73

Wicked Problems

• In 1973, Horst Rittel and Melvin Webber formally described the concept of wicked problems.

• Conventional problem solving methods, rooted in 18th century physics, economics and engineering, focused on efficiency.

• Societal problems are fundamentally different from the types of problems that scientists and engineers deal with.

• Societal problems are wicked problems.

Page 74: Big Data And Analytics Challenges and Issues

BDA-74

Tenets of Wicked Problems

• No general agreement on what the problem is.

• You don’t understand the problem until you develop a solution.

• Wicked problems have no stopping rule.

• Solutions to wicked problems are not right or wrong.

• Every wicked problem is essentially novel and unique.

• Every solution to a wicked problem is a 'one shot operation‘.

• Wicked problems have no given alternative solutions.

• Causes and Effects are Elusive• Sensitive to Initial and Boundary

Conditions (History)

Figure 3. Wicked Problems (Conklin, 2005)

Page 75: Big Data And Analytics Challenges and Issues

BDA-75

Some Wicked Problems

• Infrastructure Resilience• Climate Change• “Peaking” Oil or Coal: When does it run out?• The “Long War”: Is there an end to terrorism?• Sustainable Cities and Ecosystems• Sustainable Development in the Third World• Affordable Health Maintenance for an Aging Society• Transitioning to Democracy and Beyond

– Predicting the next “Arab Spring”• Biological and Genetic Threats and Opportunities• Reducing the U.S. Debt • Discover drugs that minimize disease-resistant micro-organisms

Page 76: Big Data And Analytics Challenges and Issues

BDA-76

Analytics Challenges

• The Google Property ?s:– Can analyses improve with more data to process?– Can analyses improve with more detailed analytics that we use?

• Kaisler, Armour, Espinosa, Money:– Can analyses improve with better system and environment

models?– How do we measure value of an analytic?– What is the limit for value as we add more data?– Can good algorithms, models, heuristics overcome data quality

problems?– With more data to analyze, can Big Data improve decision-

making? And, by how much (e.g., how do we measure it?)

Page 77: Big Data And Analytics Challenges and Issues

BDA-77

Challenge: Population Imbalance

• Events of interest occur relatively infrequently in very large datasets.• However, non-interesting events occur more frequently:

– May require {additional, extensive} computational effort.• Reasons for Imbalance:

– Underrepresented data/severe class distribution skew:• Large regions of the problem space may be covered sporadically or not at all

by the observer or observation instruments. • Impact precision and performance of data mining/machine learning

algorithms (He and Garcia 2009).

– Data collection may be (usually is) imperfect.– Data are often beset with noise.– Data may be missing in longitudinal or temporal sequences.

• Enough relevant data of good quality may not be available to permit robust analysis.

Page 78: Big Data And Analytics Challenges and Issues

BDA-78

Challenge: Data Analysis

• Feature Selection: Information is distributed in a complex way across many features.

• Mitigating False Alarms: Target patterns are ambiguous/unknown; “squelch” settings are brittle; cannot prevent false positives

• Domain Drift: Target patterns change/morph over time and across operational modes (processing methods becomes “stale”)

Page 79: Big Data And Analytics Challenges and Issues

BDA-79

Challenge: Ethical Problems

• Should we use data without the permission of individual owners, such as copying publicly available data?– What is tacit permission and approval, anyway?

• Should we be required to inform individuals when we use their data?– Do they really own it? Or, how much of it?

• Should we (be required to) check the accuracy of what is posted on a publicly available web site before using it?

• What rules and regulations should exist about combining data about individuals into a central repository?– Does aggregation exceed permissible need to know about an

individual?

Page 80: Big Data And Analytics Challenges and Issues

BDA-8005/03/2305/03/23

Challenges: Data Annotation

• The semantic web is largely unrealized:– Much metadata is lousy– Standardization of models and labels is a major issue– Integrating ontologies and vocabularies is a critical problem

• No standards for social tagging, science tagging, etc.• Tagging only works if many are tagging• What is the Quality of the Result if the Quality of the Data/Metadata

is poor?• Mark Greaves: “If we don’t have semantic convergence, then

semantics isn’t a differentiator”• The “information provenance” problem

Page 81: Big Data And Analytics Challenges and Issues

BDA-8105/03/2305/03/23

Challenges: End-to-End Systems

• Wicked Problems will require an end-to-end system approach:– Multiple analytics: cascade or mesh or other topology for the analytic

architecture• Why? Different types of data (may) require different approaches

– Need a robust, reliable computing/infrastructure environment• Pipelines are a relatively simple model; may not be adequate for complexity

of problem

• Must consider end-to-end behavior:– Bottlenecks– Data conversion, transfer, & loading overheads– Storage costs & other parts of the data life-cycle– Resource management challenges– Total Cost of Ownership (TCO)

Page 82: Big Data And Analytics Challenges and Issues

BDA-82

Challenges: Data Sources

• Tolerant Analysis – you are typically doing open-world reasoning– Things go away– Contradiction is present– Data is incomplete and may be erroneous/noisy– Surveying the KB may not/is not possible (it is too large!!)– Need substantial domain knowledge to reason effectively

• Both deep and shallow knowledge• Multiple linked ontologies & data sources:

– Single ontologies are feasible only at the organizational level– How to reconcile multiple authors and overlapping data sources– Contain both private and public knowledge w/ security, privacy &

separation issues• Is this equivalent to the multilevel security problem?

• Heterogeneity and Incompleteness:– Humans are very tolerant of heterogeneous data; computers are not– Need to perform dynamic cleansing, curation, and transformation– How to build programs that accept heterogeneous data

Page 83: Big Data And Analytics Challenges and Issues

Challenge: Metrics

83

Page 84: Big Data And Analytics Challenges and Issues

BDA-84

Where is the ROI?

• ROI (Return on Investment) is not always immediately obvious• Results of analytics may be available only after years of following

the prescription• Requires long-term effort(s) to develop a sustainable capability• Examples:

– Health: moving from predictive to preventative health care– Health: enabling personalized medicine for shortened time to value– Health: recognizing and predicting the spread of infectious diseases

(Ebola)– Crime: aggressively recognizing and combating syndicated, multiparty

fraud online– Crime: predicting potential crime locales and time to preventatively

deploy police– Environment: predicting weather, floods, earthquakes, volcanic

eruptions earlier– Computer Security: surveying systems to predict potential for attacks

Page 85: Big Data And Analytics Challenges and Issues

BDA-85

Advanced AnalyticsKey Challenges Today for Tomorrow

• Analytic Scientists or lack thereof– How are we going to train them?– How many do we need?

• Potential to end up like “data mining” (shudder)– The Big Data mantra: “80% of the effort is in extracting, moving

cleaning, and preparing the data, not actually analyzing it.”

• Don’t disregard Traditional Analytics:– Big Data Analytics and Advanced Analytics will be side by side

for years– We need an analytics capability beyond what is offered by

traditional business analytics

Page 86: Big Data And Analytics Challenges and Issues

BDA-86

Analytics: Transformative Science

• Making Analytics a Transformative Science:– Computationally tractable reasoning algorithms– Explicit models for prediction, prescription and decision versus

implicit or embedded models in technology– Enabling technologies and infrastructures for modeling and

analysis, including interoperability interfaces and standards: multiple suites of analytic tools

– Model validation and verification– Ensuring availability of appropriate, quality data– Uncertainty quantification and predictability of outcomes– Community-developed versus custom-made software– Peer review for analytics and modeling research – open

repositories– Analytic Method Capture and Reuse

Page 87: Big Data And Analytics Challenges and Issues

BDA-87

Is this ….?

Oooops! We mean the Exabyte Age!

Page 88: Big Data And Analytics Challenges and Issues

BDA-88

Questions

Page 89: Big Data And Analytics Challenges and Issues

BDA-89

Who We Are

Stephen H. Kaisler, D.Sc.Senior Associate/PCI Strategic Management && Principal/SHK & AssociatesColumbia MD/Laurel, [email protected]. Stephen Kaisler is currently a Senior Associate at PCI Strategic Management and a Principle in SHK & Associates..

He has previously worked for DARPA, the U.S. Senate and a number of small businesses. Dr. Kaisler has worked with big data, MapReduce technology, and advanced analytics in support of the ODNI CATALYST program. He has been an Adjunct Professor of Engineering since 2002 in the Department of Computer Science at George Washington University. Recently, he has also taught enterprise architecture and information security in the GWU Business School. He earned a D.Sc. (Computer Science) from George Washington University, an M.S. (Computer Science) and B.S. (Physics) from the University of Maryland at College Park. He has written or co-authored seven books and published over 38 technical papers.

William H. Money, Ph.D.School of Business AdministrationThe [email protected] Money joined the Citadel as Associate Professor of Business Administration in 2014. Previously, he was with the

George Washington University School of Business faculty, which he joined in September 1992 after acquiring over 12 years of management experience in the design, development, installation, and support of management information systems (1980-92). His publications and recent research interests focus on information system development tools and agile software engineering methodologies, collaborative solutions to complex business problems, program management, business process engineering, and individual learning. He developed teaching and facilitation techniques that prepare students to use collaboration tools in complex organizations and dynamic work environments experiencing significant change. Dr. Money has a Ph.D., Organizational Behavior 1977, Northwestern University, Graduate School of Management; the M.B.A., Management, 1969, Indiana University; and a B.A., Political Science, 1968, University of Richmond.

Page 90: Big Data And Analytics Challenges and Issues

BDA-90

Who We Are

Frank Armour, Ph.D.Kogod School of Business/American [email protected]. Armour is an independent senior IT consultant and Research Fellow at the Center for Information Technology in

the Global Environment (CITGE), Kogod Business School, American University. Dr. Armour has extensive experience applying advanced information technology. His work and research includes business and requirements analysis, enterprise architectures, System Development Cycle Development (SDLC), and object-oriented development. Dr. Armour has consulted for both government and private organizations on the effective application of enterprise architecture, IT Governance and system requirements approaches. In a previous position, at a major IT consulting firm, he had a joint appointment as the lead Object Methodologist and as the Assistant Director of the Object Technology Lab. In this position he provided guidance and in-depth mentoring to object projects on object concepts, architecture, project management, methods and tools.

J. Alberto Espinosa, Ph.D.Kogod School of Business/American [email protected]. Espinosa is currently Professor and Chair of Information Technology at the Kogod School of Business, American

University. He holds a Ph.D. and Master of Science degrees in Information Systems from Carnegie Mellon University, Graduate School of Industrial Administration; a Masters degree in Business Administration from Texas Tech University; and a Mechanical Engineering degree from Universidad Catolica, Peru. His research focuses on coordination and performance in global technical projects across global boundaries, particularly distance and time separation (e.g. time zones). His work has been published in leading scholarly journals, including: Management Science; Organization Science; Information Systems Research; the Journal of Management Information Systems; Communications of the ACM; Information, Technology and People; and Software Process: Improvement and Practice. He is also a frequent presenter in leading academic conferences.

Page 91: Big Data And Analytics Challenges and Issues

BDA-9105/03/2305/03/2305/03/2305/03/23

Thank You!!

Page 92: Big Data And Analytics Challenges and Issues

BDA-9205/03/23

References

• Argenta, C., J. Benson, N. Bos et al. 2014. “Sensemaking in Big Data Environments”, 1st Workshop on Human-Centered Big Data Research, Raleigh, NC

• Borne, K. 2013. Statistical truiisms in the Age of Big Data, retrieved December 2013 from http://www.statisticsviews.com/details/feature/4911381/Statistical-Truisms-in-the-Age-of-Big-Data.html

• Clarkson, A. (1981) Towards Effective Strategic Analysis, Westview Press, Boulder, CO• Davenport, T. H. and J. G. Harris. 2007. Competing on Analytics: The New Science of Winning ,

Harvard Business School Press• Felten, E. (2010) Needle in a Haystack Problems, Retrieved November 1, 2013 from

https://freedom-to-tinker.com/blog/felten/needle-haystack-problems/• Gladwell, M. (2000) The Tipping Point: How Little Things can Make a Big Difference. Boston:

Little Brown• He, H. and E.A. Garcia. (2009) “Learning from Imbalanced Data”, IEEE Transactions on Data and

Knowledge Engineering, 21(9):1263-1284• Heuer, R.J., Jr. 1999. Psychology of Intelligence Analysis, Center for the Study of Intelligence,

Central Intelligence Agency, Washington, D.C.• Kaisler, S. 1990. Strategic Automated Discovery System (STRADS), with C. Oresky, A. Clarkson,

and D. B. Lenat, published in Knowledge Based Simulation: Methodology and Application, ed. by P. Fishwick and D. Modjeski, Springer-Verlag, December, 1990

05/03/23

Page 93: Big Data And Analytics Challenges and Issues

BDA-9305/03/2305/03/2305/03/2305/03/23

References

• Kaisler, S. (2005) Software Paradigms, New York, NY: John Wiley & Sons• Kaisler, S. and C. Cioffi-Revilla. 2007. Quantitative and Computational Social

Sciences Tutorial, 40th Hawaii International Conference on System Sciences, Waikoloa, HI, 2007

• Kaisler, S. 2012. Advanced Analytics, Technical Report prepared under contract AFRL #AFRL FA8750-11-C-0045

• Kaisler, S., F. Armour, A. Espinosa, and W. Money. 2013. “Big Data: Issues and Challenges Moving Forward”, 46th Hawaii International Conference on System Sciences, Grand Wailea, Maui, HI

• Kaisler, S., F. Armour, A. Espinosa, and W. Money. 2014. “Advanced Analytics: Issues and Challenges”, 47th Hawaii International Conference on System Sciences, Hilton Waikoloa, Big Island, HI

• Kaisler, S., F. Armour, W. Money, and A. Espinosa. 2014. “Big Data: Issues and Challenges”, Encyclopedia of Science and Technology, 3rd Edition, IGI Global

• Kaisler, S., F. Armour, A. Espinosa, and W. Money. 2014. “Advanced Analytics: Issues and Challenges”, Encyclopedia of Science and Technology, 3rd Edition, IGI Global

Page 94: Big Data And Analytics Challenges and Issues

BDA-9405/03/2305/03/2305/03/2305/03/23

References

• Ritchey, T. .2005. Wicked Problems: Structuring Social Messes with Morphological Analysis, Swedish Morphological Society. Retrieved October 30, 2013 from http://www.swemorph.com/wp.html

• Rittel, H. and M. Webber. 1973. “Dilemmas in a General theory of Planning”, in Policy Sciences, Vol. 4 (pp. 155-169). Amsterdam, the Netherlands: Elsevier Scientific.

• Schwartz, P. M. 2010. Data Protection Law and The Ethical Use of Analytics, The Centre for Information Policy Leadership, Hunton & Williams, LLP.

• Singh, L., E.J. Bienenstock and J. Mann. 2010. What are we missing? Perspectives on social network analysis for observational scientific data, Handbook of Social Networks: Technologies and Applications. Ed. B. Furht., Springer

• Stanton, J. 2013. Version 3: An Introduction to Data Science, http://jsresearch.net/• Suchman, L. 1987. Plans and Situated Actions: The Problem of human-Machine

Communication, Cambridge University Press, Cambridge, England

Page 95: Big Data And Analytics Challenges and Issues

BDA-9505/03/2305/03/23

References

• Talburt, J. (2009-2011) Reference Linking Methods, Identity Resolution Daily, Retrieved November 3, 2013 from http://identityresolutiondaily.com/

• Tufte, E. R. 1997. Visual & Statistical Thinking: Displays of Evidence for Decision Making. Graphics Press.

• Thomas, J. J. and K. A. Cook, Eds. 2005. Illuminating the Path – the Research and Development Agenda for Visual Analytics, IEEE Computer Society

• Varian, H. (2009) McKinsey Quarterly. Retrieved October 30, 2013 http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286

• Weick, K. E. (1995). Sensemaking in Organizations. Thousand Oaks, CA: Sage Publications.

• Weiser, M. 1991. “The Computer for the Twenty-First Century”, Scientific American, 265(3):94-104

• Wong, P. C. and J. Thomas. 2004. “Visual Analytics”, IEEE Computer Graphics and Applications, 24(5):20-21

• Wong, P. C., H-W. Shen, C. R. Johnson, C. Chen, and R. B. Ross. 2012. “The Top 10 Challenges in Extreme-Scale Visual Analytics”, IEEE Computer Graphics and Applications, 32(4):63-67