data science innovations : democratisation of data and data science

38
Innovations in Data Science: Systems of Insight [email protected] linkedin.com/in/sureshsood @soody

Upload: suresh-sood

Post on 14-Feb-2017

130 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Data Science Innovations : Democratisation of Data and Data Science

Innovations in Data Science:Systems of Insight

[email protected]/in/sureshsood

@soody

Page 2: Data Science Innovations : Democratisation of Data and Data Science

Areas for Conversation

Data Science

Data Science Innovation

Democratisation of big data

Gartner & Forrester Trends

Systems of Insight

Page 3: Data Science Innovations : Democratisation of Data and Data Science

BEST USE OF CLEVER DATA

The Works with Suresh Sood, Advanced Analytics

Institute, UTS & Professor James Pennebaker from

the University of Texas - The Deceit Algorithm

BEST INSIGHT The Works with Suresh Sood,

Advanced Analytics Institute, UTS & Professor

James Pennebaker from the University of Texas -

The Deceit Algorithm

www.datafication.com.au

Vignettes in the two-step arrival of the internet of

things and its reshaping of marketing management’s

service-dominant logicWoodside & Sood

Journal of Marketing Management Volume 33, 2017 - Issue 1-2: The Internet of Things (IoT) and Marketing: The State of Play, Future Trends and the Implications for Marketing

Page 4: Data Science Innovations : Democratisation of Data and Data Science

Statistics, Data Mining or Data Science ?

• Statistics– precise deterministic causal analysis over precisely collected data

• Data Mining– deterministic causal analysis over re-purposed data carefully sampled

• Data Science– trending/correlation analysis over existing data using bulk of population i.e. big data

– Extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.

Adapted from: NIST Big Data taxonomy draft report :

(see http://bigdatawg.nist.gov /show_InputDoc.php)

Page 5: Data Science Innovations : Democratisation of Data and Data Science

Useful References Big Data • NIST Big Data interoperability Framework (NBDIF) V1.0 Final Version (September 2015)

Big Data Definitions: http://dx.doi.org/10.6028/NIST.SP.1500-1

Big Data Taxonomies: http://dx.doi.org/10.6028/NIST.SP.1500-2

Big Data Use Cases and Requirements: http://dx.doi.org/10.6028/NIST.SP.1500-3

Big Data Security and Privacy: http://dx.doi.org/10.6028/NIST.SP.1500-4

Big Data Architecture White Paper Survey: http://dx.doi.org/10.6028/NIST.SP.1500-5

Big Data Reference Architecture: http://dx.doi.org/10.6028/NIST.SP.1500-6

Big Data Standards Roadmap: http://dx.doi.org/10.6028/NIST.SP.1500-7

• Apache Spark 2.1.0 Documentation

Machine Learning Library (MLlib) Guide http://spark.apache.org/docs/latest/ml-guide.html

GraphX Programming Guide http://spark.apache.org/docs/latest/graphx-programming-guide.html

SparkR (R on Spark) http://spark.apache.org/docs/latest/sparkr.html#sparkdataframe

Spark SQL, DataFrames and Datasets Guide http://spark.apache.org/docs/latest/sql-programming-guide.html

Page 6: Data Science Innovations : Democratisation of Data and Data Science

Data Science Innovation

Data science innovation is something an organization has not done before or even something nobody anywhere has done before. A data science innovation focuses on discovering and using new or untraditional data sources to solve new problems.

Adapted from:Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son

Page 7: Data Science Innovations : Democratisation of Data and Data Science

Variety of Data Types & Big Data Challenge 1. Astronomical 2. Documents 3. Earthquake4. Email5. Environmental sensors 6. Fingerprints7. Health (personal) Images8. Graph data (social network)9. Location10.Marine11.Particle accelerator 12.Satellite13.Scanned survey data 14.Sound15.Text16.Transactions17.Video

Big Data consists of extensive datasets primarily in the characteristics of volume, variety, velocity, and/or variability that require a scalable architecture for efficient storage, manipulation, and analysis.

. Computational portability is the movement of the computation to the location of the data.

Page 8: Data Science Innovations : Democratisation of Data and Data Science

Internet of Things “trillion sensors”

Source: www.tsensorssummit.org

Page 9: Data Science Innovations : Democratisation of Data and Data Science

• The data collected in a single day take nearly two million years to playback on an MP3 player• Generates enough raw data to fill 15 million 64GB iPods every day • The central computer has processing power of about one hundred million PCs• Uses enough optical fiber linking up all the radio telescopes to wrap twice around the Earth• The dishes when fully operational will produce 10 times the global internet traffic as of 2013• The supercomputer will perform 1018 operations per second - equivalent to the number of stars in three million Milky

Way galaxies - in order to process all the data produced.• Sensitivity to detect an airport radar on a planet 50 light years away.• Thousands of antennas with a combined collecting area of 1,000,000 square meters - 1 sqkm)• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations and several years - SKA ETA 5

minutes !

To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which, according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all came into existence. As a scientist, this is a once in a lifetime opportunity.”

Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska

Galileo

Square Kilometer Array Construction (SKA1 - 2018-23; SKA2 - 2023-30)

Centaurus A

Page 10: Data Science Innovations : Democratisation of Data and Data Science

New Sources of Information (Big data) : Social Media + Internet of Things Innovations

7,919 40,204

2,003,254,102 51

Gridded Data Sources

Page 11: Data Science Innovations : Democratisation of Data and Data Science

The following BigQuery query (note that the wildcard on "TAX_WEAPONS_SUICIDE_" catches suicide vests, suicide bombers, suicide bombings, suicide jackets, and so on):

SELECT DATE, DocumentIdentifier, SourceCommonName, V2Themes, V2Locations, V2Tone, SharingImage, TranslationInfo FROM [gdeltv2.gkg] where (V2Themes like '%TAX_TERROR_GROUP_ISLAMIC_STATE%' or V2Themes like '%TAX_TERROR_GROUP_ISIL%' or V2Themes like '%TAX_TERROR_GROUP_ISIS%' or V2Themes like '%TAX_TERROR_GROUP_DAASH%') and (V2Themes like '%TERROR%TERROR%' or V2Themes like '%SUICIDE_ATTACK%' or V2Themes like '%TAX_WEAPONS_SUICIDE_%')

The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.

GDELT + BigQuery = Query The Planet

Page 12: Data Science Innovations : Democratisation of Data and Data Science

Oil reserves shipment monitoring

Ras Tanura Najmah compound, Saudi Arabia

Source: http://www.skyboximaging.com/blog/monitoring-oil-reserves-from-space

Page 13: Data Science Innovations : Democratisation of Data and Data Science

13

https://nodexl.codeplex.com/

Page 14: Data Science Innovations : Democratisation of Data and Data Science

Key Network Measures

• Degree Centrality • Betweenness Centrality• Closeness Centrality• Eigenvector Centrality

krackkite.##h (modified labels)

Connector(hub)

Diana’sClique

Broker

Boundary spanners

Contractor ? Vendor

Page 15: Data Science Innovations : Democratisation of Data and Data Science
Page 16: Data Science Innovations : Democratisation of Data and Data Science

16

Sherman and Young (2016), When Financial Reporting Still Falls Short, Harvard Business Review, July-August

Sood (2015), Truth, Lies and Brand Trust The Deceit Algorithm,

http://datafication.com.au/

New Analytical Tools Can Help

Page 17: Data Science Innovations : Democratisation of Data and Data Science

17

Page 18: Data Science Innovations : Democratisation of Data and Data Science

The Newman Model of Deception (Pennebaker et al)Key word categories for deception mapping:

(1) Self words e.g. “I” and “me” – decrease when someone distances themselves from content

(2) Exclusive words e.g. “but” and “or” decrease with fabricated content owing to complexity of maintaining

deception

(3) Negative emotion words e.g. “hate” increase in word usage owing to shame or guilty feeling

(4) Motion verbs e.g. “go” or “move” increase as exclusive words go down to keep the story on track

Page 19: Data Science Innovations : Democratisation of Data and Data Science

19

Page 20: Data Science Innovations : Democratisation of Data and Data Science

20

Page 21: Data Science Innovations : Democratisation of Data and Data Science

Language on Twitter Tracks Rates of Coronary Heart Disease, Psychological Science, January 2015

21

The findings show that expressions of negative emotions such as anger, stress, and fatigue in the tweets from people in a given county were associated with higher heart disease risk in that county. On the other hand, expressions of positive emotions like excitement and optimism were associated with lower risk.

The results suggest that using Twitter as a window into a community’s collective mental state may provide a useful tool in epidemiology…So predictions from Twitter can actually be more accurate than using a set of traditional variables.

Page 22: Data Science Innovations : Democratisation of Data and Data Science

Twitter and Marketing Predictions

• Tweets is “found data” without asking questions

• More meaning than typical search engine query

• Large numbers of passive participants in natural settings

• Twitter can predict the stock market (Lisa Grossman, Wired, Oct 19 2010)

• Predict movie success in first few weekends of release

• “…it also raises an interesting new question for advertisers and marketing executives. Can they change the demand for their film, product or service buy directly influencing the rate at which people tweet about it? In other words, can they change the future that tweeters predict?”

Tech Review, http://www.technologyreview.com/blog/arxiv/25000/

22

Page 23: Data Science Innovations : Democratisation of Data and Data Science

23

http://www.analyzewords.com

Page 24: Data Science Innovations : Democratisation of Data and Data Science

By 2020-22 : 100 million consumers shop in

augmented reality

30% of web browsing sessions without a screen

Algorithms positively alter behavior of over 1B

Blockchain-based business worth $10B

IoT will save consumers/businesses $1T a year

40% of employees cut healthcare costs via fitness tracker

SStrategic Predictions for 2017 and Beyond, research note 14 October, http://www.gartner.com/document/3471568

2016 Hype Cycle for Business Intelligence and Analytics,

29 July, http://www.gartner.com/document/3388326

Gartner (2016)

Page 25: Data Science Innovations : Democratisation of Data and Data Science

“With the addition of NLG [Natural Language Generation], smart data discovery platforms

automatically present a written or spoken context-based narrative of findings in the data

that, alongside the visualization, inform the user about what is most important for them to act on

in the data.”

Gartner, 29 June, 2015

Smart Data Discovery Will Enable

New Class of Citizen Data Scientist

Page 26: Data Science Innovations : Democratisation of Data and Data Science

26

Insights-driven businesses will generate $1.2 trillion in

2020

Forrester Research, 2016

Page 27: Data Science Innovations : Democratisation of Data and Data Science

27© 2016 Forrester Research, Inc. Reproduction Prohibited

Insights-driven businesses are faster than large companies

2015 2016 2017 2018 2019 2020$0

$250

$500

$750

$1,000

$1,250 Revenue (billions)

PublicStartup

Global GDP will grow only 3.5% annually.

27% CAGR

40% CAGR

Source: Forrester, Morningstar, PitchBook, and The Economist Intelligence Unit

Page 28: Data Science Innovations : Democratisation of Data and Data Science

Reports&

Analysis

Visualisation&

Interpretation

WriteData/Business

“Story” Insights

Led by Data Analyst or Scientist

SME owner, Machine Learning and Natural Language GenerationFusion of data science, business knowledge & creativity for

maximium ROI

Data Aggregation Operationalise

Detect & Extract

Patterns andRelationship

s

Generate Insights &

Story

ProcessApplication

IoT Data

Aggregation or

Data Set

Traditional Analytics: Slow & Expensive80% of time sifting through data

System of Insight (SoI)

SoI: Fast & Cost Effective80% of time in decision making with client

Page 29: Data Science Innovations : Democratisation of Data and Data Science

Actionable Insights

1. What now ?

2. So what ?

3. Now what ?

Page 30: Data Science Innovations : Democratisation of Data and Data Science

30

Companies are reimagining Business Processes with Algorithms and there is “evidence of

significant, even exponential, business gains in customer’s customer engagement, cost & revenue

performance”

Wilson, H., Alter A. and Shukla, P. (2016), Companies Are Reimagining Business Processes with Algorithms, Harvard Business Review, February, https://hbr.org/2016/02/companies-

are-reimagining-business-processes-with-algorithms

Page 31: Data Science Innovations : Democratisation of Data and Data Science

Better customer experiences . . .. . . and half the inventory-

carrying costs of other online fashion retailers.

Forrester, 2016

Page 32: Data Science Innovations : Democratisation of Data and Data Science

Systems of Insight

Automated pattern extraction

Outlier detection

Correlation

Time series

Analytics integration with process, app or IoT

https://ubereats.com/melbourne/

Page 33: Data Science Innovations : Democratisation of Data and Data Science

33

outlier-detection “allow detecting a significant fraction of fraudulent cases…different in nature from historical

fraud…resulting in a novel fraud pattern”

Baesens, B., Vlasselaer, V., and Verbeke, W., 2015, Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data

Science for Fraud Detection, Wiley

Page 34: Data Science Innovations : Democratisation of Data and Data Science

The ANZ Heavy Traffic Index comprises flows of vehicles weighing more than 3.5 tonnes (primarily trucks) on 11 selected roads around NZ. It is contemporaneous with GDP growth.

The ANZ Light Traffic Index is made up of light or total traffic flows (primarily cars and vans) on 10 selected roads around the country. It gives a six month lead on GDP growth in normal circumstances (but cannot predict sudden adverse events such as the Global Financial Crisis).

http://www.a http://www.anz.co.nz/about-us/economic-markets-research/truckometer/ANZ TRUCKOMETER

Page 35: Data Science Innovations : Democratisation of Data and Data Science

Systems of Insight

• Helps move away from “crisis levels” in talent

• Traditional 5 step analytics process reduced to 2 step from data to action

• Reimagine business processes through “machine engineering”

• Minimise messy data issues and data preparation time

Page 36: Data Science Innovations : Democratisation of Data and Data Science

Next Step

Start using Systems of Insight and innovative data sources

Page 37: Data Science Innovations : Democratisation of Data and Data Science

Data Science Resources

Page 38: Data Science Innovations : Democratisation of Data and Data Science

38

The future is impossible to predict.

However one thing is certain :

The company that can excite it’s customers dreams is out ahead in the race to business success

Selling Dreams, Gian Luigi Longinotti