data discovery and visualization

A Data Scientists look at Data Discovery

Dr. Neil Brittliff PhD“By far the most predictive method is Linear Regression”

Data Mining Concepts and TechniquesJiawei Han and Micheline Kambar

2University of Canberra - 2015

A little about myself…

• Currently working for Datacom Systems Pty Ltd• Awarded a PhD at the University of Canberra in March this

last year for my work in the Big Data space• Have been employed by 4 law enforcement agencies• Developed Cryptographic Software to support the

Medicare• Worked in the IT industry since 1982• Received my degree at UC in 1982

3University of Canberra - 2016

Talk Structure

•What is Data Discovery•Data Source Joining•Ontologies and Taxonomies•Rules of Data Discovery•Single Source of Truth•Data Visualisation - Infographics•Summing Up

University of Canberra - 2016 4

What is Data Discovery

Check out Jeff Jonas from IBM


Data Discovery and Data Understanding

19.2 C (temp)+ $17.5 (cost)= 36.7

oId First Name Last Name IQ compared

with Brian’s IQ

1001 Joe Bloggs M

1002 Fred Nurk L

Code Description

M More Intelligent

L Less Intelligent

Id First Name Last Name Description

1001 Joe Bloggs More Intelligent

1002 Fred Nurk Less Intelligent


Taxonomies vs Ontologies

Taxonomies Ontologies

Usually are single hierarchical classification within a subject

Subsume taxonomies

Primarily focused on “ is a” relationships between classes

Includes attributes and cardinality and restricted variables

Limited in inferencing due to lack of relational expressiveness

Defines relationships between entities

Supports comprehensive inferencing

This is Meta-Data !!!!


Taxonomies and Ontologies – a little more

Unambiguous Terminology Definitions


Data is Messy

That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.SIGKDD Explor. Newsl., 14(2):6–19, April 2013. ISSN 1931-0145. doi: 10.1145/2481244.2481247. URL http://doi.acm.org/10.1145/2481244.2481247.

Analysts regularly wrangle data into a form suitable for computational tools through a tedious process that delays more substantive analysis. While interactive tools can assist data transformation, analysts must still conceptualize the desired output state, formulate a transformation strategy, and specify complex transformations Philip J Guo, Sean Kandel, Joseph M Hellerstein, and Jeffrey Heer. Proactive wrangling:Mixed-initiative end-user programming of data transformation scripts. In Proceedingsof the 24th Annual ACM Symposium on User Interface Software and Technology, UIST’11, pages 65–74, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0716-1.


Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data“Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.” Microsoft: 2012URL htp://technet.microsoft.com/en-us/library/gg524800.aspx.

“Data will talk to you if you’re will to listen” – Jim Bergeson

Data Cleansing …

http://technet.microsoft.com/en-us/




“Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.

“Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014. URL http://rspa.royalsocietypublishing.org/content/470/2167/20140167.abstract.

“that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007. URL http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-withvery-large-data-19264.

Data Cleansing –Some thoughts


Single Source of Truth

Data Source A Data Source B

Data Source C Data Source C

Single Source of Truth

A single source of truth is constructed from as a composite view of a record that spans multiple data sources.


Rules of Data Discovery

1. Do not throw away the original data !!!2. Do not throw away the original data !!!3. Keep track of the ETL process take special

care in regards to the impact on the data’s provenance and linage.

4. Do not throw away the original data !!!5. Record (as meta data) any identity

resolution algorithms used to join the data sources together.

1. Phonetic joins2. Temporal Joins3. Geo-spatial Joins4. N-Gram5. Sliding Ribbon

6. Do not throw away the original data !!!


Finally - Data Visualization – To Illuminate

Minard’s Chart - 1826

"may well be the best statistical graphic ever drawn" - Edward Tufte


Some more Minard’s Infographics


Data View – Multidimensional &Mapping Time

Hans Rosling


The Birth of the Data Journalist


Finally – Data Presentation Bias


Further Reading and Contacts

Strategic Thinking in Criminal IntelligenceJerry H RatcliffeThe Federation Press – 2009 ISBN 978 186287 734-4

Intelligence-Led PolicingJerry RatcliffeRoutledge – 2008ISBN 978-1-843292-339-8

Data MatchingConcepts and Techniques and Record Linkage, Entity Resolution, and Duplicate DetectionPeter ChristenSpringer – 2012ISBN 978-3-642-31163-5

Foundations of Semantic Web TechnologiesPascal Hitzler, Markus Krötzsch, Sebastian RudolphCRC Press – 2010ISBN 978-1-4200-9050-5

Big Data – A revolution that will transform how we live, work, and thinkViktor Mayer-Schönberger and Kenneth CukierHMH – 2013ISBN 978-0-544-00269-2

Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014

A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014

University of Canberrahttp://www.canberra.edu.au

Datacom Systems Pty Ltdhttp://www.datacom.com.au

That’s all Folks! All the best for the

Festive Season

data discovery and visualization

Data & Analytics