data discovery and visualization
TRANSCRIPT
A Data Scientists look at Data Discovery
Dr. Neil Brittliff PhD“By far the most predictive method is Linear Regression”
Data Mining Concepts and TechniquesJiawei Han and Micheline Kambar
2University of Canberra - 2015
A little about myself…
• Currently working for Datacom Systems Pty Ltd• Awarded a PhD at the University of Canberra in March this
last year for my work in the Big Data space• Have been employed by 4 law enforcement agencies• Developed Cryptographic Software to support the
Medicare• Worked in the IT industry since 1982• Received my degree at UC in 1982
3University of Canberra - 2016
Talk Structure
•What is Data Discovery•Data Source Joining•Ontologies and Taxonomies•Rules of Data Discovery•Single Source of Truth•Data Visualisation - Infographics•Summing Up
University of Canberra - 2016 4
What is Data Discovery
Check out Jeff Jonas from IBM
University of Canberra - 2016 5
Data Discovery and Data Understanding
19.2 C (temp)+ $17.5 (cost)= 36.7
oId First Name Last Name IQ compared
with Brian’s IQ
1001 Joe Bloggs M
1002 Fred Nurk L
Code Description
M More Intelligent
L Less Intelligent
Id First Name Last Name Description
1001 Joe Bloggs More Intelligent
1002 Fred Nurk Less Intelligent
University of Canberra - 2016 6
Taxonomies vs Ontologies
Taxonomies Ontologies
Usually are single hierarchical classification within a subject
Subsume taxonomies
Primarily focused on “ is a” relationships between classes
Includes attributes and cardinality and restricted variables
Limited in inferencing due to lack of relational expressiveness
Defines relationships between entities
Supports comprehensive inferencing
This is Meta-Data !!!!
University of Canberra - 2016 7
Taxonomies and Ontologies – a little more
Unambiguous Terminology Definitions
University of Canberra - 2016 8
Data is Messy
That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.SIGKDD Explor. Newsl., 14(2):6–19, April 2013. ISSN 1931-0145. doi: 10.1145/2481244.2481247. URL http://doi.acm.org/10.1145/2481244.2481247.
Analysts regularly wrangle data into a form suitable for computational tools through a tedious process that delays more substantive analysis. While interactive tools can assist data transformation, analysts must still conceptualize the desired output state, formulate a transformation strategy, and specify complex transformations Philip J Guo, Sean Kandel, Joseph M Hellerstein, and Jeffrey Heer. Proactive wrangling:Mixed-initiative end-user programming of data transformation scripts. In Proceedingsof the 24th Annual ACM Symposium on User Interface Software and Technology, UIST’11, pages 65–74, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0716-1.
University of Canberra - 2016 9
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data“Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.” Microsoft: 2012URL htp://technet.microsoft.com/en-us/library/gg524800.aspx.
“Data will talk to you if you’re will to listen” – Jim Bergeson
Data Cleansing …
University of Canberra - 2016 1010
“Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.
“Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014. URL http://rspa.royalsocietypublishing.org/content/470/2167/20140167.abstract.
“that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007. URL http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-withvery-large-data-19264.
Data Cleansing –Some thoughts
University of Canberra - 2016 11
Single Source of Truth
Data Source A Data Source B
Data Source C Data Source C
Single Source of Truth
A single source of truth is constructed from as a composite view of a record that spans multiple data sources.
University of Canberra - 2016 12
Rules of Data Discovery
1. Do not throw away the original data !!!2. Do not throw away the original data !!!3. Keep track of the ETL process take special
care in regards to the impact on the data’s provenance and linage.
4. Do not throw away the original data !!!5. Record (as meta data) any identity
resolution algorithms used to join the data sources together.
1. Phonetic joins2. Temporal Joins3. Geo-spatial Joins4. N-Gram5. Sliding Ribbon
6. Do not throw away the original data !!!
University of Canberra - 2016 13
Finally - Data Visualization – To Illuminate
Minard’s Chart - 1826
"may well be the best statistical graphic ever drawn" - Edward Tufte
University of Canberra - 2016 14
Some more Minard’s Infographics
University of Canberra - 2016 15
Data View – Multidimensional &Mapping Time
Hans Rosling
University of Canberra - 2016 16
The Birth of the Data Journalist
University of Canberra - 2016 17
Finally – Data Presentation Bias
University of Canberra - 2016 18
Further Reading and Contacts
Strategic Thinking in Criminal IntelligenceJerry H RatcliffeThe Federation Press – 2009 ISBN 978 186287 734-4
Intelligence-Led PolicingJerry RatcliffeRoutledge – 2008ISBN 978-1-843292-339-8
Data MatchingConcepts and Techniques and Record Linkage, Entity Resolution, and Duplicate DetectionPeter ChristenSpringer – 2012ISBN 978-3-642-31163-5
Foundations of Semantic Web TechnologiesPascal Hitzler, Markus Krötzsch, Sebastian RudolphCRC Press – 2010ISBN 978-1-4200-9050-5
Big Data – A revolution that will transform how we live, work, and thinkViktor Mayer-Schönberger and Kenneth CukierHMH – 2013ISBN 978-0-544-00269-2
Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014
A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014
University of Canberrahttp://www.canberra.edu.au
Datacom Systems Pty Ltdhttp://www.datacom.com.au
That’s all Folks! All the best for the
Festive Season