making good use of data at hand: open source tools

28
Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

Upload: vahe

Post on 13-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Making Good Use of Data at Hand: Open Source Tools. Mark C. Cooke, Ph.D. Tax Management Associates, Inc. Overview. Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Making Good Use of Data at Hand: Open Source Tools

Making Good Use of Data at Hand: Open Source Tools

Mark C. Cooke, Ph.D.Tax Management Associates, Inc.

Page 2: Making Good Use of Data at Hand: Open Source Tools

Overview

• Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)”

• Open Source tools exist for making good use of existing data sets– ETL (“Extract, Transform, Load”) + Analytics

• Knime and the R language are two of the most powerful resources for leveraging data

2013Open Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.

Page 3: Making Good Use of Data at Hand: Open Source Tools

Open Data• Open Data concept – governments collect, through

existing management systems, enormous quantities of data that can be leveraged in alternative and novel ways to find solutions.

• The goal is often to leverage the broader community to develop solutions that governments may not have previously conceived.

• Open Data and Business Intelligence should be used by internal consumers as well.

2013Open Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.

Page 4: Making Good Use of Data at Hand: Open Source Tools

Open Data

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 5: Making Good Use of Data at Hand: Open Source Tools

“Data Scientist”

2013Open Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.

Page 6: Making Good Use of Data at Hand: Open Source Tools

Doing Data the Old Way

• Data is locked inside systems :-(– Software systems are designed to wrap a Graphical

User Interface (GUI) around data.– The GUI functionality, historically, has to be

programmed to produce reports, views, and analysis.

• The GUI is driven by the sole purpose of the software. But the data has many purposes…

2013Open Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.

Page 7: Making Good Use of Data at Hand: Open Source Tools

Open Data – Way Forward• Making data talk across platforms: AS400, SQL,

XML, Excel, PDF’s, Text Files, Image Files (.png, .jpeg, etc.), Shape Files (ESRI), email archives, web-scraping, API’s from social media, etc.

• Connecting data across multiple platforms• Using data for novel insight• Tools now exist for importing, cleaning,

standardizing, and analyzing data using complex algorithms built into accessible packages

2013Open Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.

Page 8: Making Good Use of Data at Hand: Open Source Tools

Open Data• These systems are known as “Data Agnostic:”

Database Agnostic - Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system.

– http://searchdatamanagement.techtarget.com/definition/database-agnostic

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 9: Making Good Use of Data at Hand: Open Source Tools

Data Science• What is the breadth of the tool base?– Reading in data from various resources– Transforming data to merge various resources, translate

data into a usable format or to add new data elements– Analyzing data from basic logical and statistical functions

to higher level machine learning tools and algorithms

“Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.”

http://en.wikipedia.org/wiki/Machine_learning

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 10: Making Good Use of Data at Hand: Open Source Tools

Data Science

• What is the output?– “Business Intelligence” or actionable information

that drives business decisions through insight– Creating new insights from existing data– Visualizations - representation of that BI in ways to

make it consumable to a non-specialist audience

“According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means.”

http://en.wikipedia.org/wiki/Data_visualizationOpen Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.2013

Page 11: Making Good Use of Data at Hand: Open Source Tools

2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Page 12: Making Good Use of Data at Hand: Open Source Tools

• Knime is a GUI-based data agnostic tool for ETL, analytics, and visualization.

• Knime is an open source platform for the desktop with commercial enterprise server layers including collaboration tools and web-services (web-portal).

• Knime supports other analytics languages, including the R language for statistical computing

www.Knime.orgOpen Source Tools for Data Analysis

Mark C Cooke - Tax Management Associates, Inc.2013

Page 13: Making Good Use of Data at Hand: Open Source Tools

• The advantages of Knime:– Rapid development environment– Very powerful processing handling large datasets

on commodity hardware• Allows for 100% data samples up to millions of

elements row-wise

– Workflows can be saved, shared, and duplicated– nodes are stepwise allowing for quick revisions– nodes provide access to complex algorithms

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 14: Making Good Use of Data at Hand: Open Source Tools

What is Knime?

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 15: Making Good Use of Data at Hand: Open Source Tools

The Knime Workbench

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 16: Making Good Use of Data at Hand: Open Source Tools

Knime Nodes

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

• Nodes are the workers inside a workflow

• Every node serves at least one function

• Nodes can also be built as Meta-Nodes, which are a collection of nodes performing common functions

• A collection of nodes is called a “workflow”

• You can develop nodes with Java and the node development support

Page 17: Making Good Use of Data at Hand: Open Source Tools

Knime Nodes

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

• For example, the file reader node is an intelligent file reader that can determine the type of file

• However, it also allows for the end user to adjust parameters

Page 18: Making Good Use of Data at Hand: Open Source Tools

Knime Nodes

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

• The Column Filter node allows users to filter columns from a table (conveniently named…)

Page 19: Making Good Use of Data at Hand: Open Source Tools

Knime Nodes (sample)

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Page 20: Making Good Use of Data at Hand: Open Source Tools

Knime Integrates with R

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

• R integration is key to expanding the data analysis and visualization capabilities of Knime

• R supports data ingestion of complex files (including ESRI)

• R supports complex data manipulation and statistical analysis

• R supports a wide variety of highly customizable visualizations

So, what is R, exactly?

Page 21: Making Good Use of Data at Hand: Open Source Tools

R Project for Statistical Computing

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

www.r-project.org

• R is an open source scripting language which can be run inside Knime, but also within a command line environment independently

• Several GUI interfaces for R exist such as R Studio, a group that provides software for using R as well as training and extension packages (www.rstudio.com)

• Community contributions make up the bulk of R packages, which now total more than 4,700

Page 22: Making Good Use of Data at Hand: Open Source Tools

R Project for Statistical Computing

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

www.r-project.org

• The R base package (standard software) provides methods for reading data, ETL, analysis and visualizations

• The community provided packages take this base and build on it depending on the interest of the producer

• Packages stretch across all imaginable data uses, including advanced statistical analyses, machine learning and data mining, and advanced graphical visualizations (including sophisticated mapping)

Page 23: Making Good Use of Data at Hand: Open Source Tools

Popular R Packages

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

A (very) brief overview of popular packages:

•Plyr – for advanced data manipulation

•Maps – for mapping datasets onto georeferenced outputs

•GGPlot2 – for advanced data visualizations

•Rcurl – for reading data from webpages and repositories

•TextMining – for text mining applications

•SNA – for social network analysis

Page 24: Making Good Use of Data at Hand: Open Source Tools

R Inside Knime

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Basic Data Manipulation:

Page 25: Making Good Use of Data at Hand: Open Source Tools

R Inside Knime

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Basic Visual using Maps:

Page 26: Making Good Use of Data at Hand: Open Source Tools

Knime + R + TPP

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Case examples for working with TPP:

•Look at distribution of TPP accounts across a county, state, or region•Map entities or create a heatmap (choropleth) of the distribution of personal property values•Compare personal property reporting across schedules across industry sectors (m&e across manufacturing types)•Compare like-kind entity reporting (franchises, big-box) for consistency in values•Compare personal property accounts with other data resources (real property accounts, permits, etc.)

Page 27: Making Good Use of Data at Hand: Open Source Tools

Brief Demonstration

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Data:Florida

67 Counties

More than 1.24 million personal property accounts

Goals:1.Group all data by industry to illustrate the taxable value and exempted value by type

2.Subset the data to include only a particular industry

3.Map the state-wide exempt value in a choropleth

Page 28: Making Good Use of Data at Hand: Open Source Tools

Questions?

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.2013

Thank you for your time and attention. I am always happy to discuss data, so please feel free to contact me at any of the information below.

Mark C [email protected] (office)704.953.6349 (cell)www.linkedin.com/in/markccooke