tools to optimize operational test analysis...csv, xml, etc… » load data into your repository...

32
Tools to Optimize Operational Test Analysis Frank Thomason

Upload: others

Post on 08-Sep-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

Tools to Optimize Operational Test Analysis

Frank Thomason

Page 2: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

2

» Data, data everywhere

Managing data affects the success of your test

Accuracy is critical

Don’t throw away useful data

Complexity

» Timeliness

Customers want results quickly

Be prepared for the 3pm phone call

» Happiness

Data Scientists spend 50-80% of their time doing “data janitorial” work (NY Times)

Let the analysts do analysis

Why do we need tools?

Page 3: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

3

» Automate as much as possible

» Be flexible

» Don’t reinvent the wheel

Use existing tools when possible

For custom software, use pre-built modules

» Don’t try to find one all-encompassing tool.

Break it into smaller chunks.

Keep it simple and focused.

Strategy

Page 4: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

4

Data Analysis Lifecycle

Retrieve Authenticate Analyze Report

Page 5: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

5

» Move data from source to your environment

» Convert it to your preferred format

CSV, XML, etc…

» Load data into your repository

Database, shared drive, big data repository, etc…

» Run data quality checks to flag records for

investigation and cleaning

» Alert when errors occur

Steps to Retrieve Data

Page 6: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

6

» Data Warehousing has spawned many tools for retrieving data. ETL – Extract, Transform, Load

» Open-Source Tools Talend Data Integrator

Pentaho Kettle

Hadoop (big data)

» Commercial Tools IBM Data Stage

SQL Server Integration Services

Informatica

Oracle Data Integrator

» Custom Tools Not recommended

Tools to Retrieve Data

Page 7: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

7

» Hundreds of pre-built components

» Process-diagram interface makes it easy to

understand and debug

» Lots of configuration options

» Create custom components

» Can run on a schedule or manually

Features in ETL Tools

Page 8: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

8

ETL Tool Screenshot

Typical Extract-Load Job (from Talend Data Integrator)

Page 9: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

9

ETL Map Fields

Page 10: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

10

ETL Standard Components

• AS400• Access• Amazon RDS• DB Generic• DB JDBC• DB2• Firebird• Greenplum• HSQL DB• Hive• Informix• Ingres• Interbase• JavaDB• LDAP

• MS SQL Server• MaxDB• MySQL• Netezza• Ole DB• Oracle• ParAccel• Postgres SQL• Redshift• SQLite• SAS• Sybase• Teradata• Vertica• eXist

Database Components• Apache Log• ARFF• Delimited• EBCDIC• Excel• JSON• MS Delimited• MS Positional• MS XML• Email• Positional• Properties• RegEx• XML

• HTTP Request• FTP• FileFetch• POP• Kerberos• Keystore• Proxy• Socket• Web Service• XMLRPC• SOAP• RSS• Named Pipe• REST• FTP• MOM• JMS• Socket

ProtocolsFile Formats

Page 11: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

11

» Lots of variability with retrieving source data

Proprietary formats, software, media, etc…

» Don’t modify the source data

Keep it as an accurate representation of the

original data.

You can transform it later.

Keep the structure the same

» Flexibility is more important than speed

» Use an existing ETL tool

» Consider a Big Data Tool if you have a lot of log

data

Tips to Retrieve Data

Page 12: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

12

» Investigate flagged records

» Clean data wherever possible

» Present data to Data Authority for authentication

Summary of data

Sampling of records if dataset is large

Separate flagged records for closer review

» Update records with results of authentication

Steps to Authenticate Data

Page 13: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

13

» Not many off-the-shelf options

» Custom software may be the best choice

Use pre-existing modules for rapid development

ASP.Net, Java, and Python all have pre-built

interfaces for viewing, sorting, and editing data.

Avoid bells-and-whistles that will make code

harder to reuse for other tests

Tools to Authenticate Data

Page 14: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

14

» Record revision history automatically

Show what records have been deleted, modified,

and added

Record the user name that made the change

» Don’t try to compete with Excel

Allow data to be exported to Excel

» Create summary reports

» Avoid adding a lot of bells-and-whistles

Makes it easier to reuse for other tests

Tips to Authenticate Data

Page 15: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

15

» Data Authentication, Reporting, and Analysis tool

» Review and update data

» Import/Export Excel spreadsheets

» Keeps revision history

» Shows the changes made to each record

» Uses central repository for data

» Developed in ASP.Net and Java to fit most

environments

» Generates summary reports

» Code is designed for flexibility so that new tests can

be added quickly.

DARA Tool

Page 16: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

16

DARA Tool

Page 17: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

17

Revision History

Page 18: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

18

Merged Records

Page 19: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

19

Sample Sizes

Page 20: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

20

Measures

Page 21: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

21

Data Flow

Page 22: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

22

» Prep the data

Change the data structure

Merge data

Flatten data

Dimensional models

Transform Change time zones

Perform calculations

Geocodes

Categorize

Optimize database

Aggregate data

» Integrate your favorite analysis tool

Steps to Analyze

Page 23: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

23

» Open Source R

Saiku (Mondrian)

Weka, Rapid Miner, KNIME

» Commercial SAS, SPSS

MATLAB

STATISTICA

Tableau

Rattle

Excel

New automated analysis tools

» Custom SQL

Python, Perl

Tools to Analyze

Page 24: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

24

» Let analysts use their favorite tool

» Optimized structures like Dimensional Models

are nice but take time

Extra time to design and develop

Requires a lot of testing to ensure data hasn’t

been distorted

May not save time

May not handle changes to data easily

» OLAP is fun but probably only useful for very

large datasets

Tips to Analyze

Page 25: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

25

R

Output from R comparing two data sets

Page 26: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

26

R

R Data Graphs from ClearPeaks.com

Page 27: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

27

OLAP

Saiku OLAP Tool (using Mondrian)

Page 28: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

28

» Determine audience

» Determine target media

» Determine level of detail

» Create reports

Canned reports

Ad-hoc reports

Custom reports

» Create visualizations

» Distribute

Steps to Report

Page 29: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

29

» Open Source

Jasper Reports

Pentaho

» Commercial

Business Objects

Tableau

Yellow Fin

Excel

» Custom Development

Tools to Report

Page 30: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

30

» Canned reports are useful but don’t go nuts.

Reports change frequently.

» Ad-hoc reporting is useful for non-technical users

but analysts may prefer something else.

Typically an add-on for a cost.

» If you are only going to use most of the reports

once, it may not be worth it to develop reports.

Tips to Report

Page 31: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

31

Jasper Reports

Reports created in Jasper Reports

Page 32: Tools to Optimize Operational Test Analysis...CSV, XML, etc… » Load data into your repository Database, shared drive, big data repository, etc… » Run data quality checks to flag

32

3000 WILSON BLVD SUITE 250

ARLINGTON, VA 22201

www.definitivelogic.com

TEL: 703.955.4186

FAX: 877.349.4031

Frank Thomason

[email protected]

703-472-8138