57 what of data scientists hate about their job · of data scientists hate about their job what. 2...

15
By Clint Eagar Understanding and Optimizing Data Quality Assurance 57 % OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT

Upload: phamdiep

Post on 22-Apr-2018

221 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

By Clint Eagar

Understanding and OptimizingData Quality Assurance

57%

OF DATA SCIENTISTSHATE ABOUT THEIR JOB

WHAT

Page 2: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

2

According to a recent study by CrowdFlower, data scientists are spending a hefty 60% of their time cleaning and organizing data, and 57% of these data scientists regard this task as the least enjoyable part of their work. By the time these scientists get to actually analyzing data for patterns, they end up spending only 9% of their time doing so. The need to clean data indicates a data quality problem, affecting anyone working with data or data collection technologies, including developers, analysts, digital marketers or decision-makers. So what do you do?

Introduction

Page 3: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

Table of Contents

DIRTY DATA 5

THE SYMPTOMS 6

THE DATA QUALITY ASSURANCE JOURNEY 7

BUSINESS REQUIREMENTS/DOCUMENTATION 9

TECHNOLOGY DEPLOYMENT 10

QUALITY ASSURANCE 11

MONITORING & MAINTENANCE 12

THE DATA QUALITY ASSURANCE CYCLE 13

WHAT IS AUTOMATED DATA GOVERNANCE? 14

BE SURE OF YOUR DATA QUALITY 15

Page 4: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

4

Clint proclaims to “know a little bit about a lot of things.” With more than 20 years’ experience in customer service and technical consulting and implementation, Clint has become an expert at building, optimizing and marketing websites. Passionate about data quality, Clint was one of the earliest employees of ObservePoint, joining the young company to help support, test, develop andmanage the OP product.

Clint is the guy everyone counts on to get things done. If you yell for Clint, he always comes running and always delivers.

About the Author

Page 5: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

5

It’s easy to want to treat symptoms instead of diagnose the cause. Because it’s usually an analyst or data scientist that discovers and deals with bad data, it may be tempting to try to solve the issue at the analysis stage. But if you’re asking “How can I make it faster to clean up data?” then you’re asking the wrong question. Things only need to be cleaned when they’re dirty. Data isn’t inherently dirty, not when collected correctly.

Dirty Data

The question you need to be asking is, “How did my data get dirty?”?

Page 6: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

6

The unauthorized transmission of data to third-parties or former team members caused by unsanctioned tags or outdated security measures can put your data and company at risk.

Data Leakage

When you fail to thoroughly deploy tags across your site or mobile app, you’re missing out on data, giving you an incomplete view of your users and giving your users a fragmented experience.

Data Loss

Data inflation is caused by data being collected more than once for the same measurement. This doesn’t just mean duplicate entries in your CRM, it means your analytics data can be inflated—giving you false perceptions of performance.

Data Inflation

What exactly do I mean by dirty data? As explained in another other eBook, Why You Should Audit Your Web Tags, dirty data is a result of broken tags and broken business processes. The product of these broken assets is data inflation, data loss and data leakage.

The Symptoms

These are serious threats to the integrity of your data, as well as to the decisions and operations that run on that data.

Page 7: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

7

a

Bad data is usually indicative of a lack of transparency and accountability between data governance leaders and data stakeholders, and can be aggravated by pressure on developers and QA professionals to release new updates at an ever-increasing rate.

It’s important to identify at what point data gets lost, duplicated or leaked. Or rather, at what point data governance personnel fail to validate data collection processes. Here are the underlying business processes (the data quality assurance journey) that a data governance steward should be monitoring to ensure data is collected correctly:

The Data Quality Assurance Journey

Product owners/project stakeholders provide requirements for how analytics and digital marketing technologies are to be formatted in a digital asset. This includes variable and tag documentation (often known as a solution design reference or SDR). Many times measurement of a new feature is an afterthought, but should be part of the initial conversation.

Business Requirements

Developers deploy a TMS container code, and then TMS controllers deploy the required technologies into the digital asset using the TMS solution.

Technology Deployment

4

3

2

1

QA professionals validate the implementation against the original business requirements, ensuring back-end and front-end technologies are functioning as expected.

Quality Assurance

Stakeholders continually ensure the digital asset functions as intended, fixing bugs and errors in the code base as users expose unique use cases.

Monitoring & Maintenance

Page 8: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

8

Where do data end users like analysts and data scientists fit into the data quality assurance journey?

?

Data end users often fulfill the role of TMS controller, responsible for deploying code to meet their own analytical needs. They can also help close the loop on data collection, providing feedback to business requirements personnel about potential events to track. These end users should just be able to trust the data, minimizing their role as data janitors and becoming data gurus, analyzing, mining and modeling data.

However, they still spend much of their time cleaning up the data before they can perform their primary business function. Going about data quality assurance this way is inefficient, tedious and often frustrating for everyone involved. There is a better way.

Page 9: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

9

Business Requirements

Role: Provide requirements for digital marketing and analytics technologies to be included in the digital asset, including variable and tag documentation.

To achieve your digital marketing and analytics objectives, you need a blueprint to understand how to architect your technology stack. This begins with identifying business requirements, which answer the following questions:

What do we want to know about visitors to our website and mobile app?

How can we better help our visitors get value from our products?

What events do we want to measure?

How will we measure/capture this?

These business requirements are then translated into tag and variable documentation. The purpose of tag and variable documentation—often known as a variable strategy or solution design reference (SDR)—is to help maintain consistency across digital assets, as well as provide a baseline to perform quality assurance against. An SDR is typically a spreadsheet with the following information about each technology:

Tag Status

activeor

inactive

Use Case

what isbeing

captured

Variable

a prop,eVar orevent

VariableDescription

quickexplanation

of thevariable

Example

what you expect the

value tolook like

When to set

page load,click

or other

Where to set

whichpages

AdditionalNotes

additionalnotes

The SDR is meant to be living and accessible. As digital assets are added and changed, so should the SDR. But SDR stakeholders often struggle to keep up with the rate at which technologies are changed, and thus lose their baseline to validate their technologies against when there is an error. The result is data governance personnel becoming blind to what stakeholders were hoping to accomplish with their analytics implementations.

Main Takeaway:Tag documentation needs to be fluid, but also needs to be a firm baseline to validate your technologies against. SDR maintenance is often neglected as technologies are changed and deployed.

Page 10: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

10

Technology Deployment

Main Takeaway:TMSs are extremely helpful, but still prone to human error, and do not provide feedback on data quality issues such as broken tags or failed compliance rules.

Role: Deploy TMS container code, and then deploy the required technologies into the digital asset using the TMS solution.

With the advent of tag management systems, the responsibility of deploying tags has been passed from developers to data end users, such as analysts, marketers and data scientists. As TMSs have powerful rules engines and syntax validation, these personnel have been able to quickly deploy the tags that are most relevant to their business objectives. Consequently, organizations that use tag management experience a serious bump in data quality.

But while tag management systems have streamlined and simplified tag deployment, there still remains gaps for human error. Tag management is not a substitute for tag validation. Misconfigured tags, conflicting JavaScript code, or tags deployed outside of a TMS can result in data quality issues—issues that cannot be detected or reported by a TMS. Tag deployment via TMS should still go through rigorous quality assurance testing processes.

Page 11: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

11

Quality Assurance

Main Takeaway:Quality Assurance teams are expected to deliver better products in less time, and yet often have to perform quality assurance through inefficient, manual spot-checking methods.

Role: Validate the implementation against the original business requirements, ensuring back-end and front-end technologies are functioning as expected.

Unless an issue with a technology emerges during migration from staging to production—or somehow crops up unexpectedly in the production environment—then it falls under QA’s purview to discover and resolve beforehand. This will ensure technologies work as needed. But even though it is best practice for quality assurance teams to preemptively resolve broken tags before production, unfortunately some broken tags proceed into production unaddressed. And it may not be QA’s fault. Here’s why: DevOps and QA are under increasing pressure to continuously release new features, technologies, web pages, app versions, etc. The “release early, release often” mantra is causing software release cycles to shrink, and

QA professionals are feeling the pressure of finding and resolving bugs in less time.

Unfortunately, these same QA professionals often are only equipped to respond to rapid release cycles with slow manual spot-checking methods—one survey showed that while 67% of teams deploy new builds weekly, only 26% of teams have more automated testing than manual. Consequently, bugs can end up slipping through the system into production, undetected until they present a problem, either for overwhelmed monitoring & maintenance personnel, or exhausted/jaded data scientists and analysts.

Page 12: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

12

Monitoring & Maintenance

Main Takeaway:Monitoring & Maintenance have an ever-increasing load of digital assets to monitor, and scaling up their manual efforts is a challenge.

Role: Continually ensure the digital asset functions as intended, fixing bugs and errors in the code base as users expose unique use cases.

The role of monitoring & maintenance personnel is to continuously verify that the tracking technologies (tags) and critical user paths on your website or mobile app are functioning as expected. To do this manually, these post-production QA professionals have to replicate user activity on the front-end production layer to look for bugs, as well as monitor back-end mechanisms to ensure functionality when new components are added to the mix. So they should be the ones to catch those final mistakes, right? Well, yes. But that can be easier said than done.

As a company grows, and its website or mobile app matches that growth, then the number of digital assets post-production QA engineers have to monitor will just keep adding up. QA professionals often don’t have scalable monitoring processes to catch errors, relying on manual tag validation methods. To put it simply, they’re in over their heads, and are required to limit themselves to what they feel are the most “important” digital assets, leaving the rest out to dry.

Page 13: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

13

In the face of a mountain of inefficient data governance processes, what you need is automated data governance.

The Data Quality Assurance Cycle

Up until now, we’ve been treating the data quality assurance journey as a linear process. But at it’s finest, the data quality assurance journey is best understood as a cycle:

Optimal, efficient data governance is forward-looking—when analysts and data scientists don’t have to struggle to clean data, they can focus more on closing the loop with business requirements stakeholders, expanding the analytical capability of their tag implementations.

But how do you get there, when your company is facing serious challenges with data governance??

Data Quality Assurance Cycle

Quality AssuranceTechnology Deploy

ment

Monito

ring/Maintenance Requirements/Documentation

Page 14: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

14

What Is Automated Data Governance?

Automated data governance solutions crawl your

website or mobile app, firing code and validating tag presence/performance

against predefined business compliance rules.

This can save data governance personnel

mountains of time, increasing efficiency and

building confidence in data quality.

Enterprise-grade data governance tools can be configured to regularly audit large portions of your website or app, ensuring all technologies are working as intended.

These solutions can also replicate traffic along critical user paths through your website or mobile app, validating the functionality of the tags that should be present along the journey.

Automated Production Monitoring1

In addition to keeping track of the tag performance in a production environment, data governance solutions can also audit digital assets in a staging environment, assisting QA pros as they prepare to push the software into production. QA engineers can also create compliance rules to validate against their implementation, ensuring the digital asset holds up to business requirements.

Automated Quality Assurance2

In addition to compliance-based audits and critical path monitoring, some data governance solutions provide tag debuggers to allow developers and TMS users to spot-check the performance of tags on a web page without having to read through any code.

Technology Deployment Assistance3

Data governance solutions dramatically simplify the process of generating and maintaining tag and variable documentation by scanning your website or mobile app and providing reports on all tags and variables found on the digital asset. These reports can also be used to augment your library of compliance rules to reflect your company’s business requirements.

Enhanced SDR Documentation4

Automated DataGovernance

Page 15: 57 WHAT OF DATA SCIENTISTS HATE ABOUT THEIR JOB · OF DATA SCIENTISTS HATE ABOUT THEIR JOB WHAT. 2 ... This doesn’t just mean ... prone to human error, and do not provide

15

REQUEST A SAMPLE AUDIT

ObservePoint’s Data Quality Assurance™ is the premier data governance software for automating quality assurance, production monitoring and compliance validation. The ObservePoint platform relieves the pressure on data governance stakeholders by automating processes—that, when performed manually, are highly monotonous, inefficient and prone to human error—in order to monitor and validate your analytics and marketing tags with efficiency, confidence and accuracy.

to see what ObservePoint can do for you.

Be Sure of Your Data Quality