df power guide

dfPower Studio 7.1.2User’s Guide

dfPower Studio - Contact and Legal Information

Technical Support Phone: 1-919-531-9000 Email: [email protected] Web: www.dataflux.com/techsupport

Contact DataFlux Corporate Headquarters DataFlux Corporation 940 NW Cary Parkway, Suite 201 Cary, NC 27513-2792 Toll Free Phone: 1-877-846-FLUX (3589) Toll Free Fax: 1-877-769-FLUX (3589) Local Telephone: 1-919-447-3000 Local Fax: 1-919-447-3100 Web: www.dataflux.com

European Headquarters DataFlux UK Limited 59-60 Thames Street WINDSOR Berkshire SL4 ITX United Kingdom UK (EMEA): +44(0) 1753 272 020

Legal Information PCRE Copyright Disclosure A modified version of the open source software PCRE library package, written by Philip Hazel and copyrighted by the University of Cambridge, England, has been used by DataFlux for regular expression support.

Copyright (c) 1997-2005 University of Cambridge. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

• Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

• Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

• Neither the name of the University of Cambridge nor the name of Google Inc. nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

gSOAP Copyright Disclosure Part of the software embedded in this product is gSOAP software.

Portions created by gSOAP are Copyright (C) 2001-2004 Robert A. van Engelen, Genivia inc. All Rights Reserved.

THE SOFTWARE IN THIS PRODUCT WAS IN PART PROVIDED BY GENIVIA INC AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF

1

mailto:[email protected]

http://www.dataflux.com/techsupport

http://www.dataflux.com/

MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Expat Copyright Disclosure Part of the software embedded in this product is Expat software.

Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Apache/Xerces Copyright Disclosure The Apache Software License, Version 1.1

Copyright (c) 1999-2003 The Apache Software Foundation. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

• Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

• Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

• The end-user documentation included with the redistribution, if any, must include the following acknowledgment: "This product includes software developed by the Apache Software Foundation (http://www.apache.org/)." Alternately, this acknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normally appear.

• The names "Xerces" and "Apache Software Foundation" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact [email protected].

• Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior written permission of the Apache Software Foundation.

THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

2

This software consists of voluntary contributions made by many individuals on behalf of the Apache Software Foundation and was originally based on software copyright (c) 1999, International Business Machines, Inc., http://www.ibm.com. For more information on the Apache Software Foundation, please see <http://www.apache.org/>.

SQLite Copyright Disclosure The original author of SQLite has dedicated the code to the public domain. Anyone is free to copy, modify, publish, use, compile, sell, or distribute the original SQLite code, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

USPS Copyright Disclosure National ZIP®, ZIP+4®, Delivery Point Barcode Information, DPV, RDI. © United States Postal Service 2005. ZIP Code® and ZIP+4® are registered trademarks of the U.S. Postal Service.

DataFlux holds a non-exclusive license from the United States Postal Service to publish and sell USPS CASS, DPV, and RDI information. This information is confidential and proprietary to the United States Postal Service. The price of these products are neither established, controlled, nor approved by the United States Postal Service.

DataFlux and all other DataFlux Corporation LLC product or service names are registered trademarks or trademarks of, or licensed to, DataFlux Corporation LLC in the USA and other countries. ® indicates USA registration. Copyright © 2005 DataFlux Corporation LLC, Cary, NC, USA. All Rights Reserved.

3

Table of Contents dfPower Studio - Contact and Legal Information.........................................................1

Technical Support ........................................................................................................................ 1 Contact DataFlux ......................................................................................................................... 1 Legal Information ......................................................................................................................... 1

Chapter 1: Introducing dfPower Studio ........................................................................6 dfPower Studio Overview............................................................................................................. 6 dfPower Studio – Main Screen – Main Menu and Toolbar .......................................................... 8 dfPower Studio – Base ................................................................................................................ 8 dfPower Profile........................................................................................................................... 13 dfPower Quality.......................................................................................................................... 14 dfPower Integration.................................................................................................................... 15 dfPower Enrichment................................................................................................................... 16 dfPower Design.......................................................................................................................... 17 dfPower Favorites ...................................................................................................................... 18

Chapter 2: Getting Started ...........................................................................................19 Before You Upgrade dfPower Studio and/or the Quality Knowledge Base............................... 19 System Requirements................................................................................................................ 21 dfPower Studio Installation Wizard ............................................................................................ 21 dfPower Studio Installation Command-Line Switches ............................................................... 28 Downloading and Installing dfPower Verify Databases ............................................................. 28 Launching dfPower Studio ......................................................................................................... 31 Licensing dfPower Studio .......................................................................................................... 32 Moving Around in dfPower Studio ............................................................................................. 33 Accessing dfPower Studio Online Help ..................................................................................... 33

Chapter 3: dfPower Studio in Action...........................................................................35 Sample Data Management Scenario......................................................................................... 35 Data Management Scenario Background.................................................................................. 35 Profiling ...................................................................................................................................... 36 Quality ........................................................................................................................................ 44 Integration .................................................................................................................................. 49 Enrichment................................................................................................................................. 54 Monitoring .................................................................................................................................. 57 Summary.................................................................................................................................... 59

Chapter 4: Performance Guide ....................................................................................60 dfPower Studio Settings............................................................................................................. 60 Database and Connectivity Issues ............................................................................................ 64 Quality Knowledge Base Issues ................................................................................................ 65

Exercises .......................................................................................................................66 Profile ......................................................................................................................................... 66 Profile ......................................................................................................................................... 74 Architect ..................................................................................................................................... 77

5

dfPower Studio® User’s Guide

Chapter 1: Introducing dfPower Studio This chapter describes dfPower Studio and its component applications and application bundles.

Topics in this chapter include:

• dfPower Studio Overview

• dfPower Studio Main Menu and Toolbar descriptions

• dfPower Studio node descriptions for Base, Profile, Quality, Integration, Enrichment, and Design.

dfPower Studio Overview dfPower Studio® is a powerful, easy-to-use suite of data cleansing and data integration software applications. dfPower Studio connects to virtually any data source, and any dfPower Studio user in any department can use Studio to profile, cleanse, integrate, enrich, monitor, and otherwise improve data quality throughout the enterprise. An innovative job flow builder allows users to build complex management workflows quickly and logically.

This capability allows frontline staff—not IT or development resources—to discover and address data problems, merge customer and prospect databases, verify and complete address information, transform and standardize product codes, and perform just about any other data management process required by your organization.

You can access dfPower Studio’s functionality through a graphical user interface (GUI), from the command line, or in batch operation mode, providing you great flexibility in addressing your data-quality issues. The dfPower Studio main GUI screen is a centralized location from which to launch the dfPower Studio applications. The toolbar on the left of the screen lists all the dfPower Studio applications.

dfPower Studio consists of these nodes and application bundles:

• dfPower Studio – Base

• dfPower Studio – Profile

• dfPower Studio – Quality

• dfPower Studio – Integration

• dfPower Studio – Enrichment

These applications/application bundles are described in this chapter, along with the Navigator window.

Note: Some applications will not be available for launching if you have not licensed them.

The dfPower Studio Navigator Window When you launch dfPower Studio, the Navigator window appears.

6

Chapter 2: Getting Started

The dfPower Studio Navigator window.

The Navigator is a revolutionary concept in the data quality and data integration industry, and is an essential part of the dfPower solution. The Navigator was designed to help you collect, store, examine, and otherwise manage the various data quality and integration logic and business rules that are generated and created during dfPower Studio use.

Quality Knowledge Base, Management Resources, and References are all containers for metadata that provide you a complete view of your data quality assets and help you build complicated data management processes with ease.

Use the Navigator to incorporate data quality business rules that are used across an entire enterprise within many different aspects of business intelligence and overall data hygiene. These business rules are stored as objects that are independent of the underlying data sources used to create them. Thus, there is no limit to the application of these business rules to other sources of data throughout the enterprise, including internal datasets, external datasets, data entry, internal applications, the web, and so on. From the smallest spreadsheets buried in the corners of the enterprise, all the way up to corporate operational systems on mainframes, you can use the same business rules to achieve consistent data integrity and usability across the entire enterprise.

More specifically, the Navigator:

• Stores all business rules generated during any of the various dfPower Studio processes

• Allows metadata to be maintained about each of these data quality objects

• Allows reuse of these data quality jobs and objects across the various applications

• Facilitates quick launches of various stored data management jobs with a few clicks of the mouse

• Maintains all reports that are generated during data quality processing

• Manages various configurations of the various data quality processes routinely implemented by the organization

• Maintains information about batch jobs and the various schedules of these jobs

• Maintains information about batch jobs and the various schedules of these jobs

7


Studio Supports the AIC Process and Building Blocks of Data Management The dfPower Studio bundles have been designed to directly support the building blocks of data management. You will find this design particularly helpful if you are using the approach DataFlux takes to data management initiatives using an Analyze, Improve, and Control (AIC) process. AIC is a method of finding data problems, building reusable transformations, and strategically applying those transformations to improve the usability and reliability of an organization’s data.

The five building blocks of AIC data management are:

• data profiling

• data quality

• data integration

• data enrichment

• data monitoring

As you will see, these building blocks closely mirror the structure of dfPower Studio’s main menu.

dfPower Studio – Main Screen – Main Menu and Toolbar

Use the dfPower Studio toolbar to launch the various dfPower Studio applications. The node menu and toolbar contain several categories of applications and functions that mirror the data management methodology—Profile, Quality, Integration, Enrichment, and Monitoring. You can customize the contents of the seventh category, Favorites.

Figure 1. The Main Screen menu and toolbar reflect the Data Management Methodology process.

You can also display a toolbar palette, by selecting Studio > Show Palette. To hide the palette, select Studio > Hide Palette.

Next, we describe the various nodes available from the main window.

dfPower Studio – Base At the heart of dfPower Studio is the bundle of applications and functions collectively known as dfPower Studio – Base. dfPower Studio – Base is the infrastructure that ties all the dfPower Studio applications together, allowing them to interact, process data simultaneously at scheduled times, and share information where necessary.

dfPower Studio – Base consists of the following components:

• Architect—Design a workflow for processing your data

The toolbar palette offers quick access to various functions. They are identified by balloon text.

• Batch—Schedule jobs to be run in batch mode

8


• DB Viewer—View and retrieve records from your data sources

• ODBC Admin—Configure dfPower Studio to connect to your various data sources

• Options—Configure dfPower Studio options

• Rule Manager—Create and manage business rules

• Monitor Viewer—View task data in a repository

• Repository Administrator—Create and manage repositories

Note: Navigator and Extend appeared in dfPower Studio – Base in earlier releases. Navigator now opens automatically with Studio and appears as a menu option at the top of Studio’s main menu. The Extend function is now accessible from dfPower Studio – Enrichment.

Architect—Design a Workflow for Processing Your Data dfPower Base - Architect brings much of the functionality of the other dfPower Studio applications (such as dfPower Base - Batch and dfPower Quality - Analysis), as well as some unique functionality, into a single, intuitive user interface. In Architect, you define a sequence of operations—for example, selecting a source data, parsing that data, verifying address data, and outputting that data into a new table—and then run the operations at once. This functionality not only saves you the time and trouble of using multiple dfPower Studio applications, but also helps to ensure consistency in your data-processing work.

To use Architect, you specify operations by selecting job flow steps and then configuring those steps, via straightforward settings, to meet your specific data needs. The steps you choose are displayed on dfPower Architect's main screen as node icons, together forming a visual job flow.

Architect can read data from virtually any source, including data processed and output to a text file, HTML reports, database tables, and a host of other formats.

Note: Understanding the other dfPower Studio applications is very helpful when working with dfPower Architect.

Batch—Schedule Jobs to Run in Batch Mode dfPower Base - Batch supports dfPower Studio’s built-in batch facilities, allowing you to schedule and run data quality procedures on databases without actually using dfPower Studio applications' graphical interfaces. These procedures include logging on and off the appropriate database, as well as the actual processing.

Using Batch, you can instruct dfPower Studio to analyze multiple tables and databases once and to run multiple batch processing jobs. You can also use Batch to schedule jobs to run at a time when database activity is non-existent or at a minimum, even when you cannot be physically present.

DB Viewer—View and Retrieve Records dfPower Base – DB Viewer is a record viewer that permits you to view and retrieve records from your various data sources.

ODBC Admin—Configure dfPower Studio to Connect to Various Data Sources Configure dfPower Studio and dfPower Studio applications to connect to databases by using the Microsoft Windows ODBC (Open Database Connectivity) Data Source Administrator screen. You can access this screen from a variety of locations within the dfPower Studio suite.

9


To process a database in dfPower Studio, an ODBC driver for the specified DBMS must be installed, and the database must be configured as an ODBC data source. If you have successfully completed these steps, the database name will display in the database lists found on the dfPower Studio main screen, dfPower Integration - Match main screen, and dfPower Enrichment - Verify main screen.

Figure 1. dfPower Studio - Base - ODBC Admin screen

Note: Extend appeared in dfPower Studio – Base in earlier releases. The Extend function is now accessible from dfPower Studio –Enrichment – Extend. Extend permits you to connect directly to a data source using ODBC and creates fixed-width text files that can be sent to Donnelly Marketing’s InfoConnect web service for address hygiene, privacy protection, and data enhancement services. See the online Help for additional details.

Options—Configure dfPower Studio Options After installing and licensing dfPower Studio, use the Options screen to set how dfPower Studio - Base and certain other dfPower Studio applications operate.

10


Figure 2. dfPower Studio - Base - Options

Rule Manager—Create Custom Business Rules and Metrics Use the Rule Manager to create and manage business rules and custom metrics to add them to a job to monitor the data. You can use business rules and custom metrics to analyze data to identify problems. The Rule Manager provides tools for building expressions for custom metric and rules. You can create custom metrics to supplement the standard metrics available in dfPower Profile or to be used in rules and tasks with dfPower Architect. Use the Rule Manager to create tasks that implement one or more rules. You can then execute these tasks at various points in a dfPower Architect job flow and trigger events when the conditions of a rule are met.

11


Figure 3. dfPower Studio - Base - Rule Manager

Monitor Viewer—View Task Data in Repository When you run a job in dfPower Architect, the job executes tasks and rules that you created using the Business Rule Manager. Information about these executed tasks and their associated rules are stored in the default repository. The Monitor Viewer lets you view the information stored in the repository about the tasks and rules executed.

Figure 4. dfPower Studio - Base – Monitor Viewer

Repository Administrator—Create and Manage Repositories The Repository Administrator enables you to create and register new repositories, upgrade existing repositories and dfPower Profile reports, and import objects from one repository to another. A DataFlux unified repository is a set of tables in a database or file that stores profile reports, custom metrics, business rules, and results from data monitoring jobs.

12


Figure 5. dfPower Studio - Base – Repository Administrator

dfPower Profile Start any data quality initiative with a complete assessment of your organizations information assets. Use dfPower Profile to examine the structure, completeness, suitability and relationships of your data. Through built-in metrics and analysis functionality, you can find out what problems you have, and determine the best way to address them.

Using dfPower Profile, you can:

• Select and connect to multiple databases of your choice without worrying about your sources being local, over a network, on a different platform, or at a remote location.

• Create virtual tables using business rules from your data sources in order to scrutinize and filter your data.

• Run at the same time multiple data metrics operations on different data sources.

• Run primary/foreign key and redundant data analyses to maintain the referential integrity of your data.

• Monitor the structure of your data as you change and update your content.

The following tools are available within Profile.

dfPower Profile – Configurator—Set up Jobs to Profile Data Sources The Configurator screen appears when you first start Profile. Use the Configurator to set up jobs to profile your data sources.

13


dfPower Profile – Viewer—View Profile Results View the results of a dfPower Profile job. By default, this screen appears automatically. To access this screen from Configurator main screen by choosing Tools > dfPower Profile (Viewer).

dfPower Profile – Compare—Compare Tables from Multiple Data Sources This dfPower DBMC Compare helps you compare content from multiple data sources. With this tool, you can

• Compare table copies for identical content, determine whether files have been deleted, reveal whether extra calculated fields have been added, and ascertain whether a table is a subset, and so on.

• Produce one spreadsheet, text file, or html page showing all variables on all databases.

• Build an output file with all the values of all the variables on any number of databases.

For more information on the Compare function, see the online Help links to the DBMC manual and web content.

dfPower Profile – Scheme Builder—Build Standardization Schemes Use the Scheme Builder to build schemes from an analysis report, edit existing schemes, or to create schemes from scratch.

dfPower Quality dfPower – Quality uses the DataFlux matching technology, transformation routines, and identification logic to help correct most common data problems, including duplicate records, non-standard data representations, and indeterminate data types. dfPower Quality consists of two components:

• dfPower Quality – Analysis

• dfPower Quality – Standardize

14


Note: Match appeared in dfPower Studio – Quality in earlier releases. The Match function is now accessible from dfPower Studio –Integration.

dfPower Quality – Analysis—Analyze Data Sources dfPower Quality – Analysis helps you discover the data-quality issues that exist in your data sources, then build business logic that can be used by other dfPower Studio applications and DataFlux technology to resolve data quality issues throughout the enterprise.

Primarily, dfPower Quality - Analysis addresses the subset of data quality known as data inconsistency. Data representation inconsistency can have damaging effects on any system, whether it is a data warehouse, a customer relationship management system, a marketing database, a human resource system, a sales force automation system, or even a plain operational data store. It can cause inaccurate results, especially when reporting, querying, or performing any type of quantitative analysis on a data set.

Importantly, dfPower Quality - Analysis uses your enterprise's own data to generate the business logic that will be used throughout dfPower Studio. This allows you not only to apply generic data quality standards to achieve better business intelligence results, but also to resolve issues that are unique to your organization. dfPower Quality – Analysis will actually write transformation rules for you by inferring standard values from your own data.

dfPower Quality – Standardize—Standardize Your Data Use dfPower Quality – Standardize to scrub, clean, purify, standardize, and otherwise ensure a consistent representation of any data that exists in your organization’s databases. This function makes the data significantly more usable, enabling more accurate results when issuing queries and generating reports, while also increasing the economic value hidden within the data.

dfPower Integration Similar data often exists in multiple databases. For example, the same product line might have different product names in European and U.S. markets. Integration helps establish the commonality among the various records. Several methods can be used for integration, or data matching. DataFlux uses a modified deterministic matching method. This means that the rules to determine matches are known before matching begins. This approach differs from

probabilistic matching, which looks at data elements in relation to other data elements in the data set to determine matches.

For more detailed information on integration strategies, refer to the DataFlux white paper, “The Building Blocks of Data Management, which you can find via the Customer Portal at www.dataflux.com.

dfPower Integration – Match—Generate Match Codes to Identify Duplicate Records dfPower Integration – Match automates the otherwise tedious, time-consuming task of identifying duplicate and near-duplicate records that can threaten the integrity of an organization's data. Its flexible match sensitivities and detailed reports enable dfPower Integration – Match to identify and help eliminate a wide array of duplicate and near-duplicate records.

dfPower Integration – Match has complete functionality not only for producing detailed match reports, but also for manual and automatic best-surviving-record determination (also known as a duplicate elimination process).

15


You can also use dfPower Integration – Match to identify households: groups of related customers that share a physical and/or other relationship. Households provide a way to use data about your customers’ relationships with each other to gain insight into their spending patterns, job mobility, family moves and additions, and more.

dfPower Integration – Merge—Combine Data from Multiple Sources Use dfPower Integration – Merge to combine data from multiple sources. You will have performed various data transformations before you attempt to merge files.

Likely, you will generate Match Codes via the Integration – Match function (or directly from Architect) and then use virtually created Match Code fields as comparison data to match records.

For more detailed information on data merging strategies, refer to the DataFlux white paper, “The Building Blocks of Data Management, which you can find via the Customer Portal at www.dataflux.com.

dfPower Enrichment Enrichment permits you to incorporate external data that is not directly related to the base data; you may gain useful insights from this extra data if your data set is incomplete. Missing data can prevent you from adequately recognizing customer needs, or it may be difficult to tie these record types to other information that already exists in your system.

These problems really have two aspects. First is the issue of data completeness. A typical example of incomplete data would be a missing ZIP code in an address field; the incomplete data precludes any mailing effort or ZIP code-based demographic mapping.

Second is the issue of data keying. You may have a complete view of your data for existing needs but desire geographical data like longitude and latitude to further some recently instituted goal. Finally, DataFlux considers any combination of supplemental data and custom or targeted algorithms designed to tackle industry-specific problems to be data enrichment.

dfPower Enrichment – Verify—Verify Addresses Against Reference Databases Address standardization and verification is the key to most data quality projects involving customer contact information. Depending on your dfPower Studio license, dfPower Verify can process US, Canadian, and international addresses—standardizing address elements, adding valuable address-related and geocode data, and correcting spelling errors, invalid addresses, invalid cities and states, invalid postal codes, and invalid organization names. For US addresses, dfPower Verify can also geocode addresses to the ZIP+4 centroid level, providing longitude, latitude, state and county FIPS codes, and census tract and block group numbers.

You can also use dfPower Verify to report on telephone numbers with invalid area codes or determine area codes from postal codes.

dfPower Verify is CASS (Coding Accuracy Support System) certified by the United States Postal Service and SERP (The Software Evaluation and Recognition Program) certified by Canada Post. Companies that use CASS-certified software applications are eligible to receive significant postal discounts for bulk mailings.

16


dfPower Enrichment – Extend—Use Online Data Enhancement Tool dfPower Extend is an application that connects directly to a data source using ODBC and creates fixed-width text files that can be sent to Donnelley Marketing's InfoConnect web service that performs any number of address hygiene, privacy protection, and data enhancement services on your data.

Note: You must have an account with Donnelley Marketing in order to use this feature. Extend appeared in dfPower Studio – Base in earlier releases. The Extend function is now accessible from dfPower Studio – Base – ODBC Admin or from dfPower – Enrichment – Extend.

dfPower Design Designed for advanced users, dfPower Design is a development and testing bundle of applications for creating and modifying data-management algorithms (such as algorithms for matching and parsing) that are surfaced in other DataFlux products. dfPower Design provides several GUI components to facilitate this process, as well as extensive testing and reporting features to help you fine tune your data quality routines and transformations.

Some of the dfPower Design tasks you can perform include:

• Creating data types and processing definitions based directly on your own data.

• Modifying DataFlux-designed processing definitions to meet the needs of your data.

• Creating and editing regular expression (Regex Library) files to format and otherwise clean inconsistent data.

• Creating and editing Phonetics Library files, which provide a means for better data matching.

• Creating and modifying extensive look-up tables (Vocabularies) and parsing rule libraries (Grammars) used for complex parsing routines.

• Creating and modifying character-level Chop Tables used to split apart strings of text according to the structure inherent in the data.

• Creating and modifying transformation tables (Schemes) that you can apply on a phrase or element-level basis.

• Testing parsing, standardization, and matching rules and definitions in an interactive or batch mode before you use them to process live data.

• Generating extensive character-level and token-level reporting information that allows you to see exactly how a string is parsed, transformed, cleaned, matched, or reformatted.

• Creating data types and processing definitions specific to your locale.

dfPower Design – Customize Modify or create data quality algorithms.

dfPower Design – Vocab Editor Create and modify collections of words.

dfPower Design – Regex Editor Create and modify regular expressions.

17


dfPower Design – Phonetics Editor Create Rules to identify similar-sounding data strings.

dfPower Design – Chop Table Editor Create ordered word lists from input data.

For more detailed information on available Customize functions, refer to the online Help.

dfPower Favorites The Favorites menu provides a convenient place to store quick links to your favorite dfPower Studio applications.

18


Chapter 2: Getting Started This chapter describes how to get started using dfPower Studio.


• Before You Upgrade dfPower Studio and/or the Quality Knowledge Base

• System Requirements

• dfPower Studio Installation Wizard

• dfPower Studio Installation Command-Line Switches

• Downloading and Installing dfPower Verify Databases

• Launching dfPower Studio

• Licensing dfPower Studio

• Moving Around in dfPower Studio

• Accessing dfPower Studio Online Help

Before You Upgrade dfPower Studio and/or the Quality Knowledge Base You do not need to uninstall a previous version of dfPower Studio before you install a new version. By default, new versions will install into a separate directory indicated by the version number. We recommend you install the new version into a new directory, allowing both the older and the newer versions to operate side by side. Incidentally, uninstalling dfPower Studio first might delete some of the ODBC drivers you are currently using but have been discontinued in the upgraded software installation.

Note: To install dfPower Studio on Terminal Server, use the Windows Control Panel > Add or Remove Programs option. Failing to do so may cause dfPower Studio to be installed incorrectly.

19


Step Action Note 1 Create backup copies of

your personal resource files, such as jobs and reports that are displayed in the dfPower Studio Navigator.

These files are stored by default in the \Program Files\DataFlux\dfPower Studio\<version number>\mgmtrsrc directory.

2 Make a backup copy of your Quality Knowledge Base (QKB) prior to installing a new QKB.

By default, the QKB location is \Program Files\DataFlux\QltyKB. If you choose to install the latest QKB, you can install it into a new directory and change your Options > QKB Directory setting in dfPower Studio to point to the new location. You can also install it into a new directory and use dfPower Customize and the import features of dfPower Studio to selectively import portions of the updated QKB. The final option is to install the new QKB directly into an existing QKB location. The installation has merge functionality that will incorporate the updates from DataFlux into your existing QKB while keeping any changes you might have made.

For version 7, the QKB format for storing metadata has changed. This change could cause problems during your installation unless a few steps are taken to accommodate the change. Here are a few scenarios:

1. If you install dfPower Studio version 7 on a machine that has version 6.2 or earlier installed and QKB version 2004D or earlier already installed, the installation process will convert the version 6.2 metadata file to be compatible with version 7.

2. If you Install dfPower Studio version 7 on a new machine and then install QKB version 2004D or earlier, you will need to run a conversion program that will convert QKB version 2004D to use the new metadata file format. To do this, use the “vault_merge.exe” application in the root directory of your QKB. Launch it from the command line and use the following syntax:

vault_merge --convert <new file> <old file>

An example:

vault_merge --convert "C:\Program Files\DataFlux\QltyKB\2004D\qltykb.db" "C:\Program Files\DataFlux\QltyKB\2004D\qltykb.xml"

3. You can also install a QKB version 2005A or later QKB. This version will already have a metadata file in the correct format for dfPower Studio 7.

A QKB is not delivered in the same installation as dfPower Studio. You can use an existing Quality Knowledge Base or you can install the most recent version of the Quality Knowledge Base, available at www.dataflux.com/qkb/.

Note: If you do choose to download the latest QKB version, install it after you install dfPower Studio.

20

http://www.dataflux.com/qkb/


System Requirements Minimum Recommended

Platform Windows NT 4.0/2000/2003/XP Windows 2000

Windows XP Professional Processor Pentium 4 – 1.2 GhZ or higher Pentium 4 – 2.2 GHz or higher

Memory (RAM) 512 MB 2+ GB

Disk Space 5 GB 10+ GB

dfPower Studio Installation Wizard Use the following procedure to install your copy of dfPower Studio using the installation wizard.

Note: This procedure is designed for Windows users. Non-Windows users should refer to dfPower Studio Installation command line switches.

1. Insert the dfPower Studio CD into your computer’s CD drive. A screen appears that offers several options. Select Install dfPower Studio 7.1

2. From the Windows taskbar, choose Start > Run. The Run dialog box appears.

3. In the Open text box, type <drive>:\dfPowerStudio.exe, replacing <drive> with the letter for your CD drive. For example, if your CD drive is e, type e:\dfPowerStudio.exe.

4. Press Enter. The setup wizard begins, as shown below. This screen offers a general welcome. Click Next to continue.

Note: You can also start the dfPower Studio by navigating to and double-clicking the dfPowerStudio.exe file on the dfPower Studio CD.

21


5. The Choose Destination Location screen appears, as shown in the following illustration. Use this screen to specify the directory where you want to install dfPower Studio. We suggest you use the default directory. Do not install a new version of dfPower Studio into an older version’s directory. For example, do not install version 7.1 into an existing 7.1 directory. However, if you need to, you can re-install the same version of dfPower Studio into the same directory. Click Next to continue.

22


6. The Select Additional Components screen appears, as shown in the following illustration. Use this screen to specify the dfPower Studio applications you want to install. By default, all applications are installed; however only the applications you have licensed will be operational. It will not harm your computer or dfPower Studio installation to install applications for which you have no license, and this makes later licensing of additional applications easier. However, if you do not install all the applications now, you can install additional applications later by re-running the installation. Click Next to continue.

23


7. The Copy Resources screen appears, as shown in the following illustration. If you have a previously installed version of dfPower Studio on your computer, you can use this screen to make all your existing reports and jobs available to the new version of dfPower Studio. Click Next to continue.

24


8. The dfPower Studio License Agreement screen appears, as shown in the following illustration. Use this screen to review and accept the software license agreement. Click Accept to continue.

25


9. The Select Program Manager Group screen appears, as shown in the following illustration. Use this screen to specify the name for the Program Manager group that is added to the Windows Start menu. By default, the group will be named DataFlux dfPower Studio 7.1. Click Next to continue.

26


10. The Start Installation screen appears, as shown in the following illustration. Use this screen to confirm your installation wizard selections so far. Click Next to continue.

11. The dfPower Batch Scheduler screen appears, as shown in the following illustration. dfPower Studio includes a batch scheduler as part of its Base – Batch application. Click Yes to install this scheduler as a Windows service, or click No to install it as a standard executable file.

12. When the installation wizard completes, check for any available dfPower Studio patches or updates, available at www.dataflux.com/products/update.asp.

27

http://www.dataflux.com/products/update.asp


dfPower Studio Installation Command-Line Switches Two command-line switches allow you to modify the way dfPower Studio installs:

• /S - Install dfPower Studio in "silent mode." Use this switch in conjunction with the /M switch described below.

• /M=<filename> - Use installation variables from an external text file. The available variables are:

o MAINDIR=<path> - Specify the dfPower Studio installation location.

o COMPONENTS=[A] [B] [C] [D] - Specify the dfPower Studio applications components to install. A= dfPower Verify. B= dfPower Match. C= dfPower Customize. D= dfPower Profile.

o RUNMDAC=[YES/NO] - Specify if you want to install MDAC (Microsoft Data Access Components).

o RUNCONVWIZ=[YES/NO] - Specify if you want to run the Conversion Wizard Utility. The conversion wizard changes dfPower Studio version 4.x-style jobs into version 5.x/6.x-style jobs. Note: Specifying YES will bring up a related dialog box during installation.

o RUNSASODBC=[YES/NO] - Specify if you want to install the SAS ODBC driver.

o RUNODBC=[YES/NO] - Specify if you want to install DataFlux ODBC Drivers. Note: This variable will also specify if you want to install the sample DataFlux database. If you do not install this database, none of the sample jobs set up in dfPower Studio will work correctly.

Sample installation command line:

dfPowerStudio.exe /S /M=dfinst.txt

where dfinst.txt contains the following text:

MAINDIR=C:\Program Files\DataFlux\dfPower Studio\7.1 COMPONENTS=ABCD RUNMDAC=NO RUNCONVWIZ=NO RUNSASODBC=NO

Downloading and Installing dfPower Verify Databases If your dfPower Studio installation included dfPower Verify, you need to use one or both of the following procedures to install the proper USPS, SERP, and/or Geocode databases to make dfPower Verify work correctly. Until you install the proper database(s), dfPower Verify will operate in "trial mode" using a sample North Carolina database. If you are licensed to use dfPower Verify – QAS, you must acquire the postal reference databases directly from QAS for the countries they support. For more information, contact your DataFlux representative.

A few things to consider when installing QAS data:

1. Install dfPower Studio. This will install the QAS libraries and required configuration files.

2. Start the installation wizard for your QAS data, as described later in “Installing dfPower Verify Databases.”

28


29

3. The installation wizard will ask you to select the application you want to update. Specify the path to the “bin” directory where you installed dfPower Studio. The default location is “\Program Files\DataFlux\dfPower Studio\7.1\bin.”

4. Continue with the installation. When the installation wizard prompts you for a license key, enter your key for the locale you are installing.

If QAS data is already installed on your computer and you want to use it with dfPower Studio, perform the following steps:

1. In the “bin” directory where you installed dfPower Studio—the default location is: “\Program Files\DataFlux\dfPower Studio\7.1\bin”—use a text editor such as Windows Notepad to open the qalicn.ini file.

2. Enter your license keys, each on a separate line. (For additional information, see the instructions in the qalicn.ini file.)

3. Copy your existing qawserve.ini file to the “bin” directory.

If you have dfPower Verify – World address databases, follow these steps to install the data and configure dfPower Studio appropriately:

1. Copy the country database (.MD file) and the address configuration file (addressformat.cfg) into the same folder of your choice. For example:

C:\World_Database\AD40NZL.MD (database file) C:\World_Database\addressformat.cfg (configuration file)

2. Use a text editor to edit your dfStudio7.ini file (typically found in your “Windows” system directory) to add the entries for WorldDir and WorldLic. Make sure the WorldDir points to the directory where you installed the database and configuration file, and then copy in the license key next to the WorldLic entry.

WorldDir=C:\World_Database

WorldLic=license key goes here

3. In dfPower Studio, in the Options dialog, under the References section, set the World Address Database Directory to be the same directory you used in the dfStudio7.ini file. In this example, you would type "C:\Platon" in that text box (without the quotes).

Downloading dfPower Verify Databases If… Then…

you have received a CD from DataFlux containing the dfPower Verify database installation file

skip to the next procedure, “Installing dfPower Verify Databases.”

you do NOT have a CD from DataFlux containing the dfPower Verify database installation file

use the following procedure to download this file from the DataFlux FTP server.

If you do not have a CD that contains the installation file, you can download it from the DataFlux FTP server:

Before You Begin

• This procedure is designed for Windows users and requires an active Internet connection.


• If you prefer, you can use your favorite FTP client or the FTP command instead of Internet Explorer to download this file as described in the following procedure.

• This procedure requires a DataFlux FTP username and password. If you do not know your username and password, contact DataFlux Technical Support at 919-531-9000.

1. Start Microsoft’s Internet Explorer.

2. In Internet Explorer, choose Tools > Internet Options. The Internet Options screen appears.

3. On the Advanced tab, check Enable Folder View for FTP Sites, then click OK.

4. In the Address box near the top of the Internet Explorer screen, type ftp://<username>:<password>@ftp.dataflux.com/usps/all/win, where <username> is your DataFlux FTP username and <password> is your DataFlux FTP password. For example, if your username is fred and your password is ABC123, type ftp://fred:[email protected]/usps/all/win. Do not include any spaces.

5. Press Enter. The DataFlux FTP directory that contains the dfPower Verify database installation file appears.

6. Right-click on the appropriate installation file, then click Copy to Folder. The Browse for Folder screen appears. For dfPower Verify – US users, this file is VerifyDataSetup.exe. For dfPower Verify – Canada users, this file is VerifyCanDataSetup.exe. For dfPower Verify – Geocode US, dfPower Verify – Geocode Canada, and dfPower Verify PhonePlus users, this file is VerifyGeoPhoneDataSetup.exe.

7. Use the Browse for Folder screen to select a location for the installation file, then click OK. Internet Explorer downloads the file.

30


Note: Be sure to select a location to which you have write access and which has at least 430 MB of free space.

Installing dfPower Verify Databases If… Then…

you have received a CD from DataFlux containing the dfPower Verify database installation file

insert that CD in your CD drive.

you have downloaded the installation file from the DataFlux FTP server

make sure you know the file’s exact location.

Note: This procedure is designed for Windows users.

1. Close all other Windows applications.

2. Browse to and double-click the dfPower Verify database installation file on the CD or in your download location. The setup wizard begins.

3. Follow the setup wizard instructions.

Launching dfPower Studio After you install dfPower Studio and any required databases, use the following one-step procedure to launch dfPower Studio.

1. From the Windows taskbar, choose Start > Programs > DataFlux dfPower Studio 7.1 > dfPower Studio. The dfPower Studio main screen appears.

31


Tip • If a small dfPower Studio Trial Version screen appears, click Start dfPower to

display the dfPower Studio main screen.

• To exit dfPower Studio, from the dfPower Studio main screen, choose Studio > Exit.

Licensing dfPower Studio Use the following procedure to license dfPower Studio and gain access to all dfPower Studio applications that you have licensed.

1. If necessary, launch dfPower Studio.

2. From the dfPower Studio main screen, choose Help > Licensing. The dfPower Studio - Licensing screen appears.

3. Note your five-digit Machine Code and License Directory.

4. Exit dfPower Studio.

32


5. Use your Web browser to navigate to www.dataflux.com/customers/unlock.asp. A DataFlux Product Registration form appears.

6. Complete the form. Be sure to enter your Machine Code correctly and to complete the form fully.

7. Click Submit to submit the form to DataFlux. Within 24 hours, DataFlux will email you a license file, studio.lic.

8. Copy studio.lic to the License Directory you noted in Step 3. (By default, this directory is Program Files\DataFlux\dfPower Studio\7.1\license.)

9. Re-launch dfPower Studio. The dfPower Studio title bar no longer includes the words “Trial Version,” and the dfPower Studio main screen displays the appropriate dfPower Studio applications based on your license agreement.

Note: Rather than using the Product Registration form in steps 5–7, you can email the following information to [email protected]:

• Your first and last name

• Your company’s name

• Your email address

• Your work telephone number

• Your work address

• Your Machine Code

• The DataFlux product you want to license (dfPower Studio)

• The dfPower Studio release (7.1)

• Your computer’s operating system (for example, Windows 2000 Professional)

Moving Around in dfPower Studio Moving around in dfPower Studio is similar to moving around in other Windows applications. The only significant differences in navigation are:

• The main toolbar appears at the left side of the dfPower Studio main screen, while many Windows applications instead have toolbars along the top of the screen.

• Some actions in dfPower Studio will open a separate application screen for the task you want to perform. For example, if you click Architect in the toolbar, dfPower Studio opens a separate dfPower Architect screen.

Accessing dfPower Studio Online Help In addition to this User’s Guide, dfPower Studio provides an online Help system that describes the finer details of using dfPower Studio. Categories of information in the online Help include:

• Help Me Understand, which contains explanations of key dfPower Studio concepts.

• How Do I, which contains step-by-step instructions for performing common dfPower Studio tasks.

33

http://www.dataflux.com/customers/unlock.asp

mailto:[email protected]


• Screen Descriptions, which contains detailed descriptions of dfPower Studio screens and the controls and fields on each of those screens.

• Reference, which contains details on data types and layouts, codes, metrics, the Expression Engine Language, and the like.

There are two main ways to access the online Help:

• On any dfPower Studio screen that has a menu bar, choose Help > Help Topics. The dfPower Studio Help opening screen appears.

• If available, click on a screen’s Help button. The dfPower Studio Help screen appears, usually displaying a detailed description of the current screen, as well as the controls and fields on that screen.

Regardless of how you access the online Help, you can use the navigation pane at the left side of the Help screen to display other Help topics.

Tip • If the navigation pane does not appear automatically, click Show Navigation Pane

near the top of the Help screen.

• Help categories display book icons. Help topics display page icons.

• To list the topics available in a category (book), click the category’s name.

• To view a topic, click the topic’s name.

• Use the navigation pane’s Search tab to find Help topics. For more information on this and other online Help features, see the Tips for Using this Online Help System topic in the online Help.

34

Chapter 3: dfPower Studio in Action This chapter walks through a sample data management scenario to provide you an overview of using dfPower Studio. Next, it walks you through a Profile job and Architect job.

Sample Data Management Scenario The sample data management scenario highlights the typical procedures in a data-management project, following the DataFlux five-building-block methodology shown in the following illustration:


• Data Management Scenario Background

Figure 6. The Building Blocks of the AIC Process

• Profiling

• Quality

• Integration

• Enrichment

• Monitoring

Other Sources • This chapter highlights some typical procedures in a data-improvement project. For step-

by-step details of procedures, see the How Do I... topics in the online Help.

• This chapter covers only part of the functionality available from dfPower Studio. For more information on what you can accomplish with dfPower Studio, see the online Help.

• The first scenario in this chapter follows a straight-line path through the DataFlux methodology. In practice, however, this methodology is highly iterative; as you find problems, you will dig deeper to find more problems, and as you fix problems, you will find more problems to fix.

• Throughout the data management scenario, we mention the dfPower Studio applications we use to accomplish each task. In most cases, a task can be performed by more than one application, although the available options and specific steps to accomplish that task might differ. For example, you can profile data with both dfPower Profile and dfPower Architect's Basic Statistics and Frequency Distribution job steps.

Data Management Scenario Background State Bank services thousands of personal, business, and public-sector accounts, with offices throughout the state. State Bank has just acquired County Bank, and needs to integrate the customer records from both banks.

Our job is to:

• ensure that all records follow the same data standards,

• join the records into one database,

1


• identify and merge duplicate records, and

• prepare the records for an upcoming mailing to all customers

Profiling Profiling is a proactive approach to understanding your data. Also called data discovery or data auditing, data profiling helps you discover the major features of your data landscape: what data is available in your organization and the characteristics of those data.

In preparing for an upcoming mailing, we know that invalid and non-standard addresses will cause a high rate of returned mail. By eventually standardizing and validating the data, we will lower our risks of not reaching customers and incurring unnecessary mailing costs. Also, by understanding the data structure of both banks’ customer records, we will be better able to join, merge, and de-duplicate those records.

Where are your organization’s data? How do data in one database or table relate to data in another? Are your data consistent within and across databases and tables? Are your data complete and up to date? Do you have any multiple records for the same customer, vendor, or product? Good data profiling serves as the foundation for successful data-management projects by answering these questions up front. After all, if you do not know the condition of your data, how can you effectively address your data problems?

Profiling helps you determine what data and types of data you need to change to make your data usable.

Data Profiling Discovery Techniques

The many techniques and processes used for data profiling fall into three major categories:

• Structure Discovery

• Data Discovery

• Relationship Discovery

The next three sections address each of these major categories in our State Bank/County Bank scenario.

Caution! As you profile your own data, resist the urge to correct data on the spot. Not only can this become a labor-intensive project, but you might be changing valid data that only appears invalid. Only after you profile your data will you be fully equipped to start correcting data efficiently.

Structure Discovery The first step in profiling the State and County Bank data is to examine the structure of that data.

In structure discovery, we will determine if:

• our data match their corresponding metadata,

• data patterns match expected patterns, and

• data adhere to appropriate uniqueness and null-value rules

Discovering structure issues early in the process can save much work later on, and establishes a solid foundation for all other data-management tasks.

36

Chapter 3: Performance Guide

To get started, we launch the Configurator component of the dfPower Profile application from the dfPower Studio main screen, connect to the State Bank database and customer table, select all fields in the table, select all available metrics, and run the resulting Profile job. When the job is complete, the results appear in dfPower Profile’s Viewer component.

Here is some of the information we can see about each data field:

• Column profiling: For each field, a set of metrics such as data type, minimum and maximum values, null and blank counts, and data length.

• Frequency distribution: How often the same value occurs in that field, presented both as text and in a chart.

• Pattern distribution: How often the same pattern occurs in that field, presented both as text and in a chart.

The information we can glean from each metric depends on the field. For example, we look at some column profiling metrics for the State Bank field that specifies the year customers opened their first account with the bank, 1STACCTOPND:

State Bank 1STACCTOPND Field

Metric Name Metric Value Data Type VARCHAR

Unique Count 976

Pattern Count 2

Minimum Value 01

Maximum Value 2208

Maximum Length 4

Null Count 24

Blank Count 35

Actual Type integer

Data Length 20

These metrics highlight several issues:

• The official type set for the field in VARCHAR, but the actual values are all integers. While this is not a major data problem, it can slow processing.

• The pattern count is 2, indicating that the years are not recorded in a consistent pattern. Perhaps a year such as 1980 is sometimes stored as just “80.”

• The maximum length of the actual data values is 4, but the data length reserved for the field is 20. That is 16 characters of wasted space for each record.

• Both the null and blank columns are greater than zero. Is it the bank’s policy to have a value in this field for every customer?

Note: We will look at the Unique Count, Minimum Value, and Maximum Values later in Data Discovery.

37


Next, we look at the pattern frequency distribution for this field:


Pattern Alternate Count Percentage 99 9(2) 229 0.0458

9999 9(4) 4771 0.9542

The table above shows those 1STACCTOPND years is expressed with two and four digits. Fortunately, the 1STACCTOPND field does appear to contain year data as we expected.

Note: dfPower Studio expresses patterns using “9” for a digit, “A” for an uppercase letter, “a” for a lowercase letter, and spaces and punctuation marks for spaces and punctuation marks. In addition, dfPower Studio also expresses alternate shorthand patterns, where each “9,” “A,” and “a” is followed by a number in parentheses that provides a count of each character type at that location in the value, unless that count is 1. The following table illustrates dfPower Studio patterns and alternates.

Value Pattern Alternate 1 9 9

1997 9999 9(4)

1-919-555-1212 1-999-999-9999 9-9(3)-9(3)-9(4)

(919) 555-1212 (999) 999-9999 (9(3)) 9(3)-9(4)

Mary Aaaa Aa(3)

Mary Smith-Johnson Aaaa Aaaaa-Aaaaaaa Aa(3) Aa(4)-Aa(6)

We continue to review these metrics for every field in this table, making notes on each data problem we find. We also review metrics for the County Bank’s fields, where we find additional data problems. For example, County Bank’s Phone field uses three different patterns.

Data Discovery Our second data profiling step, data discovery, will help us determine whether our data values are complete, accurate, and unambiguous. If they are not, this can prevent us from adequately recognizing the needs of the banks’ customers, and can make it difficult to tie these records to other data.

We take a second look at the set of metrics we first saw in structure discovery for the 1STACCTOPND field:

38



Metric Name Metric Value Data Type VARCHAR

Unique Count 976

Pattern Count 2

Minimum Value 01

Maximum Value 2208

Maximum Length 4

Null Count 24

Blank Count 35

Actual Type integer

Data Length 20

This time, focusing on the Unique Count, Minimum Value, and Maximum Value metrics, we discover new problems:

• Given that State Bank was founded in 1936, the likely range of years is 1936 through the current year, or about 70 years. However, the field’s unique count is 976, implying there are many more values than just 1936, 1937, 1938, and so on.

• The Minimum Value is 01. This appears to be part of the problem we saw earlier with dates being expresses in one to four digits. Does “01” mean 2001?

• The Maximum Value is 2208. Assuming no State Bank employees or customers are time travelers, this must be incorrect.

We do this analysis for every field for both banks.

39


We next look at part of the frequency distribution for each field. The following table shows frequency distribution for State Bank’s 1STACCTOPND field.


Value Count Percentage …

54 14 0.0028

55 9 0.0018

56 6 0.0012

…

1936 34 0.0068

1937 41 0.0082

1938 43 0.0086

…

2001 165 0.0330

2002 189 0.0378

2003 203 0.0406

…

We already determined from the pattern frequency distribution in structure discovery that there are two patterns for 1STACCTOPND and that there are 229 1STACCTOPND fields we need to change to always express years with four digits. Now, however, we can see the actual values, and the distribution of the four-digit year seems right, with each year showing progressively more customers who remain with the bank.

Next, we look at these same count metrics for both banks across several fields:

State Bank Tax ID Gender Street Address City State ZIP Phone Count 4985 3873 4998 4650 4650 4747 4184

Null Count 15 1127 2 350 114 253 816

Unique Count 4879 4 3962 27 4 46 3922

County Bank Tax ID Gender Street Address City State ZIP Phone Count 997 836 985 926 966 881 766

Null Count 3 164 15 74 34 119 234

Unique Count 916 3 815 6 2 86 692

The two preceding tables clearly show that a large number of records from both banks are incomplete. In particular, null counts greater than 0 for both banks show that data is missing for Gender, City, State, ZIP, and Phone. Also, County Bank has some missing data for Street Address.

Looking at the Count and Unique Count values, we are glad to discover that both values are high for both banks’ Tax ID field. This means we should be able to use this field to identify most if not all customers uniquely, greatly helping us find multiple records for the same customer within and across the banks.

40


The Unique Count values also reveal some problems:

• State Bank’s Gender field contains four unique values, even though there are only three valid Gender values: F (female), M (male), and U (unknown).

• County Bank’s City field contains six unique values while the ZIP field contains 86 unique values. Something is wrong here.

To find other issues, we review data length and type metrics for both banks across several fields:

State Bank Tax ID Name Street Address City State ZIP Data Length 9 30 75 20 25 10

Maximum Length

9 27 39 10 20 5

Data Type INTEGER VARCHAR VARCHAR VARCHAR VARCHAR INTEGER

County Bank Tax ID First

Name Last

Name Street

Address City State ZIP

Data Length 11 20 20 50 25 15 10

Maximum Length

9 17 18 47 23 3 5

Data Type VAR-CHAR

VAR-CHAR

VAR-CHAR

VARCHAR VARCHAR VARCHAR VARCHAR

These metrics highlight some more problems:

• State Bank stores customers’ full names in a single Name field, while County Bank uses separate First Name and Last Name fields.

• While many of the other fields in both tables have the same name, the fields are of different lengths. For example, the City Data Length is 20 for State Bank and 25 for County Bank. Also, the City Maximum Length is 23 for County Bank, which is larger than County Bank’s Data Length of 20. This difference would cause truncated City data if we merged the two tables without first making some changes.

• Similar fields also have different data types between the two tables. For example, State Bank’s Tax ID field is set to INTEGER, while County Bank’s Tax ID is set to VARCHAR.

We next look at the actual records. To do this from the report in dfPower Profile’s Viewer component, we select a field with questionable metrics, and then double-click on that metric. For example, to view records with “80” as the 1STACCTOPND value, in the Viewer we select the 1STACCTOPND field, show the frequency distribution for that field, then double-click on the “80” value. The following two tables show selected field values for a random five of each bank’s customer records:

41


State Bank Name Tax ID Gender Street Address City State ZIP Phone Bill Jones

152387623

M 1324 New Road North Carolina

27712 716-479-4990

Sam Smith

587354917

253 Forest RD Cary NC

Julie Swift

784632845

F 159 Merle St 27712

Mary Wise

785213678

O 115 Dublin Woods DR

27513

Jim 458795477

U 100 Main Street East

Durham NC 32934 919-662-5301

Bill Jones’ record is missing a City value; Sam Smith’s record is missing Gender, ZIP, and Phone values; Julie Swift’s record is missing City, State, and Phone values; and Mary Wise’s record is missing City, State, and Phone values. Viewing the actual record data, we also discover something metrics did not show:

• The area code for Bill Jones’ telephone number is invalid because 716 is a code for western New York.

• Jim’s Name field contains a value, but there is no last name.

• Jim’s ZIP code is invalid because 32934 is a code for central Florida.

• The street designations in Street Address are valid but inconsistent. For example, Bill Jones’ record uses “Road” while Sam Smith’s uses “RD.”

• The state designations in State are valid but inconsistent. For example, Bill Jones’ record uses “North Carolina” while Sam Smith’s uses “NC.”

We can also see why State Bank’s Gender field has four unique values: Mary Wise’s Gender value is an invalid “O.” To see how extensive this problem is, we can later look at the frequency distribution for the Gender field to see how often “O” appears in the field.

County Bank First Name

Last Name

Tax ID Gender Street Address

City State ZIP Phone

William Jones 458-795-689

M 1324 Milton Road

Cary North Carolina

(919) 303-9516

Joe Mead 487-562-389

159 Poof St Melbourne 32934

Brad Martin 455-687-747

F Durham 27712 919-479-4992

James Smith 467-888-898

100 Main Street

Carrboro NC 27514

Jim Smith 467-885-898

M 100 Main E St

27514 919 485 1963

County Bank’s records are also missing data, specifically Gender, Street Address, State, ZIP, and Phone. As with State Bank, when we view the actual County Bank record data, we discover some problems metrics did not show:

42


• Brad Martin’s Gender field indicates a female, but Brad is an unusual name for a woman.

• The bottom two records appear as if they might be for the same customer. The names and addresses are similar, but the Tax Ids are different by one digit. Is this a father and son living in the same household with two similar Tax Ids or one man with a typo in his Tax ID?

By examining both banks’ tables, we also discover that the Tax ID patterns are different. State Bank’s tax ID uses a 999999999 pattern, while County Bank’s ID uses a 999-999-999 pattern.

Relationship Discovery Our final data profiling step, relationship discovery, can help us answer these questions: Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables, or databases? Are there redundant data?

Here are some other problems that can result from the wrong relationships:

• A product ID exists in your invoice register, but no corresponding product is available in your product database. According to your systems, you have sold a product that does not exist.

• A customer ID exists on a sales order, but no corresponding customer is in your customer database. In effect, you have sold something to a customer with no possibility of delivering the product or billing the customer.

• You run out of a product in your warehouse with a particular UPC number. Your purchasing database has no corresponding UPC number. You have no way to restock the product.

For the purposes of demonstrating relationship discovery with our State Bank scenario, let us say we know from earlier profiling that while both banks use Standard Industrial Classification (SIC) codes for business customers, some of County Bank’s SIC codes are invalid because County did not use a SIC code lookup table as State Bank did. To help determine how much work might be involved in identifying and fixing the invalid codes for County Bank’s records, we will run two analyses:

• A redundant data analysis will determine how many values are common between the County Bank customer records and the State Bank’s SIC code lookup table.

• A primary key/foreign key analysis will show us which specific values in the customer records do not match, and thus are likely to be the invalid codes.

To run our redundant analysis, we start with dfPower Profile’s Configurator component. On the Configurator main screen, we right-click on the Code field in State Bank’s SIC Lookup table and choose Redundant Data Analysis. In the screen that appears, we select the SIC field in County Bank’s customer records table. Then, we run the job, and results appear in dfPower Profile’s Viewer component. In the Viewer, we select the SIC codes records table and Code field and display the Redundant Data Analysis results:

Field Name Table Name Primary Count Common Count Secondary Count Code SIC Lookup 1516 873 127

(The overlap of values between the Code and SIC fields is also shown graphically on a Venn diagram.)

The results tell us that in the customer table, 873 records have SIC code values that are also in the SIC Lookup table’s Code field, the SIC Lookup table contains 1,516 codes that are not used by the customer table, and the customer table has 127 occurrences of invalid SIC codes. This means we need to update 127 records in County Bank’s customer table.

43


Now let us look at the same tables using primary key/foreign key analysis. To do this, we again start with dfPower Profile’s Configurator component. On the Configurator main screen, we right-click on the Code field in State Bank’s SIC Lookup table and choose Primary Key/Foreign Key Analysis. In the screen that appears, we select the SIC field in County Bank’s customer records table. Then, we run the job, and results appear in dfPower Profile’s Viewer component. In the Viewer, we select the SIC lookup table’s Code field and display the Primary Key/Foreign Key Analysis results:

Field Name Table Name Match Percentage Code SIC Lookup 87.3

Outliers for Field: SIC

Count Percentage

1007 47 0.047

2382 13 0.013

2995 3 0.003

4217 16 0.016

5068 9 0.009

7223 32 0.032

9327 7 0.007

The Match Percentage confirms that of all the SIC codes in County Bank’s customer records, 87.3% are also in State Bank’s SIC code lookup table. Much more valuable for our purposes, though, is the list of outliers for the SIC field. We can now see that although 12.7% of the SIC codes in County Bank’s customer records are invalid, only seven codes—1007, 2382, 2995, 4217, 5068, 7223, and 9327—are invalid. This could be pretty good news if the invalid codes are consistent; if so, we only need to figure out how the seven invalid codes map to valid ones.

With all that we have learned about our data in data profiling—through structure discovery, data discovery, and relationship discovery—we are now ready to move on to the next data-management building block, Quality.

For more information on Profiling, start with the following topics in the dfPower Studio online Help:

• dfPower Profile – Introduction

• dfPower Profile – Getting Started

• dfPower Profile –Main Screen

• dfPower Profile –Metrics

• dfPower - Custom Metrics

• dfPower Profile – Viewer – Main Screen

• dfPower Profile – Viewer – DataFlux DB Record Viewer Screen

Quality Using the next data management building block—Quality—we start to correct the problems we found through profiling. When we correct problems, we can choose to correct the source data, write the corrected data to a different field or table, or even write the proposed corrections to a report or audit file for team review. For example, we might leave the County Bank customer table untouched, writing all the corrected data to an entirely new table.

44


Typical operations to improve data quality include:

• parsing,

• standardization, and

• general data sanitization

The goal of these operations is to improve overall consistency, validity, and usability of the data. Quality can also include creating match codes in preparation for joining and merging in the Integration building block.

Parsing Parsing refers to breaking a data string into its constituent parts based on the data’s type. Parsing is usually done in preparation for later data-management work, such as verifying addresses, determining gender, and integrating data from multiple sources. Parsing can also improve/speed database searches.

For our data, parsing State Bank’s Name field into separate First Name and Last Name fields will help us in three ways:

• It will make integrating the State Bank and County Bank name data easier.

• It will make it possible to verify existing and fill in missing State Bank gender identification.

• For an upcoming customer mailing, it will allow us to address customers in a more personal way. For example, we can open a letter with “Dear Edward” or “Dear Mr. Smith” instead of “Dear Edward Smith.”

Parsing Techniques

Using the dfPower Architect application, we can both parse and identify gender for State Bank customers. To parse, we launch dfPower Architect from the dfPower Studio main screen, use the Data Source job step to add the State Bank customer table to the dfPower Architect job flow, add a Parsing step, indicate that dfPower Architect should look in the Name field for both first and last names, then output the parsed values into separate First Name and Last Name fields.

The following two tables show values before and after parsing:

State Bank – Before Parsing Name Gender Bill Jones M

Sam Smith

Julie Swift F

Mary Wise O

Jim Elliott U

45


State Bank – After Parsing Name Gender First Name Last Name Bill Jones M Bill Jones

Sam Smith Sam Smith

Julie Swift F Julie Swift

Mary Wise O Mary Wise

Jim U Jim

Notice that we did not remove the Name values or change any of the Gender values. We just added fields and values for First Name and Last Name. Also, note that dfPower Architect assumed “Jim” was a first name and left the Last Name field blank.

With First Name in its own field, we can now use dfPower Architect for gender analysis. To do this, we add a Gender Analysis (Parsed) step and indicate that dfPower Architect should look in the new First Name field to determine gender and output updated gender values back to the existing Gender field. Our data now looks like this:

State Bank – After Parsing Name Gender First Name Last Name Bill Jones M Bill Jones

Sam Smith U Sam Smith

Julie Swift F Julie Swift

Mary Wise F Mary Wise

Jim M Jim

This time, the Gender field was updated. Note that Sam Smith is marked as a U because “Sam” is a common nickname for both men (Samuel) and women (Samantha). Also note that the Gender value for Jim was changed from “U” (unknown) to “M” (male).

We could have used the dfPower Architect Gender Analysis step both to parse the Name field and identify gender in the same step. However, we would not then been able to use the First Name and Last Name fields, which we know we will need later for our mailing. We also could have used the Gender Analysis step if the Name field contained first, middle, and last names; this approach would allow Architect to identify gender even better by using both first and middle names.

Standardization While profiling our data, we discovered several inconsistency issues, including:

• Different ways of expressing street designations, such as “Road” and “RD.”

• Different ways of expressing state designations, such as North Carolina and NC.

• Two different ways of expressing years: with two digits and four digits.

46


Let us look first at the street and state designations:

State Bank First Name Last Name Street Address State Bill Jones 1324 New Road North Carolina

Sam Smith 253 Forest RD NC

Julie Swift 159 Merle St

Mary Wise 115 Dublin Woods DR

Jim 100 Main Street East NC

To standardize these designations, we again use the dfPower Architect application. Continuing the job flow we created for parsing and gender identification, we add a Standardization job step. For this job step, we instruct dfPower Architect to use the DataFlux-designed Address standardization definition for the Street Address field and the State (Two Letter) standardization definition for the State field. Both standardization definitions will change the addresses to meet US Postal Service standards.

After standardizing these two fields, here is how our records look:

State Bank First Name Last Name Street Address State Bill Jones 1324 New Rd NC

Sam Smith 253 Forest Rd NC

Julie Swift 159 Merle St

Mary Wise 115 Dublin Woods Dr

Jim 100 Main St E NC

Notice that two of the State fields are still blank. Standardization confirms existing values to a standard, but does not fill in missing values.

Standardizing the two- and four-digit years requires more work. Here is what a sampling of the data looks like currently:

State Bank First Name Last Name 1STACCTOPND

Bill Jones 1936

Sam Smith 01

Julie Swift 54

Mary Wise 1937

Karen Hyder 41

William Travis 16

Fred Jones 1962

Craig Spencer 57

Mary Kossowski 71

47


Because there is no DataFlux-designed standardization definition for this task, we will need to create a standardization scheme. To do this, we launch the dfPower Quality – Analysis application from the dfPower Studio main screen, select State Bank’s customer database and table, select the 1STACCTOPND field, specify a Phrase Analysis, and run the Analysis job. The results appear in the Analysis Editor, which lists each permutation of a year and how often that year occurs in the customer records. For each two-digit permutation, we set a standard to transform the two-digit value to the appropriate four-digit value. For the values 00–05, we will add “20” to the beginning of the value. For example, we will set each occurrence of “01” to transform to “2001.” For the values 36–99, we will add “19” to the beginning. For example, we will set each occurrence of “45” to transform to “1945.” Because State Bank opened in 1936, we know that the values 06–35 are invalid, so we will leave those values untouched for now; these values will later require some manual editing or the use of another table with valid year data.

After saving our new standardization scheme, we can now standardize years in a similar manner to how we standardize street and state designations. In dfPower Architect, we use a Standardization job step to instruct dfPower Architect to use our new standardization scheme (before, we used DataFlux-designed standardization definitions) to update two-digit values in the 1STACCTOPND field. Our data now looks like this:

State Bank First Name Last Name 1STACCTOPND

Bill Jones 1936

Sam Smith 2001

Julie Swift 1954

Mary Wise 1937

Karen Hyder 1941

William Travis 16

Fred Jones 1962

Craig Spencer 1957

Mary Kossowski 1971

Notice that William Travis’s 1STACCTOPND remains unchanged because it is an invalid value we chose not to transform.

One last thing before we leave Standardization: As you might recall, in Data Discovery, we saw that State Bank’s Tax ID field uses a 999999999 pattern, while County Bank’s Tax ID field uses a 999-999-999 pattern. We will skip the details here, but to transform one of the patterns to another, we use dfPower Customize to create a new standardization definition, and then use that new definition in a dfPower Architect Standardization job step.

Note: Assuming we had all the required standardization definitions schemes and definitions at the outset, the most efficient way to standardize all four fields—Street Address, State, 1STACCTOPND, and Tax ID—would be to use one Standardization job step in dfPower Architect, assigning the appropriate definition or scheme to each field.

Matching Matching seeks to uniquely identify records within and across data sources. Record uniqueness is a precursor to Integration activities.

While profiling, we discovered some potentially duplicate records. As an example, we have identified three State Bank records that might be all for the same person, as shown below:

48


Record 1 Record 2 Record 3 First Name Robert Bob Rob

Last name Smith Smith Smith

Street Address 100 Main St 100 Main 100 Main St

Tax ID 265-67-7890 265-67-7890 242-63-9990

To help determine if these records are indeed for the same customer, we launch the dfPower Integration – Match application; select the State bank database and table; assign the appropriate DataFlux-designed and custom-created match definitions to the fields we want considered during the match process, set match sensitivities and conditions, and run an Append Match Codes job. The result is a new field containing match codes, as shown below.


Last Name Smith Smith Smith

Street Address 100 Main St 100 Main 100 Main St

Tax ID 265-67-7890 265-67-7890 242-63-9990

Match Code GHWS$$EWT$ GHWS$$EWT$ GHWS$$WWI$

Based on our match criteria, dfPower Integration – Match considers Record 1 and Record 2 to be for the same person. Record 3 is very similar—note how the first four characters of the match code (GHWS$$) are the same as Records 1 and 2—but the Tax ID is different.

Of course, we could have figured this out manually, but it gets much more difficult when you need to make matches across thousands of records. With this in mind, we create match codes for both banks.

Now that we have finished parsing, standardizing, and creating match codes for both banks’ tables, we are ready to move onto the Integration building block.

For more information on the Quality building block, start with the following topics in the dfPower Studio online Help:

• dfPower Architect – Introduction

• dfPower Quality – Analysis – Introduction

• dfPower Quality – Standardize – Introduction

• dfPower Customize – Introduction

Integration Data integration encompasses a variety of techniques for joining, merging/de-duplicating, and house holding data from a variety of data sources. These techniques together can help identify potentially duplicate records, merge dissimilar data sets, purge redundant data, and link data sets across the enterprise.

Joining Joining is the process of bringing data from multiple tables into one. For our data, we have three options for joining:

• Combine the County Bank table into the State Bank table

• Combine State Bank table into the County Bank table

• Combine both tables into a new table

49


We decide to combine both tables into a new table. As we start this process, we recall that the data types and lengths for similar fields differ between the two banks’ tables, and that some data types and lengths should be adjusted for increased efficiency. For example, currently:

• The City Data Length is 20 for State Bank and 25 for County Bank. Also, the City Maximum Length is 23 for County Bank, which is larger than the City Data Length of 20 for State Bank.

• The 1STACCTOPND data length for State Bank is 20, but the maximum length of data is 4, thus wasting space.

• The 1STACCTOPND data type for State Bank is VARCHAR, but the actual values are all integers.

• The Tax ID data type for State Bank is set to INTEGER, but VARCHAR for County Bank.

We could address all these issues by opening each bank’s table using the appropriate database tool (for example, dBase or Oracle) and changing field names, lengths, and types as needed, making sure data lengths for any given field is at least as large as the actual maximum length of those fields in both tables. For example, for the City field, we would set the data length to at least 23 in both tables. However, we can change data types and lengths and join the two tables into a new table using just dfPower Architect.

To get started, we start a new job flow in dfPower Architect and use Data Source job steps to add the State Bank table to the flow. On the Output Fields tab for that job step, we click the Advanced button. This opens the Override Defaults screen, which lists each field and its length. To set a new length for a field, we specify that length in the Override column. To set a new type for a field, we specify that type in the Override column. To set a new length and type, we specify the length followed by the type in parentheses. For example, to change the field from a data length of 20 and a VARCHAR type to a data length of 4 and an INTEGER type and, we specify “4(INTEGER).” We do take similar steps for the County Bank table, adding a second Data Source job step to the flow and using the Override Defaults screen to change data lengths and types as necessary.

Now that both bank tables are at the top of our job flow and similar fields have the same data lengths and types, we add a Data Union job step, and use the job step’s Add button to create one-to-one mappings between similar fields in each table and to the combined field in the new table. The data Union job step will not create records with combined data, but rather will add records from one table as new records to the other without making any changes to those records. For example, we map the State Tax ID and County Tax ID fields to a combined TID field and the State First Name field and County First Name field to a combined Customer First Name field.

Finally, we use a Data Target (Insert) job step to specify a database and new table name, then run the job flow to create a new table that contains all the customer records for both banks.

Note: If we had a common field in both tables (ideally primary key/foreign key fields), we could have used a Data Joining job step in dfPower Architect to combine matching records as we combined the two tables. For example, if we had one table that contained unique customer IDs and customer name information and another table that had unique customer IDs and customer phone numbers, we could use a Data Joining job step to create a new table with customer IDs, names, and phone numbers.

50


To illustrate this approach, we will start with these two sets of records:

Customer ID First Name Last Name 100034 Robert Smith

100035 Jane Kennedy

100036 Karen Hyder

100037 Moses Jones

Customer ID Phone 100034 (919) 484-0423

100035 (919) 479-4871

100036 (919) 585-2391

100037 (919) 452-8472

Using the Data Joining job step, we could identify Customer ID as the primary key/foreign key fields and create the following records:

Customer ID First Name Last Name Phone 100034 Robert Smith (919) 484-0423

100035 Jane Kennedy (919) 479-4871

100036 Karen Hyder (919) 585-2391

100037 Moses Jones (919) 452-8472

Merging/De-duplicating Now that we have all the customer records in one table, we can look for records that appear to be for the same customer and merge the data from all records for a single customer into a single surviving record.

Let us look at selected fields in three records that might be for the same customer:


Last name Smith Smith Smith

Street Address 100 Main 100 Main St

City Carrboro Carrboro

State NC

ZIP 27510 27510

Match Code GHWS$$EWT$ GHWS$$EWT$ GHWS$$WWI$

Notice that each record contains a match code, which we generated in the Matching section of the Integration building block. We can use these codes as keys to find and merge duplicate records. To do this, we start by launching the dfPower Integration – Match application from the dfPower Studio main screen, select our combined customer records table, choose Outputs > Eliminate Duplicates, then set a match definition of Exact for the Match Code field.

51


Note: Because we already generated and appended match codes to our records in the Quality building block, we only need to match on the Match Code field. If we did not already have match codes, however, we could specify match definitions and sensitivities for selected fields in Integration – Match and the application would use match codes behind the scenes to find multiple records for the same customer. We could even set up OR conditions such as “If First Name, Last Name, and Tax ID are similar OR if First Name, Last Name, Street Address, and ZIP are similar.”

We then choose Settings… from the Eliminate Duplicates output mode and specify several options for our duplicate elimination job, including Manually Review Duplicate Records, Physically Delete Remaining Duplicates, and that surviving records should be written back to the Current Source Table. When the options are all set, we run the job, and after processing, the Duplicate Elimination File Editor screen appears, showing our first surviving records from a cluster of records that might be for the same customer. The screen looks something like this:

Surviving record:

First Name Last Name Street Address

City State ZIP Match Code

Robert Smith Carrboro NC 27510 GHWS$$EWT$

Cluster records:

Robert Smith 100 Main St Carrboro NC 27510 GHWS$$EWT$ Bob Smith 100 Main 27510 GHWS$$EWT$ Rob Smith 100 Main St Carrboro GHWS$$WWI$

Notice that the surviving record and first record in the cluster are both checked and contain the same data. The Duplicate Elimination File Editor has selected the Robert Smith record as the surviving record, probably because it is the most complete record. However, the Street Address for that record is still blank. To fix this, we double-click on the Street Address value for the Rob Smith record to copy “100 Main St” from the Rob Smith record to the Robert Smith record. The screen now looks something like this:

Surviving record:

First Name Last Name Street Address

City State ZIP Match Code

Robert Smith 100 Main St Carrboro NC 27510 GHWS$$EWT$

Cluster records:

Robert Smith 100 Main St Carrboro NC 27510 GHWS$$EWT$ Bob Smith 100 Main 27510 GHWS$$EWT$ Rob Smith 100 Main St Carrboro GHWS$$WWI$

The Street Address is highlighted in red in the surviving record to indicate that we manually added the value.

To review and edit the surviving record and records cluster for the next customer, we click Next and start the process again, repeating as necessary until duplicate records have been merged and the excess records deleted.

Note: If you find patterns of missing information in surviving records, especially if you have thousands of potentially duplicate records, consider returning to dfPower Integration – Match and setting field rules and/or record rules to help the Duplicate Elimination File Editor make better choices about surviving records.

52


Householding Another use for match codes is householding. Householding entails using a special match code, called a household ID or HID, to link customer records that share a physical and/or other relationship. Householding provides a straightforward way to use data about your customers’ relationships with each other to gain insight into their spending patterns, job mobility, family moves and additions, and more.

Consider the following pair of records:

First Name

Last Name

Street Address Phone

Joe Mead 159 Milton St (410) 569-7893

Julie Swift 159 Milton St (410) 569-7893

Given that Joe and Julie have the same address and phone number, it is highly likely that they are married or at least share some living expenses. Because our bank prefers to send marketing pieces to entire households rather than just individuals, and because our bank is considering offering special account packages to households that have combined deposits of more than $50,000, we need to know what customers share a household.

Note: Although a household is typically thought of in the residential context, similar concepts apply to organizations. For example, householding can be used to group together members of a marketing department, even if those members work at different locations.

To create household IDs, we use dfPower Integration – Match to select our customer records table, and then set up the following match criteria as OR conditions:

• Match Code 1 (MC1): Last Name AND Address

• Match Code 2 (MC2): Address AND Phone

• Match Code 3 (MC2): Last Name [and] Phone (MC3)

53


Behind the scenes, dfPower Integration – Match generates three match codes, as shown below. If any one of the codes match across those records, the application assigns the same persistent HID to all those records.

First Name

Last Name Street Address Phone MC1 MC2 MC3 HID

Joe Mead 159 Milton St (410) 569-7893

$MN #L1 %RQ 1

Julie Swift 159 Milton St (410) 569-7893

$RN #L1 %LQ 1

Michael Becker 1530 Hidden Cove

Dr (919) 688-2856

$BH #H6 %B6 2

Jason Green 1530 Hidden Cove Dr

(919) 688-2856

$GH #H6 %G6 2

Becker Ruth 1530 Hidden Cove Dr

(919) 688-2856

$RH #H6 %R6 2

Courtney Benson 841B Millwood Ln (919) 231-

2611 $BM #M2 %B2 3

Courtney Myers 841 B Millwood Lane (919) 231-2611

$MM #M2 %M2 3

David Jordan 4460 Hampton Ridge (919) 278-

8848 $JH #H2 %J2 5

Carol Jordan 4460 Hampton Ridge (919) 806-9920

$JH #H2 %J8 5

Robin Klein 5574 Waterside

Drive (919) 562-7448

$KW #W5 %K5 7

Sharon Romano 5574 Waterside Drive

(919) 562-7448

$RW #W5 %R5 7

Carol Romano 5574 Waterside Drive

(919) 239-7436

$RW #W2 %R2 7

Melissa Vegas PO Box 873 (919) 239-

2600 $VB #B2 %V2 14

Melissa Vegas 12808 Flanders Ln (919) 239-2600

$VF #F2 %V2 14

Now that our data is integrated, including joining, merging/de-duplicating, and householding, we move on to the next building block: Enrichment.

For more information on the Integration building block, start with the following topics in the dfPower Studio online Help:

• dfPower Integration – Match – Introduction

• dfPower Integration – Match – Creating Duplicate Elimination Jobs


Enrichment In the Enrichment building block, we add to our existing data by verifying and completing incomplete fields, and adding new data such as geocodes.

For our data, we are going to perform three types of enrichment: address verification, phone validation, and geocoding.

54


Address Verification Address verification is the process of verifying and completing address data based on existing address data. As examples:

• You can use an existing ZIP code to determine the city and state.

• You can use an existing street address, city, and state to determine the ZIP or ZIP+4 code.

• You can use an existing ZIP code to determine that the street address actually exists in that ZIP code.

Currently, a sampling of our data looks like this:

First Name

Last Name

Street Address City State ZIP Phone

Bill Jones 1324 New Rd NC 27712 716-479-4990

Sam Smith 253 Forest Rd Cary NC 919-452-8253

Julie Swift 159 Merle St NC 14127 716-662-5301

Mary Wise 115 Dublin Woods Dr 43953 614-484-0555

To verify and help complete some of these records, we first launch dfPower Architect from the dfPower Studio main screen, use a Data Source job step to specify our combined customer records table, and then add an Address Verification job step. In the Address Verification job step, we assign the following address types to each field:

• Street Address: Address Line 1

• City: City

• State: State

• ZIP: ZIP/Postal Code

This tells dfPower Architect what type of data to expect in each field. (It is only coincidence that the City and State field and address types use the same term. We could just as easily have had a CtyTwn field to which we would assign a City address type.)

For the output fields, we specify the following:

• Output Type: Address Line 1, Output Name: Street Address

• Output Type: City, Output Name: City

• Output Type: State, Output Name: State

• Output Type: ZIP/Postal Code, Output Name: ZIP

• Output Type: US County Name, Output Name: County

Notice that except for County, all the fields will get updated with new data. For County, Architect will create a new field.

Additional Output Fields:

• First Name

• Last Name

• Phone

These three fields will not be changed, just carried through the Address Verification job step.

55


After running the job flow, our data now looks like this:

First Name

Last Name

Street Address City State ZIP County Phone

Bill Jones 1324 E New Rd Durham NC 27712-1525

Durham 716-479-4990

Sam Smith 253 Forest Rd Cary NC 27513-1627

Wake 919-452-8253

Julie Swift 159 Merle St Orchard Park

NY 14127-5283

Erie 716-662-5301

Mary Wise 115 Dublin Woods Dr S

Cadiz OH 43953-1524

Harrison 614-484-0555

Notice that the Street Address, City, State, and ZIP fields are now completely populated, a couple of Street Addresses have new directional designations (for example, 1324 E New Rd), all the ZIP codes are now ZIP+4 codes, and the County field has been created and populated. Also note that Julie Swift's State was changed from NC to NY because 14127 is a ZIP code for western New York.

Note: To change these values, dfPower Architect used data from the US Postal Service. If you have licensed dfPower Verify, you should have access to this same data, as well as data for phone validation and geocoding.

Phone Validation Phone validation is the process of checking that the phone numbers in your data are valid, working numbers. dfPower Studio accomplishes this by comparing your phone data to its own reference database of valid area code and exchange information and returning several pieces of information about that phone data, including a verified area code, phone type, and MSA/PSA codes. In addition, you can add one of the following result codes to each record:

• FOUND FULL – The full telephone number appears to be valid.

• FOUND AREA CODE – The area code appears to be valid, but the full phone number does not.

• NOT FOUND – Neither the area code or phone number appear to be valid.

Ignoring the address fields, our data currently looks like this:

First Name Last Name Phone Bill Jones 716-479-4990

Sam Smith 919-452-8253

Julie Swift 716-662-5301

Mary Wise 614-484-0555

To validate the phone numbers, we continue our dfPower Architect job flow from address verification, and add a Phone job step. In the Phone job step, we identify Phone as the field that contains phone numbers, select Phone Type, Area Code, and Result as outputs, and add First Name and Last Name as additional output fields.

The data now looks like this:

First Name Last Name Phone Area Code Phone Type Result Bill Jones 716-479-4990 919 Standard FOUND FULL

Sam Smith 919-452-8253 NOT FOUND

56


First Name Last Name Phone Area Code Phone Type Result Julie Swift 716-662-5301 716 Standard FOUND AREA

CODE

Mary Wise 614-484-0555 740 Cell FOUND FULL

Geocoding Geocoding is the process of using ZIP code data to add geographical information to your records. This information will be useful because it allows us to determine where customers live in relation to each other and to our bank branches. We might use this information to plan new branches or inform new customers of their closest branch. Geocode data can also indicate certain demographic features of our customers, such as average household income.

To add geocode data to our records, we add a Geocoding job step to our dfPower Architect job flow, Identify our ZIP field, and indicate what kind of geocode data we want. Options include latitude, longitude, census tract, FIPS (the Federal Information Processing Standard code assigned to a given county or parish within a state), and census block.

Note: Geocode data generally needs to be used with a lookup table that maps the codes to more meaningful data.

For more information on the Enrichment building block, start with the following topics in the dfPower Studio online Help:


• dfPower Verify – Introduction

Our data is now ready for our customer mailing. However, one data-management building block remains: Monitoring.

Monitoring Because data is a fluid, dynamic, ever-evolving resource, building quality data is not a one-time activity. The integrity of data degrades over time as incorrect, nonstandard, and invalid data is introduced from various sources. For some data, such as customer records, existing data becomes incorrect as people move and change jobs.

In the fifth and final building block, Monitoring, we set up techniques and processes to help us understand when data gets out of limits and to identify ways to correct data over time. Monitoring helps ensure that once data is consistent, accurate, and reliable, we have the information we need to keep the data that way. For our data, we will use two types of monitoring techniques and processes:

• Auditing

• Alerts

Auditing Auditing involves periodic reviews of your data to help ensure you can identify and correct bad data as quickly as possible. Auditing requires you to set a baseline for the acceptable characteristics of your data. For example, it might be that certain fields should:

• Never contain null values

• Contain only unique values

• Contain only values within a specified range

57


For our customer records, we know there have been problems with the ZIP field being blank, so we plan to regularly audit these fields to see if the number of blank fields is increasing or decreasing over time. We will generate a trend chart that shows these changes graphically.

To prepare for auditing, we create a Profile job just as we did in the Profiling building block: We launch dfPower Profile’s Configurator component from the dfPower Studio main screen, connect to our customer records database and table, select the ZIP field, use the Job > Options menu command to select Blank Count metric, and run the resulting Profile job. This creates a profile report that appears in dfPower Profile’s Viewer component and establishes our auditing baseline.

Note: For your own data, you might want to select all fields and all metrics. This will give you the greatest flexibility in auditing data on the fly.

Now, each day, we open and run that same job from the Configurator, making sure to keep Append to Report (If It Already Exists) checked on the Run Job screen.

Note: If you plan to do a lot of data auditing, consider using dfPower Studio’s Base – Batch component to schedule and automatically run your Profile job or jobs.

When the dfPower Profile - Viewer main screen appears, we choose Tools > Data Monitoring > Historical Analysis. In the Metric History screen that appears, we select ZIP as the field name and Blank Count as the metric. If we want to audit data for a certain time period, we can also select start and end dates and times; a date and time will be available for each time we ran the Profile job.

The Metric History screen displays a chart like this:

DSN: Bank Records Table: Customer Records Field: ZIP

0100200300400500600

2/1/05 12:00 AM 2/2/05 12:00 AM 2/3/05 12:00 AM 2/4/05 12:00 AM 2/5/05 12:00 AM

Date/Time

Metric Value

Metric Blank Count

Viewing this chart, we can quickly see that the number of blank ZIP fields generally declined over the week until the last day, when the number increased a bit.

58


Alerts Alerts are messages generated by dfPower Studio when data does not meet criteria you set. For our data, we know that the Last Name field is sometimes left blank. To be alerted whenever this happens, we will set up an alert.

To do this, we launch dfPower Profile’s Configurator component from the dfPower Studio main screen, choose Tools > Data Monitoring > Alert Configuration, and select our customer records database and table. Then, we select the Last Name field and set the following:

• Metric: Blank Count

• Comparison: Metric is Greater Than

• Value: 0

This automatically generates a description of “Blank Count is greater than 0.”

We also indicate we want to receive alerts by e-mail and specify the email address of a group of people responsible for addressing these alerts. Now, whenever dfPower Profile finds a blank Last Name field, it will send out an e-mail message.

We can also review alerts through dfPower Profile’s Viewer component. To do this, in the Viewer, we choose Tools > Data Monitoring > Alerts. The Alert screen appears, listing the current alerts, similar to what is shown in the following table:

Data Source Name Table Name Field Name Description Bank Records Customer Records ZIP Blank Count is greater than 0

For more information on the Monitoring building block, see the following subject in the Online Help:

• dfPower Profile – Introduction

• dfPower Batch – Using Studio in Batch Mode

Summary We hope this scenario has provided you with a solid foundation for starting to use dfPower Studio through all five data-management building blocks: Profiling, Quality, Integration, Augmentation, and Monitoring.

As we mentioned at the beginning of this chapter, this scenario covers a sampling of dfPower Studio’s functionality. For more information on what you can do with dfPower Studio, see the online Help.

59


Chapter 4: Performance Guide There are many settings, environment variables, and associated database and system settings that can significantly impact the performance of dfPower Studio. Aside from normal issues regarding data processing that stem from user hardware and operating-system capabilities, dfPower Studio’s performance can benefit either from directly modifying settings within dfPower Studio or from modifying the data source on which dfPower Studio is working.

These modifications can range from the amount of memory used to sort records, to the complexity of the parse definition selected for data processing, to the number and size of columns in a database table. This chapter describes some of the more common ways that you can optimize dfPower Studio for performance.


• dfPower Studio Settings

• Database and Connectivity Issues

• Quality Knowledge Base Issues

dfPower Studio Settings Several dfPower Studio environment settings permit you access through the various dfPower Studio applications you use to process your data. Currently, a few of the settings can only be changed by directly editing the dfStudio7.ini file, which is stored in your computer’s main Windows directory. This file contains many of the user-defined options and system path relationships used by dfPower Studio.

Caution! Edit the dfStudio7.ini file with care. Unwarranted changes can corrupt your dfPower Studio installation, causing it to perform unexpectedly or fail to initialize.

dfPower Studio (All Applications) • Working Directory Settings: Most dfPower Studio applications use a working directory

to create temporary files and save miscellaneous log files. Some processes, such as multi-condition matching, can create very large temporary files depending on the number of records and conditions being processed. To speed processing, the working directory should be a local drive with plenty of physical space to handle large temporary files. Normal hard drive performance issues apply, so a faster hard drive is advantageous, as is keeping the drive defragmented. You can change the working directory from the dfPower Studio main screen: Choose Studio > Options, click Directories, and change the path in the Working Directory box.

dfPower Base – Architect • Memory Allocated for Sorting and Joining: Architect has an option—Amount of

Memory to Use During Sorting Operations—for allocating the amount of memory used for sorting. Access this option from the Architect main screen by choosing Tools > Options. The number for this option indicates the amount of memory Architect uses for sorting and joining operations. This number is represented in bytes, and the default is 64MB. Increasing this number will allow more memory to be used for these types of procedures, and can improve performance substantially if your computer has the RAM available. As a rule, if you only have one join or sort step in a Architect job flow, you should not to set this number greater than half the total available memory. If you have multiple uses of sorts and joins, divide this number by the number of sorts or joins in the job flow.

60


• Caching the USPS Reference Database: Architect has an option—Percentage of Cached USPS Data Indexes—for how much United States Postal Service data your computer should store in memory. You can access this option from the Architect main screen by choosing Tools > Options. The number for this option indicates an approximate percentage of how much of the USPS reference data set will be cached in memory prior to an address verification procedure. The default setting 20. The range is from 0, which indicates that only some index information will be cached into memory, to 100, which directs that most of the indices and normalization information (approximately 200MB) be cached in memory.

• Sort by ZIP Codes: For up to about 100,000 records, sorting records containing address data by postal code prior to the address verification process will enhance performance. This has to do with the way address information is searched for in the reference database. If you use Architect to verify addresses, this step can be done inside the application; otherwise, it must be done in the database’s system. For over 100,000 records, it might be faster to use the SQL Query step in Architect, write a SQL statement that sorts records by ZIP, and let the database do the work.

• New Table Creation vs. Table Updates: When using Architect, it will be faster to create a new table as an output step rather than attempting to update existing tables. If you choose to update an existing table, set your commit interval on the output step to an explicit value (for example, 500) instead of choosing Commit Every Row.

• Clustering Performance in Version 6.1.1 and Higher: To take advantage of performance enhancements to the clustering engine that provides the functionality for the Clustering job step in Architect:

o Try to have as much RAM as possible on your computer, preferably 4GB. If more than 3GB of RAM is installed, the Windows boot.ini system file needs a special switch to instruct Windows to allow a user process to take up to 3GB. Otherwise, Windows will automatically reserve 2GB of RAM for itself. To add this special switch, add /3GB to the Windows boot.ini file.

Note: This is not supported on all Windows versions. For more information on making this change and for supported versions, see www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx.

o Terminate all non-essential processes to free up memory resources.

o Set the Windows Sort Bytes memory allocation parameter close to 75–80% of total physical RAM. If the data set is rather small, this might not be as important, but when clustering millions of records, using more memory dramatically improves performance. Do not exceed 80–85% because higher memory allocation might result in memory thrashing, which dramatically decreases clustering performance.

o Defragment the hard drive used for temporary cluster engine files. This is the dfPower Studio working directory described earlier.

o Use the fastest hard drive for cluster engine temporary files. If possible, set the dfPower Studio working directory to be on the fastest physical drive that is not the drive for the operating system or page file.

o Defragment the page file disk. Do this by setting the Windows Virtual Memory size to 0, then rebooting the system and defragmenting that drive.

61

http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx


o Manually set both minimum and the maximum values of the Windows Virtual Memory file size to the same large value. Preferably, this will be 2GB or more, depending on disk space available. This will prevent the operating system from running out of virtual memory, and will prevent the operating system from needing to resize the file dynamically.

o Disable Fast User Switching in Windows XP. This will free up space in the page file.

• Multiple Match Code Generation: When defining matching criteria, do not set up rules to generate match codes on the same fields at different sensitivities. For example, do not choose to create match codes for Name and Address fields at 55 and 85 percent sensitivities. By just choosing the Match Code job step in Architect, several calls to parsing algorithms are made to generate the match codes. A better approach than simply using the Match Code job step is to use the Parsing job step first on the Name and Address field, and then use the Match Codes (Parsed) step job to generate the match codes at the different sensitivities. This will access the parse engine only once for each field and record instead of several more times for each sensitivity. Performance gains can cut processing time by more than half depending on the match criteria.

• Limit Data Passed Through Each Step: While it might be convenient to use the Architect setting that passes every output field by default to the next step, this creates a large amount of overhead when you are only processing a few fields. Passing only 10 fields through each step might not be that memory intensive, but when it is 110 fields, the performance can really suffer. After you create your job flow with the Output Fields setting on All, go through each job step and delete the additional output fields you do not need.

dfPower Verify • Caching the USPS Reference Database: dfPower Verify has an option—USPS

Cache— for how much USPS data your computer should store in memory. You can access this option from the dfPower Verify main screen by choosing Tools > Options and displaying the Runtime tab. The number for this option indicates an approximate percentage of how much of the USPS reference data set will be cached in memory prior to an address verification procedure. The default setting is 20. The range is from 0, which indicates that only some index information will be cached into memory, to 100, which directs that most of the indices and normalization information (approximately 200MB) be cached in memory.

• Create a Verified Table and Join to Source: As an option for quicker database updates, it is possible to write verified address data to a new table, and then use Architect to join the records in the source table with the information in the verified table using primary keys for the join in Architect.

dfPower Integration - Match • Set Sort Memory Size: If you want to create match jobs that take advantage of multiple

OR steps in your match criteria, you might want to increase the memory used for match code sorting. To do this from the dfPower Integration – Match main screen, choose Options > Set Sort Memory Size. The range is from 0, which indicates that only minimal memory should be used for the sorting and clustering process, to 100, which indicates that almost all available memory will be used for this process. The default value is 15. We recommend you not set this number to much higher than 40 unless you have a very large amount or available RAM.

62


• Multiple Condition Matching: The number of match conditions, in addition to the number of fields you choose to use as match criteria, will impact performance significantly. A match job of a 100,000 records that uses four OR conditions with four fields for each condition might take hours to complete, while a match job of 100,000 records with a single condition and six fields might complete in minutes.

• Generate Cluster Data Mode: See “Clustering Performance in Version 6.1.1 and Higher” in the “dfPower Base – Architect” section.

dfPower Quality - Standardize • Log Updates: Choosing to log database updates to the statistics file can slow down

performance to a small degree. On very large job runs, the difference might be more evident. You can turn off the logging feature from the dfPower Quality main screen by choosing Control > Log Updates.

dfPower Profile - Configurator • Metric Calculation Options: From the dfPower Profile - Configurator main screen,

choose Job > Options to access settings that can directly affect the performance of a Profile job process. Two of these settings are Count all Rows for Frequency Distribution and Count All Rows for Pattern Distribution. By default, all values are examined to determine metrics that are garnered from frequency and pattern distributions. You can decrease processing time by setting a limit on the number of accumulated values. The tradeoff is that you might not get a complete assessment of the true nature of your data. Another setting, Maximum Number of Values per Field to Store in Memory for Processing, is set to 10000 by default. If your computer has more memory available, you can increase the number to improve performance. This number is per field, so if you profile all five fields in a table, then 50000 values will be stored in memory between memory flushes. To optimize performance, you can choose fewer fields to profile or increase the number, but you run the risk of running out of memory to complete the process.

• Subset the Data: You can explicitly specify which metrics to run for each field in a table. If certain metrics do not apply to certain kinds of data, or if you only want to examine select fields of large tables, be sure to manually change the list of metrics to be run for each field. Incidentally, the biggest performance loss is generating a frequency distribution on unique or almost-unique columns. Thus, if you know a field to be mostly unique, you might choose not to perform metrics that use frequency distribution. For numeric fields, these are metrics are Percentile and Median. For all fields, these metrics are Primary Key Candidate, Mode, Unique Count, and Unique Percentage. You can also subset the data by creating a business rule or SQL query that can pare down the data to only the elements you want to profile.

• Sample the Data: dfPower Profile allows you to specify a sample interval on tables that you want to profile. This feature improves performance at the expense of accuracy, but for very large tables with mostly similar data elements, this might be a good option.

• Memory Allocated for Frequency Distribution: dfPower Profile allows you to configure the amount of memory being allocated to the Frequency Distribution Engine (FRED). This option is set in the dfstudio.ini file and must be within the [Profile] section. If your .ini file does not contain the [Profile] heading you may add it manually.

For example:

[Profile]

pertablebytes = 128000

63


By default FRED allocates 256 KB per column being profiled, on top of the amount configured by the user in the Job > Options menu of dfPower Profile Configurator. If you have a large number of columns to profile (in the hundreds) this can cause your machine to run low on (or out of) memory and you may need to reduce the amount of per table bytes. For performance reasons this value should always be a power of 2. Setting this value to 1 MB (NOTE: 1 MB = 1024 * 1024 Bytes, not 1000 * 1000 Bytes) yields optimal performance. Setting it to a value larger that 1 MB (again, always a power of 2) may help slightly with processing very large data sets (dozens of millions), but might actually reduce performance for data sets with just a few millions of rows or fewer.

Database and Connectivity Issues There are certain issues related to the way databases are physically described and built that can impact the overall performance of dfPower Studio applications. These issues are generally common to all relational database systems. With just a few exceptions, database changes to enhance performance take place external to dfPower Studio applications.

dfPower Studio • Commit Interval: From the dfPower Studio main screen, choose Studio > Options and

click Transactions to access some transaction processing options. For example, you can choose a number of ways to commit changes to the database in write procedures. The default value is Auto Commit, which commits every change to the database table one record at a time. You can instead select Every N Transactions and set this number to somewhere between 100 and 500. This should increase performance, but on the remote chance that there is a problem during processing, all the data changes made by dfPower Studio ultimately might not be saved to the source table.

dfPower Base – Architect • Commit Interval: Architect commit settings are controlled on a per-job basis. The Data

Target (Update) and Data Target (Insert) job steps both have commit options. The default value is Commit Every Row, which commits every change to the database table one record at a time. You can instead select Every N Rows and set this number to somewhere between 100 and 500. This should increase performance, but on the remote chance that there is a problem during processing, all the data changes made by Architect ultimately might not be saved to the source table.

Database • ODBC Logging: Use the Microsoft Windows ODBC Administrator to turn off ODBC

Tracing. ODBC Tracing can dramatically decrease performance.

• Using Primary Keys: For many dfPower Studio processes, using primary keys will enhance performance only if constructed correctly. Primary keys should be numeric, unique, and indexed. You cannot currently use composite keys to enhance performance.

• Database Reads and Writes: All dfPower Studio processes read data from databases, but only some of them write data back. Processes that only read from a database, such as match reports, run quickly. However, if you choose to flag duplicate records in the same table using the same criteria as the match report, the processing time will greatly increase, as you might expect. Using correctly constructed primary keys will help. In addition, as a general rule, smaller fields sizes will enhance performance. A table where all fields are set to a length of 255 by default will be processed more slowly than a table that has more reasonable lengths for each field, such as 20 for a name field and 40 for an address field.

64


• Turn Off Virus Protection Software: Some virus-scanning applications such as McAfee NetShield 4.5 cause processing times to increase substantially. While you might not want to turn off your virus-scanning software completely, you might be able to change some settings to ensure that the software is not harming performance by doing things such as scanning for local database changes.

Quality Knowledge Base Issues The Quality Knowledge Base is the set of file and file relationships that dictates how all DataFlux applications parse, standardize, match, and otherwise process data. The Quality Knowledge Base uses several file types and libraries that work with each other to provide you expected outputs. A few of the file types are used in ways that are not simply direct data look-ups, and it is these types of files that require a certain degree of optimization to ensure that DataFlux applications are processing data as efficiently as they can.

• Complex Parsing: Processes using definition types that incorporate parsing functionality—parse definitions obviously do, but match, standardization, and gender definitions can use parse functionality as well—are directly affected by the way parse definitions are constructed. A parse definition built to process e-mail addresses will be much less complex than a parse definition that processes address information. This has to do with the specifics of the data type itself. Data types that have more inherent variability will most likely have parse definitions that are more complex, and processes using these more complex definitions will perform more slowly. When creating custom parse definitions, it is very easy to accidentally create an algorithm that is far from optimized for performance. Training materials are available from DataFlux that teach the proper way to design a parse definition.

• Complex Regular Expression Libraries: Many definitions use regular expression files to do some of the required processing work. Sometimes regular expression libraries are used to normalize data, while other times they are used to categorize data. Incorrectly constructed regular expressions are notorious for being resource intensive. This means you might design a perfectly valid regular expression that takes an extremely long time to accomplish a seemingly simple task. DataFlux-designed definitions have been optimized for performance, but if you create custom definitions, be sure to learn how to create efficient regular expression. Training materials are available from DataFlux that teach Quality Knowledge Base customization.

65

Exercises

Exercises

Profile

Exercise 1: Create a Profile Report Overview This is the “discovery” phase of a data management project. We will profile

a Contacts Table. Assignment Create a simple Profile report from the Configurator. Among the tasks to be

performed are: Select a data source Select job options

Select job metrics Profile a selected field Access drill-through options such

as Visualize and Filter Save the job

Output a report Step-by-Step: Creating a Profile Report 1. Select Profile > Configurator (or Tools > Profile > Configurator) to open the dfPower

Profile Configurator. The Configurator appears with data sources displayed on the left. 2. Double-click the DataFlux Sample database to begin a profile. The data source listing will

expand to show available tables for that data source.

Indicates the data source or table is not selected to be part of the profile report.

Indicates all items within the data source or table are selected to be part of the profile report.

Indicates only some of the items within the data source or table are selected to be part of the profile report.

3. Select Contacts to view the fields for that table. A drop-down menu shows report options,

which display on the right.

66

Exercises

Report options are available by field to run full or filtered profile jobs.

Before proceeding with the profile job, we will define options and metrics for the profile.

4. Select Job > Options. Select the General tab. Ensure that both Count all rows for frequency

distribution and Count all rows for pattern distribution are selected. The profile engine will look at every value for these records.

5. Click OK.

67

Exercises

Note: These processes can be constructed on a remote server, but require advanced options. Most users will run profiles from a desktop computer.

6. Select Job > Select Metrics to open the Metrics dialog box.

68

Exercises

Column Profiling shows a list of things that you can analyze.

7. To select all the metrics, click the Select/unselect all box. 8. Click OK. The Configurator displays structural information with preliminary data about these

selections, which helps users see the structure of the jobs selected for profiling.

Note: You can override features from the Configurator window by selecting the checkbox and then returning to options.

9. Select the desired field name(s) for profiling. For this exercise, select the Contacts table from the left side of the dfPower Profile – Configurator main screen to select all fields. You can also select individual fields from within the Contacts Table display.

10. To save the job, select File > Save As, specify a job name, type a report description, and

click Save. (Here we named the job “Profile 1.”)

11. Process the job. Click the Run Job icon on the toolbar to launch the job.

The Run Job dialog box appears, displaying the Profile 1 name just saved. Choose either Standard or Repository outputs; for this exercise, select Standard Output, type the Report Description “Contacts Profile 1,” and click OK.

Run Job icon

69

Exercises

Append to Report is a useful option for historical data analysis.

The dfPower Profile (Viewer) appears, showing the Profile 1 job. 12. Select a Data Source (Contacts for this exercise) and then a Field Name (State) to view

detailed profile data; note that several tabs appear in the lower half of the viewer that permit users to see full details of the data profile. The Column Profiling tab is selected in the figure that follows.

70

Exercises

These tabs enable you to view various report details.

13. Select the Frequency Distribution tab from the lower half of the viewer. The viewer now displays information about the data value, count, and percentage. Note the various values used to denote “California.”

Several values exist for California, so some are non-standard

14. Right-click anywhere in the panel to access Visualize and Filter drill-through options. 15. Double-click a specific value to get to the drill-through feature.

71

Exercises

The Visualize function accessed from Frequency Distribution for State generates this chart.

16. Select the Pattern Frequency Distribution tab for another view of the data. 17. Double-click the first pattern “AA” to view the Pattern Frequency Distribution Drill-through

screen, shown in the next image. Note that users can scroll to the right to view more columns.

72

Exercises

You can export or print this report from this window. The export function permits you to save the report as a .txt or .csv file that can be viewed by users without access to dfPower Studio. Next, we will examine filtering capabilities more fully by applying a business rule to a table.

73

Exercises

Profile

Exercise 2: Applying a Business Rule to Create a Filtered Profile Report Overview Here, we will create a filtered profile in order to perform a standardizing function

on the State field in the Contacts database. Assignment Create a Profile report to compare filtered and full tables. Among the tasks to be

performed are: Select a data source

Insert a Business Rule Create the Rule Expression Save and Run the Job

Note: You can view filtered data in a couple of ways—by applying a filter to the Pattern and/or Frequency Distribution tabs, or by applying a Business Rule. It is important to note that the first method filters only the displayed data, while applying a business rule filters data out of the report altogether.

Step-by-Step: Creating a Filtered Profile Report 1. Return to dfPower Studio Profile (Configurator). If the Viewer is still open, close it. 2. Select the Contacts table from DataFlux Sample.

3. Select the State field name from the Contacts table data. We now want to customize the

profile by writing a new business rule that permits us to examine contacts for California only. 4. Select Insert > Business Rule. A dialog box appears.

User Tips Select Insert > Business Rule to create

a profile report based on a business rule. A business rule allows you to filter data to ensure data is adhering to standards. A business rule can be based on a table or a SQL query.

Select Insert > SQL Query to create a profile report based on an SQL query. Query Builder can be used to help construct the SQL statement.

Highlight Text Files and select Insert > Text File to create a profile report based on a text file instead of an ODBC data source.

5. Type a Business rule name that is easily identifiable. For this example, we will use

Contacts_California. 6. Click OK. A Business Rule on Table dialog box appears, as shown next.

74

Exercises

7. Select STATE from the Field Name browser. 8. Select Equal to from the Operation browser. 9. Type “CA” to indicate California as a Single value in the Value section. 10. Select Add Condition to create a Rule Expression, which appears in the corresponding

display within this window. 11. Save the Profile Job as “Profile 2.” 12. Run the Profile job. dfPower Studio opens a Profile 2 job viewer to display details of the job. You now have a full table and a filtered table to study for Contacts. You can view the data in several ways using the Visualize function, as in Exercise 1.

Reporting You can run multiple jobs with the same setup. For example, you may want to run the same job weekly. These reports can be set up as real-time or batch reports. You can even set up a report database as an advanced option.

To view reports from the Profile Viewer or by selecting the Reports folder in the Management Resources Directory, click Navigator > DFStudio > Profile > Reports.

75

Exercises

Such reports can be quite valuable for data monitoring, which can show trends that permit you to fine-tune your data management strategy.

76

Exercises

Architect

Exercise 3: Create an Architect Job from a Profile Report Overview After creating the Profile Reports, you can use Architect for several

functions (Data Validation, Pattern Analysis, Basic Statistics, Frequency Distribution, and more). In this exercise, we will create an Architect job from a Profile Report that is based on a SQL query.

Assignment Create an Architect Profile report based on a SQL query to standardize data. Among the tasks to be performed:

Create a Profile report Create the Architect job

Build the SQL query Output Architect data to HTML

Standardize State data Save and Run the Architect job

To prepare, we will create a Profile report and then begin the Architect procedure.

Step-by-Step: Creating a Profile Architect Job from a Profile Report 1. Select Configurator from the Studio Profile node. An untitled Profile appears in the

Configurator window. 2. Create a Profile report based on a SQL query.

a. Select the Client Info table from the data sources. All Client Info field names are automatically selected.

b. Select Insert > SQL Query from the Configurator menu.

77

Exercises

c. Type a SQL query name for the job, for this exercise type Client_Info for State. A SQL

Query dialog opens, with Query Builder available. d. Type SELECT “State” from Client_Info.

The untitled profile information is updated in the Configurator. 3. Select File > Save As. 4. Name the job Sample Architect Job for State. We can now build the job with our

specifications and open Architect to display.

5. To run the Profile job, click Run Job. 6. Specify Standard Output.

A Profile Viewer display shows the data source and how the data is parsed. We now need to create an output for the data. Architect can perform a job flow data transformation. Architect also contains nodes and steps for 50 different transformations for data quality procedures.

7. From the open Client_Info report, highlight the field name State.

78

Exercises

8. With the State field highlighted, select Tools > Add Task > Standardization. The Standardization dialog box appears.

9. From the Standardization definition drop-down menu, select State (Two Letter). 10. Leave Standardization scheme at the default setting of (None). 11. Click OK. The Profile Viewer Task List window will display the status of the task being

performed. 12. Select Tools > Create Job to build the Architect job from the task list. 13. Save the job.

79

Exercises

14. Architect is launched, displaying the job. You can continue data transformations from this

point.

By default, Architect jobs are saved in the Management Resources directory (Navigator). 15. Select from Data Outputs to view the available output nodes. 16. Double-click HTML Report to add it to the Architect job. The Report Properties screen

appears.

80

Exercises

17. Enter the following Name and Report Title: HTML Report 1 for Client_Info. 18. Select OK. 19. Select File > Save to save the Architect job. 20. Select Job > Run to run the Architect job. The Run Job window displays. 21. Close the Run Job status window. The HTML Report now displays, as pictured.

81

Exercises

22. Select File > Close to close the HTML report. 23. Select File > Exit to close the Architect job. 24. Select File > Exit to close the Profile report.

82

df power guide

Documents