data profiling tips

Upload: ratika-miglani-malhotra

Post on 04-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 DATA PROFILING TIPS

    1/2

    The Necessity of Data Profiling: A How-to Guide to Getting Started and

    Driving Value

    Allocating sufficient time and resources to conduct a thorough data profiling assessment will help

    architects design a better solution and reduce project risk by quickly identifying and addressing

    potential data issues.

    February 3, 2010

    By Matt Austin

    Data profiling is a critical input task to any database initiative that incorporates source data fro

    e!ternal systes" #hether it is a copletely ne$ database build or siply an enhanceent to an

    e!isting syste, data profiling is a key analysis step in the overall design" Allocating sufficient tie

    and resources to conduct a thorough data profiling assessent $ill help architects design a better

    solution and reduce pro%ect risk by &uickly identifying and addressing potential data issues"

    Best Practices

    'o$ should you approach a ne$ data profiling engageent and $hat can you e!pect in ters of

    value(added results)Data profiling is best scheduled prior to syste design, typically occurring during the discovery or

    analysis phase" *he first step (( and also a critical dependency (( is to clearly identify the appropriate

    person to provide the source data and also serve as the +go to resource for follo$(up &uestions" -nce

    you receive source data e!tracts, you.re ready to prepare the data for profiling" As a tip, loading data

    e!tracts into a database structure $ill allo$ you to freely $rite / to &uery the data $hile also

    having the fle!ibility to use a profiling tool if needed"

    #hen creating or updating a data profile, start $ith basic colun(level analysis such as

    Distinct count and ercent:Analying the nuber of distinct values $ithin each colun

    $ill help identify possible uni&ue keys $ithin the source data 4$hich 5.ll refer to as natural keys6"5dentification of natural keys is a fundaental re&uireent for database and 7* architecture,especially $hen processing inserts and updates" 5n soe cases, this inforation is obvious basedon the source colun nae or through discussion $ith source data o$ners" 'o$ever, $hen youdo not have this lu!ury, distinct percent analysis is a siple yet critical tool to identify naturalkeys"

    !ero" #lan$" and N%&& ercent:Analying each colun for issing or unkno$n data helps

    you identify potential data issues" *his inforation $ill help database and 7* architects set up

    appropriate default values or allo$ 89s on the target database coluns $here an unkno$n oruntouched 4i"e",", 896 data eleent is an acceptable business case" *his analysis ay alsospa$n e!ception or aintenance reports for data ste$ards to address as part of day(to(daysyste aintenance"

    'ini(u(" (a)i(u(" and average string length:Analying string lengths of the source

    data is a valuable step in selecting the ost appropriate data types and sies in the targetdatabase" *his is especially true in large and highly accessed tables $here perforance is a topconsideration" :educing the colun $idths to be %ust large enough to eet current and future

    re&uireents $ill iprove &uery perforance by iniiing table scan tie" 5f the respectivefield is part of an inde!, keeping the data types in check $ill also iniie inde! sie, overhead,and scan ties"

    Nu(erical and date range analysis:;athering inforation on iniu and a!iu

    nuerical and date values is helpful for database architects to identify appropriate data types tobalance storage and perforance re&uireents" 5f your profile sho$s a nuerical field does notre&uire decial precision, consider using an integer data type because of its relatively sall sie"Another issue $hich can easily be identified is converting -racle dates to / /erver" 9ntil //erver 200?3 $hich often caused issues inconversions $ith -racle systes"

  • 8/14/2019 DATA PROFILING TIPS

    2/2

    #ith the basic data profile under your belt, you can conduct ore advanced analysis such as

    *ey integrity:After your natural keys have been identified, check the overall integrity by

    applying the ero, blank, and 89 percent analysis to the data set" 5n addition, checking therelated data sets for any orphan keys is e!treely iportant to reduce do$nstrea issues" Fore!aple, all custoer keys fro related transactions 4e"g", orders6 should e!ist in the custoerbase data set@ other$ise you risk understating aggregations grouped by custoer(level

    attributes" +ardinality:5dentification of the cardinality 4e"g" one(to(one, one(to(any, any(to(any,

    etc"6 bet$een the related data sets is iportant for database odeling and business intelligence4B56 tool set(up" B5 tools especially need this inforation to issue the proper inner( or outer(%oinclause to the database" ardinality considerations are especially apparent for fact and diensionrelationships"

    Pattern" fre,uency distri#utions" and do(ain analysis:7!aination of patterns is

    useful to check if data fields are foratted correctly" As e!aple, you ight validate e(ailaddress synta! to ensure it confors to userdoain" *his type of analysis can be applied toost coluns but is especially practical for fields that are used for outbound counicationchannels 4e"g", phone nubers and address eleents6" Fre&uency distributions are typicallysiple validations such as +custoers by state or +total of sales by product and help toauthenticate the source data before designing the database" Doain analysis is validation of thedistribution of values for a given data eleent" Basic e!aples of this include validating custoerattributes such as gender or birth date, or address attributes such as valid states or provinces$ithin a specified region" Although these steps ay not play as critical a role in designing the

    syste, they are very useful for uncovering ne$ and old business rules"

    Cicking the right techni&ues depends on the pro%ect ob%ectives" 5f you.re building a ne$ database fro

    scratch, take the tie to e!ecute and revie$ outcoes of each of the above bullet points" 5f you.re

    siply integrating a ne$ data set into an e!isting database, select the ost applicable tasks that

    apply to your source data"

    All of these steps ay be conducted by $riting ra$ /" *he basic profiling steps can usually be

    accoplished using a tool designed specifically for data profiling" Many third(party data profiling tools

    have been introduced into the arketplace over the last several years to help strealine the process"

    *hese tools typically allo$ the user to point to a data source and select the appropriate profile

    techni&ue4s6 to apply" *he outputs of these tools vary, but usually a data source suary is produced

    $ith the field level profile statistics"