04 metadata and metadata management

Upload: christopher-williams

Post on 25-Feb-2018

244 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 04 Metadata and Metadata Management

    1/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Module 4 Metadata and Metadata Management

    etadata is a term that means !information about data!" DataStage reliesheavil# upon metadata to describe the data that are to be processed, the

    format of the data, the processing that is re$uired, and so on" %dditionalmetadata can be used to ans&er end users' $uestions such as !&here didthis value come from!

    nformation Server has a unified metadata la#er through &hich metadata

    can be shared among man# products, including DataStage" n addition

    each DataStage project has its o&n, local repositor# for metadata" %ll ofthis metadata is available to DataStage users, and therefore must be

    managed rigorousl#"

    Objectives

    *aving completed this module #ou &ill be able+

    to list three classes of metadata

    to import DataStage components from a given DataStage eport

    file

    to inspect metadata in the -epositor# using Designer

    to use .uic/ Find and %dvanced Find in the -epositor#

    to define !nullable!

    to eport DataStage components from the -epositor#

    age 1

  • 7/25/2019 04 Metadata and Metadata Management

    2/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Metadata

    3he &ord metadata comes from a 4ree/ prefi !meta!, meaning above,and !data!" !Data! is the plural past participle of the 5atin verb !dare!,

    meaning !to give!" 3he singular past participle is !datum!, something(that has been) given" So !data! literall# means things (that have been)given"

    n information technolog# (3), the &ord !metadata! is usuall# ta/en to

    mean !information that describes data!, and ever#one claims to understand

    &hat !data! means in 3"

    etadata allo& $uestions about the data to be ans&ered" For eample, anend user ma# be loo/ing at a pie chart in &hich one sector contains 26

    of the overall total" 3he user ma# be interested to /no& ho& up to date the

    data are, &hether the# are complete, &hat relationship the# bear to theoperational s#stems' data, and &hat processing the# under&ent bet&een

    there and the pie chart"

    3here are several classes of metadata" %uthorities differ on ho& man#"

    For a DataStage developer the three most important are listed here"

    Business metadataincorporates all /no&ledge about the data that

    the business has (or ought to have)" 3his might include business

    rules (for eample !a customer number has the follo&ing format!,

    !metric measures of distance are converted to 7S measures in theD!, !order date must be no later than current date during data

    entr#!, and so on), and o&nership and9or responsibilit# (for

    eample !the product price table is o&ned b#, and /ept up to date

    b#, the sales management group!)" .uite often, business metadataare produced b# people &ith titles such as business anal#st or

    metadata ste&ard"

    Technical metadataare those that describe the technical aspects

    of data, such as the format (particularl# of tet files), the ro&s and

    columns, S.5 data t#pes, and so on" 3echnical metadata alsodescribe the processing that occurs to the data, not onl# during

    :35 but also during original data entr# and an# reformatting that

    ; tools might perform" 3hese often become specifications &ith&hich programmers9developers &or/"

    Process metadataare no less important" 3hese record &hat

    processing actuall# too/ place, &hether all records &ere processed

    or some &ere rejected"

  • 7/25/2019 04 Metadata and Metadata Management

    3/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Business Metadata in DataStage

    ;usiness metadata are t#picall# maintained outside of DataStage, perhaps

    b# a business anal#st using the ;usiness 4lossar# product (another

    product in the nformation Server suite)" DataStage does not directl# usebusiness metadata, but having it available can assist developers in, for

    eample, assigning correct validation logic" 3he usual place &here

    business metadata is to be found in the DataStage repositor# is in Data:lements" ;usiness metadata can also be found in annotations in job

    designs and in description fields on jobs, stages and lin/s"

    Figure 4-1 Example of Data Element

    Figure 1sho&s an eample of a Data :lement" 3he S.5 tab allo&s the

    most li/el# data t#pe for this element to be recorded, though it is not

    enforced" 3he other t&o tabs define the data element=s relationships &ithDataStage 3ransforms, &hich are available onl# in server jobs, not in

    parallel jobs (this class is about parallel jobs, so 3ransforms &ill not be

    discussed)"

    Data elements can be added to an# table definition, to highlight that aparticular field has business metadata to be carried &ith it" 7sage anal#ses

    can be performed on data elements, for eample to ans&er developer$uestions such as >&hich jobs process revenue? (assuming there is a dataelement called -evenue or something similar)"

    age @

  • 7/25/2019 04 Metadata and Metadata Management

    4/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Technical Metadata in DataStage

    3here are four main areas into &hich technical metadata for parallel jobs

    ma# be grouped" 3hese are configurations, table definitions, source code

    (primaril# of routines) and :35 job designs themselves"

    Configurationscover a &ide range of things, most of &hich &e discuss inother modules or in the %dministrator class"

    3he obvious one is the parallel eecution configuration files" :ach of

    these provides a list of hosts and resources (nodes) on &hich parallel

    eecution can ta/e place" Different configuration files can be used fordifferent tas/sA for eample a one&a# configuration is best suited for

    processing a single ro&, a one&a# or t&o&a# configuration is suited to a

    small volume of data, &hereas a @B&a# configuration could process aver# large volume of data indeed"

  • 7/25/2019 04 Metadata and Metadata Management

    5/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    outinesare of three /inds" arallel routines, &hich can be called from

    the 3ransformer stage in parallel jobs, are created outside of DataStage in

    the CEE language and compiled and lin/ed" %n entr# is placed into therepositor# to record the name, arguments and location of the librar# or

    object containing the routine"

    Server routines are created &ithin DataStage, using DataStage ;%SC asthe programming language" 3he# are stored directl# in the repositor# andare of t&o /inds"

    ;efore9after subroutines can be used in parallel jobs, in server jobs

    and in active stages in server jobs"

    3ransform functions can be used in server 3ransformer stage, in

    the ;%SC 3ransformer stage in parallel jobs, and in -outineactivities in job se$uences" (3here is a full course called

    Programming with DataStage BASICavailable" e do not have

    time to cover creation of routines in this class")

    !ob designsare created using DataStage designer and are stored directl#in the repositor#"

    Process Metadata in DataStage

    :ach time a job runs, it /eeps a log of its activit# and periodicall# updates

    status information such as C7 usage and ro& counts" 3his information is

    stored in the -epositor#, and ma# be vie&ed using the Director client andreported on using the reporting console of nformation Server"

    :nvironment variable options allo& the collection of etra informationabout processingA most of these are in the -eporting folder"

    Figure 4-" En#ironment $ariables That Control eporting

    age

  • 7/25/2019 04 Metadata and Metadata Management

    6/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    age B

  • 7/25/2019 04 Metadata and Metadata Management

    7/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Metadata Repository

    etadata need to be stored some&here so that the# can be used" ForDataStage, metadata are stored in the >metadata repositor#?"

    n fact there are t&o metadata repositories" :ach DataStage project has alocal repositor# and there is a central, unified metadata repositor# for all

    nformation Server products" >7nified? in this contet means that themetadata are stored in a format such that the# are accessible G via the

    metadata deliver# and anal#sis services G b# an# nformation Server

    product"

    hen using DataStage, #ou are not a&are, in general, in &hich of the t&orepositories #our particular metadata resides" etadata ma# be in one, the

    other, or both" Hou access the metadata repositor# through the -epositor#

    toolbar in the Designer client"

    Figure 4-% epositor& Toolbar

    Figure @sho&s the -epositor# toolbar in a project called (as indicated

    b# the tab) Demonstrations" Chances are that #our -epositor# &ill have adifferent set of folders, since the structure is completel# customiIable"

    Some of the folders in Figure @ relate to .ualit#Stage jobs, some relate

    to mainframe jobs" f #ou do not have these capabilities installed on #our

    DataStage server, then #ou ma# not see these folders"

    age J

  • 7/25/2019 04 Metadata and Metadata Management

    8/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    n the Designer client, the repositor# is organiIed as a tree, in &hich #ou

    create as man# branches as needed" ;e careful, though, not to create so

    comple a structure that it becomes impossible to maintain"

    Figure 4-4 epositor& 'ith (ne Branch Expanded

    n Figure the -outines branch in the -epositor# has been epanded"read onl#?"

    3o ma/e a cop#, select the object=s name, right clic/ and choose >Create a

    cop#? from the menu" For eample, if the object is called KHL, then the

    ne& object &ill be called Cop#

  • 7/25/2019 04 Metadata and Metadata Management

    9/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    -ename option available, or #ou can rename most objects &ithin their

    editing dialog"

    Creating New Categories

    3o create a ne& folder an#&here in the repositor#, right clic/ on the folder&hich &ill be the parent of the ne& folder" (3here is a >project? folder at

    the ver# top of the tree that can serve as the parent of ne& toplevel

    folders")

    Choose e& from the popup menu, then Folder from the subse$uentl#displa#ed menu" n Figure this process is illustrated creating a ne&

    subfolder in the arameterSets branch of the -epositor#"

    Figure 4-) Creating a *e' Folder

    age M

  • 7/25/2019 04 Metadata and Metadata Management

    10/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    3his &ill open a revised -epositor# toolbar &ith the ne&l#created folder

    named e&Folder, selected (highlighted) &aiting for its name to be

    changed"

    3his is sho&n in Figure B" Hou should, of course, rename the ne& folder

    immediatel# to something more meaningful"

    Figure 4-+ *e'l& Created Folder

    Deleting a olderhen #ou select a folder in the repositor# and press the Del button on

    #our /e#board, or rightclic/ the folder and choose Delete from the menu,

    #ou might be deleting not just the folder but also its entire contents"

    3o help guard against the possibilit# of accidental deletion, a confirmationdialog appears as/ing #ou to confirm deletion of the selected items"

    n the case of deleting a single folder this dialog &ill have onl# the named

    folder in its list" *o&ever, deletion can be initiated from the result of a

    search of the -epositor#, so that the confirmation dialog allo&s #ou tolimit the items to be deleted to just those &hich #ou select in the dialog

    itself"

    Searching the !e"ositor# $uic% ind

    3here are t&o tools for searching the repositor# G .uic/ Find and

    %dvanced find" 5et=s loo/ at .uic/ Find first"

    age 10

  • 7/25/2019 04 Metadata and Metadata Management

    11/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    henever #ou are &or/ing in the -epositor# there is almost al&a#s a lin/

    to >

  • 7/25/2019 04 Metadata and Metadata Management

    12/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Figure 4-/ uic. Find Dialog 0T&pes to find 2ist 3Partial

    3he >nclude description? chec/ bo allo&s the search to include

    searching for the indicated string or &ildcard pattern in the descriptionfields of DataStage objects"

    3he initial result of .uic/ Find is an epanded repositor# tree &ith the

    first object in &hich the search &as successful highlighted" et and rev

    buttons allo& this tree vie& to be navigated"

    Figure 4-5 uic. Find 6nitial esult

    n Figure Mthe result of an unconstrained search for >onth? are sho&n"1B >hits? &ere obtained, the first being in the Date4eneric3o3imestamp

    routine in N-outinesNsd/NDate folder"

    age 12

  • 7/25/2019 04 Metadata and Metadata Management

    13/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Clic/ing on the >1B matches? lin/ or on the %dv button &ould open the

    %dvanced find capabilit#" %lternativel# #ou can rightclic/ on an# of the

    selected objects G or, indeed, an# of the objects G and perform otheractivities such as rename, eport, or >&here used? or >dependencies?

    anal#ses"

    Searching the !e"ositor# &dvanced ind

    3he %dvanced Find dialog offers the same search capabilities as .uic/Find, but &ith a greater range of filters available"

    Figure 4-17 8d#anced Find

    Figure 10sho&s the same search as &as illustrated for .uic/ Find,

    namel# for the &ord >onth? occurring in the object name or description"

    n %dvanced Find, ho&ever, #ou can specif# different &ords in thedescription &hile still filtering on the object name"

  • 7/25/2019 04 Metadata and Metadata Management

    14/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    the case" 3his cue remains visible even though that particular part of the

    filter has been minimiIed"

    Figure 4-11 8d#anced Find 0Created Filter Dialog

    :here usedallo&s #ou to set up a list of repositor# objects so that the

    search finds onl# objects that use the objects in #our list"

    Dependencies ofallo&s #ou to set up a list of repositor# objects so thatthe search finds onl# objects that are dependencies of an# of the objects in

    #our list" For eample, a job can be a dependenc# of a job se$uence, a

    routine can be a dependenc# of a job or even of another routine"

    T&pe specificallo&s #ou to set a table definition that &ill be used to findthose table definitions in the repositor# that are related via the same shared

    3able" n this contet, a >shared 3able? is a table definition in thecommon, unified metadata -epositor# for nformation Server"

    3here are four(ptions" Search can be case sensitive or not, can be &ithinthe last result set onl# (or not), can include nested results for dependenc#

    searches, and can search for a match in object name or description or both"

    age 1

  • 7/25/2019 04 Metadata and Metadata Management

    15/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Table Definitions

    DataStage uses the term >table definition? to mean an# form of record

    la#out definition" 3he term has its origin in database terminolog# but hasbeen etended, for DataStage use, to mean record la#out metadata froman# source"

    So, for eample, DataStage records the format of a se$uential file as its

    >table definition?" DataStage records the format of a C

  • 7/25/2019 04 Metadata and Metadata Management

    16/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Figure 4-1" Table Definition 2a&out Tab

    Exporting DataStage Components

    DataStage Components (that is, an# object in the repositor#) can be

    eported into a tet file" 3&o formats are available"

    % DSK (DataStage eport) file is the original format used b#

    DataStage" t is the more compact of the t&o formats, a factor thatmight be considered if, for eample, contemplating emailing the

    eport file"

    3he other format uses K5 (etensible mar/up language) &hich

    identifies each component &ith its o&n pair of tags as &ell as using

    tags and a st#le sheet to represent the relationships bet&eencomponents"

    :porting DataStage components is accomplished via the :port menu in

    Designer, or b# choosing :port from the results of a .uic/ Find or an%dvanced Find" 3he -epositor# :port dialog allo&s #ou to specif# &hatto eport, and &here"

    age 1B

  • 7/25/2019 04 Metadata and Metadata Management

    17/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Figure 4-1% Data;tage epositor& Export Dialog

    3he tems to :port pane contains the list of items to be eported" 3he

    %dd lin/ reinvo/es .uic/ Find to locate more items" :ventuall# #ouhave a list of items in this field, some or all of &hich #ou have selected to

    be eported" n the status bar at the bottom of the &indo& is reported ho&

    man# objects have been selected and ho& man# of these &ill be ignored

    (not eported)" For eample, if an# readonl# items have been selectedand >:clude readonl# items? is set, then these readonl# items &ill be

    ignored"

    3he eport file is al&a#s on the client machine1" 3he t#pe of eport field

    governs the format of the eport file and also its filename suffiA DSKfiles have >"ds? as their suffi, &hile K5 eport files have >ml? as

    their suffi"

    f >append to eisting file? is not selected and the eport file alread#

    eists, an >

  • 7/25/2019 04 Metadata and Metadata Management

    18/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Importing DataStage Components

    7sing DataStage Designer=s mport menu #ou can import DataStagecomponents that have been eported from an# DataStage project@"

    7nder the mport menu the first t&o options are >DataStage components?for importing from a DSKformat file, and >DataStage components

    (K5)? for importing from an K5format file" 3he DataStage-epositor# mport dialog is relativel# simple"

    Figure 4-14 Data;tage epositor& 6mport Dialog

  • 7/25/2019 04 Metadata and Metadata Management

    19/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    'm"orting Table De(initions

    %s noted earlier, a >table definition? in DataStage describes the record

    la#out in an# data sourceA it does not have to be a database table" 3abledefinitions can be imported into the repositor# from a number of sources

    as illustrated in Figure 1"

    Figure 4-1) Data;tage Table Definition 6mport 9enu

    n later modules &e &ill investigate a couple of these in some&hat more

    detail" %s a general principle, ho&ever, each opens a &iIard that ta/es

    #ou through identif#ing the metadata source, retrieving the definitionsfrom that source and storing them in a particular categor# in the

    repositor#"

    3he Connector import &iIard allo&s table definitions to be imported into

    the DataStage repositor# from the unified nformation Server repositor#"Se$uential File definitions are unusual in that #ou also have to specif#format information, as &ell as importing9defining column definitions"

    Customaril# table definitions are stored in the 3able Definitions branch of

    the repositor# &ith t&o levels of categor#, data source t#pe and data

    source name" For eample a table definition imported from an

  • 7/25/2019 04 Metadata and Metadata Management

    20/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    categor# N3able DefinitionsNnullable? meaning that the# ma# contain 755"

    n the contet of database tables, 755 indicates that there is no /no&nvalue for this field in the current ro&" *o& 755 is stored is different in

    different databases, and immaterial"

    ;ecause 755 is un/no&n, there are ver# fe& operations that can be

    performed &ith it" For eample, adding @ to an un/no&n value #ields astillun/no&n value"

    Functions in parallel 3ransformer stages are particularl# intolerant of

    755 G #ou need to handle 755 specificall#"

    3&o tests ma# be performed &ith nullable fields G #ou can as/ &hether

    the value S 755 or &hether the value S outer? source &ill

    return 755 if there is no match on the join condition"

    ithin a DataStage table definition an# field can be mar/ed as ullable or

    not" *o&ever, if there is an# possibilit# that this field ma# contain 755

    then it must be mar/ed ullable"

    3et files have no data t#pes, and therefore no implicit concept of 755"ith se$uential files, therefore, it is necessar# to specif# some tet string

    &hich, if encountered, &ill be understood to represent 755" 3his is

    covered in more detail in the module on Se$uential Files"

    DataStage=s internal representation of 755 is usuall# a single b#te&hose binar# value is 10000000" *o&ever, in environments &here this

    b#te is used to represent the :uro currenc# s#mbol, a different b#te value

    can be configured for DataStage to use" DataStage=s internal 755 is

    referred to as an >outofband null?"

    3his value can be represented as an int8 field &hose value is 128"

    age 20

  • 7/25/2019 04 Metadata and Metadata Management

    21/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    %n >inband null? is a special value, legal for the data t#pe, that is used to

    represent 755 even though it is not" For eample, an inband null for

    Date*ired (data t#pe Date) might be 18000101 G a legal date butimpossible in data as a date hired" 3herefore an# representation of 755

    in a Se$uential File stage is, effectivel#, an >inband null?"

    Conversion functions eist for s&itching bet&een outofband and inbandnull, and for generating null" 3hese are different in the 3ransformer stageand the odif# stage"

    Table 4-1 *ull =andling Functions

    Description 9odif& ;tage Transformer ;tage

    3est for null null() sull()

    3est for not null notnull() sotull()

    Convert null to value handlePnull() ull3oRalue()ull3o:mpt#()

    ull3oLero()

    Convert to inband null handlePnull() ull3oRalue()ull3o:mpt#()

    ull3oLero()

    Convert to outofband null ma/ePnull() Setull()

    age 21

  • 7/25/2019 04 Metadata and Metadata Management

    22/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010

    Review

    3he term >metadata? is usuall# understood to mean information about data

    or the processing of the data" ;usiness metadata incorporates /no&ledgethat the business has about the data, such as business rules, o&nership andresponsibilit#" 3echnical metadata includes things li/e table definitions,

    routine code and the li/e" rocess metadata describes &hat happened to

    the data, &hen, and &ith &hat result9success" DataStage stores metadata

    in both the central nformation Server repositor# and in its o&n localrepositor#"

    3he -epositor# toolbar in DataStage (Figure @) does not reveal in &hich

    location an# particular item of metadata is stored" t is organiIed into

    folders, over &hich #ou have complete control" ;ut it is &ise to follo&some s#stematic &a# of storing metadata" 3he terms >categor#? and

    >pathname? are both used to describe the location of a particular folder, orcomponent in a folder, in the -epositor#"

    DataStage has t&o search utilities, .uic/ Find and %dvanced Find" 3helatter has more filters, and allo&s a greater range of things to be done &ith

    the results of the search"

    table definition?, a

    term that encapsulates an# collection of column definitions" 3hese can beimported using a number of different tools" t is also possible to eport

    an# combination of components from the -epositor# into a file that can be

    subse$uentl# used to import some or all of these components into another

    DataStage project"755 is a concept, that of a data item &hose value is un/no&n" :ver#

    database has its o&n &a# of representing 755 internall#, as does the

    DataStage server" Functions eist to test &hether a data item is null (or isnot null), to substitute a value &here this is true, and to generate outof

    band null &here needed" Some activities, such as outer joins, can also

    return 755"

    Further Reading

    Parallel ob Developer!s "uideChapter 2

    Designer Client "uideChapter 2 and 1@

    age 22

  • 7/25/2019 04 Metadata and Metadata Management

    23/23

    DataStage Fundamentals (version 8 parallel jobs) Staffordshire, December 2010