recipe 14 - build a staging area for an oracle data warehouse (2)

26
Designing, building, loading and managing a Staging Designing, building, loading and managing a Staging Area for an Oracle Data Warehouse. Area for an Oracle Data Warehouse. In 20 minutes. In 20 minutes. (part 2) (part 2) Recipes of Data Warehouse and Business Intelligence Recipes of Data Warehouse and Business Intelligence

Upload: massimo-cenci

Post on 28-Jan-2018

386 views

Category:

Technology


1 download

TRANSCRIPT

Designing, building, loading and managing a Staging Designing, building, loading and managing a Staging Area for an Oracle Data Warehouse. Area for an Oracle Data Warehouse.

In 20 minutes.In 20 minutes.

(part 2)(part 2)

Recipes of Data Warehouse and Business IntelligenceRecipes of Data Warehouse and Business Intelligence

• In the recent years, I have tried to explain and share a vision of the Data Warehouse which I defined as Micro ETL Foundation (MEF).

• MEF was born as a set of ideas, tips and methods to be able to successfully implement a project of this type.

• The main guideline was the simplicity. This was the most difficult point to be maintained, because a project of Data Warehouse and Business Intelligence is extraordinarily complex.

• I have shown in various articles, which you can read on Slideshare or on my blog [14], that even small details such as the descriptions of the codes [10] or the null values management [5] can weigh heavily on the final outcome of the project.

• In fact, in my opinion, the real complexity lies in being able to simplify the complexity.

Introduction

• It 'now time to collect and put together organically and usable in real life, everything that I wrote, giving a consistent life to the Micro ETL Foundation transforming it into a an ecosystem to be used by all in a simple and free way.

• The implementation was carried out using the internal language of the database (pl / sql) and it is addressed to an Oracle Data Warehouse builted in Windows (and Linux) environment, but the underlying philosophy can be easily adapted to any RDBMS and any Operating System.

• The use of a well-defined Naming Convention [6] has been decisive for the creation of models useful to the automatic generation of all the objects that constituting the system. We can then define MEF as a "Naming Convention-driven" ecosystem.

• The focus will be on the Staging Area, because, for me, is one of the main components of a Data Warehouse, the basis (according to the Kimball approach) on which we can build the next components, ie the Dimensions and the Fact tables.

• The example described is the load of a simple file in csv format. Other features will be described in subsequent articles.

The birth of MEF

• We try to think of the easiest way that we can find in the computer world. The answer is equally simple: a text file. A text file can be a file with a ".txt" extension or a more manageable file with MS Excel as a file with ".csv" extension. The csv file is in effect a text file easily opened even with any text editor (notepad, vi, etc.)

• Now try to think of the most common logic (and simple) that we can find in the computer world but also off it. The answer is equally simple: installation, configuration, elaboration, control.

• Any computer application, however simple or as complex as an operating system or a RDBMS system, always uses the same logic.

• Installation. • Configuration of the system obtained by entering data into the configuration files or

into the configuration tables.• Activation or execution of the application. • Monitoring activities gathered by the system in one or more log tables.

• Another important point related to simplicity is to have under only one database user not only all the typical structures of the Staging Area (and of the entire Data Warehouse) but also all the configuration and control tables.

• This will make simple the backup, having to act on a single schema (user) and not of the different schemes.

The importance of simplicity

The importance of simplicity

• Often, in the commercial ETL tools, there are different users. For example, we can find an user owner of the structures, a design user, a developing user, a run-time user and so on. This method has the purpose to partition and combine in a clear way the roles with the structures, but it makes the environment increasingly complex and difficult to manage; just think of the crossing of all the grant to the tables.

• The simplicity is also achieved by minimizing the physical structures (tables) in favor of logical structures (views and external tables).

• The simplicity reduces the costs for companies that have little budget. It means not to use the Integration tools and commercial charging that, in addition to the cost of the license, often require extensive installation and setup procedures. Beyond the need for a training period.

• This does not mean do not use them: they can easily co-exist together the MEF ecosystem for all those activities that are not present in the MEF. For example, for complex schedules or for load of data files that are not txt or csv (multimedia, xml, etc.).

The system (The recipe)

• Now, let's see how, in about 20 minutes, using only the MEF, the text editor and the internal language of the RDBMS (Oracle in this case with its SQL and pl/sql) we will be able to build a complete Data Staging system.

• We will follow the same steps listed in Part 1 of this article, by giving a more detailed description.

• The MEF features are numerous, so I'll just give a detailed description related to the specific loading of a data file in CSV format.

• Will follow, shortly, further examples of features. • Obviously MEF does not claim to be error-free and optimally written: it just wants to show you

how to get great results with minimal effort and minimal use of resources.

• MEF has been tested on Win7+Ora11, Win2012+Ora12, Oralinux7+Ora12

The instruments used (the ingredients)

• The data file to be loaded. As an example, we use the data file that describes the world's financial markets downloaded from the site iso15022.org.

As an example of data file, we use the MIC (Market Identifier Code), downloadable at the page http://www.iso15022.org/MIC/homepageMIC.htm.This data file has already been downloaded and placed in the \dwh\dat folder which must contain all of the data sample files. But if you want to use the latest updated version, you have to go to the site and download it in Excel format. Open the first sheet: MICs List by Country. Save it in csv format (my setting is ";" as a field separator) in the \dwh\dat. In order to deal with no problems, before saving, I advice to eliminate the last column (comments) that contains dirty characters that cause problems to his reading as an Oracle external table.

• A document describing the file structure. The information on its structure are described in the FAQ on the site.

Obviously any data file must be associated with a document that describes its internal structure. So, at a minimum, the fields that comprise it, their length and their meaning. This information on the structure are described as a pdf on the website FAQ (http://www.iso15022.org/MIC/FAQ_ISO_10383.pdf).

The instruments used (the ingredients)

• The Micro ETL Foundation (MEF), that is, the ecosystem of facilities and processing modules that will connect all the ingredients to obtain the final plate.

MEF will be the ecosystem that after its installation, will create all the objects using solely the configuration information.

• A text editor

It's fine notepad, vi, or any free editor on the market. The configuration operations are very simple, so it is not necessary to use editors particularly evolved.

• Command Prompt to access in SQL*Plus to Oracle Database

In Windows you have to launch cmd.exe, a Terminal in Linux.

Setup (Preparation)

• Decide a code of three characters for the project (for example dwh), open a DOS prompt and create a folder with the same name as the project (for example d:\projects\dwh).

These settings are part of the MEF methodology. The project code must be three characters. The folder where to place MEF depends on your server. For this example the project home will be d:\projects\dwh (<prj_home>)We will configure this path in the next steps.

• Download MEF (from https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k) and unpack it in the folder \dwh

The directory structure is fixed and should not be changed.

• Go under the \dwh\mef and open the mef_def.sql In this configuration file you must set some environment variables neede to the Data Warehouse that will be created: the ones you see below are just an example for Windows. For the moment only change the red variables.

Environment variables defined in this file are used in the various sql scripts and have the following meaning:

prj - It is the project code. It must have the length of 3 characters. For example, I have used the dwh code, but if you want to use another code, for example, edw, in step 1 you have to create the folder e:\projects\edw.

Setup (Preparation)

usr and pwd - They are the name and the password of the user that will be the owner of all of the Data Warehouse structures, including those of MEF. All MEF structures are easily recognizable because they belong to the "mef" area (the concept of area is according to the Naming Convention described in [6,8]) So, these structures, will all those named as <prj>_MEF_ *.

frm and ndy - The process of the MEF installation automatically creates two tables that each existing Data Warehouse must have. The Date Dimension and the Time Dimension (refer to the Kimball books for detailed description). These dimensions are simple, as they must contain every day of the year and every second of the day. As they will be preloaded in the installation process, we can not set every day of the year from the beginning of time, but we have to give us a limit. The frm variable sets the initial day: it depends on the historicity of the data that you must load in the DWH. For our example, I decided to start from 20120101, but you can change it according to your needs. The second variable says, starting from the initial day , how many days to load. For example, I have setted 3660 days, that is about 10 years. If you want, you can change it.

dla - It indicates what is the default language for the descriptions. For more details on the management of the descriptions I refer you to another article of mine [11]. MEF assumes, for simplicity, that are active only two descriptions: one recognized universally, English, and the local one, in this case, the Italian. Let's say, for the purposes of the loading of the Staging Area , this setting is not decisive.

Setup (Preparation)

wcm - To do the file list of a directory, java must open a console at the operating system level. Indicate here the full file executable. For example, for Windows Server at 64bit it is: 'C: \\ Windows \\ \\ cmd.exe syswow64'. Attention to put the double slash in the path.

fss - Folder separator in the File System (Unix or Windows). Obviously, the creation of the directories requires this information.

dev, den and det - They indicate which are the default values for alphanumeric data, numbers and dates. As stated in a previous article [5], are used to eliminate the null values from the Data Warehouse. For the day I am usually used 99991231 or 00010101.

srv - This variable identify the SMTP server to be used (by its name or IP address). Should be, for example, the Exchange server for Windows environments. However, if you do not have this information, it does not matter: the sending of the emails fail but will not interrupt the process.

Setup (Preparation)

prt - Physical path of the folder on which the project root was created. For example, if you have created the directoy of dwh project under the E:\projects, you have to set 'E: \ projects'

• Open a dos prompt, connect to Oracle as sysdba and run the script mef_install.sql

This installs the MEF ecosystem and will create the etl user . Warning that the installation will first try to delete it. So verify that you set a user who is not already present. In addition to the user access rights, you will also create all the Oracle directories to the project folders. Make sure there are no errors in the log file that is generated. If something goes wrong it is almost always caused by an incorrect setting of the environment variables.

• Open with an editor the mef config.sql script and change the email addresses of Massimo Cenci with yours

The table mef_email_cft contains the references for sending feedback email at the end of the loading job. Place your references.

Setup (Preparation)

• Connect to the database with the etl user and install the core environment by running the script mef_core_install.sql

This script will create all the basic structures of Micro ETL Foundation. I will give you a short logical / physical description.The script will create a project configuration table (DWH_MEF_PRJ_CFT) with all the details of a general nature.As the MEF is a data loading ecosystem, each loading process must be handled by a job that requires, after being configured (DWH_MEF_JOB_CFT) and have performed his work, a log table (DWH_MEF_JOB_LOT) .It can also be helpful to send email messages (DWH_MEF_EMAIL_CFT) as said before. And we'd like to know, always the success or failure of the e-mail send (DWH_MEF_EMAIL_LOT)As explained in my other article [9], the job is a logical entity consisting of processing modules to be configured (DWH_MEF_UNIT_CFT) and of which we must also have a detailed log of their execution (DWH_MEF_UNIT_LOT).If a job is particularly complicated, that is made up of more jobs, MEF will be responsible to insert them into a table, listing all the modules that make up the final job (DWH_MEF_ULIST_LOT) and run their execution.As the loading process manages data files, the list of the files is inserted, (via java) in a working table (DWH_MEF_DIRLIST_LOT). The job, however, always will process only one data file at a time.

Setup (Preparation)

Within each module, it is extremely important to register the most significant steps of the process, generating log messages (DWH_MEF_MSG_LOT). See [7].Also will be created and loaded the two dimensions of day and time (DWH_COM_CDI_DAY_DIT and DWH_COM_CDI_SS_DIT).Finished the script, we can take a look at the configuration tables after installation. To this end, we can use, for example, SQLDeveloper (which is free)This and the next scripts will already use the DWH_MEF_MSG_LOT to log the details.

• Install the staging environment through the script mef_sta_install.sql

This installation will create the following structures.A table that contains the configuration information of the data file (DWH_MEF_IO_CFT) and its composition (DWH_MEF_STA_CFT) obtained through a exteral table (DWH_MEF_STA_FXT) that "see" the configuration file mic.csv (under the folder \dwh\cft).A configuration table of standard controls (DWH_MEF_CHK_CFT) and a log table in which to insert the results of the controls (DWH_MEF_CHK_LOT). A configuration table of the loading days (DWH_MEF_IODAY_CFT). The process will produce a log information table (DWH_MEF_STA_LOT) of detail, and information relating to the loading days (DWH_MEF_YIO_LOT) by a yearly view. Also important are some metadata tables relating to the domains (DWH_MEF_DOM_CFT) by means of text files viewed by an external table (DWH_MEF_DOM_FXT) .

Setup (Preparation)

We can set additional notes (DWH_MEF_NOTE_CFT) . As an example of the domain, the PAEISO2.csv files under the folder \dwh\dom, contains codes and descriptions of all the nations.

• Run the configuration script for the data file that we will code as MIC. The download of the data file and the information about its structure have already been setted

The moment you decide to load a data file, you will first need to associate a name (io_ code) unique. For this example I used MIC. So you have to create a file that provides some general details about the data file. His name must be io_<io_code> .txt and must be placed under the folder \dwh\cft. We see the adjustments in io_mic.txt. The settings must be in the format<Name of the table column MEF_IO _CFT>: <value>

IO_COD: MIC (data file code)IO_DEB: ISO 10383 - Market Identifier Codes (date of the file description)TYPE_COD: FIN (type of file date entry. FIN = Input File)SEC_COD: ISO (logic section that represents the feeding system)FRQ_COD: D (frequency rate. D = Daily)FILE_LIKE_TXT: ISO10383_MIC% .csv (generic name of the data file : it consists of a fixed part and a part of "%" variable if present)FILE_EXT_TXT: ISO10383_MIC_20160128.csv (exact name of the test data file)HOST_NC:., (Configuration of the decimal and thousands separators in the data file)

Setup (Preparation)

HEAD_CNT: 1 (number of header rows)FOO_CNT: 0 (tail number lines)SEP_TXT :; (Separator if the file is csv)SEP_CNT: 12 (number of expected separators, not used for the moment)START_NUM: 14 (if the data reference date is in the name, indicating the first digit of the date. ISO10383_MIC_ are 13 char)SIZE_NUM: 8 (length of the reference date)MASK_TXT: YYYYMMDD (date format)TRUNC_COD: 1 (indicating whether you want to truncate the table Staging Area before loading a new data file)

Using the naming convention, it was possible to create templates (see files model_sta.sql, model_sta_apk.sql, model_sta_ppk.sql) that allow, using the mef_sta_build procedure, to automatically create, based on the configuration, all the physical structures and all the modules to load the data file. After the execution of this configuration script, if you access to the sql directory, you will find:•the creation script of the structures (build_sta_mic.sql)•the load script file date (dwh_sta_iso_mic.sql)

Setup (Preparation)

• a preprocessing of the data file script (temp_dwh_sta_iso_mic_apk.sql) in case it is necessary to make some calculations before the actual loading. In this case you must change this package and rename the file by removing the "temp_“ from the name.

• A post processing script (temp_dwh_sta_iso_mic_ppk.sql) if necessary, (much more likely), to make some calculations after the actual loading. In this case you must change this package and rename the file by removing the "temp_“from the name.

Let's now look at the MEF tables after this configuration .

The data file configuration table (DWH_MEF_IO_CFT) :

Setup (Preparation)

The configuration table (DWH_MEF_STA_CFT) of the structure of the data file :

The date file structure has been filled on the basis of the information in the site's FAQ. Let's see some more details on the meaning of the main fields that have been setted. HCOL_NUM: It 's just a sequential numberHOST_COLUMN_COD: Name of the field in the source database. If we know it, possibly with the name of the table as a <tab>. <Col>. Of course this information is not always available, but it could be very useful in the case of the data problems.

Setup (Preparation)

FILE_COLUMN_COD: Field Name as specified in the documentation of the data file.COLUMN_COD: Name of the field in all the staging area (and Data Warehouse) tablesHOST_TYPE_TXT: Type of data in the source system. It is not used operationally, but it is a useful metadata.HOST_LENGTH_NUM: Size of the fieldHOST_FORM_TXT: Format of the field in the case of date fieldsSTA_OFF_NUM: Offset than the HOST_LENGTH_NUM field to be used to size the column in the staging area table.CHK_FLG: field on which to activate the control path. It should not be a field that undergoes transformation rules.RULE_TXT: Simple rule of transformation to apply to the fieldHOST_COLUMN_DEB: Field Name according to the host systemBI_COLUMN_DEB: Name of the field to use in the Business Intelligence system.DOM_COD: domain code associated with this fieldBI_COLUMN_DEB: If set, will be placed as the column comment in the database.UI_FLG: Flag indicating whether the field is part of the unique key on the staging table.

Setup (Preparation)

I would like to focus on four fields: HOST_COLUMN_COD, FILE_COLUMN COD, COD and COLUMN BI_COLUMN_COD. The correct evaluation of these four fields, represents, for me, the true concept of a "data lineage" simple and immediate. That is, we can know, starting from a information present in a Business Intelligence report, what is the source data file, the column name in the Data Warehouse, as it's called in the file date and his name in the source database. Having all this information in a single place, allows to speed troubleshooting.

Note: the 2 columns STATUS DATE and CREATION DATE, have been defined as text fields * _TXT and not as a date * _YMD fields. Despite seem to contain dates in the format "MONTH YYYY ", actually, some lines contain values like: "BEFORE JUNE 2005". This of course would produce an error at the time of conversion date.

Setup (Preparation)

The execution plan of the job (DWH_MEF_UNIT_CFT) :

Execution (Cooking)

• Run the job to load the data file

The execution of the loading job of a staging area table, takes place by executing the procedure:mef_job.p_runand passing as a parameter the job code (sta_iso_mic).The MEF execution module, will run the procedures of the execution plan contained in the table MEF_IO_UNIT_CFT that we have seen before.

Elaboration logs(Tasting)

• Verify that the Staging Area table is loaded. We can use, for example, SQLDeveloper

Now we can see that the staging table (DWH_STA_ISO_MIC_STT) is loaded. The table belongs to project Data Warehouse (DWH), is located in the Staging Area (STA), in the data section received from the external system Iso10383 (ISO), and is a staging table (STT).

• Check that the loading was successful end and to have received the result also via email.

We can see the *_LOT tables to verify the status of the loading. In future articles we will analyze in detail all the features of the Micro ETL Foundation. We will show practical examples and will explain how to properly configure data files of other types.

Elaboration logs(Tasting)• Check that the loading was successful end and to have received the result also via email.

Fast view to the log tables:

Elaboration logs(Tasting)

If you wanto to tray another data file, you must put it under the …\rcv folder.

References

[01] Recipe 1 - Load a Data Source File (with header, footer and fixed lenght columns) into a Staging Area table with a click http://www.slideshare.net/jackbim/recipes-of-data-warehouse-1-load-staging-area[02] Recipe 2 - Load a Data Source File (.csv with header, rows counter in a separate file) into a Staging Area table with a click http://www.slideshare.net/jackbim/recipes-2-of-data-warehouse-load-staging-area[03] Recipe 3 - How to check the staging area loading http://www.slideshare.net/jackbim/recipes-3-of-data-warehouse-how-to-check-the-staging-area-loading[04] Recipe 4 - Staging area - how to verify the reference day http://www.slideshare.net/jackbim/recipes-4-of-data-warehouse-staging-area-how-to-verify-the-reference-day[05] Recipe 5 - The null values management in the etl process http://www.slideshare.net/jackbim/recipes-5-of-data-warehouse-the-null-values-management-in-the-etl-process[06] Recipe 6 - Naming convention techniques http://www.slideshare.net/jackbim/recipes-6-of-data-warehouse-naming-convention-techniques[07] Recipe 7 - A messaging system for Oracle Data Warehouse (part 2) http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-2[08] Recipe 7 - A messaging system for Oracle Data Warehouse (part 1) http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-1[09] Recipe 8 - Naming convention techniques (part2) http://www.slideshare.net/jackbim/recipes-8-the-naming-convention-part-2[10] Recipe 9 - Techniques to control the processing units in the ETL process http://www.slideshare.net/jackbim/recipe-9-techniques-to-control-the-processing-units-in-the-etl-process[11] Recipe 10 - The descriptions management http://www.slideshare.net/jackbim/recipe-10-the-descriptions-management [12] Recipe 11 - How to think agile http://www.slideshare.net/jackbim/recipe-11-agile-data-warehouse-and-business-intelligence[13] Recipe 12 - How to identify and control the reference day of a data file http://www.slideshare.net/jackbim/recipes-12-how-to-identify-and-control-the-reference-day-of-a-data-file

[14] http://massimocenci.blogspot.it/Email: [email protected]