isqs 3358, business intelligence extraction, transformation, and loading zhangxi lin texas tech...

31
ISQS 3358, Business Intelligence ISQS 3358, Business Intelligence Extraction, Transformation, Extraction, Transformation, and Loading and Loading Zhangxi Lin Texas Tech University 1

Upload: diana-sutton

Post on 24-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

ISQS 3358, Business IntelligenceISQS 3358, Business Intelligence

Extraction, Transformation, and Extraction, Transformation, and LoadingLoadingZhangxi Lin

Texas Tech University

1

Page 2: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

OutlineOutlineIntegration ServicesLearn by doingPackage development tools

2

Page 3: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

ETL TopicsETL Topics Dimension Processing

◦ Extract changed row from the operational database◦ Handling slowly changing dimensions◦ De-duplication and fuzzy transforms

Fact Processing◦ Extract fact data from the operational database◦ Extract fact updates and deletes◦ Cleaning fact data◦ Checking data quality and halting package execution◦ Transform fact data◦ Surrogate key pipeline◦ Loading fact data◦ Analysis services processing

Integrating all tasks

3

Page 4: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Automating your routine Automating your routine information processing tasksinformation processing tasks Your routine information processing tasks

◦ Read online news at 8:00a and collect a few most important pieces

◦ Retrieve data from database to draft a short daily report at 10a◦ View and reply emails and take some notes that are saved in a

database◦ View 10 companies’ webpage to see the updates. Input the

summaries into a database◦ Browse three popular magazines twice a week. Input the

summaries into a database◦ Generate a few one-way frequency and two-way frequency

tables and put them on the web◦ Merge datasets collected by other people into a main database.◦ Prepare a weekly report using the database and at 4p every

Monday, and publish it to the internal portal site.◦ Prepare a monthly report at 11a on the first day of a month,

which must be converted into a pdf file and uploaded to the website.

Seems there are many things are on going. How to handle them properly in the right time? ◦ Organizer – yes◦ How about regular data processing tasks?

4

Page 5: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

SQL Server Integration Services SQL Server Integration Services (SSIS)(SSIS)The data in data warehouses and data marts

is usually updated frequently, and the data loads are typically very large.

Integration Services includes a task that bulk loads data directly from a flat file into SQL Server tables and views, and a destination component that bulk loads data into a SQL Server database as the last step in a data transformation process.

An SSIS package can be configured to be restartable. This means you can rerun the package from a predetermined checkpoint, either a task or container in the package. The ability to restart a package can save a lot of time, especially if the package processes data from a large number of sources.

5

Page 6: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

What can you do with What can you do with SSIS?SSIS? To load the dimension and fact tables in the database. If

the source data for a dimension table is stored in multiple data sources, the package can merge the data into one dataset and load the dimension table in a single process, instead of using a separate process for each data source.

To update data in data warehouses and data marts. The Slowly Changing Dimension Wizard automates support for slowly changing dimensions by dynamically creating the SQL statements that insert and update records, update related records, and add new columns to tables.

To process Analysis Services cubes and dimensions. When the package updates tables in the database that a cube is built on, you can use Integration Services tasks and transformations to automatically process the cube and to process dimensions as well.

To compute functions before the data is loaded into its destination. If your data warehouses and data marts store aggregated information, the SSIS package can compute functions such as SUM, AVERAGE, and COUNT. An SSIS transformation can also pivot relational data and transform it into a less-normalized format that is more compatible with the table structure in the data warehouse.

6

Page 7: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

SSIS Architecture

7

Page 8: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Control Flow Control Flow Bulk Insert task: Perform a fast load of data fro flat

files into a target table. Good for loading clean data.

Execute SQL task: Perform database operations, creating views, tables, or even databases. Query data or metadata

File Transfer Protocol and File System tasks: transfer files or sets of files.

Execute Package, Execute DTS2000 Package, and Execute Process tasks: Break a complex workflow into smaller ones, and define a parent or master package to execute them.

Send Mail task: sends an email message.

8

Page 9: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Control Flow (cont’d)Control Flow (cont’d)Script and ActiveX Script tasks: Perform an

endless array of operations that are beyond the scope of the standard tasks.

Data Mining and Analysis Service Processing tasks: Launch processing on SSAS dimensions and databases. Use SSAS DDL task to create new Analysis Services partitions, or perform any data definition language operation.

XML and Web Services tasksMessage Queue, WMI Data Reader, and WMI

Event Watcher tasks: Help to build an automatic ELT system.

ForEach Loop, For Loop, and Sequence containers: Execute a set of tasks multiple times

Data Flow tasks9

Page 10: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Data Flow TaskData Flow TaskData Flow task is a pipeline in which

data is picked up, processed and written to a destination.

Avoids I/O, which provided excellent performance

Concepts◦Data sources◦Data destinations◦Data transformations◦Error flows

10

Page 11: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Some Data transformation Some Data transformation StepsStepsSort and Aggregate transformsConditional Split and Multicast transformsUnion All, Merge Join, and Lookup transformsSlowly Changing Dimension transformOLE DB Command transformRow Count and Audit transformsPivot and Unpivot transformsData mining Model Training and data Mining

Query transformsTerm extraction and Term Lookup

transformsFile Extractor and File Injector transforms

11

Page 12: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Dynamic PackagingDynamic PackagingModifying the actions that a package takes

when it’s executing.SSIS implements a rich expression language

that is used in control flow and also in data flow transform.

Concepts◦ Expressions. Uses an expression language,

simple.◦ Variables. Can be defined within a package.

Can be scoped to any object: package-wide, within a container, a single task, etc.

◦ Configurations. Can overwrite most of the settings for SSIS objects by supplying a configuration file at runtime.

12

Page 13: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Using SSIS Import and Export Using SSIS Import and Export WizardWizardImporting or exporting a flat file into

SQL Server 2005◦ Choose Start | Run. In the Run dialog

box’s Open field, type DTSWizard; click OK button to open SQL Server Import and Export Wizard’s Welcome page.

No need to have SQL Server to run the SSIS Import and Export Wizard.

13

Page 14: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

SQL Server Integration SQL Server Integration ServicesServicesThe hierarchy of SSIS

◦ Project -> Package -> Control flow -> Data flowPackage structure

◦ Control flow◦ Data flow◦ Event handler◦ Package explorer◦ Connection tray

Features◦ Event driven◦ Layered◦ Drag-and-drop programming◦ Data I/O definitions are done using Connection

Managers

14

Page 15: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Three Layer Structure of Three Layer Structure of Integration ServicesIntegration Services

15

Control Flow

Data Flow

Event Handler

Page 16: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Snowflake Schema of the Data Snowflake Schema of the Data MartMart

16

ManufacturingFact

DimProduct

DimProductSubType

DimProductType

DimBatch

DimMachine

DimMachineType

DimMaterial

DimPlant

DimCountry

1

2

3

6

7

8

4

5

910

Aggregate SQL Coding

Page 17: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Exercise 4: Populate Maximum Exercise 4: Populate Maximum Miniatures Manufacturing Data Mart Miniatures Manufacturing Data Mart DimensionsDimensions Preparation: Data sources and destination definition Loading dimensions

◦ ProductType◦ ProductSubType◦ Product◦ Country◦ Plant (using SQL Command)◦ Material (using SQL Command, Aggregate item) ◦ MachineType (copied from the Material loading task)◦ Machine (copied from the MachineType loading task)

Note: DimBatch and the fact table will be loaded in the next exercise.

Debugging◦ Step by step◦ Understand the error messages◦ Watch database loading status

See more detailed Guidelines of this exercise

17

Page 18: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Codes for data flows Codes for data flows The following codes are used to selectively retrieve data

from the source for the destination database

Code for DimPlant loading

SELECT LocationCode, LocationName, CountryCodeFrom LocationsWHERE LocationType = 'Plant Site'

Code for DimMaterial loading

SELECT AssetCode, AssetName, AssetClass, LocationCode,Manufacturer, DateOfPurchase, RawMaterialFROM CapitalAssetsWHERE AssetType = 'Molding Machine'

18

Page 19: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Package ItemsPackage Items Data flow Task – main task Control Flow Items

◦ For Loop Container, Foreach Loop Container, Sequence Container Data Preparation Tasks

◦ File System Task, FTP Task, Web Service Task, XML Task Work Flow Tasks

◦ Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task, Message Queue Task, Send Mail Task, WMI Data Reader Task, WMI Event Watcher Task

SQL Server Tasks◦ Bulk Insert Task, Execute SQL task

Scripting Tasks◦ ActiveX Script Task, Script Task

Analysis Services Tasks◦ Analysis Services Processing Task, Analysis Services Execute DDL Task, Data Mining

Query Task

Transfer Tasks◦ Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task Transfer

Objects Task, Transfer Stored Procedures Task Maintenance Tasks Custom Tasks

19

Page 20: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Connection managersConnection managers Excel Connection Manger File Connection Manger Flat File Connection Manager FTP Connection Manager HTTP Connection Manager ODBC Connection Manager OLE DB Connection Manager ADO Connection Manager – for

legacy applications using earlier versions of programming languages, such as VB 6.0

ADO.NET Connection Manager – Access to Microsoft SQL Server and data sources exposed through OLE DB and XML by using a .NET provider

Microsoft .NET Data Provider for mySQL Business Suite – access to SAP server and enables to execute RFC/NAPI commands and select queries against SAP tables

Design-time data source objects can be created in SSIS, SSAS and SSRS projects

20

Page 21: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Container ManagersContainer ManagersForeach Loop ContainerFor Loop ContainerSequence Container

21

Page 22: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Data Flow ComponentsData Flow ComponentsData flow sourcesData flow destinationsData transformations

22

Page 23: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Different Types of ETL Control Different Types of ETL Control FlowsFlowsWith data flows, e.g.

◦ Import data◦ Database updates◦ Loading SCD ◦ Database cleansing◦ Aggregating data

Without data flows, e.g.◦ Downloading zipped files◦ Archiving downloaded files◦ Reading application log◦ Mailing opportunities ◦ Consolidating workflow package

23

Page 24: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Data Flow for Updating Data Flow for Updating DatabaseDatabase

24

Page 25: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Data Flow for Loading Slowly Data Flow for Loading Slowly Changing DimensionChanging Dimension

25

Page 26: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Control Flow for Importing Control Flow for Importing Expanded FilesExpanded Files

26

Page 27: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

De-DuplicationDe-DuplicationTwo common situations: person, and

organizationSSIS provides two general-purpose

transforms helping address data quality and de-duplication◦Fuzzy Lookup◦Fuzzy Grouping

27

Page 28: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

ETL System DebuggingETL System DebuggingMost frequently encountered errors

◦Data format error: The database table’s data type does not match the input data’s format Reason 1: Flat Text file uses varchar(50), or

string[DT_STR] format; Excel file uses nvarchar format

Reason 2: You defined the database using different formats, which could be caused by the imported data set.

Solution: A Data Conversion data transformation node can be used for changing the format

◦SQL Server system error: Even though you did things correctly you could not get through. Solution: the easiest way to solve this problem is

to redo the ETL flow.28

Page 29: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

ETL How-to ProblemsETL How-to ProblemsHow to use Merge function of Data

Transformation to join datasets from two tables into one.

How to split a dataset to two tables How to remove duplicated rows in a table. How to detect the changes of the rows in the

data sources and extract the updated rows into a table in the data warehouse.

How to load multiple datasets with similar structure into a table

Reference: SQL Server 2005 Integration Services, McGraw Hill Osborne, 2007

29

Page 30: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Exploring Features of SQL Server Exploring Features of SQL Server ETL SystemETL SystemData Source and Data

destination◦Flat File

Data flow transformation◦Aggregate◦Derived Column◦Data Conversion◦Sort

30

Page 31: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Exercise 5: Exploring Features of Exercise 5: Exploring Features of SQL Server ETL SystemSQL Server ETL SystemData Set:

◦ Source: Commrex_2008, D5.txt (in the shared directory under \OtherDatasets)

◦ Destination: Flat file, Excel file, OLE DB fileData flow transformation

◦ Aggregate (Use D5.txt, and aggregate the data with regard to UserID)

◦ Derived Column (Use Commrex_2008, and create a new column “NewID”)

◦ Data Conversion (Use Commrex_2008, and convert data type of some columns, such as UserID, Prop_ID)

◦ Sort (use D5.txt, sort ascending with ID, Date, Time)

31