isqs 3358, business intelligence extraction, transformation, and loading zhangxi lin texas tech...
TRANSCRIPT
![Page 1: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/1.jpg)
ISQS 3358, Business IntelligenceISQS 3358, Business Intelligence
Extraction, Transformation, and Extraction, Transformation, and LoadingLoadingZhangxi Lin
Texas Tech University
1
![Page 2: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/2.jpg)
OutlineOutlineIntegration ServicesLearn by doingPackage development tools
2
![Page 3: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/3.jpg)
ETL TopicsETL Topics Dimension Processing
◦ Extract changed row from the operational database◦ Handling slowly changing dimensions◦ De-duplication and fuzzy transforms
Fact Processing◦ Extract fact data from the operational database◦ Extract fact updates and deletes◦ Cleaning fact data◦ Checking data quality and halting package execution◦ Transform fact data◦ Surrogate key pipeline◦ Loading fact data◦ Analysis services processing
Integrating all tasks
3
![Page 4: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/4.jpg)
Automating your routine Automating your routine information processing tasksinformation processing tasks Your routine information processing tasks
◦ Read online news at 8:00a and collect a few most important pieces
◦ Retrieve data from database to draft a short daily report at 10a◦ View and reply emails and take some notes that are saved in a
database◦ View 10 companies’ webpage to see the updates. Input the
summaries into a database◦ Browse three popular magazines twice a week. Input the
summaries into a database◦ Generate a few one-way frequency and two-way frequency
tables and put them on the web◦ Merge datasets collected by other people into a main database.◦ Prepare a weekly report using the database and at 4p every
Monday, and publish it to the internal portal site.◦ Prepare a monthly report at 11a on the first day of a month,
which must be converted into a pdf file and uploaded to the website.
Seems there are many things are on going. How to handle them properly in the right time? ◦ Organizer – yes◦ How about regular data processing tasks?
4
![Page 5: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/5.jpg)
SQL Server Integration Services SQL Server Integration Services (SSIS)(SSIS)The data in data warehouses and data marts
is usually updated frequently, and the data loads are typically very large.
Integration Services includes a task that bulk loads data directly from a flat file into SQL Server tables and views, and a destination component that bulk loads data into a SQL Server database as the last step in a data transformation process.
An SSIS package can be configured to be restartable. This means you can rerun the package from a predetermined checkpoint, either a task or container in the package. The ability to restart a package can save a lot of time, especially if the package processes data from a large number of sources.
5
![Page 6: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/6.jpg)
What can you do with What can you do with SSIS?SSIS? To load the dimension and fact tables in the database. If
the source data for a dimension table is stored in multiple data sources, the package can merge the data into one dataset and load the dimension table in a single process, instead of using a separate process for each data source.
To update data in data warehouses and data marts. The Slowly Changing Dimension Wizard automates support for slowly changing dimensions by dynamically creating the SQL statements that insert and update records, update related records, and add new columns to tables.
To process Analysis Services cubes and dimensions. When the package updates tables in the database that a cube is built on, you can use Integration Services tasks and transformations to automatically process the cube and to process dimensions as well.
To compute functions before the data is loaded into its destination. If your data warehouses and data marts store aggregated information, the SSIS package can compute functions such as SUM, AVERAGE, and COUNT. An SSIS transformation can also pivot relational data and transform it into a less-normalized format that is more compatible with the table structure in the data warehouse.
6
![Page 7: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/7.jpg)
SSIS Architecture
7
![Page 8: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/8.jpg)
Control Flow Control Flow Bulk Insert task: Perform a fast load of data fro flat
files into a target table. Good for loading clean data.
Execute SQL task: Perform database operations, creating views, tables, or even databases. Query data or metadata
File Transfer Protocol and File System tasks: transfer files or sets of files.
Execute Package, Execute DTS2000 Package, and Execute Process tasks: Break a complex workflow into smaller ones, and define a parent or master package to execute them.
Send Mail task: sends an email message.
8
![Page 9: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/9.jpg)
Control Flow (cont’d)Control Flow (cont’d)Script and ActiveX Script tasks: Perform an
endless array of operations that are beyond the scope of the standard tasks.
Data Mining and Analysis Service Processing tasks: Launch processing on SSAS dimensions and databases. Use SSAS DDL task to create new Analysis Services partitions, or perform any data definition language operation.
XML and Web Services tasksMessage Queue, WMI Data Reader, and WMI
Event Watcher tasks: Help to build an automatic ELT system.
ForEach Loop, For Loop, and Sequence containers: Execute a set of tasks multiple times
Data Flow tasks9
![Page 10: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/10.jpg)
Data Flow TaskData Flow TaskData Flow task is a pipeline in which
data is picked up, processed and written to a destination.
Avoids I/O, which provided excellent performance
Concepts◦Data sources◦Data destinations◦Data transformations◦Error flows
10
![Page 11: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/11.jpg)
Some Data transformation Some Data transformation StepsStepsSort and Aggregate transformsConditional Split and Multicast transformsUnion All, Merge Join, and Lookup transformsSlowly Changing Dimension transformOLE DB Command transformRow Count and Audit transformsPivot and Unpivot transformsData mining Model Training and data Mining
Query transformsTerm extraction and Term Lookup
transformsFile Extractor and File Injector transforms
11
![Page 12: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/12.jpg)
Dynamic PackagingDynamic PackagingModifying the actions that a package takes
when it’s executing.SSIS implements a rich expression language
that is used in control flow and also in data flow transform.
Concepts◦ Expressions. Uses an expression language,
simple.◦ Variables. Can be defined within a package.
Can be scoped to any object: package-wide, within a container, a single task, etc.
◦ Configurations. Can overwrite most of the settings for SSIS objects by supplying a configuration file at runtime.
12
![Page 13: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/13.jpg)
Using SSIS Import and Export Using SSIS Import and Export WizardWizardImporting or exporting a flat file into
SQL Server 2005◦ Choose Start | Run. In the Run dialog
box’s Open field, type DTSWizard; click OK button to open SQL Server Import and Export Wizard’s Welcome page.
No need to have SQL Server to run the SSIS Import and Export Wizard.
13
![Page 14: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/14.jpg)
SQL Server Integration SQL Server Integration ServicesServicesThe hierarchy of SSIS
◦ Project -> Package -> Control flow -> Data flowPackage structure
◦ Control flow◦ Data flow◦ Event handler◦ Package explorer◦ Connection tray
Features◦ Event driven◦ Layered◦ Drag-and-drop programming◦ Data I/O definitions are done using Connection
Managers
14
![Page 15: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/15.jpg)
Three Layer Structure of Three Layer Structure of Integration ServicesIntegration Services
15
Control Flow
Data Flow
Event Handler
![Page 16: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/16.jpg)
Snowflake Schema of the Data Snowflake Schema of the Data MartMart
16
ManufacturingFact
DimProduct
DimProductSubType
DimProductType
DimBatch
DimMachine
DimMachineType
DimMaterial
DimPlant
DimCountry
1
2
3
6
7
8
4
5
910
Aggregate SQL Coding
![Page 17: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/17.jpg)
Exercise 4: Populate Maximum Exercise 4: Populate Maximum Miniatures Manufacturing Data Mart Miniatures Manufacturing Data Mart DimensionsDimensions Preparation: Data sources and destination definition Loading dimensions
◦ ProductType◦ ProductSubType◦ Product◦ Country◦ Plant (using SQL Command)◦ Material (using SQL Command, Aggregate item) ◦ MachineType (copied from the Material loading task)◦ Machine (copied from the MachineType loading task)
Note: DimBatch and the fact table will be loaded in the next exercise.
Debugging◦ Step by step◦ Understand the error messages◦ Watch database loading status
See more detailed Guidelines of this exercise
17
![Page 18: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/18.jpg)
Codes for data flows Codes for data flows The following codes are used to selectively retrieve data
from the source for the destination database
Code for DimPlant loading
SELECT LocationCode, LocationName, CountryCodeFrom LocationsWHERE LocationType = 'Plant Site'
Code for DimMaterial loading
SELECT AssetCode, AssetName, AssetClass, LocationCode,Manufacturer, DateOfPurchase, RawMaterialFROM CapitalAssetsWHERE AssetType = 'Molding Machine'
18
![Page 19: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/19.jpg)
Package ItemsPackage Items Data flow Task – main task Control Flow Items
◦ For Loop Container, Foreach Loop Container, Sequence Container Data Preparation Tasks
◦ File System Task, FTP Task, Web Service Task, XML Task Work Flow Tasks
◦ Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task, Message Queue Task, Send Mail Task, WMI Data Reader Task, WMI Event Watcher Task
SQL Server Tasks◦ Bulk Insert Task, Execute SQL task
Scripting Tasks◦ ActiveX Script Task, Script Task
Analysis Services Tasks◦ Analysis Services Processing Task, Analysis Services Execute DDL Task, Data Mining
Query Task
Transfer Tasks◦ Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task Transfer
Objects Task, Transfer Stored Procedures Task Maintenance Tasks Custom Tasks
19
![Page 20: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/20.jpg)
Connection managersConnection managers Excel Connection Manger File Connection Manger Flat File Connection Manager FTP Connection Manager HTTP Connection Manager ODBC Connection Manager OLE DB Connection Manager ADO Connection Manager – for
legacy applications using earlier versions of programming languages, such as VB 6.0
ADO.NET Connection Manager – Access to Microsoft SQL Server and data sources exposed through OLE DB and XML by using a .NET provider
Microsoft .NET Data Provider for mySQL Business Suite – access to SAP server and enables to execute RFC/NAPI commands and select queries against SAP tables
Design-time data source objects can be created in SSIS, SSAS and SSRS projects
20
![Page 21: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/21.jpg)
Container ManagersContainer ManagersForeach Loop ContainerFor Loop ContainerSequence Container
21
![Page 22: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/22.jpg)
Data Flow ComponentsData Flow ComponentsData flow sourcesData flow destinationsData transformations
22
![Page 23: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/23.jpg)
Different Types of ETL Control Different Types of ETL Control FlowsFlowsWith data flows, e.g.
◦ Import data◦ Database updates◦ Loading SCD ◦ Database cleansing◦ Aggregating data
Without data flows, e.g.◦ Downloading zipped files◦ Archiving downloaded files◦ Reading application log◦ Mailing opportunities ◦ Consolidating workflow package
23
![Page 24: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/24.jpg)
Data Flow for Updating Data Flow for Updating DatabaseDatabase
24
![Page 25: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/25.jpg)
Data Flow for Loading Slowly Data Flow for Loading Slowly Changing DimensionChanging Dimension
25
![Page 26: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/26.jpg)
Control Flow for Importing Control Flow for Importing Expanded FilesExpanded Files
26
![Page 27: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/27.jpg)
De-DuplicationDe-DuplicationTwo common situations: person, and
organizationSSIS provides two general-purpose
transforms helping address data quality and de-duplication◦Fuzzy Lookup◦Fuzzy Grouping
27
![Page 28: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/28.jpg)
ETL System DebuggingETL System DebuggingMost frequently encountered errors
◦Data format error: The database table’s data type does not match the input data’s format Reason 1: Flat Text file uses varchar(50), or
string[DT_STR] format; Excel file uses nvarchar format
Reason 2: You defined the database using different formats, which could be caused by the imported data set.
Solution: A Data Conversion data transformation node can be used for changing the format
◦SQL Server system error: Even though you did things correctly you could not get through. Solution: the easiest way to solve this problem is
to redo the ETL flow.28
![Page 29: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/29.jpg)
ETL How-to ProblemsETL How-to ProblemsHow to use Merge function of Data
Transformation to join datasets from two tables into one.
How to split a dataset to two tables How to remove duplicated rows in a table. How to detect the changes of the rows in the
data sources and extract the updated rows into a table in the data warehouse.
How to load multiple datasets with similar structure into a table
Reference: SQL Server 2005 Integration Services, McGraw Hill Osborne, 2007
29
![Page 30: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/30.jpg)
Exploring Features of SQL Server Exploring Features of SQL Server ETL SystemETL SystemData Source and Data
destination◦Flat File
Data flow transformation◦Aggregate◦Derived Column◦Data Conversion◦Sort
30
![Page 31: ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649dd45503460f94acb60b/html5/thumbnails/31.jpg)
Exercise 5: Exploring Features of Exercise 5: Exploring Features of SQL Server ETL SystemSQL Server ETL SystemData Set:
◦ Source: Commrex_2008, D5.txt (in the shared directory under \OtherDatasets)
◦ Destination: Flat file, Excel file, OLE DB fileData flow transformation
◦ Aggregate (Use D5.txt, and aggregate the data with regard to UserID)
◦ Derived Column (Use Commrex_2008, and create a new column “NewID”)
◦ Data Conversion (Use Commrex_2008, and convert data type of some columns, such as UserID, Prop_ID)
◦ Sort (use D5.txt, sort ascending with ID, Date, Time)
31