data management for business intelligence data...
TRANSCRIPT
![Page 1: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/1.jpg)
DATA MANAGEMENT FOR BUSINESS INTELLIGENCE
Data Access: Files
![Page 2: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/2.jpg)
BI Architecture
Business Intelligence
2
![Page 3: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/3.jpg)
Two issues
¨ Where are my files? ¤ Local file systems ¤ Distributed file systems ¤ Network protocols
¨ Which format is file data in? ¤ Text
q CSV, JSON
Business Intelligence Lab
3
![Page 4: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/4.jpg)
Local file system
Business Intelligence Lab
4
Path of a resource�n Windows:
n C:\Program Files\Office\sample.doc
n Linux: n /usr/home/r/ruggieri/sample.txt
![Page 5: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/5.jpg)
Local file system
A logical abstraction of persistent mass memory ¤ hierarchical view (tree of directories and files)
¤ types of resources (file, directory, pipe, link, special) ¤ resource attributes (owner, rights, hard links)
¤ services (indexing, journaling)
Sample file system: ¤ Windows
n NTFS, FAT32
¤ Linux n EXT2, EXT3, JFS, XFS, REISERFS, FAT32
Business Intelligence Lab
5
![Page 6: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/6.jpg)
Distributed file system
Business Intelligence Lab
6
PC-you PC-smithj
![Page 7: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/7.jpg)
Distributed file system
Acts as a client for a remote file access protocol ¤ logical abstraction of remote persistent mass
memory
Sample file system: ¤ Samba (SMB)
or Common Internet File System (CIFS) ¤ Network File System (NFS)
¤ Hadoop Distributed File System (HDFS)
Mount/unmount
Business Intelligence Lab
7
![Page 8: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/8.jpg)
Network protocols
¨ Files accessed through explicit request/reply ¨ A local copy has to be made before accessing data ¨ Resource naming:
¤ Uniform Resource Locator (URL) n scheme://user:password@host:port/path n http://bob:[email protected]:80/home/idx.html n scheme = protocol name (http, https, ftp, file, jdbc, …) n port = TCP/IP port number
Business Intelligence Lab
8
![Page 9: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/9.jpg)
HTTP Protocol
¨ HyperText Transfer Protocol n URL: http://user:[email protected] n State-less connections n Crypted variant: Secure HTTP (HTTPs)
¨ Windows clients ¤ Any browser ¤ > wget
n GNU http://www.gnu.org/software/wget/ n W3C http://www.w3.org/Library
¨ Linux clients ¤ Any browser ¤ > wget
Business Intelligence Lab
9
![Page 10: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/10.jpg)
SCP Protocol
¨ Secure Copy n > scp data.zip [email protected]:datacopy.zip n File copy from/to a remote account n File paths must be known in advance
¨ Client ¤ command line:
n > scp/pscp > scp2 ¤ Windows GUI
n WinSCP http://winscp.sourceforge.net n SSH Secure Shell
¤ Linux GUI n SCP: default
Business Intelligence Lab
10
![Page 11: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/11.jpg)
Two issues
¨ Where are my files? ¤ Local file systems ¤ Distributed file systems ¤ Network protocols
¨ Which format is file data in? ¤ Text
q CSV, ARFF, JSON
Business Intelligence Lab
11
![Page 12: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/12.jpg)
What is a file?
¨ File = sequence of bytes
Business Intelligence Lab
12
67 73 83 65 79 10 10 …
![Page 13: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/13.jpg)
How bytes are mapped to chars?
¨ Character set = alphabet of characters ¨ Coding bytes by means of a character set
¤ ASCII, EBCDIC (1 byte per char) ¤ UNICODE (1/2/4 bytes per char)
Business Intelligence Lab
13
![Page 14: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/14.jpg)
Business Intelligence Lab
14
American Standard Code for Information Interchange
![Page 15: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/15.jpg)
Text file = file+character set
¨ Text file = sequence di characters
Business Intelligence Lab
15
C I S A O \n \n …
![Page 16: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/16.jpg)
Viewing text files
¨ By a text editor ¤ Emacs, Nodepad++,TextPad, GEdit, Vi, etc.
¨ “Carriage return” character ¤ Start a new line ¤ Coding
n Unix: 1 char ASCII(0A) (‘\n’ in Java) n Windows: 2 chars ASCII(0D 0A) (“\r\n” in Java) n Mac: 1 char ASCII(0D) (‘\r’ in Java)
¤ Conversions n > dos2unix n > unix2dos
Business Intelligence Lab
16
![Page 17: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/17.jpg)
Text file = file+character set
¨ Text file = sequence di lines
Business Intelligence Lab
17
C I A O
…
S
![Page 18: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/18.jpg)
Tabular data format
Business Intelligence Lab
18
Mario Bianchi 23 Student
Luigi Rossi 30 Workman
Anna Verdi 50 Teacher
Rosa Neri 20 Student
Row
Column
![Page 19: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/19.jpg)
Representing tabular data in text files
¨ Comma Separated Values (CSV) ¤ A row per line ¤ Column values in a line separated by a special character ¤ Delimiters: comma, tab, space
Business Intelligence Lab
19
Mario,Bianchi,23,Student Luigi,Rossi,30,Workman Anna,Verdi,50,Teacher Rosa,Neri,20,Student
![Page 20: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/20.jpg)
Representing tabular data in text files
¨ Fixed Length Values (FLV) ¤ A row per line ¤ Column values occupy a fixed number of chars
n Allow for random access to elements n Higher disk space requirements
Business Intelligence Lab
20
Mario Bianchi 23 Student Luigi Rossi 30 Workman Anna Verdi 50 Teacher Rosa Neri 20 Student
![Page 21: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/21.jpg)
Quoting
¨ What happens in CSV if a delimiter is part of a value? ¤ Format error
¨ Solution: quoting�¤ Special delimiters for start and end of a value (ex. “ … “)
Business Intelligence Lab
21
Mario Bianchi 23 Student Luigi Rossi 30 Workman Anna Verdi 50 Teacher Rosa Neri 20 Student
“Mario Bianchi” 23 Student “Luigi Rossi” 30 Workman “Anna Verdi” 50 Teacher “Rosa Neri” 20 Student
![Page 22: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/22.jpg)
Missing values
¨ How to represent missing values in CSV or FLV? ¤ A reserved string: “?”, “null”, “”
Business Intelligence Lab
22
“Mario Bianchi” 23 Student “Luigi Rossi” 30 ? “Anna Verdi” 50 Teacher “Rosa Neri” ? Student
![Page 23: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/23.jpg)
Meta-data
¨ Describe properties of data ¤ Table name, column name, column type, …
Business Intelligence Lab
23
name surname age occupation
string string int string
Mario Bianchi 23 Student
Luigi Rossi 30 Workman
Anna Verdi 50 Teacher
Rosa Neri 20 Student
![Page 24: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/24.jpg)
How to represent meta-data in text files?
¨ One or two rows: names and types
Business Intelligence Lab
24
name surname age occupation
string string int string
name,surname,age,occupation string,string,int,string
![Page 25: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/25.jpg)
Meta-data and data in text files
¨ In the same file ¤ Meta-data first (header), then data
25
Business Intelligence Lab
name surname age occupation
string string int string
Mario Bianchi 23 Student
Luigi Rossi 30 Workman
Anna Verdi 50 Insegnante
Rosa Neri 20 Studente
name,surname,age,occupation string,string,int,string Mario,Bianchi,23,Studente Luigi,Rossi,30,Operaio Anna,Verdi,50,Insegnante Rosa,Neri,20,Studente
![Page 26: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/26.jpg)
Two issues
¨ Where are my files? ¤ Local file systems ¤ Distributed file systems ¤ Network protocols
¨ Which format is file data in? ¤ Text
q CSV, JSON
Business Intelligence Lab
26
![Page 27: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/27.jpg)
Data interchange issue
¨ Problem: data interchange between applications ¤ Proprietary data format do not allow for easy interchange
n CSV with different delimiters, or column orders n Similar limitations of FLV, ARFF, binary data, etc.
¨ Solution: ¤ definition of an interchange format… ¤ … marking data elements with their meaning … ¤ … so that any other party can easily interpret them.
Business Intelligence Lab
27
![Page 28: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/28.jpg)
DATA MANAGEMENT FOR BUSINESS INTELLIGENCE
Data Access: Relational Data Bases Computer Science Department, University of Pisa
![Page 29: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/29.jpg)
BI Architecture
Business Intelligence
29
![Page 30: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/30.jpg)
Connecting to a RDBMS
Business Intelligence Lab
30
¨ Connection protocol ¤ locate the RDBMS server ¤ open a connection ¤ user autentication
¨ Querying ¤ query SQL
n SELECT n UPDATE/INSERT/CREATE
¤ stored procedures ¤ prepared query SQL
¨ Scan Result set ¤ scan row by row ¤ access result meta-data
Client Server ConnectionString
OK
SQL query
Result set
![Page 31: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/31.jpg)
Connection Standards
¨ ODBC - Open DataBase Connectivity ¤ Windows: odbc Linux: unixodbc, iodbc ¤ Tabular Data
¨ JDBC ¤ Java APIs for tabular data
¨ OLE DB (Microsoft) ¤ Tabular data, XML, multi-dimensional data
¨ ADO (Microsoft) ¤ Object-oriented API on top of OLE DB
¨ ADO.NET ¤ Evolution of ADO in the .NET framework
31
![Page 32: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/32.jpg)
ODBC Open DataBase Connectivity
Business Intelligence Lab
32
![Page 33: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/33.jpg)
DATA MANAGEMENT FOR BUSINESS INTELLIGENCE
ETL – Extract, Transform and Load Computer Science Department, University of Pisa
Master in Big Data Analytics and Social Mining
![Page 34: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/34.jpg)
BI Architecture
Business Intelligence
34
![Page 35: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/35.jpg)
Extract, Transform and Load
ETL (extract transform and load) is the process of extracting, transforming and loading data from heterogeneous sources in a data base/warehouse. ¤ Typically supported by (visual) tools.
35
![Page 36: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/36.jpg)
ETL tasks
¨ Extract: access data sources ¤ Local, distributed, file format, connectivity standards
¨ Transform: data manipulation for quality improvm ¤ Selecting data
n remove unnecessary, duplicated, corrupted, out of limits (ex., age=999) rows and columns, sampling, dimensionality reduction
¤ Missing data n fill with default, average, filter out
¤ Coding and normalizing n to resolve format (ex., CSV, ARFF), measurement units (ex., meters vs
inches), codes (ex., person id), times and dates, min-max norm, … ¤ Attribute Splitting/merging
n of attributes (ex., address vs street+city+country)
Business Intelligence Lab
36
![Page 37: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/37.jpg)
ETL tasks
¤ Managing surrogate key & Slowly changing dimensions n generation and lookup
¤ Aggregating data n At a different granularity. Ex., grain “orders” (id, qty, price) vs
grain“customer” (id, no. orders, amount), discretization into bins, … ¤ Deriving calculated attributes
n Ex., margin = sales – costs ¤ Resolving inconsistencies – record linkage
n Ex., Dip. Informatica Via Buonarroti 2 is (?) Dip. Informatica Largo B. Pontecorvo 3
¤ Data merging-purging n from two or more sources (ex., sales database, stock database)
Business Intelligence Lab
37
![Page 38: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/38.jpg)
ETL tasks
¨ Load�¤ Data staging area
n Area containing intermediate, temporary, partially processed data ¤ Types of loading:
n Initial load (of the datawarehouse) n Incremental load
n Types of updates: append, destructive merge, constructive merge n Full refresh
Business Intelligence Lab
38
![Page 39: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/39.jpg)
ETL process for DW
Business Intelligence Lab
39
Fact table
Dim1
Dim2
Dim4
Dim3
Update Dim1
Update Dim2
Update Dim4
Update fact Prepare Update Dim3
Control Flow
![Page 40: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/40.jpg)
BUSINESS INTELLIGENCE
SSIS - SQL Server Integration Services
![Page 41: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/41.jpg)
Background
¨ SSIS is a tool for ETL ¤ It can be used independently from SQL Server ¤ Formerly called Data Transformation Services (in SQL
Server 2000)
¨ Docs and samples ¤ Tutorial from Books on Line
n http://msdn.microsoft.com/en-us/library/ms141026.aspx
¤ CodePlex samples n http://www.codeplex.com/SqlServerSamples#ssis
¤ On-line community n http://sqlis.com
Business Intelligence Lab
41
![Page 42: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/42.jpg)
Developing SSIS projects
¨ Developer framework ¤ Integrated within SSDT/BIDS
n Solution = collection of projects n Project = developer project (C++, C#, IS, …)
¨ Demo ¤ File à New Project à Integration Services ¤ Panels: solution explorer, server explorer, others ¤ SSIS packages (.dtsx extension)
n Panels: control flow, data flow
Business Intelligence Lab
42
![Page 43: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/43.jpg)
Control flow / Jobs
Business Intelligence Lab
43
¨ Tasks, Containers & Precedence ¤ Tasks
n ETL tasks (list in the Toolbox panel)
¤ Container n Iteration
¤ Precedence n Arrows connecting tasks specify
precedence type
![Page 44: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/44.jpg)
Data flow / Transformations
Business Intelligence Lab
44
¨ Special tasks ¨ Define pipelines of data flows
from sources to destination ¤ Data flow sources ¤ Data flow transformation
¤ Data destination ¤ Toolbox panel for list
![Page 45: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/45.jpg)
SSIS projects structure
LSA – SQL Server Integration Services
45
![Page 46: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/46.jpg)
SSIS data types
¨ SSIS defines a set of reference data types ¤ As seen for connectivity standards (ODBC, JDBC, OLE DB) ¤ http://msdn.microsoft.com/en-us/library/ms141036.aspx
¨ Data type from sources are mapped into SSIS types
¨ SSIS transformations works on SSIS types
¨ SSIS types are mapped to destination data types
Business Intelligence Lab
46
![Page 47: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/47.jpg)
Debug, deployment, scheduling
¨ Debug ¤ Data viewers
¨ Deployment ¤ Save project on file ¤ Save project on remote SSIS server
n Project->Deploy ¤ Load project from remote SSIS server
n File->Add new project->Integration Services Import Project Wizard
¨ Launch ¤ Local run
n From Visual Studio n From command line: dtexec n From explorer: double click on .dtsx files
¤ Remote run on SSIS servers n On demand / scheduled
Business Intelligence Lab
47
![Page 48: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/48.jpg)
Change data capture
Business Intelligence Lab
48
![Page 49: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/49.jpg)
BUSINESS INTELLIGENCE LABORATORY
ETL Demo: Pipeline, Sampling and Surrogate Keys
![Page 50: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/50.jpg)
Pipeline
¨ Consider the Foodmart sales database ¨ Design an ETL project for writing to a CSV file the
list of products ordered descending by gain ¤ Gain of a single sale is defined as (store_sales –
store_cost)*unit_sales ¤ Avg gain of a product is the sum of gains of sales of
the product divided by the total units_sales sold
¨ Do not use views or queries! Do all work in ETL.
Business Intelligence Lab
50
![Page 51: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/51.jpg)
SQL SOLUTION
SELECT product_id FROM SalesGROUP BY product_idORDER BY SUM(store_sales-store_cost)/
SUM(unit_sales)
Business Intelligence Lab
51
… and what about adding Product_name?
![Page 52: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/52.jpg)
BASIC IDEA OF SISS SOLUTION
Business Intelligence Lab
52
SALES
Grouping by product and compute E
ORDER by E
PROJECT product_id
E = SUM(store_sales-store_cost)/SUM(unit_sales)
![Page 53: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/53.jpg)
Stratified subsampling
¨ Consider the census table in the MasterBigData db ¨ Design an ETL project for writing to a CSV a
random sampling of 30% stratified by sex ¤ 30% of males plus 30% of females
¨ Do not use views or queries! Do all work in ETL.
Business Intelligence Lab
53
![Page 54: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/54.jpg)
BUSINESS INTELLIGENCE LABORATORY
Lab exercise on ETL: SCD
Business Informatics Degree
![Page 55: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/55.jpg)
SCD: background
¨ Slowly Changing Dimensions�¤ Datawarehouse dimensions members updates ¤ Three types:
n Type 1: overwrite previous value n Type 2: keep all previous values n Type 3: keep last N previous values (N ~ 1, 2, 3)
¤ Each attribute of the dimension can have its own type n Type 1: name, surname, … n Type 2: address, …
Business Intelligence Lab
55
![Page 56: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/56.jpg)
SCD: input and output tables
¨ Database FoodMart in SQL Server ¨ Input
¤ table customer�¨ Output in Lbi database
¤ create a table customer_dim�n columns
n surrogate_key (PK), customer_id, customer_name, address, date_start, date_end
n with n surrogate_key being a surrogate key, customer_name including
name and surname, address made of address1-city-zip-province-country, date_start and date_end are dates
56
![Page 57: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/57.jpg)
Preliminary step
¨ Develop a SSIS package that adds to customer_dim the customers in customer that are not already in it
57
![Page 58: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/58.jpg)
SCD: type 1 updates
¨ Overwrite previous value ¨ Changes on the input table customer�
¤ On 10/3/2007 n 231, Mario Rosi, Via XXV Aprile Pisa
¤ On 12/3/2007 n 231, Mario Rossi, Via XXV Aprile Pisa
¤ Surname has been corrected
Business Intelligence Lab
58
![Page 59: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/59.jpg)
SCD: type 1 updates
¨ The DW customer_dim table looks as: ¤ On 10/3/2007, and up to 12/3/2007
surrogate_key, customer_id, name, address, date_start, date_end
874, 231, Mario Rosi, Via XXV Aprile Pisa, 10/3/2007, NULL
¤ On 12/3/2007
surrogate_key, customer_id, name, address, date_start, date_end
874, 231, Mario Rossi, Via XXV Aprile Pisa, 10/3/2007, NULL
Business Intelligence Lab
59
![Page 60: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/60.jpg)
SCD: type 2 updates
¨ Keep all previous values ¨ Changes on the input table customer�
¤ On 12/3/2007 n 231, Mario Rossi, Via XXV Aprile Pisa
¤ On 25/9/2008 n 231, Mario Rossi, Via Risorgimento Pisa
¤ Customer has changed his address
Business Intelligence Lab
60
![Page 61: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/61.jpg)
SCD: type 2 updates
¨ The DW customer_dim table looks as: ¤ On 12/3/2007, and up to 25/9/2008
surrogate_key, customer_id, name, address, date_start, date_end
874, 231, Mario Rossi, Via XXV Aprile Pisa, 10/3/2007, NULL
¤ On 25/9/2008
surrogate_key, customer_id, name, address, date_start, date_end
874, 231, Mario Rossi, Via XXV Aprile Pisa, 10/3/2007, 25/9/2008
987, 231, Mario Rossi, Via Risorgimento Pisa, 25/9/2008, NULL
Business Intelligence Lab
61
![Page 62: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/62.jpg)
Lab exercise
¨ Design a SSIS project to update customer_dim starting from customer as follows: ¤ Customers in customer that are not in customer_dim
are added to it ¤ Updates of customer_name are of Type 1
¤ Updates of address are of Type 2
Business Intelligence Lab
62
![Page 63: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/63.jpg)
Sales during travels
¨ A sale in sales_fact was done during a travel if the store of the sale was not in the city of residence of the customer. Develop a SSIS package which produces a CSV file with a row for every customer with: ¤ the customer full name ¤ the total sales to the customer
¤ the ratio of sales done during travels
Business Intelligence Lab
63
![Page 64: DATA MANAGEMENT FOR BUSINESS INTELLIGENCE Data …didawiki.cli.di.unipi.it/lib/exe/fetch.php/bdd-infuma/i... · 2018-03-09 · Meta-data and data in text files ¨ In the same file](https://reader033.vdocuments.net/reader033/viewer/2022042111/5e8d019e09c5e72c5006737a/html5/thumbnails/64.jpg)
Sales in weekends of previous month
¨ For a given customer and month, the frequency of purchases in weekends (FPW) is the number of distinct weekend days (Saturdays or Sundays) of the previous month in which the customer made a purchase. Develop a SSIS packagewhich produces a CSV file with a row for every customer and month with: ¤ the customer full name ¤ the month and year ¤ the customer FPW
Business Intelligence Lab
64