pentaho data integration: extrayendo, integrando, normalizando y preparando mis datos

86
TALLER Pentaho Data Integration: Extrayendo, Integrando, Normalizando y Preparando mis datos Proyectos Programa Big Data y Business Intelligence Alex Rayón [email protected] Noviembre, 2015

Upload: alex-rayon-jerez

Post on 15-Apr-2017

1.020 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

TALLERPentaho Data Integration: Extrayendo, Integrando,

Normalizando y Preparando mis datos

Proyectos Programa Big Data y Business Intelligence

Alex Rayó[email protected]

Noviembre, 2015

Page 2: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Before starting….

Who has used a

relational database? Source: http://www.agiledata.org/essays/databaseTesting.html

2

Page 3: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Before starting…. (II)

Who has written scripts or Java code to move data from one

source and load it to another?

Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code

3

Page 4: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Before starting…. (III)

What did you use?

1.Scripts

2.Custom Java Code

3.ETL4

Page 5: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

5

Page 6: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

6

Page 7: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance

Business Intelligence

7

Page 8: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (II)

8

Page 9: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (III)Business Intelligence & Analytics

Open Core

GPL v2

Apache 2.0

Enterprise and OEM licenses

Java-based

Web front-ends9

Page 10: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (IV)The Pentaho Stack

Data Integration / ETL

Big Data / NoSQL

Data Modeling

Reporting

OLAP / Analysis

Data Visualization

Dashboarding

Data Mining / Predictive Analysis

Scheduling

Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/

10

Page 11: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (V)Modules

Pentaho Data Integration

Kettle

Pentaho Analysis

Mondrian

Pentaho Reporting

Pentaho Dashboards

Pentaho Data Mining

WEKA

11

Page 12: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (VI)

Figures

+ 10.000 deployments

+ 185 countries

+ 1.200 customers

Since 2012, in Gartner Magic Quadrant for BI Platforms

1 download / 30 seconds

12

Page 13: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (VII)

Open Source Leader

13

Page 14: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Pentaho at a glance (VIII)Single Platform

14

Page 15: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

15

Page 16: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Academic field

16

Page 17: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Academic field (II)

17

Page 18: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Academic field (III)

18

Page 19: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Academic field (IV)

19

Page 20: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Academic field (V)

20

Page 21: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Academic field (VI)

21

Page 22: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

22

Page 23: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLDefinition and characteristics

An ETL tool is a tool that

Extracts data from various data sources (usually legacy data)

Transforms data

from → being optimized for transaction

to → being optimized for reporting and analysis

synchronizes the data coming from different databases

data cleanses to remove errors

Loads data into a data warehouse23

Page 24: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLWhy do I need it?

ETL tools save time and money when developing a data warehouse by removing the need for hand-coding

It is very difficult for database administrators to connect between different brands of databases without using an external tool

In the event that databases are altered or new databases need to be integrated, a lot of hand-coded work needs to be completely redone

24

Page 25: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLBusiness Intelligence

ETL is the heart and soul of business intelligence (BI)

ETL processes bring together and combine data from multiple source systems into a data warehouse

Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html

25

Page 26: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLBusiness Intelligence (II)

According to most practitioners, ETL

design and development work consumes 60 to

80 percent of an entire BI project

Source: http://www.dwuser.com/news/tag/optimization/

Source: The Data Warehousing Institute. www.dw-institute.com

26

Page 27: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLProcessing framework

Source: The Data Warehousing Institute. www.dw-institute.com

27

Page 28: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLTools

Source: http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01

28

Page 29: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLOpen Source tools

CloverETL

KETL

Kettle

Talend

29

Page 30: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLCloverETL

Create a basic archive of functions for mapping and transformations, allowing companies to move large amounts of data as quickly and efficiently as possible

Uses building blocks called components to create a transformation graph, which is a visual depiction of the intended data processing

30

Page 31: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLCloverETL (II)

The graphic presentation simplifies even complex data transformations, allowing for drag-and-drop functionality

Limited to approximately 40 different components to simplify graph creation

Yet you may configure each component to meet specific needs

It also features extensive debugging capabilities to ensure all transformation graphs work precisely as intended

31

Page 32: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLKETL

Contains a scalable, platform-independent engine capable of supporting multiple computers and 64-bit servers

The program also offers performance monitoring, extensive data source support, XML compatibility and a scheduling engine for time-based and event-driven job execution

32

Page 33: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLKettle

The Pentaho company produced Kettle as an OS alternative to commercial ETL software

No relation to Kinetic Networks' KETL

Kettle features a drop-and-drag, graphical environment with progress feedback for all data transactions, including automatic documentation of executed jobs

XML Input Stream to handle huge XML files without suffering a loss in performance or a spike in memory usage

Users can also upgrade the free Kettle version for optional pay features and dedicated technical support.

33

Page 34: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLTalend

Provides a graphical environment for data integration, migration and synchronization

Drag and drop graphic components to create the java code required to execute the desired task, saving time and effort

Pre-built connectors to enable compatibility with a wide range of business systems and databases

Users gain real-time access to corporate data, allowing for the monitoring and debugging of transactions to ensure smooth data integration

34

Page 35: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison

The set of criteria that were used for the ETL tools comparison were divided into seven categories:

TCO

Risk

Ease of use

Support

Deployment

Speed

Data Quality

Monitoring

Connectivity

35

Page 36: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (II)

36

Page 37: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (III)

Total Cost of Ownership

The overall cost for a certain product.

This can mean initial ordering, licensing servicing, support, training, consulting, and any other additional payments that need to be made before the product is in full use

Commercial Open Source products are typically free to use, but the support, training and consulting are what companies need to pay for

37

Page 38: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (IV)

Risk

There are always risks with projects, especially big projects.

The risks for projects failing are:

Going over budget

Going over schedule

Not completing the requirements or expectations of the customers

Open Source products have much lower risk then Commercial ones since they do not restrict the use of their products by pricey licenses

38

Page 39: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (V)

Ease of use

All of the ETL tools, apart from Inaport, have GUI to simplify the development process

Having a good GUI also reduces the time to train and use the tools

Pentaho Kettle has an easy to use GUI out of all the tools

Training can also be found online or within the community

39

Page 40: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (VI)

Support

Nowadays, all software products have support and all of the ETL tool providers offer support

Pentaho Kettle – Offers support from US, UK and has a partner consultant in Hong Kong

Deployment

Pentaho Kettle is a stand-alone java engine that can run on any machine that can run java. Needs an external scheduler to run automatically.

It can be deployed on many different machines and used as “slave servers” to help with transformation processing.

Recommended one 1Ghz CPU and 512mbs RAM

40

Page 41: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (VII)

Speed

The speed of ETL tools depends largely on the data that needs to be transferred over the network and the processing power involved in transforming the data.

Pentaho Kettle is faster than Talend, but the Java-connector slows it down somewhat. Also requires manual tweaking like Talend. Can be clustered by placed on many machines to reduce network traffic

41

Page 42: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (VIII)

Data Quality

Data Quality is fast becoming the most important feature in any data integration tool.

Pentaho – has DQ features in its GUI, allows for customized SQL statements, by using JavaScript and Regular Expressions. It also has some additional modules after subscribing.

Monitoring

Pentaho Kettle – has practical monitoring tools and logging

42

Page 43: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

ETLComparison (IX)

Connectivity

In most cases, ETL tools transfer data from legacy systems

Their connectivity is very important to the usefulness of the ETL tools.

Kettle can connect to a very wide variety of databases, flat files, xml files, excel files and web services.

43

Page 44: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

44

Page 45: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleIntroduction

Project Kettle

Powerful Extraction, Transformation and Loading (ETL) capabilities using an

innovative, metadata-driven approach

45

Page 46: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleIntroduction (II)

What is Kettle?

Batch data integration and processing tool written in Java

Exists to retrieve, process and load data

PDI is a synonymous term

Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230

46

Page 47: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleIntroduction (III)

It uses an innovative meta-driven approach

It has a very easy-to-use GUI

Strong community of 13,500 registered users

It uses a stand-alone Java engine that process the tasks for moving data between many different databases and files

47

Page 48: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleIntroduction (IV)

48

Page 49: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleData Integration Platform

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

49

Page 50: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleArchitecture

Source: Pentaho Corporation

50

Page 51: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleMost common uses

Datawarehouse and datamart loads

Data Integration

Data cleansing

Data migration

Data export

etc.51

Page 52: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleData Integration

Changing input to desired output

Jobs

Synchronous workflow of job entries (tasks)

Transformations

Stepwise parallel & asynchronous processing of a recordstream

Distributed

52

Page 53: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleData Integration challenges

Data is everywhere

Data is inconsistent

Records are different in each system

Performance issues

Running queries to summarize data for stipulated long period takes operating system for task

Brings the OS on max load

Data is never all in Data Warehouse

Excel sheet, acquisition, new application

53

Page 54: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleTransformations

String and Date Manipulation

Data Validation / Business Rules

Lookup / Join

Calculation, Statistics

Cryptography

Decisions, Flow control

Scripting

etc.

54

Page 55: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleWhat is good for?

Mirroring data from master to slave

Syncing two data sources

Processing data retrieved from multiple sources and pushed to multiple destinations

Loading data to RDBMS

Datamart / Datawarehouse

Dimension lookup/update step

Graphical manipulation of data

55

Page 56: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleAlternatives

56

Code

Custom java

Spring batch

Scripts

perl, python, shell, etc

Possibly + db loader tool and cron

Commercial ETL tools

Datastage

Informatica

Oracle Warehouse Builder

SQL Server Integration services

Page 57: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleExtraction

57

Page 58: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleExtraction (II)

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

58

Page 59: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleExtraction (III)

RDBMS (SQL Server, DB2, Oracle, MySQL, PostgreSQL, Sybase IQ, etc.)

NoSQL Data: HBase, Cassandra, MongoDB

OLAP (Mondrian, Palo, XML/A)

Web (REST, SOAP, XML, JSON)

Files (CSV, Fixed, Excel, etc.)

ERP (SAP, Salesforce, OpenERP)

Hadoop Data: HDFS, Hive

Web Data: Twitter, Facebook, Log Files, Web Logs

Others: LDAP/Active Directory, Google Analytics, etc.

59

Page 60: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleTransportation

60

Page 61: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleTransformation

61

Page 62: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleLoading

62

Page 63: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleEnvironment

63

Page 64: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

KettleComparison of Data Integration tools

64

Page 65: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

65

Page 66: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataBusiness Intelligente

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

A brief (BI) history….

66

Page 67: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataWEKA

Project WekaA comprehensive set of tools for Machine

Learning and Data Mining

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

67

Page 68: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataAmong Pentaho’s products

Mondrian

OLAP server written in Java

Kettle

ETL tool

Weka

Machine learning and Data Mining tool68

Page 69: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataWEKA platform

WEKA (Waikato Environment for Knowledge Analysis)

Funded by the New Zealand’s Government (for more than 10 years)

Develop an open-source state-of-the-art workbench of data mining tools

Explore fielded applications

Develop new fundamental methods

Became part of Pentaho platform in 2006 (PDM - Pentaho Data Mining)

69

Page 70: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA

(One-of-the-many) Definition: Extraction of implicit, previously unknown, and potentially useful information from data

Goal: improve marketing, sales, and customer support operations, risk assessment etc.

Who is likely to remain a loyal customer?

What products should be marketed to which prospects?

What determines whether a person will respond to a certain offer?

How can I detect potential fraud?

70

Page 71: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA (II)

Central idea: historical data contains information that will be useful in the future (patterns → generalizations)

Data Mining employs a set of algorithms that automatically detect

patterns and regularities in data71

Page 72: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA (III)

A bank’s case as an example

Problem: Prediction (Probability Score) of a Corporate Customer Delinquency (or default) in the next year

Customer historical data used include:

Customer footings behavior (assets & liabilities)

Customer delinquencies (rates and time data)

Business Sector behavioral data

72

Page 73: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA (IV)

Variable selection using the Information Value (IV) criterion

Automatic Binning of continuous data variables was used (Chi-merge). Manual corrections were made to address particularities in the data distribution of some variables (using again IV)

73

Page 74: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA (V)

74

Page 75: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA (VI)

75

Page 76: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataData Mining with WEKA (VII)

Limitations

Traditional algorithms need to have all data in (main) memory

big datasets are an issue

Solution

Incremental schemes

Stream algorithms

MOA (Massive Online Analysis)

http://moa.cs.waikato.ac.nz/76

Page 77: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Big DataBe careful with Data Mining

77

Page 78: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

78

Page 79: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Predictive analyticsUnified solution for Big Data Analytics

79

Page 80: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Predictive analyticsUnified solution for Big Data Analytics (II)

Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data discovery for iPad● Full analytical power on

the go – unique to Pentaho

● Mobile-optimized user interface

80

Page 81: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Predictive analyticsUnified solution for Big Data Analytics (III)

Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data discovery and development for big data● Broadens big data access to

data analysts● Removes the need for

separate big data visualization tools

● Further improves productivity for big data developers

81

Page 82: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Predictive analyticsUnified solution for Big Data Analytics (IV)

Pentaho Instaview● Instaview is simple

○ Created for data analysts○ Dramatically simplifies ways to

access Hadoop and NoSQL data stores

● Instaview is instant & interactive○ Time accelerator – 3 quick steps from

data to analytics○ Interact with big data sources –

group, sort, aggregate & visualize● Instaview is big data analytics

○ Marketing analysis for weblog data in Hadoop

○ Application log analysis for data in MongoDB

82

Page 83: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Predictive analyticsComparison

Source: http://cdn.oreillystatic.com/en/assets/1/event/100/Using%20R%20and%20Hadoop%20for%20Statistical%20Computation%20at%20Scale%20Presentation.htm#/2

83

Page 84: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Referenceshttp://cdn.oreillystatic.com/en/assets/1/event/100/Big%20Data%20Architectural%20Patterns%20Presentation.pdf

http://blog.pentaho.com/tag/strata/

http://www.slideshare.net/mattcasters/pentaho-data-integration-introduction?from_search=2

http://www.slideshare.net/infoaxon/open-source-bi-7640848

http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01

http://www.pentaho.com/Blend-of-the-Week?mkt_tok=3RkMMJWWfF9wsRonuKvNce%2FhmjTEU5z17%2BQoXaO2hokz2EFye%2BLIHETpodcMTcdgPbjYDBceEJhqyQJxPr3DJNAN1dt%2BRhDhCA%3D%3D#Analytics

84

Page 85: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Copyright (c) 2015 University of DeustoThis work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative

Commons “Attribution-ShareAlike” License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/

Alex RayónNoviembre 2015

Page 86: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

TALLERPentaho Data Integration: Extrayendo, Integrando,

Normalizando y Preparando mis datos

Proyectos Programa Big Data y Business Intelligence

Alex Rayó[email protected]

Noviembre, 2015