kettle etl tool

42
Sreenivas K 04/17/14 1 A Pentaho Data Integration tool

Upload: sreenivas1986

Post on 24-Nov-2015

104 views

Category:

Documents


3 download

DESCRIPTION

Kettle ETL Tool

TRANSCRIPT

  • Sreenivas K04/17/14*A Pentaho Data Integration tool

  • IntroductionETL ProcessPentahos KettleData Integration ChallengesPrerequisites and Recent ReleasesPentaho DI Components JDBCSpoonTransformationsJobs

    04/17/14*MaxQDPro: Kettle- ETL Tool

  • 4 major components:ExtractingGathering raw data from source systems and storing it in ETL staging environmentData profilingIdentifying data that changed since last loadTransforming- Cleaning and ConformingProcessing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensionsData cleansingRecording error eventsAudit dimensionsCreating and maintaining conformed dimensions and facts

    04/17/14MaxQDPro: Kettle- ETL Tool*

  • Data filteringIs not null, greater than, less than, includesField manipulationTrimming, padding, upper and lowercase conversionData calculations+ - X / , average, absolute value, arctangent, natural logarithmDate manipulationFirst day of month, Last day of month, add months, week of year, day of yearData type conversionString to number, number to string, date to numberMerging fields & splitting fieldsLooking up dateLook up in a database, in a text file, an excel sheet,

    04/17/14*MaxQDPro: Kettle- ETL Tool

  • LoadingLoading data into data warehouse tablesManaging hierarchies in dimensionsManaging special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensionsFact table loadingBuilding and maintaining bridge dimension tablesHandling late arriving dataManagement of conformed dimensionsAdministration of fact tablesBuilding aggregationsBuilding OLAP cubesTransferring DW data to other environment for specific purposes

    04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • Complexity and significant operational problems.Exceeds the designers expectationsData Profilingof a source.Data warehouses typically grow asynchronously.Establishing thescalability of an ETL system across the lifetime .

    04/17/14MaxQDPro: Kettle- ETL Tool*

  • Many off-the-shelf tools existHigh-end tools may not justify value for smaller warehousesProprietary ETLHigh upfront costLong term maintenanceCustom CodeLow upfront costSupport grows as business requirements changes04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

    ToolVendorOracle Warehouse Builder (OWB)Oracle Data Integrator (BODI)Business ObjectsIBM Information Server (Ascential)IBMSAS Data Integration StudioSAS InstitutePowerCenterInformatica Oracle Data Integrator (Sunopsis)OracleData MigratorInformation BuildersIntegration ServicesMicrosoftTalend Open StudioTalendDataFlowGroup 1 Software (Sagent)Data IntegratorPervasiveTransformation ServerDataMirrorTransformation Manager ETL Solutions Ltd.Data ManagerCognosDT/StudioEmbarcadero TechnologiesETL4ALLIKANDB2 Warehouse EditionIBMJitterbitJitterbitPentaho Data Integration Pentaho

  • Kettle Kettle Extraction Transformation Transportation & Loading toolIts open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.Products of PentahoMondrain OLAP server written in JavaKettle ETL toolWeka Machine learning and Data mining tool04/17/14*MaxQDPro: Kettle- ETL Tool

  • Data is everywhereData is inconsistentRecords are different in each systemPerformance issuesRunning queries to summarize data for stipulated long period takes operating system for taskBrings the OS on max loadData is never all in Data WarehouseExcel sheet, acquisition, new application04/17/14*MaxQDPro: Kettle- ETL Tool

  • Meta data , model driven approachWhat to do? And how to do?Complex transformation with zero codeGraphically design data transformation and jobs100% Java with cross-platform supportExtensible architectureRepository-basedFull featured ETLIntegration with Pentaho Open BI Platform04/17/14*MaxQDPro: Kettle- ETL Tool

  • PrerequisitesRecent ReleasesJava Runtime Environment 1.5 and above

    Compatible with almost any platform

    Compatible with wide range of Databases technologies.4/25 Data Integration 3.0.3 GA

    4/18 Data Integration 3.1 Milestone 2/8 Data Integration 3.0.2 GA

    12/12 Data Integration 3.0.1 GA

    11/15 Data Integration 3.0 GA

    10/31 Data Integration 3.0 RC2

    10/24 Data Integration 2.5.2 GA

    10/08 Data Integration 3.0 RC1

    08/24 Data Integration 2.5.1 GA 04/17/14MaxQDPro: Kettle- ETL Tool*

  • PanA program to execute transformations designed by Spoon in XML or database repository. Transformations are scheduled in batch mode to be run automatically at regular intervals CarteSimple web server to execute transformations and jobs remotely.Accept an XML (small servlet) that contains transformation to execute and the execution configuration. Allows to remotely monitor, start and stop the transformations and jobs Server running in Carte is a Slave Server

    04/17/14MaxQDPro: Kettle- ETL Tool*

  • SpoonGUI that allows you to design transformations and jobs that can be run with the Kettle tools Pan and Kitchen Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository.Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.Latest version of Spoon is 3.2 beta version. KitchenExecute jobs designed by Spoon in XML or database repository

    04/17/14MaxQDPro: Kettle- ETL Tool*

  • Create Shortcut with spoon.ico pointing to bat fileWorks on most of OS InstallingEnsure JRE 1.5 is installed.Unzip the binary distribution in any folderLaunching spoon.bat in windows platformspoon.sh in Unix like platform Supported platformMicrosoft Windows including VistaLinux GTK: on i386 and x86_64 processors Apple's OSX: works both on PowerPC and Intel machines Solaris: using a Motif interface AIX, HP-UX, FreeBSD 04/17/14MaxQDPro: Kettle- ETL Tool*

  • Latest JDBC 3.0JDBC -Database connectivity Java tool.Comes in four different typesType1: JDBC-ODBC BridgeType 2 : Native API partial Java driverType 3 : Middleware Java DriversType 4: Direct to DB Java DriversMicrosoft Based DB like MS Access rely on Type 1driversOracle, Mysql can be connected with other types. But traditionally used is the Type 4 driver.JDBC can also operate in Distributed environment.04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • Key Improvement Execution Results Pane for logs, metrics and performance graphImproved Database Connection dialogSnap to grid (graphical workspace)Zoom (Graphical Workspace)Easier to use left panel for the objects paletteOver 30 new or improved Transformation Steps13 new or improved Job EntriesSupport for four new database types - MonetDB, KingbaseES, Vertica, and HP NeoViewImproved translations04/17/14MaxQDPro: Kettle- ETL Tool*

  • Repository Connection establishmentAuto loginBy setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.LoginBy default PDI provides login username and password ad admin.It strictly advised to change default password to avoid any security vulnerablity.04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • Transformation Value: Values are part of a row and can contain any type of dataRow: a row exists of 0 or more valuesOutput stream: an output stream is a stack of rows that leaves a step.Input stream: an input stream is a stack of rows that enters a step.Hop: A hop is a graphical representation of one or more data streams between 2 steps. Note: A note is a piece of information that can be added to a transformation04/17/14MaxQDPro: Kettle- ETL Tool*Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.

  • JobsJob Entry: A job entry is one part of a job and performs a certainHop: A hop is a graphical representation of one or more data streams between 2 steps Note: a note is a piece of information that can be added to a job04/17/14MaxQDPro: Kettle- ETL Tool*A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.

  • Input StepsOutput StepsLookup StepsTransformation StepsJoin StepsDW StepsMapping StepsJob Steps04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • Table Output Step04/17/14*MaxQDPro: Kettle- ETL Tool

  • Insert / Update Output Step04/17/14*MaxQDPro: Kettle- ETL Tool

  • Besides the execution order, it specifies the condition for next job entry

    Unconditional - next job entry will be executed regardless of the result of the originating job entry.

    Follow when result is true - next job entry will only be executed when the result of the originating job entry is true,

    Follow when result is false - next job entry will only be executed when the result of the originating job entry was false04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14*MaxQDPro: Kettle- ETL Tool

  • 04/17/14MaxQDPro: Kettle- ETL Tool*

  • Brief Introduction to ETL processJDBC Repository ConnectionPentaho Data Integration ToolComponents PanCarteKitchenSpoonTransformation with different Input Data SourceJobs

    04/17/14MaxQDPro: Kettle- ETL Tool*

  • kettle.pentaho.orgKettle project homepagekettle.javaforge.comKettle community website: forum, source, documentation, tech tips, samples, www.pentaho.org/download/All Pentaho modules, pre-configured with sample dataDeveloper forums, documentationVentana Research Open Source BI Surveywww.mysql.comWhite paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htmlKettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php Roland Bouman blog on Pentaho Data Integration and MySQLhttp://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html 04/17/14*MaxQDPro: Kettle- ETL Tool

    ***************