etl with talend (big data)
DESCRIPTION
ETL operations using talend open studio for big data and HBase and HiveTRANSCRIPT
![Page 1: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/1.jpg)
ETL with talend
POOJA B. MISHRA
![Page 2: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/2.jpg)
What is ETL?Extract is the process of reading data from a database
Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data
Load is the process of writing the data into the target database
![Page 3: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/3.jpg)
Process flow
![Page 4: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/4.jpg)
Terms closely related and managed by ETL processes data migrationdata management data cleansingdata synchronization data consolidation.
.
![Page 5: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/5.jpg)
Different ETL tools
•Informatica PowerCenter•Oracle ETL•Ab Initio•Pentaho Data Integration -Kettle Project (open source ETL)•SAS ETL studio•Cognos Decisionstream•Business Objects Data Integrator (BODI)•Microsoft SQL Server Integration Services (SSIS) •Talend
![Page 6: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/6.jpg)
Prerequisites Talend Open Studio for Data Integration
◦ http://www.talend.com/download VirtualBox
◦ https://www.virtualbox.org/wiki/Downloads Hortonworks Sandbox VM
◦ http://hortonworks.com/products/hortonworks-sandbox/#install
![Page 7: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/7.jpg)
How to set up – Step 1
![Page 8: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/8.jpg)
Step 2
![Page 9: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/9.jpg)
Step 3
![Page 10: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/10.jpg)
![Page 11: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/11.jpg)
Step 5
![Page 12: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/12.jpg)
Talend InterfaceWorkspace
Repository tree
Component configuration
Palette
WorkspaceRepository tree
Palette
Repository tree
Workspace
Palette
Component configuration
![Page 13: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/13.jpg)
Supported data input and output formats?
• SQL
•MySQL
• PostgreSQL
• Sybase
• Teradata
•MSSQL
•Netezza
•Greenplum
•Access
•DB2
Hive
Pig
Hbase
Sqoop
MongoDB
Riak
Many more
![Page 14: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/14.jpg)
What kinds of datasets can be loaded?
Talend Studio offers nearly comprehensive connectivity to:
Packaged applications (ERP, CRM, etc.), databases, mainframes, files, Web Services, and so on to address the growing disparity of sources.
Data warehouses, data marts, OLAP applications - for analysis, reporting, dashboarding, scorecarding, and so on.
Built-in advanced components for ETL, including string manipulations, Slowly Changing Dimensions, automatic lookup handling, bulk loads support etc.
![Page 15: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/15.jpg)
Tutorial overview
We will do following tasks in this assignment:
1. Load data from DB on your local machine to HDFS 2. Write hive query to do analysis
3. Result of above hive query is then pushed to HBase output component
![Page 16: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/16.jpg)
Step 1 Use row generator to simulate the rows in db and created a table with 3 columns, ID, name and level.
![Page 17: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/17.jpg)
Step 2• Drag and drop the hdfsoutput component to the
surface and connect the major output of the row generator to the hdfs.
• For hdfs component, double click on the HDFS component in design area and just specify the name node address, and the folder in your machine to hold the file
![Page 18: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/18.jpg)
Step 3 After loading the data to HDFS, we can create
external hive table customers using following command by logging in hive shell and executing following command
CREATE EXTERNAL TABLE customers(id INT, name STRING, level STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY STORED AS TEXTFILE LOCATION '/usr/talend';
![Page 19: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/19.jpg)
Step 4 Create one flow to read the data in hive. pick up the version ,and the thrift server ip/port, then write a hive query as shown in below screen
![Page 20: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/20.jpg)
Step 5 Click the edit schema button and just add one column with type as object then we will parse the result and map to our schema.
Click the advanced tab, to enable the parse query results, using the column we just created as object type.
Drag the parserecordset component to the surface and conenct the mainout of hiverow to it, click edit schema to do the necessary mapping and then match the values as shown belowust created as object type
![Page 21: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/21.jpg)
Results
![Page 22: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/22.jpg)
Step 6◦ Click to run this job, from
the console it tell you whether it has connected to the hive server successfully
◦ Go to the hive server and it will show you that it has received one query and will execute it
◦ you can see the results from the run talend console
![Page 23: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/23.jpg)
Step 7 Drag one component called hbaseoutput from right pallette, and config the zookeeper info
![Page 24: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/24.jpg)
Step 8 Run the job will get this as final output
You can login into hbase shell and check if data is insterted to the hbase and the table was created by talend also!
![Page 25: Etl with talend (big data)](https://reader035.vdocuments.net/reader035/viewer/2022062220/55515b76b4c9059f768b4b04/html5/thumbnails/25.jpg)
Thank You!!