hadoop abi insight

31
Dell - Internal Use - Confidential Hadoop @ ABI Insight Into The Ecosystem Tuesday, October 07, 2014

Upload: ramagurubaran-venkat

Post on 18-Jul-2016

37 views

Category:

Documents


1 download

DESCRIPTION

Hadoop

TRANSCRIPT

Page 1: Hadoop ABI Insight

Dell - Internal Use - Confidential

Hadoop @ ABI

Insight Into The Ecosystem

Tuesday, October 07, 2014

Page 2: Hadoop ABI Insight

2 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

TEAM - POC

• Deepak Gattala– Hadoop Administrator– DW Architect – ABI/EBI

• Spike White– Linux System Administrator– Kerberos Specialist.

• Will O’Brian– Active Directory and Identity.– Security Analyst.

• Note: Special thanks for supporting the effort. – Bart Crider, Attila Finta, Mike Porreca, Feargal Tobin, Alisha Worsham.

Page 3: Hadoop ABI Insight

3 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Agenda [ Next 1 Hour ]

• Deepak Gattala [15 minutes]– Get Familiar with Hadoop (Cloudera). [5 minutes]– HDFS & Map reduce tour. [5 minutes]– Hadoop Family and Ecosystem. [5 minutes]

• Spike White and Will O’Brian [15 minutes]– Integration and Security. [5 minutes]– Kerberos [5 minutes]– AD Forest and OU [5 minutes]

• Deepak Gattala [15 minutes]– Understanding Hive and Impala. [5 minutes]– Cloudera Manager. [5 minutes]– DELL IT/Services Use case and Interest. [5 minutes]

• Deepak Gattala, Spike White and Will O’Brian [15 minutes]– Product Demo. [5 minutes]– Question & Answers. [10 minutes]

Page 4: Hadoop ABI Insight

Dell - Internal Use - Confidential

Deepak Gattala- Architect

Page 5: Hadoop ABI Insight

5 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Photography

Page 6: Hadoop ABI Insight

6 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

What is Hadoop?

• Hadoop is an open source software frame work.

• It’s an Apache top-level project but the underlying technology was from google white paper about to index all the rich textural and structural information.

• Architected to run on a large number of machines that don’t share any memory or disks.

• Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits.

• Designed to solve problems with large data while running analytics that are deep and computationally extensive.

Page 7: Hadoop ABI Insight

7 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Prerequisites

• Hadoop framework mainly consists of two important components:-

– HDFS (Hadoop Distributed File System).– MapReduce paradigm

• HDFS is a file system written in Java use for storage similar to ext3 or ext4 in LINUX.

• MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

• MapReduce is the paradigm used to process data on HDFS, the processing is moved to the data location.

• Basic Linux commands Ex: ls, cat .. Etc.

Page 8: Hadoop ABI Insight

8 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

HDFS

• Every piece of data is split into blocks and distributed across cluster.

• Typically blocks are 64MB or 128MB, default replication is 3.

1 2 3

332 11 12 2

Input file

3

MetadataTCP/IPNetworking

4 5

4

4

45 5 5

DN1

Client

DN2 DN3 DN4 DN5

Name Node

Data Node File Blocks

DataNode1 1, 4, 5

DataNode2 2, 3, 4

DataNode3 1, 2, 5

DataNode4 3, 4, 5

DataNode5 1, 2, 3

Data Node File Blocks

DataNode1

DataNode2

DataNode3

DataNode4

DataNode5

Page 9: Hadoop ABI Insight

9 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

MapReduce - Example

• MapReduce had 5 different stages.

Page 10: Hadoop ABI Insight

10 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hadoop Distributions

• Even though Hadoop is an open source project, we have some vendors who actually packaged the compatible version together and enable the operations tools and provide great flexibility.

• Below are the top 3 vendors

– Cloudera– Horton Works– MapR

• The underlying code still remain bare bone apache open source however some of them have commercial products and services attached to specific distributions.

Page 11: Hadoop ABI Insight

11 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Eco-System

Page 12: Hadoop ABI Insight

12 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hadoop Configuration

• Daemons of Hadoop ecosystem:-

– Namenode (Master)– Block information

– Secondary Namenode (Master)– Check point of Namenode

– Data Node (Slave)– Data residence.

– Task Tracker (Slave)– Workers

– Job Tracker (Master)– Checks and keeps the status.

• Hadoop by default replicates each block of data three times for redundancy and fail over.

Page 13: Hadoop ABI Insight

Dell - Internal Use - Confidential

Spike White- System Sr. Engineer

Page 14: Hadoop ABI Insight

14 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hardware Configuration

• There are 3 different types of node configuration that are very important in the architecture to get optimal performance.

• For small to medium size cluster less than 1000 nodes.

– Master Nodes (Generally 2 or 3 in a cluster)– Slave Nodes (Can scale 1 to .. N nodes)– Edge Nodes (Normally 2 for Load balancing)

• Each category of node has specific configuration with respect to the hardware and also Hadoop software.

• Please find Dell reference architecture link found below:-– http://files.cloudera.com/pdf/Dell_Cloudera_Solution_for_Apache_Hadoop_Ref

erence_Architecture.pdf

Page 15: Hadoop ABI Insight

15 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Solution Center Rack Diagram

• Location of the rack RR8 EBI Lab.

• CM 5.1 and CDH 5.1.2.– 2 Name Nodes. (R 720’s)– 6 Data Nodes. (R 720 XD’s)– 2 Edge Nodes. (R 720’s)– 1G network cards (Due upgrade)– Force 10 S 60 Switch 1G (Due Upgrade)

• System crashed, bring it offline and fix it – no impact.

• Hard drive crashed, replace it and create the mount.

Page 16: Hadoop ABI Insight

16 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

QUEST VINTELA Authentication

• Vintela Authentication Services (VAS) implements Kerberos and LDAP functionality on UNIX and Linux systems, and fully integrate with AD.

• The benefits of using VAS include the following: – You have the ability to manage UNIX

and Linux users and computers are managed through the Active Directory Users and Computers Microsoft Management Console (MMC) snap-in.

– Kerberos is the protocol used to secure LDAP traffic.

– Performance is tuned to work effectively with Active Directory.

Page 17: Hadoop ABI Insight

17 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Kerberos

• The Kerberos protocol is a standard designed to provide strong authentication within a client/server network environment.

• Kerberos network messages are encrypted and decrypted using algorithms that are very difficult to decode into its original form.

• Kerberos contains a number of terms– Principal:- All entities within Kerberos, including users, computers, and

services, are known as principals. Principal names are unique.

– Realms: -The principal is a member of a realm.

– Ticket: - A ticket is the fundamental unit of Kerberos authentication. It is a carefully constructed message containing the authentication information which is passed between computers.

– Key Distribution Center: -The Key Distribution Center (KDC) is made up of three components: – Database of principals containing users, computers, and services;

– Authentication server that issues Ticket Granting Tickets (TGT);

– Ticket Granting Service (TGS) that issues service tickets granting clients access to specific services.

Page 18: Hadoop ABI Insight

18 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Kerberos

Page 19: Hadoop ABI Insight

Dell - Internal Use - Confidential

Will O’Brian- Active Directory Services

Page 20: Hadoop ABI Insight

20 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Active Directory

• The US-POCLAB.DELLPOC.COM Active Directory (AD) Domain was utilized for the Cloudera setup.

• A Hadoop Organizational Unit (OU) was manually created under us-poclab.dellpoc.com/Unix/Servers.

• A “parent” Service Account (Servicegtminf) was manually created under the us-poclab.dellpoc.com/Service Accounts OU.

Page 21: Hadoop ABI Insight

21 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Active Directory

• CM uses the service account “servicegtminf”.

• The Servicegtminf was given rights to create\delete accounts within us-poclab.dellpoc.com/Hadoop OU as well as Full Control rights to any descendant objects (accounts).

• Service accounts are create by CM by changing the user principles name.

• Account “serviceARFSqfwFob” is configured to be utilized for the “sentry” service running on ausgtmhadoop07.

Page 22: Hadoop ABI Insight

22 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Active Directory

Page 23: Hadoop ABI Insight

Dell - Internal Use - Confidential

Deepak Gattala- Architect

Page 24: Hadoop ABI Insight

24 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Word Count Quiz:- What you choose?

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one); }}

}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter

reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum)); }

}

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

} }

Select word, count(*) from file_table group

by word;

Using Hive

Using PIG

a = load '/user/hue/word_count_text.txt';

b = foreach a generate

flatten(TOKENIZE((chararray)$0)) as word;

c = group b by word; d = foreach c generate COUNT(b),

group;

store d into '/user/hue/pig_wordcount';

Using Mapreduce

Page 25: Hadoop ABI Insight

25 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hive

• Facebook uses Hadoop extensively, looking for way to allow non-Java programmers access to the data in its Hadoop clusters.– Data analysts, Statisticians, Data Scientists etc.

• In Hive - SQL SELECT statement => MapReduce translator– Takes Hive queries and turns them into Java MapReduce code and then

Submits the code to the cluster– Display’s the results back to the user. Note: Not all SQL works!

• Hive is much easier to learn than Java-based MapReduce– Writing HiveQL queries is much faster than writing the equivalent Java

code.– Many people already know SQL – Can rapidly start using Hive to query

and manipulate data in the cluster.

Page 26: Hadoop ABI Insight

26 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hive - Authorization

• CREATE ROLE [ ROLE NAME];

• DROP ROLE [ ROLE NAME ];

• GRANT ROLE role_name [, role_name] TO GROUP <groupName> [,GROUP <groupName>];

• REVOKE ROLE role_name [, role_name] FROM GROUP <groupName> [,GROUP <groupName>];

• GRANT <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> TO ROLE <roleName> [,ROLE <roleName>];

• REVOKE <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> FROM ROLE <roleName> [,ROLE <roleName>];

• POC uses Groups:-– gtm_hdp_inf_dev – Hive Group used for POC– gtm_hdp_inf_adm – Cloudera Manager Admin Group

Page 27: Hadoop ABI Insight

27 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hive - Authorization

• The Object Hierarchy where you can apply security can be as granular as below:-

Page 28: Hadoop ABI Insight

28 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

DELL ABI Use Cases

• SAIE (Support Assist Intelligence Engine) [Design & Architecture]– Teradata Appliance (PROD) & Home grown (Horton works 2.1) - DEV/SIT

• DCCMT/NGMT Hadoop Reporting [POC]– Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

• Server log analysis POC on Hadoop [POC]– Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

• Big Data Edition ETL Use case. [POC]– Informatica 9.6.1 & Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

• MAW (Marketing Analytics Workbench) [Beta Production]– Teradata Appliance HDP 1.3.2 [ Due upgrade HDP 2.1 ]

• Rainstor – Archival Strategy [In Production]– Cloudera CDH 4.2

Page 29: Hadoop ABI Insight

29 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Cloudera Manager

• Cloudera provides the web interface for the cluster management.

Page 30: Hadoop ABI Insight

Dell - Internal Use - Confidential

Questions???

Page 31: Hadoop ABI Insight

31 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Cloudera Hadoop Demo