hadoop abi insight

Post on 18-Jul-2016

37 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hadoop

TRANSCRIPT

Dell - Internal Use - Confidential

Hadoop @ ABI

Insight Into The Ecosystem

Tuesday, October 07, 2014

2 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

TEAM - POC

• Deepak Gattala– Hadoop Administrator– DW Architect – ABI/EBI

• Spike White– Linux System Administrator– Kerberos Specialist.

• Will O’Brian– Active Directory and Identity.– Security Analyst.

• Note: Special thanks for supporting the effort. – Bart Crider, Attila Finta, Mike Porreca, Feargal Tobin, Alisha Worsham.

3 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Agenda [ Next 1 Hour ]

• Deepak Gattala [15 minutes]– Get Familiar with Hadoop (Cloudera). [5 minutes]– HDFS & Map reduce tour. [5 minutes]– Hadoop Family and Ecosystem. [5 minutes]

• Spike White and Will O’Brian [15 minutes]– Integration and Security. [5 minutes]– Kerberos [5 minutes]– AD Forest and OU [5 minutes]

• Deepak Gattala [15 minutes]– Understanding Hive and Impala. [5 minutes]– Cloudera Manager. [5 minutes]– DELL IT/Services Use case and Interest. [5 minutes]

• Deepak Gattala, Spike White and Will O’Brian [15 minutes]– Product Demo. [5 minutes]– Question & Answers. [10 minutes]

Dell - Internal Use - Confidential

Deepak Gattala- Architect

5 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Photography

6 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

What is Hadoop?

• Hadoop is an open source software frame work.

• It’s an Apache top-level project but the underlying technology was from google white paper about to index all the rich textural and structural information.

• Architected to run on a large number of machines that don’t share any memory or disks.

• Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits.

• Designed to solve problems with large data while running analytics that are deep and computationally extensive.

7 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Prerequisites

• Hadoop framework mainly consists of two important components:-

– HDFS (Hadoop Distributed File System).– MapReduce paradigm

• HDFS is a file system written in Java use for storage similar to ext3 or ext4 in LINUX.

• MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

• MapReduce is the paradigm used to process data on HDFS, the processing is moved to the data location.

• Basic Linux commands Ex: ls, cat .. Etc.

8 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

HDFS

• Every piece of data is split into blocks and distributed across cluster.

• Typically blocks are 64MB or 128MB, default replication is 3.

1 2 3

332 11 12 2

Input file

3

MetadataTCP/IPNetworking

4 5

4

4

45 5 5

DN1

Client

DN2 DN3 DN4 DN5

Name Node

Data Node File Blocks

DataNode1 1, 4, 5

DataNode2 2, 3, 4

DataNode3 1, 2, 5

DataNode4 3, 4, 5

DataNode5 1, 2, 3

Data Node File Blocks

DataNode1

DataNode2

DataNode3

DataNode4

DataNode5

9 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

MapReduce - Example

• MapReduce had 5 different stages.

10 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hadoop Distributions

• Even though Hadoop is an open source project, we have some vendors who actually packaged the compatible version together and enable the operations tools and provide great flexibility.

• Below are the top 3 vendors

– Cloudera– Horton Works– MapR

• The underlying code still remain bare bone apache open source however some of them have commercial products and services attached to specific distributions.

11 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Eco-System

12 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hadoop Configuration

• Daemons of Hadoop ecosystem:-

– Namenode (Master)– Block information

– Secondary Namenode (Master)– Check point of Namenode

– Data Node (Slave)– Data residence.

– Task Tracker (Slave)– Workers

– Job Tracker (Master)– Checks and keeps the status.

• Hadoop by default replicates each block of data three times for redundancy and fail over.

Dell - Internal Use - Confidential

Spike White- System Sr. Engineer

14 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hardware Configuration

• There are 3 different types of node configuration that are very important in the architecture to get optimal performance.

• For small to medium size cluster less than 1000 nodes.

– Master Nodes (Generally 2 or 3 in a cluster)– Slave Nodes (Can scale 1 to .. N nodes)– Edge Nodes (Normally 2 for Load balancing)

• Each category of node has specific configuration with respect to the hardware and also Hadoop software.

• Please find Dell reference architecture link found below:-– http://files.cloudera.com/pdf/Dell_Cloudera_Solution_for_Apache_Hadoop_Ref

erence_Architecture.pdf

15 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Solution Center Rack Diagram

• Location of the rack RR8 EBI Lab.

• CM 5.1 and CDH 5.1.2.– 2 Name Nodes. (R 720’s)– 6 Data Nodes. (R 720 XD’s)– 2 Edge Nodes. (R 720’s)– 1G network cards (Due upgrade)– Force 10 S 60 Switch 1G (Due Upgrade)

• System crashed, bring it offline and fix it – no impact.

• Hard drive crashed, replace it and create the mount.

16 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

QUEST VINTELA Authentication

• Vintela Authentication Services (VAS) implements Kerberos and LDAP functionality on UNIX and Linux systems, and fully integrate with AD.

• The benefits of using VAS include the following: – You have the ability to manage UNIX

and Linux users and computers are managed through the Active Directory Users and Computers Microsoft Management Console (MMC) snap-in.

– Kerberos is the protocol used to secure LDAP traffic.

– Performance is tuned to work effectively with Active Directory.

17 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Kerberos

• The Kerberos protocol is a standard designed to provide strong authentication within a client/server network environment.

• Kerberos network messages are encrypted and decrypted using algorithms that are very difficult to decode into its original form.

• Kerberos contains a number of terms– Principal:- All entities within Kerberos, including users, computers, and

services, are known as principals. Principal names are unique.

– Realms: -The principal is a member of a realm.

– Ticket: - A ticket is the fundamental unit of Kerberos authentication. It is a carefully constructed message containing the authentication information which is passed between computers.

– Key Distribution Center: -The Key Distribution Center (KDC) is made up of three components: – Database of principals containing users, computers, and services;

– Authentication server that issues Ticket Granting Tickets (TGT);

– Ticket Granting Service (TGS) that issues service tickets granting clients access to specific services.

18 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Kerberos

Dell - Internal Use - Confidential

Will O’Brian- Active Directory Services

20 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Active Directory

• The US-POCLAB.DELLPOC.COM Active Directory (AD) Domain was utilized for the Cloudera setup.

• A Hadoop Organizational Unit (OU) was manually created under us-poclab.dellpoc.com/Unix/Servers.

• A “parent” Service Account (Servicegtminf) was manually created under the us-poclab.dellpoc.com/Service Accounts OU.

21 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Active Directory

• CM uses the service account “servicegtminf”.

• The Servicegtminf was given rights to create\delete accounts within us-poclab.dellpoc.com/Hadoop OU as well as Full Control rights to any descendant objects (accounts).

• Service accounts are create by CM by changing the user principles name.

• Account “serviceARFSqfwFob” is configured to be utilized for the “sentry” service running on ausgtmhadoop07.

22 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Active Directory

Dell - Internal Use - Confidential

Deepak Gattala- Architect

24 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Word Count Quiz:- What you choose?

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one); }}

}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter

reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum)); }

}

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

} }

Select word, count(*) from file_table group

by word;

Using Hive

Using PIG

a = load '/user/hue/word_count_text.txt';

b = foreach a generate

flatten(TOKENIZE((chararray)$0)) as word;

c = group b by word; d = foreach c generate COUNT(b),

group;

store d into '/user/hue/pig_wordcount';

Using Mapreduce

25 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hive

• Facebook uses Hadoop extensively, looking for way to allow non-Java programmers access to the data in its Hadoop clusters.– Data analysts, Statisticians, Data Scientists etc.

• In Hive - SQL SELECT statement => MapReduce translator– Takes Hive queries and turns them into Java MapReduce code and then

Submits the code to the cluster– Display’s the results back to the user. Note: Not all SQL works!

• Hive is much easier to learn than Java-based MapReduce– Writing HiveQL queries is much faster than writing the equivalent Java

code.– Many people already know SQL – Can rapidly start using Hive to query

and manipulate data in the cluster.

26 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hive - Authorization

• CREATE ROLE [ ROLE NAME];

• DROP ROLE [ ROLE NAME ];

• GRANT ROLE role_name [, role_name] TO GROUP <groupName> [,GROUP <groupName>];

• REVOKE ROLE role_name [, role_name] FROM GROUP <groupName> [,GROUP <groupName>];

• GRANT <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> TO ROLE <roleName> [,ROLE <roleName>];

• REVOKE <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> FROM ROLE <roleName> [,ROLE <roleName>];

• POC uses Groups:-– gtm_hdp_inf_dev – Hive Group used for POC– gtm_hdp_inf_adm – Cloudera Manager Admin Group

27 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Hive - Authorization

• The Object Hierarchy where you can apply security can be as granular as below:-

28 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

DELL ABI Use Cases

• SAIE (Support Assist Intelligence Engine) [Design & Architecture]– Teradata Appliance (PROD) & Home grown (Horton works 2.1) - DEV/SIT

• DCCMT/NGMT Hadoop Reporting [POC]– Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

• Server log analysis POC on Hadoop [POC]– Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

• Big Data Edition ETL Use case. [POC]– Informatica 9.6.1 & Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]

• MAW (Marketing Analytics Workbench) [Beta Production]– Teradata Appliance HDP 1.3.2 [ Due upgrade HDP 2.1 ]

• Rainstor – Archival Strategy [In Production]– Cloudera CDH 4.2

29 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Cloudera Manager

• Cloudera provides the web interface for the cluster management.

Dell - Internal Use - Confidential

Questions???

31 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT

Dell - Internal Use - Confidential

Cloudera Hadoop Demo

top related