introduction to hadoop ecosystem

Introduction To Hadoop Ecosystem

InSemble Inc. http://www.insemble.com

Agenda

What is Big Data ?1

Use Cases & Java Developer fit4

Hadoop Ecosystem3

Relevance to your Enterprise2

Demo5

Big Data Definitions

• Wikipedia defines it as “ Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time

• Gartner defines it as Data with the following characteristics– High Velocity– High Variety– High Volume

• Another Definition is “ Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems

Why a game changer

• Schema on Read– Interpreting data at processing time– Key, Values are not intrinsic properties of data but chosen by person

analyzing the data• Move code to data

– With traditional, we bring data to code and I/O becomes a bottleneck

– With distributed systems, we have to deal with our own checkpointing/recovery

• More data beats better algorithms

Enterprise Relevance

• Missed Opportunities– Channels– Data that is analyzed

• Constraint was high cost– Storage– Processing

• Future-proof your business– Schema on Read– Access pattern not as relevant– Not just future-proofing your architecture

Motivation and History

• Disk access speeds have not caught up with storage capacities• Need a high speed parallel processing platform to process large

datasets on a distributed filesharing framework• Google published MapReduce architecture in 2004• Mapreduce framework

– Split the query, distribute it and process in parallel(Map Step)– Gather the results and deliver it ( Reduce Step)

• Apache Open Source Project called Hadoop implemented the MapReduce framework

– “Software library that gives users ability to process large datasets across cluster of commodity hardware in a reliable, fault-tolerant manner using a simple programming model”

Hadoop Ecosystem

Source: Apache Hadoop Documentation

HDFS Architecture

Source: Hadoop Definitive Guide by Tom White

MapReduce framework

Hadoop 2 with YARN

Source: Hadoop In Practice by Alex Holmes

Map Reduce

• Restrictive programming model– Key, values– Map, reduce functions with only coordination being just

passing keys and values• But still considered a general data-processing tool

– Google used for production search indexes– Image Analysis– Machine learning algorithms

PIG

• High level scripting language• Data Flow Language

– Good for describing data analysis problems as data flows– Can plugin UDFs written in other languages such as Java, Scala,

JRuby– Other languages can execute PIG scripts– Predominant use cases are

• Production ETL jobs• Data exploration by analysts

• Higher Level Abstraction– Map Reduce– Tez

Hive

• Framework for data warehouse on top of Hadoop– SQL Access on HDFS– Queries for Analysis

• Batch Oriented– Impala– Tez

HBase

• NoSQL database on Hadoop– Based on Google’s BigTable– Column oriented database on HDFS

• Regular Interactive/Update use cases– Real time read/write random access– Row updates are atomic

SQOOP

• Import/Export data from RDBMS into Hadoop– HDFS,Hive, Hbase– CouchBase– Uses JDBC driver to get the data types of the columns– Serialization/Deserialization

• Actual load done internally by Mapreduce jobs

Apache Flume

Source: Apache Flume Documentation

Real time streaming with Kafka & Storm• Kafka

– Pub/Sub messaging using topics– Kafka producers publish to topics

• Storm– Real time computational engine– Consumes data from spouts and passes data to bolts– Can run on top of YARN– Uses Zookeeper, implemented in Clojure– You define workflows as Directed Acyclic Graphs– True stream processing engine, so used for low latency ingestion– Can support At most once, At least once and Exactly Once semantics

Apache Spark

• High speed general purpose engine for large-scale data processing

• Does not need Hadoop, just needs a shared file system such as S3, NFS or HDFS

• Spark can run on YARN• Spark is implemented in Scala• Has Streaming API but a true batch processing engine that micro-

batches• Can only support Exactly once, but under some failure

conditions degrades to At-least once

Common Use Cases

• Queries from Detail Record Data• Queries from longer duration data• Diagnostic/Metrics/Web Logs Data Analysis• 360 degree view incorporating clickstream data• Unable to generate report within the needed timeframe• Capture and analyze sensor data• Analyze large volume of image data• Build User profiles from large volumes of data• Sentiment Analysis• Recommendation Engines• Risk Analysis

Securing Hadoop Data

Source: http://www.voltage.com

Closing

• Technology in hyper growth phase• Complex• Tools/Productivity/Monitoring products

evolving• Pilot Project• Incremental Journey

Demo - Start HDP cluster in AWS

• Total 6 EC2 machine, type t2.medium• RHEL 6.5, 3.75G Memory, 10G hard drive• 1 Ambari server + 5-node cluster• 1 Namenode + 1 Secondary node + 3 Data

Node• Public data set from

https://data.cityofchicago.org

https://data.cityofchicago.org

Managing Hadoop Cluster using Ambari

• Ambari in Indian language stands for a seat sitting on top of an elephant

• Ambari is an Apache open source project that is used to• Provision Hadoop cluster• Manage Hadoop cluster• Monitor Hadoop cluster

• Agent-based deployment model

Ambari Architecture

Taken from http://docs.hortonworks.com/

http://docs.hortonworks.com/

Demo — Hue

• Apache Hue provides web interface for analyzing data in Hadoop

• Use HCatalog to create table• Demo Hive Script• Demo Pig Script

Demo — Advanced Hive

• Use built-in UDF to extract latitude and longitude info• Use custom UDF (scala) to calculate distance

between two locations• Join tables between library and school data and find

libraries within 1 mile for each school • Use Tableau to connect to Hive through ODBC driver

to plot social economy data

introduction to hadoop ecosystem

Data & Analytics

process data

data types

interpreting data

data warehouse

data flows

data sets

unstructured data

data analysis problems