introduction to hadoop ecosystem
TRANSCRIPT
![Page 1: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/1.jpg)
Introduction To Hadoop Ecosystem
InSemble Inc. http://www.insemble.com
![Page 2: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/2.jpg)
Agenda
What is Big Data ?1
Use Cases & Java Developer fit4
Hadoop Ecosystem3
Relevance to your Enterprise2
Demo5
![Page 3: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/3.jpg)
Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time
• Gartner defines it as Data with the following characteristics– High Velocity– High Variety– High Volume
• Another Definition is “ Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems
![Page 4: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/4.jpg)
Why a game changer
• Schema on Read– Interpreting data at processing time– Key, Values are not intrinsic properties of data but chosen by person
analyzing the data• Move code to data
– With traditional, we bring data to code and I/O becomes a bottleneck
– With distributed systems, we have to deal with our own checkpointing/recovery
• More data beats better algorithms
![Page 5: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/5.jpg)
Enterprise Relevance
• Missed Opportunities– Channels– Data that is analyzed
• Constraint was high cost– Storage– Processing
• Future-proof your business– Schema on Read– Access pattern not as relevant– Not just future-proofing your architecture
![Page 6: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/6.jpg)
Motivation and History
• Disk access speeds have not caught up with storage capacities• Need a high speed parallel processing platform to process large
datasets on a distributed filesharing framework• Google published MapReduce architecture in 2004• Mapreduce framework
– Split the query, distribute it and process in parallel(Map Step)– Gather the results and deliver it ( Reduce Step)
• Apache Open Source Project called Hadoop implemented the MapReduce framework
– “Software library that gives users ability to process large datasets across cluster of commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
![Page 7: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/7.jpg)
Hadoop Ecosystem
Source: Apache Hadoop Documentation
![Page 8: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/8.jpg)
HDFS Architecture
Source: Hadoop Definitive Guide by Tom White
![Page 9: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/9.jpg)
MapReduce framework
![Page 10: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/10.jpg)
Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
![Page 11: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/11.jpg)
Map Reduce
• Restrictive programming model– Key, values– Map, reduce functions with only coordination being just
passing keys and values• But still considered a general data-processing tool
– Google used for production search indexes– Image Analysis– Machine learning algorithms
![Page 12: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/12.jpg)
PIG
• High level scripting language• Data Flow Language
– Good for describing data analysis problems as data flows– Can plugin UDFs written in other languages such as Java, Scala,
JRuby– Other languages can execute PIG scripts– Predominant use cases are
• Production ETL jobs• Data exploration by analysts
• Higher Level Abstraction– Map Reduce– Tez
![Page 13: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/13.jpg)
Hive
• Framework for data warehouse on top of Hadoop– SQL Access on HDFS– Queries for Analysis
• Batch Oriented– Impala– Tez
![Page 14: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/14.jpg)
HBase
• NoSQL database on Hadoop– Based on Google’s BigTable– Column oriented database on HDFS
• Regular Interactive/Update use cases– Real time read/write random access– Row updates are atomic
![Page 15: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/15.jpg)
SQOOP
• Import/Export data from RDBMS into Hadoop– HDFS,Hive, Hbase– CouchBase– Uses JDBC driver to get the data types of the columns– Serialization/Deserialization
• Actual load done internally by Mapreduce jobs
![Page 16: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/16.jpg)
Apache Flume
Source: Apache Flume Documentation
![Page 17: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/17.jpg)
Real time streaming with Kafka & Storm• Kafka
– Pub/Sub messaging using topics– Kafka producers publish to topics
• Storm– Real time computational engine– Consumes data from spouts and passes data to bolts– Can run on top of YARN– Uses Zookeeper, implemented in Clojure– You define workflows as Directed Acyclic Graphs– True stream processing engine, so used for low latency ingestion– Can support At most once, At least once and Exactly Once semantics
![Page 18: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/18.jpg)
Apache Spark
• High speed general purpose engine for large-scale data processing
• Does not need Hadoop, just needs a shared file system such as S3, NFS or HDFS
• Spark can run on YARN• Spark is implemented in Scala• Has Streaming API but a true batch processing engine that micro-
batches• Can only support Exactly once, but under some failure
conditions degrades to At-least once
![Page 19: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/19.jpg)
Common Use Cases
• Queries from Detail Record Data• Queries from longer duration data• Diagnostic/Metrics/Web Logs Data Analysis• 360 degree view incorporating clickstream data• Unable to generate report within the needed timeframe• Capture and analyze sensor data• Analyze large volume of image data• Build User profiles from large volumes of data• Sentiment Analysis• Recommendation Engines• Risk Analysis
![Page 20: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/20.jpg)
Securing Hadoop Data
Source: http://www.voltage.com
![Page 21: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/21.jpg)
Closing
• Technology in hyper growth phase• Complex• Tools/Productivity/Monitoring products
evolving• Pilot Project• Incremental Journey
![Page 22: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/22.jpg)
Demo - Start HDP cluster in AWS
• Total 6 EC2 machine, type t2.medium• RHEL 6.5, 3.75G Memory, 10G hard drive• 1 Ambari server + 5-node cluster• 1 Namenode + 1 Secondary node + 3 Data
Node• Public data set from
https://data.cityofchicago.org
![Page 23: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/23.jpg)
Managing Hadoop Cluster using Ambari
• Ambari in Indian language stands for a seat sitting on top of an elephant
• Ambari is an Apache open source project that is used to• Provision Hadoop cluster• Manage Hadoop cluster• Monitor Hadoop cluster
• Agent-based deployment model
![Page 25: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/25.jpg)
Demo — Hue
• Apache Hue provides web interface for analyzing data in Hadoop
• Use HCatalog to create table• Demo Hive Script• Demo Pig Script
![Page 26: Introduction To Hadoop Ecosystem](https://reader033.vdocuments.net/reader033/viewer/2022042512/55a696fe1a28ab602d8b4699/html5/thumbnails/26.jpg)
Demo — Advanced Hive
• Use built-in UDF to extract latitude and longitude info• Use custom UDF (scala) to calculate distance
between two locations• Join tables between library and school data and find
libraries within 1 mile for each school • Use Tableau to connect to Hive through ODBC driver
to plot social economy data