big data technologies - hadoop
TRANSCRIPT
![Page 1: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/1.jpg)
A new way to store and analyze data
Sandesh Deshmane
![Page 2: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/2.jpg)
• What is Hadoop?
• Why, Where, When?
• Benefits of Hadoop
• How Hadoop Works?
• Hadoop Architecture
• HDFS
• Hadoop MapReduce
• Installation & Execution
• Demo
Topics Covered
![Page 3: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/3.jpg)
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but more systems of computers.
—Grace Hopper
History
![Page 4: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/4.jpg)
• Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3 zeta byte in 2012
1 zeta byte =10^21 bytes=1k exa bytes=1 million petabyte=
1 billion terabytes
• The New York stock exchange generate 1TB data per day
• Facebook stores around 10 billion photos . Around 1 petabyte.
• The internet archive stores 1 peta byte data and its growing ( 20 TB per month).
Background
![Page 5: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/5.jpg)
• Created by Douglas Reed Cutting,
• Open-source Apache Software Foundation.
• consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System (HDFS).b. High-performance parallel data processing using a technique called Map Reduce.
• Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures.
What is Hadoop?
![Page 6: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/6.jpg)
• Need to process 100TB datasets
• On 1 node:– scanning @ 50MB/s = 23 days
• On 1000 node cluster:– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
Hadoop, Why?
![Page 7: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/7.jpg)
Where• Batch data processing, not
real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling)
• Highly parallel data intensive distributed applications
• Very large production deployments
When
• Process lots of unstructured data
• When your processing can easily be made parallel
• Running batch jobs is acceptable
• When you have access to lots of cheap hardware
Where and When Hadoop?
![Page 8: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/8.jpg)
• Runs on cheap commodity hardware
• Automatically handles data replication and node failure
• It does the hard work – you can focus on processing data
• Cost Saving and efficient and reliable data processing
Benefits of Hadoop
![Page 9: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/9.jpg)
• Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
How Hadoop Works?
![Page 10: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/10.jpg)
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing
Hadoop Consists of:
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• HDFS: A distributed file system that provides high throughput access to application data.
• MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Hadoop Architecture
![Page 11: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/11.jpg)
Web Servers
Scribe Servers
Network Storage
Hadoop ClusterOracle DB MySQL
Hadoop Architecture
![Page 12: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/12.jpg)
• Java
• Python
• Ruby
• C++ (Hadoop Pipes)
Supported Languages
![Page 13: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/13.jpg)
• Know as Hadoop Distribute File System
• Primary storage system for Hadoop Apps
• Multiple replicas of data blocks distributed on compute nodes for reliability
• Files are stored on multiple boxes for durability and high availability
HDFS
![Page 14: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/14.jpg)
• Distributed File System = holds large amount of data and provides access to this data to many clients distributed across a network . e.g NFS
• HDFS stores large amount of Information than DFS
• HDFS stores data reliably.
• HDFS provides fast, scalable access to this information to large number of clients in Cluster
DFS vs. HDFS
![Page 15: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/15.jpg)
• Optimized for long sequential reads
• Data written once , read multiple times, no append possible
• Large file, sequential reads so no local caching of data.
• Data replication
HDFS
![Page 16: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/16.jpg)
HDFS Architecture
![Page 17: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/17.jpg)
• Block Structure files system
• File is divided to bocks and stored
• Each individual machine in cluster is Data Node
• Default block size is 64 MB
• Information of blocks is stored in metadata
• All this meta data is stored on machine which is Name Node
HDFS Architecture
![Page 18: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/18.jpg)
Data Node and Data Name
![Page 19: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/19.jpg)
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://your.server.name.com:9000</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/username/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/username/hdfs/name</value>
</property>
</configuration>
HDFS Config File
![Page 20: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/20.jpg)
public class HDFSHelloWorld {
public static final String theFilename = "hello.txt";
public static final String message = "Hello, world!\n";
public static void main (String [] args) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path filenamePath = new Path(theFilename);
try {
if (fs.exists(filenamePath)) {
// remove the file first
fs.delete(filenamePath);
}
FSDataOutputStream out = fs.create(filenamePath);
out.writeUTF(message);
out.close();
FSDataInputStream in = fs.open(filenamePath);
String messageIn = in.readUTF();
System.out.print(messageIn);
in.close();
} catch (IOException ioe) {
System.err.println("IOException during operation: " + ioe.toString());
System.exit(1);
}
}
Sample Java Code to Read/Write from HDFS
![Page 21: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/21.jpg)
Map Reduce
![Page 22: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/22.jpg)
Cluster Look
![Page 23: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/23.jpg)
Map
![Page 24: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/24.jpg)
Reduce
![Page 25: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/25.jpg)
• HDFS handles the Distributed File System layer
• MapReduce is how we process the data
• MapReduce Daemons
JobTracker
TaskTracker
• Goals
Distribute the reading and processing of data
Localize the processing when possible
Share as little data as possible while processing
MapReduce
![Page 26: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/26.jpg)
MapReduce
![Page 27: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/27.jpg)
• One per cluster “master node”
• Takes jobs from clients
• Splits work into “tasks”
• Distributes “tasks” to TaskTrackers
• Monitors progress, deals with failures
Job Tracker
![Page 28: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/28.jpg)
• Many per cluster “slave nodes”
• Does the actual work, executes the code for the job
• Talks regularly with JobTracker
• Launches child process when given a task
• Reports progress of running “task” back to JobTracker
Task Tracker
![Page 29: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/29.jpg)
• Client Submits job:I want to count the count of each word We will assume that the data to process is already there in HDFS
• Job Tracker receives job• Queries the NamNode for number of blocks in File• The job is split into Tasks• One map task per each block• As many reduce tasks as specified in the Job
• TaskTracker checks in Regularly with JobTracker Is there any work for me ?• If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given
the “task”
Anatomy of Map Reduce Job
![Page 30: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/30.jpg)
Map Reduce Job – Big Picture
![Page 31: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/31.jpg)
Client Submits to JobTracker
![Page 32: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/32.jpg)
JobTracker Queries Name Node for Block Info
![Page 33: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/33.jpg)
Job tracker Defines Job as Collection of Tasks
![Page 34: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/34.jpg)
Task Trackers Checking in are Assigned tasks
![Page 35: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/35.jpg)
Task Trackers Checking in are Assigned tasks
![Page 36: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/36.jpg)
• Read text files and count how often words occur. o The input is text fileso The output is a text file
each line: word, tab, count
• Map: Produce pairs of (word, count)
• Reduce: For each word, sum up the counts.
Example of MapReduce - Word Count
![Page 37: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/37.jpg)
public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Map Class
![Page 38: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/38.jpg)
public static class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Reduce Class
![Page 39: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/39.jpg)
public void run(String inputPath, String outputPath) throws Exception
{
JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings)
conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf);
}
Driver Class
![Page 40: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/40.jpg)
import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
public class WordCountMapperTest {
@Test
public void processesValidRecord() throws IOException {
MapClass mapper = new MapClass ();
Text value = new Text(“test test”)
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
mapper.map(null, value, output, null);
verify(output).collect(new Text(“test"), new IntWritable(2));
}
}
Junit For Mapper
![Page 41: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/41.jpg)
Junit for Reducer
import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
@Test
public void returnsMaximumIntegerInValues() throws IOException {
ReduceClass reducer = new ReduceClass ();
Text key = new Text(“test");
Iterator<IntWritable> values = Arrays.asList(
new IntWritable(1), new IntWritable(1)).iterator();
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
reducer.reduce(key, values, output, null);
verify(output).collect(key, new IntWritable(2));
}
![Page 42: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/42.jpg)
Installation :• Requirements: Linux, Java 1.6,
sshd,• Configure SSH for password-free
authentication• Unpack Hadoop distribution• Edit a few configuration files• Format the DFS on the name
node• Start all the daemon processes
Execution:• Compile your job into a JAR file• Copy input data into HDFS• Execute bin/hadoop jar with
relevant args• Monitor tasks via Web interface
(optional)• Examine output when job is
complete
Let’s Go…
![Page 43: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/43.jpg)
Demo
![Page 44: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/44.jpg)
Hadoop Users
• Adobe• Alibaba• Amazon• AOL• Facebook• Google• IBM
Major Contributor
• Apache• Cloudera• Yahoo
Hadoop Community
![Page 45: Big Data Technologies - Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062420/55d4dcd3bb61ebbd1d8b463c/html5/thumbnails/45.jpg)
• Apache Hadoop! (http://hadoop.apache.org )• Hadoop on Wikipedia (
http://en.wikipedia.org/wiki/Hadoop)• Free Search by Doug Cutting (
http://cutting.wordpress.com )• Hadoop and Distributed Computing at Yahoo! (
http://developer.yahoo.com/hadoop )• Cloudera - Apache Hadoop for the Enterprise (
http://www.cloudera.com )
References