glusterfs and hadoop

1. Shubhendu Tripathi PSE Red Hat GlusterFS and Hadoop

2. 06/22/15 2 Agenda What is BigData Hadoop and its Evolution Hadoop Acrchitecture and Components Hadoop and GlusterFS (glusterfs-hadoop plugin) Advantages of using GlusterFS with Hadoop References

3. 06/22/15 3 What is BigData Software solutions mostly capture, maintain and manage data Storing data Processing data Growing data size in current world big data generators Sensors CC Cam Social networks Online shopping portals Airlines Hospitality

4. 06/22/15 4 Agenda What is BigData Hadoop and its Evolution Hadoop Acrchitecture and Components Hadoop and GlusterFS (glusterfs-hadoop plugin) Advantages of using GlusterFS with Hadoop References

5. 06/22/15 5 What is BigData 90% of total data today we have, got generated in last 2 years 1990 HDD: 1-20 GB, RAM: 14-128 MB, Speed: 10kbps 2014 HDD: 0.5-1 TB, RAM: 1-16 GB, Speed: 100 mbps 3 Factors which define BigData Volume Velocity Variety (unstructured and semi structured data)

6. 06/22/15 6 What is BigData SAN Storage Area Network One option Store the data on data centers and get them on need basis and computation performed on them to process Computation is processor bound and a limit on the same As the size of the data increases we need more and more computation as well and its not possible to perform the same on local machine Solution - sending computation to the storage node and get the processed data is better option (size of computation would be small)

7. 06/22/15 7 Hadoop Evolution Started with Google white papers GFS (Google File System) 2003 - Storage MapReduce 2004 Computation Yahoo HDFS (Hadoop Distributed File System) - 2006,7 MapReduce (Computation mechanism) 2007,8 Doug Cutting and Michael Cafarrela from Yahoo Logo Elephant Apache foundation (2005 Yahoo donated)

8. 06/22/15 8 Hadoop Architecture / Components Framework of tools not an application in entirety Used for supporting running of applications on BigData Opensource'd set of tools distributed under Apache license Traditional Approach for handling huge data Powerful computer with big storage and computation capacity Limited by processing power of the computer with growing data Hadoop approach Break up data into smaller pieces and distribute to multiple computers Breaks the computation as well into smaller pieces and distributes them Combined results returned back

9. 06/22/15 9 Hadoop Architecture / Components Map Reduce Job Tracker Task Tracker HDFS Name Node Data Node Applications contact the master node, a task is formed and submitted to the Task Tracker Task Tracker maintains a queue of the tasks and gets them processed using the Task Tracker and Data Nodes Consolidates the result and sends back to the application

10. 06/22/15 10 Hadoop Architecture / Components Hadoop works on a distributed model Numerous low cost computers commodity hardware Hadoop components Slaves Task Tracker process smaller piece of task assigned Data Node manage the piece of data distributed to this node Master Job Tracker tracks the overall task Name Node maintains the index of the data blocks stored on different nodes Task Tracker Data Node

11. 06/22/15 11 Hadoop Architecture / Components Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Applications Master Slaves Queue

12. 06/22/15 12 Hadoop Architecture / Components Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Applications Master Slaves

13. 06/22/15 13 Hadoop Architecture / Components Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Applications Master Slaves

14. 06/22/15 14 Hadoop and GlusterFS GlusterFS is a general purpose scale-out distributed file- system supporting thousands of clients Aggregates storage exports over network interconnect to provide a single unified namespace File-system completely in userspace, runs on commodity hardware Layered on disk file systems that support extended attributes

15. 06/22/15 15 Hadoop contains set of daemons running in the system Name Node centralized metadata node Job Tracker overall task distribution across data nodes Task Tracker on data nodes to maintain task Data Node to store data Hadoop = Map Reduce framework + HDFS GlusterFS can be a replacement for HDFS glusterfs-hadoop-plugin Java module which implements Hadoop file system interface Simple a JAR file which could be kept in Hadoop libraries Replaces HDFS for glusterfs Hadoop and GlusterFS

16. 06/22/15 16 Hadoop and GlusterFS Data locality is ensured by Job Tracker Using glusterfs-hadoop-plugin ensures data locality by getting the gluster volumes mounted as fuse mount Effectively no name node involved Only clients where map-reduce job runs And data nodes to store data Glusterfs-hadoop-plugin talks to glusterfs using fuse mounts In absence of name node, plugin uses xfattrs mechanism to get the details from volume and consolidates the data using the same Reads the data directly from the bricks and bypasses the volume as such for improved performance

17. 06/22/15 17 Hadoop and GlusterFS As simple as to execute map reduce daemon and then submit the hadoop task to use glusterfs as storage Analytics uses using HDFS makes files moving around the nodes whereas glusterfs just need to fuse mount the volume and no moving around the files

18. 06/22/15 18 Advantages Elimination of centralized metadata server (name node) Compatibility with MapReduce and Hadoop based applications Elimination of code rewrites for Hadoop enablement of glusterfs Fault tolerant file system Allows co-location of compute and data nodes and ability to run Hadoop jobs across multiple namespaces using multiple glusterfs volumes Data access through serveral different mechanisms / protocols (Fuse, NFS, SMB and SWIFT . and of course Hadoop)

19. 06/22/15 19 References https://github.com/gluster/glusterfs-hadoop https://forge.gluster.org/hadoop/pages/Home shubhendu @ #gluster on freenode

20. 06/22/15 20 Deployment Scenario

21. 06/22/15 21 THANK YOU!

glusterfs and hadoop

Technology