hadoop & distributed cloud computing
DESCRIPTION
Hadoop and Distributed Cloud ComputingTRANSCRIPT
HADOOP & DISTRIBUTED CLOUD COMPUTINGDATA PROCESSING IN CLOUD
Presentation By : Rajan Kumar Upadhyay || [email protected]
CLOUD COMPUTING ?
Cloud computing is a virtual setup box that includes following - Delivery of computing as a service rather than product
- Shared resources are software, utility, hardware provided over a network ( Typically Internet )
Delivery of computing
Public Utilities
Shared Resources
DISTRIBUTED CLOUD COMPUTING
As the name explains : Distributed computing in cloudExamples:
• Distributed computing is nothing more than utilizing many networked computers to partition (split it into many smaller pieces) a question or problem and allow the network to solve the issue piecemeal
• Software like Hadoop. Written in Java, Hadoop is a scalable, efficient, distributed software platform designed to process enormous amounts of data. Hadoop can scale to thousands of computers across many clusters.
• Another instance of distributed computing, for storage instead of processing power, is bittorrent. A torrent is a file that is split into many pieces and stored on many computers around the internet. When a local machine wants to access that file, the small pieces are retrieved and rebuilt.
• P2P network, that send communication/data packages into multiple pieces across multiple network routes. Then assemble them in receivers end.
Distributed computing on cloud is nothing but next generation framework to utilize the maximum value of resources over distributed architecure
WHAT IS HADOOP
Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.
Why Hadoop?A common infrastructure pattern extracted from building distributed systems
•Scale •Incremental growth •Cost •Flexibility • Distributed File System • Distributed Processing Framework
• Apache.org Open Source project • Yahoo !, Facebook, Google, Fox, Amazon, IBM, NY times uses it for their core infrastructure• Widely Adopted A valuable and reusable skill set
Taught at major universities Easier to hire for Easier to train on
Portable across projects, groups
HOW IT WORKS
HDFS: Hadoop Distributed File SystemA distributed file system for large data
• Your data in triplicate ( one local and two remote copies)
• Built-in redundancy, resiliency to large scale failures (automated restart and re-allocation )
• Intelligent distribution, striping across racks
• Accommodates very large data sizes On commodity hardware
PROGRAMMING MODEL
There are various programming model for Hadoop developments. I personally like & experienced with Map/Reduce
Why Map/Reduce:
•Simple programming technique: • Map(anything)->key, value • Sort, partition on key • Reduce(key,value)->key, value
• No parallel processing / message passing semantics
• Programmable in Java or any other language
Continued …
PROGRAMMING MODEL
Create/Allocate cluster
Put Data into File System
Program Execution
Move computation to DataGather output of map, sort or partition on key
Run reduce task
Results of job stored on HDFS
Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data
Data is split into blocks, stored in triplicate across your cluster
PRACTICES
Put large data source into HDFS
Perform aggregations, transformations, normalizations on the data
Load into RDBMS
THANK YOU
Thank you for reading this. I hope you find it useful. Please contact me to [email protected] if you have any queries/feedback. My Name is Rajan Kumar Upadhyay, I have more than 10 years of collective IT experience as a techie.
If you have anything to share/looking for consulting etc. Please feel free to contact me.