how pig and hadoop fit in data processing architecture

14
Kovid Academy Catalyst for Digital Evolution Visit Us www.kovidacademy.com

Upload: kovid-academy

Post on 11-Apr-2017

14 views

Category:

Education


3 download

TRANSCRIPT

Page 1: How pig and hadoop fit in data processing architecture

Kovid AcademyCatalyst for Digital Evolution

Visit Uswww.kovidacademy.com

Page 2: How pig and hadoop fit in data processing architecture

Visit Uswww.kovidacademy.com

How Pig and Hadoop fit in Data Processing Architecture

Page 3: How pig and hadoop fit in data processing architecture

Hadoop and its ecosystem has evolved from a narrow map-reduced

architecture to a universal data platform set to dominate the data processing

landscape in the future. Importantly, the push to simplify Hadoop

deployments with managed cloud services known as Hadoop-as-a-Service is

increasing Hadoop’s appeal to new data projects and architectures. Naturally,

the development is permeating the Hadoop ecosystem in shape of Pig as a

Service offerings, for example.

Pig, developed by Yahoo research in 2006, enables programmers to write data

transformation programs for Hadoop quickly and easily without the cost and

complexity of map-reduce programs.

Page 4: How pig and hadoop fit in data processing architecture

Consequently, ETL (Extract, Transform, Load), the core workload of DWH(data warehouse) solutions, is often realized with Pig in the Hadoopenvironment. The business case for Hadoop and Pig as a Service is verycompelling from financial and technical perspectives.

Hadoop is becoming data’s Swiss Army knife

The news on Hadoop last year have been dominated by SQL (StructuredQuery language) on Hadoop with Hive, Presto, Impala, Drill, and countlessother flavours competing on making big data accessible to business users.Most of these solutions are supported directly by Hadoop distributors, e.g.Hortonworks, MapR, Cloudera, and cloud service providers, e.g. Amazon andQubole.

The push for development in the area is driven by the vision for Hadoop tobecome the data platform of the future.

Page 5: How pig and hadoop fit in data processing architecture

It turned the core of Hadoop’s processing architecture from a map-reducecentric solution into a generic cluster resource management tool able to runany kind of algorithm and application. Hadoop solution providers are nowracing to capture the market for multipurpose, any-size data processing. SQLon Hadoop is only one of the stepping-stones to this goal.

Three Ways Hadoop is gaining on the Data Warehouses

The target and incentive is clear, Hadoop is a comparatively inexpensivetechnology to store and process large data sets with. One established marketis particularly lucrative and tempting to enter. Data warehouse (DWH)solutions easily can cost many millions of dollars and Hadoop with itseconomical distributed computing architecture and growing ecosystempromises to achieve much of their feature set for a fraction of the cost.

Three exciting, active developments are eating away on established DWHsolutions’ lead on Hadoop and reasons for spending 10 or 100 times more:

Page 6: How pig and hadoop fit in data processing architecture

• SQL on Hadoop is making data accessible to data and business analystsand existing tools for visualisation and analytics via SQL interfaces. Prestoand other new SQL engines highlight that real-time querying on big data(Petabytes and beyond) can be done with Hadoop at dramatically lowercost than what DWH solutions offered.

• Cloud computing based platforms and software services for Hadoopremove complexity, risk, and technical barriers to get started withHadoop. Importantly, it enables incremental iterative development ofHadoop data projects, which makes it also attractive for medium sizeddata projects. Today all major Hadoop projects are offered in one shape oranother as-a-Service with cloud service providers working on makingincreasingly complete Hadoop ecosystems available as-a-Service, billedby the hour, and fully scalable in minutes.

Page 7: How pig and hadoop fit in data processing architecture

Pig, the silent Hero

SQL on Hadoop has been extensively covered in the media in the last year.Pig, being a well-established technology, has been largely overlooked thoughPig as a Service was a noteworthy development. Considering Hadoop as adata platform though requires Pig and an understanding why and how it isimportant.

Data users are generally trained in using SQL, a declarative language, toquery for data for reporting, analytics, and ad-hoc explorations. SQL does notdescribe how the data is processed, it is more declarative and appeals to a lotof data users. ETL processes, which are developed by data programmers,benefit and sometimes even require the ability to detail the datatransformation steps. At times ETL programmers like a procedural languageas opposed to a declarative language. Pig’s programming language, PigLatin, is procedural and gives programmers control over every step of theprocessing.

Page 8: How pig and hadoop fit in data processing architecture

Business users and programmers work on the same data set yet usuallyfocus on different stages. The programmers commonly work on the wholeETL pipeline, i.e. they are responsible to clean and extract the raw data,transform it and load it into third party systems. Business users either accessdata on third party systems or access the extracted and transformed datafor analysis and aggregation. The requirement of diverse tooling is thereforeimportant as the interaction patterns with the same data set are divers.

Importantly, complex ETL workflows need management, extensibility, andtestability to ensure stable and reliable data processing. Pig provides strongsupport on all aspects. Pig jobs can be scheduled and managed withworkflow tools like Oozie to build and orchestrate large scale, graph-likedata pipelines.

Pig achieves extensibility with UDFs (User Defined Function), which letprogrammers add functions written in one of many programminglanguages.

Page 9: How pig and hadoop fit in data processing architecture

The benefit of this model is that any kind of special functionality can beinjected and that Pig and Hadoop manage the distribution and parallelexecution of the function on potentially huge data sets in an efficient manner.This allows the programmers to focus on adding and solving specific domainproblems, e.g. like rectifying specific data set anomalies or converting dataformats, without worrying about the complexity of distributed computing.

Reliable data pipelines require testing before deployment in production toensure correctness of the numerous data transformation and combinationsteps. Pig has features supporting easy and testable development of datapipelines. Pig supports unit tests, an interactive shell, and the option to run ina local mode, which allows it to execute programs in a fashion not requiring aHadoop cluster. Programmers can use these to test their Pig programs indetail with test data sets before they ever enter production and also help themtry out ideas quickly and inexpensively, which is essential for fast developmentcycles.

Page 10: How pig and hadoop fit in data processing architecture

None of these features are particularly glamorous yet they are important toevaluate Hadoop and data processing with it. The choice of leveraging Pig fora big data project can easily make the difference between success and failure.

Pig as a Service

Pig by itself is the important glue that turns raw data from (No)SQL and objectstores into structured data. Yet, Pig requires the Hadoop environment toexecute its programs. The as-a-Service offerings focus on providing thenecessary cluster environment ready to run for data projects to focus on theETL aspects.

The business case for Pig as a Service is simple and convincing. Hadoop is acomplex data platform and the continued growth of the Hadoop marketmeant that experts are costly and hard to find. At the same time, asmentioned before, Hadoop has the potential to shatter data processing costsper byte.

Page 11: How pig and hadoop fit in data processing architecture

The second argument for the service route is the fact that while per byteHadoop may beat alternative processing solutions; it is hard to achieve theeconomies of scale required for most business and projects. And even if thesemay be achievable in the future not many businesses are willing to investsignificant capital into a large scale Hadoop infrastructure without a proventrack record of what the eventual savings will be.

The service solution addresses all these problems by effectively outsourcingthe expertise and scale challenge. Cloud computing enables providers likeAmazon, Mortar Data, or Qubole, to offer scalable services around Hadoopbilled on usage basis. Their business models and services provided vary fromprovider to provider though they all offer effectively Pig as a Service removingtechnical barriers, and capital investments, while adding expertise forcustomers enriching the offerings on various aspects.

Page 12: How pig and hadoop fit in data processing architecture

This approach is significantly different to the more traditional solution offeredby Hadoop distributors. They provide complete Hadoop ecosystems andsupport for running them. However, the customer still has to hire experts andoperate the Hadoop cluster, and has to pay a significant fee per cluster nodeand year for support. Cloud computing has also advanced into this area andallows customers to install Hadoop distributions on virtual machines removingcapital investments. However, the operational burden of maintaining a cluster,even if supported, still remains a cost and additional effort stuck with thecustomer. As-a-Service solutions remove these problems.

Today, anyone considering data projects involving ETL workloads shouldsincerely consider how Pig and Hadoop might fit into their data processingarchitecture. Pig as a Service is offering a low cost, low barrier, testable,manageable option for business to enter big data platforms and potentiallysave time and money over traditional data warehouse options.

Page 13: How pig and hadoop fit in data processing architecture

It also opens up the opportunity to increasingly leverage the Hadoopecosystem, e.g. SQL on Hadoop for scalable querying of big data for businessusers or distributed computing for advanced data processing projects.

While business globalisation and market competitiveness has created a largedemand for Big Data Hadoop processing, organisations has stated looking forthe Big Data Scientists, who can derive the most probable and fast outcomeswith the available data. In order to fil these market gaps, Kovid Academy withits vision of reinforcing the skills required for the Big Data Developers andAdministrators, is offering instructor led interactive online and classroom BigData Hadoop training sessions to carve you as the master of Big DataAnalytics.

Page 14: How pig and hadoop fit in data processing architecture

Contact Us:[email protected]: 609-436-9548 , IND: +91 9700022933.

Website: https://kovidacademy.comFB: https://www.facebook.com/kovidacademy/Twitter: https://twitter.com/KovidAcademyLinkedIn: https://www.linkedin.com/company/kovid-academyYouTube: https://www.youtube.com/channel/UCbmkCnMoOUDsrS7O4bVpLjA