how to use hadoop

Upload: daniel-artimon

Post on 03-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 How to Use Hadoop

    1/8

    CONCLUSIONS PAPER

    Featuring:

    Brian Garrett, Product and Systems Architect, SAS

    Scott Chastain, Product and Systems Manager, SAS

    Bob Messier, Senior Director, Product Management, SAS

    Insights from a webinar in the Applying Business Analytics Webinar Series

    How to Use Hadoop as a Piece of the Big Data Puzzle

  • 8/12/2019 How to Use Hadoop

    2/8

    SAS Conclusions Paper

    Table of Contents

    What Hadoop Can Do for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Why Hadoop Is Not a Big Data Strategy. . . . . . . . . . . . . . . . . . . . . . . . 2

    Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

  • 8/12/2019 How to Use Hadoop

    3/8

    1

    How to Use Hadoop as a Piece of the Big Data Puzzle

    What Hadoop Can Do for Big Data

    Imagine you have a jar of multicolored candies, and you need to learn something from

    them, perhaps the count of blue candies relative to red and yellow ones. You could

    empty the jar onto a plate, sift through them and tally up your answer. If the jar held only

    a few hundred candies, this process would take only a few minutes.

    Now imagine you have four plates and four helpers. You pour out about one-fourth of

    the candies onto each plate. Everybody sifts through their set and arrives at an answer

    that they share with the others to arrive at a total. Much faster, no?

    That is what Hadoop does for data. Hadoop is an open-source software framework

    for running applications on large clusters of commodity hardware. Hadoop delivers

    enormous processing power the ability to handle virtually limitless concurrent tasks

    and jobs making it a remarkably low-cost complement to a traditional enterprise data

    infrastructure.

    Organizations are embracing Hadoop for several notable merits:

    Hadoop is distributed. Bringing a high-tech twist to the adage, Many hands make

    light work, data is stored on local disks of a distributed cluster of servers.

    Hadoop runs on commodity hardware. Based on the average cost per terabyte

    of compute capacity of a prepackaged system, Hadoop is easily 10 times

    cheaper for comparable computing capacity compared to higher-cost specialized

    hardware.

    Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data

    replication and speculative processing. If capacity is available, Hadoop runs

    multiple copies of the same task, accepting the results from the task that finishesfirst.

    Hadoop does not require a predened data schema. A key benet of Hadoop

    is the ability to just upload any unstructured les without having to schematize

    them first. You can dump any type of data into Hadoop and allow the consuming

    programs to determine and apply structure when necessary.

    Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000

    and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000

    concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000

    Hadoop nodes storing more than 200 petabytes of data.

    Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of

    data in 62 seconds; a 3,400-node cluster sorted 100 terabytes in 173 minutes.

    To put it in context, one terabyte contains 2,000 hours of CD-quality music;

    10 terabytes could store the entire US Library of Congress print collection.

    You get the idea. Hadoop handles big data. It does it fast. It redefines the possible

    when it comes to analyzing large volumes of data, particularly semi-structured and

    unstructured data (text).

    Hadoop automatically

    replicates the data onto

    separate nodes to mitigate the

    effects of a hardware failure.The framework is not only

    server-aware, its rack-aware,

    so if a hardware component

    becomes unavailable, the data

    still exists somewhere else.

    Brian Garrett

    Product and Systems Architect, SAS

  • 8/12/2019 How to Use Hadoop

    4/8

    2

    SAS Conclusions Paper

    Why Hadoop Is Not a Big Data Strategy

    For all its agility in handling big data, Hadoop by itself is not a big data strategy,

    says Scott Chastain, Product and Systems Manager at SAS. The data storage

    capabilities, the ability to divide and conquer, to replicate the data for redundancy

    these capabilities dont necessarily solve any business questions. For that you need

    the ability to efciently do query and reporting on big data sets. Hadoop by itself has

    limited capabilities for generating insight from data and if you have a lot of users asking

    questions of the data, Hadoop adds some unwelcome overhead.

    Suppose you need to answer a new question about your collection of candies. Perhaps

    you need to rank the colors by prevalence. The candies arent kept on the plate (in

    memory); they were poured back into the jar after the rst analysis. So you would pour

    the candies back out into the plates again, have your helpers sift through them all, and

    come up with the answer to the new question.

    Wouldnt it be nice if you could have preserved the orderliness of the candies after

    you answered the rst question? Your analysis would be so much more efcient if you

    didnt have to pour the candies back in the jar and dump them back out again later as a

    jumble of colors.

    But thats what Hadoop does; it keeps the data in the jar, not on the plates. The typical

    paradigm of Hadoop processing is a computational approach called MapReduce,

    which breaks out the individual pieces for distributed processing, and then gets that

    answer across all of the individual nodes, said Chastain. Every time you re up a new

    MapReduce job, you have to go to that data [in storage], pull it in [to memory], ingest it

    and process it.

    So Hadoop adds two elements of overhead: in creating multiple sets of the distributed

    data for redundancy and in moving data between storage and memory every time

    somebody comes to ask a question. Hadoop is extremely fast and efcient at

    answering questions on big data sets, but theres still overhead associated with

    getting that answer, said Chastain. This paradigm has been very successful for data

    scientists and others doing ad hoc problem solving, but what if you have many business

    intelligence users who want to do query and reporting based on big data sets?

    Thats why we talk about Hadoop as a piece of this big data puzzle but not a strategy

    in and of itself. Additional capabilities are needed to serve more people working with big

    data. What if we can keep the data set in memory across this distributed computing

    environment, and serve multiple pieces and multiple users with different questions

    around that set of big data?

    Hadoop is both a data storage

    mechanism using the Hadoop

    Distributed File System (HDFS)

    and a parallel and distributed

    programming model based on

    MapReduce.

    Hadoop has emerged as a

    popular way to handle massive

    amounts of structured and

    unstructured data, thanks to its

    ability to process quickly and

    cost-effectively across clusters

    of commodity hardware.

  • 8/12/2019 How to Use Hadoop

    5/8

    3

    How to Use Hadoop as a Piece of the Big Data Puzzle

    Lets start all over with the candies. This time, you pour the candies out of the

    jar and organize them in an efficient manner, lining them up neatly in rows by

    colors. Now if someone asks a question such as, Which colors are most and

    least represented? the candies are already organized for a speedy response to

    that query.

    Leaving the data in memory in a structured fashion, its very efcient for us to enable

    multiple users to answer various questions from the same set of data, said Chastain.

    However, todays business problems require more than simple answers. Business users

    need to understand about forecasting, fraud and risk, propensity to respond, root cause

    analysis, optimization and so on questions that entail a lot of variables and analytical

    complexity. The Hadoop framework doesnt provide the high-performance analytics to

    answer those business problems. Even if the needed tools do exist, Hadoop often has

    to wait for the slowest node to finish before it can deliver the answer. So we use SAS

    for complex problems.

    Suppose we are presented with a candy optimization problem. We can

    substitute two orange candies for every red, but we must take a two-green

    penalty for every red candy removed until greens are exhausted or we can

    substitute two greens for every orange but must remove three blues for every

    orange remaining. Which strategy of substitutions will yield the highest candy

    inventory?

    The answer wouldnt be quickly arrived at by intuition or counting, so youd

    pass this question over to analytics. Pour the candies into a SAS mug and

    imagine that SAS Analytics works its magic and delivers an optimized answer.

    It can be very effective to pull that data and put it into a SAS process to answer complex

    problems, such as using optimization or data mining to predict customer behavior,

    detect potentially fraudulent activities or other types of activities.

    But what if youve got a great big vat of candies?

    Now youve really got a big data problem, said Chastain. Up until now in our example,

    weve used Hadoop as a distributed data platform and brought the data to the complex

    mathematics that SAS provides. That paradigm has to change, because we cant have

    data floating around in the enterprise with little ability to provide security control and

    governance. With big data, taking the data to SAS is not our best way forward. We have

    a solution for that.

    Pour the vat of candies into as many plates as needed, and put a magic SAS

    mug with each plate. Now you can analyze the candies right in place, without

    having to move them around and potentially spill them.

  • 8/12/2019 How to Use Hadoop

    6/8

    4

    SAS Conclusions Paper

    Now we can take SAS to where the data lives, said Chastain. We have the data in

    memory, and SAS can go inside this Hadoop infrastructure and answer those complex

    business problems right inside the commodity hardware.

    Closing Thoughts

    In the past, organizations were constrained in how much data could be stored and

    what type of analytics could be applied against that data. Analysts were often limited

    to analyzing just a sample subset of the data in an attempt to simulate a larger data

    population, even when using all the data would have yielded a more accurate result.

    Hadoop can overcome the bandwidth and coordination issues associated with

    processing billions of records that previously might not have been saved. The SAS

    approach brings world-class analytics to the Hadoop framework. There are a lot

    of technical details involving the various Apache subprojects and Hadoop-based

    capabilities, but SAS support for Hadoop can be boiled down to three simple

    statements:

    SAS can use Hadoop data as just another data source.

    SAS Data Integration supports Hadoop alongside other data storage and

    processing technologies. Graphical tools enable users to access, process and

    manage Hadoop data and processes from within the familiar SAS environment.

    This is critical, given the skills shortage and the complexity involved with Hadoop.

    The power of SAS Analytics has been extended to Hadoop.

    SAS augments Hadoop with world-class analytics, along with metadata, security

    and lineage capabilities, which helps ensure that Hadoop will be ready for

    enterprise expectations.

    SAS brings much-needed governance to Hadoop.

    SAS provides a robust information management life cycle approach to Hadoop

    that includes support for data management and analytics management. This is a

    huge advantage over other products that focus primarily on moving data in and

    out of Hadoop.

    Thats why we said that Hadoop is a piece of the big data puzzle, but its not

    everything, said Chastain. Youre going to need other assets to drive a complete

    strategy, such as the abilities to:

    Keep data in memory on the distributed architecture.

    Support multiple users for querying against this data.

    Put complex mathematics inside the Hadoop environment to solve difcult

    business problems.

    To find out more, download the SAS white paper Bringing the Power of SAS to

    Hadoop. sas.com/reg/wp/corp/46633

    SAS support for Hadoop is part

    of a broader big data strategy

    that includes information

    management for big data and

    high-performance analytics,

    including grid, in-database and

    in-memory computing.

    We can drive cost savings

    in our existing storage

    capacity by using Hadoop on

    commodity hardware, and we

    can then embed SAS into the

    Hadoop infrastructure to help

    solve the more complicated

    and sophisticated business

    problems.

    Bob Messier

    Senior Director for Product Management,

    SAS

    http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/wp/corp/46633
  • 8/12/2019 How to Use Hadoop

    7/8

    5

    How to Use Hadoop as a Piece of the Big Data Puzzle

    For More Information

    To view the on-demand recording of this webinar: sas.com/reg/web/corp/2069845

    Other events in the Applying Business Analytics Webinar Series: sas.com/ABAWS

    For a go-to resource for premium content and collaboration with experts and peers:

    AllAnalytics.com

    Download the SAS white paper Bringing the Power of SAS to Hadoop: sas.com/reg/

    wp/corp/46633

    Download the TDWI Best Practices Report High-Performance Data Warehousing

    sas.com/reg/gen/corp/1909689

    Follow us on twitter: @sasanalytics

    Like us on Facebook: SAS Analytics

    http://www.sas.com/reg/web/corp/2069845http://www.sas.com/ABAWShttp://www.allanalytics.com/http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/gen/corp/1909689http://www.sas.com/reg/gen/corp/1909689http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/wp/corp/46633http://www.allanalytics.com/http://www.sas.com/ABAWShttp://www.sas.com/reg/web/corp/2069845
  • 8/12/2019 How to Use Hadoop

    8/8