how to use hadoop

8/12/2019 How to Use Hadoop

1/8

CONCLUSIONS PAPER

Featuring:

Brian Garrett, Product and Systems Architect, SAS

Scott Chastain, Product and Systems Manager, SAS

Bob Messier, Senior Director, Product Management, SAS

Insights from a webinar in the Applying Business Analytics Webinar Series

How to Use Hadoop as a Piece of the Big Data Puzzle


2/8

SAS Conclusions Paper

Table of Contents

What Hadoop Can Do for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Why Hadoop Is Not a Big Data Strategy. . . . . . . . . . . . . . . . . . . . . . . . 2

Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


3/8

1


What Hadoop Can Do for Big Data

Imagine you have a jar of multicolored candies, and you need to learn something from

them, perhaps the count of blue candies relative to red and yellow ones. You could

empty the jar onto a plate, sift through them and tally up your answer. If the jar held only

a few hundred candies, this process would take only a few minutes.

Now imagine you have four plates and four helpers. You pour out about one-fourth of

the candies onto each plate. Everybody sifts through their set and arrives at an answer

that they share with the others to arrive at a total. Much faster, no?

That is what Hadoop does for data. Hadoop is an open-source software framework

for running applications on large clusters of commodity hardware. Hadoop delivers

enormous processing power the ability to handle virtually limitless concurrent tasks

and jobs making it a remarkably low-cost complement to a traditional enterprise data

infrastructure.

Organizations are embracing Hadoop for several notable merits:

Hadoop is distributed. Bringing a high-tech twist to the adage, Many hands make

light work, data is stored on local disks of a distributed cluster of servers.

Hadoop runs on commodity hardware. Based on the average cost per terabyte

of compute capacity of a prepackaged system, Hadoop is easily 10 times

cheaper for comparable computing capacity compared to higher-cost specialized

hardware.

Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data

replication and speculative processing. If capacity is available, Hadoop runs

multiple copies of the same task, accepting the results from the task that finishesfirst.

Hadoop does not require a predened data schema. A key benet of Hadoop

is the ability to just upload any unstructured les without having to schematize

them first. You can dump any type of data into Hadoop and allow the consuming

programs to determine and apply structure when necessary.

Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000

and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000

concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000

Hadoop nodes storing more than 200 petabytes of data.

Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of

data in 62 seconds; a 3,400-node cluster sorted 100 terabytes in 173 minutes.

To put it in context, one terabyte contains 2,000 hours of CD-quality music;

10 terabytes could store the entire US Library of Congress print collection.

You get the idea. Hadoop handles big data. It does it fast. It redefines the possible

when it comes to analyzing large volumes of data, particularly semi-structured and

unstructured data (text).

Hadoop automatically

replicates the data onto

separate nodes to mitigate the

effects of a hardware failure.The framework is not only

server-aware, its rack-aware,

so if a hardware component

becomes unavailable, the data

still exists somewhere else.

Brian Garrett

Product and Systems Architect, SAS


4/8

2


Why Hadoop Is Not a Big Data Strategy

For all its agility in handling big data, Hadoop by itself is not a big data strategy,

says Scott Chastain, Product and Systems Manager at SAS. The data storage

capabilities, the ability to divide and conquer, to replicate the data for redundancy

these capabilities dont necessarily solve any business questions. For that you need

the ability to efciently do query and reporting on big data sets. Hadoop by itself has

limited capabilities for generating insight from data and if you have a lot of users asking

questions of the data, Hadoop adds some unwelcome overhead.

Suppose you need to answer a new question about your collection of candies. Perhaps

you need to rank the colors by prevalence. The candies arent kept on the plate (in

memory); they were poured back into the jar after the rst analysis. So you would pour

the candies back out into the plates again, have your helpers sift through them all, and

come up with the answer to the new question.

Wouldnt it be nice if you could have preserved the orderliness of the candies after

you answered the rst question? Your analysis would be so much more efcient if you

didnt have to pour the candies back in the jar and dump them back out again later as a

jumble of colors.

But thats what Hadoop does; it keeps the data in the jar, not on the plates. The typical

paradigm of Hadoop processing is a computational approach called MapReduce,

which breaks out the individual pieces for distributed processing, and then gets that

answer across all of the individual nodes, said Chastain. Every time you re up a new

MapReduce job, you have to go to that data [in storage], pull it in [to memory], ingest it

and process it.

So Hadoop adds two elements of overhead: in creating multiple sets of the distributed

data for redundancy and in moving data between storage and memory every time

somebody comes to ask a question. Hadoop is extremely fast and efcient at

answering questions on big data sets, but theres still overhead associated with

getting that answer, said Chastain. This paradigm has been very successful for data

scientists and others doing ad hoc problem solving, but what if you have many business

intelligence users who want to do query and reporting based on big data sets?

Thats why we talk about Hadoop as a piece of this big data puzzle but not a strategy

in and of itself. Additional capabilities are needed to serve more people working with big

data. What if we can keep the data set in memory across this distributed computing

environment, and serve multiple pieces and multiple users with different questions

around that set of big data?

Hadoop is both a data storage

mechanism using the Hadoop

Distributed File System (HDFS)

and a parallel and distributed

programming model based on

MapReduce.

Hadoop has emerged as a

popular way to handle massive

amounts of structured and

unstructured data, thanks to its

ability to process quickly and

cost-effectively across clusters

of commodity hardware.


5/8

3


Lets start all over with the candies. This time, you pour the candies out of the

jar and organize them in an efficient manner, lining them up neatly in rows by

colors. Now if someone asks a question such as, Which colors are most and

least represented? the candies are already organized for a speedy response to

that query.

Leaving the data in memory in a structured fashion, its very efcient for us to enable

multiple users to answer various questions from the same set of data, said Chastain.

However, todays business problems require more than simple answers. Business users

need to understand about forecasting, fraud and risk, propensity to respond, root cause

analysis, optimization and so on questions that entail a lot of variables and analytical

complexity. The Hadoop framework doesnt provide the high-performance analytics to

answer those business problems. Even if the needed tools do exist, Hadoop often has

to wait for the slowest node to finish before it can deliver the answer. So we use SAS

for complex problems.

Suppose we are presented with a candy optimization problem. We can

substitute two orange candies for every red, but we must take a two-green

penalty for every red candy removed until greens are exhausted or we can

substitute two greens for every orange but must remove three blues for every

orange remaining. Which strategy of substitutions will yield the highest candy

inventory?

The answer wouldnt be quickly arrived at by intuition or counting, so youd

pass this question over to analytics. Pour the candies into a SAS mug and

imagine that SAS Analytics works its magic and delivers an optimized answer.

It can be very effective to pull that data and put it into a SAS process to answer complex

problems, such as using optimization or data mining to predict customer behavior,

detect potentially fraudulent activities or other types of activities.

But what if youve got a great big vat of candies?

Now youve really got a big data problem, said Chastain. Up until now in our example,

weve used Hadoop as a distributed data platform and brought the data to the complex

mathematics that SAS provides. That paradigm has to change, because we cant have

data floating around in the enterprise with little ability to provide security control and

governance. With big data, taking the data to SAS is not our best way forward. We have

a solution for that.

Pour the vat of candies into as many plates as needed, and put a magic SAS

mug with each plate. Now you can analyze the candies right in place, without

having to move them around and potentially spill them.


6/8

4


Now we can take SAS to where the data lives, said Chastain. We have the data in

memory, and SAS can go inside this Hadoop infrastructure and answer those complex

business problems right inside the commodity hardware.

Closing Thoughts

In the past, organizations were constrained in how much data could be stored and

what type of analytics could be applied against that data. Analysts were often limited

to analyzing just a sample subset of the data in an attempt to simulate a larger data

population, even when using all the data would have yielded a more accurate result.

Hadoop can overcome the bandwidth and coordination issues associated with

processing billions of records that previously might not have been saved. The SAS

approach brings world-class analytics to the Hadoop framework. There are a lot

of technical details involving the various Apache subprojects and Hadoop-based

capabilities, but SAS support for Hadoop can be boiled down to three simple

statements:

SAS can use Hadoop data as just another data source.

SAS Data Integration supports Hadoop alongside other data storage and

processing technologies. Graphical tools enable users to access, process and

manage Hadoop data and processes from within the familiar SAS environment.

This is critical, given the skills shortage and the complexity involved with Hadoop.

The power of SAS Analytics has been extended to Hadoop.

SAS augments Hadoop with world-class analytics, along with metadata, security

and lineage capabilities, which helps ensure that Hadoop will be ready for

enterprise expectations.

SAS brings much-needed governance to Hadoop.

SAS provides a robust information management life cycle approach to Hadoop

that includes support for data management and analytics management. This is a

huge advantage over other products that focus primarily on moving data in and

out of Hadoop.

Thats why we said that Hadoop is a piece of the big data puzzle, but its not

everything, said Chastain. Youre going to need other assets to drive a complete

strategy, such as the abilities to:

Keep data in memory on the distributed architecture.

Support multiple users for querying against this data.

Put complex mathematics inside the Hadoop environment to solve difcult

business problems.

To find out more, download the SAS white paper Bringing the Power of SAS to

Hadoop. sas.com/reg/wp/corp/46633

SAS support for Hadoop is part

of a broader big data strategy

that includes information

management for big data and

high-performance analytics,

including grid, in-database and

in-memory computing.

We can drive cost savings

in our existing storage

capacity by using Hadoop on

commodity hardware, and we

can then embed SAS into the

Hadoop infrastructure to help

solve the more complicated

and sophisticated business

problems.

Bob Messier

Senior Director for Product Management,

SAS
http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/wp/corp/46633


7/8

5


For More Information

To view the on-demand recording of this webinar: sas.com/reg/web/corp/2069845

Other events in the Applying Business Analytics Webinar Series: sas.com/ABAWS

For a go-to resource for premium content and collaboration with experts and peers:

AllAnalytics.com

Download the SAS white paper Bringing the Power of SAS to Hadoop: sas.com/reg/

wp/corp/46633

Download the TDWI Best Practices Report High-Performance Data Warehousing

sas.com/reg/gen/corp/1909689

Follow us on twitter: @sasanalytics

Like us on Facebook: SAS Analytics
http://www.sas.com/reg/web/corp/2069845http://www.sas.com/ABAWShttp://www.allanalytics.com/http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/gen/corp/1909689http://www.sas.com/reg/gen/corp/1909689http://www.sas.com/reg/wp/corp/46633http://www.sas.com/reg/wp/corp/46633http://www.allanalytics.com/http://www.sas.com/ABAWShttp://www.sas.com/reg/web/corp/2069845


8/8

how to use hadoop

Documents