e-guide hadoop big data platforms buyer’s guide part...
Post on 15-May-2018
225 Views
Preview:
TRANSCRIPT
E-guide
Hadoop Big Data Platforms Buyer’s Guide – part 2 Your expert guide to Hadoop big data platforms
Page 1 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
What to consider when evaluating Hadoop vendors
David Loshin, Knowledge Integrity Inc.
Before you evaluate specific Hadoop software or subscriptions,
examine what features the vendor distributions provide and how
they match your big data management needs.
Apache Hadoop is at the heart of many big data environments, supporting
large-scale, data-intensive applications. Its variety of open source software
components and related tools for capturing, processing, managing and
analyzing data, and the low overall cost of Hadoop clusters, are alluring to lots
of organizations. But, as this series has examined, the open source Hadoop
framework only offers so much, and companies that need more robust
performance and functionality capabilities as well as maintenance and support
are turning to commercial Hadoop vendors.
Because Hadoop is a technology that's managed via The Apache Software
Foundation's open source process, the sales model of Hadoop vendors differs
from that of proprietary software development companies. The Hadoop source
code is open, meaning that it's available to anyone who wants to access it, so
Page 2 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
product offerings have to be differentiated by what the vendors provide beyond
the openly accessible functionality.
Once you've determined that your organization could benefit from a commercial
Hadoop big data distribution, the next step is to explore some value-added
supplements to the code base and key features offered by Hadoop vendors and
determine how these offerings match your needs.
What are the Hadoop distribution vendors really selling?
IT teams can download Hadoop from the Apache website and deploy it on a
hardware cluster themselves, without any vendor involvement. But Hadoop
vendors are aware that the self-starter approach isn't for everyone, so they
provide prebuilt Hadoop distributions that can be downloaded from their
websites -- typically in both a free community edition and an enterprise edition
that adds more features and requires the purchase of a license. But if these
vendors are providing users with a product, what are they really selling? In other
words, what do you actually get when you engage and pay a Hadoop software
vendor?
Vendors offering commercial versions of open source technologies, such as
those providing big data management systems based on Hadoop, follow an
alternative system and services model in which customers effectively subscribe
Page 3 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
to the enterprise edition of the product. Benefits of subscribing to an enterprise
edition include:
Access to enterprise features. The subscription relationship enables
customers to access versions of Hadoop that have features and
optimizations that haven't been openly released to the open source
community.
Release from restrictions. In some situations, the freely downloadable
Hadoop distributions have been built with restrictions, such as a limit to the
number of nodes on which the system can be run or the amount of data
that can be managed. Buying an enterprise subscription lifts these
restrictions.
Responsive technical support. Enterprise subscriptions provide
availability of resources for support with 24/7 telephone access and
response times that can be guaranteed under service-level agreements,
depending on the level of support purchased.
Advanced training. While all website visitors may have access to some
training materials and videos, enterprise subscribers typically are entitled to
more advanced and extensive training sessions.
Page 4 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
Access to deployment experts. Hadoop vendors have professional
services teams that are experienced in big data management deployments
and can help jump-start a customer's implementation.
Key considerations for comparing Hadoop distribution vendors
The enterprise editions of vendor Hadoop distributions all provide the core
components of the Hadoop ecosystem stack, which include the Hadoop
Distributed File System (HDFS), the MapReduce programming and execution
environment for batch processing, and the YARN job scheduler and cluster
resource manager. They also commonly incorporate various other open source
technologies, such as the Spark data processing engine and HBase database.
But different vendors may support different releases of all those technologies,
and newer or more specialized tools may not be universally supported. If your
organization is looking to use a particular technology as part of a Hadoop
deployment, you should ensure that the distributions you're considering support
it and, if so, which release they're currently on
Beyond these typical components, you should also compare and contrast how
each vendor provides the following:
Access to enterprise-class features. Some Hadoop vendors offer additional
tools that aren't part of the open source distribution for system configuration,
Page 5 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
system performance, ongoing monitoring and administration. While these may
add value to the enterprise distribution, recognize that integration with
proprietary components may lock the customer into that vendor's product.
Infrastructure deployment alternatives. Your organization may choose to
adopt different underlying infrastructure options, such as running on-premises,
in the cloud or in virtualized environments. Consider how the Hadoop
distribution alternatives are adaptable to these infrastructure choices.
Interoperability with other data management systems. In most cases, an
organization will have existing data warehousing, business intelligence and
analytics systems in place. Hadoop typically doesn't fully replace these systems,
but rather augments and complements them. So it's critical that the adopted
Hadoop environment enable access and data exchange with existing data
management platforms such as DB2, Oracle, SQL Server, Teradata and others.
Integration with end-user tools. End users will want to continue using their
favorite tools for business intelligence, reporting, visualization and analytics.
Assess how well the Hadoop big data management vendor's distribution
supports integration with the tools used in your organization.
Security and data protection. The Apache Hadoop ecosystem is still maturing,
which means that not all of its components may meet enterprise expectations
for data security and protection. Many Hadoop vendors provide security
features as add-ons.
Page 6 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
Support options. Consider what your support requirements are in terms of
availability and response times. Vendors offer different plans for support
availability as well as response windows.
Indemnification from litigation from use of open source technology. This
increasingly important concept ensures that vendors of open source
technologies protect their users from potential liabilities related to the use of the
product.
Optimized performance. Enterprise distributions may be augmented with
performance optimizations that enhance scalability and extensibility.
One additional consideration when comparing Hadoop distribution vendor
offerings relates to the approach that vendors are taking toward compatibility
within the open source community and interoperability between product
offerings from different companies. Ideally, this means ensuring that Hadoop
distributions will remain compatible with the open source versions of Hadoop
and other Apache technologies, even as vendors make code changes and
develop proprietary add-ons. That could help prevent vendor and version lock-
in, in which an organization becomes bound to a particular distribution of
Hadoop.
However, there's a lack of unanimity among Hadoop vendors on how best to
enable interoperability. Several have formed a group called the Open Data
Platform Initiative, set up within the Linux Foundation open source consortium,
Page 7 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
to develop a common set of interoperability standards for Hadoop. But other
vendors have declined to join the group, saying that compatibility and
interoperability issues are already being sufficiently addressed within Apache.
Assuring alignment with the open source distribution as a standard is certainly
desirable in that it allows Hadoop users to maintain some flexibility in their
choice of vendors.
Prior to engaging vendors, it's also important to assess what types of
applications your company plans to develop and run using the Hadoop
ecosystem, and the required capabilities. Then determine which of these are
provided by the community open source versions of Hadoop and other
technologies and which require additional functions only provided by a specific
Hadoop software vendor.
Weighing all of these factors will help prepare your organization to move
forward and evaluate the available options. In our next article, we will assess
the similarities and differences between the leading Hadoop distributions.
Page 8 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
Four factors for comparing the top Hadoop distributions
David Loshin, Knowledge Integrity Inc.
By examining the key characteristics presented here -- along with
the top Hadoop distributions -- you can determine which subscription
is right for your organization.
Although the software components that constitute the Hadoop ecosystem stack
are open source technologies, there are numerous benefits to paying a vendor
for a subscription to use its commercial Hadoop platform. For example, a
subscription provides technical support and training, as well as access to
enterprise features not available to the open source community. While the
enterprise editions of vendor Hadoop distributions all provide the core
components of the Hadoop ecosystem stack, the key differentiators are what
these vendors offer beyond the openly accessible functionality.
Recent changes in the market have thinned the ranks of Hadoop vendors. Just
this month, for example, Pivotal Software pulled the plug on its own Hadoop
distribution and said it would start reselling Hortonworks' instead. But there's still
a diverse group of suppliers to consider, including independent Hadoop
specialists, cloud providers and two of the largest IT vendors.
Page 9 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
To help you determine which Hadoop provider is right for your organization, this
article distinguishes the top Hadoop distributions based on several key
characteristics; these include deployment models, enterprise-class features,
security and data protection features, and support services.
Note that while the Hadoop big data management ecosystem is engineered to
support scalable data storage and high-performance distributed computing, your
actual performance may vary for several reasons, including the software
implementation. But many performance issues are dependent on the planned
applications themselves. To address this, we'll further examine how the Hadoop
product distributions are targeted to meet the business needs of user
organizations.
1. Hadoop deployment models
Most of the Hadoop vendors support a mix of deployment methods, but Hadoop
offerings from Microsoft and Amazon Web Services are deployed solely in cloud
environments. Microsoft leverages its Azure cloud infrastructure for HDInsight, a
managed service based on the Hortonworks Data Platform (HDP) -- the same
Hadoop distribution that Pivotal is now reselling. AWS uses its Amazon Elastic
Cloud Computing platform and S3 data store to underpin Amazon Elastic
MapReduce (EMR), which bundles its Hadoop distribution with various other
Page 10 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
tools and technologies. In addition, Amazon EMR provides the option of using
MapR's Hadoop distribution instead of the Amazon one.
The cloud deployment model provides a rapid yet low-effort means of
provisioning a Hadoop cluster, and both Microsoft and AWS enable users to
resize their environments on demand to handle dynamic computing and storage
capacity needs. This elasticity is desirable for organizations with computational
and storage needs that may vary over time.
While the other major Hadoop vendors -- Cloudera, Hortonworks, IBM and
MapR -- all offer cloud-based deployments, they aren't limited to that model.
They allow users to download distributions that can be deployed on-premises or
in private clouds on a variety of servers, including Linux and Windows systems.
In addition, Cloudera and MapR also provide sandbox versions that can be run
in a virtual environment such as VMware.
The bottom line: Consider whether your organization prefers to manage its big
data environment in-house or use a hosted service. In-house management
implies oversight and maintenance of the software environment and continuous
monitoring of the system, whether that environment is a physical platform on
premises or housed using a cloud-based service. The on-premises option may
be preferable if you have experienced staff and know the proper system sizing
characteristics, or if security concerns warrant managing the system behind a
trusted firewall.
Page 11 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
The alternative is to use a vendor with a hosted services platform that will help
configure, launch, manage and monitor your operations. This may be preferable
if you aren't sure what size system you will need or expect that the system size
will grow based on increasing demand. The benefit of working with a cloud or
hosted service is that it will provide the necessary elasticity for both storage and
processing resources.
2. Enterprise-class features of the top Hadoop distributions
There are some notable differences in the development approaches of the three
independent Hadoop vendors. Cloudera often augments the Hadoop core with
internally developed add-on technologies -- for example, its Impala SQL-on-
Hadoop query engine; Cloudera Manager administration tools; and Kudu, an
alternative data store to the Hadoop Distributed File System (HDFS) for use in
real-time analytics applications. Typically, the company now open sources such
technologies after doing the initial development work itself. Hortonworks, on the
other hand, promotes that it's "innovating 100% of its software in the Apache
Hadoop community, and there are no proprietary extensions." Add-on
technologies that it's the driving force behind, such as the Ambari provisioning
and management software, are launched as open source projects from the
outset. In addition, Hortonworks has banded together with IBM and other
companies to form the Open Data Platform Initiative (ODPi), an organization
Page 12 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
devoted to creating a common set of core technical specifications for Hadoop
platforms. ODPi members claim that will improve interoperability and minimize
vendor lock-in.
MapR has taken a third path by developing its own file system, MapR-FS,
instead of using HDFS, as well as its own NoSQL database, MapR-DB, and
other foundational technologies in an effort to support deployments of large
clusters with enterprise-class performance needs. MapR also is increasingly
focusing on real-time and stream processing applications. In late 2015, the
company rebranded its product as the MapR Converged Data Platform, which
combines Hadoop and the MapR file system and database with the Apache
Spark processing engine and a new event streaming technology called MapR
Streams in order to handle both batch and real-time jobs.
From a features standpoint, the enterprise version of the Cloudera CDH
distribution provides tools for operational management and reporting and for
supporting business continuity. This includes such items as configuration history
and rollbacks, rolling updates and service restarts, and automated disaster
recovery. MapR's enterprise offering provides tools to better manage and
ensure the resiliency and reliability of data in Hadoop clusters, as well as multi-
tenancy and high availability capabilities. Hortonworks provides proactive
monitoring and maintenance with its HDP support subscriptions.
IBM, meanwhile, has adopted an analytics-oriented strategy on its BigInsights
for Apache Hadoop distribution, in keeping with its broader focus on selling
Page 13 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
business intelligence and advanced analytics tools. IBM offers different value-
add modules with enterprise-grade features as part of BigInsights, including
separate Analyst and Data Scientist modules. Its Analyst module provides Big
SQL for federated SQL access to Hadoop and other data sources. BigSheets,
which is part of the Analyst module, allows users to explore, transform and
perform visualizations on large data sets stored in Hadoop, using an intuitive
spreadsheet-like interface. The BigInsights Data Scientist Module includes a
version of the R language, text analytics and a machine learning library called
SystemML that has been contributed to the open source community.
While its cloud platform is AWS' primary calling card for Amazon EMR, it also
offers tools for monitoring and managing clusters and enabling application and
cluster interoperability as part of the Hadoop service.
Amazon EMR collects metrics that are used to track progress and measure the
health of a cluster. Cluster health metrics can be accessed through the
command line interface, software developer kits or APIs and can be viewed
through the EMR management console. Additionally, Amazon's CloudWatch
monitoring service can be used along with its implementation of the Apache
Ganglia performance monitoring component to check the cluster and set alarms
on events triggered by these metrics.
Page 14 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
The bottom line: Choosing a vendor that provides value-add components as
part of its enterprise subscription may mean committing to a long-term
relationship -- especially if these components are tightly integrated with its
standard stack distribution. If you're concerned about vendor lock-in, consider
those vendors that are participating in the OPDi.
3. Security and protection offerings from the Hadoop vendors
Despite the expanding use of open source software for enterprise-class
applications, there remain suspicions about its suitability for production use from
a security and protection perspective. Several Hadoop vendors have taken
steps to alleviate some of this anxiety.
For example, Hortonworks has teamed up with other vendors and customers to
launch a Data Governance Initiative for Hadoop, with an initial focus on a new
Apache project called Atlas for managing shared metadata, data classification,
auditing, and security and policy management for data protection. It's also
working to integrate Atlas with Ranger, an open source security tool for
enforcing data access policies. Cloudera provides tools that enable users to
manage data security and governance for the CDH platform, supporting an
organization's need to meet compliance and regulatory requirements.
Page 15 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
In addition, Hortonworks, Cloudera, MapR and IBM all provide data encryption.
Both Hortonworks and Cloudera support encryption of data at rest. MapR
provides encryption of data transmitted to, from and within a cluster. IBM offers
the product InfoSphere Guardium, which enforces data privacy as well as
provides encryption and masking of confidential data.
The bottom line: The Hadoop vendors provide different approaches to
authentication, role-based access control, security policy management and data
encryption. Carefully specify your security and protection requirements and
review how each vendor addresses those needs.
4. Support subscriptions for the top Hadoop distributions
The fundamental value proposition for the open source software model is the
bundling and simplification of system deployment with support and services.
One alternative for deploying Hadoop involves downloading the source code for
each component from the open source repository and then building and
integrating all the parts together. This takes both skill and effort, and is likely to
be an iterative process. Open source vendors have already done the heavy
lifting, providing preconfigured distributions and maintaining an up-to-date
integrated stack.
Page 16 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
What differentiates the vendors to a large degree is their support models.
Hortonworks provides several models, ranging from its Jumpstart edition with
Web-based support during business hours and one-day response time to its
Enterprise edition with 24/7 support and much shorter response times
depending on the severity of the issue. Cloudera offers a support subscription
with one-hour and 24/7 support options for enterprise license holders. It also
offers premium support for organizations with the Flex or Data Hub edition
licenses that include a 15-minute response time for critical issues.
All AWS accounts include basic support, which provides 24/7 customer service,
access to community forums and documentation, as well as access to the AWS
Trusted Advisor application. Developer support includes one-hour response for
severe issues -- with 12- or 24-hour response times for most issues. Business-
level support provides 24/7 email access to cloud support engineers as well as
shortened response times based on severity. Enterprise-level support adds less
than 15-minute response time for critical issues as well as a dedicated technical
account manager, plus additional launch and operation support benefits.
MapR offers a Premium support service that adds Web and email support,
custom portal, training, urgent bug fixes, follow-the-sun support and 24/7 phone
support for priority issues. The company's Premium+ Support adds priority
queuing of tickets and single point of contact support, and offers options for
onsite or remote dedicated support. IBM provides support for organizations that
Page 17 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
purchase the licensed components -- also referred to as their value-add
modules -- that extend their Open Platform with Apache Hadoop.
The bottom line: If support services are the source of added value from the
vendor, the costs for the different support subscriptions should be aligned with
customer expectations. Subscriptions providing one-hour or even 15-minute
response times on a 24/7 basis with dedicated support staff will cost a lot more
than 24-hour response time from a Web-based interface during business hours.
Hadoop has transformed the business intelligence and analytics industry during
the past 10 years. But, as we've examined, the open source Hadoop framework
offers only so much, and companies that need more robust performance and
functionality capabilities as well as maintenance and support are turning to
commercial Hadoop software distributions. Hopefully, this information will help
you make a more informed choice when purchasing a Hadoop distribution.
Page 18 of 18
In this e-guide
What to consider when
evaluating Hadoop vendors
Four factors for comparing the
top Hadoop distributions
E-guide
About the author
David Loshin, managing director at Decisionworx, is a recognized thought
leader, speaker and expert consultant. He has written numerous books,
including Big Data Analytics: From Strategic Planning to Enterprise Integration
with Tools, Techniques, NoSQL and Graph. He can be reached through his
website, at www.decisionworx.com.
Email us at editor@searchbusinessanalytics.com and follow us on Twitter:
@BizAnalyticsTT.
top related