big data integration and hadoop - ibm

16
IBM Software September 2014 Big data integration and Hadoop Best practices for minimizing risks and maximizing ROI for Hadoop initiatives

Upload: others

Post on 16-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big data integration and Hadoop - IBM

IBM Software September 2014

Big data integration and HadoopBest practices for minimizing risks and maximizing ROI for Hadoop initiatives

Page 2: Big data integration and Hadoop - IBM

2 Big data integration and Hadoop

IntroductionApache Hadoop technology is transforming the economics and dynamics of big data initiatives by supporting new processes and architectures that can help cut costs, increase revenue and create competitive advantage. An open source software project that enables the distributed processing and storage of large data sets across clusters of commodity servers, Hadoop can scale from a single server to thousands, as demands change. Primary Hadoop components include the Hadoop Distributed File System for storing large files and the Hadoop distributed parallel processing framework (known as MapReduce).

However, by itself, Hadoop infrastructure does not present a complete big data integration solution, and there are both chal-lenges and opportunities to address before you can reap its bene-fits and maximize return on investment (ROI).

The importance of big data integration for Hadoop initiativesThe rapid emergence of Hadoop is driving a paradigm shift in how organizations ingest, manage, transform, store and analyze big data. Deeper analytics, greater insights, new products and services, and higher service levels are all possible through this technology, enabling you to reduce costs significantly and gener-ate new revenues.

Big data and Hadoop projects depend on collecting, moving, transforming, cleansing, integrating, governing, exploring and analyzing massive volumes of different types of data from many different sources. Accomplishing all this requires a resilient, end- to-end information integration solution that is massively scalable and provides the infrastructure, capabilities, processes and disci-pline required to support Hadoop projects.

“By most accounts, 80 percent of the develop-ment effort in a big data project goes into data integration and only 20 percent goes toward data analysis.”

—Intel Corporation, “Extract, Transform, and Load Big Data with Apache Hadoop”1

An effective big data integration solution delivers simplicity, speed, scalability, functionality and governance to produce consumable data from the Hadoop swamp. Without effective integration, you get “garbage in, garbage out”—not a good rec-ipe for trusted data, much less accurate and complete insights or transformative results.

Page 3: Big data integration and Hadoop - IBM

3IBM Software

As the Hadoop market has evolved, leading technology analysts agree that Hadoop infrastructure by itself is not a complete or effective big data integration solution (read this report that discusses how Hadoop is not a data integration platform). To fur-ther complicate the situation, some Hadoop software vendors have saturated the market with hype, myths and misleading or contradictory information.

To cut through this misinformation and develop an adoption plan for your Hadoop big data project, you must follow a best practices approach that takes into account emerging technolo-gies, scalability requirements, and current resources and skill levels. The challenge: create an optimized big data integration approach and architecture while avoiding implementation pitfalls.

Massive data scalability: The overarching requirementIf your big data integration solution cannot support massive data scalability, it may fall short of expectations. To realize the full business value of big data initiatives, massive data scalability is essential for big data integration on most Hadoop projects. Massive data scalability means there are no limitations on data volumes processed, processing throughput, or the number of processors and processing nodes used. You can process more data and achieve higher processing throughput simply by adding more hardware. The same application will then run without modification and with increased performance as you add hardware resources (see Figure 1).

Critical success factor: Avoid hype and distinguish fact from fiction

During these emerging stages of the Hadoop market, carefully consider everything you hear about Hadoop’s prowess. A significant gap exists between the myths and the realities of exploiting Hadoop, particularly when it comes to big data integration. There is a lot of industry hype claiming that any non- scalable extract, transform and load (ETL) tool plus Hadoop equals a high-p erformance, highly scalable data integration platform.

In reality, MapReduce was not designed for the high- performance processing of massive data volumes, but for finely grained fault tolerance. That discrepancy can lower overall performance and efficiency by an order of magnitude or more.

Hadoop Yet Another Resource Negotiator (YARN) takes the resource management capabilities that were in MapReduce and packages them so they can be used by other applications that need to execute dynamically across the Hadoop cluster. As a result, this approach makes it possible to implement massively scalable data integration engines as native Hadoop applications without having to incur the performance limita-tions of MapReduce. All enterprise technologies seeking to be scalable and efficient on Hadoop will need to adopt YARN as part of their product road map.

Before you start your integration journey, be sure you under-stand the performance limitations of MapReduce and how different data integration vendors address them. Learn more in the “Themis: An I/O- Efficient MapReduce” paper, which discusses this subject at length: http://bit.ly/1v2UXAT

Page 4: Big data integration and Hadoop - IBM

4 Big data integration and Hadoop

Critical success factor: Big data integration platforms must support all three dimensions of scalability• Linear data scalability: A hardware and software system

delivers linear increases in processing throughput with linear increases in hardware resources. For example, an applica-tion delivers linear data scalability if it can process 200 GB of data in four hours running on 50 processors, 400 GB of data in four hours running on 100 processors and so on.

•Application scale-up: A measurement of how effectively the software achieves linear data scalability across processors within one symmetric multiprocessor (SMP) system.

•Application scale-out: A determination of how well the software achieves linear data scalability across SMP nodes in a shared-n othing architecture.

Requirements for supporting massive data scalability are not only linked to the emergence of the Hadoop infrastructure. Leading data warehouse vendors such as IBM and Teradata, and leading data integration platforms such as IBM® InfoSphere® Information Server have provided shared-nothing, massively parallel software platforms supporting massive data scalability for years—for nearly two decades in some cases.

Over time, these vendors have converged on four common software architecture characteristics that support massive data scalability, as shown in Figure 2.

Figure 1. Massive data scalability is a mandatory requirement for big data integration. In the big data era, organizations must be able to support an MPP clustered system to scale.

Sourcedata

Transform Cleanse Enrich EDW

Sequential 4-way parallel 64-way parallel

CPU

Memory

Disk

Uniprocessor

CPU CPU CPU CPU

Shared memory

Disk

SMP system MPP clustered system or GRID

Page 5: Big data integration and Hadoop - IBM

5IBM Software

Figure 2. The four characteristics of massive data scalability.

A shared-nothing architecture

Software is designed from the ground up to exploit a shared-nothing, massively parallel architecture by partitioning data sets across computing nodes and executing a single application with the same application logic executing against each data partition.

Implemented using software dataflow

Software dataflow enables full exploitation of a shared-nothing architecture by making it easy to implement and execute data pipelining and data partitioning within a node and across nodes. Software dataflow also hides the complexities of building and tuning parallel applications from users.

That leverages datapartitioning for linear data scalabilityLarge data sets are partitioned across separate nodes and a single job executes the same application logic against all partitioned data.

Resulting in a design isolation environmentDesign a data processing job once, and use it in any hardware configuration without needing to redesign and re-tune the job.

Most commercial data integration software platforms were never designed to support massive data scalability, meaning they were not built from the ground up to exploit a shared-nothing, massively parallel architecture. They rely on shared memory multithreading instead of software dataf low.

Furthermore, some vendors do not support partitioning large data sets across nodes and running a single data integration job in parallel against separate data partitions, or the ability to design a job once and use it in any hardware configuration without

needing to redesign and retune the job. These capabilities are critical to reducing costs by realizing efficiency gains. Without them, the platform won’t be able to work with big data volumes.

The InfoSphere Information Server data integration portfolio supports the four massive data scalability architectural characteristics. Learn more in the Forrester report, “Measuring The Total Economic Impact Of IBM InfoSphere Information Server” at http://ibm.co/UX1RqB

Page 6: Big data integration and Hadoop - IBM

6 Big data integration and Hadoop

Optimizing big data integration workloads: A balanced approachBecause nearly all Hadoop big data use cases and scenarios first require big data integration, organizations must determine how to optimize these workloads across the enterprise. One of the leading use cases for Hadoop and big data integration is off load-ing big ETL workloads from the enterprise data warehouse (EDW) to reduce costs and improve query service-level agree -ments (SLAs). That use case raises the following questions:

●● Should organizations off load all ETL workloads from the EDW?

●● Should all big data integration workloads be pushed into Hadoop?

●● What is the ongoing role for big data integration workloads in an ETL grid without a parallel relational database manage-ment system (RDBMS) and without Hadoop?

The right answer to these questions depends on an enterprise’s unique big data requirements. Organizations can choose among a parallel RDBMS, Hadoop and a scalable ETL grid for running big data integration workloads. But no matter which method they select, the information infrastructure must meet one com-mon requirement: full support for massively scalable processing.

Some data integration operations run more efficiently inside or outside of the RDBMS engine. Likewise, not all data integration operations are well suited for the Hadoop environment. A well- designed architecture must be f lexible enough to leverage the strengths of each environment in the system (see Figure 3).

Figure 3. Big data integration requires a balanced approach that can leverage the strength of any environment.

Run in the ETL grid

Advantages Exploit ETL MPP engine Exploit commodity hardware andstorage Exploit grid to consolidate SMP servers Perform complex transforms (datacleansing) that can’t be pushed intothe RDBMS

Free up capacity on RDBMS server Process heterogeneous data sources(not stored in the database) ETL server faster for some processes

Disadvantages ETL server slower for some processes(data already stored in relational tables) May require extra hardware (low-costhardware)

Run in the database

AdvantagesExploit database MPP engine Minimize data movementLeverage database for joins/aggregations Works best when data is already cleanFree up cycles on ETL server Use excess capacity on RDBMS server Database faster for some processes

Disadvantages Expensive hardware and storage Degradation of query SLAs Not all ETL logic can be pushed intoRDBMS (with ETL tool or hand coding) Can’t exploit commodity hardware Usually requires hand codingLimitations on complex transformationsLimited data cleansing Database slower for some processes

Run in Hadoop

Advantages Exploit MapReduce MPP engineExploit commodity hardware and storage Free up capacity on database server Support processing of unstructured data Exploit Hadoop capabilities forpersisting data (such as updatingand indexing) Enables low-cost archiving ofhistory data

Disadvantages Can require complex programming MapReduce will usually be muchslower than parallel database orscalable ETL tool Risk: Hadoop is still a young technology

Page 7: Big data integration and Hadoop - IBM

7IBM Software

Here are three important guidelines to follow when optimizing big data integration workloads:

1. Push big data integration processing to the data instead of pushing the data to the processing: Specify appropriate processes that can be executed in either the RDBMS, in Hadoop and in the ETL grid.

2. Avoid hand coding: Hand coding is expensive and does not effectively support rapid or frequent changes. It also doesn’t support the automated collection of design and operational metadata that is critical for data governance.

3. Do not maintain separate silos of integration develop-ment for the RDBMS, Hadoop and the ETL grid: This serves no practical purpose and becomes tremendously expensive to support. You should be able to build a job once and run it in any of the three environments.

Processes best suited to HadoopHadoop platforms comprise two primary components: a distrib-uted, fault-tolerant file system called the Hadoop Distributed File System (HDFS), and a parallel processing framework called MapReduce.

The HDFS platform is very good at processing large sequential operations, where a “slice” of data read is often 64 MB or 128 MB. Generally, HDFS files are not partitioned or ordered unless the application loading the data manages this. Even if the application can partition and order the resulting data slices, there is no way to guarantee where that slice will be placed in the

HDFS system. This means there is no good way to manage data collocation in this environment. Data collocation is critical because it ensures data with the same join keys winds up on the same nodes, and therefore the process is both high-performing and accurate.

While there are ways to accommodate the lack of support for data collocation, they tend to be costly—typically requiring extra processing and/or restructuring of the application. HDFS files are also immutable (read only) and processing an HDFS file is similar to running a full table scan in that most often all the data is processed. This should immediately raise a red f lag for operations such as joining two very large tables, since the data will likely not be collocated on the same Hadoop node.

MapReduce Version 1 is a parallel processing framework that was not specifically designed for processing large ETL work-loads with high performance. By default, data can be reparti-tioned or re-collocated between the map and the reduce phase of processing. To facilitate recovery, the data is landed on the node running the map operation before being shuff led and sent to the reduce operation.

MapReduce contains facilities to move smaller reference data structures to each map node for some validation and enhance-ment operations. Therefore, the entire reference file is moved to each map node, which makes it more appropriate for smaller reference data structures. If you are hand coding, you must account for these processing f lows, so it is best to adopt tools that generate code to push data integration logic down into MapReduce (also known as ETL pushdown).

Page 8: Big data integration and Hadoop - IBM

8 Big data integration and Hadoop

Using ETL pushdown processing in Hadoop (regardless of the tool doing the pushing) can create a situation where a nontrivial portion of the data integration processing must continue to run in the ETL engine and not on MapReduce. This is true for several reasons:

●● More complex logic cannot be pushed into MapReduce●● MapReduce has significant performance limitations ●● Data is typically stored in HDFS in a random sequential

manner

All of these factors suggest that big data integration in a Hadoop environment requires three components for high-performance workload processing:

1) A Hadoop distribution2) A shared-nothing, massively scalable ETL platform (such as

the one offered by IBM InfoSphere Information Server)3) ETL pushdown capability into MapReduce

All three components are required because a large percentage of data integration logic cannot be pushed into MapReduce without hand coding, and because MapReduce has known performance limitations.

Critical success factor: Consider data integration workload processing speeds

The InfoSphere Information Server shared-n othing, massively parallel architecture is optimized for processing large data integration workloads efficiently with high performance. IBM InfoSphere DataStage®—a part of InfoSphere Information Server that integrates data across multiple systems using a high- performance parallel framework—can process typical data integration workloads 10 to 15 times faster than MapReduce.2

InfoSphere DataStage also offers balanced optimization for the Hadoop environment. Balanced optimization generates Jaql code to run natively in the MapReduce environment. Jaql comes with an optimizer that will analyze the generated code and optimize it into a map component and a reduce component. This automates a traditionally complex develop-ment task and frees the developer from worrying about the MapReduce architecture.

InfoSphere DataStage can run directly on the Hadoop nodes rather than on a separate node in the configuration, which some vendor implementations require. This capability helps reduce network traffic when coupled with IBM General Parallel File System (GPFS™)-FP O, which provides a POSIX-c ompliant storage subsystem in the Hadoop environment. A POSIX file system allows ETL jobs to directly access data stored in Hadoop rather than requiring use of the HDFS interface. This environment supports moving the ETL workload into the hardware environment that Hadoop is running on—helping to move the processing to where the data is stored and leverag-ing the hardware for both Hadoop and ETL processing.

Resource management systems such as IBM Platform™ Symphony can also be used to manage data integration workloads both inside and outside of the Hadoop environment.

This means that although InfoSphere DataStage may not run on the exact node as the data, it runs on the same high-sp eed backplane, eliminating the need to move the data out of the Hadoop environment and across slower network connections.

Page 9: Big data integration and Hadoop - IBM

9IBM Software

ETL scalability requirements for supporting HadoopMany Hadoop software vendors evangelize the idea that any non-scalable ETL tool with pushdown into MapReduce will provide excellent performance and application scale-out for big data integration—but this is simply not true.

Without a shared-nothing, massively scalable ETL engine such as InfoSphere DataStage, organizations will experience functional and performance limitations. More and more organi-zations are realizing that competing non-scalable ETL tools with pushdown into MapReduce are not capable of providing required levels of performance in Hadoop. They are working with IBM to address this issue because the IBM big data integra-tion solution uniquely supports massive data scalability for big data integration.

Here are some of the cumulative negative effects from overreli-ance on ETL pushdown:

●● ETL comprises a large percentage of the EDW workload. Because of the associated costs, the EDW is a very expensive platform for running ETL workloads.

●● ETL workloads cause degradation in query SLAs, and eventually require you to invest in additional, expensive EDW capacity.

●● Data is not cleansed prior to being dumped into the EDW and is never cleansed once in the EDW environment, promoting poor data quality.

●● The organization continues to rely heavily on manual coding of SQL scripts for data transformations.

●● Adding new data sources or modifying existing ETL scripts is expensive and takes a long time, limiting the ability to respond quickly to new requirements.

●● Data transformations are relatively simple because more complex logic cannot be pushed into the RDBMS using an ETL tool.

●● Data quality suffers.●● Critical tasks such as data profiling are not automated—and in

many cases are not performed at all.●● No meaningful data governance (data stewardship, data

lineage, impact analysis) is implemented, making it more difficult and expensive to respond to regulatory requirements and have confidence in critical business data.

In contrast, organizations adopting massively scalable data inte-gration platforms that optimize big data integration workloads minimize potential negative effects, leaving them in a better position to transform their business with big data.

Best practices for big data integrationOnce you’ve decided to adopt Hadoop for your big data initia-tives, how do you implement big data integration projects while protecting yourself against Hadoop variability?

Page 10: Big data integration and Hadoop - IBM

10 Big data integration and Hadoop

Working with numerous early adopters of Hadoop technology, IBM has identified five fundamental big data integration best practices. These five principles represent best-of- breed approaches for successful big data integration initiatives:

1. Avoid hand coding anywhere for any purpose2. One data integration and governance platform for the

enterprise3. Massively scalable data integration available wherever it needs

to run4. World-class data governance across the enterprise 5. Robust administration and operations control across the

enterprise

Best practice #1: Avoid hand coding anywhere for any purposeOver the past two decades, large organizations have recognized the many advantages of replacing hand coding with commercial data integration tools. The debate between hand coding versus data integration tooling has been settled, and many technology analysts have summarized the significant ROI advantages3 to be realized from adoption of world-class data integration software.

“When in doubt, use higher-level tools when -ever possible.”

—“Large- Scale ETL With Hadoop,” Strata+Hadoop World 2012 presentation given by Eric Sammer, Principal Solution Architect, Cloudera4

The first best practice is to avoid hand coding anywhere, for any aspects of big data integration. Instead, take advantage of graphi-cal user interfaces available with commercial data integration software to support activities such as:

●● Data access and movement across the enterprise●● Data integration logic●● Assembling data integration jobs from logic objects●● Assembling larger workflows●● Data governance●● Operational and administrative management

By adopting this best practice, organizations can exploit the proven productivity, cost, time to value, and robust operational and administrative control advantages of commercial data inte-gration software while avoiding the negative impact of hand coding (see Figure 4).

Page 11: Big data integration and Hadoop - IBM

11IBM Software

Figure 4. Data integration software provides multiple GUIs to support various activities. These GUIs replace complex hand coding and save organizations significant amounts of development costs.

A pre-built data integration solution can helpmap and manage data governance requirementsacross the enterprise.

A pre-built data integration solution can streamline the creation of data integrationjobs from logic objects.

Data integration tools

IBM PureData™ System

Hand coding

Development usinghand coding

30 man days to write

Development using data integration tools

Diverse data access andintegration requirementsfrom across the enterprisespawn a complexassortment of UIs.

Join twoHDFS files

Create newHDFS file, fully

parallelized

Read froman HDFS file

in parallel

Transform/restructure

the data

savings indevelopment

costs comparedto hand coding

87%Almost 2,000 lines of code71,000 charactersNo documentationDifficult to reuseDifficult to maintain

2 days to writeGraphicalSelf-documentingReusabilityMore maintainableImproved performance

Source for hand coding and tooling results: IBM pharmaceutical customer example

Page 12: Big data integration and Hadoop - IBM

12 Big data integration and Hadoop

Best practice #2: One data integration and governance platform for the enterpriseOverreliance on pushing ETL into the RDBMS (due to a lack of scalable data integration software tooling) has prevented many organizations from replacing SQL script hand coding and estab-lishing meaningful data governance across the enterprise. Nevertheless, they recognize there are huge cost savings to be had by moving large ETL workloads from the RDBMS to Hadoop. However, moving from a silo of ETL hand coding in the RDBMS to a new silo of hand coding of ETL and Hadoop only doubles down high costs and long lead times.

Deploying a single data integration platform provides the oppor-tunity for organizational transformation by the ability to:

●● Build a job once and run it anywhere on any platform in the enterprise without modification

●● Access, move and load data between a variety of sources and targets across the enterprise

●● Support a variety of data integration paradigms, including batch processing; federation; change data capture; SOA enablement of data integration tasks; real-time integration with transactional integrity; and/or self- service data integra -tion for business users

It also provides an opportunity to establish world-class data gov -ernance, including data stewardship, data lineage and cross-tool impact analysis.

Best practice #3: Massively scalable data integration available wherever it needs to runHadoop offers significant potential for the large-scale, distrib -uted processing of data integration workloads at extremely low cost. However, clients need a massively scalable data integration solution to realize the potential advantages that Hadoop can deliver.

Figure 5. Scalable big data integration must be available for any environment.

Design job once

Run and scale anywhere

Case 3: Move and

process data inparallel between environments

Outside Hadoop environment

Case 1: InfoSphere Information Serverparallel engine running against any traditional data source

Case 2: Push processing into

parallel database

Within Hadoop environment

Case 4: Push processing into

MapReduce

Case 5: InfoSphere InformationServer parallel engine running against HDFS without MapReduce

Page 13: Big data integration and Hadoop - IBM

13IBM Software

Scenarios for running the data integration workload may include:

●● The parallel RDBMS●● The grid without the RDBMS or Hadoop●● In Hadoop, with or without pushdown into MapReduce●● Between the Hadoop environment and the outside environ-

ment, extracting data volumes on one side, processing and transforming the records in f light, and loading the records on the other side

To achieve success and sustainability—and to keep costs low—an effective big data integration solution must f lexibly support each of these scenarios. Based on IBM experience with big data customers, InfoSphere Information Server currently is the only commercial data integration software platform that supports all of these scenarios, including pushdown of data integration logic into MapReduce.

There are many myths circulating within the industry about running ETL tools in Hadoop for big data integration. The popular wisdom seems to be that combining any non-scalable ETL tool and Hadoop provides all required massively scalable data integration processing. In reality, MapReduce suffers several limitations for processing large-scale data integration workloads:

●● Not all data integration logic can be pushed into MapReduce using the ETL tool. Based on experiences with its clients, IBM estimates that about 50 percent of data integration logic cannot be pushed into MapReduce.

●● Users have to engage in complex hand coding to run more complex data integration logic in Hadoop, or restrict the process to running relatively simple transformations in MapReduce.

●● MapReduce has known performance limitations for processing large data integration workloads, as it was designed to support finely grained fault tolerance at the expense of high perfor-mance processing.

Best practice #4: World-clas s data governance across the enterpriseMost large organizations have found it difficult, if not impossi-ble, to establish data governance across the enterprise. There are several reasons for this. For example, business users manage data using business terminology that is familiar to them. Until recently, there has been no mechanism for defining, controlling and managing this business terminology and linking it to IT assets.

Also, neither business users nor IT staff have a high degree of confidence in their data, and may be uncertain of its origins and/or history . The technology for creating and managing data governance through capabilities such as data lineage and cross- tool impact analysis did not exist, and manual methods involve overwhelming complexity. Industry regulatory requirements only add to the complexity of managing governance. Finally, overreliance on hand coding for data integration makes it diffi-cult to implement data governance throughout an organization.

Page 14: Big data integration and Hadoop - IBM

14 Big data integration and Hadoop

It is essential to establish world-class data governance—with a fully governed data lifecycle for all key data assets—that includes the Hadoop environment but is not limited to it. Here are suggested steps for a comprehensive data lifecycle:

●● Find: Leverage terms, labels and collections to find governed, curated data sources

●● Curate: Add labels, terms, custom properties to relevant assets●● Collect: Use collections to capture assets for a specific analysis

or governance effort●● Collaborate: Share collections for additional curation and

governance●● Govern: Create and reference information governance

policies and rules; apply data quality, masking, archiving and cleansing to data

●● Offload: Copy data in one click to HDFS for analysis for warehouse augmentation

●● Analyze: Analyze off loaded data●● Reuse and trust: Understand how data is being used with

lineage for analysis and reports

With a comprehensive data governance initiative in place, you can build an environment that helps ensure all Hadoop data is of high quality, secure and fit for purpose. It enables business users to answer questions such as:

●● Do I understand the content and meaning of this data?●● Can I measure the quality of this information?●● Where does the data in my report come from?●● What is being done to the data inside of Hadoop?●● Where was it before reaching our Hadoop data lake?

Best practice #5: Robust administration and operations control across the enterpriseOrganizations adopting Hadoop for big data integration must expect robust, mainframe-class administration and operations management, including:

●● An operations console interface that provides quick answers for anyone operating the data integration applications, developers and other stakeholders as they monitor the runtime environment

●● Workload management to allocate resource priority to certain projects in a shared-services environment and queue workloads on a busy system

●● Performance analysis for insight into resource consumption to identify bottlenecks and determine when systems may need more resources

●● Building workflows that include Hadoop-based activities defined through Oozie directly in the job sequence, as well as other data integration activities

Administration management for big data integration must include:

●● An integrated web-based installer for all capabilities ●● High-availability configurations for meeting

24/7 requirements●● Flexible deployment options to deploy new instances or

expand existing instances on expert, optimized hardware systems

●● Centralized authentication, authorization and session management

●● Audit logging of security-related events to promote Sarbanes- Oxley Act compliance

●● Lab certification for various Hadoop distributions

Page 15: Big data integration and Hadoop - IBM

15IBM Software

Best practices for big data integration set a foundation for successOrganizations are looking to big data initiatives to help them cut costs, increase revenue and gain first-mover advantages. Hadoop technology supports new processes and architectures that enable business transformation, but certain big data challenges and opportunities must be addressed before this can happen.

IBM recommends building a big data integration architecture that is f lexible enough to leverage the strengths of the RDBMS, the ETL grid and Hadoop environments. Users should be able to construct an integration workflow once and then run it in any of these three environments.

The five big data integration best practices outlined in this paper represent best-of- breed approaches that set your projects up for success. Following these guidelines can help your organization minimize risks and costs and maximize ROI for your Hadoop projects.

For more informationTo learn more about the big data integration best practices and IBM integration solutions, please contact your IBM representative or IBM Business Partner, or visit: ibm.com/software/data/integration

Additionally, IBM Global Financing can help you acquire the software capabilities that your business needs in the most cost-effective and strategic way possible. W e’ll partner with credit-qualified clients to customize a financing solution to suit your business and development goals, enable effective cash management, and improve your total cost of ownership. Fund your critical IT investment and propel your business forward with IBM Global Financing. For more information, visit: ibm.com/financing

Page 16: Big data integration and Hadoop - IBM

© Copyright IBM Corporation 2014

IBM Corporation Software Group Route 100 Somers, NY 10589

Produced in the United States of America September 2014

IBM, the IBM logo, ibm.com, DataStage, GPFS, InfoSphere, Platform, and PureData are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

It is the user’s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT . IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the client is in compliance with any law or regulation. Actual available storage capacity may be reported for both uncompressed and compressed data and will vary and may be less than stated.

1 Intel Corporation. “Extract, Transform, and Load Big Data With Apache Hadoop.” July 2013. http://intel.ly/UX1Umk

2 Measurements produced by IBM while working on-site with a customer deployment.

3 International Technology Group. “Business Case for Enterprise Data Integration Strategy: Comparing IBM InfoSphere Information Server and Open Source Tools.” February 2013. ibm.com/common/ssi/cgi-bin/ ssialias?infotype=PM&subtype=XB&htmlfid=IME14019USEN

4 “Large-Scale ETL W ith Hadoop,” Strata+Hadoop World 2012 presenta-tion given by Eric Sammer, Principal Solution Architect, Cloudera. www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-large-scale-etl-with-hadoop.html

Please Recycle

IMW14791-USEN-00