ibm cscc deploying big data analytics applications to the cloud roadmap for success

Upload: yasirzaidi1

Post on 07-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    1/21

    Deploying Big Data Analytics Applications to theCloud: Roadmap for Success

    May, 2014

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    2/21

    Copyright © 2014 Cloud Standards Customer Council Page 2

    Contents Acknowledgements ....................................................................................................................................... 3

    Workgroup Leaders ................................................................................................................................... 3

    Key Contributors ....................................................................................................................................... 3

    Reviewers .................................................................................................................................................. 3

    Executive Overview ....................................................................................................................................... 4

    Roadmap ....................................................................................................................................................... 4

    Step 1: Build the Business Case ................................................................................................................. 5

    Step 2: Assess your Application Workloads .............................................................................................. 8

    Step 3: Develop a Technical Approach .................................................................................................... 10

    Step 4: Address Governance, Security, Privacy, Risk, and Accountability Requirements ...................... 13

    Step 5: Deploy, Integrate, and Operationalize the Infrastructure .......................................................... 15Works Cited ................................................................................................................................................. 16

    Additional References ................................................................................................................................. 16

    Appendix A: Use Cases and Applications of Big Data in the Cloud ............................................................. 17

    Appendix B: Technical Requirements for Big Data in the Cloud ................................................................. 18

    Appendix C: Risk-Mitigation Considerations for Big Data in the Cloud ...................................................... 20

    Appendix D: Platform-choice Considerations for Big Data in the cloud ..................................................... 20

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    3/21

    Copyright © 2014 Cloud Standards Customer Council Page 3

    © 2014 Cloud Standards Customer Council.

    All rights reserved. You may download, store, display on your computer, view, print, and link to theDeploying Big Data Analytics Applications to the Cloud: Roadmap for Success white paper at the CloudStandards Customer Council Web site subject to the following: (a) the document may be used solely for

    your personal, informational, non-commercial use; (b) the document may not be modified or altered inany way; (c) the document may not be redistributed; and (d) the trademark, copyright or other noticesmay not be removed. You may quote portions of the document as permitted by the Fair Use provisionsof the United States Copyright Act, provided that you attribute the portions to the Cloud StandardsCustomer Council Deploying Big Data Analytics Applications to the Cloud: Roadmap for Success (2014).

    Acknowledgements

    Workgroup LeadersJames Kobielus, IBM; Bob Marcus, ET Strategies

    Key ContributorsJames Kobielus, IBM

    ReviewersBob Marcus, ET Strategies; David Saul, State Street; Judith Hurwitz, Hurwitz Group; Thomas McCullough,LMCO; Jay Greenberg, Boeing; Michael Edwards, IBM

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    4/21

    Copyright © 2014 Cloud Standards Customer Council Page 4

    Executive Overview Big Data focuses on achieving deep business value from deployment of advanced analytics andtrustworthy data at Internet scales.

    Big Data is at the heart of many cloud services deployments. As private and public cloud deployments

    become more prevalent, it will be critical for end-user organizations to have a clear understanding of BigData application requirements, tool capabilities, and best practices for implementation.

    Big Data and cloud computing are complementary technological paradigms with a core focus onscalability, agility, and on-demand availability. Big Data is an approach for maximizing the linearscalability, deployment and execution flexibility, and cost-effectiveness of analytic data platforms. Itrelies on such underlying approaches as massively parallel processing, in-database execution, storageoptimization, data virtualization, and mixed-workload management. Cloud computing complements BigData by enabling ubiquitous, convenient, on-demand network access to a shared pool of configurablecomputing resources that can be rapidly provisioned and released with minimal management effort or

    service provider interaction.

    This paper's primary focus is to provide guidance to cloud service customer organizations in allindustries on how best to deploy new and existing Big Data analytics applications to the cloud. Thepaper:

    • Defines Big Data, analytic applications, and cloud deployment alternatives.• Provides a description of the typical current Big Data analytics environment (i.e., what is the

    current progression to Big Data and/or cloud computing for the majority of enterprises andwhat are their primary issues).

    • Provides a high-level description and decision criteria for the types of Big Data analyticapplications that are appropriate to deploy to the cloud vs. the type of Big Data analyticsapplications which are not.

    • Describes the principal use cases for Big Data in the cloud, including Big Data exploration, datawarehouse augmentation, enhanced 360 degree view of the customer, operations analytics,security and intelligence analytics.

    • Highlights the potential benefits of developing new and/or deploying existing Big Data analyticsapplications to the cloud.

    • Articulates the challenges associated with deploying new and/or existing Big Data analyticsapplications to the cloud (challenges will vary depending on the type of Big Data analyticsapplication which is being considered for deployment to the cloud).

    RoadmapThis paper provides a prescriptive roadmap for justifying, deploying, and managing Big Data in the cloud.It highlights key considerations that users should follow in order to ensure successful deployment. Keysteps discussed below are as follows:

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    5/21

    Copyright © 2014 Cloud Standards Customer Council Page 5

    • Step 1: Build the business case for Big Data in the cloud

    • Step 2: Assess your Big Data application workloads for deployment into the cloud

    • Step 3: Develop a technical approach for deploying and managing Big Data in the cloud

    • Step 4: Address governance, security, privacy, risk, and accountability requirements

    • Step 5: Deploy, integrate, and operationalize your cloud-based Big Data infrastructure

    Step 1: Build the Business Case Building a business case for Big Data in the cloud involves the following key considerations:

    Articulate strategic business drivers for Big Data in the cloudBig Data focuses on achieving deep business value from deployment of advanced analytics andtrustworthy data at Internet scales. The key business drivers for Big Data in business are the following:

    • Extract greater value from enterprise data resources• Enhance information worker and developer productivity• Realize continued improvements in customer acquisition, loyalty, retention, upsell, and

    satisfaction• Ensure more timely, agile response to business opportunities, threats, and challenges• Manage a single view of diverse data resources

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    6/21

    Copyright © 2014 Cloud Standards Customer Council Page 6

    • Secure, protect, and govern data throughout its lifecycle• Improve data/analytics platform scale, efficiency, and agility• Optimize back-office operations

    Refer to Appendix A for a thorough list of use cases and applications of Big Data in the cloud.

    Identify key stakeholders and champions for Big Data in the cloudEvery C-level business executive has strategic applications of Big Data in the cloud. The principalstakeholder/champions and their strategic imperatives are presented in Table 1:

    Chief Marketing Officers CMOs have been the prime movers on many Big Data initiatives. Theirprimary applications consist of marketing campaign optimization,customer churn and loyalty, upsell and cross-sell analysis, targetedoffers, behavioral targeting, social media monitoring, sentimentanalysis, brand monitoring, influencer analysis, customer experienceoptimization, content optimization, and placement optimization

    Chief Information Officers CIOs use Big Data platforms for data discovery, data integration,business analytics, advanced analytics, exploratory data science.

    Chief Operations Officers COOs rely on Big Data for supply chain optimization, defect tracking,sensor monitoring, and smart grid, among other applications.

    Chief Information SecurityOfficers

    CISOs run security incident and event management, anti-frauddetection, and other sensitive applications on Big Data.

    Chief Technology Officers CTOs do IT log analysis, event analytics, network analytics, and othersystems monitoring, troubleshooting, and optimization applications onBig Data

    Chief Financial Officers CFOs run complex financial risk analysis and mitigation modelingexercises on Big Data platforms.

    Table 1: Chief stakeholders of Big Data in the cloud

    Gaining support for your Big Data cloud business case requires that you align it with stakeholderrequirements. If key business and IT stakeholders haven’t clearly specified their requirements orexpectations for your Big Data cloud initiative, it’s not going to succeed. Make sure that you build a casethat articulates business value (i.e. what are the likely or potential benefits) in business terms, such as:

    • Enabling more agile business response to competitive opportunities and challenges• Improvements in customer loyalty, engagement, satisfaction, and lifetime value• Enhancements in operational efficiency, speed, and flexibility• Tightened security and compliance

    To gain business stakeholder support for your Big Data cloud initiative, your business case should:

    • Define a business justification and return on investment for your Big Data cloud initiative• Identify quantifiable success metrics for the project• Measure and report project success metrics

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    7/21

    Copyright © 2014 Cloud Standards Customer Council Page 7

    Emphasize the advantages of cloud-based deployment for Big DataIn building your business case for Big Data in the cloud and gaining the support of key stakeholders, youmust first identify Big Data business applications where cloud approaches have an advantage that otherapproaches (such as software on commodity hardware or pre-integrated appliances) lack. The practicaloverlap between the Big Data and cloud paradigms is so extensive that you could validly claim you'redoing cloud-based Big Data with an existing on-premise Hadoop, NoSQL, or enterprise data warehousingenvironment. Keep in mind that cloud computing includes "private cloud" deployments in addition to orin lieu of public cloud environments.

    Cloud services have an important role to play in Big Data, especially for short-term jobs or applicationswhere the bulk of data is already held in a cloud environment. Building your business case requiresupfront analysis of the key value points in your business where cloud-based approaches are the rightmodel for deploying Big Data analytics. At heart, the core principles defining what constitutes Big Data inthe cloud are presented in Table 2:

    On-demand self-service This enables a Big Data cloud service customer to self-provision cloudservices--both physical and virtual--automatically and with minimalinteraction involving the cloud service provider.

    Broad network access This enables Big Data cloud resources to be made available over thenetwork and accessible through standard mechanisms by diverse clientplatforms.

    Multi-tenancy This enables Big Data cloud resources to be allocated so that multipletenants and their computations and data can be guaranteed isolationfrom one another.

    Resource pooling This aggregates Big Data cloud resources in a location-independentfashion to serve multiple multi-tenant customers, enabling resources tobe dynamically assigned and reassigned on demand and to be accessed

    through a simple abstraction.Rapid elasticity andscalability

    This enables Big Data cloud resources to be rapidly, elastically, andautomatically scaled out up, out, and down on demand.

    Measured service This enables Big Data cloud resources to be monitored, controlled,reported, and billed transparently.

    Table 2: Core architectural principles of cloud-based Big Data deployments

    The real questions for the architecture of Big Data and cloud computing are:a) What is the data involved, how big is it and where is it located? One of the challenges of Big

    Data is that the data may be disparate and start out in a variety of locations. These locations

    may or may not be within a cloud service.b) What sort of processing is going to be required on the data? Continuous or infrequent "burst"

    mode? How much can it be parallelized?c) Is it practical to move the data to an environment where the processing can be performed

    efficiently, or is it better to think about moving the processing to the data? With the sorts ofdata volume involved – these are the key questions.

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    8/21

    Copyright © 2014 Cloud Standards Customer Council Page 8

    To gain IT stakeholder support for your Big Data cloud business case, you must address the followingconcerns:

    • Technical and supportability requirements of Big Data in the cloud (refer to Appendix B fordetails)

    • Deployment and service models of Big Data in the cloud (refer to relevant CSCC whitepapers [1],[2] for details)

    • Risk-mitigation considerations for Big Data in the cloud (refer to Appendix C for details)

    Step 2: Assess your Application WorkloadsYour business case must be predicated on your ability, under the proposed Big Data clouddeployment/service model and the proposed use case, to handle the expected application workloads. 1

    The principal functionality that any Big Data analytic application must support falls into the functionalcategories described in Table 3.

    1 For example, it’s clear that Hadoop has already proven its core role in the emerging Big Data cloud ecosystem asa petabyte-scalable staging, transformation, pre-processing and refinery platform for unstructured content andembedded execution of advanced analytics.

    http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    9/21

    Copyright © 2014 Cloud Standards Customer Council Page 9

    Big-data storage A Big Data cloud must incorporate a highly scalable, efficient, low-cost datastorage platform. Chief uses of the Big Data cloud storage platform are forarchiving, governance and replication, as well as for discovering, acquiring,aggregating and governing multi-structured content. Typically, it wouldsupport these uses through integration with a high capacity storage areanetwork architecture.

    Big-data processing A Big Data cloud should support massively parallel execution of advanceddata processing, manipulation, analysis and access functions. It shouldsupport the full range of advanced analytics, as well as some functionstraditionally associated with enterprise data warehouses, businessintelligence, and online analytical processing. It should have all themetadata, models and other services needed to handle such core analyticsfunctions as query, calculation, data loading and data integration. And itshould handle a subset of these functions and interface through connectorsto other data/analytic platforms.

    Big-data development A Big Data cloud should support Big Data application development whichinvolves modeling, mining, exploration and analysis of deep data sets. Thecloud should provide a scalable “sandbox” with tools that allow datascientists, predictive modelers and business analysts to interactively andcollaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregateand prepare data sets, tweak segmentations and decision trees, and iteratethrough statistical models as they look for deep statistical patterns. It shouldfurnish data scientists with massively parallel CPU, memory, storage and I/Ocapacity for tackling analytics workloads of growing complexity. And it shouldenable elastic scaling of sandboxes from traditional statistical analysis, datamining and predictive modeling, into new frontiers of Hadoop/MapReduce,R, geospatial, matrix manipulation, natural language processing, sentiment

    analysis and other resource-intensive types of Big Data processing.

    Table 3: Principal Big Data workloads

    Assess which application workloads are suitable for public vs. private cloudsA Big Data cloud should not be a stand-alone environment, but, instead, a standards-basedinteroperable environment that, when deployed in larger cloud configurations, can be rapidly optimizedto new workloads as they come online. Many Big Data clouds will increasingly be configured to supportmixes of two or all three of the functional requirements described above within specific cloud nodes orspecific clusters. Some will handle low latency and batch jobs with equal agility. Others will be entirelyspecialized to a particular function that they perform with lightning speed and elastic scalability. Thebest Big Data analytic clouds will facilitate flexible re-optimization by streamlining the myriaddeployment, configuration tuning tasks across larger, more complex deployments.

    You may not be able to forecast with fine-grained precision the mix of workloads you’ll need to run onyour Big Data cloud in the future. But investing in the right family of Big Data cloud platforms, tools, andapplications should give you confidence that, when the day comes, you’ll have the foundation in placeto provision resources rapidly and efficiently.

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    10/21

    Copyright © 2014 Cloud Standards Customer Council Page 10

    If you are considering public cloud services, you will need to identify which Big Data applicationworkloads are best suited for public cloud deployment vs. your own private cloud deployment. Wherepublic cloud deployment is concerned, some of the principal applications are presented in Table 4:

    Enterprise applications alreadyhosted in the cloud

    If, like many organizations -- especially small and midmarketbusinesses -- you use cloud-based applications from an externalservice provider, much of your source transactional data is alreadyin a public cloud. If you have deep historical data in that cloudservice, it might already have accumulated in Big Data magnitudes.To the extent the cloud service provider or one of its partners offersa value-added analytics service -- such as churn analysis, marketingoptimization, or off-site backup and archiving of customer data -- itmight make sense to leverage that rather than host it all in-house.

    High-volume external datasources that requireconsiderable preprocessing

    If, for example, you're doing customer sentiment monitoring onaggregated feeds of social media data, you probably don't have theserver, storage, or bandwidth capacity in-house to do it justice.That's a clear example of an application where you'd want toleverage the social media filtering service provided by a public-cloud-based, Big Data-powered service.

    Tactical applications beyondyour on-premises, Big Datacapabilities

    If you already have an on-premises Big Data platform dedicated forone application (such as a dedicated Hadoop cluster for high-volume extract-transform-load (ETL) on unstructured data sources),it might make sense to use a public cloud to address newapplications (say, multichannel marketing, social media analytics,geospatial analytics, query-able archiving, elastic data-sciencesandboxing) for which the current platform is unsuited or for whichan as-needed, on-demand service is more robust or cost effective.In fact, a public cloud offering might be the only feasible option if

    you need petabyte-scale, streaming, multi-structured, Big Datacapability.

    Elastic provisioning of very largebut short-lived analyticsandboxes

    If you have a short-turnaround, short-term data science project thatrequires an exploratory data mart (aka sandbox) that's an order-of-magnitude larger than the norm, the cloud may be your onlyfeasible or affordable option. You can quickly spin up cloud-basedstorage and processing power for the duration of the project, andthen rapidly de-provision it all when the project is over. One mightcall this the "bubble mart" deployment model, and it's tailor-madefor the cloud.

    Table 4: Principal applications of Big Data in public clouds

    Step 3: Develop a Technical ApproachYou need to perform a comprehensive technical assessment of existing and planned Big Data analyticapplications and workloads to determine which can and should be deployed to the chosen cloud service.In conducting that assessment, you should identify all the Big Data, analytics, and cloud functionalservice layers, components, nodes, and platforms needed to support the planned

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    11/21

    Copyright © 2014 Cloud Standards Customer Council Page 11

    application/workloads. Just as important, you need to perform a thorough assessment of the variouscloud deployment/service models to determine which are best suited to your workloads.

    Migrating to the target ("to-be") Big Data cloud architecture--functional service layers and components,deployment roles, and topologies---requires that you incorporate the following tasks in your planning:

    • Define the "as-is" and "to-be" Big Data analytics cloud architecture needed to support theapplication

    • Assess the feasibility of modifying existing IT, data, and analytics investments for the to-bearchitecture, and/or reusing and migrating them to the planned application

    • Determine the likely impact of internally hosted Big Data analytics applications on the ITinfrastructure

    • Calculate the additional costs that would be incurred to meet these demands• Define the optimal phasing--staged vs. “all-at-once”--of deployment of Big Data analytics

    applications to the target cloud platform

    When developing the technical approach and plan for your Big Data cloud initiative, pay close attentionto the following critical areas:

    • Topologies• Data platforms

    TopologiesIn defining your Big Data cloud "to-be" architecture, you will need to define the deployment topology ofyour environment. Big Data clouds are increasingly spanning federated private and public deployments,

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    12/21

    Copyright © 2014 Cloud Standards Customer Council Page 12

    encompassing at-rest and in-motion data, incorporating a growing footprint of in-memory and flashstorage, and requiring on demand availability from all applications.

    Consequently, your technical approach should address the distributed, multi-tier, heterogeneous, andagile nature of your Big Data environment. Optimization of your Big Data cloud topology should involve

    continual balancing of the topology to ensure that it has the optimal number, configuration, and size ofclouds, tiers, clusters, and nodes. You will need to rebalance the topology by configuring it from end-to-end to support all anticipated workloads and meet all anticipated service level requirements. The sheercomplexity, rate of change, variety of workloads, and welter of requirements on Big Data architecturescan easily unbalance any fixed configuration in no time.

    Virtualization is an important consideration in heterogeneous Big Data cloud topologies. It provides aunified interface to disparate resources that allows you to change, scale, and evolve the back endwithout disrupting interoperability with tools and applications. The degree to which you needvirtualization depends in part on the complexity of your Big Data cloud requirements and technicalenvironment, including the legacy investments with which it must all interoperate. Virtualizationenables you to preserve and even expand, where necessary, the distributed, multi-tier, heterogeneous,and agile nature of your Big Data cloud environment. You can implement virtualization at the storagelayer, the middleware layer, the query/access layer, the data layer, the management layer, or all ofthem.

    One of the key enablers of Big Data cloud virtualization is the semantic abstraction layer, which enablessimplified access to the disparate schemas of the RDBMS, Hadoop, NoSQL, columnar, and other datamanagement platforms that constitute a logically unified data/analytic resource. As an integrationarchitecture, virtualization ensures logically unified access, modeling, deployment, optimization, andmanagement of Big Data as a heterogeneous resource.

    Data platformsIn building your Big Data cloud, you will need to determine the optimal data storage, management, andruntime platform(s) to support each deployment role, functional service layer, component, andworkload, considering your target topology and service level requirements. You may require a Big Datacloud that includes different data platforms (e.g., RDBMS/row-based, Hadoop/HDFS, in-memory/columnar, etc.) fitted to different purposes but integrated into a unified architecture.

    Beware of those who assert that there is only one type of cloud-oriented data platform that you shouldbe considering for all of your requirements. Weigh your options by analyzing the different data

    platforms on the market, many of which have been optimized by vendors and solution providers fordeployment in private, public, and other types of clouds. Rather than get mired in a "holy war" of oneplatform/approach vs. another, it's best to frame your platform options in terms of the principalconsiderations described in the corresponding table in Appendix D.

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    13/21

    Copyright © 2014 Cloud Standards Customer Council Page 13

    Step 4: Address Governance, Security, Privacy, Risk, and AccountabilityRequirementsA Big Data cloud can be a complex, tricky thing to control as a unified business resource. The morecomplex and heterogeneous your Big Data cloud environment, the more difficult it will be to crack atight whip of oversight and enforcement. Life-cycle controls, driven by explicit policies, can keep yourBig Data environment from degenerating into an unmanageable mess and becoming a security andcompliance vulnerability to boot.

    But all things considered, Big Data cloud platforms may not be any more or less secure than established,smaller-scale databases. Where security is concerned, the sheer volume of your cloud-based Big Data isnot the key factor to lose sleep over. Instead, the chief security vulnerabilities in your Big Data cloudstrategy may have more to do with the unfamiliarity of the platforms and the need to harmonizedisparate legacy data-security systems.

    Platform heterogeneity is potentially a Big Data cloud security vulnerability. If you are simply scaling outan existing DBMS into a private cloud, your current security tools and practices will probably continue towork well. However, many Big Data deployments involve deploying a new platform–such as Hadoop,NoSQL, and in-memory databases in private or public clouds–that you have never used before and forwhich your existing security tools and practices are either useless or ill-suited. If the Big Data cloudoffering is new to the market, you may have difficulty finding a sufficient range of commercial securitytools, or, for that matter, anybody with experience using them in a demanding production setting.

    In the process of amassing your Big Data repository from heterogeneous precursors, you will need tofocus on the thorny issue of harmonizing disparate legacy security tools. These tools may support a widerange of security functions, including authentication, access control, encryption, intrusion detection andresponse, event logging and monitoring, and perimeter and application access control.

    At the same time as you are consolidating into a Big Data platform, you will need to harmonize thedisparate security policies and practices associated with the legacy platforms. Your Big Dataconsolidation plans must be overseen and vetted by security professionals at every step. And,ultimately, your consolidated Big Data platform must conform to all controlling enterprise, industry, andgovernment mandates.

    And if you want to approach Big Data security holistically, your strategy should also include tools andprocedures for unified (hopefully real-time) monitoring of security events on disparate data sets. Youshould consider implementing a logical Big Data mart for security incident and event monitoring (SIEM)to identify threats across your disparate Big Data platforms, consolidated and otherwise.

    You can control a Big Data cloud in a coherent manner, just as you oversee data and analyticsinvestments at smaller scales. Doing so demands that you thoroughly assess your big-datainfrastructure, tooling, practices, staffing, and skillsets in the following key areas:

    Security controls These enforce policies of authentication, entitlement, encryption, time-stamping, non-repudiation, privacy, monitoring, auditing, and protection on allaccess, usage, manipulation, and other information management functions.

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    14/21

    Copyright © 2014 Cloud Standards Customer Council Page 14

    Governance controls These enforce policies of discovery, profiling, definition, versioning,manipulation, cleansing, augmentation, promotion, usage, and compliance ofofficial system-of-record data, metadata, and analytic models.

    Information lifecyclemanagement (ILM)controls

    These enforce policies on the creation, updating, deleting, storage, retention,classification, tagging, distribution, discovery, utilization, preservation, purging,and archiving of information from cradle to grave.

    If you are an established data professional, this is probably all motherhood and apple pie to you. Whatyou want to know is whether Big Data in the cloud has substantially altered best practices in any ofthese areas. In fact, it has. Enterprise-level controls – implemented through tools, infrastructure, andpractices – are useless if they don’t keep pace with the scale, complexity, and usage patterns of theresource being managed.

    Where security, governance, and integrated lifecycle management (ILM) controls are concerned, BigData innovations in several areas are raising the stakes and changing best practices in security,governance, and other key controls. You should consider the following trends as you establish your ownpolicies, procedures, and safeguards for Big Data in the cloud:

    • Adapt your controls to fit your new platforms for Big Data in the cloud: You may have maturesecurity, governance, and ILM in your established enterprise data warehouse (DW), onlinetransaction processing (OLTP), and other databases. But Big Data in the cloud has probablybrought a menagerie of new platforms into your computing environment, including Hadoop,NoSQL, in-memory, and graph databases. The chance that your existing tools work out of thebox with all of these new platforms is slim, and, to the extent that you are doing Big Data in apublic cloud, you may be required to use whatever security, governance, and ILM features –strong, weak, or middling – that are native to that environment. You might be in luck if your BigData platform vendor is also your incumbent DW/OLTP DBMS provider and if, either by itself orin conjunction with its independent software vendor partners, it has upgraded your security and

    other tools to work seamlessly with both old and new platforms. But don’t count yourproverbial chickens on that matter. In the process of amassing your end-to-end Big Dataenvironment, you will also need to sweat over the nitty-gritty of integrating the disparate legacysecurity tools, policies, and practices associated with each new platform you adopt.

    • Implement controls that address the new data domains you are managing in the cloud: Manyof your traditional RDBMS-based data platforms are where you manage official system-of-record data on customers, finances, transactions, and other domains that demand stringentsecurity, governance, and ILM. But these system-of-record data domains may have very littlepresence on your newer cloud-based Big Data platforms, many of which focus instead onhandling unstructured data from social, event, sensor, clickstream, geospatial, and other newsources. The key difference from “small data” is that many of the newer data domains are

    disposable to varying degrees and are not linked directly to any “single version of the truth”records. Instead, your data scientists and machine-learning algorithms typically distill theunstructured feeds for patterns and subsequently discard the acquired source data (whichquickly become too voluminous to retain cost-effectively anyway). Consequently, you probablywon’t need to apply much if any security, governance, and ILM to many new unstructured BigData domains.

    • Implement controls that operate at the new scales associated with Big Data in the cloud:Where life-cycle controls are concerned, Big Data’s “bigness” is not necessarily a vulnerability.

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    15/21

    Copyright © 2014 Cloud Standards Customer Council Page 15

    There is no inherent trade-off between bigness – the volume, velocity, or variety of the data set – and your ability to monitor, manage, and control it. To the extent you have scaled out your BigData initiative on a cloud-based platform with massively parallel processing (MPP), you canleverage the security, governance, and ILM tooling for that platform as you zoom into “3 Vs”territory. You might even find Big Data in the cloud to be a boon, facilitating some control-relevant workloads – such as real-time encryption, cleansing, and classification – that provedless feasible on “small-data” on-premises platforms. And in fact, the extreme scalability andagility of Big Data in the cloud may prove essential for enforcing tight security, governance, andILM on new unstructured data types, as these needs emerge.

    • Ensure that your controls respond to new mandates relevant to Big Data in the cloud: Big Datahas become a culturally sensitive topic on many fronts, most notably among privacy watchdogswho point to its potential for use in intrusive target marketing and other abuses. As storageprices continue to plummet and more data is archived in Big Data clouds, the more likely thosehuge data stores are to be accessed for compliance, surveillance, e-discovery, intrusiondetection, anti-fraud, and other applications dictated under new legal and regulatory mandatesin many countries. Besides, even if no new Big Data-specific mandates hit the books, litigatorswill seek to subpoena these massive archives first when they’re looking for evidence ofcorporate malfeasance. Consequently, in most industries--regulated and otherwise--Big Dataarchives will increasingly be managed under stringent mandates for security, governance, andILM. This will ensure that the historical record is preserved in perpetuity, free from tamperingand unauthorized access.

    Step 5: Deploy, Integrate, and Operationalize the InfrastructureDeploying, integrating, and operationalizing your Big Data cloud environment are essential to realize thefull value of this business resource over its useful life.

    The criteria of Big Data cloud production-readiness must conform to what stakeholders require, and thatdepends greatly on the use cases and applications they have in mind for the Big Data cloud. Service-levelagreements (SLAs) vary widely for Big Data deployed as an enterprise data warehouse, as opposed to anexploratory data-science sandbox, an unstructured information transformation tier, a queryable archive,or some other use. SLAs for performance, availability, security, governance, compliance, monitoring,auditing and so forth will depend on the particulars of each Big Data application, and on how yourenterprise prioritizes them by criticality.

    You should re-engineer your data management and analytics IT processes for seamless support in yourBig Data cloud initiatives. If you can’t provide trouble response, user training and other supportfunctions in an efficient, reliable fashion that is consistent with existing operations, your Big Dataplatform is not production-ready. Key considerations in this respect:

    • Implement continuous deployment of Big Data cloud applications, because traditionalapplication development cycles of 6 months or more are no longer adequate;

    • Facilitate improved communication, collaboration, and governance of disparate Big Dataanalytics development and operations organizations around your cloud initiative;

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    16/21

    Copyright © 2014 Cloud Standards Customer Council Page 16

    • Integrate siloed data, analytic, management, process, and operational support systeminfrastructures, roles, and workflows to enable deployment of a converged Big Data cloudresource;

    • Provide Big Data cloud users with a “single throat to choke” for support, service andmaintenance;

    • Offer consulting support to users for planning, deployment, integration, optimization,customization and management of their specific Big Data cloud initiatives;

    • Deliver 24x7 support with quick-turnaround on-site response on issues;• Manage your end-to-end Big Data cloud environment with a unified system and solution

    management consoles; and• Automate Big Data support functions to the maximum extent feasible.

    To the extent that your enterprise already has a mature enterprise data warehousing program inproduction, you should use that as the template for “productionizing” your Big Data cloud deployment.There is absolutely no need to redefine “productionizing” for the sake of Big Data in the cloud.

    Works Cited[1] Cloud Standards Customer Council. Practical Guide to Cloud Computing Version 2.0, April, 2014.http://www.cloudstandardscustomercouncil.org/webPG-download.htm

    [2] Cloud Standards Customer Council. Convergence of Social, Mobile and Cloud: 7 Steps to EnsureSuccess. June 2013 .http://www.cloudstandardscustomercouncil.org/Convergence_of_Cloud_Social%20_Mobile_Final.pdf

    [3] NIST Big Data Working Grouphttp://bigdatawg.nist.gov/home.php

    Additional ReferencesReport McKinsey Global Institute. Big data: The next frontier for innovation, competition, and

    productivity .

    http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

    IBM Institute for Business Value report. Analytics: The real-world use of big data

    https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigData

    Report to the President. Big Data and Privacy: A Technological Perspective

    http://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_- _may_2014.pdf

    http://www.cloudstandardscustomercouncil.org/webPG-download.htmhttp://www.cloudstandardscustomercouncil.org/webPG-download.htmhttp://www.cloudstandardscustomercouncil.org/Convergence_of_Cloud_Social%20_Mobile_Final.pdfhttp://www.cloudstandardscustomercouncil.org/Convergence_of_Cloud_Social%20_Mobile_Final.pdfhttp://bigdatawg.nist.gov/home.phphttp://bigdatawg.nist.gov/home.phphttp://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovationhttp://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovationhttps://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigDatahttps://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigDatahttps://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigDatahttp://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdfhttps://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigDatahttps://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigDatahttp://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovationhttp://bigdatawg.nist.gov/home.phphttp://www.cloudstandardscustomercouncil.org/Convergence_of_Cloud_Social%20_Mobile_Final.pdfhttp://www.cloudstandardscustomercouncil.org/webPG-download.htm

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    17/21

    Copyright © 2014 Cloud Standards Customer Council Page 17

    Appendix A: Use Cases and Applications of Big Data in the CloudAddressing the strategic imperatives of one or more business stakeholder/champions, your businesscase should specify the business case(s) for your proposed deployment of big data in the cloud. Some ofthe principal use cases for big data in the cloud include:

    The following are the principal use cases and applications of Big Data in the cloud.

    Big Data exploration Any analytic application that requires interactive statistical and visualexploration of very large, longitudinal, complex data sets.

    Customer experience

    optimization Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timecustomer experience data from portals, mobile devices, and other devices,applications, channels, and other touch points

    Data warehouse augmentation Any analytic application that requires a massively scalable tier to supportdiscovery, acquisition, transformation, staging, cleansing, enhancement, andpreparation of multi-structured data and subsequent loading into a downstreamdata warehouse, hub, or mart.

    Enhanced 360 degree view ofthe customer

    Any analytic application that requires a high-volume, scalable repository ofcurrent customer profile, sentiment, attitudinal, social, transactional, survey,and other data to supplement existing customer system of reference data.

    Financial risk analysis Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timedata from internal and/or external sources on accounting, general ledger,budgets, and other financial sources

    Geospatial analytics Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timegeospatial information.

    Influencer analysis Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timedata from social media and other sources on influencers, net promoters, andbrand ambassadors among customers and prospects

    Machine data analytics Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timedata sourced from equipment, systems, devices, sensors, actuators, and otherphysical infrastructural components

    Marketing campaignoptimization

    Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timedata on the success of marketing campaigns across one or more channels ortouch points

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    18/21

    Copyright © 2014 Cloud Standards Customer Council Page 18

    Operations analytics Any analytic application that requires deep historical and/or real-time analysisof network, system, application, device, process, sensor, and other data sourcedand logged from operational infrastructure.

    Security and intelligenceanalytics

    Any analytic application that requires deep historical and/or real-time analysisof security, fraud, and other event data intelligence sourced and logged frominternal and/or external systems.

    Sensor analytics Any analytic application that requires automated sensors to take measurementsand feed them back to centralized points at which the data are aggregated,correlated and analyzed.

    Sentiment analysis Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timedata on customer sentiment from one or more social media or other source

    Social media analytics Any analytic application that requires acquisition, aggregation, correlation,

    processing, analysis, modeling, and visualization of historical and/or real-timedata from one or more social media

    Supply chain optimization Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling, and visualization of historical and/or real-timedata on production, equipment, materials, inventories, logistics, transport,transactions, and other supply-chain entities

    Targeted next best offers Any analytic application that requires acquisition, aggregation, correlation,processing, analysis, modeling and visualization of historical and/or real-timedata required to power analytic models, business rules, and other artifacts thattailor offers made to customer on inbound and/or outbound marketing,customer service, portal, and other touch points

    Unstructured analytics Any analytic application that analyzes a deep store of data sourced fromenterprise content management systems, social media, text, blogs, log data,sensor data, event data, RFID data, imaging, video, speech, geospatial and more.

    Appendix B: Technical Requirements for Big Data in the CloudOne key useful approach for assessing the appropriateness of cloud-based deployment for your big-dataapplication is to describe the extent to which it is uniquely suited to handling the requisite technical andsupportability requirements in these areas:

    Scalability and acceleration Will the Big Data cloud environment support scale-out, shared-nothingmassively parallel processing, storage optimization, dynamic query optimization,and mixed workload management better than alternative deployment models(e.g., on-premises appliances, software on commodity hardware)?

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    19/21

    Copyright © 2014 Cloud Standards Customer Council Page 19

    Agility and elasticity Will the Big Data cloud environment better support on-demand provisioning,scaling, optimization, and execution of diverse data and analytic resources thanalternative deployment models?

    Affordability andmanageability

    Will the Big Data cloud environment enable lower total cost of ownership andbetter IT staff productivity than alternative deployment models?

    Availability, reliability, andconsistency

    Will the Big Data cloud enable better availability, reliability, and consistencyacross distributed data/analytic resources that alternative deployment models?

    Service levels Will the Big Data cloud service support measurable metrics and thresholds ofquality of service as agreed between the cloud service user and provider?

    Security Will it support physical, application, and data security, including suchcapabilities as authentication, authorization, availability, confidentiality, identitymanagement, integrity, audit, security monitoring, incident response, andsecurity policy management?

    Privacy Will it protect the assured, proper, and consistent collection, processing,communication, use and disposition of personally identifiable information in therelation to cloud services?

    Availability Will it meet the guaranteed uptime requirements specified in a service levelagreement?

    Reliability Will it support an acceptable level of service in the face of faults (unintentional,intentional, or naturally caused) affecting normal operation?

    Performance Will it meet the latency, response time, concurrency, and throughput metrics

    and thresholds specified in a service level agreement?

    Auditability Will it support logging and auditing of all quality of service metrics as specified ina service level agreement

    Compliance Will it support continued, measurable, and verifiable compliance with all therelevant government, industry, and company mandates relevant to any and allaspects of operations?

    Interoperability Will it support user interaction with the service and exchange of informationaccording to a prescribed method to obtain predictable results?

    Portability Will it allow users to move their data or their applications between multiplecloud service providers at low cost and with minimal disruption?

    Reversibility Will it allow users to retrieve their data and their application artifacts and tohave strong assurance that the cloud service provider will delete all copies andnot retain any materials belonging to the cloud service customer after an agreedperiod?

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    20/21

    Copyright © 2014 Cloud Standards Customer Council Page 20

    Appendix C: Risk-Mitigation Considerations for Big Data in the CloudThe business case for any deployment or service model for cloud-based big data often hinges on a riskanalysis that considers the following factors:

    Scaling risks How likely is it that a particular approach will allow the solution to scale elastically in

    volume, velocity, and variety as Big Data analytics requirements evolve?

    Performance risks How likely is it that a particular approach will enable the solution to achieve ourperformance objectives for all workloads, current and anticipated, as requirementsevolve?

    Total cost of ownershiprisks

    How likely is it that a particular approach will reduce the cost of deploying andmanaging Big Data analytics and maximize the productivity and efficiency of IToperations over the deployment's expected useful life?

    Availability and

    reliability risks How likely is it that this particular approach will enable us to meet our requirementsfor 24x7 high availability and reliability in the provisioned Big Data analytics service?

    Security, compliance,and governance

    How likely is it that a particular approach will meet requirements for security,compliance, and governance in Big Data analytics?

    Manageability How likely is it that a particular approach will meet requirements for administration,monitoring, and optimization of the Big Data platform/service over its expected usefullife?

    Appendix D: Platform-choice Considerations for Big Data in the cloud

    These are the principal platform-choice considerations for Big Data in the Cloud:

    Data architecture In this era of constant database innovation, the range of established and new dataarchitectures and platforms continues to expand. The chief architectural categoriesinclude architectures that address varying degrees of "data at rest" and "data in motion."The principal market categories (which have diverse commercial and/or open-sourceofferings) include relational/row-based, columnar, file system, document, in-memory,graph, and key-value. A growing range of commercial and open-source offerings"hybridize" these approaches to varying degrees, and increasing numbers of them arebeing implemented within various public, private, and other clouds. Note the term"Hadoop" does not refer to a specific data storage architecture but instead to a range of

    architectures in different categories that can, to varying degrees, execute ApacheMapReduce models. The same applies to the sprawling menagerie of data platformslumped under the terms "NoSQL" and "NewSQL," which (depending on who you talk to)may include innovative databases in any of these categories, or in a narrow group ofthem.

    https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

  • 8/18/2019 IBM CSCC Deploying Big Data Analytics Applications to the Cloud Roadmap for Success

    21/21

    Copyright © 2014 Cloud Standards Customer Council Page 21

    Application

    functionality

    The functionality of commercial and open-source data platforms and associated toolingvaries widely. Every data platform has its own specific capabilities, strengths, andlimitations in such key areas as query languages, data types supported, schemas andmodels, scalability, performance, and so forth. Frequently, you'll identify several cloud-based or cloud-enabling commercial and open-source platforms in different segments that

    overlap significantly in many of these features.

    Deployment roles The optimal data platform will vary depending on the cloud-based deployment role thatyou envision for a specific database. The range of potential cloud deployment databaseroles is wide, including, at the very least, specialized nodes for data acquisition, collection,preparation, cleansing, transformation, staging, warehousing, access, and other functions.For example, if you're implementing a node for acquisition, collection, and preprocessingof unstructured data at rest, Hadoop on HDFS is well-suited, or, depending on the natureof unstructured data and what you plan to do with it, various other file, document, graph,key-value, object, XML, tuple store, triple store, or multi-value databases may be moresuitable. Or if you need strong transactionality and consistency on structured data in agovernance hub, relational/row-based databases or so-called "NewSQL" databases maybe worth looking into. Or, if real-time data-in-motion requirements predominate for dataintegration and delivery, you should investigate the stream-computing, in-memory and

    other platforms optimized for low latency.

    Integrationrequirements

    The optimal data platform(s) must integrate well with your chosen cloud-computingexecution and development platforms, whether they are open source (e.g., OpenStack,Cloud Foundry) or commercial. To the extent that your data platform(s) also conform toopen industry cloud-computing reference frameworks developed by the U.S. NationalInstitute for Standards and Technologies and other groups, all the better. Ideally, thechosen Big Data and cloud-computing platforms should integrate well with your legacy orenvisioned investments in analytics development and modeling tools; data discovery,acquisition, transport, integration, and so forth.