tdwi checklistreport multisponsor dw modernization web

Upload: prince2venkat

Post on 06-Jan-2016

225 views

Category:

Documents


0 download

DESCRIPTION

TDWI

TRANSCRIPT

  • TDWI CHECKLIST REPORT

    TDWI RESEARCH

    tdwi.org

    Eight Tips for Modernizing a Data Warehouse

    By Philip Russom

    Co-sponsored by:

  • 1 TDWI RESEARCH tdwi.org

    2 FOREWORD

    3 NUMBER ONE Modernize your data warehouse environment to leverage new data and big data

    4 NUMBER TWO Support the data needs of new analytics with a modern warehouse and other integrated data platforms

    5 NUMBER THREE Re-architect the data warehouse and its environment as you modernize

    6 NUMBER FOUR Consider Hadoop an extension of the modern warehouse

    7 NUMBER FIVE Modernize ETL, not just the core warehouse

    7 NUMBER SIX Accelerate the business closer to real-time operations as you modernize the data warehouse and related systems

    8 NUMBER SEVEN Comply with external regulations and internal policies as you handle data during modernization

    9 NUMBER EIGHT Apply modern economic criteria to selecting and using data platforms

    11 ABOUT OUR SPONSORS

    12 ABOUT THE AUTHOR

    12 ABOUT TDWI RESEARCH

    12 ABOUT TDWI CHECKLIST REPORTS

    2015 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

    MAY 2015

    Eight Tips for Modernizing a Data Warehouse

    By Philip Russom

    TABLE OF CONTENTS

    555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295

    T 425.277.9126 F 425.687.2842 E [email protected]

    tdwi.org

    TDWI CHECKLIST REPORT

  • 2 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    No matter the vintage or sophistication of your organizations data warehouse (DW) and the environment around it, it probably needs to be modernized. DW modernization takes many forms. Common scenarios range from software and hardware server upgrades to the periodic addition of new data subjects, sources, tables, and dimensions. As data types and data velocities continue to diversify, many users are likewise diversifying their software portfolios to include tools and data platforms built for new and big data. A few organizations are even decommissioning current DW platforms to replace them with modern ones optimized for todays requirements in big data, analytics, real time, and cost control. No matter what modernization strategy is in play, all require significant adjustments to the logical and systems architectures of the extended data warehouse environment.

    Most of the trends driving the need for data warehouse modernization boil down to four broad issues:

    1. Organizations demand business value from big data. In other words, users are not content to merely manage big data and other valuable data from new sources, such as Web applications, machines, devices, social media, and the Internet of things. Because big data and new data tend to be exotic in structure and massive in volume, users need new platforms that scale with all data types if they are to achieve business value.

    2. The age of analytics is here. Many firms are aggressively adopting a wide variety of analytic methods so they can compete on analytics and understand evolving customers, markets, and business processes. There is a movement from analyst intuition and statistics to empirical data-science-driven insights. Furthermore, todays consensus says that the primary path to big datas business value is through so-called advanced forms of analytics, based on technologies for mining, predictions, statistics, and natural language processing (NLP). Each analytic technology has unique data requirements and DWs must modernize to satisfy all of them.

    3. New challenges for real-time data. Technologies and practices for real-time data have existed and been successfully used for years. Yet, many organizations are behind in this area, so its a priority for their data warehouse modernization efforts. Even organizations that have succeeded with real-time data warehousing and similar techniques will now need to refresh their solutions so that real-time operations scale to exponential data volumes, streams, and greater numbers of concurrent users and applications. Furthermore, real-time technologies

    must adapt to a wider range of data types, including schema-free and evolving ones.

    4. Open source software (OSS) is now ensconced in data warehousing. Ten years ago, Linux was the only OSS product commonly found in the technology stack for DWs, BI, analytics, and data management. Today, TDWI regularly encounters OSS products for reporting, analytics, data integration, and big data management. This is because OSS has reached a new level of functional maturity while still being economically desirable. A growing number of user organizations are eager to leverage both characteristics.

    To help user organizations prepare, this TDWI Checklist Report canvasses eight of the leading DW modernization scenarios, discussing many of the new product types, functionality, and user best practices (as well as the business case and technology strengths) of each.

    FOREWORD

  • 3 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    A founding principle of data warehousing is that user organizations should repurpose data from the enterprise and other sources to gain additional insights and guide decisions. In that spirit, organizations are grappling with new data types and sources and how to capture and manage these information assets, plus how to leverage them for business advantage. For example:

    Web logs. A common starting point for leveraging big data is to assemble logs from Web servers and other Internet applications, then sessionize and analyze the clickstream and shopping cart data they contain to understand website visitor behavior and products of affinity in an e-commerce context.

    Industry-specific big data. Valuable data sets and analytics can be assembled from call detail records (CDRs) in telecommunications; RFID in retail, manufacturing, and other product-oriented industries; and sensor data from robots in manufacturing and vehicles in logistics.

    Human language and other text as big data. Tools based on natural language processing, search, and text analytics provide visibility into text-laden business processes, as in the claims process in insurance, medical records in healthcare, and call center or help desk applications in any industry. The killer app of human language data is sentiment analytics, which has become common in customer-oriented businesses, using both enterprise and social media big data.

    Multi-structured data. Partnering firms that work together through a supply chain often exchange information via XML and JSON documents, which include a mixture of structured data, hierarchies, text, and other elements. When processed and analyzed properly, these help quantify profitable partners, supply quality, and supply chain efficiencies.

    Managing and leveraging these new data types and sources is worthwhile because of their business value. However, users are challenged by the newness of the data, the massive volume of many new data sets, the wide range of data structures, and the streaming nature of some sources. The problem is further compounded because most vendor platforms and user designs for traditional data

    warehouses were originally designed for structured data alone or just for relational data. Because many manifestations of new big data are not relational (or even structured in any way), many users are asking: How do we modernize our DW so that we both preserve our traditional investment and embrace new types and sources of data?

    Many users choose to reserve their core DW for the relational data that goes into standard reports, dashboards, performance management, and OLAP. For new big data, users are deploying specialized platforms built for new data types, and they are integrating the new platforms with the core DW and related systems. Specialized platforms include those based on column stores and appliances, plus open source Hadoop and NoSQL databases. Some data warehouse platform vendors have incorporated native support for semi-structured data typessuch as XML and JSONwith the relational environment to enable tight integration between semi-structured and structured data types.

    Given the real-world limitations of modernizing a DW thats tightly wedded to the relational paradigm, complementing the relational DW with other data platforms is a viable strategy for DW modernization. Even so, some organizations prefer to replace the old DW platform with a different platform thats more broadly suited to the extreme diversity of data were witnessing today, even though rip-and-replace is time-consuming and disruptive for the business.

    To quantify users efforts with data warehouse modernization, a recent TDWI survey asked: Which of the following best describes your organizations strategy for evolving [or modernizing] your DW environment and its architecture, relative to big data? Most survey respondents plan to extend an existing DW (41%); the assumption is that the DW platform in place is capable of handling a broad range of data types and their workloads.

    However, a few will deploy new data platforms (25%); they assume these specialized platforms complement the core DW without replacing it. Finally, 29% of respondents have no strategy for DW modernization or addressing big data, which is not a good idea given the upsurge in new big data and other modernization requirements.

    MODERNIZE YOUR DATA WAREHOUSE ENVIRONMENT TO LEVERAGE NEW DATA AND BIG DATA

    NUMBER ONE

    1 Figure 1 in this report is based on Figure 11 in the 2014 TDWI Best Practices Report Evolving Data Warehouse Architectures in the Age of Big Data, available for download at tdwi.org.

    Figure 1. Strategies for data warehouse modernization.1

    Extend existing data warehouse to accomodate big data and other new requirements

    No strategy, though we need one

    OtherNo strategy because we dont need one

    Deploy new data management systems, specifically for big data, analytics, real time, etc.

    41% 25% 23% 6% 5%

  • 4 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    We say analytics as if it were a single practice or technology. In reality, there are many approaches to analytics, and there are many enabling technologies, including mining, clustering, statistics, predictive algorithms, SQL, hierarchies, dimensions, visualization, and a wide array of natural language processing (NLP) techniques.

    A ramification of the diversity of analytics is that the requirements for data to be analyzed vary tremendously. Some analytic methods demand relational data; others need some other structure. This, in turn, complicates the modernization of a data warehouse that must supply data for multiple analytic approaches. Again, given the diversity of analytic data, many users choose to deploy multiple purpose-built platforms, instead of expecting a relational warehouse to supply all data types. Heres a rundown of data structures required by various analytic methods:

    Data exploration and discovery. Many analytic methods begin with a data analyst exploring data as a prelude to analysis, reporting, and visualization. Although its possible to explore data residing on many platforms, a few organizations have relocated data to be explored into data lakes, data vaults, and enterprise data hubs, typically on Hadoop, a large configuration of an MPP database, or a hybrid environment that supports elements of both.

    Large data samples. Some analytic methods (for mining, statistics, and clustering) work best with data samples of many terabytes or petabytes. Many users house these on large MPP configurations, but the trend is toward Hadoop integrated with a relational MPP database.

    Relational data. TDWI surveys show that after OLAP, the most common form of analytics is so-called complex SQL or extreme SQL. This involves hundreds of lines of SQL because data access, data models, data transformations, and other elements are expressed in SQL code instead of handling them elsewhere. For this form of analytics, relational DBMSs are the obvious choice today, although the progress of SQL on Hadoop may change this.

    Dimensional models. A true data warehouse will include dimensional models, typically to support online analytic processing (OLAP). Hence, the relational DW continues to be the first choice for dimensional analytics, followed by relational appliances and columnar databases.

    Hierarchies. Hierarchical business structures are all around us, in a bill of materials, a chart of accounts, and XML or JSON documents. Furthermore, some tools for data mining, text analytics, and visualization produce hierarchies. Vendor brands of relational DBMSs vary in their abilities to successfully manage hierarchies. The trend is toward Hadoop.

    File-based data. Significant new big data is captured in log files, such as those generated by Web servers, enterprise applications, machines (sensors, robots, and devices), and when streaming data is captured. Hadoop was designed for logs and other file-based data, so its a natural choice.

    Multimedia data. Some organizations need to store, manage, and analyze audio and video files, preferably in an active archive, which Hadoop can enable.

    Textual documents. For the analytic methods of sentiment analysis, entity extraction, text mining, and other forms of NLP, the human language and other forms of text they operate on are often file-based. For these applications, Hadoop is coming on strong as the preferred storage and analytic processing platform.

    Set-based and algorithmic approaches to analysis. Set-based analytics usually entails relational techniques, namely SQL, tables, keys, dimensions, etc.; optimizing and parallelizing operations with these is easily done in a relational database environment. Algorithmic analytics (sometimes called procedural analytics) varies considerably, but a common example is the row-over-row comparisons made in graph or time-series analyses. All forms of algorithmic analysis optimize well in Hadoop.

    SUPPORT THE DATA NEEDS OF NEW ANALYTICS WITH A MODERN WAREHOUSE AND OTHER INTEGRATED DATA PLATFORMS

    NUMBER TWO

  • 5 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    Data warehouse modernization faces a perfect storm of requirements: supporting new data, expanding analytics, coming closer to true real-time operations, containing costs, planning capacity, among others. One way to satisfy diverse requirements is to diversify the software and hardware portfolio of the DW by adding more tools and platforms to it. Thats exactly what roughly half of organizations are doing.

    Many user organizations are evolving their mature enterprise data warehouses (EDWs) into multi-platform data warehouse environments (DWEs). To put it in historical perspective, the technology stack for BI and DW has always had multiple tools and platforms, including tools for reporting, analytics, and integration, as well as database management systems (DBMSs) for the DW, data marts, cubes, and operational data stores (ODSs).

    We say the warehouse or the EDW as if its one monolithic entity, although for many organizations its long been a collection of more-or-less integrated tools, data platforms, and data sets. Rearranging the EDW acronym to DWE acknowledges the extreme degree the multi-platform DW and BI technology stack has achieved in recent years, and its not just the DWE. Data management in other areas of the enterprise has attained a similar extreme of platform diversity.

    The current extreme of the multi-platform DWE has architectural ramifications:

    New data platforms enable new practices that complement the core DW without replacing it. Thats because the DW is still the best platform for the aggregated, standardized, and documented data that goes into standard reports, dashboards, performance management, operational analytics, and OLAP. Instead of replacing it, the new platforms complement the warehouse because they are optimized for workloads that manage, process, and analyze data thats new, big, unstructured, exotic, or real time. Also, new data platforms are better suited to the early ingestion, later processing

    practice many users need to apply during data exploration and analytics. Admittedly, the additional platforms complicate the architecture, but BI/DW professionals have dealt with a complex technology stack for decades, so they are well-equipped for multi-platform DWEs. In addition, a number of data platform vendors are extending their tools to simplify the orchestration, ingestion, and consumption of data, regardless of where the data is persisted.

    A DWE enables a workload-centric architecture that gives users more options. For example, a DWE assumes that some workloads and their data are best offloaded from the core DW and taken to a platform more suited to them. This includes workloads and data for algorithmic analytics, extreme SQL-based analytics, multi-structured data, massive big data, and real time. This modernization strategy frees up capacity on the core DW so it can be reallocated to expanding DW-specific data and workloads.

    Note that the leading benefit of the workload-centric DWE is that it gives users options: they can match a given data set or workload with a platform thats the best technical fit or the most cost-effective. In that context, modern organizations develop metrics for total cost, ROI, functionality, performance, ownership, and other data platform characteristics so that decisions about data platform usage are enlightened by the full range of platform characteristics, not just technical capabilities.

    To quantify the trend toward multi-platform data warehouse environments (DWEs), a recent TDWI survey asked: Which of the following best describes your extended data warehouse environment today? (See Figure 2.) Pure, central, monolithic EDWs are relatively rare (15%, far left). Conversely, environments without a DW are equally rare (15%, far right). The majority of DWs coexist successfully with other platforms in a mixed environment (68%, middle three segments of the chart). Even so, the degree of diversity varies from a few additional platforms to many.

    RE-ARCHITECT THE DATA WAREHOUSE AND ITS ENVIRONMENT AS YOU MODERNIZE

    NUMBER THREE

    2 Figure 2 in this report is based on Figure 10 in the 2014 TDWI Best Practices Report Evolving Data Warehouse Architectures in the Age of Big Data, available for download at tdwi.org.

    Figure 2. Evolving from the EDW to the modern DWE.2

    Central monolithic EDW with no other data platforms

    Central EDW with many additional data platforms

    Many workload-specific data platforms; EDW is present but not the center

    Other

    No true EDW; many workload-specific data platforms instead

    Central EDW with a few additional data platforms

    15% 37% 16% 15% 15% 2%EDW

    DWE

  • 6 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    As we just saw, the multi-platform data warehouse environment (DWE) is both a trend and a strategy for data warehouse modernization. Among the new platforms proliferating in DWEs, Hadoop is coming on strong for several reasons:3

    Open source software (OSS) has recently achieved a higher level of functional maturity, in general, across all types of OSS. This makes Hadoop and other OSS products more attractive for demanding enterprise uses.

    A compelling balance of cost and performance is struck by Hadoop. Vendor distributions of Hadoop add enterprise functions required for enterprise use (security, administration, maintenance, high availability, disaster recovery, query, etc.) but are more affordable than comparable licenses for enterprise software. Furthermore, Hadoop is proven to perform and scale linearly, even when deployed on the cheapest commodity hardware.

    Data-type diversity leads many users to Hadoop. Theoretically, any data you can put in a file can be handled by the Hadoop Distributed File System. This empowers user organizations to finally get full business value from unstructured and semi-structured data.

    Computational power for advanced analytics is the true value proposition for Hadoop. Hadoops renowned talent for storing massive volumes of highly diverse data is merely a foundation for computational analytics. This also makes Hadoop a complement to the set-based analytics performed elsewhere in the DWE with OLAP, SQL, and relational techniques.

    Hadoop complements and extends other platforms without replacing them. This adds years of productive use, new functionality, and greater scale to traditional investments in data warehouses, reporting tools, analytic tools, and data integration tools.

    Early adopters and others have been using Hadoop integrated with a DW for a few years now. From their successful experiences, we see that there are a number of low-risk but high-value use cases that are appropriate to users wishing to introduce Hadoop into their DWEs:

    Operational data stores (ODSs). TDWI has found users who have migrated ODSs from relational DBMSs to Hadoop, typically for use with Hive and HBase, sometimes MapReduce and Pig. They report that the straightforward record or relational data structures of

    an ODS migrate easily and perform well with little tweaking once in Hadoop. In a similar trend, some users are working toward an enterprise data hub (EDH), which extends the capabilities of operational data stores, to bring more analytic workloads to larger volumes of diverse data.

    Data staging. Hadoop was designed for early ingestion, later processing data management best practices. Hence, it adapts well to data landing, data staging, and the transformational processing of data that usually accompanies such practices.

    Source data archiving. Its impossible to foresee all the ways that source data will need to be repurposed for new analytic applications in the future. The current practice is to retain raw, extracted data with all its original details. Much of the expensive storage capacity of EDWs is burned up by large archives of source data; Hadoop can store and process this data just as well, but at a fraction of the cost. Unlike old-fashioned archives that depend on offline media such as magnetic tapes and optical disks, a Hadoop-based archive is online, queryable, and searchable, so users get daily business value from it without time-consuming data-restore processes.

    Computational analytics. Valuable computational analytics performed by Hadoop users today includes website behavior analysis, sentiment analysis, clustering for customer base segments, and many applications of statistical or mining techniques with large volumes of diverse data.

    ETL/ELT offload. Just as users offload data and analytic workloads from the core DW to Hadoop, they also offload jobs for extract, transform, and load (ETL). The catch is that some ETL or ELT jobs are inherently relational or set-based because they involve complex table joins or depend on advanced SQL functions; such jobs are best controlled by a data integration tool and pushed down into a relational DBMS. However, other ETL jobs count entity occurrences or perform algorithmic processing but on a massive scale, which is at the core of Hadoops design.

    CONSIDER HADOOP AN EXTENSION OF THE MODERN WAREHOUSE

    NUMBER FOUR

    3 Readers unfamiliar with Hadoop may wish to read the TDWI Best Practices Reports Integrating Hadoop into Business Intelligence and Data Warehousing and Hadoop for the Enterprise, available for download at tdwi.org.

  • 7 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    Data warehouse modernization is not limited to the warehouse per se. A modernization strategy may be needed for the many tools and platforms that interface with the DW and other data platforms in the DWE. Thats potentially a long list of tools, so lets focus on those for data integration (DI) and extract, transform, and load (ETL).

    Shuffling data in a modern DWE. As users add more types of data platforms to their DW environments, they almost always need to move data around to relocate it on new platforms that are best suited to given data sets. Hence, early in DW modernization initiatives, users must plan a number of data migrations, consolidations, collocations, and data workload balancing. These are typically done with a variety of DI tools, including those for ETL or replication.

    Data integration infrastructure for the modern DWE. Users have always needed a solid data integration architecture to cope with complex data flows and multiple tools in the BI/DW technology stack. A modern DWE takes that situation to a new extreme, and a DWE assumes many complex multi-platform data flows. Hence, data integration infrastructure is a critical success factor for daily operations in a DWE.

    Adapting to new ETL practices. Traditional data warehousing practices use ETL to improve data before loading a DW. Users with ample capacity on their DWs may push down some processing into the DW, which is known as ELT. A new variation on these practices ingests extracted data into the target data platform as early as possible, then processes the data for specific purposes as late as possible. Called early ingestion, late processing, this has become a standard practice with new big data, especially when Hadoop is the target.

    Modernizing metadata management. This is especially challenging with schema-free new data. Instead of developing metadata a priori (as is the case with most DW practices today), modern tools for Hadoop can deduce metadata at runtime from a wide range of data structures, empowering a user to develop metadata quickly as data is explored, discovered, and analyzed. The same tools can also detect evolving data structures, track data lineage, enable search, and update statistics and heuristics about specified data.

    For decades, BI professionals have pushed refreshing and delivering reports and analyses closer to real time. Today, a number of common BI practices handle data in near real time (minutes or hours), including operational BI, dashboarding, and metrics-driven performance management. These practices enable managers to make tactical and operational decisions based on very fresh information.

    However, for some fast-paced, time-sensitive business processes, near real time (also known as near time) isnt fast enough. They need true real time, where data is handled within seconds, preferably microseconds. Examples include applications for financial trading systems, business activity monitoring, utility-grid monitoring, e-commerce product recommendations, and facility surveillance.

    For user organizations needing to modernize the DWE to handle data in near time or real time, many technologies are available today and therefore should be considered. The list includes data federation and virtualization, data replication and synchronization, intraday micro batches, columnar DBMSs, DW appliances, MPP computing architectures, elastic clouds, in-database analytics, in-memory functions, and solid-state drives.4 Note that the bar has been raised on these; they must operate in various short time frames (sometimes called right times) and they must also operate on a wider range of data structures in unprecedented volumes.

    Complex event processing (CEP) for streaming data. One form of new big data is streaming data. Data streams into an organization more or less continuously as a series of data records, each describing a business event. For example, streams come online when users add sensors to their machines, products, vehicles, and mobile devices, plus turn on logging in Web or enterprise applications. Streaming data is captured, triaged, and processed to determine a reaction; then an automated response is executed by software or a user is alertedall within seconds or milliseconds. Standalone CEP tools have arisen to handle streams and users are adding CEP tools to their DWEs as they modernize for true real-time operations.

    Hadoop for streaming data. Early versions of Hadoop lacked near-time and real-time capabilities. This situation has improved considerably, with the introduction of open source projects for capturing and analyzing streaming data (such as Samza, Spark, and Storm). These promise to handle both the speed of real time and the massive data volumes we expect in Hadoop. TDWI anticipates that

    MODERNIZE ETL, NOT JUST THE CORE WAREHOUSE ACCELERATE THE BUSINESS CLOSER TO REAL-TIME OPERATIONS AS YOU MODERNIZE THE DATA WAREHOUSE AND RELATED SYSTEMS

    NUMBER FIVE NUMBER SIX

    4 For an in-depth examination of real-time operations, see the 2014 TDWI Best Practices Report Real-Time Data, BI, and Analytics, available on tdwi.org

    (Continues)

  • 8 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    Hadoop will become a preferred real-time platform because of its low cost (as compared to commercial CEP platforms) and its massive storage capabilities. After all, streaming data adds up to large volumes in a hurry.

    Interactive SQL on Hadoop. The many users using HiveQL with Hive and HBase attest to the value of these tools. Yet, data management professionals are calling for better support of standard SQL on Hadoop so they can leverage their SQL skills and their SQL-based tools. Likewise, data analysts need near-real-time query responses in support of analytic practices such as data exploration and ad hoc queries. The open source projects Drill and Impala provide these and other functions. In addition, some vendor distributions of Hadoop support file-system enhancements for fast ingestion of data streams, so these are available immediately for both analytic and operational workloads.

    Streaming ETL on Hadoop. Hadoops capabilities designed for handling and analyzing streaming data can also be used for streaming ETL, which can aggregate, transform, and otherwise process data as it arrives. Streaming ETL avoids the overhead and latency of applying structure before load time, and by accelerating the ETL process, downstream decision making and other business processes are greatly accelerated.

    Data warehouse modernization is an opportunity to create or improve data governance best practices, plus related practices for data standards and security.

    New big data needs governance (DG), as would any data set. DW modernization usually involves new data, and each new source should be certified per established compliance and governance policies prior to use. Because the policies and standards created by most data governance committees are designed for structured data and traditional platforms, data types and sources that are new to your DWE may need new policies and standards or adjustments to older ones (especially for exotic data from social media, geospatial, or surveillance systems).

    New big data needs improvement, as do most data sets. Data governance is more than policies for compliance. A mature program also establishes and enforces standards for datas quality, models, architectures, semantics, and development methods. All data sets have problems and opportunities that merit attention, whether old or new, from the enterprise or beyond. Data standards help leverage datas opportunities and remediate its problems. Dont just move data during a data warehouse modernization; improve it as well.

    Data exploration is a compliance accident waiting to happen. A common goal for data warehouse modernization is to collect highly detailed source data for data exploration and discovery, usually in conjunction with analytics. Exploration is increasingly performed with modern data sets, such as data lakes, data vaults, and enterprise data hubs, whether on Hadoop or large MPP DBMS installations. To avoid compliance and privacy violations, all these scenarios need governance policies and the appropriate level of security, as explained below.

    Hadoop must be secure, just like other IT systems. Security in purely open source Hadoop is limited to authorizations based on Kerberos. This is useful, but its only one approach to security, whereas mature enterprise IT teams tend to prefer multiple approaches. For example, many IT organizations have standardized on role-based and directory-based approaches. Eventually, users will also demand single sign-on, encryption, and data masking.

    Fortunately, additional security measures (and other enterprise-grade functions) are available for Hadoop, typically from vendors that offer Hadoop distributions. These functions make the distributions more appealing to mature enterprises than does purely open source Hadoop. Additional functionality is also available from software vendors in the extended Hadoop ecosystem.

    COMPLY WITH EXTERNAL REGULATIONS AND INTERNAL POLICIES AS YOU HANDLE DATA DURING MODERNIZATION

    NUMBER SEVEN

    (Continued)

  • 9 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    The primary benefit of a modern multi-platform DW environment is to proactively manage a data set on a data platform that is the best technology fit for that data set and its associated workloads. When possible, however, users should also manage data on a platform that realizes a low total cost of ownership (TCO) or a high return on investment (ROI), or both. The calculus of TCO and ROI is complicated and fraught with exceptions but worth considering if you need to innovate how you control costs in a modern data warehouse environment. Note that TCO goes beyond acquisition costs. An enterprise must consider costs in other areas, such as development, maintenance, support, and usage. Note that ROI may be expressed in either hard dollars or in soft benefits.

    Here are a few considerations of TCO and ROI for prominent platform types in a modern data warehouse environment:

    You get what you pay for. Mature brands of relational database management systems (RDBMSs) are premium products and therefore command premium price tags. However, the expense is worth it to get an RDBMSs rich variety and fully baked feature sets for query optimization, SQL standards, indexing, workload management, in-memory processing, data compression, metadata management, large concurrent user bases, view technologies (materialized, federated, virtual, dimensional), and a variety of other system management and end-user productivity features. These features are required for demanding data-driven practices, such as data warehousing, reporting, business performance management, operational BI, and OLAP. The data managed for these practices is high value (and hence merits financial investment) because its used by employees who make strategic and operational decisions that deeply influence the success of the enterprise. For these reasons, the vast majority of DWs today are built on mature RDBMSs, anddue to the value returnedthese organizations have little trouble justifying the cost.

    You can pay now or pay later. Hadoop is based on open source software that runs well on commodity-priced hardware. Hence,

    Hadoops acquisition costs are quite low compared to other data platforms in the modern DW environment, giving Hadoop a low cost per terabyte. However, the total cost of owning Hadoop mounts over time to fund skilled personnel, system administration, and environmental costs (such as power, space, and cooling). In particular, Hadoop requires more advanced programming skills than peer systems do. For example, experienced Hadoop users have spoken at TDWI conferences about the high payroll costs required of data scientists, programmers, and other highly technical staff. Yet, the same speakers point out that Hadoops average TCO is still lower than a comparable MPP RDBMS configuration; they would know because they have both, and the two complement each other, as described earlier in this report.

    Purpose-built systems have their placeand their price. Regardless of what their vendor creators intended, TDWI most often finds DW appliances and columnar RDBMSs used for SQL-based analytics performed by a relatively small user base but with very large data volumes. Secondarily, TDWI finds these platforms supporting multi-terabyte data marts, which are usually the foundation for specific analytic applications. DW appliances and columnar RDBMSs have a lower price point than mature, multi-purpose RDBMSs, which is appropriate, given their limited use cases, small user constituencies, and scaled-back relational functionality. DW appliances and columnar RDBMSs fulfill an important role within a multi-platform DW environment as effective but affordable platforms for analytic sandboxes and departmental analytic applications.

    If we pull together the three prominent platform types discussed above, they provide a diverse range of options for both functionality and cost. This report discussed the established pairing of RDBMS-based warehouses with Hadoop. TDWI also sees appliances and columnar databases tightly integrated with the relational warehouse and Hadoop, as illustrated in Figure 3.

    APPLY MODERN ECONOMIC CRITERIA TO SELECTING AND USING DATA PLATFORMS

    NUMBER EIGHT

    Figure 3. Range of platform options within a data warehouse environment.

    (Continues)

    Premium Functionality at a Premium Price

    Emerging Functionality at a Low Entry Price

    Mature, Feature-Rich Relational DBMSs

    Appliances & Columnar DBs, Built for DW & Analytics

    Open Source Hadoop Built for Big Data

    Range of Options

  • 10 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    Cost and functionality are major drivers for data migration. Again, the point of the multi-platform DW environment is to manage a data set on a platform that is the best fit for it and its workloads. Thats a technology consideration; yet, many users are under pressure to control costs, so they look at both cost and functionality considerations when they choose their platform and physical placement of data. The balance of cost and functionality is driving certain kinds of data migrations, usually in the context of data warehouse modernization.

    For example, many users complement their core data warehouse with standalone implementations of columnar and appliance-based relational databases. This migration frees up capacity on the DW and provides a workload-specific platform optimized for complex, SQL-based analytics. More recently, data migrations to Hadoop have increased, as early adaptors offload the core warehouse and take advantage of Hadoops inexpensive storage, scalability, and analytic processing power. In other words, in the model shown in Figure 3, theres a trend to migrate data from left to right. Despite the migration of a minority of warehouse data sets, the relational DW is as relevant as ever, and it has a new and more practical focus on data sets that truly belong on it.

    Data migrations aside, users also balance cost and functionality in green field situations, as when they select platforms for new data (such as that from machines and social media). A similar balance is struck in common data warehouse modernization tasks, especially the consolidation of proliferated data marts and ODSs.

    Everyones different. Data warehouse modernization is a golden opportunity for rethinking both TCO and ROI on both single platform and total environment levels. Each organization has its own unique mix of business, technology, and budgetary requirements. Each organization will need to develop its own metrics for quantifying platform TCO and ROI, the value of specific data sets, and the value of certain user constituencies. These financial metrics can complement technical considerationssuch as the size and usage patterns of dataso that platform acquisitions and usage decisions are fully enlightened and innovative.

    This report has discussed the leading options for data warehouse modernization today, as well as future directions for modernization. Most modernization efforts should consider all those options but give priority to what the business needs from data, while leaving room for innovation based on new data, new technologies, new architectures, and new opportunities for managing costs in the modern data warehouse environment.

    (Continued)

  • 11 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    www.cloudera.com

    Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Clouderas open source Big Data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 30,000 individuals worldwide. Over 1,450 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. For more information, visit us at www.cloudera.com.

    www.impetus.com

    Impetus Technologies provides innovative big data solutions and services that empower large enterprises to unlock the full value of their big data opportunities. Our proven methodologies and solutions span the full life cycle of architecture advisory, proof of value, data science, application development, and implementation services. We have launched solutions for data warehouse modernization and real-time streaming data analytics. The data warehouse modernization solution incorporates a proven productive methodology with an automation toolset that substantially reduces the time and cost of migrating ETL and analytics functions to Hadoop. We have also introduced StreamAnalytix, an application development platform for rapid development of real-time streaming analytics applications. Both solutions leverage our deep experience as an early adopter of big data technologies and offer enterprise-class support and ease of use combined with the benefits of open source. By leveraging the open source community, we are able to incorporate a vibrant source of innovation while reducing cost for our enterprise customers. More information regarding our big data services can be found at bigdata.impetus.com and www.streamanalytix.com, or by writing to us at [email protected].

    ABOUT OUR SPONSORS

    www.mapr.com

    MapR Technologies delivers on the promise of Hadoop with a proven enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease of use, and world-record speed to Hadoop, NoSQL data stores, and streaming applications in one unified distribution for Hadoop. MapR is used by more than 500 customers across financial services, government, healthcare, manufacturing, media, retail, and telecommunications sectors as well as by leading Global 2000 and Web 2.0 companies.

    MapR provides engineering contributions to several open source Apache Hadoop projects including Apache Drill. Drill delivers interactive ANSI SQL queries on Hadoop and NoSQL databases, without requiring the building of centralized schemas. Drill is the first on-the-fly schema-discovery SQL engine that brings instant insight from any data source from simple files to complex hierarchical JSON data structures and schema-less databases. You can get started with Drill in minutes by downloading the MapR Sandbox for Drill.

    www.teradata.com

    The Teradata Unified Data Architecture (UDA) enables companies to get more value from their data by connecting the dots across the business for breakthrough insights and providing the agility to answer new business questionsall while reducing overall costs and complexity. The UDA is a proven, reliable, and cost-effective framework for integrating analytics across Hadoop and the data warehouse.

    As the market leader in data warehousing, Teradata has deep engineering relationships with Hortonworks, Cloudera, and MapR that provides customers with the choice to implement the best distribution for their needs. Hadoop and the Integrated Data Warehouse are orchestrated with products such as QueryGrid that through a single query pushes down analytics to where the data resides across the ecosystem, thereby reducing data movement and redundancies.

  • 12 TDWI RESEARCH tdwi.org

    TDWI CHECKLIST REPORT: E IGHT T IPS FOR MODERNIZING A DATA WAREHOUSE

    TDWI Research provides research and advice for data professionals worldwide. TDWI Research focuses exclusively on business intelligence, data warehousing, and analytics issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence, data warehousing, and analytics solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations.

    TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects.

    ABOUT TDWI CHECKLIST REPORTS

    ABOUT TDWI RESEARCHABOUT THE AUTHOR

    Philip Russom is director of TDWI Research for data management and oversees many of TDWIs research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 500 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

    FOREWORDNUMBER ONE: Modernize your data warehouse environment to leverage new data and big dataNUMBER TWO: Support the data needs of new analytics with a modern warehouse and other integrated data platformsNUMBER THREE: Re-architect the data warehouse and its environment as you modernizeNUMBER FOUR: Consider Hadoop an extension of the modern warehouseNUMBER FIVE: Modernize ETL, not just the core warehouseNUMBER SIX: Accelerate the business closer to real-time operations as you modernize the data warehouse and related systemsNUMBER SEVEN: Comply with external regulations and internal policies as you handle data during modernizationNUMBER EIGHT: Apply modern economic criteria to selecting and using data platformsABOUT OUR SPONSORSABOUT THE AUTHORABOUT TDWI RESEARCHABOUT TDWI CHECKLIST REPORTS