data lake principles and economics - mapr · storing and managing cold data in the data warehouse,...

c1 TDWI RESE ARCH tdwi.org

TDWI CHECKLIST REPORT: DATA L AKE PRINCIPLES AND ECONOMICS

TDWI CHECKLIST REPORT

TDWI RESEARCH

tdwi.org

Data Lake Principles and EconomicsEvaluating Data Economics, Business Value, and Real-Time Delivery Options

By Stephen Swoyer

Sponsored by:

www.mapr.com

www.teradata.com

www.tdwi.org

1 TDWI RESE ARCH tdwi.org


2 FOREWORD

2 DATA LAKE BASICS

2 CHARACTERISTICS OF A DATA LAKE

3 NUMBER ONE Automated and reliable data ingestion

4 NUMBER TWO Preservation of original source data

5 NUMBER THREE Capturing and managing all metadata

6 NUMBER FOUR Governance and security

7 NUMBER FIVE Search, access, and consume data with ease

8 NUMBER SIX Cleansing, aggregation, and integration matched to each use

9 NUMBER SEVEN Prepare data for analysis—and sometimes do the analysis

10 NUMBER EIGHT Offload cold data to boost performance and reduce costs

10 NUMBER NINE Real-time ingest and egress requirements

11 NUMBER TEN Cost containment via multi-tenancy

12 ABOUT OUR SPONSORS

13 ABOUT THE AUTHOR

13 ABOUT TDWI RESEARCH

13 ABOUT THE TDWI CHECKLIST REPORT SERIES

© 2015 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

OCTOBER 2015

DATA LAKE PRINCIPLES AND ECONOMICSEvaluating Data Economics, Business Value, and Real-Time Delivery Options

By Stephen Swoyer

TABLE OF CONTENTS

555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295

T 425.277.9126 F 425.687.2842 E [email protected]

tdwi.org

TDWI CHECKLIST REPORT

www.tdwi.org



Think of a data lake as a new-age data repository of “raw” data that offers developers and business users an extended set of features from basic reporting to machine learning to non-relational graph analytics. It’s a data management innovation that promises to reduce data processing and storage costs and enable new forms of business-analytic agility. The value proposition of the data lake is that it provides a cost-effective context in which to support data exploration use cases as well as to host and process the new kinds of analytics many organizations need to do with big data today.

Data lakes are often (but not always) implemented in Hadoop because this file-based platform can accommodate data of different shapes and sizes. This includes everything from event messages or logs generated by applications, devices, or sensors to semi-structured data (such as text) to multi-structured data—a category that includes objects of any conceivable kind, such as JSON and XML files or voice, video, and image files.

This Checklist Report discusses what your enterprise should consider before diving into a data lake project, no matter if it’s your first or second or even third major data lake project. Presumably, adherence to these principles will become second nature to the data lake team and they will even improve upon them at some point.

DATA LAKE BASICS

Data lakes provide a practical way to reduce the cost and complexity of processing and storing data while promoting flexibility, self-service, and agility. In practice, they tend to wear many hats. They are used as both an architectural centerpiece (i.e., as a central hub in which to persist and retrieve data of all kinds) and as a means to complement or extend existing data management processes.

Conceptually, the data lake can be thought of as a scalable, cost-effective repository for persisting, managing, and processing data in its original (“raw”) form. That data can originate from internal sources (such as OLTP applications, operational data stores, and the event messages transmitted via enterprise service buses) or external sources (including social media services, Web and cloud applications, sensors, subscription services, and open data sets). A data warehouse’s schema-based approach takes in data that has had a heavy dose of cleansing, profiling, and data integration through ETL processing. The data lake’s early-ingest model is much less heavy duty, consisting of—at most—data consistency checks and light transformations. Like a data warehouse, the data lake can be optimized for accepting both batch and real-time (i.e., streaming) data, too.

FOREWORD

The data lake combines a cost-effective, general-purpose parallel processing layer with an equally cost-effective distributed storage layer. Some adopters opt to build a third architectural layer to create the equivalent of virtual views over the lake. These views give the data lake some degree of structure in cases where specific applications need it. Ideally, however, the lake’s storage should be as raw and unstructured as possible. This makes it more easily repurpose-able as new analytic requirements emerge. It is possible to scale up (and, if necessary, to scale down) a data lake environment as needed from a comparatively small footprint—several hundred gigabytes to a few terabytes—to petabyte scale and beyond.

Moreover, because the data lake isn’t a passive repository, but rather a scalable parallel processing platform, it is likewise possible to realize additional cost efficiencies by shifting or offloading workloads from mature platforms—such as a relational data warehouse—in cases in which they do not need or benefit from advanced features such as accelerated performance for query processing and complex joins, workload management features that are required to meet strict SLAs, and high concurrency.

In the same way, the data lake provides a compelling alternative to storing and managing cold data in the data warehouse, where the per-terabyte cost of storage and processing is considerably higher. Finally, the data lake must be a multi-tenant environment in that all users, applications, jobs, or services (i.e., “tenants”) share access to the same storage and compute resources. If its resources are managed effectively, the data lake environment can flexibly host a large number of simultaneous workloads.

Fully realized, the data lake enables the equivalent of a big-data-as-a-service model. It can consolidate ETL processing and other kinds of data preparation workloads, and it can feed downstream analytic practices such as data warehouses and data marts. It can function as a scalable, self-serviceable resource for data exploration and new breeds of analytics. Finally, the data lake can host non-relational types of analyses (e.g., graph or network analysis) that the typical data warehouse cannot.

CHARACTERISTICS OF A DATA LAKE

Raw data storage. In its most basic state, the data lake is a primary repository for ingesting, persisting, and managing raw source data. Unlike the data warehouse, data doesn’t have to be modeled and transformed prior to being loaded into the data lake; in this way, the data lake environment comprises a cost-effective solution for landing and storing strictly structured,



semi-structured, and multi-structured information, the value of which is unknown and under-appreciated. Because the data lake makes it possible to cost-effectively store original data, it can feed downstream business intelligence (BI) and analytic needs, from populating or repopulating a data warehouse to supporting a visual discovery practice.

Consolidated data processing. Thanks to its Hadoop-based underpinnings, the data lake is an ideal platform for processing or “refining” data, be it in the form of lightweight data preparation, complex data quality routines, or CPU intensive ETL processing, along with the multi-stage, highly iterative kinds of data prep that typically presage data scientific analysis. Research analyst Richard Winter describes this use case as “data refining”; industry expert Wayne Eckerson uses the metaphor of a “data refinement system.” In both analogies, the role of the data lake is to prepare data for consumption by downstream systems, applications, services, and users.

Consolidated analytical processing. Because it’s a multi-tenant environment, the data lake is an ideal context in which to consolidate and host multiple, simultaneous analytical workloads. These include basic BI reporting and ad hoc query; computationally intensive machine learning and data mining workloads; and non-relational analytic workloads such as certain kinds of time series and graph analyses. Even though the data lake isn’t designed to support high levels of user concurrency, it is suitable for access by a limited subset of highly technical users, such as data scientists.

Flexibility and cost-efficiency. The data lake is likewise an attractive solution for consolidating redundant data sources. Enterprises can consolidate and eliminate ODSs, analytic sandboxes, and other (comparatively costly) systems in the data lake environment. In the same way, cold data can be offloaded from the data warehouse into data lake storage. In cases where query-processing performance, workload management, and other capabilities are no longer needed, data offloading is considerably cheaper. Data virtualization technology can make this transparent to BI tools. Finally, the data lake is a polyglot resource that is of particular interest to internal application developers. Instead of coding their apps to pull and reconstitute data from the rows and columns of a traditional RDBMS, applications can read directly from and write directly to the file blocks of the data lake itself.

Ideally, there should be reusable scripts that standardize automation and error recovery processes in the data lake. After all, there’s a world of difference between merely loading data into the Hadoop environment and loading data as part of a consistent, reliable, and repeatable process.

Source systems and mappings change. When they do, script-driven automation routines will break down. As Hadoop loading tools mature, use them to replace the scripts you’ve cobbled together to bootstrap your data lake. Script-driven automation is feasible at low volumes but impracticable at high volumes. In practice, scripts beget scripts beget more scripts, and, as the data lake environment matures, the collection of scripts required to manage the lake may balloon in scale from dozens to hundreds to perhaps even thousands. This is to say nothing of the cost of the human labor required to keep the whole, increasingly tottering edifice running. Scripting is unavoidable in building and managing the data lake. It is critical, then, that you enforce good coding and naming standards and that you make use of tools to manage your ever-expanding collection of scripts.

It is also important to perform “reasonableness” checks and other kinds of basic tests on the data that you’re ingesting. If you want to do this today, you’re going to have to hand code these checks yourself.

Yes, the ability to derive or impute schema at the time of access is one of Hadoop’s signature selling points, but schema-on-read is the wrong time to discover that you have significant—and easily avoidable—problems with your data. Don’t confuse the checks you traditionally perform as a prelude to heavy-duty ETL processing with the use of similar checks to ensure reliable data acquisition into the data lake. Such checking is not ETL or anything like it. Schedule checks on samples of your data after you’ve ingested it.

Hadoop ingest processes are often set up with little or no appreciation for the sheer number of files that will be ingested over time or for the necessity of ingesting data in a consistent and complete manner. All data must be ingested in its entirety every day. When something goes wrong—e.g., the wrong data is fed in or a step is not completed—that error must be detected and corrected immediately before bad results can be propagated. If this process is not completely reliable, the resulting data will not be trusted. Many years of experience have shown that the fastest way to kill a project is to provide unreliable data to business users. Therefore, operations tools such as job schedulers and backup routines must be adapted for use in the data lake, too.

AUTOMATED AND RELIABLE DATA INGESTION

NUMBER ONE



Raw data is special. You can easily recreate derived files from raw data but you can never recreate raw data from derived files. Therefore, be careful about persisting derived data into the data lake. Remember, the primary role of a data lake is as a central, cost-effective repository in which to store and manage raw data. (Part of that cost effectiveness lies in avoiding the expense and overhead of transforming data.) In this scheme, the lake functions as a system of record. This permits it to support a wide range of downstream analytic use cases.

For example, raw source data is used to derive the information that populates data warehouse systems, data marts, analytic sandboxes, and the like. It is likewise used to derive the normalized data that (along with data from the warehouse) is required by many machine-learning models. In the data lake paradigm, raw data is also made available for analytic practices, such as visual discovery, which blend raw data with derived data from the warehouse as well as data from other sources. These are all cases in which raw data is used to derive data that in turn feeds an organization’s analytics pipeline. Although the second use follows from the first, it doesn’t have to feed back into it.

In most cases, then, derived data can and should be recreated from the raw source. As always, there are exceptions to this rule—such as, for example, auditing requirements or time-critical use cases in which data must be processed quickly. In such cases, it does make sense to persist derived files into the data lake. You could take this a step further by also saving the processing logic that you use to produce your derived files. In this way, you can recreate them six months, one year, or even several years later. As processing logic changes, keep the older versions, too, just in case they are needed.

It is important to be careful about the data you put into the lake for several reasons.

First, you don’t want to dilute or compromise the raw content of your lake by mixing in data that has been altered by downstream users or applications. If you do so, make sure that this data has been sandboxed or appropriately isolated. Second, there’s an understandable temptation to treat a data lake as an inexhaustible repository, such that one uses it to store all information, irrespective of provenance or value. Because the data lake comprises a cluster of physical server and storage resources, it isn’t in any sense inexhaustible. These resources must be powered and cooled; in the same way, server and storage racks also consume valuable data center floor space. In some locales, the cost of the electricity required to power a data lake, or, alternately, of the physical space

required to house it, can far exceed the cost of its hardware. At the very least, then, you should establish and maintain expiration policies on the derived files that you persist into the lake environment. For that matter, it is a good idea to apply expiration policies to raw data, too.

A data lake is a cost-effective option relative to a traditional RDBMS or a data warehouse, but it isn’t a panacea. An assortment of large, derived files replicated three or more times across a Hadoop cluster will squander hardware resources and increase the second- and third-order costs of managing and scaling the data lake. These include increased data center heating and cooling costs, as well as more complicated licensing and administrative costs. Total cost of ownership varies widely depending on your Hadoop distribution due to different architectural approaches, so research the total cost between vendors.

Instead of persisting derived files, encourage analysts to address their immediate needs by deriving new files from the same raw source data. If users or programmers are permitted to create and persist derived files in the lake, they must be strongly encouraged to reuse these files as much as possible—or face capacity limits imposed by the data center operations manager. Replicating a 500 gigabyte or 1 TB derived file across a data lake cluster is bad enough; replicating a dozen or more copies of the same file will quickly squander space and significantly increase the cost of operating the lake.

PRESERVATION OF ORIGINAL SOURCE DATA

NUMBER TWO



The biggest challenge in putting data into the data lake is getting the same data back out again.

Hadoop itself provides tools to build some but not all of the data lake; several critical components, such as metadata management and data lineage-tracking tools, must be acquired, purchased, or built. A metadata management tool, in particular, is probably the most important component to acquire. Imagine you’ve built a thriving data lake with several million files in it. How would an analyst find one specific file among all of those millions of files if she didn’t have its exact 30-character name? What if the file in question is seven derivatives removed from the original raw file? Will she grab the sixth derivative in error? If you don’t know what’s in your data lake, and if you aren’t able to track how the content of your lake has changed over time, you’re setting consumers up for failure.

This is why it is critical to have an efficient, reliable mechanism for capturing metadata and tracking data lineage as information is ingested into and, subsequently, changed within the data lake environment. To cite just one example, it’s critical to be able to map the original raw data file to the subsystem that produced it. A suitable metadata management tool must also be capable of sampling data to produce a profile of its constitutive record structures. With CSV, JSON, and structured data, this should be easy enough; however, more complicated data may not be so easily profiled in the metadata repository. In these and other cases, the Hive metastore, which is limited to SQL table and column names, isn’t sufficient. A dedicated metadata management tool is necessary.

The cost of bad metadata will compound quickly. This isn’t just a function of first-order costs, as when incomplete or incorrect data is used as a basis for decision making, but of second-order costs, too. These second-order costs have to do with the complexity of managing and scaling a bloated data lake environment. Remember, data that has been “lost” in the lake isn’t actually lost; it’s persisted in the physical file blocks of the file system DataNodes to which it has been replicated. It will reside there until it is identified and cataloged—or, alternately, purged.

In the meantime, “lost” data that cannot be reliably located must be re-created or re-derived. This effectively means re-ingesting it from source systems and/or deriving it yet again into the data lake environment. Without an efficient and reliable means of managing metadata and data lineage, some proportion of a data lake’s storage and compute resources must and will be squandered. Costs will likewise increase as nodes are added (or as new clusters are spun

up) to address expanding resource requirements. The upshot, then, is that inefficiency and lost programmer time will reign.

Make it a priority to acquire third-party metadata management and data lineage tools for Hadoop. Today, several start-up vendors market Hadoop-based metadata and lineage tools; in addition, the Apache Foundation sponsors a project (Apache Atlas) that aspires to tackle this problem. Bear in mind, however, that these tools are immature and highly differentiated. In most cases, it’s necessary to use two or more at the same time to capture and track lineage. Ultimately, product maturation and feature-function expansion will probably result in one or, at most, two reliable metadata tools.

CAPTURING AND MANAGING ALL METADATA

NUMBER THREE



Don’t give short shrift to governance and security in your data lake environment. Governance and security are not additives to be retrofitted during development testing. Neither is optional. If security is lax, sensitive data is vulnerable to catastrophic loss, including data breaches and cyberattacks.

Like any other repository of critical or sensitive information, the data lake is an invaluable resource for enterprise consumers. If its contents are compromised, however, the data lake—as with any other critical data source—becomes a danger to the enterprise. This is why it is critical to balance the needs of common-sense security and governance over and against the information access needs of consumers. If security safeguards are too strict, experimentation and innovation will wither as data scientists and analysts are stifled. As a result, the data lake will fail in its mission to promote innovation.

Similarly, with too little governance, it can become difficult, even onerous, to manage the data lifecycle. There’s little clarity as to who is empowered to manage or to make decisions about the data in the lake. There’s the potential for a lack of standards—as with, for example, data quality thresholds. As a result, predictable data uses or production dataflows cannot be instantiated and sustained. The data lake fails in its mission to support repeatable business processes.

On the other hand, with too much governance it takes too long to make decisions, to add new data, or to experiment with new processes. A fine balance is needed. In securing and governing the data lake, it is critical to develop a framework of accountability that addresses the following use cases:

• Creation, storage, use, archival, and deletion of data

• Reliable processes for data files

• Roles, standards, and metrics for the data lake

The standard objections—that Hadoop is still a relatively young environment or that governance and security will come with time—are dangerously irresponsible. The same rules that apply in traditional analytics application development likewise apply in developing for and maintaining the data lake. Security must be built-in from the beginning both by the vendor and by the programming staff.1 In the same way, data governance policies must be established and enforced from the very beginning, too. Both

priorities are related: if you build in or otherwise design for security, you can more easily enforce governance.

The trick is to do so while simultaneously promoting agility and, to the degree desirable, programmer autonomy. As with the ideal data warehouse environment, the ideal data lake environment must be hardened as much as is practicable (or required) without exasperating potential users.

Apache Hadoop has taken its lumps for giving short shrift to security, but the Hadoop environment offers a creditable security feature set, including a built-in authentication mechanism and authorization controls. This last category runs the gamut from file system-based access controls to projects such as Apache Sentry and, more recently, Apache Ranger. Sentry, now in beta at Apache, is a mechanism for providing role-based access to data and metadata in a Hadoop environment.

A more recent project, Apache Knox, combines both authentication and authorization. In addition, some commercial distributions of Hadoop offer features that encrypt data (at both the file-block and disk levels) and safeguard its availability—for example, via consistent snapshots and read-write mirrors. Security capabilities are highly differentiated on a distribution-by-distribution basis. Some distributions offer comparatively granular enforcement mechanisms, such as the ability to enforce user- or role-based access control on a per-column or per-table basis.

Even so, Hadoop users who are serious about enterprise-grade security are in most cases looking to third-party commercial offerings, not open source projects. These products tend to be more feature-rich than their open source kith; in addition, they’re bundled with maintenance and support.

Even though there’s no shortage of options for securing Hadoop, remember that product-specific features are no substitute for enterprise security best practices such as vulnerability management, detection, and response. Product and enterprise security go hand in hand, and both are important for preventing and detecting the purposeful or accidental tampering of data.

Securing the data lake isn’t the same thing as governing it. Without the involvement of domain experts to help plan and manage its growth, however, the data lake will become inefficient and polluted. One solution is to establish a data lake competency center (DLCC)—an entity that consists of business users, architects, programmers, and representatives from quality assurance and operations management. The DLCC meets regularly to discuss

GOVERNANCE AND SECURITY

NUMBER FOUR

1 In comparison to standard Apache Hadoop, commercial Hadoop distributions offer an improved security feature set—usually in the form of add-on administrative tools (monitoring and auditing capabilities) and/or distribution-specific enforcement mechanisms (access control on a per-column or per-table basis). The robustness of these features and mechanisms will differ (sometimes significantly) between and among different Hadoop distributions, however, so make sure to do your homework.



SEARCH, ACCESS, AND CONSUME DATA WITH EASE

NUMBER FIVE

planning and changes to the data lake. Initially this is a huge benefit to all involved, inasmuch as it solves business problems and makes the data lake more reliable. Be mindful, however, that the DLCC may become increasingly bureaucratic over time. The DLCC and other governance bodies must not detract significantly from the data lake’s core mission to promote agility and flexibility.

The data lake environment is designed to serve information consumers of all kinds, be they quants and analysts, users of embedded BI and analytics, or people not interacting with BI at all.

This last category includes users of applications that consume information from disparate data sources, such as OLTP applications, operational data stores, and, of course, the data warehouse.

Ideally, these applications will consume data from the data lake, too. In other words: market the data lake to critical internal constituencies, especially software developers.

Don’t just expose the data lake as a resource for software developers and users, however. Actively promote its advantages and uses—but don’t stop there. Accommodate developers on their own turf by offering alternatives to SQL access for non-BI apps. One such alternative is Apache Drill, the open source implementation of Google’s Dremel. (Dremel is the distributed ad hoc query system that underpins Google’s BigQuery infrastructure service, designed to provide flexibility without compromising on interactive performance.) Drill permits a user to explore raw files and multi-structured data—in this case, the term multi-structured means both relational and non-relational data—in the absence of a predefined schema.

Drill is ideal for querying against the text, JSON, Parquet, Avro, and other kinds of file objects that are stored in the data lake’s distributed file system layer. Instead of creating a schema overlay for these file objects in Hive, use Drill to access them as they are, in situ. Along with raw data in files, Drill can query Hive and HBase tables. It has a modular storage plug-in architecture which permits it to query sources outside of the Hadoop environment, too. Drill supports ANSI SQL and connects to BI tools using JDBC/ODBC drivers.

Another, newer alternative is Apache Spark, an extensible cluster computing framework. Each of Spark’s four libraries promises to enable valuable analytics functionality for data lakes and other big data collections. Extant Spark libraries support a subset of ANSI SQL, machine learning, streaming data, and graph analytics. However, Spark is still relatively new, so only time will tell whether its promise—particularly with respect to simplified access and accelerated SQL query performance—is ever realized. One thing’s for certain: Spark enjoys wide backing in and among the vendor community.

Elsewhere, use natural language processing (NLP) search technology to support certain users and applications. For users who are less technically inclined, search is a much better solution than SQL query. What’s more, search tools are often the best or the only way to cope with highly unstructured data. Thanks to Apache OpenNLP, it



is possible to extend open source software (OSS) offerings such as Elasticsearch and Solr with NLP search capabilities.

Remember, too, that you can store data in the Hadoop environment without predefining schemas. If there’s no schema, there’s no SQL, which means applications can read and write data directly to and from Hadoop’s file system layer without reconstituting it from (or shredding it back into) database rows and columns via object-relational mapping (ORM) techniques. ORM—its uses and abuses, its disadvantages and oft-disputed advantages—is about as polarizing a subject as there is among developers. The point is to make it as easy as possible for people to get at the unstructured clickstreams and server logs in the data lake. This means catering to the preferences of all potential consumers—including developers—not (just) to those of data management practitioners.

CLEANSING, AGGREGATION, AND INTEGRATION MATCHED TO EACH USE

NUMBER SIX

One of the earliest data management-oriented use cases for Hadoop was as an inexpensive platform for parallel data integration processing.

This was the so-called “ETL on steroids” use case touted by early ETL-on-Hadoop proponents. In the same way, the first data lakes were designed and built for ETL processing on a massive scale. Many still are.

This use case is as viable as it ever was. Use your data lake to feed the data warehouse and other downstream analytic systems; in the same way, make the data lake a primary site for the preparation—cleansing, aggregation, and integration—of data. You shouldn’t necessarily expect to eliminate existing ETL tools and processes. Instead, use the data lake to augment them when and where it makes sense to do so.

Be wary of performing too much data integration. Not all applications or use cases require completely consistent or top-quality data. The data warehouse has rigorous data preparation requirements; other applications, however, may have less stringent requirements.

More to the point, a draconian emphasis on cleansing, standardizing, and vetting data can be an enemy to agility. In the same way, the rigorous enforcement of data governance policies can likewise frustrate or exasperate people who don’t require top-quality cleansed data and are desperately trying to address time-critical business requirements. In such cases, use the data lake as an agile, pragmatic alternative. For example, if the data integration pipeline takes too long or if the requirements of governance prove to be too onerous across multiple technologies, then prepare the required data in the data lake and make it available to the appropriate constituencies.

Inevitably, there are use cases in which having even poor quality data (e.g., data that is only 40 percent accurate) is better than having no data at all. Data lake programmers must both educate business users about data quality levels and make sure they understand that it is incumbent upon them to use this data responsibly. In the same way, the data lake team should look for opportunities to educate executives about the state of quality levels as well as about the need to make investments to maintain or improve quality in the data lake environment.



PREPARE DATA FOR ANALYSIS—AND SOMETIMES PERFORM THE ANALYSIS

NUMBER SEVEN

The data lake isn’t a passive repository for storing and managing enterprise information. It can and should host data processing workloads, too. These include not only ETL and data preparation workloads, as discussed in NUMBER SIX, but a wide variety of analytics as well.

From basic reporting and ad hoc analytics to machine learning, use the data lake as a resource to augment your data warehouse or data research and development (R&D) efforts.

There’s no “right” place to host any data processing workload, regardless of context. There are situations in which it might make sense to host iterative, multi-pass SQL routines on large data sets in the data lake environment. In other situations, the same workload will run faster, more efficiently, and at much lower cost in a massive parallel processing (MPP) database. (Cost in this example is a function of the human programming and administrative resources required to code and manage these workloads.) The key, now as ever, is to determine what’s best for your needs.

For workloads that involve the parsing, preparation, and processing of semi-structured or multi-structured data, the data lake is an excellent alternative to the data warehouse. These workloads are sometimes cumbersome or impractical in the data warehouse.

The data lake likewise gives you a practical option to address reporting and/or ad hoc analytics requirements that may be impractical in a data mart or data warehouse.

One such example is basic reporting and ad hoc analysis on operational data as it is ingested into the data lake environment. This data hasn’t yet been cleansed and prepared for loading into the warehouse. In the past, this requirement was addressed—at some cost—by an ODS. The data lake gives you a single, managed, multi-tenant repository in which to consolidate multiple ODSs.

Still another compelling analytic use case involves analytics on non-traditional data types (e.g., log or telemetry data from machines, e-mail, social media text, or other sources) that are mostly tabular or which contain critical tabular fields. In the data lake model, it’s comparatively easy to parse logs or event messages into tabular data format and to make this data available for reporting to a group of users. In the same way, machine or text data is often used in graph/network analysis and other types of non-relational analyses for which the data warehouse is less than ideal.

Depending on what you’re doing, analytics development in the data lake might be considerably more expensive than in the data warehouse. This is as much a function of the immaturity of extant SQL-on-Hadoop technologies as of the complexity of hand coding in Hadoop, which requires data processing-specific knowledge and skills along with the requisite proficiency in Java, Pig Latin, Python, Scala, Clojure, and other languages.



OFFLOAD COLD DATA TO BOOST PERFORMANCE AND REDUCE COSTS

REAL-TIME INGEST AND EGRESS REQUIREMENTS

NUMBER EIGHT NUMBER NINE

As indicated in NUMBER TWO, be careful about what you put into your data lake. As a general rule, derived data should not be persisted into the data lake environment. It can be re-derived from raw data, at little cost, when and if it’s needed. One exception involves data that is accessed only infrequently and which is costly to maintain but has value nonetheless.

This category includes cold or historical information that is stored in the data warehouse, data marts, operational data stores, and similar analytic systems.

In this context, the temperature of cold data isn’t determined by its age (i.e., by its time or date stamp) but instead by the frequency at which it is accessed and used by business consumers.

To the extent practicable, think about shifting cold data from online storage in source systems and into the data lake environment. The purpose is not to archive data for retrieval at some later point, if or when it should be needed, nor is it to develop an online queryable archive to supplement source analytic systems. Cold data does not equal an archive. The purpose is, rather, to eliminate redundant, unnecessary, and costly storage in ODSs, certain kinds of analytic sandboxes, databases, and the like. Look for opportunities to shift some cold data into the data lake and to use commodity data federation (also called data virtualization) technology to join it with data from source analytic systems.

Federation technology permits this to be done transparently, eliminating the need to point front-end tools to a new data source, to build a new BI presentation layer, etc. From the perspective of the consuming BI tool, a data source that’s been shifted to the data lake and exposed via federation looks and acts the same way it always did—nor is federation a black-box art. Most BI tools implement a federation layer of some kind, and a few RDBMS vendors bundle federation technology as part of their core database offerings; federation technology is likewise available from several data integration vendors. True, federation got something of a bad name in the early 2000s, but the technology itself, as realized in modern data virtualization offerings, has improved vastly since then.

The data lake environment provides a logical context in which to ingest, parse, process, and persist real-time streaming data. Real-time or fast-arriving data can be ingested into the data lake’s file system much more quickly than it can be loaded into a database. In the same way, some of the tools that are bundled with Hadoop also permit very high-speed data delivery to applications. Finally, and most important, there’s no shortage of options for ingesting and processing real-time or streaming data into the data lake environment.

This has as much to do with the lake’s de facto purpose (a cost-effective context in which to persist and manage data of all shapes and sizes) as with the diversity and economic friendliness of the OSS ecosystem. With respect to real-time and streaming applications, some of your OSS options include:

• Apache Apex: A YARN-native platform for stream and batch processing

• Apache Kafka: An event messaging system used for streaming ingest

• Apache Spark Streaming: The streaming component of the Apache Spark distributed cluster computing framework; Spark Streaming supports micro-batch loading

• Apache Storm: A streaming engine that supports micro-batch via its Trident API

• Apache Hbase: A non-relational data store that’s a core component of Apache Hadoop

Some combination of these and other projects can address many common single-stream processing requirements. Such applications involve hundreds of thousands or millions of events per second and multiple, simultaneous streams with hundreds of thousands or millions of events per second. For example, Kafka is an ideal option for ingesting event data as well as for performing light, in-flight data processing.

Spark Streaming and Storm are compelling options for stream processing and analysis. Both platforms support microbatch processing (i.e., the batch loading and processing of messages at intervals of between 100 milliseconds and several minutes). Spark Streaming performs microbatch by default, Storm uses its Trident API. Microbatch permits higher throughput, albeit at the cost of greater latency. By default, Storm processes messages on a one-at-a-time basis. This is useful for applications that require extremely low (or near-real-time) latencies. A new OSS option is Apache



Flink, which aims to be a unified platform for stream and batch processing, handling everything from ingest to processing to egress.

Finally, HBase, which is part of the core Hadoop framework, is an excellent option for persisting streaming data at scale as well as for supporting sub-second query response times. At what scale? Published use cases cite throughput rates of several hundred thousand messages per second. Volume at this scale is common with telemetry data, such as that generated by oil rigs and jet aircraft, to name just two sources. Most commercial distributions claim to do even better than this. Some vendors have published case studies that claim to achieve buffered ingest rates of millions or tens of millions of data points per second. (At least one vendor says it can support ingest rates of up to 100 million data points per second – on a four-node cluster.) If you have a need for extreme-scale performance of this kind, critically interrogate the commercial Hadoop vendors as to the capabilities of their platforms.

In this scheme, HBase effectively functions as a kind of high-throughput OLTP data store, persisting streaming data in the context of a table abstraction. You can use the data lake to process stream data via an in-loop engine such as Storm and persist it in HBase, preserving history, too. Streaming data stored in HBase can be made available for use in the data warehouse (for historical analysis), for machine learning or predictive analysis, or—via Elasticsearch, Solr, or other offerings—in contextualized views or search results.

COST CONTAINMENT VIA MULTI-TENANCY

NUMBER TEN

By functioning as a source for raw data and feeding downstream analytic practices, the data lake enables the equivalent of a big-data-as-a-service model.

This is why it is critical that you maintain fine-grained control over resource and workload management in your data lake cluster. Resource management describes how resources are allocated and prioritized in the Hadoop cluster; workload management addresses how resources are allocated and prioritized on a per-user basis. The more effective use you’re able to make of your data lake’s storage and compute resources, the more analytic customers you’ll be able to serve; the more robust your workload management feature set, the more effectively you’ll be able to serve these analytic customers. When you exercise fine-grained control over both, you can more efficiently provision data lake resources and realize proportionately high levels of multi-tenancy.

Multi-tenancy means that you can support multiple users and applications concurrently. It also means that data lake resources are efficiently utilized at all times, regardless of workload levels. This permits the data lake to host more users and at the same time offer better response times for jobs. If you don’t have fine-grained control over resources and workloads, you must either add unnecessary compute and storage capacity in the form of extra nodes or—what amounts to the same thing—spin up additional data lake clusters.

Unfortunately, data lakes are often underutilized. From a hardware standpoint, underutilization costs much more than it should and results in second-order power, cooling, maintenance, and support expenses. In point of fact, your data lake shouldn’t be costing you anything, at least on a net basis. Use it to realize cost savings by eliminating costly intermediate systems—such as the ODSs, (R)DBMSs, and object/document stores that are typically used to ingest and persist data—and relocating them to the data lake.

You can likewise shift data preparation workloads to the data lake environment, either in whole or in part, permitting you to reevaluate your investments in third-party ETL or data integration tools. In this regard, cost reduction isn’t just a function of reducing or eliminating spending on hardware and software. The data-lake model confers additional cost efficiencies in the form of consolidated (i.e., standardized, rationalized) administration and application development. Instead of working with a confusing diversity of different systems and administrative tools, administrators, developers, data scientists, and analysts can work with a single environment using a simplified array of tools.



This is only true, however, if you have fine-grained control over the resources in your data lake. Hadoop 2.0’s Yet Another Resource Negotiator (YARN) comprises a significant improvement over prior versions of Hadoop, in which resource management was handled by separate JobTracker and TaskTracker daemons. Commercial distributions of Hadoop typically bundle enhanced resource management features via proprietary management consoles, dedicated (and usually proprietary) workload management tools, and, in some cases, open source alternatives to YARN itself.

One such alternative is an open source project called Myriad, which permits YARN to run in the context of—and to be managed by—the Mesos cluster management framework. Mesos offers fine-grained control over Hadoop, the operating system substrate on which it runs—usually Linux—and other systems or resources in a data center. Resource management in YARN, by contrast, is limited strictly to the Hadoop environment itself.

Multi-tenancy is the ideal state for the data lake. It ensures maximum use of the resources, results in better response times, and helps drive down the total cost of ownership.

mapr.com

MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use, and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. MapR is used by more than 700 customers across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations as well as by leading Fortune 100 and Web 2.0 companies. Amazon, Cisco, Google and HP are part of the broad MapR partner ecosystem. Investors include Google Capital, Lightspeed Venture Partners, Mayfield Fund, NEA, Qualcomm Ventures and Redpoint Ventures.

Connect with MapR on Facebook, LinkedIn and Twitter.

ABOUT OUR SPONSORS

Teradata.com

Teradata helps companies get more value from data than any other company. Our big data analytic solutions, integrated marketing applications, and team of experts can help your company gain a sustainable competitive advantage with data. Teradata helps organizations leverage all their data so they can know more about their customers and business and do more of what’s really important. www.teradata.com

www.mapr.com

www.mapr.com

www.teradata.com

www.teradata.com

www.facebook.com/maprtech

www.linkedin.com/company/mapr-technologies

www.twitter.com/mapr



TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence and data warehousing solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations.

ABOUT TDWI RESEARCH

ABOUT THE AUTHOR

Stephen Swoyer has been a TDWI Contributor for almost 15 years. His work focuses on information consumption and analysis, data engineering, and open source software. He’s particularly interested in the all-too-human people and process problems that tend to complicate the work of would-be information consumers and analysts, data engineers, and open source contributors alike. Swoyer has co-written a book (forthcoming) on data warehouse automation. He lives in Portland, OR.

TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects.

ABOUT THE TDWI CHECKLIST REPORT SERIES

data lake principles and economics - mapr · storing and managing cold data in the data warehouse,...

Documents