rhjb rethink data integration delivering agile bi systems data virt

Re-think Data Integration: Delivering Agile BI Systems With Data Virtualization

A Technical Whitepaper

Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy

March 2014 Sponsored by

Copyright © 2014 R20/Consultancy. All rights reserved. Red Hat, Inc., Red Hat, Red Hat Enterprise Linux, the Shadowman logo, and JBoss are trademarks of Red Hat, Inc., registered in the U.S. and other countries. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. Trademarks of companies referenced in this document are the sole property of their respective owners.

Copyright © 2014 R20/Consultancy, all rights reserved.

Table of Contents 1 Management Summary 1

2 The New Challenges for BI Systems 2

3 Current BI Systems and the New Challenges 4

4 On-Demand Integration with Data Virtualization 6

5 Under The Hood of Data Virtualization Servers 8

6 BI Application Areas for Data Virtualization 13

7 Data Virtualization and the New BI Challenges 15

8 Data Virtualization Simplifies Sharing of Integration Specifications 15

9 Overview of Red Hat’s JBoss Data Virtualization Server 17

About the Author Rick F. van der Lans 21

About Red Hat, Inc. 21

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization 1


1 Management Summary It didn’t happen overnight, but a new era for business intelligence (BI) has arrived. Gone are the days when BI systems presented yesterday’s data, when internal data was sufficient to cover all the organization’s BI needs, and when development of new reports took a few weeks or even months. Today, organizations rely much more on BI systems than ever before. Having access to the right data and at the right time in the right form is crucial for decision making processes in organizations in today’s fast moving world of business. BI has become a key instrument for organizations to distinguish themselves and to stay competitive. In this new era, BI systems have to change, because they’re confronted with new technological developments and new business requirements. These are some of the key challenges that BI systems are facing:

x Productivity improvement: Because the speed of business continues to increase, the productivity of BI developers has to improve as well. BI development must follow this speed of business. Taking a few weeks to develop a report is not acceptable anymore.

x Self-service BI: BI systems have to support self-service BI tools that allow users to develop and maintain their own reports.

x Operational intelligence: Users want to analyze operational data, not yesterday’s data. This form

of analytics is called operational intelligence or real-time analytics.

x Big data, Hadoop, and NoSQL: Undoubtedly, one of the biggest changes in the BI industry is initiated by big data. BI systems must embrace big data together with the accompanying Hadoop and NoSQL data storage technologies. The challenge is to allow users to use big data for reporting and analytics as easily as the data stored in classic systems.

x Systems in the cloud: Organizations are migrating BI system components to the much-discussed cloud. BI systems must be able to embrace cloud technology and cloud solutions in a transparent way.

x Data in the cloud: Analytical capabilities can be extended by enriching internal data with external data. On the internet, countless sources containing valuable external data are available, such as social media data and the numerous open data sources. The challenge for BI systems is to integrate all this valuable data in the cloud with internal enterprise data.

For many current BI systems it will be difficult to embrace all these challenges. The main reason is that their internal architectures are made up of a chain of databases, consisting of staging areas, data warehouses, and data marts. Data is made available to users by copying it from one database to another and with each copy-process the shape and form of the data gets closer to what the users require. This

BI has become a key instrument for organizations to distinguish themselves and

to stay competitive.

Allow users to exploit big data for reporting and analytics as

easily as small data.



data supply chain has served many organizations well for many years, but is now becoming an obstacle. It was designed with a “built to last” mentality, however, organizations are asking for solutions that are built with a “designed for change” approach. This whitepaper describes a lean form of on-demand data integration technology called data virtualization. Deploying data virtualization results in BI systems with simpler and more agile architectures that can confront the new challenges much easier. All the key concepts of data virtualization are described, including logical tables, importing data sources, data security, caching, and query optimization. Examples are given of application areas of data virtualization for BI, such as virtual data marts, big data analytics, extended data warehouse, and offloading cold data. The whitepaper ends with an overview of the first open source data virtualization server: Red Hat’s JBoss Data Virtualization.

2 The New Challenges for BI Systems BI systems are faced with new technological developments and new business requirements. The consequence is that BI systems have to change. This section lists some of the key challenges that BI systems face today. Productivity Improvement – If IT wants to assist organizations to stay competitive and cost-effective, development of BI systems must follow the ever increasing speed of business. This was clearly shown in a study by the Aberdeen Group1: 43% of enterprises find that making timely decisions is becoming more difficult. Managers increasingly find they have less time to make decisions after certain business events occur. The consequence is that it must be possible to modify existing reports faster and to develop new ones more quickly. Unfortunately, whether it’s due to the quality of the tools, the developers themselves, the continuously changing needs of users, or the inflexible architecture of most BI systems, many IT departments struggle with their BI productivity. BI backlogs are increasing. Self Service BI – The approach used by many IT departments to develop BI reports is an iterative one. It usually starts with a user requesting a report. Next, a representative from the IT department begins with analyzing the user’s needs. This involves interviews and the study of data structures, various reports, and documents. In most cases, it also involves a detailed analysis process by the IT specialist primarily to understand what the user is requesting and what all the terms mean. This process of “understanding” can be very time-consuming. Eventually, the IT specialist comes up with an initial version of the report, which is shown to the user for review. If it’s not what the user wants, the specialist starts to work on a second version and presents that to the user. This process may involve a number of iterations, depending on how good the user is in specifying his needs and how good the analyst is in understanding the user and his

1 Aberdeen Group, Agile BI: Three Steps to Analytic Heaven, April 2011, see https://www.tableausoftware.com/sites/default/files/whitepapers/agile_bi.pdf

Deploying data virtualization results in BI systems with simpler and more agile

architectures.



needs. Finally, all this work leads to an implementation. It’s obvious that this iterative process can be time-consuming. Understandably, many users have started to look for an alternative solution and they found self-service BI tools. These tools with their intuitive interfaces have been designed for users to develop their own reports. Users already understand their own needs and they know what they want. This means that many of the steps described above can be skipped and that improves productivity dramatically. But self-development can lead to chaos. Users are not professional developers, they have not been trained to develop re-usable solutions or formal testing techniques, and they don’t aim for developing shared metadata and integration specifications. Their only goal is to develop their report as quickly as possible. The challenge for BI systems is how to manage this self-service development in such a way that the reports return correct and consistent results and that the wheel is not reinvented over and over again. Operational Intelligence – There was a time when users of data warehouses were satisfied with reports containing one week old data. Today, users don’t accept such a data latency anymore, they want a data latency of one minute, or maybe even a few seconds or less. Especially user groups such as operational management and external parties want to have insight in the most current situation—yesterday’s data is worthless to them. This form of BI, in which users need a very low data latency, is called operational intelligence (sometimes called real-time analytics). If new data is entered in production systems, and the reporting is done on data marts, the key technical challenge is to copy data from the production systems very rapidly, via the staging area and data warehouse to the data marts. It must be clear to everyone that the longer the chain, the higher the data latency. And in many BI systems the chain is long. Big Data, Hadoop, and NoSQL – Undoubtedly, one of the biggest trends in the BI industry is big data. Gartner2 predicts that big data will drive $232 billion in spending through 2016, and Wikibon3 claims that in 2017 big data revenue will have grown to $50.1 billion. Many organizations have adopted big data. There are those who are studying what big data could mean for them, many are in the process of developing big data systems, and some are already relying on these systems. And they all do it to enrich their analytical capabilities. Whether big data is structured, unstructured, multi-structured, or semi-structured, it’s always a massive amount of data. To handle such large databases, many organizations have decided not to deploy familiar SQL systems but Hadoop systems or NoSQL systems, such as MongoDB and Cassandra. Hadoop and NoSQL systems are designed for big data workloads, are powerful and scalable, but are different from the SQL systems. First of all, they don’t always support the popular SQL database language nor the familiar relational concepts, such as tables, columns, and records. Second, many of them support their own API, database language, and set of database concepts. So, expertise in one product can’t always easily be reused in another.

2 Gartner, October 2012; see http://techcrunch.com/2012/10/17/big-data-to-drive-232-billion-in-it-spending-through-2016/ 3 Wikibon, Big Data Vendor Revenue and Market Forecast 2013-21017, February 12, 2014; see http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017

Yesterday’s data is worthless to more and more users, they need operational intelligence.

BI systems should allow users to do reporting and analytics

on big data as easily as on “small” data.



The challenge for BI will be to integrate all the big data stored in the Hadoop and NoSQL systems with the data warehouse environment, so that users can use big data for reporting and analytics as easily as they can “small” data. Systems in the Cloud – All the software components of a BI system, including the production systems, the staging area, the data warehouse, and the data marts, used to run on-premises. Nowadays, components may reside in the cloud. Moving a component to the cloud can have impact on its performance, the technical interface, security aspects, and so on. Such changes can require redevelopment. The challenge is to adopt cloud solutions in a transparent way. For example, when a data mart is moved to the cloud, or when a cloud-based ERP system is introduced, all this should be as transparent as possible. Data in the Cloud – The data sources of a data warehouses used to be limited to internal production systems. Reporting and analytics on internal, enterprise data can lead to useful business insights, but the cloud contains massive amounts of valuable external data that enriches analytical capabilities and leads to more unexpected insights. For example, by integrating internal customer data with social media data, a more detailed and complete picture can be developed of what a customer thinks about the products and the company. Nowadays, loads of external data are available in the cloud of which social media data is the most well-known. But it’s not only social media data. Thousands and thousands of open data sources have become available for the public. Examples of open data sources are weather data, demographic data, energy consumption data, hospital performance data, public transport data, and the list goes on. Almost all these open data sources are available in the cloud through some API. The challenge for BI systems is to integrate all this valuable cloud-based data with internal enterprise data. Copying all this data may be too expensive, so smarter solutions must be developed.

3 Current BI Systems and the New Challenges The challenges described in the previous section may be hard to implement in existing BI systems, due to their architecture. This section describes the classic BI architecture and summarizes why the challenges described in the previous section cannot be easily met. Classic BI Systems – The architectures of most BI systems resemble a long chain of databases; see Figure 1. In such systems, data is entered using production applications and stored in production databases. Data is then copied via a staging area and an operational data store to a central data warehouse. Next, it’s copied to data marts in which the majority of all the reporting and analytics takes place.

Integrating internal data with external data enriches

analytical and reporting capabilities.

The architectures of most BI systems resemble a long

chain of databases.



productiondatabases

stagingarea

operationaldata store

datawarehouse

data marts

productionapplications

analytics& reporting

ETL ETL ETL ETL

Figure 1 Most BI systems consist of a chain of databases in which new data is entered in production systems and from there copied from one database to another.

ETL jobs are commonly used to copy data from one database to another. They are responsible for transforming, integrating, and cleansing the data. ETL jobs are scheduled to run periodically. In other words, data integration and transformation is executed as batch process. The chain of databases and the ETL jobs that link them together form a factory that transforms raw data (in the production databases) to data for reporting purposes. This is very much like a real assembly line in which raw products are processed, step by step, into end products. This chain is a data supply chain. Classic BI Systems and the New Challenges – This data supply chain, with its batch-oriented style of copying, has served numerous organizations well for many years, but is now becoming an obstacle. It was designed with a “built to last” mentality. The consequence is that apparently simple report changes can lead to an eruption of changes to databases and ETL jobs and thus consuming a lot of development time. There are many steps where things can go wrong. Due to its ever growing nature, the chain has been stretched to the limit and has become brittle. Today, organizations demand solutions that are built with a “designed for change” approach. But most worrisome is that it’s difficult for these systems to meet the new challenges:

x Productivity Improvement: A key characteristic of ETL is that integration results can only be used when they have been stored in a database. Such a database has to be installed, designed, managed, kept up to date, and so on. All this costs manpower.

x Self-Service BI: Self-service BI or not, reports should return

correct and consistent results and productivity should be high, else, nothing is gained. This requires that specifications entered by self-service BI users must be shareable. The need to reinvent the wheel should be minimal. Unfortunately, current BI systems don’t have a “module” that makes it easy for users to share specifications.

x Operational Intelligence: As indicated in the previous section, new data entered in production

systems is copied several times before it becomes available for reporting. This is far from ideal for users interested in analyzing zero-latency data. Somehow, the chain must be shortened; there

The chain of databases and ETL jobs are very much like an

assembly line.

Data supply chains should be developed with a “designed

for change” approach.

Specifications entered by users of self-service BI users

must be shareable.



should be fewer databases and fewer ETL jobs.

x Big Data, Hadoop, and NoSQL: Big data is sometimes too big to copy. A copying process may take too long or the storage of duplicated big data can be too expensive. With respect to data transformation, data integration, and data cleansing, big data must be processed differently. It should not be pushed through the chain.

x Systems in the Cloud: When systems are moved to the cloud, it may be necessary to change the

way in which data is extracted. For example, copying data can take longer when running in the cloud. It may be required to encrypt the data when it’s transmitted over the Internet, which may not have been relevant before. The architecture of BI systems should be flexible enough so that moving components into, out of, or within the cloud is transparent.

x Data in the Cloud: In principle, external data in the cloud can be processed in the same way as

internal data: it can be extracted, integrated, transformed, and then copied through the chain of databases. However, it may be more convenient to run reports directly on these external data sources. Such a solution would be hard to fit in the existing architecture. For example, when a report needs data from a data mart and an external data source, how and where is that data integrated, transformed, and cleansed?

Summary – The classic BI architecture consisting of a chain of databases and ETL jobs has served us well for many years, but it’s not evident how BI systems, by sticking to this architecture, can face the new challenges.

4 On-Demand Integration with Data Virtualization This section describes a newer technology for data integration called data virtualization and how it offers a lean form of on-demand data integration. The following sections describe respectively how these products work, their application areas, and how they meet the requirements listed in Section 2. Data Virtualization in a Nutshell – Data virtualization is a technology for integrating, transforming, and manipulating data coming from all kinds of data sources and presenting all that data as one unified view to all kinds of applications. Data virtualization provides an abstraction and encapsulation layer that, for applications, hides most of the technical aspects of how and where data is stored; see Figure 2. Because of that layer, applications don’t need to know where all the data is physically stored, how the data should be integrated, where the database servers run, how to insert and update the data, what the required APIs are, which database language to use, and so on. When data virtualization is deployed, to every application it feels as if one large database is accessed. For completeness sake, here is the definition of data virtualization4:

Data virtualization is a technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.

4 Rick F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann, 2012.



productiondatabases

streamingdatabases

socialmedia data

productionapplication

big datastores

website

ESB

analytics& reporting

unstructureddata

mobileApp

datawarehouse

& data marts

internalportal dashboard

externaldata

privatedata

Data Virtualization Server

applications

Figure 2 Data virtualization servers make a heterogeneous set of data sources look like one logical database to the applications. Data Virtualization Offers On-Demand Integration – When an application requests a data virtualization server to integrate data from multiple sources, the integration is executed on-demand. This is very different from ETL-based integration, where integration takes place before the application asks for it. In a typical ETL environment the retrieved data may have been integrated a week before. Not so, with data virtualization where integration is done live. When an applications asks for data, only then will the data virtualization server retrieve the required data from the source databases, and integrate, transform, and cleanse it. Compare this to buying sandwiches. When a customer orders a sandwich in a restaurant, all the ingredients, such as the sandwich, the ham, the cheese, and the lettuce, are all “integrated” in the kitchen right there and then. That’s data virtualization! ETL compares with buying a pre-packaged sandwich at a shop where the “integration” of all the ingredients was done early in the morning or maybe even the evening before. Data virtualization is really on-demand data integration. Transformations and Translations – Because data virtualization technology offers access to many different data source technologies, many different APIs and languages are supported. It must be possible to handle requests specified in, for example SQL, XQuery, XPath, REST, SOAP, and JMS. Technically, what this means is that when an application prefers to use a SOAP/XML interface to access data while the data source supports JMS, the data virtualization server must be able to translate SOAP/XML into JMS. Lean Data Integration – Integration of two data sources using ETL may require a lot of work. The integration logic has to be designed and implemented, a database to store the result of the integration process has to be setup, this database has to be tuned and optimized, it has to be managed during its entire operational life, the integration process must be scheduled, it has to be checked, and so on.

Integration with data virtualization is like ordering a sandwich in a restaurant

where all the ingredients are “integrated” in the kitchen

right there and then.



The key advantage of on-demand integration via data virtualization is lean data integration. With data virtualization only the integration logic has to be designed and implemented. When this is done, applications can access the integrated data result. There is no need to develop and manage extra databases to hold integration results. The benefits of this lean form of integration are speed of delivery and the ease with which an existing integration solution can be changed. Not Limited to Read-Only – When a derived database, such as a data mart, is created to hold the result of ETL jobs, the data in that database is read-only. Technically, its content can be changed, but it wouldn’t make sense, because the application would be updating derived data, not the source itself. With data virtualization the source is accessed directly, so when data is changed using a data virtualization server, it’s the source data that’s changed. With data virtualization, new data can be inserted, and existing data can be updated and deleted. Note that the source itself may not allow a data virtualization server to change data. The Logical Data Warehouse – Users accessing data via a data virtualization server see one database consisting of many tables. The fact that they’re accessing multiple data sources, is completely transparent. They will see the database they query to be their data warehouse. But that data warehouse is not one physical database anymore. It has become a logical concept and is therefore referred to as a logical data warehouse. Data Virtualization Does Not Replace ETL – Sometimes data virtualization is considered unjustly to be a threat to ETL. Data virtualization does not replace ETL. Admitted, some ETL work will be replaced by on-demand integration, but not in all places. For example, in many (and probably most) organizations a physical data warehouse will still be necessary to, for example, keep track of historical data, or because the production systems can’t be accessed because of potential performance or stability problems. If that physical data warehouse is still needed, ETL can be the right integration technology to load data periodically. Data virtualization and ETL are complementary integration solutions, each with its own strengths and weaknesses. It’s up to the architect to determine what the best solution is for a specific integration problem.

5 Under the Hood of Data Virtualization Servers Most data virtualization servers support comparable concepts. This section describes these key concepts. Importing Data Sources – Before data in source systems can be accessed via a data virtualization server, their specifications must be imported. This doesn’t mean that the data is loaded in the data virtualization server, but that a full description of, for example, a physical SQL table is stored by the data virtualization server in its own repository. This description includes the structure (columns) of the physical table, the data types and lengths of each column, the physical location of the table, security details, and so on. Some

The benefits of lean integration are speed of

delivery and ease of change.

Data virtualization and ETL are complementary

integration solutions.

A physical table is a wrapper on a data source.



even gather some quantitative information on the tables, such as the number of records or the size in bytes. The result of an import is called a physical table; see Figure 3. A physical table can be seen as a wrapper on the data source.

<XML>

Data sources

Physical tables

Figure 3 Physical tables are used to wrap data sources.

Developing a physical table for a table in a SQL database involves a few simple clicks. When non-SQL sources are accessed, it may be a bit more difficult. For example, if data is retrieved from an Excel spreadsheet, it may be necessary to define the column names; when the source is a web service, a mandatory parameter may have to be specified; and, when the source offers its data in an hierarchical structure, for example in XML or JSON format, the logic to “flatten” the data is required. But usually, the importing process is relatively straightforward. The Logical Table – Applications can use physical tables right after they have been defined. What they see is of course the original data, the data as it is stored within the source system, including all the incorrect data, misspelled data, and so on. In addition, the data has not been integrated with other sources yet. In this case, the applications are responsible for integration; see Figure 3. By defining logical tables (sometimes called virtual tables or logical data objects), transformation and integration specifications can be defined; see Figure 4. The same transformation and integration logic that normally ends up in ETL jobs, ends up in the definitions of logical tables. To applications, logical tables look like ordinary tables; they have columns and records. The difference is that the contents is virtual. In that respect a logical table is like a view in a SQL database. Its virtual content is derived when the physical tables (the data sources) are accessed. The definition of a logical table consists of a structure and a content. The content is defined using a SQL query. Together, the structure and the query form the mapping of the logical table.



<XML>

Data sources

Physical tables

Logical tables

Figure 4 Logical tables are used to integrate, transform, and cleanse data.

The mapping defines how data from the physical tables have to be transformed and integrated. Developers have the full power of SQL at their disposal to define the virtual content. Each operation that can be specified in a SQL query can be used, including the following ones:

x Filters can be specified to select a subset of all the rows from the source table. x Data from multiple physical tables can be joined together (integration) x Columns in physical tables can be removed (projection). x Values can be transformed by applying a long list of string manipulation functions. x Columns in physical tables can be concatenated. x Names of the columns in the source table and the name itself can be changed. x New virtual and derivable columns can be added. x Group-by operations can be specified to aggregate data. x Statistical functions can be applied. x Incorrect data values can be cleansed. x Rows can be sorted. x Rank numbers can be assigned to rows.

Nesting of Logical Tables – Like views in a SQL database can be nested (or stacked), so can logical tables. In other words, logical tables can be defined on top of others; see Figure 5. A logical table defined this way, is sometimes referred to as a nested logical table. Logical tables can be nested indefinitely. The biggest benefit of being able to nest virtual tables is that it allows for common specifications to be shared. For example, in Figure 5, two nested virtual tables, LT1 and LT2 are defined on a third, LT3. The advantage of this layered approach is that all the specifications inside the mapping of LT3 are shared by the other two. If LT3’s common specifications are changed, they automatically apply to LT1 and LT2 as well. This can relate to cleansing, transformation, and integration

Nesting logical tables allows for sharing of common

specifications.



specifications. So, when an integration solution of two data sources has been defined, all other logical tables and applications can reuse that.

<XML> Physicaltables

Logical tables

Nestedlogical tables

LT1 LT2

LT3

Figure 5 Logical tables can be nested so that common specifications can be shared.

Publishing Logical Tables – When logical tables have been defined, they need to be published. Publishing means that the logical tables are made available for applications through one or more languages and APIs. For example, one application wants to access a logical table using the language SQL and through the API JDBC, whereas another prefers to access the same logical table as a web service using SOAP and HTTP. Most data virtualization servers support a wide range of interfaces and languages. Here is a list of some of the more popular ones:

x SQL with ODBC x SQL with JDBC x SQL with ADO.NET x SOAP/XML via HTTP x ReST (Representational State Transfer) with JSON (JavaScript Object Notation) x ReST with XML x ReST with HTML

Note that when multiple technical interfaces are defined on one logical table, the mapping definitions are reused. So, if a mapping is changed, all the applications, regardless of the technical interface they use, will notice the difference. Data Security Via Logical Tables – Some source systems have their own data security system in place. They support their own features to protect against incorrect or illegal use of the data. When a data virtualization server accesses such sources, these security rules still apply, because the data virtualization server is treated as a user of that data source. But not all data sources have a data security layer in place. In that case, data security can be implemented in the data virtualization server. For each table privileges can be granted in very much the same way as access to tables in a SQL database is granted. Privileges such as select, insert, update or delete can be granted. Note that data security can also be implemented when the data source supports its own data



security mechanism. Some data virtualization servers allow access privileges to be granted on table level, on individual column level, on record level, and even on individual value level. The last situation means that two users can have access to one particular record, where one user sees all the values and the other user sees a value being masked. Caching of Logical Tables – As indicated, data virtualization servers support on-demand data integration. Because doing integration live is not always preferred, they support caching. For each logical table a cache can be defined. The effect is that the virtual content is materialized: the content is determined by running the query and the result is stored in a cache. From then on, when an application accesses a cached logical table, the data source is not accessed, but the answer is determined by retrieving data from the cache. The reasons why caching is used, are diverse:

x Query performance x Load optimization x Consistent reporting x Source availability x Complex transformations

Regardless of whether caches are kept in memory, in a file, or in database, they are managed by the data virtualization server itself. For each cached logical table a refresh schedule must be defined. Query Optimization – When accessing data sources, performance is crucial. Therefore, it’s important that data virtualization servers know how to access the sources as efficiently as possible; they must support an intelligent query optimizer. One of the most important query optimization features is called pushdown. With pushdown, the data virtualization server tries to push as much of the query processing to the data sources themselves. So, when a query is received, it analyzes it, determines whether it can push the entire query to the data source or whether parts can be pushed down. In case of the former, the result coming back from the source needs no extra processing by the data virtualization server and can be pushed straight on to the application. In case of the latter, the data virtualization server must do some extra processing before the result received from the source can be returned to the data application. Pushdown is required to minimize the amount of data transmitted back to the data virtualization server, to let the database server do as little I/O as possible, and to do as little processing itself. All this is to improve query performance.

Caching of logical tables is used to improve query

performance.



6 BI Application Areas for Data Virtualization Currently, data virtualization is used in many different ways in BI systems. This section describes some of the more popular application areas. Virtual Data Mart – A data mart can be developed for many different reasons. One is to organize the tables and columns in such a way that it becomes easy for reporting tools and users to understand and query the data. So, the data marts are designed for a specific set of reports and users. In classic BI systems, data marts are physical databases. The drawback of developing a physical data mart is, first, that the data mart database has to be designed, developed, optimized, and managed, and second, that ETL processes have to be designed, developed, optimized, managed, and scheduled to load the data mart. With data virtualization, a data mart can be simulated using logical tables. The difference is that the tables the users see are logical tables; they are not physically stored. Their content is derived on-demand when the logical tables are queried. Hence, the name virtual data mart. Users won’t see the difference. The big advantage of virtual data marts is agility. Virtual data marts can be developed and adapted more quickly. Extended Data Warehouse – Not all data needed for analysis is available in the data warehouse. Non-traditional data sources, such as external data sources, call center log files, weblog files, voice transcripts from customer calls, and personal spreadsheets, are often not included. This is unfortunate, because including them can definitely enhance the analytical and reporting capabilities. Enhancing a data warehouse with some of these data sources can be a laborious and time-consuming effort. For some data sources, it can take months before the data is incorporated in the chain of databases. In the meantime, business users can in no way get an integrated view of all that data, let alone they can invoke advanced forms of analysis. With data virtualization servers, these sources together with the data warehouse will look like one integrated database. This concept is called an extended data warehouse. In a way, it feels as if data virtualization was designed for this purpose. In literature, this concept is comparable to the logical data warehouse and the data delivery platform7. Big Data Analytics – More and more organizations have big data stored in Hadoop and NoSQL systems. Unfortunately, most reporting and analytical tools aren’t able to access those database servers, because most of them require a SQL or comparable interface. There are two ways to solve this problem. First, relevant big data can be copied to a SQL database. However, in situations in which a Hadoop or NoSQL solution is selected, the amount of data is probably massive. Copying all that data can be time-consuming, and storing all that data twice can be costly. The second solution is to use a data virtualization server on top of a Hadoop and NoSQL system, to wrap it as a physical table, and to publish it with a SQL interface. The responsibility of the data virtualization server is to translate the incoming SQL statements to the API or language of the big data system. Because

Virtual data marts increase the agility of BI systems.

Data virtualization allows for easy development of an

extended data warehouse.



the interfaces of the NoSQL systems are proprietary, data virtualization servers must support dedicated wrapper technology for each of them. Operational Data Warehouse – An operational data warehouse is normally described as a data warehouse that not only holds historical data, but also operational data. It allows users to run reports on data that has been entered a few seconds ago. Implementing an operational data warehouse by copying new production data to the data warehouse fast can be a technological challenge. By deploying a data virtualization server an operational data warehouse can be simulated without the need to copy the data. Data virtualization servers can be connected to all types of databases, including production databases. So, if reports need access to operational data, logical tables can be defined that point to tables in a production database. This allows an application to elegantly join operational data (in the production database) with historical data (in the data warehouse). This makes it possible to develop an operational data warehouse without having to actually build a data warehouse that stores operational data itself. Normally, to minimize interference on production systems, ETL jobs are scheduled during so-called batch windows. Many production systems operate 24x7, thus removing the batch window. The workload generated by data virtualization servers is more in line with this 24x7 constraint, because the query workload is spread out over the day. In addition, they support various features, such as caching and pushdown optimization, to access data sources as efficiently as possible. Offloading Cold Data – Data stored in a data warehouse can be classified as cold, warm, or hot. Hot data is used almost every day, and cold data occasionally. Keeping cold data in a data warehouse slows down the majority of the queries and is expensive, because all the data is stored in an expensive data storage system. If the data warehouse is implemented with a SQL database, it may be useful to store cold data outside that database and in, for example, a Hadoop system. This solution saves storage costs, but more importantly, it can speed up queries on the hot and warm data in the warehouse (less data), plus it can handle larger data volumes. But most importantly, a data virtualization server is very useful when data from the SQL-part of the data warehouse must be pumped to the Hadoop system. It makes copying of the data straightforward, because it comes down to a simple copy of the contents of one SQL table to another. And, with a data virtualization server on the Hadoop big data files, reports can still access the cold data easily. Cloud Transparency – As indicated in Section 2, components of a data warehouse architecture are being moved to the cloud. Such a migration can lead to changes in how data is accessed. A data virtualization server can hide this migration. By always making sure that all the data is accessed via a data virtualization server, the latter can hide where the data is physically stored. If a database is moved to the cloud, moved back to on-premises, or migrated from one cloud to another, the data virtualization server can hide the technical differences, thus making such a migration painless and transparent for the reports and users. In this way, a data virtualization server implements cloud transparency.

Data virtualization can offer an operational data

warehouse without having to copy operational data.



Summary – Data virtualization technology can be used for a long list of application areas. This section describes a few, but more can be identified, such as sandboxing for data scientists, data services, and prototyping of integration solutions.

7 Data Virtualization and the New BI Challenges This section summarizes how data virtualization can meet the new BI challenges listed in Section 2:

x Productivity Improvement: Because data virtualization supports on-demand data integration there is less need to develop derived databases, such as data marts. This clearly shortens the chain of databases and results in systems that are quicker to develop and easier to maintain.

x Self-Service BI: When virtual data marts have been developed, change requests are easy to

implement by the IT department. It’s just a matter of changing the definitions of the logical tables. No physical data marts have to be unloaded and reloaded, and no ETL jobs have to be changed. More on this in Section 8.

x Operational Intelligence: As the previous section describes, the operational data warehouse is an

application area for data virtualization. Users can be given access to operational data (through logical tables) without the need to copy that data. In addition, they can integrate the operational data with historical data stored in a physical data warehouse.

x Big Data, Hadoop, and NoSQL: Most data virtualization servers support direct access to Hadoop

and NoSQL systems. This implies that data can stay where it is and can still be analyzed with reporting and analytical tools using SQL. Big data is pulled through the chain.

x Systems in the Cloud: Data virtualization can hide cloud technology. It can hide the location

where systems are running. Even migrating systems from one cloud to another can be done transparently. Data virtualization makes the cloud transparent.

x Data in the Cloud: If an external data source has a well-defined API, data virtualization servers

can access it. With a data virtualization server, reports can run directly on these external data sources, and external data can be integrated, transformed, and cleansed in the same way as internal enterprise data.

8 Data Virtualization Simplifies Sharing of Integration Specifications The Dangers of Not Sharing Specifications – Rarely ever do all the users of an organization use the same reporting tool. Usually a wide range of tools is in use. Unfortunately, tools don’t share integration, transformation, or cleansing specifications; see Figure 6. A solution developed for one tool cannot be re-used by another tool. (Note that in numerous BI systems, even if users are using the same tool, specifications are not shared either.) This requires that the solution is implemented in many different tools, leading to a replication of all integration, transformation, and

In many BI systems specifications are not shared

across different BI tools.



cleansing specifications.

BI tool 1

integrationsolution

repository

BI tool 2

integrationsolution

repository

BI tool 3

integrationsolution

repository

source 1 source 2 source 3 source 4 source 5 source 6

Figure 6 BI tools don’t share integration, transformation, and cleansing specifications; they all have their own central repository.

For example, a user can define the concept of “a good customer” based on the total number of orders he has placed, the average value of orders, the number of products returned, the age of the orders, and so on. He can enter filters and formulas to distinguish the good ones from the bad ones. If another user needs a similar concept but uses another tool, he has to define it himself using his own tool. The consequence is that the wheel is reinvented over and over again. It must be clear that no sharing of specifications lowers the agility level of a BI system, reduces the productivity of BI developers, and lowers the correctness and consistency of report results. Self-Service BI Tools Have No Central Repository – A drawback of many self-service BI tools is that there is no real central repository where specifications are stored and can be shared by users and reports. For example, if two users want to integrate the same two data sources, they have to define their own solutions. So, despite that these users are working with the same tool, there is a limited sharing of integration and transformation specifications. They are reinventing the wheel over and over again. Plus, integrating data sources is not always easy. Some data sources can have highly complex structures that require in-depth knowledge of how they organize data. Some of that logic may even be deeply hidden in the data structures and data itself. So, the question is whether in such a situation the correct form of integration is implemented? Integrating systems is not always as easy as drag and drop. Complexity of integration should never be trivialized. Data Virtualization to the Rescue – By deploying data virtualization, many specifications can be implemented centrally and can be shared. They can even be shared by different tools from different vendors; see Figure 7. The definition of what a good customer is, has to be entered only once with the data virtualization server and can then be shared by all users. It can even be shared across all BI tools in use. It must be clear that this sharing of specifications raises the agility level of a BI system. If the definition of a good customer changes, it only has to be changed in one spot. It also increases the productivity of BI

Often in self-service BI tools, users are reinventing the

wheel.

Data virtualization allows sharing of specifications.



developers, and improves the correctness and consistency of report results. In addition, if the integration of some data sources is complex, it can be implemented by IT specialists (using logical tables) for all BI users. So, no reinvention of the wheel, but sharing and reusing specifications.

BI tool 1 BI tool 2

repository

BI tool 3

data virtualizationwith shared specifications

source 1 source 2 source 3 source 4 source 5 source 6

Figure 7 With data virtualization, specifications are stored in a central repository and can be shared.

With respect to self-service BI tools, if users prefer to define integration specifications themselves, they can still do that by accessing the low-level logical tables. Instead of defining concepts with their own tool, such as good customer, they can create re-usable specifications themselves in the data virtualization server. They will experience the same level of easy-to-use and flexibility they are used to with their self-service tools. Changing specifications in the data virtualization server is as easy as changing comparable specifications in a self-service BI tool.

9 Overview of Red Hat’s JBoss Data Virtualization Server History of Red Hat JBoss Data Virtualization – Red Hat’s product for data virtualization called JBoss Data Virtualization (JDV) is not a brand new, but mature product. It started its life as a closed source product called MetaMatrix. The vendor was founded in 1998 as Quadrian and renamed it later to MetaMatrix. They released their first data virtualization product in 1999. Before the product received much attention in the market and before data virtualization became popular, they were acquired by Red Hat in June 2007. Red Hat took a few years to transform the closed source product to an open source one. Initially, Red Hat’s product was released under the name JBoss Enterprise Data Services Platform. This name was changed at the end of 2013. Currently, there are two versions of the product available: Teiid5 is the community edition, and JBoss Data Virtualization is the enterprise edition. Noteworthy is that Red Hat’s product is the only open source data virtualization server currently available. This section focuses on JDV only. Collaborative and Rapid Development – Each data virtualization product offers on-demand data viewing. When a logical table is created, users can study its (virtual) contents right away. JDV also supports on-demand

5 Teiid is a type of lizard. In addition, the name contains the acronym EII, which stands for Enterprise Information Integration. The term EII can be seen as a forerunner for data virtualization.



data visualization through dashboarding. With this feature, the virtual contents of logical tables can be visualized as bar charts, pie charts, and so on, right after the table has been created; see Figure 8.

Figure 8 A sceenshot of JBoss Data Virtualization showing data visualization through dashboards. This on-demand data visualization feature allows for collaborative development. Analysts and business users can sit together and work on the definitions of logical tables together. The analysts will work on the definitions and the models (which may be too technical for some users) and the users will see an intuitive visualization of the data—their data. Because of this collaborative development, less costly development time will be lost on incorrect implementations. It leads to rapid development. Design Environment – The logical tables can be defined using a graphical and easy-to-use design environment. Figure 9 contains a screenshot showing the five logical tables and their relationships. The familiar JBoss Developer Studio is an add-on and can be used as design and development environment. Lineage and Impact Analysis – JDV stores all the definitions of concepts, such as data sources, logical tables, and physical tables, in one central repository. This makes it easy for JDV to show all the dependencies between these objects. This feature helps, for example, to determine what the effect will be if the structure of a source table or logical table changes; which other logical tables must be changed as well? In other words, JDV offers lineage and impact analysis. Accessing Data Sources – JDV can access a long list of source systems, including most well-known SQL databases servers (Oracle, DB2, SQL Server, MySQL, and PostgreSQL), enterprise data warehouse platforms (Teradata, Netezza, and EMC/Greenplum), office tools (Excel, Access, and Google Spreadsheets), applications (SAP and Salesforces.com), flat files, XML files, SOAP and REST web services, and OData services.



Figure 9 A sceenshot of JBoss Data Virtualization showing the relationships between tables. Accessing Big Data Hadoop and NoSQL – For accessing big data stored in Hadoop or NoSQL systems, JDV comes with interfaces for HDFS and MongoDB. With respect to Hadoop, JDV’s implementation works via Hive. So, JDV sends the SQL query coming from the application to Hive, and Hive translates that query into MapReduce code, which is then executed in parallel on HDFS files. In the case of MongoDB, hierarchical data is flattened to a relational table. User-Defined Functions – For logic too complex for SQL, developers can build their own functions. Examples may be complex statistical functions or functions to turn a complex value into a set of simple values. These user-defined functions can be developed in Java and can be invoked from SQL statements. Invoking the UDFs is comparable to invoking built-in SQL functions. Two Languages For Developing Logical Tables – Logical tables can be defined using SQL queries or stored procedures. The SQL dialect supported is rich enough to specify most of the required transformations, aggregations, and integrations. Stored procedures may be needed, for example, when non-SQL source systems are accessed that require complex transformations to turn non-relational data into more relational structures. Query Optimizer – To improve query performance, JDV’s query optimizer supports various techniques to push down most or all of the query processing to the data sources. It supports several join processing strategies, such as merge joins and distributed joins. The processing strategy (or query plan) selected by the optimizer can be studied by the developers. The optimizer is not a black box. This openness of the optimizer is very useful for tuning and optimizing queries.



Caching – JDV supports two forms of caches: internal materialized caches and external materialized caches. With internal materialized caches, the cache is kept in memory. The advantage of using internal materialized caches is fast access to the data, but the disadvantage is that memory is limited—not all data can be cached. With external materialized caches, data is stored in a SQL database. In this case, there is no restriction on the size of the cached data. But it will be somewhat slower than the memory alternative. Publishing Logical Tables – JDV supports a long list of APIs through which logical tables can be accessed which includes JDBC, ODBC, SOAP, RESR, and OData. Data Security – JDV supports the four different forms of data access security specified in Section 5. Privileges such as select, insert, update, and delete can be granted on the table level, individual column level, record level, and even on individual value level. This makes it possible to let JDV operate as a data security firewall. In addition, when logical tables are published using particular APIs, security aspects can be defined as well. For example, developers can publish a logical table using SOAP with WS-Security extended. Embeddable – All the functionality of JDV can be invoked through an open API. This means that applications can invoke the functionality of JDV. This makes JDV an embeddable data virtualization server. Vendors and organizations can use this API to develop embeddable solutions. Summary – JBoss Data Virtualization is a mature data virtualization server that allows organizations to develop BI systems with more agile architectures. Its on-demand data integration capabilities makes it ready for many application areas:

x Virtual data marts x Extended data warehouse x Big data analytics x Operational data warehouse x Offloading cold data x Cloud transparency

JDV allows the development of agile BI systems that are ready for the new challenges BI systems are faced with.



About the Author Rick F. van der Lans Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data warehousing, business intelligence, database technology, and data virtualization. He works for R20/Consultancy (www.r20.nl), a consultancy company he founded in 1987. Rick is chairman of the annual European Enterprise Data and Business Intelligence Conference (organized in London). He writes for the eminent B-eye-Network.com6 and other websites. In 2009, in a number of articles7 all published at BeyeNetwork.com, he introduced the business intelligence architecture called the Data Delivery Platform which is based on data virtualization. He has written several books. His latest book8 Data Virtualization for Business Intelligence Systems was published in 2012. Published in 1987, his popular Introduction to SQL9 was the first English book on the market devoted entirely to SQL. After more than twenty five years, this book is still being sold, and has been translated in several languages, including Chinese, German, Italian, and Dutch. For more information please visit www.r20.nl, or email to [email protected]. You can also get in touch with him via LinkedIn and via Twitter @Rick_vanderlans.

About Red Hat, Inc. Red Hat is the world’s leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux, and middleware technologies. Red Hat also offers award-winning support, training, and consulting services. Red Hat is an S&P company with more than 70 offices spanning the globe, empowering its customers’ businesses.

6 See http://www.b-eye-network.com/channels/5087/articles/ 7 See http://www.b-eye-network.com/channels/5087/view/12495 8 R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012. 9 R.F. van der Lans, Introduction to SQL; Mastering the Relational Database Language, fourth edition, Addison-Wesley, 2007.

rhjb rethink data integration delivering agile bi systems data virt

Documents