bright house technical brief-5

8/4/2019 Bright House Technical Brief-5

http://slidepdf.com/reader/full/bright-house-technical-brief-5 1/13

BrightHouse:

An Analytic Data Warehouse

! Infobright Inc. 2007. All Rights Reserved.

White PaperSeptember 17, 2007

MySQLwww.mysql.com

Infobrighthttp://www.infobright.com

http://www.infobright.com/

http://www.mysql.com/

http://www.mysql.com/

http://www.infobright.com/



1.0 IntroductionInfobright BrightHouse is a solution for Analytic Data Warehousing that delivers highperformance for complex analytic queries across vast amounts of data. BrightHousedelivers the following key features:

! high query performance for analysis across terabytes of data.

! average data compression of 10:1 (10TB of raw data can be stored at 1TB).

! low administration requirements.

! runs on low cost, commodity hardware.

! compatible with all major BI tools including Cognos, Business Objects, Pentaho,and others.

This paper will introduce the BrightHouse solution, discuss typical use cases and willgive a technical overview of the key features including compression and query perform-ance.

2.0 Use CasesBrightHouse was designed to solve 2 specific problems associated with traditional datawarehouses. First is the problem of poor query performance for analytic queries, par-ticularly in a mixed-workload type environment. Secondly it aims to solve the problem ofrapidly expanding data within the data warehouse and the corresponding increase incosts for hardware, software and administration overhead.

2.1 Use Case 1 - Complementing an Existing Data WarehouseIn this case there is an existing large data warehouse (generally over a terabyte in size)that is being used in a mixed-workload environment where the workload consists of alarge number of canned reports, a large number of users doing simple queries and asmaller number of users doing more complex or ad-hoc analysis. These more complexqueries are running slowly and are slowing down the overall performance of the system,forcing administrators to either do additional work to tune the database or run the que-ries in a batch window (and making users wait for the results).

In this use case, BrightHouse is implemented alongside the existing warehouse. Asnapshot of the data is loaded into BrightHouse and business analysts doing morecomplex queries will now query against BrightHouse instead of the original data ware-house.




With BrightHouse, users doing complexanalysis have freer access to data and canget results much faster.

Also, since the overhead of these querieshas been removed from the original datawarehouse, all users will see a perform-ance boost, and DBA!s no longer have theheadache of tuning for changing or occa-sionally run analytic queries.

2.2 Use Case 2 - A Standalone Data Warehouse for Analytics

This case generally applies to on-line businesses who are collecting a very large

amount of customer interaction or click-stream data and wantto do analysis across that data. These businesses need highperformance but the extremely complex nature of the queriesand the massive size of the data make that difficult to do withtraditional database systems.

In this use case often there is no pre-existing data warehouseand BrightHouse would get implemented as a standalone Ana-lytic Data Warehouse. Data is batch loaded directly from theproduction systems into BrightHouse.

BrightHouse allows on-line businesses to do very complexanalysis on a terabytes of data with a high level of perform-ance. This allows for faster, more informed decision making.




2.3 Use Case 3 - An Analytic Data Warehouse and Archive Combined

In this use case there is data that needs to be archived off of the data warehouse butstill needs to be occasionally queried. Traditional data archive solutions require that thedata be reloaded into the data warehouse which is time consuming, expensive and can

be technically difficult if there have been

schema changes since the data was archived.The implementation of BrightHouse in this caseis similar to Case 1 where a snapshot of thedata is moved into BrightHouse but in thiscase, BrightHouse replaces a traditional dataarchive. When the data needs to be queried, itis now queried directly from BrightHouse ratherthan having to reloading into the warehouse.

The advantages to using BrightHouse over atraditional data archive are that the data is now

easily accessible without the time and costsassociated with re-loading the data into thedata warehouse, and analysts can now use thisformerly archived data for other business pur-poses.

3.0 BrightHouse ArchitectureBrightHouse at its core is a highly compressed column-oriented datastore that incorpo-rates MySQL technology. The following sections provide an overview of the major in-ternal structures and how they interact.

3.1 How BrightHouse integrates with MySQL

BrightHouse leverages MySQL!s pluggable storage engine architecture and bundlesMySQL Version 5.1. The MySQL connectors (C, JDBC, ODBC, .NET, Perl, etc.) areused in BrightHouse. The MySQL management services and utilities are used as thetechnology around connection pooling. As with other MySQL storage engines, MyISAMis used to store catalogue information such as table definitions, views, users, permis-sions, etc.




BrightHouse uses its own load and unload utilities, rather than MySQL!s mainly because

compression and decompression are done on load and unload and the BrightHouseKnowledge Grid is created on load (see next section). BrightHouse also uses its ownoptimizer instead of the MySQL optimizer. The BrightHouse optimizer knows how touse the information that is stored in the Knowledge Grid to optimize the execution ofqueries against BrightHouse (see section 4.1).

3.2 BrightHouse: key internal structures

Not including the MySQL-provided technology, BrightHouse consists of 4 key layers:

! BrightHouse is a column-oriented data store. This means that instead of the databeing stored row by row, it is stored column by column. There are many advan-

tages to column-orientation, not the least of which is the ability to do more efficientdata compression because since each column stored a single data type (as op-posed to rows that typically contain several data types), compression can be opti-mized for each particular data type. The data itself within the columns is stored by65K item groupings. We refer to each of these groupings as Data Packs. The useof Data Packs improves data compression and is also critical to how Infobright re-solves complex queries.




! Data Pack Nodes (DPNs) contain a set of statistics stored related to the data thatis stored and compressed in each of the Data Packs. There is always a 1 to 1 re-lationship between Data Packs and DPNs

! Knowledge Nodes are a further set of metadata related to Data Packs, columns or

table combinations. The set of these Knowledge Nodes taken together is calledthe Knowledge Grid.

! The BrightHouse Optimizer uses the Knowledge Grid to determine the minimumset of Data Packs which need to be decompressed in order to satisfy a givenquery. In some cases, the information contained in the Knowledge Grid is sufficientto resolve the query, in which case nothing is decompressed.

4.0 Key Features

BrightHouse has several unique features that make it particularly well suited to AnalyticData Warehousing. Firstly BrightHouse delivers market-leading data compressionwhich drastically reduces I/O and thus improves query performance. Secondly theKnowledge Grid delivers a unique approach to the resolution of analytic queries.

4.1 Compression

Managing large volumes of data continue to be a problem. Even though the costs of data storage has declined, data continues to increase at an exponential rate. At the




same time disk transfer rates have not increased significantly and more disks createadditional problems, such as expanding back-up times, more management overheadand slower access due to increasing search times.

The reality is that data compression can significantly reduce the problems of rapidly ex-

panding data. By compressing the data and bringing it closer together, the data volumeis reduced, lowering disk access time.

BrightHouse has industry leading compression. Unlike traditional row-based data ware-houses, BrightHouse stores data by column, allowing compression algorithms to befinely tuned to the column data type. Moreover, for each column, the data is split intoPacks with each storing up to 65,536 values of a given column. BrightHouse then ap-plies a set of patent-pending compression algorithms that are optimized by automati-cally self-adjusting various parameters of the algorithm for each data pack.

4.1.1 Compression Results

An average compression ratio of 10:1 is achieved in BrightHouse. For example 10TB of raw data can be stored in about 1TB of space in BrightHouse, on average (including theoverhead associated with Data Pack Nodes and the Knowledge Grid). Other databasesystems increase the data size usually because of the additional overhead required tocreate indexes and other special structures, in some cases by a factor of 2 or more.Even database systems with compression rarely exceed compression ratios of 1 to 1.

Within BrightHouse, the compression ratio may differ depending on data types and con-tent. Additionally, some data may turn out to be more repetitive than others and henceeasier to compress. Our experience with real world data shows our compression ratioscan be as high as 30:1.




4.2 Resolving Complex Analytic Queries without Indexes

A key feature of BrightHouse is its ability to resolve complex analytic queries without theneed for traditional indexes. The following sections describe the methods used.

4.2.1 Data Packs and Data Pack Nodes Data Packs consist of groupings of 65,536 items within a given column. For example,for the table T with columns A, B and 300,000 records, we would have the following Packs:

Pack A1: values of A for rows no.1-65,536 Pack B1: values of B for rows no.1-65,536

Pack A2: values of A for 65,537-131,072 Pack B2: values of B for 65,537-131,072




Analytical information about each Data Pack is collected and stored in what is referredto as a Data Pack Node (DPN). For example, for numeric data types; min value, maxvalue, sum value, number of non-null elements are stored. Each Data Pack has a cor-responding DPN.

For e xample, for the above table T, assume that both A and B store some numeric val-

ues. For simplicity, assume there are no null values in T; hence we can omit informationabout the number of non-nulls in DPNs. The following table should be read as follows:for the first 65,536 rows in T the minimum value of A is 0, maximum is 5, the sum of val-ues on A for the first 65,536 rows is 100,000 (and there are no null values).

Pack NumbersColumns A&B

DPNs of A DPNs of B

Min Max Sum Min Max Sum

Packs A1 & B1 0 5 10,000 0 5 1,000

Packs A2 & B2 0 2 2,055 0 2 100

Packs A3 & B3 7 8 500,000 0 1 100

Packs A4 & B4 0 5 30,000 0 5 100

Packs A5 & B5 -4 10 12 -15 0 -40




DPNs are accessible without the need to decompress the corresponding Data Packs.Whenever we look at using data stored in the given Data Pack we first examine its DPNto determine if we really need to decompress the contents of the Data Pack. In manycases, information contained in a DPN is enough to optimize and execute a query.

The mechanism of creating and storing Data Packs and their DPNs is illustrated below.

During Load, 65,536 values of the given column are treated as a sequence with zero or more null values occurring anywhere in the sequence. Information about the null posi-tions is stored separately (within the Null Mask ). Then the remaining stream of the non-null values is compressed, taking full advantage of regularities inside the data.

The loading algorithm is multi-threaded, allowing each column to be loaded in parallel,maintaining a good load speed, even with the overhead of compression.

4.2.2 The Knowledge Grid

DPNs alone can help minimize the need to decompress data when resolving queries,however Infobright’s technology goes beyond DPNs to also include what we call The

Knowledge Grid.

The Knowledge Grid was developed to efficiently deal with complex, multiple-table que-ries (joins, sub-queries, etc.). The Knowledge Grid stores more advanced informationabout the data interdependencies found in the database, involving multiple tables, mul-tiple columns, and single columns in the form of Knowledge Nodes (KNs). It enablesprecise identification of Data Packs involved and minimizes the need to decompressdata.




To process multi-table join queries and sub-queries, the Knowledge Grid uses multi-table Pack-To-Pack KNs that indicate which pairs of Data Packs from different tablesshould actually be considered while putting the tables together. The query optimizationmodule is designed in such a way that Pack-To-Pack Nodes can be applied together with other KNs.

Simple statistics such as min and max within Data Packs can be extended to includemore detailed information about occurrence of values within particular value ranges. Inthese cases a histogram is created that maps occurrences of particular values in par-ticular Data Packs to determine quickly and precisely the chance of occurrence of agiven value, without the need to decompress the Data Pack itself. In the same way, ba-sic information about alpha-numeric data is extended by storing information about theoccurrence of particular characters in particular positions in the Data Pack. AdditionalKnowledge Node types are planned in future releases.

KNs can be compared to indices used by traditional databases, however, KNs work onPacks instead of rows. Hence, basically speaking, KNs are 65,536 times smaller (or

even 65,536 times 65,536 for the Pack-To-Pack Nodes because of the size decrease for each of the two tables involved) than indices. In general the overhead is around 1% of the data, compared to classic indexes which can be 20-50% of the size of the data.

Knowledge Nodes are created on data load and are automatically created and main-tained by the Knowledge Grid Manager based on the column type and definition.

4.2.3 Using the Knowledge Grid and Data Pack Nodes for Query Execution

How do Data Packs, DPNs and KNs work together to achieve high query performance?

Decompressing Data Packs is incomparably faster than decompressing larger portionsof data. By having a good mechanism for identifying the Packs to be decompressed,

we achieve a high query performance over the partially compressed data. BrightHouseapplies DPNs and KNs for the purpose of splitting Data Packs among the three follow-ing categories for every query coming into the optimizer:

• Relevant Packs – in which each element (the record’s value for the given col-umn) is identified, based on DPNs and KNs, as applicable to the given query.

• Irrelevant Packs – based on DPNs and KNs, the Pack holds no relevant values.• Suspect Packs – some elements may be relevant, but there is no way to claim

that the Pack is either fully relevant or fully irrelevant, based on DPNs and KNs.

While querying, we do not need to decompress either Relevant or Irrelevant DataPacks. Irrelevant Packs are simply not taken into account at all. In case of Relevant

Packs, we know that all elements are relevant, and the required answer is obtainablefrom the analytical information inside the DPN.




For example, using our previously discussed DBNs for Table T (below) consider the fol-lowing SQL query statement:

Query 1: SELECT SUM(B) FROM T WHERE A > 6;

Pack NumbersColumns A&B DPNs of A DPNs of BMin Max Sum Min Max Sum

Packs A1 & B1 0 5 10,000 0 5 1,000

Packs A2 & B2 0 2 2,055 0 2 100

Packs A3 & B3 7 8 500,000 0 1 100

Packs A4 & B4 0 5 30,000 0 5 100

Packs A5 & B5 -4 10 12 -15 0 -40

We can see that:

• Packs A1, A2, A4 are Irrelevant – none of the rows can satisfy A > 6 because all these packs have maximum values below 6. Consequently, Packs B1, B2, B4 will not be analyzed while calculating SUM(B) – they are Irrelevant too.

• Pack A3 is Relevant – all the rows with numbers 131,073-196,608 satisfy A > 6. It means Pack B3 is Relevant too. The sum of values on B within Pack B3 is oneof the components of the final answer. Based on B3’s DPN, we know that that sum equals to 100. And this is everything we need to know about this portion of data.

• Pack A5 is Suspect – some rows satisfy A > 6 but we do not know which ones.

As a consequence, Pack B5 is Suspect too. We will need to decompress both A5 and B5 to find, which rows out of 262,145-300,000 satisfy A > 6 and sum uptogether the values over B precisely for those rows. A result will be added to thevalue of 100 previously obtained for Pack B3, to form the final answer to thequery.

The Suspect Packs can change their status during query execution and become Irrele-vant or Relevant based on intermediate results obtained from other Data Packs.

For example, consider the following, just slightly modified SQL query:

Query 2: SELECT MAX(B) FROM T WHERE A > 6;

Comparing to Query 1, the split among Relevant, Irrelevant, and Suspect Data Packsdoes not change. From DPN for Pack B3 we know that maximum value on B over therows 131,073-196,608 equals 1. So, we are already sure that at least one row satisfy-ing A > 6 has the value equal to 1 on B. From DPN for Pack B5 we know that themaximum value on B over the rows 262,145-300,000 equals to 0. So, although some of those rows satisfy A > 6, none of them can exceed the previously found value of 1 on B.So, we know that the answer to the above Query 2 is equal to 1 without decompressing any Packs for T.




To summarize, the above Data Pack splitting can be modified during the query execu-tion, with the percentage of Suspect Packs decreasing. Actual workload is still limited tothe suspect data. This method of optimization and execution is entirely different fromother databases as we iteratively work with portions of compressed data.

As another example, consider the usage of Pack-To-Pack Nodes. Pack-To-Pack Nodes

are one of the most interesting components of Knowledge Grid.

Query 3: SELECT MAX(X.D) FROM T JOIN X ON T.B = X.C WHERE T.A > 6;

Table X consists of two columns, C and D. The above query means that we want to find maximum value on column D in table X, but only for such rows in X, for which there arerows in table T with the same value on B as the value on C in X, and with the value on Agreater than 6. Table X consists of 150,000 rows with 3 Data Packs for each column.Its DPNs are presented below, in the same way as for table T. (And no null values as-sumed.)

Pack Numbers

Columns C&D

DPNs of C DPNs of D

Min Max Sum Min Max Sum

Packs C1 & D1 -1 5 100 0 5 100

Packs C2 & D2 -2 2 100 0 33 100

Packs C3 & D3 -7 8 1 0 8 100

As in case of the previous queries, Data Packs to be analyzed from perspective of tableT are A3, A5, B3, B5. Precisely, the join of tables X and T with additional condition T.A >6 may involve only the rows 131,073-196,608 and 262,145-300,000 on the table T’s

side.

Assume that the following Pack-To-Pack Knowledge Node is available for the values of B in table T and C in table X. “T.Bk” refers to Pack Bk in T, where k is the Pack’s num-ber. Hence, e.g., we can see that there are no pairs of the same values from T.B1 and

X.C1 (represented by 0) and there is at least one such pair for T.B1 and X.C2 (repre-sented by 1).

T.B1 T.B2 T.B3 T.B4 T.B5

X.C1 0 1 0 1 0

X.C2 1 0 0 0 1

X.C3 1 1 0 1 0

Now, we can see that while joining tables T and X on the condition T.B = X.C we do not need to consider Pack B3 in T any more because its elements do not match with any of the rows in table X. The only Pack left on T’s side is B5. From the above we see that B5 can match only with the rows from Pack C2 in table X. Hence, we need to analyze




only the values of column D over the rows 65,537-131,072 in table X. We are not sureexactly which out of those rows satisfy T.B = X.C, with additional filter T.A > 6 on T’sside. We only know that the final result will be at most 33 based on DPN. In order tocalculate the exact answer, we need to first decompress Packs B5 and C2, as well as

A5 to get precise information about the rows, which satisfy the conditions. Then we de-

compress Pack D3 in table X to find D’s maximum value over those previously extracted row positions.

5.0 Conclusion and Next StepsTogether Infobright and MySQL have developed a joint solution that meets the needs ofcustomers looking for a high performance Analytic Data Warehouse. Using Infobright!sunique technology, BrightHouse gives corporations a new way to make vast amounts ofdata available to business analysts and executives to make better business decisions.

As mentioned earlier in this paper, compression and performance depend on the dataitself, the types of queries being run and the overall environment (including hardware

and numbers of users). The best way to get started with BrightHouse is to contact usfor a Proof of Concept. Using your data and your queries we can show you first handhow BrightHouse performs in your environment.

For more information on Infobright!s products and services or to schedule a Proof ofConcept, please contact MySQL at [email protected].


mailto:[email protected]

mailto:[email protected]

bright house technical brief-5

Documents