[lecture notes in electrical engineering] computer science and its applications volume 330 || a...

© Springer-Verlag Berlin Heidelberg 2015 James J. (Jong Hyuk) Park et al. (eds.), Computer Science and Its Applications,

765

Lecture Notes in Electrical Engineering 330, DOI: 10.1007/978-3-662-45402-2_108

A Flexible Page Storage Model for Mixed OLAP and OLTP Workloads

Kyounghyun Park, Hai Thanh Mai, Miyoung Lee, and Hee Sun Won

Big Data Software Platform Research Department, Electronics and Telecommunications Research Institute, Daejeon, Korea

{hareton,mhthanh,mylee,hswon}@etri.re.kr

Abstract. Previous page storage models do not show good performance in mixed OLAP and OLTP workloads because they are originally designed for only one of these workload types. Therefore, we propose a new page storage model for mixed OLAP and OLTP workloads which is based on column groups. We also present a dynamic page layout method that generates correlated column groups by analyzing the current workload and then dynamically reorganizes the data pages. The proposed page model shows a considerable performance improvement in the mixed workload environment.

Keywords: Databases, mixed workload, page storage model, page layout.

1 Introduction

Traditional database management systems have been developed to work with one of the two different workload types, namely Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP) [1]. OLAP concentrates on analytical operations over a large amount of data while OLTP focuses on read, write, and update operations over a small fraction of the data. Nevertheless, the distinction between OLAP and OLTP is becoming less meaningful as enterprise markets are moving toward the management and analysis of real-time big data where both of these workload types need to be processed efficiently at the same time [2]. Among many components of the traditional database systems that need to be redesigned for the new requirement, page storage model and page layout management are particularly important [3].

Row stores and column stores are the two most popularly used page storage models. The former model stores full data records sequentially and thus are suitable for OLTP workloads which usually request accesses to only a few records. The latter one stores multiple values of the same attribute continuously and thus suitable for OLAP workloads which typically require accesses to a few columns over a large amount of data. However, none of these storage models shows good performance with mixed OLAP and OLTP workloads. Therefore, in this paper, we propose a new Column Group-based page storage Model (CGM) to solve the problem. We also present a new page layout management method to get the highest performance improvement from our page storage model.

766 K. Park et al.

2 Background

N-ary storage model (NSM) [4] is the most popular page storage model in relational database systems. NSM stores data records sequentially and is suitable for OLTP workloads. However, NSM shows poor performance with OLAP workloads where only a fraction of each record is required. Meanwhile, the decomposition storage model (DSM) [5] is a page storage model that partitions data vertically into sub-relations. Each sub-relation contains the values of one attribute in a relation. DSM shows good performance when the query requires a single-attribute scan. However, DSM is not suitable for multi-attribute queries because the database system has to join the sub-relations to answer these queries. Partition Attributes Access (PAX) [6] is a page storage model that partitions the records within each page, storing together the values of each attribute in minipages. PAX thus does not increase record reconstruction cost significantly. Overall, these page storage models are static ones in which the page layout is fixed. Therefore, they are not efficient for dynamic workloads.

3 The Proposed Page Model

Fig. 1 illustrates our column group-based page model (CGM). CGM consists of subpages that stores values of correlated attributes. For example, consider a relation Movies (title, year, length, genre, studio) and a sample query as follows.

SELECT title, length FROM Movies WHERE studio = ‘Fox’ and genre = ‘comedy’

To get the query result, the database system accesses studio and genre attributes in Movies and then accesses title and length attributes if studio and genre attribute values satisfy the query condition. The above query does not access year attribute in Movies relation. Therefore, we can group the correlated attributes in Movies relation by query access patterns such as Movies ({title, length} {genre, studio} {year}).

3.1 Page Structure

In CGM page model, each page consists of a page header and a set of subpages. The page header provides general information of a page, including (a) Page ID, (b) The number of subpages, (c) Offset array of subpages, (d) The number of attributes and the number of records, and (e) Attribute-subpage mapping information.

Each subpage has a subpage header, an array that stores Item objects, and data of the tuples. A subpage header contains four location values, namely start, end, lower, and upper. The start and end fields have start and end offset values from the start of the page. The lower field points to the end of the Item array and the upper field points to the start position of the last tuple. Each Item object consists of an offset value and a length value of the corresponding tuple.

A Flexible Page Storage Model for Mixed OLAP and OLTP Workloads 767

Each tuple consists of a tuple header and a set of attribute values. The tuple header has the number of attributes and tuple length. In addition, the tuple header has an object array that consists of value length and value offset of each attribute.

Fig. 1. A CGM page model structure

3.2 Page Operations

A new data record is inserted if there is space available in the page. If there is no more space to insert data, a new page is created and the record is inserted into the new page. To search a record or a part of attributes in a record, the system finds the subpage to access by interpreting the page header. Then, it reads items and get data using offset and length values. For an update operation, if the new data’s size is equal to or smaller than the existing data’s size in the tuple, the operation is done immediately. Otherwise, the system must reorganize the page first and then update the data. To delete a record, the system removes the item entry in the item array. This causes fragmentation in pages. Therefore, the system periodically reorganizes pages by running page compaction.

4 Dynamic Page Layout Management

For our CGM model, we develop a dynamic page layout management system, illustrated in Fig. 2. The page monitor collects the names and selectivity of the attributes accessed by the query processor. In addition, it collects IDs of the pages accessed by the system when evaluating the queries. In short, the page monitor

768 K. Park et al.

collects and manages information that is necessary to create column groups of each page. Next, the page layout manager periodically extracts the efficient column groups from the current workload and provides the storage manager with column group information. To select the best column groups, the page layout manager requires a cost model and a column group selection algorithm. The cost model is based on cache miss so the page layout manager computes the number of cache misses when processing the workload. The process for extracting column groups is as follows. First, the page layout manager receives a page list accessed by query evaluation from the page monitor. Second, the page layout computes all possible costs about all possible attribute combinations and extracts the best column groups.

Fig. 2. Dynamic Page Layout Management

The dynamic page layout reorganizer periodically gets new column groups of the page from the layout manager and reorganizes the page. Note that, the page layout reorganizer does not reorganize all pages. Instead, it reorganizes only hot pages. A page is considered to be “hot” if the number of accesses is greater than a threshold. The dynamic page layout reorganizer consists of a candidate page manager and a candidate page reorganizer. The candidate page manager decides if a page is a hot or not. The selected hot pages are sent to the candidate page queue. The role of the candidate page reorganizer is to reorganize old pages. For the page reorganization, first, it reads a candidate page from the candidate page queue. Second, it reads a page header and checks if the current column group information is equal to the new column

A Flexible Page Storage Model for Mixed OLAP and OLTP Workloads 769

group information. If the answer is “yes”, page reorganization is ignored and it reads the next page from the candidate page queue. Otherwise, the candidate page reorganizer starts reorganizing the page according to the new column group information.

5 Performance Evaluation

We compared the performance of the proposed CGM and the existing NSM and PAX models. All experiments were performed on a PC with 2.83 GHz Intel(R) Core(TM) 2 Quad Processor and 4 GB of main memory. The operating system was Ubuntu 12.04 64-bit with kernel version 3.5.0. The page size of all pages is 8 KB.

Fig. 3 shows the performance of all storage models with OLAP workload. PAX and CGM are superior to NSM. It is because NSM reads all attributes to search for only one attribute, thus causes more cache misses than PAX and CGM. When the number of attributes increases, the improvement ratios of PAX and CGM over NSM also rise. CGM resulted in 36% faster evaluation time as compared to NSM. When the size of the dataset becomes bigger, the performance of PAX and CGM is also 30% better than NSM.

Fig. 3. Response time comparison with OLAP workload

Fig. 4 illustrates the OLTP performance. Because PAX has to reconstruct the record with significantly high costs to search for data, its performance is worst among all models. With regard to the number of attributes (tested with 200,000 records), CGM is 16% faster than PAX. As for the number of records, CGM performs 6% better than PAX.

Fig. 4. Response time comparison with OLTP workload

770 K. Park et al.

Fig. 5 shows the performance comparison with the mixed OLAP and OLTP workloads. CGM is superior to the other models because it consists of subpages that are optimized for the mixed workload and causes less cache misses. In summary, for the mixed workload, the column group-based page model shows the best performance while, for pure OLAP or OLTP workload, the performance of our model is acceptable.

Fig. 5. Response time comparison with mixed workload

6 Conclusions

In this paper, we have proposed an efficient page storage model for processing mixed OLAP and OLTP workloads. To overcome the drawbacks of previous models, which are good for only OLAP or OLTP, we organize tuple data into subpages dynamically. Different subpages contain data of different column groups, constructed by analyzing the data access pattern. We also monitor page access, re-compute column groups, and reorganize the layout of hot pages periodically to improve the system’s efficiency.

References

1. Stonebraker, M., Bear, C., Cetintemel, U., Cherniack, M., Ge, T., Hachem, N., Harizopoulos, S., Lifter, J., Rogers, J., Zdonik, S.B.: One Size Fits All? Part 2: Benchmarking Studies. In: CIDR 2007, pp. 173–184 (2007)

2. Plattner, H.: A common database approach for OLTP and OLAP using an in-memory column database. In: SIGMOD Conference (2009)

3. Grund, M., Kruger, J., Plattner, H., Zeier, A., Cudre-Mauroux, P., Madden, S.: HYRISE - A Main Memory Hybrid Storage Engine. PVLDB 4(2), 105–116 (2010)

4. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 2nd edn. WCB/McGraw-Hill (2000)

5. George, P.: Copeland, Setrag Khoshafian, A Decomposition Storage Model. In: SIGMOD Conference, pp. 268–279 (1985)

6. Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving Relations for Cache Performance. In: VLDB 2001, pp. 169–180 (2001)

[lecture notes in electrical engineering] computer science and its applications volume 330 || a...

Documents