online data partitioning in distributed database...

12
Online Data Partitioning in Distributed Database Systems Kaiji Chen University of Southern Denmark [email protected] Yongluan Zhou University of Southern Denmark [email protected] Yu Cao EMC Labs China EMC Corporation [email protected] ABSTRACT Most of previous studies on automatic database partitioning focus on deriving a (near-)optimal (re)partition scheme ac- cording to a specific pair of database and query workload and oversees the problem about how to efficiently deploy the de- rived partition scheme into the underlying database system. In fact, (re)partition scheme deployment is often non-trivial and challenging, especially in a distributed OLTP system where the repartitioning is expected to take place online without interrupting and disrupting the processing of nor- mal transactions. In this paper, we propose SOAP, a system framework for s cheduling o nline da tabase rep artitioning for OLTP workloads. SOAP aims to minimize the time frame of executing the repartition operations while guaranteeing the correctness and performance of the concurrent processing of normal transactions. SOAP packages the repartition oper- ations into repartition transactions, and then mixes them with the normal transactions for holistic scheduling opti- mization. SOAP utilizes a cost-based approach to rank the repartition transactions’ scheduling priorities, and leverages a feedback model in control theory to determine in which order and at which frequency the repartition transactions should be scheduled for execution. When the system is under heavy workload or resource shortage, SOAP takes a further step by allowing repartition operations to piggy- back onto the normal transactions so as to mitigate the re- source contention. We have built a prototype on top of Post- greSQL and conducted a comprehensive experimental study on Amazon EC2 to validate SOAP’s significant performance advantages. Categories and Subject Descriptors H.2.4 [Database Management]: Systems - Distributed Databases and Transaction Processing General Terms Algorithms, Design, Performance, Experimentation c 2015, Copyright is with the authors. Published in Proc. 18th Inter- national Conference on Extending Database Technology (EDBT), March 23-27, 2015, Brussels, Belgium: ISBN 978-3-89318-067-7, on OpenPro- ceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 . Keywords Distributed Database Systems, Online Data Partitioning, Transaction Scheduling 1. INTRODUCTION The difficulty of scaling front-end applications is well known for DBMSs executing highly concurrent workloads. One ap- proach to this problem employed by many Web-based com- panies is to partition the data and workload across a large number of commodity, shared-nothing servers using a cost- effective, distributed DBMS. The scalability of online trans- action processing (OLTP) applications on these DBMSs de- pends on the existence of an optimal database design, which defines how an application’s data and workload is parti- tioned across the nodes in a cluster, and how queries and transactions are routed to the nodes. This in turn deter- mines the number of transactions that access data stored on each node and how skewed the load is across the cluster. Optimizing these two factors is critical to scaling complex systems: a growing fraction of distributed transactions and load skew can degrade performance by a factor of over ten. Hence, without a proper design, a DBMS will perform no better than a single-node system due to the overhead caused by blocking, inter-node communication, and load balancing issues. Automatic database partitioning has been extensively re- searched in the past. As a consequence, nowadays, most DBMSs offer database partitioning design advisory tools. The idea of these tools analyze the workload at a given time and suggest a (near-)optimal repartition scheme in a cost- based or policy-based manner, with the expectation that the system performance can thereby always maintain a con- sistently high level. It is then the DBA’s responsibility to deploy the derived repartition scheme into the underlying database system, which however often posts great challenges to the DBA. On the one hand, the repartition operations should be executed fast enough so that the new partition scheme can start to take effect as soon as possible. How- ever, granting high execution priorities to the repartition- ing operations will inevitably slow down or even stall the normal transaction processing on the database system. On the other hand, the repartitioning procedure should be as transparent to the users as possible. In other words, the normal user transactions’ correctness must not be violated and the processing performance should not be significantly influenced. Obviously, even skilled DBAs may not be able to easily figure out the best ways of deploying repartition schemes, especially when the workload changes over time 1 10.5441/002/edbt.2015.02

Upload: others

Post on 07-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

Online Data Partitioning in Distributed Database Systems

Kaiji ChenUniversity of Southern

[email protected]

Yongluan ZhouUniversity of Southern

[email protected]

Yu CaoEMC Labs ChinaEMC Corporation

[email protected]

ABSTRACTMost of previous studies on automatic database partitioningfocus on deriving a (near-)optimal (re)partition scheme ac-cording to a specific pair of database and query workload andoversees the problem about how to efficiently deploy the de-rived partition scheme into the underlying database system.In fact, (re)partition scheme deployment is often non-trivialand challenging, especially in a distributed OLTP systemwhere the repartitioning is expected to take place onlinewithout interrupting and disrupting the processing of nor-mal transactions. In this paper, we propose SOAP, a systemframework for scheduling online database repartitioning forOLTP workloads. SOAP aims to minimize the time frame ofexecuting the repartition operations while guaranteeing thecorrectness and performance of the concurrent processing ofnormal transactions. SOAP packages the repartition oper-ations into repartition transactions, and then mixes themwith the normal transactions for holistic scheduling opti-mization. SOAP utilizes a cost-based approach to rank therepartition transactions’ scheduling priorities, and leveragesa feedback model in control theory to determine in whichorder and at which frequency the repartition transactionsshould be scheduled for execution. When the system isunder heavy workload or resource shortage, SOAP takesa further step by allowing repartition operations to piggy-back onto the normal transactions so as to mitigate the re-source contention. We have built a prototype on top of Post-greSQL and conducted a comprehensive experimental studyon Amazon EC2 to validate SOAP’s significant performanceadvantages.

Categories and Subject DescriptorsH.2.4 [Database Management]: Systems - DistributedDatabases and Transaction Processing

General TermsAlgorithms, Design, Performance, Experimentation

c© 2015, Copyright is with the authors. Published in Proc. 18th Inter-national Conference on Extending Database Technology (EDBT), March23-27, 2015, Brussels, Belgium: ISBN 978-3-89318-067-7, on OpenPro-ceedings.org. Distribution of this paper is permitted under the terms of theCreative Commons license CC-by-nc-nd 4.0.

KeywordsDistributed Database Systems, Online Data Partitioning,Transaction Scheduling

1. INTRODUCTIONThe difficulty of scaling front-end applications is well known

for DBMSs executing highly concurrent workloads. One ap-proach to this problem employed by many Web-based com-panies is to partition the data and workload across a largenumber of commodity, shared-nothing servers using a cost-effective, distributed DBMS. The scalability of online trans-action processing (OLTP) applications on these DBMSs de-pends on the existence of an optimal database design, whichdefines how an application’s data and workload is parti-tioned across the nodes in a cluster, and how queries andtransactions are routed to the nodes. This in turn deter-mines the number of transactions that access data storedon each node and how skewed the load is across the cluster.Optimizing these two factors is critical to scaling complexsystems: a growing fraction of distributed transactions andload skew can degrade performance by a factor of over ten.Hence, without a proper design, a DBMS will perform nobetter than a single-node system due to the overhead causedby blocking, inter-node communication, and load balancingissues.

Automatic database partitioning has been extensively re-searched in the past. As a consequence, nowadays, mostDBMSs offer database partitioning design advisory tools.The idea of these tools analyze the workload at a given timeand suggest a (near-)optimal repartition scheme in a cost-based or policy-based manner, with the expectation thatthe system performance can thereby always maintain a con-sistently high level. It is then the DBA’s responsibility todeploy the derived repartition scheme into the underlyingdatabase system, which however often posts great challengesto the DBA. On the one hand, the repartition operationsshould be executed fast enough so that the new partitionscheme can start to take effect as soon as possible. How-ever, granting high execution priorities to the repartition-ing operations will inevitably slow down or even stall thenormal transaction processing on the database system. Onthe other hand, the repartitioning procedure should be astransparent to the users as possible. In other words, thenormal user transactions’ correctness must not be violatedand the processing performance should not be significantlyinfluenced. Obviously, even skilled DBAs may not be ableto easily figure out the best ways of deploying repartitionschemes, especially when the workload changes over time

1 10.5441/002/edbt.2015.02

Page 2: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

and has bursts and peaks. As a result, automatic parti-tion scheme deployment satisfying the above requirementsis highly desirable. Surprisingly, few previous studies havebeen devoted to this important research problem.

In this paper, we focus on the problem about how tooptimally execute a database repartition plan consisting ofa set of repartition operations in a distributed OLTP sys-tem, where the repartitioning is expected to take place on-line without interrupting and disrupting the normal trans-actions’ processing. We propose SOAP, a system frame-work for scheduling online database repartitioning for OLTPworkloads. SOAP aims to minimize the time frame of ex-ecuting the repartition operations while guaranteeing thecorrectness and performance of the concurrent normal trans-action processing.

SOAP models and groups the repartition operations intorepartition transactions, and then mixes them with the nor-mal transactions for holistic scheduling optimization. Thereare two basic strategies for SOAP to schedule the reparti-tion transactions, which are similar to the techniques usedin state-of-the-art database systems’ online repartitioningsolutions. The first strategy is to maximize the speed ofapplying the repartition plan and submit all the reparti-tion transactions to the waiting queue with a priority higherthan the normal transactions. The second strategy schedulesrepartition transactions only when the system is idle. Bothbasic strategies lack the flexibility to find a good trade-offbetween the two contradicting objectives: maximizing thespeed of executing repartition transactions and minimizingthe interferences to the processing of normal transactions.As a result, SOAP interleaves the repartition transactionswith normal transactions, and leverages feedback models incontrol theory to determine in which order and at which fre-quency the repartition transactions should be scheduled forexecution.

In the feedback-based method the repartition transactionshave the same priority as the normal transactions, hencethey will contend with normal transactions for the locksof database objects and significantly increase the system’sworkload, especially when the system is under heavy loadsor resource shortage. To mitigate this issue, SOAP utilizes apiggyback-based approach, which injects repartition opera-tions into the normal transactions. The overhead of acquir-ing and releasing locks as well as performing the distributedcommit protocols incurred by a repartition transaction canbe saved if the normal transaction that it piggybacks on willaccess the same set of database objects.

While the piggyback-based approach consumes less re-sources, it fails to take use of the available system resourcesto speed up the repartitioning process. This may leave someresources unused when the system workload is low and thereare few transactions to piggyback on. Therefore, SOAPadopts a hybrid approach that is composed by a piggybackmodule and the feedback module. When the system work-load does not use up all the system’s resources, we can makeuse of the available resources to repartition the data be-fore the actual arrival of transactions that will access them,meanwhile the piggyback-based approach will attempt to letthe repartition transactions piggyback on the normal trans-actions when they arrive.

To summarize, we make the following contributions withthis work:

• To the best of our knowledge, we are among the first

to specifically study the problem of online deployingdatabase partition schemes into OLTP systems.

• We propose a feedback model that realizes dynamicscheduling of the repartition operations.

• We also propose a piggyback approach to execute se-lected repartition operations within normal transac-tions to further mitigate the repartitioning overhead.

• We have built a SOAP prototype on top of PostgreSQL,and conducted a comprehensive experimental study onAmazon EC2 that validates SOAP’s significant perfor-mance advantages.

The rest of this paper is organized as follows. In Section 2,we describe the generic SOAP system architecture, as wellas how SOAP realizes online repartitioning for OLTP work-loads. In Section 3, we elaborate SOAP’s feedback-based,piggyback-based and hybrid approaches of online schedul-ing repartition operations. Section 4 presents the experi-ment set-up and experimental results of a SOAP prototypeon an Amazon EC2 cluster. We discuss the related works inSection 5 and then conclude in Section 6.

2. SOAP SYSTEM OVERVIEWIn this section, we describe the generic SOAP system ar-

chitecture, as well as how SOAP realizes online repartition-ing for OLTP workloads.

2.1 SOAP System ArchitectureFigure 1 shows a SOAP-enabled distributed database ar-

chitecture providing OLTP services. The clients submit usertransactions through a transaction manager (TM), whichcan be either centralized or distributed.Each submitted trans-action will be given a global unique ID by the TM. TMtakes care of the processing life-cycle of transactions andguarantees their ACID properties with certain distributedcommit protocols and concurrency control protocols. Thequery router maintains the mappings between data parti-tions and their resident nodes, based on which it routes theincoming transaction queries to the correct nodes for exe-cution. All the submitted transactions will be associatedwith a scheduling priority and then put into a processingqueue, where higher-priority transactions will be executedfirst, while the FIFO policy will be applied to break the tie.The rules of setting priorities are customizable.

The transaction manager, query router and processingqueue are common components in most OLTP systems, whileSOAP introduces a new component repartitioner to coordi-nate its online database repartitioning for OLTP workloads.In the following subsection, we describe how the reparti-tioner works.

2.2 Online Database RepartitioningIn this paper, we consider the scenarios where the type of

transactions and frequency of OLTP workloads could changeover time, so that periodic database repartitioning is re-quired in order to maintain the system performance.

The repartitioner determines when and how the OLTPdatabase should be repartitioned. Its optimizer componentperiodically extracts the frequency of transactions and theirvisiting data partitions from the workload history, and thenestimates the system throughput and latency in the near

2

Page 3: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

Client Client Client

Transaction Manager

Query Router

Partition-1 Partition-2 Partition-3 Partition-4 Partition-5 Partition-6

Repartitioner

Processing Queue

Log

Optimizer

Scheduler

Figure 1: The generic SOAP system architecture

future based on the history. If the estimated system per-formance is under a predefined threshold, the optimizer willderive a repartition plan in a cost-based manner. The repar-tition plan could be at the granularity of moving individualtuple or tuples within some ranges or with some hash keyson their attributes. We assume each tuple contains enoughinformation to be positioned by query router. We assumetuple replicas are only made for the purpose of high availabil-ity, yet make no assumptions about the replication strategyutilized. Tuple replicas will be distributed over distinct datapartitions, and the query router will determine for a trans-action which replica of a specific tuple should be visited.

The optimizer will generate three types of repartition op-erations together with the normal transactions accessing thedatabase objects repartitioned by each of them, i.e. newreplica creation, replica deletion and objects migration.

• New Replica Creation: insert some new replicas ofdatabase objects originally stored in a data partitioninto an another one containing no other replicas of thesame objects.

• Replica Deletion: for database objects with multiplereplicas, delete the specific replica within one parti-tion.

• Objects Migration: relocate some database objects be-tween two partitions; the procedure is realized by firstinserting new replicas of them into the destination par-tition and then deleting the original ones from thesource partition.

To execute the repartition operations, the scheduler pack-ages them into repartition transactions using the informa-tion provided by the optimizer. The repartition transac-tion will be scheduled by the repartitioner and submittedto the system at a chosen time. It utilizes a cost-based ap-proach to determine the repartition transactions’ executionorders, and leverages a feedback model in control theory to

determine at which frequency the repartition transactionsshould be scheduled for execution. As such, the processingof repartition transactions and normal transactions may beinterleaved. In other words, during the database repartition-ing, the processing of normal transactions will keep goingon, and an online scheduling algorithm, which will be elabo-rated in Section 3, attempts schedule repartition transactionso that the time frame of executing the repartition opera-tions is minimized while guaranteeing the correctness andperformance of the concurrent processing of normal trans-actions.

The repartitioner accesses the system logs, manipulatesthe processing queue and updates the mapping informationand routing rules in the query router during and after thedatabase repartitioning. With the piggyback-based execu-tion method, the repartitioner may need to modify the nor-mal transactions by inserting additional repartition opera-tions to some of them, and the transaction manager willcoordinate the processing of the modified normal transac-tions.

3. ONLINE REPARTITION SCHEDULINGThe scheduling of repartition operations has to be done

in an online fashion. Besides the incoming workload is hardto predict, there are many system factors that will causethe system performance to fluctuate over time, such as vari-ations of network speeds/bandwidth, transaction failures,and interferences from other programs running on the sameserver. Therefore, we study how to implement an onlinescheduler that can continuously adapt to the system’s ac-tual workload and capacity.

3.1 Generating and Ranking Repartition Trans-actions

In all the subsequent scheduling algorithms, we have tofirst decide the execution order of repartition transactionsand schedule the more “beneficial” ones before those less“beneficial”. To achieve this, we need to estimate the costand benefit of executing such transactions. To estimate thecost of a transaction under different partitioning plans, wefollow the approach in [4]. Suppose the cost of runningtransaction Ti with a repartition plan where all the tuplesaccessed by Ti are collocated in a single partition is Ci, thenthe cost of Ti with a plan where Ti has to access more thanone partition is 2Ci.

To estimate the benefit of a repartition transaction, weuse the cost model of the data partitioning algorithms, suchas [4, 13, 15]. Suppose the cost of an arbitrary normal trans-action Ti with partition plan P is Ci(P), then the benefit ofa repartition transaction Tj , denoted as Bj can be definedas Σ∀Tifi(Ci(O)− Ci(P)), where fi is the frequency of Ti.Finally, we can define the benefit density of Tj as Bj/Cj

and then we can schedule the repartition transactions in de-scending order of their benefit densities.

To package the repartition operations into transactions,there are two simple options: (1) putting all operations intoone transaction and (2) creating one transaction for each op-eration. The first option will create a very large transactionespecially when there are a lot of data to be repartitioned.Such a large transaction will be run for a very long time andwill significantly increase resource contention. For example,with a 2PL policy, the repartition transaction has to holdthe locks of all the data objects involved in the repartition

3

Page 4: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

plan until it is committed. This will substantially increasethe degree of lock contention with the normal transactions.On the other hand, the second option will not suffer fromthis problem but it will create a lot of transactions each in-curring overhead to the transaction manager. It is desirableto find a good trade-off between this two extremes. In prin-ciple, we would like to create small transactions and theiroverhead can be paid off by the benefit that it will bring tothe system. As accurately predicting and quantifying theoverhead and the degree of lock contentions that will be in-troduced by a repartition transaction is difficult, we adopt asimple heuristic here. Roughly speaking, we put the repar-tition operations that repartition all the objects accessed bya normal transaction into a repartition transaction. In thisway, we can ensure that there is at least one normal transac-tion that will benefit from executing the repartition transac-tion. Provided that the achieved benefit is greater than theoverhead of introducing the repartition transaction, we canensure that the overhead will be paid off when the repar-tition transaction is executed. Furthermore, even with thisheuristic, there are still many possible ways to combine therepartition operations into transactions. We prefer generat-ing transactions that will have higher benefit densities and,as mentioned earlier, schedule them in descending order oftheir benefit densities.

Algorithm 1 shows the whole process for the scheduler togenerate a ranked list of repartition transactions. Given anew partition plan P generated by a cost-based repartitionoptimizer, the scheduler will obtain a list of repartition op-erations OPrep together with the list of normal transactionsthat will access the data objects modified by each operation.In line 1-8, we construct a map TOP that maps the ID of anormal transaction ti with frequency fi to a group of repar-tition operations that will modify objects accessed by ti. Inother words, the performance of ti will be affected by thislist of repartition operations.

We then calculate the benefit of executing each repartitionoperation in lines 9-14. After that, we can calculate the totalbenefits of each group of repartition operations in Top andstore them in another map Tbenefit with value descendingorder. Finally, we transform each group of repartition op-erations into a repartition transaction, and make sure eachrepartition operation only belongs to one repartition trans-action. These repartition transactions will be returned bythe algorithm as the output. Furthermore, we calculate thebenefit density of each repartition transaction and sort themin descending order. Given a repartition transaction ri anda normal transaction ti whose performance will be affectedby ri, TRep maps the ID of ri to the ID of ti and the benefitdensity of ri. Such auxiliary information will be used in oursubsequent scheduling algorithms.

3.2 Basic SolutionIn general, there is a tension between the two objectives

in our scheduling: (1) executing the repartitioning queriesas soon as possible to improve the current partitioning plan,(2) avoiding interferences to the normal transactions andmaking the repartition process transparent to the end users.In this subsection, we propose two baseline solutions, eachfavoring one of the objectives.

Apply-All. This strategy is to maximize the speed of ap-plying the repartitioning plan and submits all the repartitiontransactions to the waiting queue with a priority higher than

Algorithm 1: Generating and Ranking RepartitionTransactionsData: a list of repartition operations OPrep generated

by optimizer, new partition plan PResult: a list of repartition transactions LRep,a map

TRep mapping repartition transaction id to aaffected normal transaction id and the benefitdensity of the repartition transaction

1 Create HashMap Top, a mapping from normaltransaction to the repartition operations that edit theobjects visited by it

2 for opk ∈ OPrep do3 for Normal transaction ti accessing the objects

modified by opk do4 if Ci(O)− Ci(P) > 0 then5 Insert opk to Top.get(ti)

6 for ti ∈ Top.keylist do

7 benefit ← fiCi(O)−Ci(P)

Top.get(ti).size()

8 for opk ∈ Top.get(ti) do9 opk.benefit += benefit

10 Create HashMap Tbenefit, a mapping from repartitionoperation group ID to the total benefit for system if allthe operations within this group are executed

11 for (ti,Lop) ∈ Top.entrySet do12 benefit ← 0; for opi ∈ Lop do13 benefit += opi.benefit;14 Insert (ti,benefit) to Tbenefit;

15 Sort Tbenefit with value descending order16 for (ti,benefit) ∈ Tbenefit do17 ops ← Top.get(ti);18 for opi ∈ ops do19 if opi /∈ OPrep then20 Remove opi from ops; Tbenefit.get(ti) ←

Tbenefit.get(ti)− opi.benefit;21 Remove ops from OPrep;22 Create ri with ops;23 ci ← Cost(ri,O);

24 cpri ←Tbenefit.get(ti)

ci;

25 Insert ((ri, ti),cpri) to TRep;26 Insert ri to LRep

27 Sort TRep with value descending order;

the normal transactions. As mentioned earlier, the systemwill schedule the transactions in descending order of theirpriorities and hence this strategy is equivalent to pausingthe processing of normal transactions and performing therepartitioning queries immediately. Depending on the num-ber of repartition transactions, the normal transactions mayneed to wait for a rather long time, which is usually unac-ceptable.

After-All. To minimize the interferences to normal trans-actions, we can use a lazy strategy where repartition trans-actions will only be scheduled when the system is idle. Wecan achieve this by giving all the repartition transactions apriority lower than the normal ones. By doing so, the normaltransactions will almost not be affected by repartition trans-actions and the repartitioning could be done transparently.Due to this advantage, a state-of-the-art approach for onlinerepartitioning adopted this strategy [15]. However, there isa downside of this approach: the repartitioning may be per-

4

Page 5: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

Figure 2: A sample PID controller block diagram

formed too slowly, especially when the system workload ishigh and there is very little idle time. Under this situation,the high workload could actually be alleviated by adoptingthe new and better partitioning plan and this strategy failsto take advantage of those.

3.3 Feedback-based ApproachAs discussed earlier, the aforementioned basic solutions

lack the flexibility to find a good trade-off between the twocontradicting objectives. To achieve this, one can sched-ule some additional repartition transactions on top of thosescheduled by the After-All strategy. These additional trans-actions will be assigned with the same priority as the normaltransactions so that they have the chance to be executedfaster. We call such transactions as high-priority reparti-tion transactions to distinguish them with those low-priorityones scheduled by the After-All strategy. To limit the im-pact over the normal transactions, we can limit the numberof high priority repartition transactions.

However, such a seemingly simple idea is rather challeng-ing to realize in practice. Note that the number high-priorityrepartition transactions that we can execute without signifi-cant disturbance of the normal transactions heavily dependson the system’s current workload and capacity. In reality,the system’s workload may have temporal skewness and fluc-tuations even if it appears to be uniformly distributed fora long period [13]. Furthermore, the system’s capacity isalso subject to variations caused by external factors, such asexternal workload imposed on the same server or other vir-tual servers running on the same physical machine or clusterrack in a cloud computing environment. A desirable solutionshould be able to detect such short-term variations of systemworkload and capacity and promptly adapt the schedulingstrategy accordingly. To achieve this goal, we model oursystem as an automatic control system and make use of thefeedback control concept in control theory to design an adap-tive scheduling strategy.

Control theory deals with the behaviors of complex dy-namic systems with inputs and present output values. Acontroller is engineered to generate proper corrective actionsso that system error, i.e. the differences between the desiredoutput value, called setpoint (SP ), and the actual measuredoutput value, called process variable (PV ), are minimized.

A commonly used controller is the Proportional-Integral-Derivative controller (PID controller). Figure 2 depicts agraphical representation of a PID controller. Let u(t) bethe output of the controller, then the PID controller can be

defined as follows:

u(t) = Kpe(t) +Ki

∫ t

0

e(τ)dτ +Kdd

dte(t) (1)

where Kp, Ki and Kd are the proportional, integral andderivative gains respectively, and e(t) is the system errorat time t. The system error can be minimized by tuningthe three gains so that the controller will generate properoutputs. Simply put, Kp, Ki and Kd determines how the

present error (e(t)), the accumulation of past errors (∫ t

0e(τ)dτ)

and the predicted future error ( ddte(t)) would affect the con-

troller’s output.The system for scheduling repartition transactions can be

modeled as a PID controller as follows. We can use theratio of the total cost of the high-priority repartition trans-actions to that of the normal transactions as the SP for thePID controller. By stabilizing this ratio, we can constrainthe total workload imposed by the high-priority repartitiontransactions at a desirable level so that they would havelimited impact over the latency of the normal transactionsand in the mean time maximize the speed of applying therepartitioning plan.

To capture the fluctuations of the system’s workload andcapacity, we divide the time into small intervals and measurethe aforementioned ratio for every interval. The actual ratiothat is measured would be the PV of the PID controllerand hence the error can be computed as SP − PV . Theoutput of the controller is the ratio to be used to calculatethe number of high-priority repartition transactions that weshould schedule in the coming interval.

To tune the parameters of the PID controller, we take anonline heuristic-based tuning method formally known as theZiegler—Nichols method[19].

Finally, we enforce a limit on the maximum number ofhigh-priority repartition transactions scheduled in each timeinterval to avoid significant impacts caused by sudden changesof system workload and capacity, which the PID controllerwill take some time to stabilize its outputs. Putting sucha limit is essentially a conservative approach to avoid toomuch interferences during the period that the PID controlleris stabilizing its behavior.

3.4 Piggyback-based ApproachIn the feedback-based method the repartition transactions

have the same priority as the normal transactions, hencethey will contend with normal transactions for the locksof database objects and significantly increase the system’sworkload.

In this section, we propose a piggyback-based approach,which injects repartition operations into the normal trans-actions that access the same object. As these transactionswould acquire the locks of these objects anyway, we can savethe overhead of acquiring and releasing locks as well as per-forming the distributed commit protocols. Moreover, we canreduce the degree of lock contention by reducing the numberof transactions.

The algorithm of this approach is illustrated in Algo-rithm 2. The algorithm make use of the auxiliary informa-tion produced in Algorithm 1. It examines the transactionID ti of the incoming normal transaction and check if thereexist an repartition transaction rj in TRep which ti can ben-efit from but are not yet executed. If so, rj will piggybackonto ti, injecting the repartition operations in rj to ti. These

5

Page 6: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

Algorithm WorkloadHighLoad LowLoad

α = 100% α = 60% α = 20% α = 100% α = 60% α = 20%

FeedbackZipf 1.05 1.05 1.1 1.05 1.03 1.015Uniform 1.25 1.25 1.25 1.02 1.03 1.02

HybridZipf 1.05 1.05 1.05 1.05 1.03 1.05Uniform 1.05 1.05 1.05 1.03 1.05 1.05

Table 1: SP value for Experiments

Algorithm 2: Piggyback Algorithms for incoming nor-mal transactions Ti(k)

Data: a list of repartition transactions LRep, a mapTRep from repartition transaction id to anaffected normal transaction and the benefitdensity of the repartition transaction, incomingnormal transactions Ti(k) in interval k, List ofall the repartition operations OPRep

Result: Piggybacked normal transaction executed ininterval k

1 Create a map P(k) of all the normal transactionspiggyback some repartition operations in interval k

2 for ti ∈ Ti(k) do3 if rj , ti ∈ TRep.keylist then4 ops ← LRep.get(rj)5 Insert ops to ti6 Inert (ti, (rj , ti)) to P(k)

7 Submit Ti(k)8 Get Result(k) for any finished transaction9 for (ti ∈ P(k)) do

10 if ti ∈ Result(k) then11 if ti.success then12 Remove P (k).get(ti) from TRep

13 else14 Remove LRep.get(P (k).get(ti).getKey())

from ti15 Resubmit ti

repartition operations will share the locks of objects with thenormal transactions. This essentially leads to a repartition-on-demand strategy where data will be repartitioned onlywhen they are accessed. After the piggybacked transactionis successfully committed, we will remove the correspondingrepartition transaction in TRep.

The piggyback method will increase the transaction fail-ure rate as the execution times of normal transactions are in-creased. If too many repartition operations piggyback onto anormal transaction, then the system throughput will be de-creased due to unnecessary aborts caused by the failure ofthe piggybacked repartition operations. Therefore, we needto limit the maximum number of repartition operations thatcan piggyback onto each normal transaction. This parame-ter should be tuned at runtime to adapt to the scenarios ofdifferent systems.

3.5 Hybrid ApproachWhile the piggyback-based approach consumes less re-

sources, it fails to take use of the available system resourcesto speed up the repartitioning process. This may not workwell when the system workload is low and there are fewtransactions to piggyback on. A more desirable approach

is, when the system workload is low, we can take use of theavailable resources to repartition the data before the actualarrival of transactions that will access them, and when thesystem is a bit congested, we can take advantage of the op-portunities to piggyback the repartition operations on theincoming transactions.

In this section, we present a hybrid approach that is com-posed by a piggyback module and the feedback module. Thepiggyback module will piggyback the repartition operationson the incoming normal transactions. Then for each inter-val, the feedback module will measure the actual PV valueby counting in both the repartition transactions and thoserepartition operations piggybacked on the normal transac-tions. In this way, the feedback module will adapt the num-ber of repartition transactions according to the actual work-load of the system. In other words, when there are moreincoming normal transactions that more repartition opera-tions can piggyback on, we will reduce the number of repar-tition transactions and vice versa.

4. EVALUATIONIn this section, we will first provide some details of our

system implementation and experimental configuration inSection 4.1. The experimental results under different work-load conditions are presented and discussed in Section 4.3and Section 4.2.

4.1 Experimental ConfigurationSystem Implementation and Configuration. We

have used PostgreSQL 9.2.4 as the local DBMS system ateach data node and JavaSE-1.6 platform for developing andtesting our algorithms. We have developed a query routerusing a lookup table to route each query to its target databaseobjects. We have also implemented a query parser that readsa query and extracts the partition attributes of the targetobjects, which will be used for query routing and applyingour online repartition strategies. For transaction manage-ment, we take use of Bitronix[17], an implementation of JavaTransaction API 1.1 version adopting the XAResource inter-face to interact with the DBMS resource managers runningeach the individual data node and using 2-Phase Commitprotocol for distributed transaction commits.

Our evaluation platform is deployed on a Amazon EC2cluster consisting of 5 data nodes corresponding to 5 datapartitions respectively. Each data node runs an instance ofa PostgreSQL 9.2.4 server, which is configured to use theread committed isolation level and has a limitation of 100simultaneous connections. Note that higher isolation levelwill decrease the system concurrency and hence lower thesystem’s capacity. But it will not affect the performance ofour algorithms. The node configuration consists of 1 vCPUusing Intel Xeon E5-2670 processor and 3.75 GB memorywith an on-demand SSD local storage running 64-bit Ubuntu

6

Page 7: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

Time Interval

ApplyAllAfterAllFeedbackPiggybackHybrid

(a) Zipf/High Workload

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

ApplyAllAfterAllFeedbackPiggybackHybrid

(b) Uniform/High Workload

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

ApplyAllAfterAllFeedbackPiggybackHybrid

(c) Zipf/Low Workload

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

ApplyAllAfterAllFeedbackPiggybackHybrid

(d) Uniform/Low Workload

Figure 3: Experiment Results for Transaction Failure Rate

13.04. The query router is run on another EC2 instance withthe same setting.

Workloads and Datasets. We create a table containing500, 000 tuples, each tuple has a global unique key field andan integer content field. The size of each tuple is 8 bytes.We generate two types of workload distribution to simulatedifferent sceneries of transaction popularity: a uniform dis-tribution with 30, 000 distinct transactions and a Zipf dis-tribution with 23, 457 distinct transactions. We generatethe workload with a Zipf distribution using the parameters = 1.16 so that the workload follows the 80-20 rule. Eachnormal transaction contains 5 queries. Each query accessone unique tuple and is either a read-only or a write querywith equal possibility.

We use a Poisson distribution to determine how many nor-mal transactions are submitted to the system during eachinterval, which is set to be 20 seconds. Each run of theexperiment will last for 45 minutes and the normal transac-tions are submitted to the system at the beginning of eachtime interval. We generate a high and a low workload asfollows. Lowload has an average system utilization as 65%before the repartitioning, which is measured by the percent-age of time that the system spend on processing the normaltransactions. Highload simulates a system overloaded sit-uation, where the incoming workload is 30% higher thanthe system capacity. Under this situation, it is more urgentto adopt the repartitioning plan to reduce the effective in-coming load. Furthermore, for each situation, we vary thepercentage of tuples α we need to repartition, which variesfrom 100% to 20%. After the repartitioning, α percent of thenormal transactions would be tranformed from distributedtransactions to non-distributed transactions.

Algorithm Settings. We compare all the algorithmswe discussed in the previous sections. We use two per-formance metrics for comparison: the system throughput,which is counted as the maximum number of normal trans-actions that the system can process per unit of time, and

the processing latency, which is the time between a transac-tion is submitted and the time its processing is finished. Inorder to examine the lock contentions incurred by the var-ious scheduling algorithms, we also collect the failure rateof transactions, which is defined as the number of abortedtransactions compared to the total number of transactionsubmitted to transaction manager.

In line with the workload generation, we divide the timeinto 20 seconds of intervals and run the system for 10 inter-vals to warm it up before we start the repartitioning. Fur-thermore, the feedback-based approach uses 20 seconds asthe monitoring interval. For the feedback model parameterused in each of the experiment, we have the different SPvalues which are listed in Table 1. All the experiments willhave the same controller parameter Kp = 1, Ki = 0 andKd = 0.

4.2 Performance Under High LoadRecall that under the high workload setting, we have set

the initial workload to be higher than the system’s capacitybut it should become lower than the system’s capacity af-ter applying the repartition plan as the normal transactionswould consume less resources with the new partition plan.Therefore, it is necessary for the system to be able to processthe repartition transactions soon.

Zipf workload. The experimental results are presentedin Figure 4. As we discussed earlier, ApplyAll would stallall the normal transactions and execute all the repartitiontransactions before we resume the normal processing. Thisshould result in the fastest deployment of the new partitionplan. This is verified by Figures 4a, 4b and 4c. However asone can see from Figures 4d, 4e and 4f and Figures 4g, 4hand 4i, using this approach will experience a period thatsystem has a very low throughput and very high processinglatency caused by the stalling of the normal transactions.

As shown in Figure 4i, the impact on processing latencycan actually last much longer than the time needed to per-form the repartitioning. This is because a long waiting queue

7

Page 8: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(a) α = 100%

0 20 40 60 80 100 1200.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(b) α = 60%

0 20 40 60 80 100 1200.8

0.85

0.9

0.95

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(c) α = 20%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(d) α = 100%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(e) α = 60%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(f) α = 20%

0 20 40 60 80 100 120

104

105

106

Time Interval

Lat

ency

(ms)

(g) α = 100%

0 20 40 60 80 100 120

104

105

106

Time Interval

Lat

ency

(ms)

(h) α = 60%

0 20 40 60 80 100 1200

2

4

6

8

10x 105

Time IntervalL

aten

cy(m

s)

(i) α = 20%

Figure 4: Experiment Results for Zipf High Workload

will be built up during the repartitioning process and henceit needs a longer time to clear the queue. (Note that wehave set the initial ratio of workload to system capacity beroughly equal for all α values and we actually submit morenormal transactions in the case with a lower α value. Hencethe waiting queue will be longer in this case.)

On the contrary, AfterAll basically cannot execute anyrepartition transactions due to the lack of system idle time,hence it cannot take advantage of the new data partitionplan. The Feedback approach will enforce the scheduling ofsome repartition transactions, hence can make some progressin deploying the new partition plan (Figures 4a, 4b and 4c).Accordingly the system’s throughput and processing latencywill improve gradually. (Again, we submit more normaltransactions in the case with a lower α value. So it takes alonger time to deploy the repartition plan.)

As we analyzed in the earlier sections, the Piggyback ap-proach can effectively reduce the cost of executing reparti-tion operations. This is especially important when the sys-tem is under high workload and has little extra resources forrepartitioning the data. Furthermore, the high arrival rateof normal transactions provides abundant opportunities forthe repartition operations to piggyback. The results in Fig-ure 4 verify our analysis. In comparing to ApplyAll, bothPiggyback and Hybrid do not incur any sudden droppingof system performance while is able to quickly execute therepartition plan. It even outperforms ApplyAll at almost alltime intervals in terms of both throughput and latency with

a lower α value, i.e. fewer tuples to be repartitioned.Uniform workload. We also perform the experiments

with a workload under a uniform distribution. The resultsare presented in Figure 5. The difference from the work-load with a Zipf distribution is that we will not gain a lotof improvement by executing a small portion of repartitiontransactions.

Similar to the previous experiments, since the workload ismore than the system could handle, AfterAll could barelyexecute any repartition transactions improve the system’sperformance, while ApplyAll finish the repartition processin 20,12 and 4 intervals, which is proportional to the numberof repartition transactions that need to be executed.

For the Feedback method, we set a higher SP value un-der uniform workload to examine its performance when morerepartition transactions are enforced to be submitted to thesystem. Under α = 100%, we cannot apply enough repar-tition transactions to stop the queue size from increasing.So even the repartition rate increases a bit in figure 5a, thethroughput and latency does not get improved. But underthe cases with α = 60% and α = 20%, since the number ofrepartition transactions we need to execute is smaller, thesystem finally finish the repartition in time and make thesystem able to process all the incoming normal transactionswithout queuing. In comparing the results in the previousexperiments, a higher SP here is actually beneficial whenthe number of repartitioning transactions is relatively smalland Feedback has the chance to finish them in a good time.

8

Page 9: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(a) α = 100%

0 20 40 60 80 100 1200.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(b) α = 60%

0 20 40 60 80 100 1200.8

0.85

0.9

0.95

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(c) α = 20%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(d) α = 100%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(e) α = 60%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(f) α = 20%

0 20 40 60 80 100 12010

3

104

105

106

Time Interval

Lat

ency

(ms)

(g) α = 100%

0 20 40 60 80 100 120

104

105

106

Time Interval

Lat

ency

(ms)

(h) α = 60%

0 20 40 60 80 100 120

104

105

106

Time IntervalL

aten

cy(m

s)

(i) α = 20%

Figure 5: Experiment Results for Uniform High Workload

Similar to the Zipf workload, both Piggyback and Hybridperformed the best in all cases. They can achieve a speedthat are almost identical to the ApplyAll approach.

Transaction failure rate. To further investigate theeffect of the algorithms, we also collect the failure rates oftransactions. Here we only report the results with α = 100%since we could experience highest lock contention in thesescenarios. The results are shown in Figure 3. We can seefrom Figure 3a, AfterAll has a very high failure rate dur-ing the whole period simply because the system’s workloadis very high and it fails to apply the repartition plan toquickly improve the system’s performance. Furthermore,both the piggyback and hybrid method has a very low fail-ure rate through the whole period, which clearly show thepiggyback-based method’s advantage of lowering lock con-tentions. On the other hand, the feedback-based methodexperiences some failure caused by the extra transactionsscheduled by the feedback-based method.

Figure 3b shows the results with the uniform workload.The general trend is similar. But it is interesting to see thatthe extra failure rate caused by the piggyback strategy lastslonger than the case with Zipf workload. This is becausethere is not any very hot transaction in this scenario, sowe cannot deploy the repartition plan as quickly as in thecase with Zipf workload. The mechanism of executing morehigh-priority repartition transactions with Hybrid causes ahigher initial failure rate but it drops more quickly than thepiggyback approach. This shows that Hybrid has an edge

when there are less transactions to piggyback on.

4.3 Performance Under Low LoadIn the low load experiments, we expect the system has

more idle time and the repartition process could be donemore aggressively to make use of those available resources.For the Zipf workload, since there exist some transactionsthat have very high frequencies, the resource contention un-der the same load level will be higher than the Uniformworkload. The results are shown in Figure 6 and 7.

Zipf workload. ApplyAll performs similarly as underhigh load situation. But since there are fewer normal trans-actions, there are also fewer transactions that are queued upduring the repartitioning period and we have a shorter timefor the system to achieve its maximum performance afterthe repartitioning period.

As the system has enough idle time now, AfterAll couldsubmit quite some repartition transactions. In Figure 6a,we can see that it takes some time for the system to have anidle period. It happens that there are three intervals thathave a very high workload generated by our Poisson loadgenerator. When the tuples accessed by the high frequencynormal transactions are not repartitioned yet, their averageexecution time will be much higher than normal because ofresources contention. We could see that AfterAll does nothave this problem in Figures 4b and 4c. AfterAll has theminimum interference to the normal transactions when thesystem is able to handle the normal transaction load like

9

Page 10: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(a) α = 100%

0 20 40 60 80 100 1200.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(b) α = 60%

0 20 40 60 80 100 1200.8

0.85

0.9

0.95

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(c) α = 20%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

ApplyAllAfterAllFeedbackPiggybackHybrid

(d) α = 100%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

ApplyAllAfterAllFeedbackPiggybackHybrid

(e) α = 60%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

ApplyAllAfterAllFeedbackPiggybackHybrid

(f) α = 20%

0 20 40 60 80 100 120

103

104

105

106

Time Interval

Lat

ency

(ms)

ApplyAllAfterAllFeedbackPiggybackHybrid

(g) α = 100%

0 20 40 60 80 100 120

103

104

105

106

Time Interval

Lat

ency

(ms)

ApplyAllAfterAllFeedbackPiggybackHybrid

(h) α = 60%

0 20 40 60 80 100 120

104

105

106

Time Interval

Lat

ency

(ms)

ApplyAllAfterAllFeedbackPiggybackHybrid

(i) α = 20%

Figure 6: Experiment Results for Zipf Low Workload

the situations in Figure 6h and 6i and hence it could be thealgorithm that can achieve the lowest average latency.

The Feedback method will add more repartition transac-tions to the system besides filling the idle period. So wecan see in Figure 6g that it partitions the tuples accessedby some high frequency transactions and render the systemload decreasing more quickly than AfterAll. Adding the ex-tra repartition transactions will increase processing latencyof the normal transactions. This extra latency is a trade-offagainst the repartitioning speed. In Figure 5d, we can seethat Feedback has a higher throughput than AfterAll butwith a higher latency before it finishes all the repartitiontransactions.

The overhead of the piggyback method is proportional tothe number of piggybacked transactions. When the work-load is low, like the condition in Figure 6f, it may not beable to repartition the database as fast as the other meth-ods. But the overhead on latency is also lower when theworkload is low, which is similar to AfterAll.

Hybrid performs very well even under low workload. Italways finishes the repartitioning work with a speed that isonly slower than ApplyAll and in the meantime keeps thelatency overhead less than the Feedback method. Since therepartitioning speed of Hybrid is faster than Feedback, thetime period that normal transactions will experienced someextra latency are much shorter.

With regard to the failure rate, the trend shown in Fig-ure 3c is very similar to the case with high workload, exceptthat AfterAll has a much lower failure rate in this case. Thisis because the system’s workload is not that high and After-All has the opportunity apply the repartition plan to further

improve the system’s performance.Uniform workload. The results are reported in Fig-

ure 7. In this case, the degree of resource contention amongnormal transactions is lower than that with the Zipf work-load. AfterAll could finish the repartitioning more quicklythan with the Zipf load. Since the frequency of each nor-mal transaction is low, the effect of repartition may takesome time to get into effect. An interesting phenomenon isthat Piggyback in this case works worse than the previouscases. With the uniform and low load situation, there arerelatively few transactions to be piggybacked on and henceit take much longer for the piggyback approach to finish therepartitioning.

Furthermore, from Figure 3d, we can find that before therepartitioning is finished, the piggyback approach incurs ahigher failure rate. This is because, even though Piggy-back does not incur additional transactions, the piggybackedtransactions will run longer than normal and this may stillincrease lock contentions to a certain degree. Given thelonger repartitioning period in this case, its overhead on fail-ure rate is more prominent here. On the contrary, Hybrid isable to make use of the available system resources to speedup the repartition and hence do not suffer this problem.

5. RELATED WORKSData partitioning for distributed database system is about

designing a data placement strategy minimizing the trans-action overheads and balancing the workloads. Besides ba-sic algorithms using some static functions, such as range-based or hash-based, to partition the data, researchers haverecently been focusing on workload-aware partitioning al-

10

Page 11: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(a) α = 100%

0 20 40 60 80 100 1200.4

0.6

0.8

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(b) α = 60%

0 20 40 60 80 100 1200.8

0.85

0.9

0.95

1

Time Interval

Rep

Rat

e

ApplyAllAfterAllFeedbackPiggybackHybrid

(c) α = 20%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(d) α = 100%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(e) α = 60%

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3x 104

Time Interval

Th

rou

gh

pu

t(tx

n/m

in)

(f) α = 20%

0 20 40 60 80 100 120

103

104

105

106

Time Interval

Lat

ency

(ms)

(g) α = 100%

0 20 40 60 80 100 120

103

104

105

106

Time Interval

Lat

ency

(ms)

(h) α = 60%

0 20 40 60 80 100 120

104

105

106

Time IntervalL

aten

cy(m

s)

(i) α = 20%

Figure 7: Experiment Results for Uniform Low Workload

gorithms that take the transaction statistics as input [4,13]. These solutions utilize various techniques to model thehistorical workload information and search for an optimalpartition scheme according to a specific objective function.Schism [4] is an automatic partitioning tool trying to mini-mize distributed transactions. Horticulture [13] further im-prove this approach by considering temporary skewness ofworkloads and using a local search algorithm to optimizepartition schemes. This work provides a cost model for pro-cessing a transaction that considers both the number of par-titions the transaction need to access and the overall skew-ness of data access.

Alekh etc.[8] present an online partitioning method thatwill partition the data in checkpoint time intervals. Theygenerate partition schemes based on historical query execu-tion logs and automatically update the partition plan whenthe workload changes. Besides the static partitioning meth-ods, there are also some studies on incremental partition-ing. For example, Sword [15] adopts a similar model asSchism [4] and introduces an incremental repartitioning al-gorithm that calculates the contribution of each repartitionoperation and the cost of executing it. They also propose athreshold on the number of repartition queries that will begenerated for each repartition step and a constrained num-ber of repartition queries will be executed when the systemis at a lean period. This approach is similar to our baselinesolution AfterAll. Some commercial database systems [12,10] support online partitioning. But all of them require the

partitioned data would not be untouched by normal trans-actions during the partitioning period. Hence this is similarto our ApplyAll solution. In our former short paper[1], webriefly presented the basic ideas about the feedback and pig-gyback algorithms. In this paper, we provide a more thor-ough analysis of the problem, consider the drawbacks of thepiggyback solution and provide more experiment results toillustrate the trade-offs in the piggyback solution. We alsopropose the hybrid approach that combines the feedback andpiggyback algorithm to benefit from the advantages of bothalgorithms while avoiding the problem of high failure rate inthe piggyback solution and the problem of high interferencewith normal transactions in the feedback solution.

On the other hand, some researches on quality of service(QoS) of OLTP systems, such as [11], have considered thedifferent resources’ influence on transaction performance andhave attempted to find the bottleneck resource for OLTPtransactions and shows an arresting performance improve-ment. [6] makes use of machine learning methods to predictmultiple performance metrics of analytical queries. The so-lution relies on SQL text extracted from the query executionplans. In [18], the authors addressed predictions of queryexecution time using the query optimizer’s cost model andthe generated query plan, which can be used to estimatethe transaction execution time in OLTP systems. For dis-tributed systems, the extra cost of network communicationwill be the new bottleneck, which limits the OLTP transac-tions performance [3]. It is an important part of execution

11

Page 12: Online Data Partitioning in Distributed Database Systemsopenproceedings.org/2015/conf/edbt/paper-31.pdf · rived partition scheme into the underlying database system. In fact, (re)partition

cost if we want to optimize the transaction performance ina distributed environment.

Live migration of databases [5] focuses on migrating awhole database from one system to another while providingnon-stop services and has not considered the scenario of mi-grating data from multiple nodes to multiple destinations,which however is the common case in our online data parti-tioning problem and may encounter distributed transactionsthat update the data at both the source and the destinationnodes simultaneously. The authors of [5] provide a solutionof combining on-demand pull and asynchronous push to mi-grate a tenant with minimal service interruption. Their so-lution is somehow similar to our piggyback approach, wheredata that needs to be moved will be migrated when the in-coming transactions visit it.

Transaction scheduling is an important topic studied invarious areas such as Web services and database systems,and there are several works, such as [7, 2]), that tried to findan optimal schedule by considering query execution time,transaction deadline and system workload. Given the trans-actions’ execution time and hard execution deadlines, mostof the scheduling problems are NP-Complete [16]. The mostcommon solution is cost-based algorithms [9]. The quality ofa schedule highly depends on the cost estimation and howthe execution cost each transaction is modeled. [14] is ascheduling and admission control method using a prioritytoken bank in computer networks. They classify jobs intoN classes and jobs within each class are treated equally (byusing FIFO). This approach is much simpler than cost-basedscheduling (CBS).

6. CONCLUSIONIn this paper, we studied the problem of online reparti-

tioning of a distributed OLTP database. We identify thatthe two basic solutions are very rigid and miss the opportu-nities to find good trade-offs between the speed of reparti-tioning and the impact on the normal transactions. We thenpropose to use control theory to design an adaptive methodswhich can dynamically change the frequency that we submitrepartition transactions to the system. As putting the repar-tition queries into extra transactions may further increasethe system’s resource contention especially when the systemhas a high workload, we also proposed a piggyback-basedmethod to mitigate the repartitioning overhead, which how-ever do not perform well when the system has a low workloadand there is few transactions to piggyback on. Our hybridapproach intelligently integrates the two approaches and isable to combine their strengths while avoiding their prob-lems. Based on the experiments of running our prototypeon Amazon EC2, we can conclude that Hybrid is the overallbest approach and achieves a great performance improve-ment in comparing to the two basic solutions used in mostexisting systems.

7. REFERENCES[1] Kaiji Chen, Yongluan Zhou, and Yu Cao. Scheduling

online repartitioning in oltp systems. In Proceedings ofthe Middleware Industry Track, Industry papers, pages4:1–4:6, 2014.

[2] Yun Chi, Hakan Hacıgumus, Wang-Pin Hsiung, andJeffrey F Naughton. Distribution-based queryscheduling. Proceedings of the VLDB Endowment,6(9):673–684, 2013.

[3] Mosharaf Chowdhury, Matei Zaharia, Justin Ma,Michael I Jordan, and Ion Stoica. Managing datatransfers in computer clusters with orchestra.41:98–109, 2011.

[4] Carlo Curino, Evan Jones, Yang Zhang, and SamMadden. Schism: a workload-driven approach todatabase replication and partitioning. Proceedings ofthe VLDB Endowment, 3(1-2):48–57, 2010.

[5] Aaron J. Elmore, Sudipto Das, Divyakant Agrawal,and Amr El Abbadi. Zephyr: Live migration in sharednothing databases for elastic cloud platforms. InProceedings of the 2011 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’11,pages 301–312, 2011.

[6] Archana Ganapathi, Harumi Kuno, Umeshwar Dayal,Janet L Wiener, Armando Fox, Michael I Jordan, andDavid Patterson. Predicting multiple metrics forqueries: Better decisions enabled by machine learning.pages 592–603, 2009.

[7] Shenoda Guirguis, Mohamed A Sharaf, Panos KChrysanthis, Alexandros Labrinidis, and Kirk Pruhs.Adaptive scheduling of web transactions. pages357–368, 2009.

[8] Alekh Jindal and Jens Dittrich. Relax and let thedatabase do the partitioning online. In BIRTE, pages65–80, 2011.

[9] Joseph YT Leung. Handbook of scheduling: algorithms,models, and performance analysis. CRC Press, 2004.

[10] MSDN Library. Partitioned tables and indexes.

[11] David T McWherter, Bianca Schroeder, AnastassiaAilamaki, and Mor Harchol-Balter. Prioritymechanisms for oltp and transactional webapplications. pages 535–546, 2004.

[12] Oracle. Using partitioning in an online transactionprocessing environment.

[13] Andrew Pavlo, Carlo Curino, and Stanley B. Zdonik.Skew-aware automatic database partitioning inshared-nothing, parallel oltp systems. In K. SelcukCandan, Yi Chen, Richard T. Snodgrass, LuisGravano, and Ariel Fuxman, editors, SIGMODConference, pages 61–72. ACM, 2012.

[14] Jon M Peha. Scheduling and admission control forintegrated-services networks: the priority token bank.Computer Networks, 31(23):2559–2576, 1999.

[15] Abdul Quamar, K. Ashwin Kumar, and AmolDeshpande. Sword: Scalable workload-aware dataplacement for transactional workloads. In Proceedingsof the 16th International Conference on ExtendingDatabase Technology, EDBT ’13, pages 430–441, 2013.

[16] Jeffrey D. Ullman. NP-complete scheduling problems.Journal of Computer and System sciences,10(3):384–393, 1975.

[17] Brett Wooldridge. Bitronix jta transaction manager,2013.

[18] Wentao Wu, Yun Chi, Shenghuo Zhu, JunichiTatemura, H Hacigumus, and Jeffrey F Naughton.Predicting query execution time: Are optimizer costmodels really unusable? pages 1081–1092, 2013.

[19] JG Ziegler and NB Nichols. Optimum settings forautomatic controllers. trans. ASME, 64(11), 1942.

12