how rakuten reduced database management spending by 90% through clustrix implementation
DESCRIPTION
Session slide at db tech showcase 2012 How Rakuten Reduced Database Management Spending by 90% through Clustrix implementation - About Rakuten - Rakuten database environment and operational issues - What is Clustrix? - Clustrix verification results and implementation effectiveness - SummaryTRANSCRIPT
October 17th, 2012Ryutaro Yada ( 矢田 龍太郎 )
Database Platform GroupGlobal Infrastructure Development Dept.
Rakuten, Inc.
楽天事例紹介 : Clustrix 導入によるDB 管理コストの削減
How Rakuten Reduced Database Management Spending by 90% through
Clustrix implementation
Introduction
2
Ryutaro Yada First employed by Rakuten in 2008 Present job
Development of a platform database to support Rakuten Testing and discussion of new techniques and new architecture in view of having it
adopted for use. Previous functions
Promotion of Oracle business with specified customer Establish collaborative network with Oracle, develop and verify new solutions, etc..
LinkedIn profile: http://www.linkedin.com/pub/ryutaro-yada/32/368/4b0
Agenda
3
About Rakuten Rakuten database environment and
operational issues What is Clustrix? Clustrix verification results and
implementation effectiveness Summary
Introduction to Rakuten
Rakuten market
About 3000 employees: (Group approx. 7000) Market / more than 40 services provided including travel More than 120,000 contracted firms; more than 80,000,000
registered products Group distribution total: 3.2 trillion yen (2011)
Rakuten Global Expansion
Our Goal is to become the No. 1 Internet Service in the World
★★
★★
★★
★★
★★
★★
Taiwan
★★
★★★★
★★★★
★★
★★
★★★★★★
★★
★★LS(UK)
★★
★★ Ichiba (EC) ★★ Travel ★★ Performance marketing
★★★★★★
★★ ★★
★★★★
★★
*To be open soon
Rakuten’s Global Position
Rakuten is aiming to be the world’s largest internet firm. Firm and highly flexible infrastructure is required to achieve this
goal
Amazon e-Bay Alibaba Apple Rakuten Walmart0
50000
100000
150000
200000
250000
300000Retail / auction site global ranking 2011 based on unique (no. of) visitors
Source : comScore Media Metrics
Rakuten Database
7
Breakdown according to the number of databases: approx. 80% MySQL (more than 1100)
More than 350 MySQL database servers MySQL has the largest share
MySQL
Informix
Oracle PostgreSQL Teraddata
No. of databases according to actual environment RDBMS
Same number of databases for each STG and DEV
MySQL Database Issue (1)
8
Data Sharding Operations Required for functionality scaling Instance/database/table splitting, data redistribution Correction of application code, control of database access
Data Protection, HA Securing Replication cannot realize zero data loss at failure Switch back/switch over management takes a lot of effort
MySQL Database Issue (2)
9
Online Maintainability Schema modification and index addition, rebuild Lock, access concentration
Number of Units Tends to Increase Load distribution slave, redundant configuration of slave Tendency for preparations on an individual service basis (service level
differences, maintenance adjustment diversion) CPU efficiency decreases; increases in data center costs
Clustrix Characteristics
10
Appliance-style database server Cluster database
NewSQL = LegacySQL + NoSQL LegacySQL: SQL access, transaction consistency NoSQL: Scalability, high performance
Fault-tolerance function MySQL compatibility
Usually access is through MySQL protocol
What is Clustrix?
Clustrix Provision Model
11
2 Models
Looking at Clustrix
12
SSD
Infiniband Low latency High performance
Clustrix Operation
13
Distributed arrangement on the physical layer Redundancy protection, auto rebalance Parallel query execution
SQL
SQL SQL SQL Query, not data, is migrated (this concept differs from Oracle RAC)
14
TPC-C Benchmark Result
GUI
15
16
Useful Command Interface
Clustrix Implementation Cases
17
Rakuten is the first case in JapanNumerous foreign cases
Verification Points
18
PerformanceScalabilityFault-tolerance verificationOnline schema modification
OLTP Performance Results (1)
19
p3 p12 p24 p48 p96 p192
Single Throughput
4014.70340928495
8350.80109847176
10022.3282685872
10448.2547852492
10520.0806628112
10213.9827837091
Clx 3 nodesThroughput
6301.60359877864
18530.3162608949
26182.7733058652
30021.3084104135
27581.9210371511
24401.2890368011
Clx 4 nodes Throughput
6090.51319292526
20584.419997779 30544.8251962052
38545.2177406787
36837.1017573143
33221.7252902507
2500
7500
12500
17500
22500
27500
32500
37500
42500
Insert
(ops/sec)
OLTP (2)
20
p3 p12 p24 p48 p96 p192
Single Throughput
3854.24358613085
8018.59324851275
12186.2079297577
13385.7783377638
13395.0658663929
11538.296680495
Clx 3 nodes Throughput
3377.35937160859
10741.774171341 16505.7965226951
16964.0110664261
16189.8841578955
15379.6268324374
Clx 4 nodes Throughput
3682.99885957258
12679.269657803 19737.6381506327
22232.7357030747
21568.3931798503
21303.7287176473
2500
7500
12500
17500
22500
Update
(ops/sec)
OLTP (3)
21
p3 p12 p24 p48 p96 p192
Single Throughput
6134.38646581779
26773.7120229317
44388.7847721773
56144.2728063981
57926.6243329803
49362.5110601163
Clx 3 nodes Throughput
5050.23029484888
17380.6806013135
27803.8249388909
39693.7563317322
49822.3402571188
56847.7787871649
Clx 4 nodes Throughput
5959.90006380203
20794.2083020251
34743.3165492514
54382.9641882055
70302.2731278006
76000.5917532173
5000
15000
25000
35000
45000
55000
65000
75000
Read
(ops/sec)
OLTP (4)
22
p3 p12 p24 p48 p96 p192
Single Throughput
3976.841545964 8587.21815825699
11632.6412209825
12946.2536032732
13122.3374763111
12769.4579406992
Clx 3 nodes Throughput
3113.10943051971
12940.9919099681
21264.6330913696
26759.7529065423
25976.2662489231
25334.180535794
Clx 4 nodes Throughput
5150.53799892806
15220.8469000052
25601.6790899751
34647.4161635837
34697.7370021236
30804.0994888721
2500
7500
12500
17500
22500
27500
32500
37500
Mix
(ops/sec)
23
Clustrix IA with SSD SPARC with SAN
J) Count+GroupBy+OrderBy+Limit 1.9s (3.4s) 2.1s (8.5s) 3.4s (409.32s)
K) Count+GroupBy+OrderBy+Limit 0.7s (1.13s) 5.9s (7.49s) 13.0s (39.41s)
L) 2000 of IN+GroupBy 3.8s (8.97s) 106.5s (103.77s) 193.0s (321.68s)
M) Case+OrderBy 31.0s (45.66s) 47.3s (60.9s) 90.5s (112.24s)
Complex and Heavy SQL Comparison
Example of Performance Improvements
24
Example improvements regarding a particular service Before: 116.8ms After: 21.4ms
Fault-Tolerance Inspection
25
Failure Test Items Downtime
1 Front network (port1) No
2 Front network (port2) No
3 Internal network (primary) < 12s
4 Internal network (standby) No
5 MySQL instance < 4s
6 Node OS < 4s
7 Online data disk(SSD) failure < 5s
8 Log/work data disk(SATA) failure No
9 Infiniband switch (primary) < 12s
10 Infiniband switch (standby) No
11 Front network (port1&2) < 18s
12 Internal network (primary & standby) < 12s
DB DB DB
Infiniband SW1 Infiniband SW2
DB
Front SW1 Front SW2
1
3
5,6
2
4
11
12
9 10
7,8
Time Required for Online Maintenance
26
Small Medium Large
Create Column 1.6s 13.5 149.8
Create Index 1.6s 13.0s 172.7sDrop Column 1.5s 13.8s 125.5s
Drop Index 0.5s 0.5s 0.5s
Implementation Time
Small Medium Large
Row 50,000 500,000 5,000,000
Size (byte) 113,639,424 1,063,190,528 10,696,130,560
Table Rows and Size
Impacts During Online Scheme Modification
27
Online execution – 5 million cases, total tables 10G
No impact on access performance in areas other than those subject to work operations Some impact on performance of access to table being subject to work operations (taking
periods with little impacts, such as night service, into consideration)
Clustrix Implementation Impacts Release from Sharding (1)
DB DB DB DB
……
……
No more sharding!
DB DB DB DB……+
Before
After
Clustrix Implementation Impacts Release from Sharding (2)
29
0 2 4 6 8 10 12 14
to-be
as- is
man-month
DBA
APP
No need for correction of application No need for DB distribution Sharding production costs reduction (over 90%) for
both application engineer and DBA
In case of large-scale sharding project, actual production costs compared to original
Clustrix Implementation Impacts Cost Reductions due to Consolidation (1)
30
Sufficient performance scalability Fault-tolerance ready for mission critical No data loss High online maintainability that doesn’t affect other services Possibility of consolidation to Clustrix of existing MySQL
database
Clustrix Implementation Impacts Cost Reductions due to Consolidation (2)
31
Consolidation of all existing MySQL within Clustrix Number of servers will be reduced to 10% Monthly system costs will be reduced to 40%
Back-up Structure
32
DB
Node 1
DB DB
Node 2 Node 3
DBMySQL
NAS
NFS
Backup by mysqldump
Replication
Slave as first backup
Clustrix
…
Data Migration Procedure
33
DBMySQL
Replication
DB DB DB
Clustrix PRODB DB DB
Clustrix DEVReplication
Replication to DEV for verification Replication to PRO for migration Conversion of application access point to PRO
Other Advantages of Clustrix
34
Auto-DefragCordial Support Service
Advice regarding structure Troubleshooting Tuning advice Etc.
Operational Issues Resolved with Clustrix
35
Data sharding operations
Data protection, HA securing
Online maintenance
Tendency for large number of units
Unnecessary, operational cost reduction
Possible
Possible
Consolidation possible Cost reduction
Clustrix at Rakuten
36
An important database platformProvided as Database-as-a-ServiceNo lead-timeUsage volume rate structure