5 signs you might be outgrowing your mysql data warehouse*be used to support high availability,...

6
Whitepaper 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse* *And Why Vertica May Be the Right Fit

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above)

Whitepaper

5 Signs You Might BeOutgrowing Your MySQLData Warehouse*

*And Why Vertica May Be the Right Fit

Page 2: 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above)

Like Outgrowing Old Clothes...

Most of us remember a favorite pair of pants or shirt we had as kids that seemed to fit fine one day, and the next time we put it on, we realized that they were suddenly much too small. You might let the hems out, or cut the arm holes, but you knew that it was soon going to be time to put it in the “too small” pile, and a trip to the store with your mom was around the corner. Outgrowing things was a way of life back then, an inevitable step in the grand scheme and one that always seemed to lead to the next favorite shirt or toy. This is not an attempt to trivialize data warehouse and data mart systems, but they too evolve and mature, and one day you might wake up and realize that the MySQL data warehouse that you have so faithfully supported and maintained is just too small for your current analytics needs. Data volumes keep increasing, new data sources are added to the system and performance starts to degrade to the point that your users are reporting that queries are taking too long or never returning. Or maybe your users are starting to run more and more sophisticated queries that you (and the database) weren’t quite ready for. Nobody wants to get to that point, so it is useful to know a few signs that you are starting to outgrow your current system so you can start planning the transition to a new system. This paper details the five most common signs that it may be time to consider replacing a MySQL system.

1. You are considering implementing sharding/partitioning.

Your big tables are getting REALLY big, and you’ve started to look at sharding as a way to spread out the load over multiple machines and eek out the most performance you can get. Sharding can be a useful tool; however, the process to manage this exercise can soon outweigh the gains being made. According to the MySQL Performance Blog, the complexity comes down to two factors. First, the application developer will have to write more code to be able to make use of the sharding logic. You will need to rewrite most of your application and queries to point them to the correct data. Second, operational issues become more difficult (backing up, adding indexes, changing schema). It can take a significant amount of work to build an application that works correctly when you are rolling through an upgrade where the schema will not be the same on all nodes.  Many of these tasks remain only semi-automated, so from an operations perspective, there can often be a lot more work to be done. (Tocker, 2009)

Vertica implements a fundamentally better paradigm to sharding called segmentation. Segmentation allows you to distribute contiguous pieces of your physical data, called segments, for fact and large dimension tables across database nodes. This maximizes database performance by distributing the load. But unlike MySQL, this is managed completely by the Vertica engine. When you create your physical tables, you specify if you want to segment, and Vertica does the rest. Queries do not need to be aware of

Page 3: 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above)

the segments, so no changes to your existing SQL are necessary. Without introducing any maintenance headaches, segmentation can be used to provide high availability for your system. Redundant physical storage can be configured to provide performance optimization for different query types. Then, the distribution is modified so that segments which contain the same data are distributed to different nodes. This ensures that if a node goes down, all of the data is available on the remaining nodes. Again, this is managed automatically by the Vertica engine and only requires a single keyword in the table creation DDL.

2. File sizes are too large.

In MySQL, all database interactions are managed at the file system level.  Eventually, the size of the files in MySQL becomes too large for the machine to manage effectively. There is more and more I/O required to sift through the data in the file, and forget being able to load them into memory. Depending on your operating system and file store choice, the file size may be limiting the size of your tables. Now, you are being forced to make some fundamental architecture decisions. Maybe you are considering moving to InnoDB, enabling Large File Support on MyISAM, or even having to more to a different operating system. All of these options have expensive price tags in terms of time and DBA resources.

Wouldn’t it be nice if there was some way that bringing more data into a system didn’t cause database structures and files to bloat? Well, Vertica engineers thought so too. Vertica automatically compresses each column using one of fifteen different methods, depending on the data type and distribution. Customers see 10 – 60x data compression rate as they load their raw data into Vertica. The engine is fully aware of these compression algorithms, and can process compressed data until the last possible moment. This gives you a double bang for your hardware buck. You use less disk space to store the data, and less CPU and memory to process the data. As far as actual file size goes, Vertica continuously monitors file structures to remove and “merge out” deleted data and reorganize the file for maximum space efficiency. Tables can be broken up into smaller storage units (called partitions), usually by some business construct like month or year. That way, data can be easily rotated out by dropping individual partitions, or utilized during query execution for “pruning” for specific data or to improve parallelism.

3. The number and size of the indexes is beginning to get cumbersome.

Indexes are good, right? They are to a point, but eventually you are going to find that you are using the majority of your disk space for these adjunct structures. And more disk space means less availability for growth, more complicated (read: expensive) maintenance, and the need for more and larger hardware. MySQL loads indexes into

Page 4: 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above)

memory at execution time, so if your indexes no longer fit, the performance benefit of having them is no longer there, and can spell longer query run times. Again, possible solutions are smaller indexes, meaning smaller tables or more memory. Getting this free database up and running strong is starting to look very expensive.

Vertica doesn’t have indexes. It doesn’t need them. Data is physically stored in compressed and sorted columns called projections, which essentially act as a traditional index would, but without the extra I/O overhead required for performing lookups. Projections can use all the columns in a table, or just a subset. They can be sorted differently to provide optimization for different types of queries. Since they actually store the physical data, not a pointer, having multiple projections on a table means they can be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above). And don’t forget about the compression explained in #2; this means that even with multiple copies of the data, you are still storing a smaller amount than the actual raw data.

4. Tables are getting wider.

It’s bound to happen. Users are doing more complicated analysis, and ask for pre-computed columns to be added to the fact table. Or, you are bringing in another data source, so your dimension tables start getting wider. MySQL is a row-based database, so every time a query asks for just one column in a table, all the other columns in the table need to come along for the ride. This can get very expensive in I/O and overall query efficiency.

Vertica is a native column-store database. Column stores offer significant gains in performance, I/O, storage footprint, and efficiency when it comes to analytic workloads. Why read and retrieve all columns in the database if you don’t need them? Unlike traditional database vendors who struggle to retrofit columnar storage into their legacy code for marginal gains, Vertica’s columnar orientation was deliberately designed into the core platform from day one. This means that all Vertica components are columnar-aware so that it delivers superior compression and encoding, better and more efficient relational join performance, and the engine is able to operate on compressed columnar data without having to unpack it.

5. You keep maxing out your servers.

Dan Khasis, a leading MySQL performance and scaling expert, says he sees clients “reaching the threshold (of MySQL) when there are a few billion rows and people want reports (or queries) instantly, with slicing dicing and drill down, sorting and grouping. Their servers start running out of ram and start writing to disk or temp tables.” Adding more and more hardware can get expensive. Even though you are saving in license

Page 5: 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above)

fees with MySQL, you are sinking a lot of money into your infrastructure/cloud resources.

We have discussed Vertica’s pervasive use of column compression as one was of beating the data bloat on other RDBMS. Combine that with Vertica’s truly shared nothing MPP architecture, customers see better than linear scalability when adding new servers to the cluster (see diagram below). And this isn’t proprietary hardware or an appliance. Any well spec’d Linux server will do just fine. Vertica’s built-in high availability also reduces the need for redundant hardware, because even if any node in the Vertica cluster goes down, the database will still be available and active, with minimal performance impact to user queries and data loads. Looking at the total cost of ownership of your data warehouse as it grows, including hardware and technical resources to manage that hardware should be an important factor to any long-term maintenance plan. Using a commercial RDBMS that can fully utilize all the hardware to the maximum extent might be the better financial choice moving forward.

“So, I may be showing some signs of outgrowing my current data warehouse database,” you might say, “but migrating a production data warehouse is no trivial matter. I would rather go back to clothes shopping with my mom when I was in junior high.” But it doesn’t have to be. Vertica has many features that make a migration project a lot easier than you might think. Vertica is ANSI-99 compliant, which means that your DDL and current reports will run with little changes needed. In most customer engagements, all the needed table DDL and query SQL is converted within hours. Vertica also has a

Page 6: 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*be used to support high availability, since they will either be replicated or segmented and offset on each node (see #1 above)

built-in Database Designer that, once pointed to your logical schema, some sample data and the queries, will tell you exactly what projections (the Vertica physical storage mechanism) need to be built to the get optimal performance out of your new database, as well as the DDL needed to build them. Adding new hardware as your system continues to grow won’t be an issue either. A single command adds a new node to the Vertica cluster and automatically rebalances the system for performance and high availability. As of April, 2011, Vertica’s largest deployment was on 230 nodes managing over 1.5 petabytes of data, growing by a terabyte each month. Rest assured, you won’t need a new data warehouse for a long, long time.

About VerticaVertica, an HP Company, is the leading provider of next-generation analytics platforms enabling customers to monetize ALL of their data. The elasticity, scale, performance, and simplicity of the Vertica Analytics Platform are unparalleled in the industry, delivering 50x-1000x the performance of traditional solutions at 30% the total cost of ownership. With data warehouses and data marts ranging from hundreds of gigabytes to multiple petabytes, Vertica’s 600+ customers are rede!ning the speed of business and competitive advantage. Vertica powers some of the largest organizations and most innovative business models globally including Zynga, Groupon, Twitter, Verizon, Guess Inc., Admeld, Capital IQ, Mozilla, AT&T, and Comcast.

Vertica, An HP Company 8 Federal Street, Billerica, MA 01821 +1.978.600.1000 TEL +1.978.600.1001 FAX www.vertica.com© Vertica 2012. All rights reserved. All other company, brand and product names may be trademarks or registered trademarks of their respective holders.