ver tica database multi terabyte

15
 Buildi ng a Multi-teraby te Vertica Database Jan. 31 Vertica Confidential. Copyright Vertica Systems Inc. 2007 February, 2007

Upload: vsnaikhnr

Post on 03-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 1/15

 

Building a Multi-terabyte Vertica

DatabaseJan. 31

Vertica Confidential. Copyright Vertica Systems Inc. 2007

February, 2007

Page 2: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 2/15

Page 3: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 3/15

Page 4: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 4/15

 

4

 

Logical tables are physically stored as columns, and partitioned into segments on several machines and in severaldifferent projections

Projections and Performance

 A projection, because of the sorting, localizes logically-grouped values, so that a disk read can

pick up many results at once. Today’s disks can read astounding fast once positioned on data

but still take many milliseconds to seek to a single record.

The best sort orders are determined by the where clauses of queries. If a sort order is (x,y) and a

query has “where x=1 and y=2”, all the needed data is found together in one place in the sort

order. The query will fly.

 Another important performance factor is data compression. Compression and localization play

well together, because localized data has repeated sequences that compress very well. Also,

compression makes multiple projections affordable, and thus more queries that fly.

DB Designer

DB Designer is a component of Vertica that uses cost-based metrics to analyze your schema,

data and queries and design the best sort orders for projections. Your essential contribution is

Page 5: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 5/15

Page 6: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 6/15

 

6

If surprise queries are rare and the training queries are selective (under 5%), the number of disks

can be lowered. The number of processors cannot be lowered, however, without slowing down

the training queries, since they tend to run CPU-bound.

K-safety

If K=1 safety is needed, add one or two systems as replacement systems. These do not need to

be hot standby systems, just on-site. The diskspace suggestions above cover K=1 needs.

Building the Linux Infrastructure

Suggested hardware

 A wide range of possible systems can be utilized. Systems from Dell and HP are well-made and

reliable, although more expensive than ones put together without a name brand. However,

beware of cheap consumer systems, as they are often shoddily made. For good performance,

small size (rack mountable, usually 2U), and low cost, the pizza box server is an excellent choice.

 A typical 2U system can accommodate 6 disks. The following assumes direct attached storage,

the classical shared-nothing approach with local disks on each system. A SAN (storage area

network) configuration can also be utilized, but is more expensive.

Linux system for one Vertica node

1. Two CPUs, dual core if available, with at least 2MB cache, 4MB is preferable.

2. At lease 4GB memory, or 2GB/CPU core for more than two CPU cores.

3. 4 matched disks for Vertica data, preferably SATA (or SAS, or other SCSI variants, but

SCSI is more expensive). 10K rotation speed is good, 15K is better, but more expensive.

Hardware RAID is another possibility, but requires another single disk for the system

disk.

4. Possibly another, small disk for the operating system. This disk can have lowerperformance, IDE for example. See Linux System Setup below for discussion.

5. 1Gbps Ethernet interface.

6. USB 2.0 port, for loading from moveable disk.

7. If it comes with RedHat 4, make sure the OS is installed on one approximately 50 GB

partition of one disk, leaving the rest of the system disk usable for other purposes.

 Addi tional Hardware

1. 1Gbps Ethernet switch with ports for all N systems, plus at least 2 more ports for external

connection, and a spare. Dell sells a 24-port 1Gbps switch with a 30 Gbs backplane that

works well in our experience.

2. Enough USB 2.0 disks to hold the fact data, and 5 USB hubs. Not needed if there is a1Gbps LAN connection from the source of the data.

Page 7: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 7/15

 

7

Linux System Setup

During Linux install, choose locale en-US.UTF-8 for US English, or another appropriate UTF-8

locale. You can set up one system partition and clone it by image copy to the other systems, and

then fix the few things that are specific to each system.

 A straightforward disk layout uses one disk for the system, and four matched disks for Vertica

data. Vertica does not require separate “temp space”, so the 4 disks can be put together in one

Linux software RAID 0 “md0” filesystem for each node, as shown in the figure below. The stripe

size (chunk size for the Linux RAID tool mdadm) should be 1MB.

The system disk does not need to be a high performance disk, that is, it could be IDE for

example. If the system is restricted to a 50 GB partition of the single disk, the rest can be used

as a second filesystem without worry of using up the system filespace. This is the current disk

layout for used for Vertica testing.

Proposed disk layout with system disk and 4-disk md RAID for a node 

The system disk can double as a data disk as long as it is matched to the other data disks. The

operating system does not use significant disk bandwidth once booted, so its disk is largely idle in

the above plan.

The most audacious design is a single monolithic md0 filesystem, containing the OS as well as

the Vertica data. Here the reasoning goes that the OS is no more or no less important than thedatabase, as the failure of either causes a full rebuild of the node, so everything can be in one

boat. Diagnostics can be run from a bootable CD or temporary additional disk. However, once

RAID is allowed under system, there is no reason not to let the system have its own 50GB

partition, and swap its twice-memory size partition, as follows:

system

md0

swap

Extra

space

Page 8: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 8/15

 

8

 

Proposed disk layout w ith system in RAID partition for a node 

Mount the big data partition (md0) under the same name on all nodes, say “/vdata”. With 4 250G

disks, and 60GB reserved for the system, this provides 940 GB on each /vdata filesystem.

NTFS support

If you are bringing data from Windows, add NTFS support to the RedHat Linux kernel. It is

missing from RedHat for legal reasons, not stability issues. It is available as an RPM at

www.linux-ntfs.org.

Naming your systems

For ease of administration, name your Vertica systems using a repeating pattern such as

vnode01, vnode02, …, vnode20. Test the hostnames using the procedure in the installation

guide under the heading Network Configuration: “Check Hostname Resolution.”

Connectivity via port 5433

TCP port 5433 is used for JDBC, underlying both external and internal (psql) tools. Make sure

port 5433 is enabled for inside- and outside-cluster connections. See “Check Remote Access” in

the installation guide. Another port can be used if necessary, but 5433 is the default.

Testing Connectivity

 A test for this connectivity is as follows. From the client machine or another node, try “telnet

vnode01.whatever.com 5433”, and see if it connects, presenting you with a blank screen.

Disconnect with control-] q. If this test fails, it may be the fault of the RedHat firewall or SELinux

security protections.

Fixing Connectivity

These can be disabled by running the Linux command system-config-securitylevel and disabling

the firewall and SELinux, if this is consistent with your security policy. If system security depends

on these protections, enable just port 5433 for JDBC and port 22 for ssh.

system

md0

swap

Page 9: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 9/15

 

9

Linux note: getting X Windows to work for root

To enable X for root use for system-config-securitylevel and other X-enabled (GUI) tools, use the

shell command “xhost +” as root.

Setting the stage for the DB DesignVertica customizes its data representation to your actual needs, expressed by you in your most

important queries, the training queries.

Determining the Training Queries

Consider the most time-crucial queries. Note that all queries on your data will be supported, so

there is no need to include all the columns of the fact table in the training queries. In fact, it is

important to be as restrictive as possible consistent with your real needs.

What is most important in a training query is the where clause. A typical star query looks like this:

select … from fact

where … and … and …  most important part: what columns restrict data

group by …

order by …

However the columns mentioned anywhere in the query are also important, to ensure their

presence along with the where-clause columns in a Vertica projection. This is not to say that the

queries should be artificial. They should be legitimate important queries, except for the exact

constants involved, which are expected to be variable in practice.

Determining the Segmentation Key

The DB Designer will choose a good segmentation key based on your sample data. Still, it’s a

good idea to think about the choices and understand the considerations.

The proper choice of segmentation key, a certain column of the fact table, is an important part of

the Vertica database implementation and does not follow directly from the schema or training

queries. Each projection for a fact table has a certain segmentation key, and it is possible to

have more than one segmentation key in use for a table, but K-safety considerations tend to align

these keys so that one segment of data of one projection can be used to reload the same

segment of another projection in a failed node. Thus the important thing is determining one or

two good segmentation keys for a fact table.

The wrong segmentation key could slow down queries because of poor load balance between

nodes. The segmentation key should be a column of the fact table that is not present in the where

clauses of important queries. It should be able to compartmentalize fact data into segments so

that each important query will use all the segments.

Page 10: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 10/15

 

10

For example, if the fact data is about employees, the Social Security number (an integer) would

have this property. Twenty segments could be based on ranges of this number:

Site s1 values less than 050000000

Site s2 values less than 100000000

 And so on

In many database systems, practical maintenance requires horizontal partitioning by time. This is

not true of Vertica, because it incrementally merges in new data and deletes old data. In fact,

since many queries are on recent data, segmentation by time is not recommended.

Mobil izing the Data

Vertica is loaded from delimited text files. See the COPY command documentation in the

Database Administrator’s Guide and further discussion below.

 A common case is moving data from its current database home to Vertica. Here is more on that

case.

Migrating the Schema

Obtaining the Schema

You may need to reconstruct the create table statements, etc., that define the schema. An ETL

tool such as Informatica can do this for you. If you want to do it without such a tool, you can see

if your source database system has a way to generate DDL. DB2 has DDL generation in its

db2look tool. Oracle has the DBMS_METADATA package, which can output SQL create table

statements, etc. However, these tools (especially from Oracle) tend to use proprietary data types

and additional storage clauses, so some edits will be needed.

DB Visualizer (www.minq.se/products/dbvis) can export schema and is moderately priced, and

supports most important databases. It is based on JDBC, so Vertica should soon be usable

through this tool.

If you are using the free eclipse IDE, add the WTP (Web Tools Project) package, and try its

Generate DDL wizard in the Database Explorer. However, this is not enterprise software, and

may not work for all databases.

Checking the Schema

Make sure that the needed foreign key clauses are there, in the table definitions themselves or in

separate “alter table T add constraint …” commands. See retail_define_schema.sql of the quick

start example database for an example. Add any missing foreign key constraints that hold the

star or snowflake together. The foreign key constraints are very important to guide the DB

Designer in its work.

Page 11: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 11/15

 

11

 

Exporting the Data from the Source Database

The exact way to extract the data varies across source database systems. The data should be

exported to text form by the source database to a local file or attached disk if possible.

Oracle Note

In Oracle for example, there is no export (to text) tool, only a load tool from text (SQL

Loader). To export data, you can run a select query in Oracle’s SQLPlus command line

query tool with specified column delimiter, suppressed headers, etc., with redirection of

output to a local file (for example, on a USB 2.0 disk.)

To make a practical export scheme, design queries or unloads that produce 250-500GB files of

various parts of the data of the fact table, plus files for each non-fact table. For 10 TB, this

means 20-40 files for the fact table.

Some attention to handling special characters is needed to make sure the Vertica COPY

command will accept all the exported rows. See the Appendix on Load Format Details

ETL products of course can handle these well-known difficulties of moving data from one

database system to another. They typically use ODBC or JDBC to extract data, which gives them

program-level access to column values to fix them up as needed for the load files.

Moving the Data

The data can be transported from its source to the Vertica installation on USB 2.0 (or possiblySATA) disks or across a fast local network connection. Deliver chunks of data to the different

Vertica nodes, by connecting the transport disk or writing files from network copy.

Fast network transfer of data

 A 1Gbps network can deliver about 50 MB/s, or 180GB/hr. Vertica can load about 200GB in 4

hours on each node (of 4 nodes), so about 50GB/hr on each node. Thus a dedicated 1Gbps LAN

should be usable. Slower LANs will be proportionally slower, and non-local networks are

probably untenable because of the delays over distance slow down the TCP protocol to a small

percent of its apparent bandwidth, even without competing traffic.

Disk transfer of dataUSB 2.0 disks can deliver data at about 30 MB/s, or 108 GB/hour, fast enough. SATA disks are

usually internal, but can be external, or unplugged safely internally. USB 2.0 disks are easy to

use for transporting data from Linux to Linux. Simply set up a ext3 filesystem on the disk and

write large files there. Linux 2.6 has USB plug-and-play support, so a USB 2.0 disk is instantly

usable on various Linux systems.

Using a disk for one file

Page 12: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 12/15

 

12

For other variants of UNIX, if there is no common filesystem format available, the disk can be

used without a filesystem for a single large file. You can use “cp bigfile /dev/sdc1” for example on

the source system and access the file on the Linux system as /dev/sdd1 or whatever device it

ends up with. Even without a filesystem on the disk, the plug-and-play support still works on

Linux to provide a device node for the disk. You can find out the assigned device by the shell

command “dmesg | tail -40” after plugging in the disk.

Data from a Windows System

For Windows to Linux, NTFS is the clear choice for the filesystem, which requires the added RPM

for Linux as discussed above under Linux System Setup. Although RedHat Linux as originally

installed can read Windows FAT32 filesystems, they are useless for such large files.

Building the Vertica Cluster

Setting up Vertica

The Quick Start guide shows the basic steps. The systems involved are the set of nodes of the

cluster, plus at least one additional client system to play the part of the eventual users.

Following the Installation Guide:

1. Make sure the hostnames pass the hostname tests listed in the Installation Guide, underCheck Hostname Resolution. The hostnames are used in the Vertica installationprocess.

2. Make sure the RedHat firewall and SELinux pass port 5433 (or whatever port you areusing instead). Test with “telnet <host> 5433” to and from nodes and from the clientsystem.

3. Create the unprivileged Linux account for administration on each node. I called it

vadmin. Enable the SSH logins as directed in the Installation Guide. Give vadminownership of /vdata.

4. In a root login, install Vertica on one node. This node will be your top-level administrationnode. See the Installation Guide for details, under Initial Software Installation.

5. Create the sample data, for testing.

Following the Quick Start Guide, a first test on one node.

Do the following logged in on some node as vadmin, the one user who runs adminTools in the

current version of Vertica

1. Follow the directions to copy the sample data, but to make it more realistic, put it on

/vdata: mkdir /vdata/retail_example_database, etc. All “big” data should be under /vdata.

Of course this data is not really big.

2. Try out the single-site quick start install. It only takes a few minutes, and tests the coresubset of the configuration. Use /vdata/retail_example_database/single/catalog and

/vdata/retail_example_database/single/data in the create-database step (admin tool 4)

3. Try out the canned queries, as described under Running Simple Queries.

Following the Quick Start, Second test on two nodes. 

 Again do the following as vadmin, while logged in on the admin node.

Page 13: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 13/15

 

13

1. Shut down the first database. Only one database at a time may be running with the

current version of Vertica.

2. Try out the multi-site quick start install. Use /vdata/retail_example_database/multi/catalog

and /vdata/retail_example_database/multi/data in the create-database step (admin tool 4)

3. Again try out the canned queries, on each of the two nodes.

Now you have installed and tested the core Vertica server.

Since the psql SQL environment is working from the above tests, and it depends on JDBC, we

are assured that JDBC is being served. You can use “netstat –a |grep 5433” to see the listener.

 Again try “telnet <host> 5433” from the client system to any node involved in a Vertica database.

Note that psql can be run outside of adminTools on any Vertica node. From a client system, try

out your favorite JDBC client.

Building Your Database

Creating the Database and Running DB Designer

Leave the little example database where it is (and shut down) and set up another Vertica

database for your real data. First bring over your schema, training queries and data for non-fact

tables, and sample data for the fact table. Because this is only a moderate amount of data, it can

be transferred over the network easily.

1. Make a top-level directory in /vdata, say /vdata/sales, for this database.

2. Put the schema definition (say schema.sql) and training queries (say queries.sql) in

directory /vdata/sales/config and the data files in say /vdata/sales/inidata.

3. Do the create-cluster step of the Multi-Site Procedure in the Quick Start Guide, except

name your cluster appropriately, say “sales”.

4. The installation of Vertica has already been done at this point for two nodes, but needs tobe done for the others.

5. Do the create-database step using the same name for your database as you did for the

cluster. Specify directories in /vdata such as /vdata/sales/catalog and /vdata/sales/data.

6. Run the Database Designer, from the config directory, entering the schema.sql and

queries.sql. Specify a temp directory on /vdata, say /vdata/temp, for best results. You

also specify the delimiter and null-value representations here. Provide the disk budget.

For our example 10 TB system with 20 nodes each with 4 250G disks, reserving 60GB

for the system, we have a 940 GB * 20 = 18800 GB for the disk budget

Checking the Projection Design

It is a good idea to examine the output of the DB Designer to make sure the projections are in factsorted on columns of importance to where clauses of your training queries. Make sure the

segmentation key is right. If a projection seems unneeded or otherwise unexpected, check if a

training query is not what you intended, or possibly not really important. Contact technical support

if needed. It is much easier to fix design problems at this point than later on.

Page 14: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 14/15

 

14

Implementing the Projections and do ing the initial small load

1. Connect to the production database and set up the schema from

/vdata/sales/config/schema.sql. Primary and foreign key columns cannot have null

values. If “not null” is missing from the schema definition for such a column, it will be

added with a warning at this point, but this is harmless.

2. Generate the projections from the autoDBDesign.sql generated by DB Designer.

3. Load the dimensions and then the sample fact data using COPY DIRECT.

Here is a sample load command for comma-delimited data for a dimension table named

promotion, from a file on a USB disk mounted as /extdata, with comma-separated column values

and null values indicated as “null”:

copy promotion from ‘/extdata/promotion.dat’ delimiter ‘,’ null ‘null’ direct

Copy commands will not fail if a few rows are rejected, for example, for having the wrong number

of delimited values. Thus the whole COPY is not a transaction but rather each row addition is

committed. See the COPY command documentation in the DBA Guide. Check the Vertica log file

for rejected rows and other diagnostics. The log file location is displayed by the tool

/opt/vertica/bin/dbInfo. You can fix up the problems and load the corrected rows.

Testing the init ial database

The database is now functional, although still relatively small, since only the sample fact data is

loaded so far. You should try out your training queries at this point. If anything fails or runs

slowly, study the projection definitions for problems, and contact technical support. If another

projection is needed, for example, you will need to redo the database build with the new

projection, a relatively easy task at this point before the main body of data is loaded, but much

harder later.

The Big Load

Suppose we have a 10-20 node system and 20 500G USB 2.0 disks for transporting data. We

can start loads on each of 5 nodes by accessing one transport disk on each and starting a COPY

DIRECT for its data. When they are all done, another 5 can be loaded, and so on for four rounds.

This choice of 5 parallel loads is just an example. You may be able to do more at once. You can

add parallel loads until it stops loading faster.

With a USB hub, the 4 transport disks for a node can all be accessible without recabling, and the

four parts can proceed one after another following a script.

Testing Your Database

Before doing serious queries, be sure to run SELECT ANALYZE_STATISTICS(‘projectionname’);

for each projection. The tuple mover will periodically rerun this to keep statistics current.

Now your cluster is up and running. Try the training queries and then some other queries. Check

the size of each table with a count(*) query. Enjoy the speed!

Page 15: Ver Tica Database Multi Terabyte

8/11/2019 Ver Tica Database Multi Terabyte

http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 15/15

 

15

Running Your Database

Over time, new data needs to be added, and eventually old data deleted. Unlike many other

databases, no “reorg” is needed, since Vertica is continuously merging in new data and rewriting

the older data. A process that inserts new rows and deletes old rows typically accesses the

database via JDBC, connected from the intermediate systems that control the external data flow.

This process is called a trickle load, and allows new data to be added even while queries are

actively running.

 APPENDIX

Load Data format details

The data delimiter and quote character

Choosing the right column-value delimiter is important. You need to choose a character that

does not show up in any char(n) or varchar(n) data values. The vertical bar, ‘|’ is a good one totry. You can test for the existence of a certain character c in column x by using the query “select

count(*) from T where x like ‘%c%’”. If a few values are using |, they can be eliminated from the

main load by a where clause and separately loaded using another delimiter. Alternatively, one

could try to quote the delimiter character with \ if the database can do this. Also, \ chars in the

char data will disappear on load into Vertica unless doubled up, and newlines will cause trouble

too.

Oracle has a REGEX_REPLACE function that can substitute one substring with another,

although this will slow down the unload operation significantly. It might be practical to use a where

clause to avoid problem rows on the main load, and the opposite where clause with

REGEX_REPLACE for just the problem rows.

Non-ASCII dataVertica stores data in the UTF-8 compressed encoding of Unicode. The resulting UTF-8 codes

are identical to ASCII codes for the ASCII characters (codes 0 to 127 in one byte). If your table

data (char columns) is all ASCII, it should be easy to transport, since all current OS environments

treat it the same way. If you have UTF-8 data, it is just a matter of preserving it that way. Make

sure that the extraction method does not convert char column values to the current (non-UTF-8)

locale of the source system. On most UNIX systems, you can see the current locale with the

“locale” command, and change it for a session by setting the LANG environment variable to

en_US.UTF-8. If you have data in another character encoding such as Latin-1 (ISO 8859), itneeds to be converted to UTF-8 if your data is actively using the non-ASCII characters of Latin-1

such as the euro sign and the diacritical marks of many European languages. The Linux tool

iconv can do the needed conversions. Luckily it is rare to have non-ASCII characters in the fact

table, so these conversions are usually needed only for the smaller dimension table’s data.