mysql: lessons learned on a digital library

4
10 IEEE SOFTWARE Published by the IEEE Computer Society 0740-7459/05/$20.00 © 2005 IEEE open source Editor: Christof Ebert Alcatel [email protected] E dgar F. Codd introduced the relational model to database management in 1970 (a copy of his seminal paper is available online at www.acm.org/classics/nov95). The model has since become a standard for database management systems (DBMSs). Relational databases store not only information about data items but also information about the relationships between them. Re- lational databases are powerful because they impose minimal constraints on the kinds of rela- tionships they can represent and on how data is extracted from them. Popular relational DBMSs (RDBMSs) include com- mercial products, such as Oracle (www.oracle.com), Microsoft SQL Server (www.microsoft. com/sql), and Sybase (www.sybase.com), and open source products, such as PostgreSQL (www.postgresql.com) and MySQL (www. mysql.com). All of these products use the Struc- tured Query Language for extracting database information. In recent years, the open source products have achieved enterprise-level quality. In response, en- terprises have become more interested in migrat- ing from proprietary, commercial products to open source. Businesses around the world now commonly use the two leading open source rela- tional DBMSs, MySQL and PostgreSQL. The Los Alamos National Laboratory’s Re- search Library has used MySQL databases for years. However, a recent project to develop a comprehensive database of scientific journal articles and citation information revealed its unique strengths and features. Digital library requirements Digital libraries as well as data centers have demanding requirements for capacity, speed, re- liability, and flexibility. Determining the best re- lational DBMS isn’t an easy task, and it depends on the criteria the application must satisfy. Key RDBMS capabilities include the following: Multiuser simultaneous access. Easy access from APIs written in different languages. The two most well-known APIs are Open Database Connectivity and Java Database Connectivity. An API provides con- nectivity to the database, and SQL statements manipulate it. The combination of APIs and SQL establishes nearly complete interoper- ability between the database and a client. Support for multiple operations that happen either all at once or not at all. A true trans- MySQL: Lessons Learned on a Digital Library Mariella Di Giacomo Many applications and servers use databases, which no doubt belong alongside operating sys- tems as key middleware and infrastructure that engineers must think about. MySQL is to rela- tional databases what Linux is to operating systems, so I’m glad to have Mariella Di Giacomo of the Los Alamos National Lab contribute her experience, insight, and practical guidance with this major open source database management system. I also welcome hearing from you—both readers and prospective authors—about the products and tools you’d like to see in this column. —Christof Ebert

Upload: halien

Post on 21-Jan-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: MySQL: Lessons Learned on a Digital Library

1 0 I E E E S O F T W A R E P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y 0 7 4 0 - 7 4 5 9 / 0 5 / $ 2 0 . 0 0 © 2 0 0 5 I E E E

open sourceE d i t o r : C h r i s t o f E b e r t ■ A l c a t e l ■ c h r i s t o f . e b e r t @ a l c a t e l . c o m

Edgar F. Codd introduced the relationalmodel to database management in 1970(a copy of his seminal paper is availableonline at www.acm.org/classics/nov95).The model has since become a standard fordatabase management systems (DBMSs).

Relational databases store not only informationabout data items but also information about the

relationships between them. Re-lational databases are powerfulbecause they impose minimalconstraints on the kinds of rela-tionships they can represent andon how data is extracted fromthem. Popular relationalDBMSs (RDBMSs) include com-mercial products, such as Oracle(www.oracle.com), MicrosoftSQL Server (www.microsoft.

com/sql), and Sybase (www.sybase.com), andopen source products, such as PostgreSQL(www.postgresql.com) and MySQL (www.mysql.com). All of these products use the Struc-tured Query Language for extracting databaseinformation.

In recent years, the open source products haveachieved enterprise-level quality. In response, en-terprises have become more interested in migrat-ing from proprietary, commercial products to

open source. Businesses around the world nowcommonly use the two leading open source rela-tional DBMSs, MySQL and PostgreSQL.

The Los Alamos National Laboratory’s Re-search Library has used MySQL databases foryears. However, a recent project to develop acomprehensive database of scientific journalarticles and citation information revealed itsunique strengths and features.

Digital library requirementsDigital libraries as well as data centers have

demanding requirements for capacity, speed, re-liability, and flexibility. Determining the best re-lational DBMS isn’t an easy task, and it dependson the criteria the application must satisfy. KeyRDBMS capabilities include the following:

■ Multiuser simultaneous access.■ Easy access from APIs written in different

languages. The two most well-known APIsare Open Database Connectivity and JavaDatabase Connectivity. An API provides con-nectivity to the database, and SQL statementsmanipulate it. The combination of APIs andSQL establishes nearly complete interoper-ability between the database and a client.

■ Support for multiple operations that happeneither all at once or not at all. A true trans-

MySQL: Lessons Learnedon a Digital Library

Mariella Di Giacomo

Many applications and servers use databases, which no doubt belong alongside operating sys-tems as key middleware and infrastructure that engineers must think about. MySQL is to rela-tional databases what Linux is to operating systems, so I’m glad to have Mariella Di Giacomoof the Los Alamos National Lab contribute her experience, insight, and practical guidance withthis major open source database management system.

I also welcome hearing from you—both readers and prospective authors—about the productsand tools you’d like to see in this column. —Christof Ebert

Page 2: MySQL: Lessons Learned on a Digital Library

M a y / J u n e 2 0 0 5 I E E E S O F T W A R E 1 1

OPEN SOURCE

actional database supports atomicity(all-or-none operations), consistency(a transaction never leaves the data-base in an inconsistent state), isola-tion (transactions are separated fromeach other until they are completed),and durability (the database keepstrack of pending changes so that theserver can recover from an abnormalsituation). Together, these are calledthe ACID properties.

■ Sophisticated search capabilities,such as joins, subselects, triggers, andviews.

■ Consistent online backups per-formed while the database remains ina read-write state.

■ Support for managing large amountsof data while maintaining high per-formance.

■ Replication support that scales andachieves high availability.

Your requirements relative to keycapabilities will determine which prod-uct works best for any given project.As a database architect and adminis-trator, I’m concerned with all these keycapabilities as well as the following:

■ commercial support of at least mod-est usefulness, and

■ a robust and complete logging mech-anism for abnormal end conditions.

The decision must ultimately weighnot only performance, features, and sup-port but also licensing and price factors.When cost is an issue, open source prod-ucts offer free or inexpensive alternatives.We chose MySQL, which is developed,supported, and marketed by MySQL AB,a commercial company that builds itsbusiness providing services for the data-base product. MySQL is most commonlyused for Web and embedded applica-tions. Its speed and reliability have madeit a popular alternative to proprietarydatabase systems. It runs on several plat-forms and has many attractive features.We chose it for the following reasons:

■ It’s easy to use, install, and main-tain. It’s well documented, has goodsupport through the users’ group,and also offers commercial support.

■ It offers several storage models, suchas InnoDB, MyISAM, and FullText.InnoDB provides MySQL with atransaction-safe (ACID-compliant)storage engine that includes commit,rollback, and crash-recovery capa-bilities. It also implements lockingon the row level and provides a con-sistent nonlocking read in SELECTstatements. MyISAM supports stor-age of more than a terabyte of datain more than 200 tables in a singleproject database.

■ It’s fast. MySQL can manage linksamong our 1.5 billions rows of datain several virtual tables. It can han-dle hundreds of clients connectingto the server and using multipledatabases simultaneously.

■ It provides fault tolerance, load bal-ancing, and security via replication.Replication is not a backup policy,but it provides a basic level of pro-tection against hardware failure. Wealso use it to update a server insidethe firewall and propagate the datato servers outside the firewall inread-only mode.

MySQL was originally developed tohandle large databases much faster thanexisting solutions and has been success-fully used in highly demanding produc-tion environments for several years. Itsconnectivity, speed, and security makeMySQL server highly suited for access-ing databases on the Internet.

Product comparisonDatabase administrators that have

worked with commercial database en-

gines such as Oracle or MSSQL havecome to rely on a fairly broad featureset. Table 1 compares the features offour database engines, two of themcommercial.

As the table shows, the differences be-tween commercial and open sourceRDBMSs are minor. Specifically, thereare some differences in data storagemodels, and PostgreSQL offers less repli-cation support. While both Oracle andMSSQL have some features not avail-able in the open source products, noneof them were indispensable to our pro-ject. We chose MySQL over PostgreSQLprimarily because it scales better and hasembedded replication.

Lessons learnedThe LANL Research Library’s re-

cent project to develop a comprehen-sive database of scientific journal arti-cles and citation information was itsmost ambitious project ever. The proj-ect converted bibliographic metadatafrom several data sources into a com-mon format and enhanced the datawith links between each of more than55 million articles as well as 600 mil-lion individual references. The projectalso provided search capabilities andbrowser access to the data.

In addition to the quantity of datato be managed, other challenges in-cluded maintaining flexibility, responsetime, reliability, fault tolerance, bud-gets, and security.

ReplicationBecause of network latency, it’s im-

portant to keep servers involved in thereplication “close.” Moreover, al-though MySQL replication works well,the process can break in cases of net-work outages, exhausted disk space,and other problems. It’s therefore criti-cal to monitor server’s status and errorlogs.

MySQL doesn’t provide scripts tomonitor the replication process flowand alert operators when a problemoccurs. For the LANL project, I wrotescripts to monitor the flow and serverupdates and to send alerts when thecommunications between master andslave servers failed.

MySQL’s speed andreliability have made it a popular alternative

to proprietary database systems.

Page 3: MySQL: Lessons Learned on a Digital Library

1 2 I E E E S O F T W A R E w w w . c o m p u t e r . o r g / s o f t w a r e

OPEN SOURCE

Some replication management utili-ties have since become available, such asMySQL::Replication and My::Rep::MySQL Perl modules.

OptimizationOptimization can be a complicated

task because it requires a deep under-standing of the whole system. Themost important component in speedingup a system is the basic design. It’s es-sential to know how the system be-haves, how it will be used, and wherethe bottlenecks are.

IndexingIndexing is the most important tool

for speeding up queries. Indexes are usedto quickly match rows. MySQL storesall indexes in B-trees. It can index all col-umn types, though in the case of CHARand VARCHAR, it’s much faster and re-quires less disk space to index a columnprefix rather than the entire column. Us-ing indexes on the relevant columns isthe best way to make a query faster.

MySQL can also create n-columncomposite indexes—that is, indexes on

multiple columns. A composite indexserves as several indexes because MySQLcan use any leftmost column set in the in-dex to match rows. However, it cannotuse the composite index for searches thatdon’t involve a leftmost prefix.

Overusing indexes can cause prob-lems. Their performance benefits docome at a price. Every additional index

takes disk space and lowers the perfor-mance of write operations. In addition,indexes must be organized and resortedwhen table contents change, especiallywith tables that contain variable-lengthcolumns. And the more indexes youhave, the longer it takes.

Optimizing nontransactional tablesis far less expensive with MySQL toolsthan SQL commands.

Disk issuesDisk access becomes important

when dealing with databases of hun-dreds of gigabytes or more, where ef-fective caching becomes impossible.Data distribution is very important,and disk seeks can become a big per-formance bottleneck.

You can move tables or databasesfrom the MySQL database directory toother locations. The recommended wayto do this is to link only databases to adifferent location. Nevertheless, as thedatabase grows, it’s also useful to linksome tables to distant locations, espe-cially those that applications accessconcurrently. When using this approach

Table 1A comparison of features for four database engines

MySQL PostgreSQL Oracle MSSQL

Data storageStorage models MyISAM, InnoDB, Postgres Bitmapped, B-tree, Clustered, nonclustered

Berkeley DB, full-text IOT, function-basedReliability High/very high High High/very high High/very highScalability Large/very large Large Large/very large Large/very large

IndexesSingle- and multicolumn, Yes Yes Yes Yesprimary key, and full text

Data integrityACID compliance, row-level Yes Yes Yes Yeslocking, hot backup, and partial rollback

ReplicationSingle master Yes Yes Yes YesMultimaster Yes Yes/no* Yes YesClustering Yes No Yes Yes

Interface methodsODBC/JDBC, C/C++, and Java Yes Yes Yes Yes

Advanced featuresStored procedures, views, Yes (starting with Yes Yes Yestriggers, sequences, and cursors version 5.x)

*Solutions exist but they are commercial.

The most importantcomponent in speedingup a system is the basic

design. It’s essential to know how the system behaves.

Page 4: MySQL: Lessons Learned on a Digital Library

OPEN SOURCE

you need to remember that, in MySQLversions prior to 4.0, some commandssuch as ALTER, REPAIR, and OPTIMIZETABLE will remove the symbolic linksand replace them with the original file.This happens because these statementscreate a temporary file in the databasedirectory and replace the original filewith the temporary file when the state-ment operation is complete.

Query processingComplex queries can make it diffi-

cult to understand MySQL rules fordeciding exactly how to fetch data.Fortunately, there are a few generalrules and a command to help.

MySQL will not use an index if itdecides that it would be faster to sim-ply scan the entire table than, for ex-ample, to access roughly 30 percent ofthe table’s rows. If multiple indexes cansatisfy a query, MySQL will use themost restrictive one—that is, the onethat would fetch the fewest rows.

If the columns you’re selecting areall part of an index, MySQL might readall the data you need directly from theindex and never access the table itself.When joining several tables, MySQLwill first read data from the table that islikely to return the fewest rows. The or-der in which you specify the tablesmight not be the same order in whichMySQL uses them. This also affects theorder in which the rows are ultimatelyreturned to you, so be sure to use anORDER BY clause in your query if youneed the rows in a particular order.

Having said all that, it’s important torealize that some MySQL decisions areactually based on guesses. You can,however, compel MySQL to use, ignore,or force a specific index or index set.

The EXPLAIN command will alsohelp you understand what MySQL isdoing to process a query.

Query processing and LLIIMMIITT clauseThe LIMIT clause helps to control

the number of rows returned, which isuseful in the Web-search context whenresults must be displayed in chunks.

When you combine LIMIT with aquery statement such as SELECT,MySQL handles the query differently

depending on the number of rows re-quested and their location. If the selec-tion involves only a few rows, MySQLuses indexes rather than doing a fulltable scan. With MySQL (as well asother RDBMSs) you can’t assume thatresults returned using LIMIT with sub-sequent chunks will return consecutiveresults without duplications. To avoidthat problem, you must use LIMIT inconjunction with ORDER BY to sort theresults. MySQL ends the sorting assoon as it finds the first lines requestedwith LIMIT.

In cases where the ORDER BY clausecosts too much in performance terms, analternative is to create a temporary tablewith the data and then apply the querywith LIMIT on the temporary table.

Query issuesIn general, after checking and eventu-

ally adding column indexes, you canspeed up a slow query such as SELECT …WHERE by reducing the WHERE clause asmuch as possible and moving the logicinto the SELECT part whenever possi-ble. MySQL evaluates the condition inthe WHERE clause in a wider set, espe-cially in a join context. The set returnedon the SELECT is narrow and thereforeeasier and faster to manipulate than theWHERE clause.

O ur experience on this project hasbeen extremely positive. MySQL’sreputation for speed has proved itself

in handling links among over 1.5 billionsof rows of data in several virtual tables,laid over a terabyte of data. Moreover,we have used its replication capabilitiesto balance the load, reinforce fault toler-ance, and protect and update our data.Our lessons learned should certainlyhelp advocates promote MySQL andopen source for building a sound tech-nology infrastructure. Some of themmight also interest practitioners or teamsusing other products.

Mariella Di Giacomo is a member of the LibraryWithout Walls team at the Los Alamos National Laboratory Re-search Library. Contact her at [email protected] or [email protected].

COMPUTER BOOKS

THAT DELIVER

1-58450-364-5 $41.95

1-58450-385-8 $44.95

1-58450-313-0 $49.95

1-58450-358-0 $49.95

(800) 382-8505Titles available at Amazon, Borders,

Barnes & Noble, & other fine retailers.

20% OFF AT

WWW.CHARLESRIVER.COM

1-58450-346-7 $49.95