how new to sql server1

47
SQL Server Search Architecture The SQL Server full text search support leverages the same underlying full text search access method and infrastructure employed in other Microsoft products, including Exchange, Sharepoint Portal Server and the Indexing Service which supports full text search over filesystem hosted data. This approach has several advantages, the most significant of which are 1) common full text search semantics across all data stored in relational tables, the mail system, web hosted data, and filesystem resident data, and 2) leverage of full text search access method and infrastructure investments across many complementary products. Indexed text in SQL Server can range from a simple character string data to documents of many types, including Word, Powerpoint, PDF, Excel, HTML, and XML. The document filter support is a public interface, enabling custom filters for proprietary document formats to be integrated into SQL Server. The architecture is composed of five modules hosted in three address spaces (figure 1: Architecture of SQL Server Full Text Search): 1) content reader, 2) filter daemon, 3) word breaker, 4) indexer, and 5) query processor. Full text indexed data stored in SQL Server tables is scanned by the content reader where packets are assembled including related metadata. These packets flow to the main search engine which triggers the search engine filter daemon process to consume the data read by the content reader. Filter daemons are modules managed by MS Search but outside of the MS Search address space. Since the search architecture is extensible and filters may be sourced from the shipped product, ISV supplied, or customer produced and there is a non- zero risk that a filter bug or a combination of a poorly formed document and a filter bug could allow the filter to either fail or not terminate. Running the filters and word breakers in an independent process allows the system to be robust in the presence of these potential failure modes. For example, if some instance of daemon process is seen to consume too much memory MS Search process kills it and restarts a new instance. Filters are invoked by the daemon based on the type of the content. Filters parse the content and emit chunks of processed text. A chunk is a contiguous

Upload: arnoldkat88

Post on 10-Feb-2016

215 views

Category:

Documents


0 download

DESCRIPTION

intro to sql server

TRANSCRIPT

Page 1: HOW New to SQL Server1

SQL Server Search ArchitectureThe SQL Server full text search support leverages the same underlying full text search access method and infrastructure employed in other Microsoft products, including Exchange, Sharepoint Portal Server and the Indexing Service which supports full text search over filesystem hosted data. This approach has several advantages, the most significant of which are 1) common full text search semantics across all data stored in relational tables, the mail system, web hosted data, and filesystem resident data, and 2) leverage of full text search access method and infrastructure investments across many complementary products.

Indexed text in SQL Server can range from a simple character string data to documents of many types, including Word, Powerpoint, PDF, Excel, HTML, and XML. The document filter support is a public interface, enabling custom filters for proprietary document formats to be integrated into SQL Server. The architecture is composed of five modules hosted in three address spaces (figure 1: Architecture of SQL Server Full Text Search): 1) content reader, 2) filter daemon, 3) word breaker, 4) indexer, and 5) query processor.

Full text indexed data stored in SQL Server tables is scanned by the content reader where packets are assembled including related metadata. These packets flow to the main search engine which triggers the search engine filter daemon process to consume the data read by the content reader. Filter daemons are modules managed by MS Search but outside of the MS Search address space. Since the search architecture is extensible and filters may be sourced from the shipped product, ISV supplied, or customer produced and there is a non-zero risk that a filter bug or a combination of a poorly formed document and a filter bug could allow the filter to either fail or not terminate. Running the filters and word breakers in an independent process allows the system to be robust in the presence of these potential failure modes. For example, if some instance of daemon process is seen to consume too much memory MS Search process kills it and restarts a new instance.

Filters are invoked by the daemon based on the type of the content. Filters parse the content and emit chunks of processed text. A chunk is a contiguous portion of text along with some relevant information about the text segment like the language-id of the text, attribute information if any etc. Filters emit chunks separately for any properties in the content. Properties can be items such as title or author and are specific to the content types and therefore understood by the filters.

Page 2: HOW New to SQL Server1

Figure 1: Architecture of SQL Server Full-Text Search

The next step in the process is the breaking of the chunks into keywords. Word breakers are modules which are human language-aware. SQL Server search installs word breakers for various languages including but not limited to English (USA and UK), Japanese, German, French, Korean, Simplified and Traditional Chinese, Spanish, Thai, Dutch, Italian, and Swedish. The word breakers are also hosted by the filter daemons and they emit keywords in Unicode, alternate keywords in Unicode and location of the keyword in the text. These keywords and related metadata are transferred to the MS Search process via a high speed shared memory protocol which feeds the data into the Indexer. The indexer builds an inverted list with a batch of keywords. A batch consists of all the keywords from one or more content items. Once MS Search persists this inverted list to disk, it sends notification back to the SQL Server process confirming success. This protocol ensures that, although documents are not synchronously indexed, documents won’t be lost in the event of process or server failures and it allows the indexing process to be restartable based upon metadata maintained by the SQL Server kernel.

As with all text indexing systems we’ve worked upon, the indexes are in a highly compressed form which increases storage efficiency but runs the risk of driving up the keyword insertion cost. To obtain this storage size reduction without substantially penalizing the insertion operation, a stack of indexes are maintained. New documents are built into a small index, which is periodically batch merged into a larger index which, in turn, is periodically merged into the base index. This stack of indexes may be greater than three deep but the principle remains the same and it’s an engineering approach that allows the use of an aggressively compressed index form without driving up the insertion costs dramatically. When searching for a keyword, all indexes in this stack need to be searched so there is some advantage in keeping the number of indexes to a small number. During insertion and merge operations, distribution and frequency statistics are maintained for internal query processing use and for ranking purposes.

Page 3: HOW New to SQL Server1

This whole cycle sets up a pipeline involving the SQL Server kernel, the MS Search engine, and the filter daemons the combination of which is key to reliability and performance of SQL Server full text indexing process.

3. SQL Server Full-text Search Query FeaturesThe full text indexes supported by SQL Server are created using the familiar CREATE INDEX SQL DDL statement. These indexes are fully supported by SQL Server standard utilities, such as backup and restore, and other administrative operations, such as attach/detach of databases work unchanged in the presence of full text search indexes. Other enterprise-level features, including shared disk failover clustering, are fully supported in the presence of full text indexes.

Indexes are created and maintained online using one of the following options:

1. Full Crawl: scans the full table and builds or rebuilds a complete full text index on the indexed columns of the table. This operation proceeds online with utility progress reporting.

2. Incremental Crawl: uses a timestamp column on the indexed table to track changes to the indexed content since the last re-index.

3. Change Tracking: is used to maintain near real time currency between the full text index and the underlying text data. The SQL Server Query Processor directly tracks changes to the indexed data and this data is applied in near real time to the full text index.

The full text search support is exposed in SQL using the following constructs:

1. Contains Predicate: The contains predicate has the basic syntactic form Contains(col_list,,’<search condition>’). This predicate is true if any of the indicated columns in the list col_list contains terms which satisfy the given search condition. A search condition can be a keyword, a keyword prefix, a phrase or a more general proximity term, or some combination of all of these. For example a predicate Contains(description, (‘word* or Excel or “Microsoft Access”’) will match all entries with description containing words like ‘word’, ‘wordperfect’, ‘wordstar’, ‘wordings’, ‘Excel’ or the phrase ‘Microsoft Access’.

2. Freetext Predicate: Freetext predicates are similar to the contains predicate except that they match on text containing not only the terms in the search condition, but also the terms which are linguistically similar (stemming) to the terms in the search condition. Thus Freetext(description,‘run in the rain’) will match all the items that contain in its description column text with terms like run, running, ran, rain, raining, rains etc.

3. ContainsTable & FreetextTable: The previous 2 predicates, Contains and Freetext match data which satisfy search terms. However, they do not provide any way of obtaining a rank measure of the match. ContainsTable and Freetexttable are table-valued functions which locate entries using a search condition similar to that of Contains and the Freetext predicates, and return the items along with a rank value for each matching item. The rank is computed using keyword distribution within the data as well as in the whole corpus.

The search condition for any of the predicates described above can include:

1. Keyword lookup: E.g. Contains(*,’searched-for-word-or-phrase’)

Page 4: HOW New to SQL Server1

2. Linguistic generation of relevant keywords:a. Stemming: Freetext(*,’distributed’) finds all documents containing the keyword “distributed” and

all similar forms.b. Thesaurus and Inflectional Forms: For example: Contains(*,’FORMSOF( INFLECTIONAL,

distributed) AND FORMSOF(THESAURUS, databases)’) will find documents containing inflectional forms of “distributed” and all words meaning the same as databases (thesaurus support).

3. Weighted Terms: Query terms can be assigned relative weight to impact the rank of matching documents when one wants to favor one term over another. In the following the spread search term is given twice the weight of the sauces search team which is, in turn, given twice the weight of the relishes search term:

SELECT a.CategoryName, a.Description, b.rank

FROM Categories a,ContainsTable(Categories, description,

'ISABOUT (spread weight (.8), sauces weight (.4), relishes weight (.2))') b WHERE a.categoryId = b.[key]

4. Phrase and Proximity Query: One can specify queries over phrases, and more generally using proximity (NEAR) between terms in a matching document, e.g. “distributed NEAR databases” matches items in which the term distributed appears close to the term databases.

5. Prefix match: Search conditions can specify a query for matching the prefix of a term. For example ‘data*’ matches all terms data, databases, datastore, datasource, etc.

6. Composition: Terms can be composed using conjuncts (AND), disjuncts (OR) and conjuncted complementation ( AND NOT ).

4. Examples of Full text Query ScenariosScenario 1. We have a table with documents published in a site. The table has a schema Documents(DocumentId, Title, Author, PublishedDate, Version, RevisionDate, Content) with a full-text index built on columns Title and Content. Following are some queries one can issue on this table.

SELECT title, author

FROM Documents

WHERE Author = ‘Linda Chapman’ and Contains(Title,’child NEAR development’)

This query finds information on documents authored by Linda Chapman where title includes the term child close to the term development.

SELECT a.Title, a.Author , a.PublishedDate, b.rank

FROM Documents a, FreetextTable(Documents,Content,’”child development” AND

insomnia’) b

Page 5: HOW New to SQL Server1

WHERE a.DocumentId = b.[key] and a.Author = ‘Linda Chapman’ order by b.rank desc

This query finds all documents authored by Linda Chapman on child development and insomnia. The result is presented in descending order of rank.

Scenario 2. We want to search for information distributed in heterogeneous sources. Data is stored in an Exchange mail server in email format, in the filesystem in form of locally-authored documents, and in SQL Server in form of published documents. SQL Server content schema is the same as above in Scenario 1. The filesystem content index is provided by the filesystem indexing service. The following query gets all documents related to marketing and cosmetics from the email store, the filesystem, and from the SQL Server document store.

--Get qualifying email docs

SELECT DisplayName,hRef,MailFrom, Subject

FROM openquery(exchange,

'SELECT "DAV:displayname" as DisplayName,"DAV:href" as hRef,

“urn:schemas:mailheader:from” as MailFrom,

“urn:schemas:mailheader:subject” as subject

FROM "manager\Inbox"

WHERE contains(*,''marketing AND cosmetics’’)')

UNION ALL --Get qualifying filesystem data

SELECT filename, vpath, docauthor, doctitle

FROM OpenQuery(Monarch,

'SELECT vpath, Filename, size, doctitle, docauthor

FROM SCOPE(''deep traversal of "c:\tapasnay\My Documents" '')

WHERE contains (''marketing AND cosmetics’’)’)

UNION ALL –-Get qualifying SQL Server data

SELECT Title, ‘SQLServer:Documents:’+cast(DocumentId as varchar(20)) as docref,

author, title

FROM Documents

WHERE contains(*,'marketing AND cosmetics’)

Page 6: HOW New to SQL Server1

5. ConclusionsIn this paper we motivate the integration of a native full text search access method into the Microsoft SQL Server product, describe the architecture of the access method and motivate some of the trade-offs and advantages of the engineering approach taken. We explore the features and functions of the full text search feature and provide example SQL queries showing query integration over structured, semi-structured, and unstructured data.

For further SQL Server 2000 full text search usage and feature details one may look at Inside Microsoft SQL Server [1] or SQL Server 2000 Books online [2]. On the implementation side, we are just completing a major architectural overhaul of the indexing engine and its integration with SQL Server in the next release of the product and this paper is the first description of this work.

6. References[1] Delaney, Kalen; Inside Microsoft SQL Server 2000, Microsoft Press, 2001

[2] SQL Server 2000 Books Online

As a guideline, clustered Indexes should be Narrow, Unique, Static and Ever Increasing (NUSE). 

Clustered indexes are the cornerstone of good database design. A poorly-chosen clustered index doesn't just lead to high execution times; it has a 'waterfall effect' on the entire system, causing wasted disk space, poor IO, heavy fragmentation, and more.

Attributes for make up an efficient clustered index key, which are:

Narrow – as narrow as possible, in terms of the number of bytes it stores Unique – to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values Static – ideally, never updated Ever-increasing – to avoid fragmentation and improve write performance

How clustered indexes work

In order to understand the design principles that underpin a good clustered index, we need to discuss how SQL Server stores clustered indexes. All table data is stored in 8 KB data pages. When a table contains a clustered index, the clustered index tells SQL Server how to order the table's data pages. It does this by organizing those data pages into a B-tree structure, as illustrated in Figure 1.

Page 7: HOW New to SQL Server1

Figure 1: The b-tree structure of a clustered index

It can be helpful, when trying to remember which levels hold which information, to compare the B-tree to an actual tree. You can visualize the root node as the trunk of a tree, the intermediate levels as the branches of a tree, and the leaf level as the actual leaves on a tree.

The leaf level of the B-tree is always level 0, and the root level is always the highest level. Figure 1 shows only one intermediate level but the number of intermediate levels actually depends on the size of the table. A large index will often have more than one intermediate level, and a small index might not have an intermediate level at all.

Index pages in the root and intermediate levels contain the clustering key and a page pointer down into the next level of the B-tree. This pattern will repeat until the leaf node is reached. You'll often hear the terms "leaf node" and "data page" used interchangeably, as the leaf node of a clustered index contains the data pages belonging to the table. In other words, the leaf level of a clustered index is where the actual data is stored, in an ordered fashion based on the clustering key.

Let's look at the B-tree again. Figure 2 represents the clustered index structure for a fictional table with 1 million records and a clustering key on EmployeeID.

Figure 2: A b-tree index for a 1-million row table

The pages in Level 1 and Level 2, highlighted in green, are index pages. In Level 1, each page contains information for 500,000 records. As discussed, each of these pages stores not half a

Page 8: HOW New to SQL Server1

million rows, but rather half a million clustered index values, plus a pointer down into the associated page on the next level. For example, to retrieve the details for Employee 500, SQL Server would read three pages: the root page in Level 2, the intermediate page in Level 1, and the appropriate leaf level page in Level 0. The root page tells SQL Server which intermediate level page to read, and the intermediate page tells it which specific leaf level page to read.

Index seeks and Index scansWhen specific data is returned from data page, in this fashion, it is referred to as an index seek. The alternative is an index scan, whereby SQL Server scans all of the leaf level pages in order to locate the required data. As you can imagine, index seeks are almost always much more efficient than index scans. For more information on this topic, please refer to the Further Reading section at the end of this article.

In this manner, SQL Server uses a clustered index structure to retrieve the data requested by a query. For example, consider the following query against the Sales.SalesOrderHeader table in AdventureWorks, to return details of a specific order.

SELECT CustomerID ,

        OrderDate ,

        SalesOrderNumber

FROM    Sales.SalesOrderHeader

WHERE   SalesOrderID = 44242 ;

This table has a clustered index on the SalesOrderID column and SQL Server is able to use it to navigate down through the clustered index B-tree to get the information that is requested. If we were to visualize this operation, it would look something like this:

Root Node SalesOrderID PageID

NULL 75059392 751

Intermediate level

(Page 750)

SalesOrderID PageID

44150 81444197 81544244 81644290 81744333 818

Page 9: HOW New to SQL Server1

Leaf level

(Page 815)

SalesOrderID

OrderDate

SalesOrderNumber

AccountNumber

CustomerID

44240 9/23/2005 SO44240 10-4030-013580 13580

44241 9/23/2005 SO44241 10-4030-028155 28155

44242 9/23/2005 SO44242 10-4030-028163 28163

In the root node, the first entry points to PageID 750, for any values with a SalesOrderID between NULL and 59391. The data we're looking for, with a SalesOrderID of 44242, falls within that range, so we navigate down to page 750, in the intermediate level. Page 750 contains more granular data than the root node and indicates that the PageID 815 contains SalesOrderID values between 44197 and 44243. We navigate down to that page in the leaf level and, finally, upon loading PageID 815, we find all of our data for SalesOrderID 44242.

Characteristics of an effective clustered index

Based on this understanding of how a clustered index works, let's now examine why and how this dictates the components of an effective clustered index key: narrow, unique, static, and ever-increasing.

Narrow

The width of an index refers to the number of bytes in the index key. The first important characteristic of the clustered index key is that it is as narrow as is practical. To illustrate why this is important, consider the following narrow_example table:

CREATE TABLE dbo.narrow_example ( web_id INT IDENTITY(1,1), -- unique

      web_key     UNIQUEIDENTIFIER , -- unique

      log_date    DATETIME , -- not unique

customer_id INT -- not unique ) ;

The table has been populated with 10 million rows and table contains two columns that are candidates for use as the clustering key:

web_id – a fixed-length int data type, consuming 4 bytes of space

Page 10: HOW New to SQL Server1

web_key – a fixed-length uniqueidentifier data type, consuming 16 bytes.

TIP: Use the DATALENGTH function to find how many bytes are being used to store the data in a column.

So, which column will make a better clustered index key? Let's take a look at the B-tree structure of each, shown in Figure 3.

Figure 3: The b-tree levels for clustered indexes based on int and uniqueidenitifier key

The most obvious difference is that the uniqueidentifier key has an additional non-leaf level, giving 4 levels to its tree, as opposed to only 3 levels for the int key. The simple reason for this is that the uniqueidentifier consumes 300% more space than the int data type, and so when we create a clustered key on uniqueidentifier, fewer rows can be packed into each index page, and the clustered key requires an additional non-leaf level to store the keys.

Page 11: HOW New to SQL Server1

Conversely, using a narrow int column for the key allows SQL Server to store more data per page, meaning that it has to traverse fewer levels to retrieve a data page, which minimizes the IO required to read the data. The potential benefit of this is large, especially for range scan queries, where more than one row is required to fulfill the query criteria. In general, the more data you can fit onto a page, the better your table can perform. This is why appropriate choice of data types is such an essential component of good database design.

However, our choice of clustering key can affect the performance of not only the clustered index, but also any non-clustered indexes that rely on the clustered index. As shown in Figure 4, a non-clustered index contains the clustered index key in every level of its b-tree structure, as a pointer back into the clustered index. This happens regardless of whether or not the clustering key was explicitly included in the nonclustered index structure, either as part of the index key or as an included column. In other words, whereas in the clustered index the leaf level contains the actual data rows, in a nonclustered index, the leaf level contains the clustered key, which SQL Server uses to find the rest of the data.

Page 12: HOW New to SQL Server1

Figure 4: Non-clustered indexes also store the clustering key in order to look up data in the clustered index

So, let's see how our choice of clustering key impacts the potential performance of our non-clustered indexes. We'll keep the example pretty simple and create a non-clustered index on customer_id, which is an int data type.

CREATE NONCLUSTERED INDEX IX_example_customerIDON dbo.narrow_example (customer_id) ;

Figure 5 shows the resulting B-tree structures of our nonclustered index, depending on whether we used the uniqueidentifier or the int column for our clustered index key.

Page 13: HOW New to SQL Server1

Figure 5

While we have the same number of levels in each version of the index, notice that the non-clustered index based on the int clustering key stores 86% more data in each leaf-level data page than its uniqueidentifier counterpart. Once again, the more rows you can fit on a page, the better the overall system performance: range-scan queries on the narrow int version will consume less IO and execute faster than equivalent queries on the wider, uniqueidentifier version.

In this example, I've kept the table and index structures simple in order to better illustrate the basic points. In a production environment, you'll often encounter tables that are much, much wider. It's possible that such tables will require a composite clustered index, where the clustering key is comprised of more than one column. That's okay; the point isn't to advise you to base all of your clustered keys on integer IDENTITY columns, but to demonstrate that a wide index key can have on a significant, detrimental impact on a database's performance, compared to a narrow index key. Remember, narrowness refers more to the number of bytes consumed than the number of columns. For example, a composite clustered key on three int columns would still be narrower than a uniqueidentifier key (4 + 4 + 4 = 12 bytes for the former vs. 16 bytes for the latter).

Unique

Index uniqueness is another highly desirable attribute of a clustering key, and goes hand-in-hand with index narrowness. SQL Server does not require a clustered index to be unique, but yet it must have some means of uniquely identifying every row. That's why, for non-unique clustered indexes, SQL Server adds to every duplicate instance of a clustering key value a 4-byte integer

Page 14: HOW New to SQL Server1

value called a uniqueifier. This uniqueifier is added everywhere the clustering key is stored. That means the uniqueifier is stored in every level of the B-tree, in both clustered and non-clustered indexes. As you can imagine, if there are many rows using the same clustering key value, this can become quite expensive.

What's more, the uniqueifier is stored as a variable-length column. This is important because if a table does not already contain any other variable-length columns, each duplicate value is actually consuming 8-bytes of overhead: 4 bytes for the uniqueifier value and 4 bytes to manage variable-length columns on the row. The following example demonstrates this. We create a table with a non-unique clustered index, insert into it a single row, and then retrieve minimum and maximum record sizes (which currently refer to the same, single record) from the sys.dm_db_index_physical_stats DMV:

CREATE TABLE dbo.overhead ( myID INT NOT NULL ) ; CREATE CLUSTERED INDEX CIX_overhead -- not unique!ON dbo.overhead(myID) ; INSERT INTO dbo.overhead ( myID ) SELECT 1 ;

SELECT  min_record_size_in_bytes ,

max_record_size_in_bytes

FROM    sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID(N'dbo.overhead'),

                                       NULL, NULL, N'SAMPLED') ;

 

min_record_size_in_bytes   max_record_size_in_bytes

------------------------ ------------------------

11                         11

 

(1 row(s) affected)

Although we only have a single column in the table, there is a minimum of 7 bytes of overhead per row, in SQL Server. While this overhead may increase with the addition of NULL or variable-length columns, it will never be less than 7 bytes per row. The other 4 bytes are used to store the int column, myID.

Page 15: HOW New to SQL Server1

Now let's insert a duplicate value into the table:

INSERT  INTO dbo.overhead

        ( myID )

        SELECT  1 ;

 

SELECT  min_record_size_in_bytes ,

        max_record_size_in_bytes

FROM    sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID(N'dbo.overhead'),

                                       NULL, NULL, N'SAMPLED') ;

 

min_record_size_in_bytes   max_record_size_in_bytes

------------------------ ------------------------

11                         19

 

(1 row(s) affected)

The duplicate value requires the addition of a uniqueifier, which consumes an extra 4 bytes. However, since a variable-length column, such as a varchar() column, does not already exist on the table, an additional 4 bytes are added by SQL Server to manage the variable-length properties of the uniqueifier. This brings the total uniqueifier overhead to 8 bytes per row.

TIP:The sys.dm_db_index_physical_stats DMV runs in three modes: LIMITED, SAMPLED, or DETAILED. The min_record_size_in_bytes and max_record_size_in_bytes columns are only available in SAMPLED or DETAILED mode. Be careful when running this DMV in production or on large tables, as the SAMPLED mode scans 1% of pages and DETAILED modes scans all pages. Refer to Books Online for more information.

So, returning to our original narrow_example table, let's see what would happen if the clustering key was changed to customer_id, which is a non-unique int. Although the uniqueifier is not readily visible and cannot be queried, internally the leaf-level page might look something like this:

web_id web_key log_date customer_id uniqueifier

Page 16: HOW New to SQL Server1

1 6870447C-A0EC-4B23-AE5F-9A92A00CE166 12/15/2010 1 NULL

2 5AB480CF-40CD-43FD-8C3D-5C625875E143 12/15/2010 1 1

3 95C312B9-83AF-4725-B53C-77615342D177 12/15/2010 1 2

4 88AA4497-9A20-4AB7-9704-1FDFAE200564 12/15/2010 2 NULL

5 E3EA3014-FC23-48B6-9205-EE6D06D37C5B 12/15/2010 2 1

6 9F6A8933-F6EC-416F-AACA-1C3FF172151C 12/15/2010 3 NULL

7 B16406A8-649B-4E7A-A234-C7B7D8FCE2D3 12/15/2010 4 NULL

8 443B627B-21CE-4466-AD15-1879C8749225 12/15/2010 4 1

9 2F3757DE-3799-4246-BA88-944C5DA3683E 12/15/2010 4 2

10 25D9F2AA-6610-48CD-9AC4-4F1E29FDED1C 12/15/2010 4 3

The uniqueifier is NULL for the first instance of each customer_id, and is then populated, in ascending order, for each subsequent row with the same customer_id value. The overhead for rows with a NULL uniqueifier value is, unsurprisingly, zero bytes. This is why min_record_size_in_bytes remained unchanged in the overhead table; the first insert had a uniqueifier value of NULL. This is also why it is impossible to estimate how much additional storage overhead will result from the addition of a uniqueifier, without first having a thorough understanding of the data being stored. For example, a non-unique clustered index on a datetime column may have very little overhead if data is inserted, say, once per minute. However, if that same table is receiving thousands of inserts per minute, then it is likely that many rows will share the same datetime value, and so the uniqueifier will have a much higher overhead.

If your requirements seem to dictate the use of a non-unique clustered key, my advice would be to look to see if there are a couple of relatively narrow columns that, together, can form a unique key. You'll still see the increase in the row size for your clustering key in the index pages of both your clustered and nonclustered indexes, but you'll at least save the cost of the uniqueifier in the data pages of the leaf level of your clustered index. Also, instead of storing an arbitrary uniqueifier value to the index key, which is meaningless in the context of your data, you would be adding meaningful and potentially useful information to all of your nonclustered indexes.

A good clustered index is also built upon static, or unchanging, columns. That is, you want to choose a clustering key that will never be updated. SQL Server must ensure that data exists in a logical order based upon the clustering key. Therefore, when the clustering key value is updated, the data may need to be moved elsewhere in the clustered index so that the clustering order is

Page 17: HOW New to SQL Server1

maintained. Consider a table with a clustered index on LastName, and two non-clustered indexes, where the last name of an employee must be updated.

Figure 6: The effect of updating a clustered key column

Not only is the clustered index updated and the actual data row moved – most likely to a new data page – but each non-clustered index is also updated. In this particular example, at least three pages will be updated. I say "at least" because there are many more variables involved, such as whether or not the data needs to be moved to a new page. Also, as discussed earlier, the upper levels of the B-tree contain the clustering key as pointers down into the leaf level. If one of those index pages happens to contain the clustering key value that is being updated, that page will also need to be updated. For now, though, let's assume only three pages are affected by the UPDATE statement, and compare this to behavior we see for the same UPDATE, but with a clustering key on ID instead of LastName.

Figure 7: An UPDATE that does not affect the clustered index key

In Figure 7, only the data page in the clustered index is changed because the clustering key is not affected by the UPDATE statement. Since only one page is updated instead of three, clustering on ID requires less IO than clustering on LastName. Also, updating fewer pages means the UPDATE can complete in less time.

Of course, this is another simplification of the process. There are other considerations that can affect how many pages are updated, such as whether an update to a variable-length column causes the row to exceed the amount of available space. In such a case, the data would still need to be moved, although only the data page of the clustered index is affected; nonclustered indexes would remain untouched.

Page 18: HOW New to SQL Server1

Nevertheless, updating the clustering key is clearly more expensive than updating a non-key column. Furthermore, the cost of updating a clustering key increases as the number of non-clustered indexes increases. Therefore, it is a best practice is to avoid clustering on columns that are frequently updated, especially in systems where UPDATE performance is critical.

Ever-Increasing

The last important attribute of a clustered index key is that it is ever-increasing. In addition to narrow, unique, and static, an integer identity column is an excellent example of an ever-increasing column. The identity property continuously increments by the value defined at creation, which is typically one. This allows SQL Server, as new rows are inserted, to keep writing to the same page until the page is full, then repeating with a newly allocated page.

There are two primary benefits to an ever-increasing column:

1. Speed of insert – SQL Server can much more efficiently write data if it knows the row will always be added to the most recently allocated, or last, page

2. Reduction in clustered index fragmentation – this fragmentation results from data modifications and can take the form of gaps in data pages, so wasting space, and a logical ordering of the data that no longer matches the physical ordering.

However, before we can discuss the effect of the choice of clustering key on insert performance and index fragmentation, we need to briefly review the types of fragmentation that can occur.

Internal and external index fragmentation

There are two types of index fragmentation, which can occur in both clustered and non-clustered indexes: extent (a.k.a. external) and page (a.k.a. internal) fragmentation. First, however, Figure 8 illustrates an un-fragmented index.

Figure 8: Data pages in an un-fragmented clustered index

In this simplified example, a page is full if it contains 3 rows, and in Figure 8 you can see that every page is full and the physical ordering of the pages is sequential. In extent fragmentation,

Page 19: HOW New to SQL Server1

also known as external fragmentation, the pages get out of physical order, as a result of data modifications. The pages highlighted in orange in Figure 9 are the pages that are externally fragmented. This type of fragmentation can result in random IO, which does not perform as well as sequential IO.

Figure 9: External fragmentation in a clustered index

Figure 10: Internal fragmentation in a clustered index

Figure 10 illustrates page fragmentation, also known as internal fragmentation, and refers to the fact that the there are gaps in the data pages, which reduces the amount of data that can be stored on each page, and so increase the overall amount of space needed to store the data. Again, the pages in orange indicate an internally fragmented page.

For example, comparing Figures 8 and 10, we can see that the un-fragmented index holds 15 data rows in 5 pages. By contrast, the index with internal fragmentation only holds 9 data rows in the same number of pages. This is not necessarily a big issue for singleton queries, where just a single record is needed to fulfill the request. However, when pages are not full and additional pages are required to store the data, range-scan queries will feel the effects, as more IO will be required to retrieve those additional pages.

Most indexes suffering from fragmentation will often have both extent and page fragmentation.

Page 20: HOW New to SQL Server1

How non-sequential keys can increase fragmentation

Clustering on ever-increasing columns such as identity integers will result in an un-fragmented index, as illustrated in Figure 8. This results in sequential IO and maximizes the amount of data stored per page, resulting in the most efficient use of system resources. It also results in very fast write performance.

Use of a non-sequential key column can, however, result in a much higher overhead during insertion. First, SQL Server has to find the correct page to write to and pull it into memory. If the page is full, SQL Server will need to perform a page split to create more space. During a page split, a new page is allocated, and half the records are moved from the old page to the newly-allocated page. Each page has a pointer to the previous and next page in the index, so those pages will also need to be updated. Figure 11 illustrates the results of a page split.

Figure 11: Internal and external fragmentation as a result of a page split

Initially, we have two un-fragmented pages, each holding 3 rows of data. However, a request to insert "coconut" into the table results in a page split, because Page 504, where the data naturally belongs, is full. SQL Server allocates a new page, Page 836, to store the new row. In the process, it also moves half the data from Page 504 to the new page in order to make room for new data in the future. Lastly, it updates the previous and next pointers in both pages 504 and 505. We're left with Page 836 out of physical ordering, and both pages 504 and 836 contain free space. As you can see, not only would writes to this latter scenario be slower, but both internal and external fragmentation of the table would be much higher.

I once saw a table with 4 billion rows clustered on a non-sequential uniqueidentifier, also known as a GUID. The table had a fragmentation level of 99.999%. Defragging the table and changing the clustering key to an identity integer resulted in a space savings of over 200 GB. Extreme, yes, but it illustrates just how much impact an ever-increasing clustering key can have on table.

I am not suggesting that you only create clustered indexes on identity integer columns. Fragmentation, although generally undesirable, primarily impacts range-scan queries; singleton queries would not notice much impact. Even range-scan queries can benefit from routine defragmentation efforts. However, the ever-increasing attribute of a clustered key is something to consider, and is especially important in OLTP systems where INSERT speed is important.

Page 21: HOW New to SQL Server1

Summary

In this article, I've discussed the most desirable attributes of a clustered index: narrow, unique, static, and ever-increasing. I've explained what each attribute is and why each is important. I've also presented the basics of B-tree structure for clustered and non-clustered indexes. The topic of "indexing strategy" is vast topic and we've only scratched the surface. Beyond what I presented in this article, there are also many application-specific considerations when choosing a clustering key, such as how data will be accessed and the ability to use the clustered index in range-scan queries. As such, I'd like to stress that the attributes discussed in this article are not concrete rules but rather time-proven guidelines. The best thing to do if you're not sure if you've chosen the best clustering key is to test and compare the performance of different strategies.

Further Reading

Brad's Sure Guide to Indexes Brad McGehee's "ground level" overview of indexes and how they work

Defragmenting Indexes in SQL Server 2005 and 2008 Rob Sheldon on investigating, and fixing, index fragmentation using sys.dm_db_index_physical_stats.

SQL Server Indexes: The Basics, by Kathi Kellenberger

SQL Server Primary Key vs. Clustered Index: Part 1

By Mike Byrd

Note: This is Part 1 of a three-part article on SQL Server primary keys and clustered indexes.

Many SQL Server data architects and developers have been strong proponents of letting an identity column define the primary key for large tables. Why not? This article examines the pros and cons.

Should the definition of a primary key define the clustered index?

The identity property is always increasing and newly inserted rows are always inserted at the end of the table,  thus resembling (but not always) the properties of a CreateDate column. Old data is then at one end of the table and new data at the other end of the table.

One of the cons to this approach is that you may encounter hot spots at the new data end of the table, but I’ve never encountered this issue. Another plus for using the identity property is that it is an integer and keeps foreign key relationships and indexes to a relatively small size.

With this approach I’ve always let the definition of the primary key define the SQL Server clustered index:

Page 22: HOW New to SQL Server1

1

2

ALTER TABLE dbo.Package WITH CHECK

ADD CONSTRAINT [PK_Package] PRIMARY KEY (PackageID) ON [PRIMARY]

where the default is to create a clustered index.

However, while researching how best to define a partition key (for large tables – further discussed in Part 3 of this article), it finally dawned on me that perhaps it might be better to break apart the Primary Key (i.e., nonclustered) and define a separate unique clustered index based on the Primary Key and a natural key (something like CreateDate). Many of the stored procedures and ad hoc queries I’ve encountered in the business world have some dependency on CreateDate and although the identity column is monotonically increasing it has no relationship with the CreateDate column.

Consider an example

Consider the following table definition:

1

2

3

4

5

6

7

CREATE TABLE dbo.Package(

PackageID int IDENTITY(1,1) NOT NULL,

..,

CreateDate datetime NOT NULL,

UpdateDate datetime NULL,

CreateDateKey AS (CONVERT(int,CONVERT(varchar(8),CreateDate,(112)))),

CONSTRAINT PK_Package PRIMARY KEY (PackageID ASC));

With the exception perhaps of the computed column CreateDateKey this is a typical table definition using the identity property as both the SQL Server Primary Key and the Clustered Index. In the instance the CreateDateKey is an integer of the form yyyymmdd.

ALTER TABLE dbo.Package WI ADD CONSTRAINT [PK_Packa

CREATE TABLE dbo.Package( PackageID int IDENTITY .., CreateDate datetime N

Page 23: HOW New to SQL Server1

Consider using the CreateDateKey (as defined above) and the Primary Key for the clustered index for the table instead of the original definition:

1. Need to persist CreateDateKey to use as Clustered index–this causes a schema lock

1

2

3

ALTER TABLE dbo.Package DROP COLUMN CreateDateKey;

ALTER TABLE dbo.Package ADD CreateDateKey AS (CONVERT(INT,CONVERT(varchar(8),CreateDate,112))) PERSISTED;

GO

2. Delete old PK

1

2

ALTER TABLE [dbo].[Package] DROP CONSTRAINT [PK_Package];

GO

3. Generate unique clustered index on CreateDateKey,PackageID–this does cause schema locks

1

2

3

CREATE UNIQUE CLUSTERED INDEX CIDX_Package ON dbo.[Package](CreateDateKey, PackageID)

WITH (ONLINE = ON, DATA_COMPRESSION = ROW)

GO

4. Rebuild PK

ALTER TABLE dbo.Package DRALTER TABLE dbo.Package ADGO

ALTER TABLE [dbo].[Package] GO

CREATE UNIQUE CLUSTERED IN WITH (ONLINE = ON, GO

Page 24: HOW New to SQL Server1

1

2

3

ALTER TABLE dbo.Package WITH CHECK ADD CONSTRAINT [PK_Package] PRIMARY KEY NONCLUSTERED

(PackageID) ON [PRIMARY]

GO

The #1 TSQL statements drops the original computed column, and then recreates it with the PERSISTED property. This is necessary for the definition of the SQL Server clustered index.

The #2 TSQL statement drops the original Primary Key. It will fail if there are any other tables with foreign key relationships to dbo.Package. (this will be discussed later in a future article with a stored procedure to identify and generate drop and create statements on any existing dependent foreign keys.)

The #3 TSQL statement generates a unique, clustered index based on the computed column CreateDateKey and the original SQL Server Primary Key (identity column). Obviously this is unique since the original Primary Key is unique. Uniqueness is a property that helps the optimizer pick a better query plan. Since query plan generation looks at the selectivity of only the first column in a multi-column index, I picked the CreateDateKey first so the optimizer might select a seek (rather than a scan) when CreateDateKey is a parameter in the WHERE clause. The second line can only be used for Enterprise (or higher) editions of SQL Server 2008 R2 and higher. The ONLINE property allows the index to be created, but still giving access to other users to the table data. ONLINE is slightly slower than OFFLINE (default). The row compression property will be discussed in a future article. This line can be eliminated in the standard or express editions of SQL Server.

The #4 TSQL statement re-generates the Primary Key but now it is non-clustered. Any dependent Foreign Keys can then be re-created with their original definitions.

Part 2 of this article will discuss differences in query plans between the original table and the revised table with some benchmarking numbers for typical scenarios. The results were better than expected – stay tuned.

The first article in this three-part series (Primary Key vs. Clustered Index: Maybe they should be different. Part 1) concentrated on the syntax for breaking apart the primary key and the clustered index. This article (Part 2) will look at performance and some benchmarking comparisons of the Primary Key and the Clustered Index being the same and different. The queries used in the following benchmark discussion came from stored procedures used in a real world business OLTP database.

ALTER TABLE dbo.Package WI(PackageID) ON [PRIMARY]GO

Page 25: HOW New to SQL Server1

A benchmark analysis of primary key vs. clustered index

The benchmark analysis concentrated on two separate, but related tables: Container and Package. All Package rows are related to a Container row (package.ContainerID = container.ContainerID) by a foreign key relation of package pointing to its parent row in Container. Thus, all packages have an equivalent container row, but also have parent rows (i.e., packages can also aggregate to sacks, pallets, shipments, etc.) as separate rows in the Container table. The Container table does contain a ParentContainerID column and has a foreign key constraint pointing to the ContainerID column.

The SQL Server benchmarking consisted of three separate definitions of the Primary Key and Clustered Key:

1. Primary Key Clustered on ContainerID and PackageID for respective tables (both are identity columns).

2. Primary Key non clustered and Unique Clustered on CreateDateKey* (natural key) and Identity column

3. Primary Key non clustered and Unique Clustered on Identity column and CreateDateKey* (natural key)

*See CreateDateKey definition in Part 1 article.

Case 1 above defines the original definition within the database tables.

12

ALTER TABLE dbo.Container WITH CHECK ADD CONSTRAINT [PK_Container] PRIMARY KEY CLUSTERED (ContainerID)ALTER TABLE dbo.Package WITH CHECK ADD CONSTRAINT [PK_Package] PRIMARY KEY CLUSTERED (PackageID)

Case 2 breaks apart the Primary Key and has a separate Unique, Clustered index with selectivity on the natural key (CreateDateKey)

123

CREATE UNIQUE CLUSTERED INDEX CIDX_Container ON dbo.[Container](CreateDateKey, ContainerID);ALTER TABLE dbo.Container WITH CHECK ADD CONSTRAINT [PK_Container]

ALTER TABLE dbo.Container WALTER TABLE dbo.Package WI

CREATE UNIQUE CLUSTERED INALTER TABLE dbo.Container WCREATE UNIQUE CLUSTERED INALTER TABLE dbo.Package WI

Page 26: HOW New to SQL Server1

4

PRIMARY KEY NONCLUSTERED (ContainerID);CREATE UNIQUE CLUSTERED INDEX CIDX_Package ON dbo.[Package](CreateDateKey, PackageID);ALTER TABLE dbo.Package WITH CHECK ADD CONSTRAINT [PK_Package] PRIMARY KEY NONCLUSTERED (PackageID);

Case 3 breaks apart the Primary Key and has a separate Unique, Clustered index with selectivity on the identity column.

1234

CREATE UNIQUE CLUSTERED INDEX CIDX_Container ON dbo.[Container]( ContainerID ,CreateDateKey);ALTER TABLE dbo.Container WITH CHECK ADD CONSTRAINT [PK_Container] PRIMARY KEY NONCLUSTERED (ContainerID);CREATE UNIQUE CLUSTERED INDEX CIDX_Package ON dbo.[Package]( PackageID ,CreateDateKey);ALTER TABLE dbo.Package WITH CHECK ADD CONSTRAINT [PK_Package] PRIMARY KEY NONCLUSTERED (PackageID);

The reason for Case 3 is that most joins in the test database on these two tables use the identity column (Foreign Key) to relate the tables and there was concern that Case 2 might cause an additional table lookup.

A closer look at the query plans

OK, enough on the administrative details, let’s look at example queries with emphasis on their query plans and their statistics IO. Originally, we tested eight individual queries for all three cases, but most had similar results as Queries 1 and 3 below. All queries ran with “SET STATISTICS IO ON” and “Include Actual Query Plan” ON.

Example query 1:

1234

SELECT c.ContainerID, p.PackageID, OrigPostalCode, c.ContainerTypeID, c.Description       FROM dbo.Package p       JOIN dbo.Container c         ON c.ContainerID = p.ContainerID

CREATE UNIQUE CLUSTERED INALTER TABLE dbo.Container WCREATE UNIQUE CLUSTERED INALTER TABLE dbo.Package WI

SELECT c.ContainerID, p.Packa FROM dbo.Package p JOIN dbo.Container c ON c.ContainerID = p.Con

Page 27: HOW New to SQL Server1

5       WHERE p.PackageID between 233439695 and 233647160

1. Original PK (clustered on identity column):

Statistics IO: Table ‘Package’. Scan count 1, logical reads 2934, physical reads 8, read-ahead reads 2917.

Table ‘Container’. Scan count 1, logical reads 418197, physical reads 188, read-ahead reads 423482.

Query Plan for Query 1:

2. PK (NonClustered, Clustered on CreateDateKey, Identity column):

Statistics IO: Table ‘Package’. Scan count 5, logical reads 846903, physical reads 363, read-ahead reads 448.

Table ‘Container’. Scan count 0, logical reads 1496460, physical reads 376, read-ahead reads 344.

Query Plan:

Page 28: HOW New to SQL Server1

3. PK (NonClustered, Clustered on Identity column, CreateDateKey):

Statistics IO: Table ‘Package’. Scan count 1, logical reads 2936, physical reads 1, read-ahead reads 2925.

Table ‘Container’. Scan count 1, logical reads 418565, physical reads 18, read-ahead reads 423075.

Query Plan:

Discussion Query 1:

When comparing queries, the number of logical reads is usually the best choice. Physical reads can vary because of data caching. Note that the statistics IO and the query plans for Cases 1 and 3 are nearly identical. Case 2 has a Key Lookup causing a significant increase in IO reads (about

Page 29: HOW New to SQL Server1

double the logical reads). As noted earlier, the concern that basing the clustered index on the natural key first may cause SQL Server query performance issues is valid for most queries in this analysis.

Example query 2

12345

SELECT c.ContainerID, p.PackageID, OrigPostalCode, c.ContainerTypeID, c.Description       FROM dbo.Package p       JOIN dbo.Container c         ON c.ContainerID = p.ContainerID      WHERE p.CreateDateKey = 20120601

 

1. Original PK (clustered on identity column):

Statistics IO: Table ‘Package’. Scan count 1, logical reads 702935, physical reads 506, read-ahead reads 698652.

Table ‘Container’. Scan count 1, logical reads 418197, physical reads 52, read-ahead reads 631.

Query Plan:

2. PK (NonClustered, Clustered on CreateDateKey, Identity column):

Statistics IO: Table ‘Package’. Scan count 1, logical reads 2936, physical reads 0, read-ahead reads 0.

SELECT c.ContainerID, p.Packa FROM dbo.Package p JOIN dbo.Container c ON c.ContainerID = p.Con

Page 30: HOW New to SQL Server1

Table ‘Container’. Scan count 0, logical reads 1478212, physical reads 0, read-ahead reads 0.

Query Plan (really bad – even after update statistics fullscan):

3. PK (NonClustered, Clustered on Identity column , CreateDateKey):

Statistics IO: Table ‘Package’. Scan count 5, logical reads 704154, physical reads 11495, read-ahead reads 261280.

Table ‘Container’. Scan count 207466, logical reads 888655, physical reads 0, read-ahead reads

Query Plan:

Page 31: HOW New to SQL Server1

Note there is a recommended index for Cases 1 and 3:

12

CREATE NONCLUSTERED INDEX IDX_Package_CreateDateKeyON dbo.Package (CreateDateKey) INCLUDE (PackageID,ContainerID,OrigPostalCode)

Implementing this index gives (for Case 3)

Statistics IO: Table ‘Package’. Scan count 1, logical reads 699, physical reads 0, read-ahead reads 0.

Table ‘Container’. Scan count 1, logical reads 418565, physical reads 69, read-ahead reads 420281.

Discussion Query 2:

The where clause in Query 2 uses the CreateDateKey as a means to select only a single days’ worth of package/container data. This query took advantage of the clustered index based on the CreateDateKey in the package table with an index seek on the clustered index (2936 logical reads), but still required a Key Lookup on the Container table because of the join criteria based on only the Primary Key (1478213 logical reads – had to hit container indexes twice). What troubles me though is that the query optimizer picked the wrong query plans for Query 1 and 3 (even with a fullscan update statistics). The query optimizer estimated that only one row from the PK_Container seek (actually it returned 207466 rows) and picked a inner join for both the PK_Container result set and the CIDX_Container result set. Inner joins in this context are not a good choice for those large result sets.

CREATE NONCLUSTERED INDEON dbo.Package (CreateDateKe

Page 32: HOW New to SQL Server1

As noted for Queries 1 and 3, Management Studio suggested an index and the query plan above shows those results. As it turns out, for the three cases the latter case (with the suggested index) had the best overall performance (less disk reads).

Example query 3 (results representative of the remaining benchmark queries)

1234567891011121314151617181920

--csp_GetShipmentDetailsDECLARE       @p_nContainerID int = 241949111       declare @nContainerCount int       declare @nPackageContainerType int  select @nPackageContainerType = ContainerTypeID FROM ContainerTypeWHERE Description = 'Package' AND DelFlag = 'N';WITH PackageContainerIDs (ContainerID, ContainerTypeID)       AS       (      SELECT ContainerID, ContainerTypeID                     FROM dbo.Container                     WHERE ContainerID = @p_nContainerID              UNION ALL              SELECT c2.ContainerID, c2.ContainerTypeID                     FROM dbo.Container c2                     JOIN PackageContainerIDs p                       ON p.ContainerID = c2.ParentContainerID       )       SELECT @nContainerCount = Count(*)              FROM PackageContainerIDs p              WHERE p.ContainerTypeID = @nPackageContainerType;

So is there a benefit to separating the primary key from the clustered index?

In most cases, it appears there is little if any reward for separating the Primary Key from the Clustered Index (although we did see one example where performance was better). And there could be some cons: additional index maintenance, key lookups, etc. If the natural key is the primary join mechanism to other tables, then separate Primary Keys and Clustered Indexes may be a viable choice. As usual the old Microsoft adage of “testing, testing, and more testing” is still applicable.

However, one case where breaking apart the Primary Key from the Clustered Index may be viable and that is in selecting a Partition Key for table partitioning. Part 3 of this discussion will dive into that.

Tags: Clustered Index, Primary Key, Query Performance

--csp_GetShipmentDetailsDECLARE @p_nContainerID declare @nContainerCount declare @nPackageContain

Page 33: HOW New to SQL Server1

here was little performance benefit in doing so. However, when table partitioning (especially for existing databases), separate Clustered Index and Primary Key may just be the answer.

In the real world, most large tables, for existing databases, join through foreign keys based only on an identity property in the parent table (the Primary Key). This works great from a performance aspect, as queries only need an integer column (unique) to join parent and child tables.

There are many white papers and articles on SQL Server table partitioning with academic examples (see references at end of this article), and most suggest partitioning on a natural key, like date. This is a good recommendation, but then how do you implement it when there are child tables? This article will show how to do so.

Partioning on a natural key when there are child tables: an example

In this case, we will use the container table as described in Part 2 with child tables having Foreign Key relations back to ContainerID.

 

The customer data retention plan called for data to be retained 6 months (since last update and/or create) and the data to be partitioned on a month-to-month basis. Data archiving and purging will be monthly just prior to the end of the month. Therefore, for this scenario, we defined a computed column LastDateKey as shown below:

Page 34: HOW New to SQL Server1

1

2

ALTER TABLE dbo.Container ADD LastDateKey AS

(CONVERT(INT,CONVERT(varchar(8),ISNULL(UpdateDate,CreateDate),112))) PERSISTED;

Note the PERSISTED key word. This is required so that the LastDateKey can be part of the clustered index. Also, note this is an integer column in the form of yyyymmdd. Now we can create the Partition Function and Partition Scheme:

1

2

3

4

5

6

CREATE PARTITION FUNCTION DB_PF1 (INT)

AS RANGE RIGHT FOR VALUES(20120301,20120401,20120501,20120601,20120701,20120801)

GO

CREATE PARTITION SCHEME DB_PS1

AS PARTITION DB_PF1 ALL TO ([FileGroup01])

GO

The partition function defines 7 partitions covering the last 6 months and the partition scheme puts all the partitions in the same filegroup (FileGroup01). The RANGE RIGHT option defines the lower range boundary of each partition starting with partition 2 (the first partition has no lower boundary).

Before we can separate the Primary Key from the Clustered Index, the dependent Foreign Keys (from the child tables) must be dropped. Otherwise, you will encounter an error when trying to drop the Primary Key. Assuming this has been done, you can then run the following statements

1 IF EXISTS (SELECT * FROM sys.indexes

ALTER TABLE dbo.Container A(CONVERT(INT,CONVERT(varc

CREATE PARTITION FUNCTION AS RANGE RIGHT FOR VAGOCREATE PARTITION SCHEME D

IF EXISTS (SELECT * FROM sysWHERE object_id = OBJECT_ID( ALTER TABLE dbo.ConGO

Page 35: HOW New to SQL Server1

2

3

4

5

6

7

8

9

10

11

12

WHERE object_id = OBJECT_ID(N'dbo.Container') AND name = N'PK_Container')

ALTER TABLE dbo.Container DROP CONSTRAINT PK_Container;

GO

CREATE UNIQUE CLUSTERED INDEX CIDX_Container ON dbo.Container(ContainerID, LastDateKey)

WITH (ONLINE = ON, DATA_COMPRESSION = ROW)

on DB_PS1(LastDateKey);

GO

--Rebuild PK

ALTER TABLE dbo.Container WITH CHECK ADD CONSTRAINT PK_Container PRIMARY KEY NONCLUSTERED

(ContainerID)

ON [PRIMARY]

The first TSQL statement drops the Primary Key (and its respective clustered index). The second TSQL statement creates a unique clustered index based on ContainerID and LastDateKey with row compression. The reason for this order is from the performance degradation we saw in Part 2 of this series if the LastDateKey was first (for this particular database, most table joins are on the ContainerID). The third TSQL statement rebuilds the primary key (non-clustered). It is defined on the Primary file group and is not partitioned. This is because all of the child tables have a single column Foreign Key constraint referring back to ContainerID. If you wanted a partitioned Primary Key, all of the child tables would have to be modified with the addition of the LastDateKey column. This is very undesirable for a variety of reasons – mainly that the LastDateKey would be a property (and defined) on the parent table (Container) and have no relation to the data in the child table. There is no need to have 2 columns for join criteria (from a performance viewpoint) back to the container (parent) table when the ContainerID is already unique.

Now that the Container table partitioned by month (as defined above), it is very easy to add a new month:

1 ALTER PARTITION SCHEME DB_PS1 NEXT USED [FileGroup01]

ALTER PARTITION SCHEME DBGOALTER PARTITION FUNCTION DGO

Page 36: HOW New to SQL Server1

2

3

4

GO

ALTER PARTITION FUNCTION DB_PF1() SPLIT RANGE (20120901);

GO

The NEXT USED statement is necessary to alert SQL Server to the filegroup to use for the next partition range. The SPLIT statement actually creates the new boundary. It can be run before or after that date – just be careful if the SPLIT is after the date there may be data movement within the filegroup with a resulting schema lock and possible application timeouts. The MERGE statement (not shown) can remove an old partition boundary – but, again, be careful – there also may be data movement from orphaned data or data retained because of other archiving constraints. This MERGE operation generates a schema lock and may cause application timeouts. The more data to move, the longer the locks are retained. Data movement during splits or merges can be very disruptive during your monthly partition maintenance operations — especially in a 24×7 environment.

Final thoughts

Do not forget to add back in the foreign keys that were dropped earlier. Another tip to consider is to drop all the existing non-clustered indexes on the parent table before dropping the Primary Key and creating the new Clustered Index and new Primary Key. The non-clustered indexes can then be rebuilt (I would suggest using the ONLINE = ON option and also partitioned on the partition scheme, e.g.,

1

2

3

4

5

CREATE NONCLUSTERED INDEX [IDX_Container_ParentContainerID] ON dbo.Container

(ParentContainerID ASC,DelFlag ASC)

INCLUDE (CreateDate,ContainerTypeID,UpdateDate)

WITH (ONLINE = ON, DATA_COMPRESSION = ROW)

ON [NGSCoreContainerPS1](ContainerID);

Data compression is another tool for consideration, but that is a topic for another article.

Consider an index on LastDateKey; it could be useful in queries using dates or a date range.

CREATE NONCLUSTERED INDE (ParentContainerID ASC,De INCLUDE (CreateDate,Cont WITH (ONLINE = ON, DATA

Page 37: HOW New to SQL Server1

This article should have given you some insight into SQL Server table partitioning selecting a partition key. In any case picking a partition key needs a fair amount of thought – once you do so and implement it the cost of change may be prohibitive.