1. Introduction Programming Techniques and Data Structures Ellis Horowitz Editor Implementations for Coalesced Hashing Jeffrey Scott Vitter Brown University The coalesced hashing method is one of the faster searching methods known today. This paper is a practical study of coalesced hashing for use by those who intend to implement or further study the algorithm. Techniques are developed for tuning an important parameter that relates the sizes of the address region and the cellar in order to optimize the average running times of different implementations. A value for the parameter is reported that works well in most cases. Detailed graphs explain how the parameter can be tuned further to meet specific needs. The resulting tuned algorithm outperforms several well-known methods including standard coalesced hash- ing, separate (or direct) chaining, linear probing, and double bashing. A variety of related methods are also analyzed including deletion algorithms, a new and im- proved insertion strategy called varied-insertion, and ap- plications to external searching on secondary storage devices. CR Categories and Subject Descriptors: D.2.8 [Soft- ware Engineering]: Metrics--performance measures; E.2 [Data]: Data Storage Representations--hash-table rep- resentations; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems-- sorting and searching; H.2.2 [Database Management]: Physical Design--access methods; H.3.3 [Information Storage and Retrieval]: Information Search and Re- trieval-search process General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: analysis of algo- rithms, coalesced hashing, hashing, data structures, data- bases, deletion, asymptotic analysis, average-case, opti- mization, secondary storage, assembly language Author's Present Address: Jeffrey Scott Vitter, Department of Computer Science, Box 1910, Brown University, Providence, RI 02912. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1982 ACM 0001-0782/82/1200-0911 $00.75. 911 One of the primary uses today for computer technol- ogy is information storage and retrieval. Typical search- ing applications include dictionaries, telephone listings, medical databases, symbol tables for compilers, and storing a company's business records. Each package of information is stored in computer memory as a record. We assume there is a special field in each record, called the key, that uniquely identifies it. The job of a searching algorithm is to take an input K and return the record (if any) that has K as its key. Hashing is a widely used searching technique because no matter how many records are stored, the average search times remain bounded. The common element of all hashing algorithms is a predefined and quickly com- puted hash function hash: (all possible keys) --~ (1, 2 ..... M} that assigns each record to a hash address in a uniform manner. (The problem of designing hash functions that justify this assumption, even when the distribution of the keys is highly biased, is well-studied [7, 2].) Hashing methods differ from one another by how they resolve a collision when the hash address of the record to be inserted is already occupied. This paper investigates the coalesced hashing algo- rithm, which was first published 22 years ago and is still one of the faster known searching methods [16, 7]. The total number of available storage locations is assumed to be fixed. It is also convenient to assume that these locations are contiguous in memory. For the purpose of notation, we shall number the hash table slots 1, 2 ..... M'. The first M slots, which serve as the range of the hash function, constitute the address region. The remain- ing M'--M slots are devoted solely to storing records that collide when inserted; they are called the cellar. Once the cellar becomes full, subsequent colliders must be stored in empty slots in the address region and, thus, may trigger more collisions with records inserted later. For this reason, the search performance of the coa- lesced hashing algorithm is very sensitive to the relative sizes of the address region and cellar. In Sec. 4, we apply the analytic results derived in [10, I1, 13] in order to optimize the ratio of their sizes, fl = M/M', which we call the address factor. The optimizations are based on two performance measures: the number of probes per search and the running time of assembly language ver- sions. There is no unique best choice for fl--the optimum address factor depends on the type of search, the number of inserted records, and the performance measure cho- sen-but we shall see that the compromise choice fl 0.86 works well in many situations. The method can be further turned to meet specific needs. Section 5 shows that this tuned method dominates several popular hashing algorithms including standard coalesced hashing (in which fl = 1), separate (or direct)

