In this blog, we have discussed the idea of hashing and hash tables. Now we will discuss the design and properties of some popular hash functions used in programming and data structures.

As discussed in hashing, suppose we want to store m key-value pairs in a hash table of size m. For this, we need a mechanism to transform any given key into an index in the range 0 to m - 1. So we use a hash function that is easy to compute and uniformly distributes the keys in the range 0 to m - 1 i.e. for each key, every integer between 0 and m-1 should be equally likely.

In other words, a good hash function satisfies the **assumption of uniform hashing**: each key is equally likely to hash to any of the m slots. Unfortunately, this ideal situation is challenging to achieve, making the understanding of hashing interesting.

Critical questions are: How do we implement a hash function close to the ideal situation? Think! However, since all possible keys are large and the table size is small, many keys may get mapped into the same place in the table no matter what function we use. So we need to deal with two situations: (1) Finding a good hash function that minimizes the likelihood of collisions and (2) Strategies for handling collisions.

Overall, no hash function can map all possible sets of keys without collisions. If the mapping is uniform, the number of keys is n, and the size of the hash table is m, then each location will have approximately n/m keys mapped into it.

If the keys are real numbers between 0 and 1, we can multiply each key by m and round off to the nearest integer to get an index between 0 and m-1 i.e. the hash function is **h(k) = ⌊km⌋**

This approach looks good, but it is defective because it gives more weight to the most significant bits of the keys and the least significant bits play no role. One way to address this situation is to use modular hashing on the binary representation of the key.

This is the most commonly used method for hashing. In this method, we map a key k into one of the m slots by taking the remainder of k divided by m i.e. the hash function is **h(k) = k mod m**.

For example, if the hash table has size m = 12 and the key is k = 100, then h(k) = 4. Since it requires only a single division operation which is quite fast. But when using the division method, we usually avoid specific values of m or choose a specific value of m.

**Observation 1:** m should not be a power of 2. If m = 2^p, then h(k) is just the p lowest-order bits of k. In other words, if we know that most of the low-order p-bit of keys are similar, then there will be a lot of collisions. So in this situation, the best idea would be to design the hash function that depends on all the key bits. Similarly, if keys are base-10 numbers and m is 10^k, then only the k least significant digits are used. Such a choice might be problematic!

**Observation 2:** One solution to the above problem would be choosing the table size m to be prime and, for any positive integer key k. This approach will evenly disperse the keys between 0 and m - 1 and minimize the chances of collision. In other words, if m is not prime, it may be the case that not all of the key bits play a role, and the hash function might not disperse the values evenly. Note: A prime not too close to an exact power of 2 is often a good choice for m. Think!

**Example 1**

Suppose we wish to allocate a hash table to hold n = 2000 keys. In this scenario, we won’t mind examining an average of 3 elements in an unsuccessful search, so we allocate a hash table of size m = 701. We have chosen m = 701 because it is a prime near 2000/3 but not near any power of 2.

**Example 2**

Suppose that the keys are three-digit telephone area codes and m = 100. If the middle digit of area codes is 0 or 1, this choice strongly favors the values less than 20. A better idea would be to choose a prime value to disperse them better (a prime value not close to 100 would work better). Similarly, IP addresses are binary numbers with a given set of patterns, so we need to use a table size m that is a prime (in particular, not a power of 2) to disperse them better using modular hashing.

**Observation 3:** If the size of the table cannot be adjusted easily to be a prime, then the following hash function can be used: **h(k) =(k mod p) mod m**, where **p** is a prime, and **p > m** (p should be sufficiently larger than m to be effective, but it should also be sufficiently smaller than the number of all possible values of keys).

**Observation 4:** Sometimes finding a large list of primes is not easy. So another possibility would be the following: at random, select two numbers, a and b, such that a < p, b < p, and let **h(k) =((ak + b) mod p) mod m.** This function is more complicated to compute than the previous one, but it gives a good average performance for all inputs.

**Observation 5:** As we have already mentioned, no hash function can be suitable for all inputs. Using primes as described is reasonably safe since most data in practice have no structure related to prime numbers. On the other hand, it can be possible in a particular application that one will want to store the results of some experiments made on integers, all of which are of the form **c + kp (here, c is some** constant). In such a scenario, all these keys will have the same hash values if **m = p** is used. So one idea is to scramble the data one step further, and use a random procedure to select a hash function from a given set of hash functions!

**Observation 5:** Here is an interesting example! Suppose there are 500 students identified by their social security number and there are 1 billion possible social security numbers. So we will not allocate an array of size 1 billion to store information about them. Instead, we can use the last three digits of the numbers, in which case we need only an array of size 1000. But there may be students with the same last three digits (in fact, with 500 students, the probability of that is quite high). We can also use the last four digits, or the last three digits and the first letter of the student’s name, to minimize duplicates even further. However, using more digits requires a larger-size table and results in a smaller utilization.

**Here is another interesting example!**
Suppose there are 500 students identified by their social security number, and all possible social security numbers are 1 billion! So we will not allocate an array of size 1 billion to store information about them. Instead, we can use the last three digits of the numbers, in which case we need only an array of size 1000. But there may be students with the same last three digits (in fact, with 500 students, the probability of that is relatively high). So in this situation, we can use the last four digits, or the last three digits and the first letter of the student’s name, to minimize duplicates even further.

Most of the hash functions assume that the keys are a set of natural numbers, i.e., N = {0, 1, 2, 3, ....}. Thus, if the keys are not natural numbers, we need to find a way to interpret them as natural numbers. For example, in the case of a string, we can interpret a string as the sum of the ASCII values of the characters in a string. We can also design other methods for converting each string key into a large natural number.

```
int stringHashFunction(String s[], int size, int m)
{
int sum = 0
for(int i = 0; i < size; i = i + 1)
sum = sum + s[i]
return sum % m
}
```

This function takes a string as input and processes the four characters of the string (four bytes data) simultaneously, and interprets each of the four-bytes chunks as a single long integer value. The integer values for the four-byte chunks are added together, and the resulting sum will be converted to the range 0 to m−1 using the modulus operator.

```
int stringFolding(String s[], int size, int m)
{
long sum = 0
long mul = 1
for (int i = 0; i < size; i = i + 1)
{
mul = (i % 4 == 0) ? 1 : mul * 256
sum = sum + s[i] * mul
}
return (int)(sum % m)
}
```

For example, if the string “aaaabbbb” is given, then the first four bytes (“aaaa”) will be interpreted as the integer value 1,633,771,873, and the next four bytes (“bbbb”) will be interpreted as the integer value 1,650,614,882. Their sum is 3,284,386,755 (when treated as an unsigned integer). If the table size is 101, then the modulus operator will hash this key to slot 75 in the table.

The mid-square method is a good hash function to use with the integer key. It squares the key value and then takes out the middle r bits of the result and provides a hash value in the range 0 to 2^r - 1. This works well because most or all bits of the key have contributed to the result.

For example, consider records whose keys are 4-digit numbers of base 10, and the goal is to hash these key values to a table of size 100 i.e. a range of 0 to 99. This range is equivalent to two digits in base 10, so r = 2. If the input is the number 4567, squaring yields an 8-digit number, 20857489.

The middle two digits of this result are 57. If we observe, all digits of the original key (equivalently, all bits when the number is viewed in binary) contribute to the middle two digits of the squared value. So the result is not dominated by the distribution of the lower-order digit or the higher-order digit of the original key value.

The multiplication method for creating hash functions operates in two steps. First, we multiply the key k by a constant A in the range 0< A < 1 and extract the fractional part of kA. Then, we multiply this value by m and take the floor of the result. In short, the hash function is **h(k) = ⌊m(kA mod 1)⌋**, where “kA mod 1” means the fractional part of kA, that is, kA - ⌊kA⌋. An advantage of the multiplication method is that the value of m is not critical. We can choose it as a power of 2, i.e., m = 2^p for some integer p.

Suppose that the machine's word size is w bits and key k fits into a single word. We restrict A to be a fraction of the form s/2^w, where s is an integer in the range 0 < s < 2^w.

We first multiply k by the w-bit integer s = A2^w. The result is a 2w-bit value r1*2^w + r0, where r1 is the high-order word of the product and r0 is the low-order word of the product. The desired p-bit hash value consists of the p most significant bits of r0. Although this method works with any value of the constant A, it works better with some values than others. The optimal choice depends on the characteristics of the data being hashed. Think!

- Uniformity and randomness are the essences of hashing. In other words, hash functions should transform a set of keys uniformly to a set of random locations in the range 1 to m.
- A good hash function generates the hash values from keys in a way independent of any patterns that might exist in the data. For example, the modular hashing frequently gives good results, assuming that we choose m as a prime number that is unrelated to any patterns in the distribution of keys.
- Of course, the same hash function must be used for all accesses to the same table. In many cases, however, there is a need for many independent tables or tables that are created and destroyed frequently. In those cases, a different hash function can be used every time a different table is created.
- Information about the distribution of keys can help us to design an effcient hash function. For example, suppose the keys are strings representing the name of the person, then closely related names, such as "Mohan", "Sohan", "Rohan" may often occur. Here a good hash function would minimize the chance that such names hash to the same slot in the hash table. In other words, some applications of hash functions might require stronger properties than the condition of the simple uniform hashing. For example, we might want keys that are “close” to yield hash values that are far apart.

**Enjoy learning, Enjoy algorithms!**

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.