Millions of users on the internet are generating terabytes of data every day, and the amazing part is that the amount of data is doubling every six months! So there is a need for an efficient search strategy or data structures for storing and accessing data with such growth.

Let's understand it from an application perspective! Many applications require an efficient mechanism that supports only the dictionary operations (insert, search, and delete). For example, a compiler that translates a programming language maintains a symbol table, where the keys of elements are character strings corresponding to identifiers in the language.

But the critical question is: what is the problem with arrays or linked lists? If the large data is stored in an array, the time required to search an element will be either O(log n) or O(n) based on whether the array is sorted or not. At least the O(n) scenario may not be a viable option if we need to process a large data set.

On the other hand, searching for an element in a hash table can take O(1) time average and O(n) time in the worst case! The best idea is that hashing performs exceptionally well in practice. So, it's crucial to understand the concept of hashing, hash table, hash functions, etc.

Suppose an application requires a data storage mechanism in which each element has a key that is not too large. We can assume that no two elements have the same key i.e. keys are distinct, and the range of keys is 0 to m-1.

Then we can simply build an array T[0..m-1], where the element with key k is placed at T[k], and T[i] = NIL if there is no element with key i. Depending on the application, we may either place the value itself at T[k] or just a pointer to the value.

This is an idea of a direct address table where each dictionary operation Insert, Search, and Delete will take O(1) time since they simply correspond to accessing the elements in the array.

In the above image, each key in the universe of all possible keys {0, 1, 2,....9} corresponds to an index in the table. The set {2, 3, 5, 8} of actual keys determines the slots in the table that store values or pointers to the values.

- We need a prior idea about the upper limit of key values.
- It is useful when the keys are in a small range.
- It causes wastage of memory space if the actual number of keys is small compared to the total number of possible keys. In other words, we can only take advantage of it when we can afford an array with one slot for every possible key.

When the actual number of keys is small compared to the total number of possible keys, a hash table is an effective alternative to a direct address table. In other words, a hash table uses an array of size proportional to the number of actual keys and calculates the array index from the key rather than using the key as an array index directly. So the idea of the hash table generalizes the concept of a normal array.

- We use some
**hash function**h(k) to compute the slot from the key k and the store element at slot h(k) in a hash table. In other words, we can say that an element with key k hashes to slot h(k), which is the**hash value**of key k**.**So the hash function reduces the range of array indices and hence the size of the table. - The hash table requires much less storage than a direct-address table. Specifically, we can reduce the storage requirement to O(m) (where m is the total number of actual keys) while maintaining the benefit that is searching for an element in the hash table still requires only O(1) time on average.

So hashing operations consist of three separate parts:

- In hashing, the key does not have to be an integer. We use some relevant information from the input data and do some pre-processing to calculate the key (an integer).
- We use a hash function to transform the key into an array index.
- We can insert, delete or access the desired value using a key.

There is one hitch: two keys may hash to the same slot. We call this situation a collision. Fortunately, we have effective techniques for resolving the conflict created by collisions. Of course, the ideal solution would be to avoid collisions altogether. We might try to achieve this goal by choosing a suitable hash function. Thus, we need a well-designed “random” looking hash function to minimize the number of collisions, but we still need a method for resolving the collisions. In a separate blog, we will discuss two different approaches to collision resolution: chaining and open addressing.

In a mathematical sense, the map data structure is a relation between two sets. We can define a map M as a set of (key, value) pairs, where for a given key, we can find a value using some function. Most simply, we can think of an array as a map where the key is the index, and the value is the value available at that index i.e. for a given array A[] if i is the key, then we can find the value by simply looking up A[i].

Without memory limitations, we could search with only one memory access by simply using the key as an index in a large array. This ideal situation often cannot be achieved because the amount of memory required is not a practical choice when the total number of possible keys is larger than the actual number of keys. On the other hand, we can use the linear search in an unordered array if there are no time limitations.

So hashing provides a way to balance these two extremes with a reasonable amount of memory and time. Indeed, it turns out that we can trade off time and memory in hashing algorithms by adjusting parameters, not by rewriting code! Think!

In hashing, we need a good hash function that is easy to compute and uniformly distributes the keys in the hash table. So a good hash function satisfies (approximately) the **assumption of uniform hashing**: each key is equally likely to hash to any of the m slots in the hash table.

Unfortunately, this is an ideal situation because no hash function can totally avoid the possibility of collisions. So we need to deal with two situations: (1) finding a good hash function that minimizes the likelihood of collisions and (2) Strategies for handling collisions.

The critical question is: How do we identify a hash function close to the ideal situation? Think! Explore hash function blog to understand the ideas and properties of several hash functions.

In the next series of blogs, we will discuss collision resolutions techniques, implementation of hash table operations, and problem-solving using hashing. Enjoy learning, Enjoy thinking, Enjoy algorithms!

- Book: Algorithms by CLRS
- Book: Algorithms by Robert Sedgewick
- Lectures of Introduction to Algorithms by MIT OpenCourseWare
- Book: Algorithm design manual by Steven Skiena

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.