Indexing is an excellent way to improve database performance. An index is nothing more than a reference to data in a table. For example, we can think of a database index the same way we think of a book index. The aim of creating an index is to perform read and search queries faster.
It is a way to organize files with several fields. We have to maintain an additional data structure with indexing to retrieve the filed value and the corresponding record point. The index structure is generally sorted and allows binary searches for efficient performance. There are several methods of indexing used but before delving deep into them, let’s first look with an example of why do we need indexing at all :)
Suppose you are in your city’s central library and have 5 minutes to find a particular book. Would you accomplish this task and find the book in the specified time frame? It would be nearly impossible to search for any book in such an extensive collection of books by checking each section. However, if we request access to the library’s index, our task would be easier because indexes provide all the information required to locate the book quickly.
Similarly, a database index includes all of the information needed to access data efficiently. Nowadays, the IT industry is evolving rapidly and generating vast amounts of data. All such data need to be stored, but we need a way to efficiently store and retrieve such a massive amount of data. The company’s ability to access the data quickly is critical to its success. However, one way to achieve this is using the similar idea of maintaining indexes as we saw in the above Library example. The indexes function as lookup tables and store data efficiently for faster retrieval.
Now it is clear why we need indexing in databases. Now let’s proceed ahead and look into the various methods of indexing. There are primarily three methods of indexing:
Cluster indexing is a form of storage in which more than two records are stored in the same file. We may reduce the search cost by using cluster indexing since many records related to the same item are stored in one location.
Here in this method, an ordered data file is used to describe the clustering index, and a non-key field is used to sort the data file. In some instances, the index is built on columns other than the primary key, which may or may not be unique for each record. In such instances, we can group two or more columns to get the unique values and index them to make it easier to find the data. Hence records with similar characteristics are grouped, and indexes for these groups are generated.
A secondary index indicates where the data is stored by providing a list of virtual pointers or references to the data’s location. The data is not physically processed in the index’s order but is stored in leaf nodes.
For example, consider a book’s contents page. Each entry specifies the information’s page number or position. Although the data is not structured, we have an ordered reference showing where the data points are located. We can only have dense ordering in the non-clustered index because data is not physically ordered, so sparse ordering is unlikely. It takes longer than the clustered index because more work is required to extract the data by following the pointer further. The data is immediately in front of the index in the case of a clustered index.
Indexes expand in tandem with the size of the database. A single-level index can become too big to store with multiple disc accesses because it is stored in the main memory. Multilevel indexing divides the main block into smaller blocks stored together in a single block. Internal blocks are separated into outer blocks, which are then pointed to the data blocks. With fewer overheads, this can be conveniently stored in the main memory.
Data is stored in rows and organized into tables in a database. Each row has a unique key that distinguishes it from the others, and these keys are stored in an index for easy retrieval. Since keys are stored in indexes, the index is automatically modified whenever a new row with a specific key is inserted.
As data is saved on a disk-based storage unit, it is organized into blocks. The atomic disc access operation is when these blocks are accessed in their entirety. Disk blocks are organized similarly to linked lists: they have a data section and a pointer to the next node (or block), and they don’t have to be stored in the same order.
We may claim that searching on a field that isn’t sorted requires a Linear Search because multiple records can only be sorted on one field. If that field is a non-key field, N block accesses are needed to search the entire tablespace. With a sorted field with log2 N block accesses, a Binary Search can be used. Furthermore, since the data is sorted based on a non-key field, the rest of the table does not need to be searched for duplicate values until a higher value is discovered. As a result, the performance boost is essential.
Indexes are a powerful method that a database uses in the background to speed up querying. Indexes power queries by allowing for a fast lookup of the requested information. It’s a data structure technique for quickly locating and accessing data in a database. Indexing is a technique for improving database efficiency by reducing the number of disc accesses needed when a query is running.
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.