Database partitioning is the backbone of modern distributed database management systems. It is a process of dividing a large dataset into several small partitions placed on different machines. In other words, It is a way of partitioning data like tables or index-organized tables into smaller pieces so that data can be easily accessed and managed.
With the growth in services and user base, it becomes tricky for a single server or database to function efficiently. We may experience lower performance with the architecture of a single database server. Here is some situation that could arise:
Database partition helps us fix all the above challenges by distributing data across several partitions. Each partition may reside on the same machine (coresident) or different machines (remote). The idea of co-resident partitioning is to reduce individual indexes size, and the amount of I/O needed to update records. Similarly, the concept of remote partitioning is to increase the bandwidth access to data by having more RAM, avoiding disk access, or having more network interfaces and disk I/O channels available.
A view of partitioned tables
There are several scenarios when partitioning data can be beneficial:
However, not all cases require data partitioning. It is important to carefully consider the specific needs of the system before deciding whether or not to use partitioning.
There are several benefits to using data partitioning:
Improved Availability: By partitioning the database, we can ensure the high availability of our application. Individual partitions can be managed independently, so if one partition becomes unavailable, the other partitions can still execute database queries successfully. This helps to avoid a single point of failure for the entire dataset and increases the overall availability of the service.
Note: Keeping data in different partitions helps the database administrator do backup and recovery operations on each partition, independent of the other partitions. This could allow the active partition of the database to be made available sooner so access to the system can continue while the inactive data is still being restored.
Improved Scalability: Every hardware has certain capacity limitations. As traffic increases, the performance of the service can decrease. Data partitioning allows us to scale out the service by distributing the data across multiple partitions, removing any limitations on scalability.
Improved Security: Data partitioning can also improve security by storing sensitive and non-sensitive data in different partitions. This allows for better management and increased security for sensitive data.
Improved Query Performance: Instead of querying the entire database, data partitioning allows the system to query smaller components, improving overall performance.
Improved Data Manageability: Data partitioning divides tables and indexes into smaller, more manageable units. This "divide and conquer" approach to data management allows for easier maintenance of particular table partitions.
Data Partitioning can be done through various strategies to distribute the database into separate and smaller databases. Broadly there are three different data partitioning strategies used. Let’s have a look at each one of them.
Horizontal partitioning, also known as database sharding, is a strategy for splitting table data horizontally based on the range of values defined by a partition key. In this approach, the table is divided into smaller, more manageable tables, with each row of the table being assigned to a single partition. This allows each partition to be managed independently of the others.
Suppose a large database containing multiple rows of customer data has a slow query performance. So we can think to partition the table into two separate tables horizontally. The first table would contain the first half of the customer data, and the second table would contain the second half. This allows us to query either partition 1 or partition 2, depending on the partition key. For example, suppose we store the contact info for customers. In that case, we can keep the contact info starting with name A-H on one partition and contact info starting with name I-Z on another partition.
The benefit of the horizontal partition: The horizontal partition scheme is the most straightforward partitioning method. It involves dividing the database into separate partitions that have the same schema as the original database. This makes it easy to answer queries without having to combine data from multiple partitions.
The disadvantage of the horizontal partition: Data may not be evenly distributed across the partitions. For example, if there are many more customers with names that fall in the range of A-H than in the range I-Z, the first partition may experience a much heavier load than the second partition.
Vertical partitioning involves dividing a table into smaller tables based on columns. This is also known as normalization. In this method, each partition contains a smaller number of elements and is stored in a separate partition. For example, in a social media application like LinkedIn, a user's profile data, list of connections, and articles they have written can be placed on separate partitions using vertical partitioning. The user's profile data would be on one partition, the list of connections on a second partition, and the articles on a third partition. This can help to improve the performance and scalability of the database.
There are a few disadvantages to using vertical partitioning:
In this type of partitioning strategy, data is organized based on the contextual dependency of a service. For example, a medical store system might store information about medicines in one partition and invoice data in another partition.
The choice of which type of partitioning to use depends on the structure of the data. In some cases, it may be useful to combine both horizontal and vertical partitioning to take advantage of both methods. For example, if we have a large dataset of customer information with different data types, we could use vertical partitioning to divide the database into string values and horizontal partitioning to divide the customer information.
There are a large number of criteria available for data partitioning. Most of them use partition keys and assign partitions on their basis. Some of the data partitioning criteria are range-partitioning, list-partitioning, hash partitioning, and many more.
In range partitioning, data is organized into partitions based on ranges of values for a partitioning key. This means that each partition contains rows with values for the partitioning key within a specific range. The ranges are typically contiguous and do not overlap, with each range specifying a non-inclusive lower and upper bound for the partition. Any partitioning key values equal to or higher than the upper bound of the range are added to the next higher partition.
Range partitioning is used in a few specific cases:
In hash partitioning, rows are divided into different partitions based on a hashing algorithm. This is different from range partitioning, which groups database rows based on continuous indexes.
Hash partitioning can be used in a few different ways:
One disadvantage of hash partitioning is that it can be expensive to dynamically add or remove database servers. For example, if we want to add more partitions, we may need to remap some of the keys and migrate them to a new partition, which requires changing the hash function. During this process, a large number of requests may not be able to be served, resulting in downtime until the migration is complete. This problem can be addressed using consistent hashing!
In list partitioning, each partition is defined and selected based on a list of values for a particular column, rather than a set of contiguous ranges of values. Some key points to consider when using list partitioning include:
As an example, consider a table with data for 20 video stores distributed among 4 regions, as shown in the following table:
Using list partitioning, we could partition the table so that rows for stores belonging to the same region are stored in the same partition. This would allow us to easily add or drop records relating to specific regions from the table.
Composite partitioning is a method of partitioning data based on two or more partitioning techniques. In this method, data is first partitioned using one technique, and then each partition is further divided into sub-partitions using the same or a different method.
Some key points to consider when using composite partitioning include:
Composite partitioning can be a useful technique for organizing and managing large datasets. It can help to improve the performance and scalability of the database by allowing for more precise control over data placement.
There are several types of composite partitioning:
Composite Range-Range Partitioning: This method performs range partitioning based on two table entries. For example, we could first partition the data by date and then sub-partition the range by price.
Composite Range-Hash Partitioning: This method first partitions the data using range partitioning and then sub-partitions it using hash partitioning within each range partition.
Composite Range-List Partitioning: This method first partitions the data using range partitioning and then sub-partitions it using list partitioning within each range partition.
Composite List-Range Partitioning: This method performs range sub-partitioning within a list partition. For example, we could first perform list partitioning by country name and then perform range sub-partitioning by date.
Composite List-Hash Partitioning: This method sub-partitions list-partitioned data using the hash partitioning technique.
Composite List-List Partitioning: This method performs list partitioning based on two table dimensions. For example, we could perform list partitioning by country name and list sub-partitioning by customer account status.
There are a few key considerations when it comes to data partitioning:
Data Partitioning is the backbone of modern distributed data management systems. Data Partitioning proves very effective in improving the availability, scalability, and performance of the system. In this blog, we tried to present a full conceptual understanding of Data Partitioning. Hope you liked it. Please share your views in the comments below.
References
Enjoy learning, Enjoy system design!