Bloom filter is a space-efficient probabilistic data structure that tells whether an element may be in a set or definitely is not. If we look up an item in the Bloom filter, we can get two possible results.
Suppose we want to compare two strings. Instead of storing the entire string, we compute the hash of the string and then compare the hashes. Computing hash and then comparing them takes O(1) time instead of the O(n) time it would take to compare the strings directly.
One drawback of this method is the limited range of the hash function. The hash function h1 ranges from 0 to r1–1. We cannot identify more than r1 strings uniquely with this method as at least two strings will have the same hash value. To compensate for this, we can use multiple hash functions. We use k hash functions, h1, h2, …, hk. Suppose we pass two strings through these hash functions and compute their hashes. If all the hashes are identical for both the strings, there is a very high probability that both the strings are the same. On the other hand, even if one hash does not match, we can definitely say that the strings are different.
Bloom filter is essentially an array that stores 0s and 1s. In the image below, each cell represents a bit. The number below the bit is its index for a bloom filter of size 10.
In order to add an element, we need to pass it through the k hash functions. Bits are set at the index of the hashes in the array. For example, we want to add the word “coding”. After passing it through three hash functions, we get the following results.
Suppose that the size of our bloom filter is m = 10. We need to take mod of 10 for each of these values so that the index is within the bounds of the bloom filter. Therefore, the indexes at 125%10 = 5, 67%10 = 7 and 19%10 = 9 have to be set to 1.
If we want to test the membership of an element, we need to pass it through the same hash functions. If the bits are already set for all these indexes, then this element might exist in the set. However, even if one index is not set, we are sure that this element is not present in the set.
Let’s say we want to check the membership of “cat” in our set. Furthermore, we have already added two elements, “coding” and “music”, to our set.
We pass “cat” through the same hash functions and get the following results.
So we check if the indexes {3, 5, 9} are all set to 1. As we can see, even though indexes 5 and 9 are set to 1, 3 is not. Thus we can conclude with 100% certainty that “cat” is not present in the set.
Now let’s say we want to check the existence of “gaming” in our set. We pass it through the same hash functions and get the following results.
We check if the indexes {0, 2, 5} are all set to 1. We can see that all of these indexes are set to 1. However, we know that “gaming” is not present in the set. So this is a false positive.
Can we predict the probability of these false positives? Yes, and it is explained below.
Let n = number of elements, m = length of the bloom filter, k = number of hash functions.
Assume that the hash function selects each index with equal probability. The probability that a particular bit is not set to 1 by a specific hash function during the insertion of an element is 1-(1/m). If we pass this element through k hash functions, the probability that the bit is not set to 1 by any of the hash functions is (1 -(1/m))^k.
After inserting inserted n elements, the probability that a particular bit is still 0 is (1 -(1/m))^(kn). Therefore the probability that it is 1 is 1 - (1 -(1/m))^(kn).
We want to test the membership of an element in the set.
Each of the k array positions computed by the hash functions is 1 with a probability 1 -(1- (1/m))^(kn). The probability of all of them being 1, which would result in a false positive, is (1 -(1 -(1/m))^(kn))^k. This is also known as the error rate.
As we can see, if m tends to infinity, the error rate tends to zero. Another way to avoid a high false-positive rate is to use multiple hash functions.
GitHub link: https://github.com/PrasannaIITM/Bloom-Filter/blob/main/bloomFilter.py
Bloom filters can be used as the first layer of filtering. The advantage is that they require constant space. They can be helpful where we want to know if something is definitely not present or possibly present somewhere. For example, we want to query a large database from a rotating hard drive that is slow to read. Suppose there is a high probability that the thing we want to query is not present in the database, then before querying, we can check for this item in the bloom filter. We can avoid the expensive query if the bloom filter says that this element is definitely not present in the database.
Using an LRU Cache here will remove some older but frequent results from the local storage for one-hit wonders, negatively affecting the user experience.
A Bloom filter is a data structure designed to tell rapidly and memory-efficiently whether an element is present in a set. The tradeoff is that it is probabilistic; it can result in False positives. Nevertheless, it can definitely tell if an element is not present. Bloom filters are space-efficient; they take up O(1)space, regardless of the number of items inserted as the bloom filter size remains the same. However, their accuracy decreases as more elements are added. Insert, and lookup operations are O(1) time.
Enjoy learning, Enjoy algorithms!
Given an array X[] of n integers, write a program to find the length of the longest consecutive elements sequence. In other words, we need to find the length of the longest sub-sequence such that elements in the subsequence are consecutive integers. The consecutive numbers can be in any order.
Ever wondered how does 1-click-buy works on Amazon? How does an e-commerce platform show the status of your order after the order is placed? What happens when you cancel your order right after you place an order, or after your item is shipped, or even delivered? How is all the activity related to an order tied to just one order Id? This blog will try to tackle such system design challenges and lay out key insights on designing a workflow system.
Level order traversal accesses nodes in level by level order. This is also called breadth-first search traversal or BFS traversal. Here we start from the root node and process it, then process all the nodes at the first level, then process all the nodes at the second level, and so on. In other words, we explore all nodes at the current level before moving on to the nodes at the next level.
In today’s world, notification services are widely employed in almost every product. They’re helpful if you’d like to be alerted of a price change or availability of a product you’re interested in, or if you’d want to be told if a new job specification for a job search criterion you’ve provided becomes available.
The Least Recently Used (LRU) is one of the popular caching strategies, which defines the policy to discard the least recently used items first from the cache and make room for new elements when the cache is full. It is used to organize items in order of their use, which allows identifying items that have not been used for a long time.
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.