Bloom filter is a space-efficient probabilistic data structure that tells whether an element may be in a set or definitely is not. If we look up an item in the Bloom filter, we can get two possible results.

- The item is not present in the set: True negative.
- The item might be present in the set: Can be either a False positive or True positive.

Suppose we want to compare two strings. Instead of storing the entire string, we compute the hash of the string and then compare the hashes. Computing hash and then comparing them takes O(1) time instead of the O(n) time it would take to compare the strings directly.

One drawback of this method is the limited range of the hash function. The hash function h1 ranges from 0 to r1–1. We cannot identify more than r1 strings uniquely with this method as at least two strings will have the same hash value. To compensate for this, we can use multiple hash functions. We use k hash functions, h1, h2, …, hk. Suppose we pass two strings through these hash functions and compute their hashes. If all the hashes are identical for both the strings, there is a very high probability that both the strings are the same. On the other hand, even if one hash does not match, we can definitely say that the strings are different.

Bloom filter is essentially an array that stores 0s and 1s. In the image below, each cell represents a bit. The number below the bit is its index for a bloom filter of size 10.

In order to add an element, we need to pass it through the k hash functions. Bits are set at the index of the hashes in the array. For example, we want to add the word “coding”. After passing it through three hash functions, we get the following results.

- h1(“coding”) = 125
- h2(“coding”) = 67
- h3(“coding”) = 19

Suppose that the size of our bloom filter is m = 10. We need to take mod of 10 for each of these values so that the index is within the bounds of the bloom filter. Therefore, the indexes at 125%10 = 5, 67%10 = 7 and 19%10 = 9 have to be set to 1.

If we want to test the membership of an element, we need to pass it through the same hash functions. If the bits are already set for all these indexes, then this element might exist in the set. However, even if one index is not set, we are sure that this element is not present in the set.

Let’s say we want to check the membership of “cat” in our set. Furthermore, we have already added two elements, “coding” and “music”, to our set.

- Coding has the hash output {125, 67, 19} from the three hash functions, and as discussed above, the indexes {5, 7, 9} are set to 1.
- Music has the hash output {290, 145, 2} and the indexes {0, 2, 5} are set to 1.

We pass “cat” through the same hash functions and get the following results.

- h1(“cat”) = 233
- h2(“cat”) = 155
- h3(“cat”) = 9

So we check if the indexes {3, 5, 9} are all set to 1. As we can see, even though indexes 5 and 9 are set to 1, 3 is not. Thus we can conclude with 100% certainty that “cat” is not present in the set.

Now let’s say we want to check the existence of “gaming” in our set. We pass it through the same hash functions and get the following results.

- h1(“gaming”) = 235
- h2(“gaming”) = 60
- h3(“gaming”) = 22

We check if the indexes {0, 2, 5} are all set to 1. We can see that all of these indexes are set to 1. However, we know that “gaming” is not present in the set. So this is a false positive.

Can we predict the probability of these false positives? Yes, and it is explained below.

Let n = number of elements, m = length of the bloom filter, k = number of hash functions.

Assume that the hash function selects each index with equal probability. The probability that a particular bit is not set to 1 by a specific hash function during the insertion of an element is 1-(1/m). If we pass this element through k hash functions, the probability that the bit is not set to 1 by any of the hash functions is (1 -(1/m))^k.

After inserting inserted *n* elements, the probability that a particular bit is still 0 is (1 -(1/m))^(kn). Therefore the probability that it is 1 is 1 - (1 -(1/m))^(kn).

We want to test the membership of an element in the set.

Each of the *k* array positions computed by the hash functions is 1 with a probability 1 -(1- (1/m))^(kn). The probability of all of them being 1, which would result in a false positive, is (1 -(1 -(1/m))^(kn))^k. This is also known as the error rate.

As we can see, if m tends to infinity, the error rate tends to zero. Another way to avoid a high false-positive rate is to use multiple hash functions.

GitHub link: https://github.com/PrasannaIITM/Bloom-Filter/blob/main/bloomFilter.py

Bloom filters can be used as the first layer of filtering. The advantage is that they require constant space. They can be helpful where we want to know if something is definitely not present or possibly present somewhere. For example, we want to query a large database from a rotating hard drive that is slow to read. Suppose there is a high probability that the thing we want to query is not present in the database, then before querying, we can check for this item in the bloom filter. We can avoid the expensive query if the bloom filter says that this element is definitely not present in the database.

- Facebook uses bloom filters to avoid one-hit wonders. One-hit wonders are web objects requested by users just once. Using a Bloom filter to detect the second request for a web object and caching it only on its second request prevents one-hit wonders from entering the local storage. For example, searching for “coding” often can be stored in the local storage. However, if you search for something only once, like “giraffe”, it should not be stored in the local storage as it will be a one-hit-wonder.

Using an LRU Cache here will remove some older but frequent results from the local storage for one-hit wonders, negatively affecting the user experience.

- Medium uses Bloom filters in its Recommendation module to avoid showing those posts that the user has already seen.

A Bloom filter is a data structure designed to tell rapidly and memory-efficiently whether an element is present in a set. The tradeoff is that it is probabilistic; it can result in False positives. Nevertheless, it can definitely tell if an element is not present. Bloom filters are space-efficient; they take up O(1)space, regardless of the number of items inserted as the bloom filter size remains the same. However, their accuracy decreases as more elements are added. Insert, and lookup operations are O(1) time.

**Enjoy learning, Enjoy algorithms!**

Longest Consecutive Sequence

Given an array X[] of n integers, write a program to find the length of the longest consecutive elements sequence. In other words, we need to find the length of the longest sub-sequence such that elements in the subsequence are consecutive integers. The consecutive numbers can be in any order.

Segment Trees Part 2: Point Update, Range Update and Lazy Propagation

In this blog, we will learn how to use segment trees for efficient point and range updates. For the point update query, update(idx, val), we need to increment the element at the index idx from the original array by val, i.e. arr[idx] = arr[idx] + val. For the range update query, update(l, r, val), we must increment all the elements from index l to r from the original array by val.

Designing Uber: System Design Interview Question

Uber continues to improve its operations and services by deploying and developing new services to meet market demand, find the best and most efficient routes, detect any potential fraud, provide more customer-centric services, and monitor and update data to provide the most efficient real-time services. In this blog, we’ll be looking at how to design the Uber system.

Introduction to Heap in Data Structures and Algorithms

A heap is a complete binary tree structure where each element satisfies a heap property. We learn two types of the heap in programming: 1) max-heap, which satisfies the max heap property, and 2) min-heap, which satisfies the min-heap property. We use the heap to implement the priority queue efficiently.

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.