Typeahead, or autocomplete, is a feature in which an application guesses the rest of a word as the user inputs it. In most graphical user interfaces, users can accept a suggestion by hitting the tab key or many suggestions by pressing the down arrow key. Furthermore, autocomplete is used in many search functions across numerous platforms, including Facebook, Instagram, and Google search.
We can expect 5 billion searches every day, or around 60K queries per second if we build a service on the same scale as Google.
Because there will be so many duplicates, we estimate that just 20% of the 5 billion searches will be unique. As a result, if we only want to index the top 50% of search phrases, we can remove many less often searched queries. Consider the case where we want to develop an index for 100 million different terms. Storage estimations:
The average query size will be 15 characters if each query contains an average of 3 words and a word has an average length of 5 characters. An average query will require 30 bytes to store if a character needs two bytes to store. As a result, we’ll need the total storage space = 100 million * 30 bytes = 3 GB.
Every day, we can anticipate some growth in this data, but we should also exclude some phrases that are no longer searched. If we assume we receive 2% new requests every day and preserve our index for the previous year, total storage = 3GB + (0.02 * 3 GB * 365 days) = 25 GB
According to our requirements, our autocomplete must be real-time. In other words, new search queries must be added to our database. As a result, we need to not only develop a system for providing ideas but also need to integrate popular search queries in our database so that users can get suggestions based on both current and popular searches.
As a result, we’ve divided our autocomplete into two sections:
The get suggestions() query is sent to our application servers for each character typed when a user starts typing in the search field. It retrieves the user’s top seven options from our database and presents them.
Because we need to handle a large number of queries with minimal latency, we need a system that can efficiently store our data so that it can be queried quickly. We won’t be able to use a database for this; therefore, we’ll have to keep our index in memory in a very efficient data structure.
For our needs, the Trie is one of the best data structures. A trie is a data structure for storing phrases that looks like a tree, with each node storing a character in the phrase in order. The following is the central concept of trie:
If we need to store ‘camping, campus, cart’ in the trie, for example, it will look like this:
If the user has typed ‘cam,’ our service can traverse the trie to the node ‘m’ to find all the terms that start with this prefix (e.g., cam-ping, cam-pus, etc.).
We should expect a massive tree given the amount of data we need to index. Even traversing a sub-tree might take a long time; for example, the term “system design interview questions” has 30 levels. We need to improve the efficiency of our solution because we have very severe latency constraints.
This will undoubtedly speed up our searches, but it will necessitate a significant amount of more storage. At each node, we can save the top ten suggestions and return them to the user. To reach the desired efficiency, we must bear a significant increase in our storage capacity.
Rather than saving the complete phrase, we can save space by storing only references to the terminal nodes. We must go back using the parent reference from the terminal node to find the suggested terms. To keep track of top ideas, we’ll need to save the frequency with each reference.
We can efficiently construct our trie from the bottom up. To compute their top suggestions and counts, each parent node will first recursively call all of the child nodes. Then, to decide their top choices, parent nodes will combine the top suggestions from all of their children.
Assuming a daily search volume of five billion, this equates to about 60K requests every second. If we try to update our trie on every query, it will consume a lot of resources, which will slow down our read requests. One way to deal with this might be to update our trie offline after a set period of time.
We can keep track of the frequency of new queries as they come in. We have the option of logging every query or sampling and logging every 1000th query. It’s okay to log every 1000th searched term if we don’t want to show a term that has been searched less than 1000 times.
We can set up a Map-Reduce (MR) job to process all of the logging data on a regular basis, say once an hour. These MR tasks will calculate the frequency of all terms searched in the previous hour. After that, we may update our trie with the new information. We can update the trie’s current snapshot with all of the new terms and their frequency. We should perform this offline because we don’t want to update trie requests to obstruct our read queries. There are two possibilities available to us:
How can we keep the frequency of typeahead suggestions up to date? We need to change the frequencies of our typeahead suggestions because they are stored with each node. Rather than recreating all search phrases from the beginning, we can merely update changes in frequency. For example, if we want to maintain track of all the phrases searched in the last 10 days, we’ll need to deduct the counts from the time period that is no longer included and add the counts for the new period. Based on the Exponential Moving Average (EMA) of each term, we can add and subtract frequencies. The most recent data is given more weight in the EMA. The exponentially weighted moving average is another name for it.
We’ll proceed to the phrase’s terminal node and boost its frequency after we’ve added a new term to the trie. Because each node saves the top 10 searches, it’s likely that this search term leaped into the top 10 queries of a few additional nodes. As a result, we’ll need to change the top ten queries for those nodes. From the node, we must travel all the way up to the root. We check if the current query is in the top 10 for each parent. If that’s the case, we’ll adjust the frequency accordingly. If not, we see if the current query has a high enough frequency to be included in the top 10. If that’s the case, we’ll add this new phrase and eliminate the term with the lowest frequency.
What is the best way to delete a term from the trie? Assume we need to delete a term from the trial due to legal concern, hatred, or piracy. When the normal update occurs, we can totally delete such phrases from the trie; in the meantime, we can add a filtering layer to each server that will remove any such terms before sending them to users.
What different ranking criteria may there be for suggestions? In addition to a simple count, other criteria such as freshness, user location, language, demographics, personal history, and so on must be considered when ranking phrases.
For the most frequently searched terms in the cache, we can employ caching. This will be really beneficial in reducing latency. Caching is a frequent performance method. Before querying database servers, the server should check the cache server to determine if the search list is already in the cache. The suggested list should be ready in the cache in this scenario. As a result, list preparation time is eliminated, and reaction time is reduced. We can put a cache in front of database servers that contain the most commonly searched terms and their suggestions. And, because recent search results have a higher likelihood of being searched, update them using the LRU caching strategy.
As a result, we can take the first strategy. We can combine those characters with fewer terms onto one server.
For load balancing and fault tolerance, we need to have replicas for our trie servers. A load balancer that maintains track of our data partitioning method and redirects traffic depending on prefixes is also required.
What happens if one of the trie servers goes down? We can have a master-slave arrangement, as mentioned previously; if the master dies, the slave can take over after failover. Any server that restarts can use the last snapshot to recreate the trie.
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.