Design Typeahead (Autocomplete) System

What is Typeahead?

Typeahead, or autocomplete, is a feature in which an application guesses the rest of a word as the user inputs it. In most graphical user interfaces, users can accept a suggestion by hitting the tab key or many suggestions by pressing the down arrow key. Furthermore, autocomplete is used in many search functions across numerous platforms, including Facebook, Instagram, and Google search.


Functional Requirements

  • As the user types their query, our service should offer the top 10 terms that begin with whatever they have written.
  • The frequency and recentness of a phrase or query are used to sort suggestions.

Non-Functional Requirements

  • Our system should have a high level of availability.
  • Suggestions should be able to be seen in real-time. The user should be able to see the suggestions within 200 milliseconds.
  • Scalable : The system can manage a huge volume of traffic.

Estimation of Scale and Constraints

Traffic estimations

We can expect 5 billion searches every day, or around 60K queries per second if we build a service on the same scale as Google.

Because there will be so many duplicates, we estimate that just 20% of the 5 billion searches will be unique. As a result, if we only want to index the top 50% of search phrases, we can remove many less often searched queries. Consider the case where we want to develop an index for 100 million different terms. Storage estimations:

The average query size will be 15 characters if each query contains an average of 3 words and a word has an average length of 5 characters. An average query will require 30 bytes to store if a character needs two bytes to store. As a result, we’ll need the total storage space = 100 million * 30 bytes = 3 GB.

Every day, we can anticipate some growth in this data, but we should also exclude some phrases that are no longer searched. If we assume we receive 2% new requests every day and preserve our index for the previous year, total storage = 3GB + (0.02 * 3 GB * 365 days) = 25 GB


  • get_suggestions(prefix)prefix: Until the search query is sent, whatever the user has written.
  • addtodb(query):query: Our database will be updated with a fresh, unique trending question that has been searched above a particular level.

High-level Design

According to our requirements, our autocomplete must be real-time. In other words, new search queries must be added to our database. As a result, we need to not only develop a system for providing ideas but also need to integrate popular search queries in our database so that users can get suggestions based on both current and popular searches.

As a result, we’ve divided our autocomplete into two sections:

  • Making recommendations: Given a search query or prefix, return the seven most frequently searched keywords.
  • Collecting information: In real-time, it collects and arranges user input requests. Real-time processing is impossible for large data sets, but it is a good starting point. We’ll take a closer look at a more realistic approach.

The get suggestions() query is sent to our application servers for each character typed when a user starts typing in the search field. It retrieves the user’s top seven options from our database and presents them.

Data Structure to store Search Queries

Because we need to handle a large number of queries with minimal latency, we need a system that can efficiently store our data so that it can be queried quickly. We won’t be able to use a database for this; therefore, we’ll have to keep our index in memory in a very efficient data structure.

For our needs, the Trie is one of the best data structures. A trie is a data structure for storing phrases that looks like a tree, with each node storing a character in the phrase in order. The following is the central concept of trie:

  • A trie is a data structure that looks like a tree.
  • An empty string is represented by the root.
  • Each node stores a character and has 26 children, one for each character that can be created. We don’t draw empty links to conserve space.
  • A single word or a prefix string is represented by each tree node.

If we need to store ‘camping, campus, cart’ in the trie, for example, it will look like this:

Trie data structure for storing search queries in Autocomplete System

If the user has typed ‘cam,’ our service can traverse the trie to the node ‘m’ to find all the terms that start with this prefix (e.g., cam-ping, cam-pus, etc.).

We should expect a massive tree given the amount of data we need to index. Even traversing a sub-tree might take a long time; for example, the term “system design interview questions” has 30 levels. We need to improve the efficiency of our solution because we have very severe latency constraints.

This will undoubtedly speed up our searches, but it will necessitate a significant amount of more storage. At each node, we can save the top ten suggestions and return them to the user. To reach the desired efficiency, we must bear a significant increase in our storage capacity.

Rather than saving the complete phrase, we can save space by storing only references to the terminal nodes. We must go back using the parent reference from the terminal node to find the suggested terms. To keep track of top ideas, we’ll need to save the frequency with each reference.

We can efficiently construct our trie from the bottom up. To compute their top suggestions and counts, each parent node will first recursively call all of the child nodes. Then, to decide their top choices, parent nodes will combine the top suggestions from all of their children.

Assuming a daily search volume of five billion, this equates to about 60K requests every second. If we try to update our trie on every query, it will consume a lot of resources, which will slow down our read requests. One way to deal with this might be to update our trie offline after a set period of time.

We can keep track of the frequency of new queries as they come in. We have the option of logging every query or sampling and logging every 1000th query. It’s okay to log every 1000th searched term if we don’t want to show a term that has been searched less than 1000 times.

We can set up a Map-Reduce (MR) job to process all of the logging data on a regular basis, say once an hour. These MR tasks will calculate the frequency of all terms searched in the previous hour. After that, we may update our trie with the new information. We can update the trie’s current snapshot with all of the new terms and their frequency. We should perform this offline because we don’t want to update trie requests to obstruct our read queries. There are two possibilities available to us:

  1. To update the trie offline, we can make a copy of it on each server. After that, we can start using it and throw away the old one.
  2. Another alternative is to set up each trie server as a master-slave arrangement. Then, while the master is serving traffic, we can update the slave. We can make the slave our new master after the update is complete. We can then upgrade our old master, which will then be able to serve traffic as well.

How can we keep the frequency of typeahead suggestions up to date? We need to change the frequencies of our typeahead suggestions because they are stored with each node. Rather than recreating all search phrases from the beginning, we can merely update changes in frequency. For example, if we want to maintain track of all the phrases searched in the last 10 days, we’ll need to deduct the counts from the time period that is no longer included and add the counts for the new period. Based on the Exponential Moving Average (EMA) of each term, we can add and subtract frequencies. The most recent data is given more weight in the EMA. The exponentially weighted moving average is another name for it.

We’ll proceed to the phrase’s terminal node and boost its frequency after we’ve added a new term to the trie. Because each node saves the top 10 searches, it’s likely that this search term leaped into the top 10 queries of a few additional nodes. As a result, we’ll need to change the top ten queries for those nodes. From the node, we must travel all the way up to the root. We check if the current query is in the top 10 for each parent. If that’s the case, we’ll adjust the frequency accordingly. If not, we see if the current query has a high enough frequency to be included in the top 10. If that’s the case, we’ll add this new phrase and eliminate the term with the lowest frequency.

What is the best way to delete a term from the trie? Assume we need to delete a term from the trial due to legal concern, hatred, or piracy. When the normal update occurs, we can totally delete such phrases from the trie; in the meantime, we can add a filtering layer to each server that will remove any such terms before sending them to users.

What different ranking criteria may there be for suggestions? In addition to a simple count, other criteria such as freshness, user location, language, demographics, personal history, and so on must be considered when ranking phrases.


For the most frequently searched terms in the cache, we can employ caching. This will be really beneficial in reducing latency. Caching is a frequent performance method. Before querying database servers, the server should check the cache server to determine if the search list is already in the cache. The suggested list should be ready in the cache in this scenario. As a result, list preparation time is eliminated, and reaction time is reduced. We can put a cache in front of database servers that contain the most commonly searched terms and their suggestions. And, because recent search results have a higher likelihood of being searched, update them using the LRU caching strategy.

Database Partitioning

  • The database can be partitioned based on the first letter of a term. For example, on one server, we can insert words beginning with ‘A,’ whereas in another server, we can enter words beginning with ‘B.’ The issue with this method is that it can result in uneven partitioning. There are extremely few words that begin with the letter ‘Z,’ for example.
  • The hash of the words can also be used to partition the data. The hash algorithm will generate a server number, which we will use to store the word. This may result in a random distribution. However, we may need to ask all of the servers to find a suggested list and then merge them to reach the desired result. As a result, reaction time latency may increase.

As a result, we can take the first strategy. We can combine those characters with fewer terms onto one server.

Replication and Load Balancer

For load balancing and fault tolerance, we need to have replicas for our trie servers. A load balancer that maintains track of our data partitioning method and redirects traffic depending on prefixes is also required.

Fault Tolerance

What happens if one of the trie servers goes down? We can have a master-slave arrangement, as mentioned previously; if the master dies, the slave can take over after failover. Any server that restarts can use the last snapshot to recreate the trie.

Enjoy Learning!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.