Design Yelp

What is Yelp?

Yelp is a Proximity server that is useful for discovering nearby attractions such as restaurants, theatres, temples, or recreational areas.

Key Requirements

Our service will store information about various locations so that users can search for them. It will deliver a list of places near the user after a query is made. The following requirements should be met by our Yelp-like service:

Functional Requirements

  1. A proximity server’s fundamental operation is searching. Users should be able to search all nearby locations within a specified radius for any particular location (latitude and longitude).
  2. Users can create new places or change existing ones by adding or editing basic features such as photos, a brief description, and so on.
  3. Users can give places ratings (from one to five stars) and reviews (written and photographs).

Non-Functional Requirements

  1. The system should be extremely reliable, with real-time search capabilities and low latency.
  2. We may anticipate this application being read-heavy due to a large number of search requests compared to the frequent inclusion of new locations.

Capacity Estimation

With 100K queries per second, the total number of spots in the system is estimated to reach 200 million. We should develop our system at a size of at least 5 years, assuming a future scale of 5 years with 20% annual growth.

  • Our system should be capable of handling a population of 400 million people.
  • A load of 200K requests per second should be no problem for our system.

High-Level Design

We have three high-level working API for this service: 

Search API:

Search API is responsible for searching any user query and returns information regarding the searched query.

searchRequest: query_param = {
   search_terms: "paratha", 
   radius: 10,                              
   category_filter: "indian",  
   filter: "5 start rated restaurants",     
   maximun_no_of_results: 50              
}
Response: A JSON containing information about a list of popular places matching the search query.

Add places:

Add places API is responsible for adding places and returns the response of newly created place with its place_id.

placesRequest: body = {
   place_name: "Baba ka Dhaba", 
   latitude: 115,                              
   longitude: 100,  
   category: "Restaurant",     
   description: "Best local shop",
   photos: ["url1", "url2"]                  
}
Response: Response of the newly created place with place_id

Add reviews:

Add reviews API is responsible for adding reviews about the places.

reviewsRequest: body = {
   user_id: "user12", 
   place_id: "place12",                              
   rating: 5,                               
   review: "Food was awesome",     
   photos: ["url1", "url2"]                   
}
Response: Response of the newly created review with it's id.

Database Schema

The basic schema of the database for the system is described below.

Each dataset given above must be stored and indexed (places, reviews, etc.). Users want to see results in real-time while searching for local places; therefore, the indexing must be read-efficient to query this enormous database. We don’t need to worry about frequent data updates because the position of a place doesn’t change too often. In contrast, if we want to construct a service where items, such as people or cabs, change their location regularly, we might develop a radically different design. Let’s look at the various options for storing this information and determine which technique is appropriate for our needs:

We must first establish where we will store the data before moving on to the detailed component design. Let’s look at the various options for storing this information and determine which technique is appropriate for our needs:

Simple SQL based storage

Places, reviews, User details can be stored in SQL DB easily, and latitude and longitude can be indexed for search optimization. We can use the concept of 2D grids to make the search more optimized.

We can divide the entire world map into small squares. Considering the area of the earth is 100 Million square km and has a fixed search radius is 10km. We will have 100M/10 = 10 Million squares with a fixed grid size of 10km. With fixing the grid size to the query radius, we will only need to search within the grid and its neighboring eight grids.

Every place with a location will belong to a specific grid. Each grid will have a unique id that can be indexed and stored in the places table. Let’s see how our basic search flow works :) 

We can find the grid id for every location and its eight nearby grids because our grids are statically created with a search radius equal to the grid size. As a result, the query’s overall runtime will be lowered and improved, as the search query execution scope has been narrowed to just nine grids instead of the brute force strategy, which requires us to search for the entire map.

We can make it even faster by storing the grid’s number and a list of its locations in memory. As previously stated, we will have 10 million grids, with each grid id being 5 bytes and the place id being 10 bytes, similar to the gigantic scale of 100 million locations. As a result, the total amount of memory required to cache grid and place ids is 2 GB.

(10M \ 5) + (100M * 10) ~= 2GB*

Problems with this approach

  1. For popular locations, this strategy will be slow to implement. This may result in a grid imbalance, with some grids being intensely populated and others being sparsely populated, such as coastal regions or islands.
  2. If possible, we can dynamically alter grid sizes by maintaining a maximum number of locations in a grid. Still, this strategy will be challenging to execute and will add to the complexity.

However, we can solve this issue using QuadTrees, let’s look at what are QuadTrees and how they work :)

QuadTrees

A quadtree is a tree data structure with exactly zero or four offspring for each node. Quadtree’s unique feature is the efficiency of dividing a flat 2-Dimensional space and maintaining location data in its nodes. 

In our example, each node can represent a grid and include information about the locations within that grid. When a node reaches the bucket size of X, it will be divided down into four child nodes, with their data sent recursively to those nodes.

We’ll keep all of the places in one root node initially, but because our five-year scale is 100 million, the root node won’t be able to hold them all. The root node will then be recursively split into four child nodes until no nodes with more than 100 locations remaining. We now have a QuadTree for 100 million locations, with leaf nodes storing all of the locations.

Let’s see how our basic search flow works :)

We’ll begin our search at the root node and work our way down until we discover the appropriate node. As previously stated, the necessary node will always be the leaf node, and places will be stored only in the leaf nodes.

Our Quadtrees creation approach always ensures that neighboring nodes contain geographically nearby locations. As a result, we will take neighboring nodes into account when determining nearby places.

By caching the QuadTree information, we can make it go faster. As previously stated, we will have a total of 100M/100 = 1 Million nodes. Except for the leaf nodes, we may assume that the node id will be 5 bytes long and that each node will contain four children pointers. Aside from that, we have 10 bytes each for location id, latitude, and longitude. As a result, to store everything, we’ll need:

3\ 10 * 100 M + 4*5*1 M ~= 4 GB*

Further Optimization

Given the magnitude, we cannot rely on a single server to service all of the traffic. This could result in a single point of failure, which would expose the demand for availability, which is unacceptable these days when distributed systems have great power. However, partitioning QuadTrees is a good idea. To partition the data, we can use a variety of methods and techniques.

  1. Region-based sharding: The data can be partitioned based on regions, but this strategy will result in a non-uniform data distribution because some are intensely populated. In contrast, others are less populated, as stated above. As a result, our challenge of uniform data distribution will remain unsolved.
  2. Sharding based on Place ID: The data can be sharded using hashing or consistent hashing based on the place id. We’ll go through all of the places and use our hash function to calculate the hash of each place id as we build the Quadtree. Each place id will be mapped to a server where we will keep information about that location.

The second method appears to be less complicated. Even if we wind up with numerous QuadTrees, this isn’t a big deal because the uniform distribution of spots is ensured.

Data Replication

The capacity to perform correctly even in the face of adversity is one of the most critical core needs of any colossal scale distributed system. As a result, we can’t only rely on one machine to provide this and risk making it a single point of failure.

We can use the master-slave architecture. Only masters will be able to write, while slaves will be able to read. When a master server goes down, any slave servers can step in and assume its place as master, serving the writes. With this strategy, there may be a tiny delay of a few milliseconds in revealing newly changed data, resulting in eventual consistency, but this should be fine because this isn’t a banking application.

Using this strategy will save you time. A replica of the QuadTree Index server is also required for fault tolerance. If a QuadTree server fails, it may always be rebuilt by searching the QuadTree index server rather than querying the database and increasing the database load.

Load Balancing

We can use a load balancer to make our system work efficiently. We can add load balancers in two places:

  1. between clients and application servers
  2. between application servers and backend servers.

A basic Round Robin technique can distribute all incoming requests evenly among backend servers at first. This LB is easy to set up and doesn’t add any more overhead. Another advantage of this method is that if a server goes down, the load balancer will remove it from the rotation and stop distributing traffic. Round Robin LB has the drawback of not taking server load into account. A more intelligent LB solution would be required to tackle this, one that polls the backend server about their load regularly and adjusts traffic accordingly.

Conclusion

In this blog, we discussed the system design of Yelp. It is a tough system to design and scale. I hope you must have got an understanding of it. Please do share your views :)

Our Weekly Newsletter

Subscribe to get well-designed content on data structures and algorithms, machine learning, system design, oops, and mathematics. enjoy learning!

We Welcome Doubts and Feedback!

More Content From EnjoyAlgorithms