What is Instagram?
Instagram is a photo and video-sharing social media platform that allows users to share their creations with others. The original poster can set the visibility of these posts (photos/videos) to private or public. Posts can be liked and commented on by users. Users can follow and see the news feeds of other users (a collection of posts from the users they are following).
Users can also search for content across the entire platform. Image editing, location tagging, private messaging, push alerts, group messaging, hashtags, filters, and more are all available on Instagram.
Requirements of the system
- Photos should be able to be uploaded and viewed by users.
- Users can search for photos based on their titles.
- Other users can be followed by a user.
- Create a custom NewsFeed for each user that includes the best photos from all of the individuals and accounts the user follows.
Non Functional Requirements
- Read heavy : read to write ratio is very high.
- Low latency is expected while viewing photos.
- Access pattern for posts : optimize so that media content is easily accessible when the post gets the most interaction.
- Globally available : works on a wide range of devices, supports many languages, and works with a wide range of internet bandwidth.
- Out system should be highly Scalable and Reliable.
The crucial point to keep in mind is that the number of reading requests will be 100 times higher than the number of uploads (writing) requests. Assume we have 500 million users registered on our platform, with 1 million of them active per day. If 5 million images are posted every day, the number of photos uploaded in one second is:
1 sec = 5M / (24*60*60) ≈ 57 photos
If the average photo size is 150 KB, the following is the daily storage usage:
5M * 150KB = 716 GB
If we assume our service would continue for ten years, the space required will be:
716GB * 365 * 10 ≈ 2553TB ≈ 2.6PB
A user service manages user onboarding, login, and profile-related actions. The user service runs on a MySQL database, which is chosen since the data is structured in a relatively relational manner. Also, user data will be read-heavy rather than write-heavy, and MySQL will suffice for such a query pattern. The user service is also linked to a Redis database, which stores all of the user’s data. When the user service receives a request, the first thing it does is look it up in Redis. The user service checks in the MySQL DB inserts the information into Redis for future usage and then returns to the user if Redis contains the information. Also, whenever a new user or information is introduced.
The system will be made up of multiple microservices, each of which will execute a different task. The data will be stored in a graph database such as Neo4j. Because our data will contain complex relationships between data elements such as users, posts, and comments as nodes of the graph, we’ve chosen a graph data model. After that, we’ll use the graph’s edges to record relationships like follows, likes, and comments, among other things. In addition, columnar databases such as Cassandra can be used to store information such as user feeds, activities, and counters.
Overall Data Flow and API Design
- An API request is sent by the user.
- The request is received by the load balancer, which then sends it to an app server.
- That request is received by an app server.
- After input validation and sanitization, the app server tries to fulfill the request.
- If everything went well, the app server delivers an ok response with or without required data; otherwise, it sends a specified error response.
signup (username, firstname, lastname saltedpasswordhash, phone_number, email, bio, photo)
- adds the user to the user table
login (username, saltedpasswordhash)
- log in and update the last login time
search_user (searchstring, authtoken)
- return public user data for a given search string (can be searched in user first name, last name, and username)
- return public user data for given user-id
follow_user(userid, targetuserid, authtoken)
add_post(file, caption, userid, authtoken)
- upload file to file storage server
delete_post(userid, postid, auth_token)
- delete given user’s given post along with its metadata(use soft delete).
get_feed(userid, count, offset, timestamp, authtoken)
- return top posts after the given timestamp of users followed by the given user according to count and offset.
getuserposts(userid, count, offset, authtoken)
- return posts of the given user according to count and offset
post_like(userid, postid, auth_token)
- add given post id to given user’s likes
post_unlike(userid, postid, auth_token)
- remove given post id from given user’s likes
add_comment(userid, postid, comment)
- add a comment to give a user’s comment on a given post
- delete given user’s comment of given comment id
Early in the interview, define the database structure to aid in understanding the data flow between various components and, eventually, data segmentation.
Data about users, their posted images, and the people they follow must be stored. We require an index on (PhotoID, CreationDate) since we need to obtain recent photos first from the photo table, which will store all data connected to a photo.
Because we need joins, a simple option for storing the aforementioned structure would be to utilize an RDBMS like MySQL. However, relational databases have their own set of issues, particularly when it comes to scaling. Photos can be stored in a distributed file system such as HDFS 5 or S3 10.
To make use of NoSQL’s features, we can store the aforementioned schema in a distributed key-value store. All photo metadata can be stored in a table with a ‘key’ of ‘PhotoID’ and a ‘value’ of an object including PhotoLocation, UserLocation, CreationTimestamp, and so on.
To know who owns which photo, we need to store relationships between users and photos. We also need to keep track of who a user follows. We can use a wide-column datastore like Cassandra 28 for both of these tables. The ‘key’ for the ‘UserPhoto’ table would be ‘UserID,’ and the ‘value’ would be the user’s list of ‘PhotoIDs,’ kept in distinct columns. The ‘UserFollow’ table will follow a similar pattern.
Cassandra, like all key-value stores, has a set number of replicas on hand to ensure reliability. Deletes are also not implemented immediately in such data stores; data is held for a specified number of days (to allow for undeleting) before being erased from the system.
News Feed Generation
Generating News Feed
Designing a customized newsfeed for each user, featuring the most recent post from each user he or she is following, is one of the most important needs of an Instagram-like service. For the sake of simplicity, imagine that each user and their followers upload 200 new unique photos per day. As a result, a user’s newsfeed will consist of a combination of these 200 unique photographs, followed by the reputation of previous submissions.
So, in order to generate a news feed for a user, we will first acquire the metadata (likes, comments, time, location, and so on) of the most recent 200 photographs and give it to the ranking algorithm, which will determine how the photos should be placed in the newsfeed based on the metadata.
The major disadvantage of the above newsfeed generation approach is that it necessitates simultaneously querying a large number of tables and then ranking them based on predefined criteria. As a result, this approach will result in higher latency, i.e. it will take a long time to generate a newsfeed.
Pregenerating News Feed : To avoid the problems with the above news feed producing algorithm, we’ll set up a server that will generate a unique newsfeed for each user ahead of time and store it in a separate newsfeed table. With this method, we’ll simply query this table whenever the user wants to access the most recent newsfeed.
Serving the News Feed
We have now seen how to create a news feed. The next big challenge in Instagram architecture design is determining how the user will get the generated newsfeed.
Push : One method is to alert all of a user’s followers whenever he or she uploads a new photo. We can do this by using Long-Pooling.
A potential issue with this strategy is that if a user follows a large number of persons or celebrities, the server will have to push updates/ deliver notifications quite frequently.
Pull : When users want to see new content, they will refresh their newsfeeds (send a pull request to the server). The difficulty with this strategy is that the new post will not appear until users do not refresh, and most refreshes will return empty results.
Hybrid Approach : The hybrid strategy will employ the Pull-Based approach for all users with a large number of followers (celebrities) and the Push-Based approach for all other users.
For user requests, we require a load balancer. To distribute requests among app servers, we can utilize the round-robin technique. However, if a server is unavailable, a request might be sent to it. We can employ a heartbeat system as a solution, in which each server pings the LB at a set interval to inform it that it is not down. Load balancers are required for DB and cache servers because they are also dispersed. We can use consistent hashing to decide which request should go to which server because they are both user-specific.
The Least Bandwidth Method will be used to spread the load among the servers. This algorithm will select the server with the least amount of traffic (measured in megabits per second) (Mbps).
The Load Balancers can be placed between:
- The client and the server.
- The database and the server.
In case of any query and feedback, feel free to write us at firstname.lastname@example.org. Enjoy learning, enjoy system design!