Fundamentals of Distributed Systems

Nowadays, almost every large-scale application is based on distributed computing. So understanding distributed system concepts is crucial in designing fault-tolerant, highly scalable, and low-latency services. This blog will introduce the fundamental concepts of distributed systems and their key characteristics.

What is a Distributed System? How does it Work?

A distributed system is a collection of independent software components or machines that work together as a single system for the end user. These components are spread across multiple computers connected by a local or a wide area network and exchange messages to complete tasks efficiently. Most of the time, these components operate concurrently and fail independently without affecting the system’s performance.

Distributed systems can process large numbers of requests and manage millions of users at the same time. Because several machines can perform the same task concurrently, if one machine is unavailable, the others keep the system running, ensuring fault tolerance and reliability. They are easily scalable and significantly increase performance as well.

For example, suppose we are using traditional databases stored on a single machine to perform read and write operations. If volume of data increases, the query performance will decrease. One solution is to partition database system on multiple machines. If read traffic is much more than write traffic, we can use master-slave replication to handle read and write requests separately.

So it is evident that such systems consist of considerable complexity. The critical question is: Why are we using distributed systems if they are complex and challenging to manage? Think!

Scalability in a distributed system

One of the major advantages of adopting distributed systems is their ability to provide highly scalable services. Such systems can continuously evolve to support growing workloads like handling large number of user requests or large number of database transactions. So the goal of a scalable distributed system is to achieve high scalability without performance loss.

Generally, performance of a system declines with increase in system size. For example, network speed may become slower because nodes tend to be far apart, performance may decrease due to increased user traffic, etc. So scalable distributed systems avoid such situations and evenly balance the incoming load among all available nodes.

Traditional systems generally scale using Vertical Scaling, i.e., by adding more power (CPU, RAM, Storage, etc.) to an existing server. Such vertical scalable services are incompatible when operating on a vast scale because they are expensive and more prone to a single point of failure. This is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime.

However, horizontal scaling allows us to scale indefinitely by adding more servers to the pool of resources. If there is any performance degradation, we only need to add more machines, making the system extremely fast without much overhead cost compared to vertical scaling.

The figure below describes how much a company costs to use Vertical vs Horizontal Scaling.

Comparison between vertical and horizontal scaling

Good examples of horizontal scaling are Cassandra and MongoDB, as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, MySQL is an excellent example of vertical scaling as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.

Performance of a distributed system

There are two standard parameters to measure the performance of a distributed system: 1) Latency or response time, which denotes the delay to obtain the response of a first request 2) Throughput, which denotes the number of requests served in a given time. These two things are related to the volume of responses sent by nodes and the size of responses representing the volume of data exchanges. Note: The performance of distributed systems also include factors like network load, architecture of software and hardware components, etc.

Most highly scalable services are read-heavy, which might decrease system performance. To deal with this, one can use replication, which also ensures high availability and fault-tolerant behaviour. But there is a specific limit to this! To further increase performance, distributed systems offer another way to scale the service by sharding databases. With sharding, one can easily split the central database server into smaller servers called shards and achieve higher performance by distributing the load.

Scalability vs Performance

Scalability is related to performance, but they are not the same thing. Performance measures how long it takes to process a request or to perform a specific task, whereas scalability measures how much we can grow or shrink.

For example, if we had 100 concurrent users, with each user sending a request once every 5 seconds (on average). In this situation, we would end up with a throughput requirement of 20 requests per second. The performance will decide how much time we need to serve these 20 requests per second, and scalability will determine how many more users we can handle and how many more requests the system will serve without degrading the user experience.

Reliability of a distributed system

Reliability represents one of the main characteristics of any distributed system, defined as the probability of a system failure in a given time period. A distributed system is considered reliable if it keeps delivering its services even when one or several components fail. In such systems, another healthy machine can always replace any failing machine.

For example, one of the primary requirements of an e-commerce store is that any user transaction should never be cancelled due to a failure of a machine. If a user has added an item to the shopping cart, the system is expected not to lose it. So reliability can be achieved through redundancy, i.e. if the server carrying the user’s shopping cart fails, another server with the exact replica of the shopping cart should replace it. Redundancy has a cost, and a reliable system has to pay to achieve such resilience by eliminating every single point of failure.

How important Is reliability?

Bugs or outages of critical applications cause lost productivity and can have considerable costs in terms of lost revenue and damage to reputation. Even in noncritical applications, businesses have a responsibility towards their users. Suppose a customer stored all their necessary details in an application. How would they feel if that database was suddenly corrupted? How would they feel if their no mechanism to restore it from a backup?

Availability of a distributed system

Availability is a simple measure of the percentage of time that a system or service remains operational under normal conditions. An aircraft that can be flown many hours a month without much downtime can be said to have high availability.

For example, Amazon describes availability for internal services in terms of the 99.9th percentile. Even though it only affects 1 in 1,000 requests, this is important because the customers with the slowest requests often have the most data on their accounts because they have made many purchases (most valuable customers).

Reliability vs Availability

If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability. Still, achieving high availability even with an unreliable system is possible by minimising maintenance time and ensuring that machines are always available when needed.

For example, suppose a system has 99.99% availability for the first two years after its launch. Unfortunately, the system was launched without any security testing. The customers are happy, but they never realize that system is vulnerable to security risks. In the third year, the system experiences a series of security problems that suddenly result in low availability for long periods. This may result in financial loss to the customers.

Other key features of a distributed system

Manageability: Another important consideration while designing a distributed system is manageability, i.e. how simple to operate and maintain a system. If the delay in fixing a system failure increases, availability will decrease. In other words, early detection of faults can reduce or avoid system downtime.

Concurrency: Components in distributed systems are executed in concurrent processes. In other words, distributed systems enable several components to access and update shared resources concurrently without interference. Concurrency helps us to reduce latency and increases throughput of the distributed system.

Transparency: Transparency is an essential feature that allows users to see distributed systems as a single logical device without being concerned about the system architecture. This is an abstraction where a distributed system consisting of millions of components spread across multiple computers works as a single system for the end user.

Openness: Distributed systems have the flexibility to update and scale them independently. So openness is related to extensions and improvements of distributed systems. This is about: How one can easily integrate new components or replace existing ones without affecting the overall computing environment.

Security: In a distributed system, users send requests to access some of the critical data managed by servers — for example, doctors requesting records from hospitals, users purchasing items through an e-commerce website, etc. So distributed systems must avoid denial of service attacks and ensure security and privacy by identifying a user with a secure authentication process.

Heterogeneity: In distributed systems, components can have variety and differences in networks, hardware, operating systems, programming languages and implementations by different developers.

What are the Types of Distributed Systems?

The distributed systems fall into any one of the four architectural modes:

Client-Server Model: Most of the traditional architecture falls under this category. There is a server to which all the requests are made from the clients. Resource sharing is one of the best examples of the client-server model.

Three-Tier: In such architecture, a central server manages all the requests and services and acts as a middle layer between the communication. The middle tier accepts the request, does some pre-processing, and forwards it to the server for further processing.

Multi-Tier: Such architectures are used when an application needs to forward requests to various network services. Here the application servers interact both with the presentation tiers and data tiers.

Peer-to-Peer: There are no centralised machines required in this architecture. Each entity behaves as an independent server and performs its roles. Responsibilities are distributed among various servers called peers, and they cooperate to achieve a common goal.

Examples that are heavily relied on distributed systems are:

  • Telecommunication Networks
  • Parallel Processing
  • Real-Time Distributed Services
  • Distributed Databases and Caches
  • Distributed File System and Stream Processing Services

Advantages of Distributed Systems

The key benefits of using distributed systems are:

Reliability: Distributed systems remain available most of the time, irrespective of the failure of any particular system server. If one server fails, the service remains operational.

Scalability: Distributed systems offer extensive scalability. Adding a large number of servers allows such systems to achieve horizontal scalability.

Low latency: Distributed systems offer high-speed service because of the replication of servers and servers’ location close to users, reducing the query time.

Cost-effective: Compared to a single-machine system, the distributed system is made up of several machines together. Although such systems have high implementation costs, they are far more cost-effective when working on a large scale.

Efficiency: Distributed systems are made efficient in every aspect since they possess multiple machines. Each of these computers could work independently to solve problems. This is not only considered efficient, but it significantly saves the user time.

Disadvantages of Distributed Systems

It is essential to know the various challenges that one may encounter while using any system. This will help in dealing with trade-offs. The shortcomings of distributed systems are:

Complexity: Distributed systems are highly complex. Although using a large number of machines, the system can become scalable, but it increases the system’s complexity. There will be more messages, network calls, devices, user requests, etc.

Network failure: Distributed systems have heavily relied on network calls for communications and transferring information or data. In case of network failure, message mismatch or incorrect ordering of segments leads to communication failure and eventually deteriorates its application’s overall performance.

Consistency: Because of its highly complex nature, it becomes too challenging to synchronise the application states and manage the data integrity in the service.

Management: Many functions, such as load balancing, monitoring, increased intelligence, logging, etc., need to be added to prevent the distributed systems’ failures.

Conclusion

Distributed systems are the necessity of the modern world as new machines need to be added, and applications need to scale to deal with technological advancements and better services. It enables modern systems to offer highly scalable, reliable, and fast services.

Such systems can support many requests and compute jobs compared to a single standard system by spreading workloads and requests. Although there are some trade-offs and challenges, distributed systems can transform the world with their services and applications. As a result, almost every application has distributed system as one of the major components.

Enjoy learning, Enjoy system design!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.