Many of us may have experienced moments where we could not access certain applications due to an outage or unavailability. Recently, YouTube faced a global outage that stopped users from streaming videos for about an hour. You may wonder about the reason behind it and How one can prevent it from happening? Let’s Find out.
Availability is the percentage of time in a given period that a system is available to perform its task and function under normal conditions. One way to look at is how resistant a system is to failures. The percentage of availability that a system requires depends on the business logic or usage of the system. Let us take some examples.
Air Traffic Control systems are among the best examples of systems that require high availability. In today’s world, where air travel is so complex and busy, a single error in directing airplanes can lead to catastrophic results. In contrast, a system with few visitors and not prone to catastrophic failures require slightly lesser available systems. High Availability comes with a cost, so we have to optimize according to our needs.
A system’s availability is measured as the percentage of a system’s uptime in a given time period or by dividing the total uptime by the total uptime and downtime in a given period of time.
Availability = Uptime ÷ (Uptime + Downtime)
Availability can also be expressed in terms of Nines. In high-demand applications, we usually measure availability in terms of Nines rather than percentages. If availability is 99.00 percent available, it is said to have “2 nines” of availability, and if it is 99.9 percent, it is called “3 nines,” and so on. A system with 5 nines (i.e., 99.999%) of availability is said to have a Gold Standard of Availability. Let's take a look at different Nines of Availability.
High availability is the ability of a system to maintain operation despite the failure of components. To increase availability, we can use redundancy by duplicating or adding additional hardware (servers or storage) components. For example, a system with two identical web servers behind a load balancer can continue operating even if one of the servers goes down, as the load balancer will redirect traffic to the remaining server. So by adding redundancy, we can make the system more resilient to failure.
It is important to note that redundancy alone is not enough to guarantee high availability. Failure detection mechanisms must also be in place to identify failures. This requires regular high-availability testing and the ability to take corrective action whenever one of the components in the system becomes unavailable.
There are both hardware and software based approaches to achieving high availability. Redundancy is a hardware-based approach, while other techniques such as top-to-bottom or distributed high-availability approaches may involve both hardware and software. Software-based downtime reduction techniques can also be effective.
There is a trade-off between the availability of a system and its performance. To achieve high availability, we often take measures to implement redundancy or disaster recovery strategies, which can hurt other aspects of system performance (higher latency or lower throughput). For example, implementing redundancy may involve replicating data or tasks across multiple resources, which can increase the time it takes to complete a task, resulting in higher latency.
Both high availability and fault tolerance are strategies used to achieve high uptime in systems, but they approach the problem differently. High availability is about system or component's ability to remain operational and accessible with minimal downtime. On other side, Fault tolerance is about system or component's ability to continue functioning normally even in the event of a failure.
Thanks to Chiranjeev and Navtosh for his contribution in creating the first version of this content. If you have any queries/doubts/feedback, please write us at email@example.com. Enjoy learning, Enjoy system design, Enjoy algorithms!