Which of the following best explains how fault tolerance in a network is achieved

Which of the following best describes fault tolerance?

A.
The ability of an application to automatically correct user mistakes.

B.
The ability of the system to continue operations during and after a hardware failure.

C.
Having multiple servers.

D.
Measurement of the ability of a datacenter to withstand a natural disaster.


Fault tolerance is a process that enables an operating system to respond to a failure in hardware or software. This fault-tolerance definition refers to the system’s ability to continue operating despite failures or malfunctions.

An operating system that offers a solid definition for faults cannot be disrupted by a single point of failure. It ensures business continuity and the high availability of crucial applications and systems regardless of any failures. 

Fault tolerance can be built into a system to remove the risk of it having a single point of failure. To do so, the system must have no single component that, if it were to stop working effectively, would result in the entire system failing.

Fault tolerance is reliant on aspects like load balancing and failover, which remove the risk of a single point of failure. It will typically be part of the operating system’s interface, which enables programmers to check the performance of data throughout a transaction. 

A fault-tolerance process follows two core models:

This describes a situation when a fault-tolerant system encounters a fault but continues to function as usual. This means the system sees no change in performance metrics like throughput or response time.  

Other types of fault-tolerant systems will go through graceful degradation of performance when certain faults occur. That means the impact the fault has on the system’s performance is proportionate to the fault severity. In other words, a small fault will only have a small impact on the system’s performance rather than causing the entire system to fail or have major performance issues.

The key benefit of fault tolerance is to minimize or avoid the risk of systems becoming unavailable due to a component error. This is particularly important in critical systems that are relied on to ensure people’s safety, such as air traffic control, and systems that protect and secure critical data and high-value transactions.

The core components to improving fault tolerance include:

If a system’s main electricity supply fails, potentially due to a storm that causes a power outage or affects a power station, it will not be possible to access alternative electricity sources. In this event, fault tolerance can be sourced through diversity, which provides electricity from sources like backup generators that take over when a main power failure occurs.

Some diverse fault-tolerance options result in the backup not having the same level of capacity as the primary source. This may, in some cases, require the system to ensure graceful degradation until the primary power source is restored.

Fault-tolerant systems use redundancy to remove the single point of failure. The system is equipped with one or more power supply units [PSUs], which do not need to power the system when the primary PSU functions as normal. In the event the primary PSU fails or suffers a fault, it can be removed from service and replaced by a redundant PSU, which takes over system function and performance. 

Alternatively, redundancy can be imposed at a system level, which means an entire alternate computer system is in place in case a failure occurs.

Replication is a more complex approach to achieving fault tolerance. It involves using multiple identical versions of systems and subsystems and ensuring their functions always provide identical results. If the results are not identical, then a democratic procedure is used to identify the faulty system. Alternatively, a procedure can be used to check for a system that shows a different result, which indicates it is faulty. 

Replication can either take place at the component level, which involves multiple processors running simultaneously, or at the system level, which involves identical computer systems running simultaneously.

Fault-tolerant systems also use backup components, which automatically replace failed components to prevent a loss of service. These backup components include:

Hardware systems can be backed up by systems that are identical or equivalent to them. A typical example is a server made fault-tolerant by deploying an identical server that runs in parallel to it and mirrors all its operations, such as the redundant array of inexpensive disks [RAID], which combines physical disk components to achieve redundancy and improved performance.

Software systems can be made fault-tolerant by backing them up with other software. A common example is backing up a database that contains customer data to ensure it can continuously replicate onto another machine. As a result, in the event that a primary database fails, normal operations will continue because they are automatically replicated and redirected onto the backup database.

Power sources can also be made fault-tolerant by using alternative sources to support them. One approach is to run devices on an uninterruptible power supply [UPS]. Another is to use backup power generators that ensure storage and hardware, heating, ventilation, and air conditioning [HVAC] continue to operate as normal if the primary power source fails.

There are several factors that affect organizations’ decision to implement a fault-tolerant system, including:

The biggest disadvantage of adopting a fault-tolerant approach is the cost of doing so. Organizations must think carefully about the cost elements of a fault-tolerant or highly available system.

Fault-tolerant systems require organizations to have multiple versions of system components to ensure redundancy, extra equipment like backup generators, and additional hardware. These components need regular maintenance and testing. They also take up valuable space in data centers. 

One way around the cost of fault tolerance is to opt for more cost-effective but lower-quality redundant components. This approach can inadvertently increase maintenance and support costs and make the system less reliable. To avoid such a situation, organizations must monitor the performance of individual components and keep an eye on their lifespan in relation to their cost.

Fault tolerance inevitably makes it more difficult to know if components are performing to the expected level because failures do not automatically result in the system going down. As a result, organizations will require additional resources and expenditure to continuously test and monitor their system health for faults. 

Additionally, they may need to acquire or develop custom software and procedures to carry out these detection and testing tasks. 

Fortinet helps organizations achieve fault tolerance through its FortiGate next-generation firewalls [NGFWs]. As an example, Fortinet NGFWs have been core to delivering fault-tolerant RingCentral access, which allowed unified communications provider RingCentral to achieve fault tolerance across its global data centers.

This was made possible through the implementation of a highly fault-tolerant network of active and backup virtual private networks [VPNs]. The solutions deployed included classic FortiGate network security features, failover from a primary to backup wide-area network [WAN], failover between data centers, and many more.

<< Back to Technical Glossary

Fault Tolerance Definition

Fault Tolerance simply means a system’s ability to continue operating uninterrupted despite the failure of one or more of its components. This is true whether it is a computer system, a cloud cluster, a network, or something else. In other words, fault tolerance refers to how an operating system [OS] responds to and allows for software or hardware malfunctions and failures.

An OS’s ability to recover and tolerate faults without failing can be handled by hardware, software, or a combined solution leveraging load balancers[see more below]. Some computer systems use multiple duplicate fault tolerant systems to handle faults gracefully. This is called a fault tolerant network.

FAQs

What is Fault Tolerance?

The goal of fault tolerant computer systems is to ensure business continuity and high availability by preventing disruptions arising from a single point of failure. Fault tolerance solutions therefore tend to focus most on mission-critical applications or systems.

Fault tolerant computing may include several levels of tolerance:

  • At the lowest level, the ability to respond to a power failure, for example.
  • A step up: during a system failure, the ability to use a backup system immediately.
  • Enhanced fault tolerance: a disk fails, and mirrored disks take over for it immediately. This provides functionality despite partial system failure, or graceful degradation, rather than an immediate breakdown and loss of function.
  • High level fault tolerant computing: multiple processors collaborate to scan data and output to detect errors, and then immediately correct them.

Fault tolerance software may be part of the OS interface, allowing the programmer to check critical data at specific points during a transaction.

Fault-tolerant systems ensure no break in service by using backup components that take the place of failed components automatically. These may include:

  • Hardware systems with identical or equivalent backup operating systems. For example, a server with an identical fault tolerant server mirroring all operations in backup, running in parallel, is fault tolerant. By eliminating single points of failure, hardware fault tolerance in the form of redundancy can make any component or system far safer and more reliable.
  • Software systems backed up by other instances of software. For example, if you replicate your customer database continuously, operations in the primary database can be automatically redirected to the second database if the first goes down.
  • Redundant power sources can help avoid a system fault if alternative sources can take over automatically during power failures, ensuring no loss of service.

High Availability vs Fault Tolerance

Highly available systems are designed to minimize downtime to avoid loss of service. Expressed as a percentage of total running time in terms of a system’s uptime, 99.999 percent uptime is the ultimate goal of high availability.

Although both high availability and fault tolerance reference a system’s total uptime and functionality over time, there are important differences and both strategies are often necessary. For example, a totally mirrored system is fault-tolerant; if one mirror fails, the other kicks in and the system keeps working with no downtime at all. However, that’s an expensive and sometimes unwieldy solution.

On the other hand, a highly available system such as one served by a load balancer allows minimal downtime and related interruption in service without total redundancy when a failure occurs. A system with some critical parts mirrored and other, smaller components duplicated has a hybrid strategy.

In an organizational setting, there are several important concerns when creating high availability and fault tolerant systems:

Cost. Fault tolerant strategies can be expensive, because they demand the continuous maintenance and operation of redundant components. High availability is usually part of a larger system, one of the benefits of a load balancing solution, for example.

Downtime. The greatest difference between a fault-tolerant system and a highly available system is downtime, in that a highly available system has some minimal permitted level of service interruption. In contrast, a fault-tolerant system should work continuously with no downtime even when a component fails. Even a system with the five nines standard for high availability will experience approximately 5 minutes of downtime annually.

Scope. High availability systems tend to share resources designed to minimize downtime and co-manage failures. Fault tolerant systems require more, including software or hardware that can detect failures and change to redundant components instantly, and reliable power supply backups.

Certain systems may require a fault-tolerant design, which is why fault tolerance is important as a basic matter. On the other hand, high availability is enough for others. The right business continuity strategy may include both fault tolerance and high availability, intended to maintain critical functions throughout both minor failures and major disasters.

What are Fault Tolerance Requirements?

Depending on the fault tolerance issues that your organization copes with, there may be different fault tolerance requirements for your system. That is because fault-tolerant software and fault-tolerant hardware solutions both offer very high levels of availability, but in different ways.

Fault-tolerant servers use a minimal amount of system overhead to achieve high availability with an optimal level of performance. Fault-tolerant software may be able to run on servers you already have in place that meet industry standards.

What is Fault Tolerance Architecture?

There is more than one way to create a fault-tolerant server platform and thus prevent data loss and eliminate unplanned downtime. Fault tolerance in computer architecture simply reflects the decisions administrators and engineers use to ensure a system persists even after a failure. This is why there are various types of fault tolerance tools to consider.

At the drive controller level, a redundant array of inexpensive disks [RAID] is a common fault tolerance strategy that can be implemented. Other facility level forms of fault tolerance exist, including cold, hot, warm, and mirror sites.

Fault tolerance computing also deals with outages and disasters. For this reason a fault tolerance strategy may include some uninterruptible power supply [UPS] such as a generator—some way to run independently from the grid should it fail.

Byzantine fault tolerance [BFT] is another issue for modern fault tolerant architecture. BFT systems are important to the aviation, blockchain, nuclear power, and space industries because these systems prevent downtime even if certain nodes in a system fail or are driven by malicious actors.

What is the Relationship Between Security and Fault Tolerance?

Fault tolerant design prevents security breaches by keeping your systems online and by ensuring they are well-designed. A naively-designed system can be taken offline easily by an attack, causing your organization to lose data, business, and trust. Each firewall, for example, that is not fault tolerant is a security risk for your site and organization.

What is Fault Tolerance in Cloud Computing?

Conceptually, fault tolerance in cloud computing is mostly the same as it is in hosted environments. Cloud fault tolerance simply means your infrastructure is capable of supporting uninterrupted functionality of your applications despite failures of components.

In a cloud computing setting that may be due to autoscaling across geographic zones or in the same data centers. There is likely more than one way to achieve fault tolerant applications in the cloud in most cases. The overall system will still demand monitoring of available resources and potential failures, as with any fault tolerance in distributed systems.

What Are the Characteristics of a Fault Tolerant Data Center?

To be called a fault tolerant data center, a facility must avoid any single point of failure. Therefore, it should have two parallel systems for power and cooling. However, total duplication is costly, gains are not always worth that cost, and infrastructure is not the only answer. Therefore, many data centers practice fault avoidance strategies as a mid-level measure.

Load Balancing Fault Tolerance Issues

Load balancing and failover solutions can work together in the application delivery context. These strategies provide quicker recovery from disasters through redundancy, ensuring availability, which is why load balancing is part of many fault tolerant systems.

Load balancing solutions remove single points of failure, enabling applications to run on multiple network nodes. Most load balancers also make various computing resources more resilient to slowdowns and other disruptions by optimizing distribution of workloads across the system components. Load balancing also helps deal with partial network failures, shifting workloads when individual components experience problems.

Does Avi Networks Offer a Fault Tolerance Solution?

Avi offers load balancing capabilities that can keep your systems online reliably. Avi aids fault tolerance by automatically instantiating virtual services when one fails, redistributing traffic, and handling workload moves or additions, reducing the chance of a single point of failure strangling your system.

Avi Networks Software Load Balancer

Video liên quan

Postingan terbaru

LIHAT SEMUA