Vajra: Benchmarking Survivability in Distributed Systems
Researcher(s): Priya Narasimhan
Research Area: Available and Secure Computing Systems
![]()
Abstract
It is important for distributed mission-critical applications to be able to quantify their survivability, and also for application developers to be able to compare different fault-tolerance approaches with each other, in the interests of selecting the best one. Unfortunately, while reliability modeling tools exist today, there are no comprehensive run-time approaches that ensure coverage over a wide span of faults, and moreover, do so in distributed settings. Most run-time fault-injection techniques assume failures are independent (i.e., faults happen in isolation, without any correlation across the system) and benign (i.e., there is no malicious adversary orchestrating the failures). Neither of these assumptions is realistic for applications that must complete their missions despite arbitrary failures.
Current survivable systems rely on strong theoretical properties (such as the Byzantine-fault tolerance guarantees) to guarantee survivability. Unfortunately, it is not common to measure the efficiency or the effectiveness of these properties in implementations of survivable systems. Instead of focusing on the survivability benefit of a system or technique, evaluations of such systems generally focus on the performance overhead of the mechanisms in the fault-free case: a metric that, in itself, is not a good evaluator of survivability. This dearth of metrics makes the objective comparison of the survivability of different implementations of systems---even those that employ similar algorithms---nearly impossible. To solve this problem, we propose the development of metrics to characterize and evaluate survivability.
We intend to employ these metrics to evaluate survivable systems, including the dependable systems that we ourselves have built. There are two important categories of operation for any survivable system: (i) the fault-free case; and (ii) the faulty case, under which the system's resistance---though not necessarily its survivability---has been overcome, i.e., a fault, either latent or active, now exists in the system. For the purpose of evaluation, it is useful to categorize the faulty case further into (a) proactive and (b) reactive, based on the survivability strategies employed by the system. It is our intention to evaluate a number of different systems to evaluate the survivability, performance and resource usage under the fault-free, reactive-faulty and proactive-faulty cases. The intention here is for us to understand better the precise implications of the different survivability approaches of the three different systems, to derive insights into the specific fault-tolerance mechanisms and to determine the effectiveness of these systems under a variety of faults.
The Vajra survivability benchmark project aims to perform the run-time fault-injection of various kinds of failures, including crash, communication, malicious and timing failures. Apart from its comprehensive approach, the Vajra tools allow the injection of distributed failures, e.g., timing fault on one component coupled with a message loss on another. The utility of such a benchmark is that it will allow us to compare the relative dependability of different distributed systems that claim to be survivable. Even for a single distributed application, Vajra allows us to quantify the impact of failures, including complex ones that distributed systems are vulnerable to. Additionally, Vajra is transparent and target-agnostic -- by using an interception approach, it simply attaches itself non-intrusively, at run-time, to an existing distributed application and allows for a variety of failures to be injected into the application in any order. This implies that it can be used to measure and analyze the dependability of any distributed system without requiring any changes of the system.
