Researcher: Priya Narasimhan
Research Area: Survivable Distributed Systems
Surviving Network Partitioning in Distributed Wireless Systems
Many distributed applications often require data to be consistent across multiple, distributed nodes that are connected by a network. These applications typically achieve data consistency by ensuring that updates to the application’s state are done by all the nodes in the same order. Unfortunately, link failures in the underlying network can cause a single system to split into multiple disjoint, disconnected partitions. As a result of such network-partitioning faults, nodes within a partition can communicate with each other, but there is no communication across nodes in different partitions. If the application is allowed to continue to operate in all of the disjoint partitions, the distributed application’s state might become inconsistent, thereby leading to difficulties when the network remerges subsequently. Current strategies to handle the network-partition and the network-remerge problem adopt the extreme approach of allowing only one of the partitions to survive, while the nodes and application processes in the other partitions are forcibly shut down (called primary component approach); manual intervention is often required, when the network-partition heals, to re-introduce the killed processes and nodes back into the distributed system. The primary component strategy is infeasible and impractical for large networks of nodes, and also for distributed systems where a (potentially non-trivial) number of nodes cannot simply shut down and cease operation. For instance, in the embedded distributed network of nodes inside an automotive control system, shutting down half or more of the nodes within the car, while the car is on the road, is neither safe nor practical. This problem is exacerbated in wireless systems where node mobility can cause the network to partition and remerge several times. In such a dynamic environment, manual reconciliation of states may not be feasible due to the high frequency with which the network can partition and remerge while primary component approach can significantly degrade the application’s performance by reducing the number of nodes available to the application. Therefore, there is an urgent need to develop techniques to effectively cope with network partitioning failures before distributed applications can be satisfactorily used in mobile networks.
Recognizing the infeasibility of this approach for real-world applications that can afford neither downtime nor manual intervention, this project aims at developing key building-blocks that can be exploited to support partition-tolerant distributed systems, along with mechanisms to facilitate a distributed application’s state consistency during remerging. The proposed approach aims to address the challenges of network partitioning through a combination of a static program-analysis element, along with distributed, run-time fault-tolerance and logging infrastructures to address the challenges behind surviving network partitioning in a distributed system.
The intention of this project is to develop key building-blocks that can be exploited to support partition-tolerant infrastructures and to facilitate replica consistency during remerging. With these additional mechanisms, we could also sustain continuous, albeit degraded, operation in each partition of a partitioned system, and could facilitate remerging and recovery when the partition heals. This project proposes to address the challenges of network partitioning by a combination of
Thus, the challenges addressed by our proposed approach for building partition-tolerant distributed systems are:
One of the problems with the current state-of-the-art techniques to handle network partitioning is that the applications performance can suffer not only during the partitioning phase (in non-primary partitions) but also after the network partitioning has been healed (since components in the non-primary partitions have been forced shut). Our approach allows the application to perform to its full potential once the partitions have healed by allowing nodes in all components to continue operating. This is advantageous in mobile systems, where the network partitioning and remerging occurs very frequently.