A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, which leads to nontrivial service disruption or downtime.
This paper utilizes separation of process recovery from data recovery to enable a practical recovery technique (microreboot), which can recover faulty application components, without disturbing the rest of the application. The authors implement a microrebootable prototype above the enterprise edition of Java (J2EE). Moreover, they evaluate their prototype upon a well-designed evaluation system, using their own client emulator, failure injector, and recovery manager.
- Many failures can be successfully recovered by rebooting, even when the failure's root cause is unknown.
- Reboots provide a high-confidence way to reclaim stale or leaked resources, which returns the software to its start state and is easy to implement and automate.
- This paper reuses the concept of the crash-only software in their earlier work and structure each software as a collection of small, well-isolated components, which separate important state from the application logic.
- The authors evaluate their prototype upon a client emulator, a fault injector, and a system for automated failure detection, diagnosis, and recovery.
- Based on the design in this paper, failure and recovery can be masked from end users through transparent call-level retries.
- Microreboot is only a temporary solution for failure recovery and does not mean that the root causes of failures should not be identified and fixed. This work does not cover or solve the later parts.
- Microreboot cannot handle the case of interactions with external resources, which will not be released after microrebooting, so it can lead to unexpected interruptions.
- Microrebooting turned out to be one to two orders of magnitude faster than a full process restart: Microrebooting an EJB takes less than 600 milliseconds, whereas an application server process restart takes almost 20 seconds. A microreboot also loses much less user work than a process restart.
- Microreboots reduce functional disruption during recovery and also reduce lost work.
This paper only explores the fields of failure recovery. It still remains open about how we can identify and fix the root causes of the failure. Besides, microreboot only focuses on J2EE applications, which can be divided into different crash-only components. However, for those non-web applications, how can we employ a lightweight recovery solution? This is also worth considering.