Trustworthy COTS Hardware
-
Aim: develop software approaches to improving the trustworthiness of commodity off-the-shelf (COTS) hardware.
-
Overview: The formal verification of the seL4 microkernel has produced an operating system that is guaranteed mathematically to be correct and enforce isolation. The seL4 microkernel can consolidate safety and security critical software on a single device where previously multiple devices were used to provide air gap isolation. One of the barriers to wide-spread consolidation on commodity hardware is the lack of hardware dependability features. A hardware fault triggered by cosmic rays, alpha particle strikes, etc. could invalidate the strong mathematical guarantees that are based on the hardware behaving correctly, and pierce the virtual air gap seL4 provides.
This project aims to improve the trustworthiness of commodity hardware to enable a verified microkernel to be used in situations previously needing an air gap. We explore leveraging multicore processors to provide redundancy, and thus increase the trustworthiness of a virtual air gap, and thus systems running on COTS hardware.
The key challenge of the approach is to provide improved trustworthiness in the presence of transient hardware faults that also manifest themselves in operating system itself, while retaining good performance. A typical COTS hardware platform also features non-redundant devices, which creates another challenge of improving trustworthiness in the presence of non-redundant device drivers interacting with a redundant overall system.
-
Kernel implementation and mechanisms for improved trustworthiness: We are evolving seL4 and its API to support both improved trustworthiness through redundant software execution on multicore processors. The activity includes:
- Modifying seL4 to support redundant execution of itself, including developing techniques to ensure consistency across redundant replicas without sacrificing performance.
- Extending seL4's API to enable consistent redundant execution of trusted applications, while supporting non-redundant drivers and their interaction with the redundant part of the system.
-
Contact: Kevin Elphinstone, k.elphinstone@unsw.edu.au
Achievements
The project has uccessfully created a version of seL4 that provides redundant co-execution (RCoE) of an seL4-based system on COTS multicore, i.e. without any non-standard hardware support. It can run in dual or triple modular redundancy (DMR or TMR) configuration.
The replication and fault-tolerance implementation is done at the kernel level, meaning that the whole system is replicated, except for the minimal interfaces to non-replicated devices (mostly device register reads and writes). Our RCoE implementation is the first system that achieves such maximal replication on COTS hardware.
Our RCoE setup comes in two versions:
- a loosely-coupled version, LC-RCoE, that synchronises anywhere between kernel entries, minimises overheads, but assumes deterministic execution between system calls. This means it cannot support code with data races, including lock-free algorithms. It generally also cannot support virtual machines.
- a closely-coupled version, CC-RCoE, that has higher overheads but avoids the limitations of LC-RCoE.
Both versions are share the following properties:
- due to kernel-level replication, user-level code is transparently replicated;
- RCoE is effective in detecting faults arising from bitflips in memory and registers;
- RCoE support trade-offs between overhead and detection latency, by providing the option of more frequent synchronisation and including user data in the state comparison of the replicas;
- RCoE support error masking by downgrading from TMR to DMR operation by removing a faulty replica;
- RCoE is implemented on the x64 and ARMv7-A architectures
- RCoE provides massive cost and space, weight and power (SWaP) benefits over commercial radiation-hardened processors.
Status
Having demonstrated the above, the projectis concluded for now. We will be happy to revive it should there be commercial interest.
Artifact downloads
Coming soon!
People
Past
|
Publications
Yanyan Shen, Gernot Heiser and Kevin Elphinstone Fault tolerance through redundant execution on COTS multicores: Exploring trade-offs International Conference on Dependable Systems and Networks (DSN), pp. 188-200, Portland, Oregon, USA, June, 2019 | ||
Yanyan Shen Microkernel mechanisms for improving the trustworthiness of commodity hardware PhD Thesis, UNSW, Sydney, Australia, March, 2019 | ||
Yanyan Shen and Kevin Elphinstone Microkernel mechanisms for improving the trustworthiness of commodity hardware European Dependable Computing Conference, pp. 12, Paris, France, September, 2015 | ||
Kevin Elphinstone and Yanyan Shen Improving the trustworthiness of commodity hardware with software Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV), pp. 6, Budapest, Hungary , June, 2013 |