Trustworthy Systems

Fault tolerance through redundant execution on COTS multicores: Exploring trade-offs

Authors

Yanyan Shen, Gernot Heiser and Kevin Elphinstone

DATA61

UNSW Sydney

Abstract

High availability and integrity are paramount in systems deployed in life- and mission-critical scenarios. Such fault-tolerance can be achieved through redundant co-execution (RCoE) on replicated hardware, now cheaply available with multicore processors. RCoE replicates almost all software, including OS kernel, drivers, and applications, achieving a sphere of replication that covers everything except the minimal interfaces to non-replicated peripherals. We complement our original, loosely-coupled RCoE by a closely-coupled version that improves transparency of replication to application code, and investigate the functionality, performance, and vulnerability trade-offs.

BibTeX Entry

  @inproceedings{Shen_HE_19,
    address          = {Portland, Oregon, USA},
    author           = {Shen, Yanyan and Heiser, Gernot and Elphinstone, Kevin},
    booktitle        = {International Conference on Dependable Systems and Networks (DSN)},
    date             = {2019-6-24},
    doi              = {https://doi.org/10.1109/DSN.2019.00031},
    issn             = {1530-0889},
    keywords         = {{seL4}; microkernel; {SEU}; replication; fault tolerance},
    month            = jun,
    pages            = {188-200},
    paperurl         = {https://trustworthy.systems/publications/full_text/Shen_HE_19.pdf},
    publisher        = {IEEE},
    title            = {Fault Tolerance Through Redundant Execution on {COTS} Multicores: Exploring Trade-offs},
    year             = {2019}
  }

Download