SoK: benchmarking flaws in systems security
Authors
DATA61
Leiden University
Vrije Universiteit
UNSW Sydney
Abstract
Properly benchmarking a system is a difficult and intricate task. Even a seemingly innocuous mistake can compromise the guarantees provided by a systems security defense and threaten reproducibility and comparability. Moreover, as many modern defenses trade security for performance, the damage caused by benchmarking mistakes is increasingly worrying. To analyze the magnitude of the phenomenon, we identify 22 benchmarking flaws that threaten the validity of systems security evaluations, and survey 50 defense papers published in top venues. We show that benchmarking flaws are widespread even in papers published at tier-1 venues; tier-1 papers contain an average of five benchmarking flaws and we find only a single paper in our sample without any benchmarking flaws. Moreover, the scale of the problem appears constant over time, suggesting that the community is not yet taking sufficient countermeasures. This threatens the scientific process, which relies on reproducibility and comparability to ensure that published research advances the state of the art. We hope to raise awareness and provide recommendations for improving benchmarking quality and safe- guard the scientific process in our community. To analyze the magnitude of the phenomenon, we identify a set of 22 “benchmarking crimes” that threaten the validity of systems security evaluations and perform a survey of 50 defense papers published in top venues. To ensure the validity of our results, we perform the complete survey twice, with two independent readers. We find only a very small number of disagreements between readers, showing that our assessment of benchmarking crimes is highly reproducible. We show that benchmarking crimes are widespread even in papers published at tier-1 venues. We find that tier-1 papers commit an average of five benchmarking crimes and we find only a single paper in our sample that committed no benchmarking crimes. Moreover, we find that the scale of the problem is constant over time, suggesting that the community is not yet addressing it despite the problem being now more relevant than ever. This threatens the scientific process, which relies on reproducibility and comparability to ensure that published research advances the state of the art. We hope to raise awareness of these issues and provide recommendations to improve benchmarking quality and safeguard the scientific process in our community.
BibTeX Entry
@inproceedings{vanderKouwe_HABG_19, address = {Stockholm, Sweden}, author = {van der Kouwe, Erik and Heiser, Gernot and Andriesse, Dennis and Bos, Herbert and Giuffrida, Cristiano}, booktitle = {European Conference on Security and Privacy (EuroS\&P)}, date = {2019-6-17}, keywords = {evaluation, experimental methodology}, month = jun, numpages = {16}, paperurl = {https://trustworthy.systems/publications/full_text/vanderKouwe_HABG_19.pdf}, publisher = {IEEE}, title = {{SoK}: Benchmarking Flaws in Systems Security}, year = {2019} }