A Way towards Reproducibility in Computer Science — Artifact Evaluation
The replication crisis [1] is a phenomenon in which scientific results cannot be replicated by other research groups. To avoid such a crisis for computer science, the Association for Computing Machinery (ACM) initiated the Reproducibility Task Force in 2015.
Reproducibility
One proposal by this task force was the definition of the terminology for reproducibility (cf. our previous blog post [2]):
- Repeatability, when the same team recreates results on the same Research Infrastructure(RI);
- Reproducibility, when a different team recreates results on the same RI;
- Replicability, when a team recreates results using their own infrastructure.
To reward researchers who try to improve their scientific results by making them reproducible, the ACM introduced the badging system [3]. These badges reflect the different qualities of the experimental artifacts to ensure reproducibility. These are:
- available,
- evaluated functional,
- evaluated reusable,
- results reproduced, and
- results replicated.
ACM conferences can award these badges to papers that invested the additional effort to document their research and release the artifacts publicly. The artifacts are evaluated by a committee called the Artifact Evaluation Committee (AEC), which also awards the badges.
SLICES’ contribution to artifact evaluation
Fostering reproducible research is one of the goals of SLICES, and the SLICES community actively supported the AEC process in 2023. Two members of SLICES, Damien Saucez (Inria) and Sebastian Gallenmüller (TUM), co-chaired the AEC committees of two major conferences in the networking area—the ACM SIGCOMM’23 conference [4] and the ACM CoNEXT’23 conference [5], together with Zhizhen Zhong (MIT) and Xiaoqi Chen (Princeton University), respectively.
Participating in the artifact evaluation helps maximize the impact of SLICES. We plan to offer the future SLICES testbeds to the research community to support artifact evaluations. Therefore, the SLICES testbeds and their experiment toolchains will be developed so they can be used for future artifact evaluations. Knowing the challenges but also the pitfalls of the current evaluation process helps us to create testbeds that offer powerful and specialized hardware and software to allow the reproduction of complex experiments.
Challenges of the artifact evaluation in 2023
One of the biggest challenges of artifact evaluation is the availability of state-of-the-art experiment facilities. Among others, some of the 2023 artifacts of SIGCOMM and CoNEXT required the following resources:
- multiple Nvidia GPUs
- multiple Tofino switches
- CPUs requiring Intel SGX feature
- RAM requirements:
- 512 GB RAM for a single server
- 64 GB per machine in a network with at least three servers
- Large AWS instance (with over 1000 USD of costs for reviewers)
A SLICES testbed could help alleviate these requirements by providing a platform for paper authors and reviewers to reproduce the experiments.
Providing access to experiment infrastructures for artifact review also introduces challenges. Authors may decide to share access to their own resources for reviewers. However, reviewers should stay anonymous for the authors to ensure an unbiased and fair review process. Authors, on the other hand, want to make sure that their resources are not misused without actually knowing the reviewers. SLICES can solve these issues. SLICES will provide central services that grant easy access to resources for reviewers and authors alike. SLICES knows the authors and the reviewers to avoid the misuse of resources but can ensure anonymity for the reviewers. Sharing a common experimental platform also simplifies debugging by creating a common reference platform for artifact reviewers and artifact authors.
For this year’s artifact evaluation, authors provided access to the above-mentioned resources to reviewers. However, due to the scarcity of the mentioned resources, access was only granted for a limited amount of time. Resources will not be available for other researchers beside the reviewers. The orientation of SLICES towards long-term availability ensures that the hardware will also be available for multiple years. This means that experiments can be reproduced several years from now. Individual researchers, who created the original experiments, can rely on SLICES and do not need to take care that testbeds will be available in the future to reproduce their results.
Conclusion
The scientific experiment is the primary source of truth in science. Reproducibility guarantees experimental results can be trusted, i.e., results are consistent, were not manipulated, or purely coincidental. SLICES can provide access to required resources, provide useful services, and guarantee long-term availability. All three of them cannot be provided by small research groups and require a community effort to achieve. Thereby, SLICES helps building a reliable fundament, enabling experimental research and ensuring the ability for independent reproduction.
References
[1] https://en.wikipedia.org/wiki/Replication_crisis, accessed 2023/12/20
[2] https://www.slices-ri.eu/reproducibility-by-design/, accessed 2023/12/20
[3] https://www.acm.org/publications/policies/artifact-review-and-badging-current, accessed 2023/12/20
[4] https://conferences.sigcomm.org/sigcomm/2023/cf-artifacts.html, accessed 2023/12/20
[5] https://conferences.sigcomm.org/co-next/2023/#!/program, accessed 2023/12/20