Open repository and analysis of system usage data
Dependability has become a necessary requisite property for many of the computer systems that surround us or work behind the scenes to support our personal and professional lives. Heroic progress has been made by computer systems researchers and practitioners working together to build and deploy dependable systems. However, an overwhelming majority of this work is not based on real publicly available failure data. As a result, results in small lab settings are sometime disproved years later, many avenues of productive work in dependable system design are closed to most researchers, and conversely, some unproductive work gets done based on faulty assumptions about the way real systems fail. Unfortunately, there does not exist any open system usage and failure data repository today for any recent computing infrastructure that is large enough, diverse enough, and with enough information about the infrastructure and the applications that run on them. We are addressing this pressing need that has been voiced repeatedly by computer systems researchers from various sub-domains.
The project is collecting, curating, and presenting public failure data of large-scale computing systems, in a repository called FRESCO. Our initial sources are Purdue, U of Illinois at Urbana-Champaign, and U of Texas at Austin. The data sets comprise static and dynamic information about system usage and the workloads, and failure information, for both planned and unplanned outages. We are performing data analytics on these datasets to answer various questions, such as: (1) How do jobs utilize cluster resources in a university centrally managed cluster? (2) How do users use or do not use the options to share resources on a node? (3) How often are the typical resources (compute, memory, local IO, remote IO, networking) overstretched by the demand and does such contention affect the failure rates of jobs? (4) Can users estimate the time their jobs will need on the cluster?
Further Reading
- FRESCO: Open Source Data Repository for Computational Usage and Failures. At: https://diagrid.org/resources/1093
- Subrata Mitra, Suhas Raveesh Javagal, Amiya K. Maji (ITaP), Todd Gamblin (LLNL), Adam Moody (LLNL), Stephen Harrell (ITaP), and Saurabh Bagchi, “A Study of Failures in Community Clusters: The Case of Conte,” At the 7th IEEE International Workshop on Program Debugging, co-located with ISSRE, pp. 1-8, Oct 23-27, 2016, Ottawa, Canada.
- Amiya Maji, Subrata Mitra, Bowen Zhou, Saurabh Bagchi, and Akshat Verma (IBM Research), “Mitigating Interference in Cloud Services by Middleware Reconfiguration,” At the 15th Annual ACM/IFIP/USENIX Middleware conference, pp. 277-288, December 8-12, 2014, Bordeaux, France. (Acceptance rate: 27/144 = 18.8%)