Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights
Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights
In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges …