Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage
Published in IEEE INFOCOM, 2023
This paper addresses the challenge of optimizing cost and performance in deep learning clusters with separated compute and storage by proposing strategies to alleviate IO bottlenecks through caching or bandwidth scaling, tailored to the heterogeneous needs of different DL models and job characteristics.