If a GPU fails during a trial running on a resource pool on HPE Machine Learning Development Environment, the conductor will reschedule the trial on another available GPU in the pool, and the trial will restart from the latest checkpoint. The trial will not fail, and the ML engineer will not have to manually restart it from the latest checkpoint using the WebUI.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit