On an NVIDIA DGX system, theBaseboard Management Controller (BMC)is an independent processor that runs even if the main CPU and Operating System fail to load. If a server reboots and the administrator can access the BMC web interface or IPMI console, but the OS (Ubuntu/DGX OS) does not load, the most likely cause is aboot disk failure. The DGX H100 uses NVMe drives in a RAID-1 configuration for the OS boot volume. If both drives in the mirror fail, or if the boot partition becomes corrupted, the system will hang at the BIOS or UEFI prompt, unable to find a bootable device. While failed power supplies (Option D) or network links (Option A) can cause issues, they would typically prevent the BMC from being reachable at all or prevent remote network traffic respectively. A GPU failure (Option C) would not stop the OS from booting; the system would simply boot with a degraded GPU count. Therefore, checking the storage health via the BMC "Storage" logs is the correct diagnostic step.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit