NVIDIA AI Infrastructure NCP-AII Question # 25 Topic 3 Discussion
NCP-AII Exam Topic 3 Question 25 Discussion:
Question #: 25
Topic #: 3
A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?
A.
Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.
B.
To save time, simultaneously update all nodes in the cluster without draining or diagnostics.
C.
Update nodes that have reported faults, leaving others on older firmware.
D.
Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.
Updating firmware on an NVIDIA DGX cluster is a critical operation that involves multiple sensitive components, including the GPU baseboard, the BMC, the motherboard tray (SBC), and the InfiniBand HCAs. In a production environment, " Batching " is the industry standard to prevent a single corrupted firmware image or update failure from taking down the entire AI factory. The process must begin with " Draining " the nodes in the workload scheduler (like Slurm or Kubernetes) to ensure no active training jobs are interrupted. Running pre-update diagnostics—using tools like nvsm show health or dcgmi diag—is vital to establish a baseline and ensure the hardware is stable before applying changes. Once the firmware is applied in a controlled batch, post-update verification is required to confirm the system returns to a " Healthy " state and that all versions match the target manifest. This " Rolling Update " strategy allows the engineer to pause the automation if a specific node fails to return to service, protecting the overall availability of the cluster. Skipping diagnostics (Option D) or leaving nodes on mismatched versions (Option C) creates " configuration drift, " which leads to unpredictable performance in collective communication libraries.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit