NVIDIA AI Infrastructure NCP-AII Question # 34 Topic 4 Discussion
NCP-AII Exam Topic 4 Question 34 Discussion:
Question #: 34
Topic #: 4
One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?
The nvidia-smi (NVIDIA System Management Interface) utility is the primary tool for monitoring and managing the state of NVIDIA GPUs. When a node exhibits " jitter " or lower performance compared to its peers in a cluster, nvidia-smi provides the necessary granular data to identify the bottleneck. It reports critical metrics such as GPU utilization percentages, memory usage, and, most importantly, " Clocks Throttle Reasons. " If a GPU is running slower due to power capping, thermal issues, or a hardware error (like an uncorrectable ECC error), nvidia-smi will display this state immediately. While lspci (Option A) can confirm if the GPU is physically visible on the PCIe bus, it cannot provide any telemetry regarding its operational performance or health. iblinkinfo (Option D) is a network-level tool that only monitors InfiniBand link states. In an AI infrastructure context, nvidia-smi is the first-line diagnostic to determine if a GPU is " healthy " and operating at its intended clock speeds.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit