During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?
You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)
During HPL execution on a DGX cluster, the benchmark fails with " not enough memory " errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?
An engineer must ensure that a BlueField-3 NIC firmware download matches the cluster’s PSID. Which step is critical before installation?
An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?
A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?
An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?
You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager cluster. Which two of the following actions are essential for a successful OS installation on the cluster’s head node?
Pick the 2 correct responses below.
A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?
When verifying network cable signal integrity during cluster deployment, which measurement result most strongly indicates a cable signal problem?