Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: force70

Pass the NVIDIA NVIDIA-Certified Professional NCP-AII Questions and answers with CertsForce

Viewing page 4 out of 4 pages
Viewing questions 31-40 out of questions
Questions # 31:

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

Options:

A.

export HPL_OOC_SAFE_SIZE=4.0


B.

export HPL_OOC_MODE=0


C.

export HPL_OOC_NUM_STREAMS=8


D.

export HPL_OOC_MAX_GPU_MEM=90


Expert Solution
Questions # 32:

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.


B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.


C.

Critical failure; bus bandwidth exceeds hardware capabilities.


D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.


Expert Solution
Questions # 33:

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware according to NVIDIA’s documentation and recommended operational steps?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status and reissue update commands if any firmware appears inactive afterward.


B.

Initiate the required cold reset or power cycle to activate updated firmware, reset the BMC using the recommended command, and perform an AC power cycle when required for EROT and CPLD firmware activation.


C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.


D.

Execute a single operating system reboot on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node.


Expert Solution
Questions # 34:

One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?

Options:

A.

lspci | grep NVIDIA


B.

nvidia-smi


C.

nvidia-gpu-status


D.

iblinkinfo


Expert Solution
Questions # 35:

An engineer is reimaging a DGX system in a large cluster. Which method ensures the most efficient and secure remote installation without physical access?

Options:

A.

Use apt-get to upgrade the operating system without rebooting the system.


B.

Create a USB drive with the ISO and manually boot from it on the DGX system.


C.

Build a software image on Base Command Manager and then reimage the system.


D.

Skip ISO verification and directly flash the operating system to the disk via SSH.


Expert Solution
Questions # 36:

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

Options:

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.


B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.


C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.


D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.


Expert Solution
Viewing page 4 out of 4 pages
Viewing questions 31-40 out of questions