Spring Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the NVIDIA NVIDIA-Certified Professional NCP-AII Questions and answers with CertsForce

Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
Questions # 1:

Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?

Options:

A.

Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.


B.

Validate the driver version post-install since the fresh install will overwrite the legacy drivers.


C.

Keep the older driver running alongside the new version in case you need to roll back the upgrade.


D.

Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.


Expert Solution
Questions # 2:

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.


B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.


C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.


D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.


Expert Solution
Questions # 3:

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.

Reduction of problem size (N) to accelerate computation.


B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.


C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.


D.

Automatic NVLink bandwidth doubling via driver updates.


Expert Solution
Questions # 4:

When updating the firmware on an NVLink switch transceiver, how can an engineer apply new firmware without interrupting the network?

Options:

A.

mlxfwreset -d -lid 27 reset --yes to reset the transceiver


B.

Physically disconnect and reconnect the transceiver.


C.

flint -d -lid 27 --linkx --linkx_auto_update --activate


D.

nv action reboot system to force immediate activation.


Expert Solution
Questions # 5:

An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?

Options:

A.

Lane power variance < 3dB across all transceivers.


B.

Transceiver model matching QSFP-DD specifications.


C.

Temperature fluctuations > 5°C during validation.


D.

Effective BER > 1.5E-254 during a <6-hour monitoring window.


Expert Solution
Questions # 6:

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices" > select a switch > "Cables' tab to see ASIC firmware and transceiver versions.


B.

Use "Topology’ view to visually inspect cable icons.


C.

Run mlxlink -d lid- -m on each port manually.


D.

Export all switch logs and grep for ’FW Version".


Expert Solution
Questions # 7:

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.


B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.


C.

Critical failure; bus bandwidth exceeds hardware capabilities.


D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.


Expert Solution
Questions # 8:

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

Options:

A.

ipmitool raw 0x32 0x6a 1


B.

systemctl restart rshim


C.

systemctl enable bmc-rshim.service


D.

scp root@:/dev/rshim0/boot


Expert Solution
Questions # 9:

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?

Options:

A.

The network card has no link / connection.


B.

A boot disk has failed.


C.

Multiple GPUs have failed.


D.

There are more than two failed power supplies.


Expert Solution
Questions # 10:

What information does the 'ibnodes' command display?

Options:

A.

All hosts & switches


B.

All host & server names


C.

All server names


D.

All channel adapters


Expert Solution
Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions