Pass the NVIDIA NVIDIA-Certified Professional NCP-AIO Questions and answers with CertsForce

Viewing page 2 out of 2 pages
Viewing questions 11-20 out of questions
Questions # 11:

You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.

How would you ensure that only the intended GPUs are allocated to jobs?

Options:

A.

Verify that the GPUs are correctly listed in both gres.conf and slurm.conf, and ensure that unconfigured GPUs are excluded.


B.

Use nvidia-smi to manually assign GPUs to each job before submission.


C.

Reinstall the NVIDIA drivers to ensure proper GPU detection by Slurm.


D.

Increase the number of GPUs requested in the job script to avoid using unconfigured GPUs.


Expert Solution
Questions # 12:

A DGX H100 system in a cluster is showing performance issues when running jobs.

Which command should be run to generate system logs related to the health report?

Options:

A.

nvsm show logs --save


B.

nvsm get logs


C.

nvsm dump health


D.

nvsm health --dump-log


Expert Solution
Questions # 13:

An administrator is troubleshooting issues with an NVIDIA Unified Fabric Manager Enterprise (UFM) installation and notices that the UFM server is unable to communicate with InfiniBand switches.

What step should be taken to address the issue?

Options:

A.

Reboot the UFM server to refresh network connections.


B.

Install additional GPUs in the UFM server to boost connectivity.


C.

Disable the firewall on the UFM server to allow communication.


D.

Verify the subnet manager configuration on the InfiniBand switches.


Expert Solution
Questions # 14:

A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.

Why would generating core dumps be a critical step in troubleshooting this issue?

Options:

A.

Core dumps prevent future crashes by stopping any further execution of the faulty process.


B.

Core dumps provide real-time logs that can be used to monitor ongoing application performance.


C.

Core dumps restore the process to its previous state, often fixing the error-causing crash.


D.

Core dumps capture the memory state of the process at the time of the crash.


Expert Solution
Questions # 15:

You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.

What would be the first step to troubleshoot this issue?

Options:

A.

Verify that the NVIDIA GPU Operator is installed and running on the cluster.


B.

Ensure that all pods are using the latest version of TensorFlow or PyTorch.


C.

Check if the nodes have sufficient memory allocated for AI workloads.


D.

Increase the number of CPU cores allocated to each pod to ensure better resource utilization.


Expert Solution
Questions # 16:

You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.

What mechanism is responsible for ensuring that the workload continues to run without interruption?

Options:

A.

Load balancing across all nodes in the cluster.


B.

Manual intervention by the system administrator to restart services.


C.

The failover mechanism that automatically transfers workloads to a standby node.


D.

Data replication between nodes to ensure data integrity.


Expert Solution
Questions # 17:

A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING” state and unable to progress to the “RUNNING” state.

Which Slurm command can help the user identify the reason for the job’s pending status?

Options:

A.

sinfo -R


B.

scontrol show job


C.

sacct -j


D.

squeue -u


Expert Solution
Questions # 18:

A Slurm user needs to submit a batch job script for execution tomorrow.

Which command should be used to complete this task?

Options:

A.

sbatch -begin=tomorrow


B.

submit -begin=tomorrow


C.

salloc -begin=tomorrow


D.

srun -begin=tomorrow


Expert Solution
Questions # 19:

You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.

What is a key requirement for enabling cloudbursting across multiple cloud providers?

Options:

A.

You only need to configure credentials for one cloud provider, as BCM will automatically replicate them across other providers.


B.

You need to set up a single set of credentials that works across both AWS and Azure for seamless integration.


C.

You must configure separate credentials for each cloud provider in BCM to enable their use in the cluster extension process.


D.

BCM automatically detects and configures credentials for all supported cloud providers without requiring admin input.


Expert Solution
Viewing page 2 out of 2 pages
Viewing questions 11-20 out of questions