NVIDIA AI Infrastructure NCP-AII Question # 31 Topic 4 Discussion
NCP-AII Exam Topic 4 Question 31 Discussion:
Question #: 31
Topic #: 4
When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?
The correct option is export HPL_OOC_SAFE_SIZE=4.0. NVIDIA HPL out-of-core mode allows matrix data that exceeds GPU memory capacity to be placed in host memory, but the GPU still needs reserved memory for drivers and runtime overhead. NVIDIA documents HPL_OOC_SAFE_SIZE as the amount of GPU memory, in GiB, reserved for the driver and not used by HPL out-of-core mode; increasing it is recommended when GPU out-of-memory errors occur. HPL_OOC_MODE=0 disables out-of-core mode, which would not help run a larger 40B matrix. HPL_OOC_NUM_STREAMS=8 changes the number of CUDA streams used for out-of-core operations, but it does not reserve driver memory. HPL_OOC_MAX_GPU_MEM=90 limits total GPU memory use, but the specific variable intended to leave safe driver space is HPL_OOC_SAFE_SIZE. During cluster burn-in, this setting helps preserve test validity while avoiding false failures caused by memory reservation issues rather than actual hardware instability.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit