" GPU-Host latency " issues in NVIDIA DGX or HGX systems are frequently caused by incorrect PCIe affinity or sub-optimal NUMA (Non-Uniform Memory Access) mapping. If a GPU is forced to communicate with a CPU core or an HCA that is not on its local PCIe switch/root complex, latency increases significantly as data must cross the QPI/UPI inter-processor links. The command nvidia-smi topo -m provides a detailed matrix of the system ' s internal topology, showing how GPUs, CPUs, and NICs are connected. It identifies whether the connection is via a single PCIe switch (PIX), multiple switches (PXB), or across the CPU (SYS). By inspecting this map, an administrator can identify if a software process is pinned to the wrong NUMA node or if a hardware path is unexpectedly degraded. While DCGM (Option C) is good for checking component health, it doesn ' t map the logical-to-physical affinity paths that cause specific latency " threshold " warnings.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit