For 400G (NDR) InfiniBand and Ethernet links, signal integrity is managed through Forward Error Correction (FEC). While "Raw BER" accounts for errors before correction, theEffective BER(errors remaining after FEC) is the definitive metric for link stability. In a high-performance NVIDIA AI fabric, the Effective BER should ideally be zero. NVIDIA’s Cable Validation Tool (CVT) and Unified Fabric Manager (UFM) flag any link that shows an Effective BER greater than $1.5 \times 10^{-254}$ during standard monitoring periods. This specific threshold indicates that the FEC engine is working at its limit and cannot guarantee a "lossless" fabric. Unlike optical transceivers, Direct Attach Copper (DAC) cables do not have "Rx power" (Option A), as they are electrical, making BER the primary health indicator. A marginal cable failing this threshold will cause intermittent packet retransmissions, leading to massive performance degradation in NCCL collective operations.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit