NVIDIA
NCP-AII
Q1:
What information does the 'ibnodes' command display?
○
A
All hosts & switches○
B
All host & server names○
C
All server names○
D
All channel adapters
NVIDIA
NCP-AII
Q2:
After ClusterKit reports "GPU-Host latency exceeds threshold," which NVIDIA diagnostic tool should be used to isolate hardware faults?
○
A
Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration○
B
nvidia-smi topo -m to inspect GPU topology connections○
C
DCGM Diags dcgmi diag -r 2○
D
ib_write_bw to measure InfiniBand bandwidth between nodes
NVIDIA
NCP-AII
Q3:
During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?
○
A
Inconclusive; rerun with point-to-point tests.○
B
Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.○
C
Critical failure; bus bandwidth exceeds hardware capabilities.○
D
Suboptimal performance; algorithm bandwidth should match bus bandwidth.
NVIDIA
NCP-AII
Q4:
What command is needed to measure BER (Bit Error Rate)?
○
A
mlxconfig -d <device> q○
B
ethtool -S <device>○
C
mlxlink -d <device> -c -e○
D
mstflint -d <device> q full
NVIDIA
NCP-AII
Q5:
You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?
○
A
Confirm the system fans are running at 100% under all workloads to prevent overheating.○
B
Review the system BIOS to ensure GPU overclocking is enabled for maximum performance.○
C
Use NVSM to disable unused PCIe devices to reduce overall system heat output.○
D
Verify that each DGX system is connected to redundant, properly rated PDUs and that all power supplies are reporting nominal input.