gpu p2p多卡训练运行不正常问题,现象如下:
GPU显存只用到2G无增长,但使用率却100%
处理办法:bios禁用acs
以超微服务器为例:
按del键进BIOS设置步骤参考
(1)Chipset configuration
(2)North Bridge
(3)iio configuration
(4)intel vt-d configuration
(5)acs control 设置为disable
acs禁用后运行正常:显存几乎打满,使用率也接近100%
测试过程:
运行程序:
cd /home/omnisky/projects/slz/code/domain-adaption-master
./run.sh
新开窗口:
watch -n 1 nvidia-smi
相关问题参考:https://forums.developer.nvidia.com/t/cpu-amd-5975wx-4-4090-cuda-cuda12-pytoch-2-0/237825/2
检查:
root@omnisky:~# cat /etc/issue
Ubuntu 22.04.2 LTS \n \l
root@omnisky:~# lspci -vvv |grep -i nvidia
4f:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
52:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
56:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
57:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
ce:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
d1:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
d5:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
d6:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
root@omnisky:~#
root@omnisky:~# lspci -s 4f:00.0 -vvvv|grep -i acs
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
root@omnisky:~# lspci -s d6:00.0 -vvvv|grep -i acs
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
其它参考:
root@omnisky-X10DGQ:~# lspci -vvv |grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation TU104GL [Tesla T4]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
root@omnisky-X10DGQ:~# lspci -s 03:00.0 -vvvv|grep -i acs
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
root@omnisky-X10DGQ:~#
为何要用ACS ?
https://blog.csdn.net/weixin_40357487/article/details/120295827
- 为何要用ACS ?
🌟1.1 P2P传输风险
ATS(Address Translation Services)是一种基于信任的服务协议。如果EP端ATC(Address Translation Cache)声称其发出的访问请求是经过转换后的地址,且该地址刚好落在PCIe交换开关的BAR范围内,则该访问请求不会到达RC,而是被交换开关路由到该地址所对应的EP。也就是说,该访问请求绕过了IOMMU的隔离,进行了P2P(peer-to-peer)传输。
图1 Peer-to-peer PCIe Transaction
PCIe协议允许P2P传输,这也就意味着同一个PCIe交换开关连接下不同EP可以在不流经RC的情况下互相通信。若使用过程中不希望P2P直接通信又不采取相关措施,则该漏洞很有可能被无意或有意触发,使得某些EP收到无效、非法甚至恶意的访问请求,从而引发一系列潜在问题。
🌟1.2 解决方案 - ACS
ACS协议提供了一种机制,能够决定一个TLP被正常路由、阻塞或重定向。在SR-IOV系统中,还能防止属于VI或者不同SI的设备Function之间直接通信。通过在交换节点上开启ACS服务,可以禁止P2P发送,强迫交换节点将所有地址的访问请求送到RC,从而避开P2P访问中的风险。ACS可以应用于PCIe桥、交换节点以及带有VF的PF等所有具有调度功能的节点,充当一个看门人的角色。