+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
#执行安装 (self-llm) deepseek@deepseek1:~/installPkgs$ sudo sh cuda_12.4.0_550.54.14_linux.run #在弹出对话中输入“accept”接受条款,然后如下选择
安装进程完成后,输出如下内容:
=========== = Summary = ===========
Driver: Not Selected Toolkit: Installed in /usr/local/cuda-12.4/
Please make sure that - PATH includes /usr/local/cuda-12.4/bin - LD_LIBRARY_PATH includes /usr/local/cuda-12.4/lib64, or, add /usr/local/cuda-12.4/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.4/bin ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 550.00 is required for CUDA 12.4 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
#发现系统已经创建软链接/usr/local/cuda,指向/usr/local/cuda-12.4/。后续使用此软链接 (self-llm) deepseek@deepseek1:~/installPkgs$ ls -al /usr/local/cuda lrwxrwxrwx 1 root root 21 Feb 21 17:55 /usr/local/cuda -> /usr/local/cuda-12.4/
deepseek@deepseek1:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-f275597c-05e4-f7e8-35bc-a3ab26194262) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 1: NVIDIA A800-SXM4-80GB (UUID: GPU-56afa4fb-5618-6b47-7861-51c811bb87d8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 2: NVIDIA A800-SXM4-80GB (UUID: GPU-9039f623-9c6f-bc1a-f7d4-fd706a7cd7f5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 3: NVIDIA A800-SXM4-80GB (UUID: GPU-3029201c-2230-e6d4-290b-267c7c8adb03) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 4: NVIDIA A800-SXM4-80GB (UUID: GPU-54d02ccd-6b42-8ec6-1b2a-624974742a62) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 5: NVIDIA A800-SXM4-80GB (UUID: GPU-13254428-498a-5ec1-5dd5-d2c46dd29c36) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 6: NVIDIA A800-SXM4-80GB (UUID: GPU-5442c93d-d690-3603-f0c3-c0592cf3797c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s GPU 7: NVIDIA A800-SXM4-80GB (UUID: GPU-bfe7f980-f583-3778-9779-85c7ebbb9432) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s
#可以看到“已成功配置所有可用的gpu和nvswitch以路由NVLink流量” (self-llm) deepseek@deepseek1:~/installPkgs$ sudo systemctl status nvidia-fabricmanager ● nvidia-fabricmanager.service - NVIDIA fabric manager service Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2025-02-20 18:11:52 CST; 21h ago Main PID: 3697 (nv-fabricmanage) Tasks: 19 (limit: 629145) Memory: 22.4M CPU: 36.002s CGroup: /system.slice/nvidia-fabricmanager.service └─3697 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Feb 20 18:11:40 deepseek1 systemd[1]: Starting NVIDIA fabric manager service... Feb 20 18:11:41 deepseek1 nv-fabricmanager[3697]: Connected to 1 node. Feb 20 18:11:52 deepseek1 nv-fabricmanager[3697]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic. Feb 20 18:11:52 deepseek1 systemd[1]: Started NVIDIA fabric manager service.
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
#成功运行后,输出大致如下 2025-02-23 22:04:33,124 INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2025-02-23 22:04:33,125 INFO scripts.py:865 -- Local node IP: 10.119.165.139 2025-02-23 22:04:34,147 SUCC scripts.py:902 -- -------------------- 2025-02-23 22:04:34,147 SUCC scripts.py:903 -- Ray runtime started. 2025-02-23 22:04:34,147 SUCC scripts.py:904 -- -------------------- 2025-02-23 22:04:34,147 INFO scripts.py:906 -- Next steps 2025-02-23 22:04:34,148 INFO scripts.py:909 -- To add another node to this Ray cluster, run 2025-02-23 22:04:34,148 INFO scripts.py:912 -- ray start --address='10.119.165.139:6379' 2025-02-23 22:04:34,148 INFO scripts.py:921 -- To connect to this Ray cluster: 2025-02-23 22:04:34,148 INFO scripts.py:923 -- import ray 2025-02-23 22:04:34,148 INFO scripts.py:924 -- ray.init() 2025-02-23 22:04:34,148 INFO scripts.py:936 -- To submit a Ray job using the Ray Jobs CLI: 2025-02-23 22:04:34,148 INFO scripts.py:937 -- RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py 2025-02-23 22:04:34,148 INFO scripts.py:946 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 2025-02-23 22:04:34,148 INFO scripts.py:950 -- for more information on submitting Ray jobs to the Ray cluster. 2025-02-23 22:04:34,148 INFO scripts.py:955 -- To terminate the Ray runtime, run 2025-02-23 22:04:34,148 INFO scripts.py:956 -- ray stop 2025-02-23 22:04:34,148 INFO scripts.py:959 -- To view the status of the cluster, use 2025-02-23 22:04:34,148 INFO scripts.py:960 -- ray status 2025-02-23 22:04:34,148 INFO scripts.py:964 -- To monitor and debug Ray, view the dashboard at 2025-02-23 22:04:34,148 INFO scripts.py:965 -- 127.0.0.1:8265 2025-02-23 22:04:34,148 INFO scripts.py:972 -- If connection to the dashboard fails, check your firewall settings and network configuration. 2025-02-23 22:04:34,148 INFO scripts.py:1076 -- --block 2025-02-23 22:04:34,148 INFO scripts.py:1077 -- This command will now block forever until terminated by a signal. 2025-02-23 22:04:34,148 INFO scripts.py:1080 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
#运行成功后,输出如下 [2025-02-23 22:06:55,915 W 1 1] global_state_accessor.cc:429: Retrying to get node with node ID de9745473ab8d1adb62391a32ea4254142ca109b2520c761285bc1a0 2025-02-23 22:06:55,851 INFO scripts.py:1047 -- Local node IP: 10.119.85.138 2025-02-23 22:06:56,937 SUCC scripts.py:1063 -- -------------------- 2025-02-23 22:06:56,937 SUCC scripts.py:1064 -- Ray runtime started. 2025-02-23 22:06:56,937 SUCC scripts.py:1065 -- -------------------- 2025-02-23 22:06:56,937 INFO scripts.py:1067 -- To terminate the Ray runtime, run 2025-02-23 22:06:56,938 INFO scripts.py:1068 -- ray stop 2025-02-23 22:06:56,938 INFO scripts.py:1076 -- --block 2025-02-23 22:06:56,938 INFO scripts.py:1077 -- This command will now block forever until terminated by a signal. 2025-02-23 22:06:56,938 INFO scripts.py:1080 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
上述操作当在每个服务器启动一个名为node的容器。按照vllm官方文档的说法,上述操作针启动一个以容器形式运行的ray集群(英文是“a ray cluster ofcontainers”,不知如何翻译更准确,暂时就叫这个名字吧),4个服务器上的每个node容器都是这个容器集群的一员。
3.1.2 查看ray集群
#在集群中任何一个节点进入node容器(每个节点上都有一个node容器) #比如在deepseek1服务器上,再打开一个终端窗口,执行如下命令 (base) deepseek@deepseek1:~$ sudo docker exec -it node /bin/bash #以下在node容器中执行 root@deepseek1:/vllm-workspace# ray status ======== Autoscaler status: 2025-02-23 22:08:18.818774 ======== Node status --------------------------------------------------------------- Active: 1 node_5a3c4cb16e576f2afb9e2c612f5052ab28c46014fcfa7d7b501eb35c 1 node_de9745473ab8d1adb62391a32ea4254142ca109b2520c761285bc1a0 1 node_f2f75b7120e56b71d416d9af7301c3ad78ead0a19a6bf810a35e016f 1 node_d0f9e04c9e3425460679f6fc59d274be1547d81f6a5f1d3dbd96c0fb Pending: (no pending nodes) Recent failures: (no failures)
# Test vLLM NCCL, with cuda graph fromvllm.distributed.device_communicators.pyncclimportPyNcclCommunicator
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank) # pynccl is enabled by default for 0.6.5+, # but for 0.6.4 and below, we need to enable it manually. # keep the code for backward compatibility when because people # prefer to read the latest documentation. pynccl.disabled =False
s = torch.cuda.Stream() withtorch.cuda.stream(s): data.fill_(1) out = pynccl.all_reduce(data, stream=s) value = out.mean().item() assertvalue == world_size,f"Expected{world_size}, got{value}"
print("vLLM NCCL is successful!")
g = torch.cuda.CUDAGraph() withtorch.cuda.graph(cuda_graph=g, stream=s): out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
data.fill_(1) g.replay() torch.cuda.current_stream().synchronize() value = out.mean().item() assertvalue == world_size,f"Expected{world_size}, got{value}"
... deepseek1:1239:1239 [3] NCCL INFO Connected all trees deepseek1:1239:1239 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 deepseek1:1239:1239 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer deepseek1:1240:1240 [4] NCCL INFO Connected all trees deepseek1:1240:1240 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 deepseek1:1240:1240 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer deepseek1:1241:1241 [5] NCCL INFO Connected all trees deepseek1:1241:1241 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 deepseek1:1241:1241 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer deepseek1:1243:1243 [7] NCCL INFO Connected all trees deepseek1:1243:1243 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 deepseek1:1243:1243 [7] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer deepseek1:1242:1242 [6] NCCL INFO Connected all trees deepseek1:1242:1242 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 deepseek1:1242:1242 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer deepseek1:1236:1236 [0] NCCL INFO ncclCommInitRank comm 0xb606410 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 3d000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1238:1238 [2] NCCL INFO ncclCommInitRank comm 0xb88e340 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 61000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1240:1240 [4] NCCL INFO ncclCommInitRank comm 0xad28600 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId ad000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1242:1242 [6] NCCL INFO ncclCommInitRank comm 0xb29e250 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId d0000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1237:1237 [1] NCCL INFO ncclCommInitRank comm 0xb60f020 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 42000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1241:1241 [5] NCCL INFO ncclCommInitRank comm 0xc2f2570 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId b1000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1239:1239 [3] NCCL INFO ncclCommInitRank comm 0xac2c590 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 67000 commId 0x3c2a650fc7de9a50 - Init COMPLETE deepseek1:1243:1243 [7] NCCL INFO ncclCommInitRank comm 0xbd1b050 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId d3000 commId 0x3c2a650fc7de9a50 - Init COMPLETE vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!
vLLM NCCL is successful! vLLM NCCL is successful! vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful! deepseek1:1236:1299 [0] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek1:1237:1286 [1] NCCL INFO [Service thread] Connection closed by localRank 1 deepseek1:1241:1291 [5] NCCL INFO [Service thread] Connection closed by localRank 5 deepseek1:1240:1293 [4] NCCL INFO [Service thread] Connection closed by localRank 4 deepseek1:1239:1295 [3] NCCL INFO [Service thread] Connection closed by localRank 3 deepseek1:1242:1288 [6] NCCL INFO [Service thread] Connection closed by localRank 6 deepseek1:1243:1287 [7] NCCL INFO [Service thread] Connection closed by localRank 7 deepseek1:1238:1290 [2] NCCL INFO [Service thread] Connection closed by localRank 2 deepseek1:1236:1400 [0] NCCL INFO comm 0x8265490 rank 0 nranks 8 cudaDev 0 busId 3d000 - Abort COMPLETE deepseek1:1243:1405 [7] NCCL INFO comm 0x897f4f0 rank 7 nranks 8 cudaDev 7 busId d3000 - Abort COMPLETE deepseek1:1237:1403 [1] NCCL INFO comm 0x828e430 rank 1 nranks 8 cudaDev 1 busId 42000 - Abort COMPLETE deepseek1:1240:1402 [4] NCCL INFO comm 0x79a3f10 rank 4 nranks 8 cudaDev 4 busId ad000 - Abort COMPLETE deepseek1:1241:1401 [5] NCCL INFO comm 0x8f75840 rank 5 nranks 8 cudaDev 5 busId b1000 - Abort COMPLETE deepseek1:1242:1404 [6] NCCL INFO comm 0x7f1d4b0 rank 6 nranks 8 cudaDev 6 busId d0000 - Abort COMPLETE deepseek1:1238:1407 [2] NCCL INFO comm 0x850c0d0 rank 2 nranks 8 cudaDev 2 busId 61000 - Abort COMPLETE deepseek1:1239:1406 [3] NCCL INFO comm 0x78b0310 rank 3 nranks 8 cudaDev 3 busId 67000 - Abort COMPLETE
... INFO 02-24 00:01:07 pynccl.py:69] vLLM is using nccl==2.21.5 deepseek2:460:460 [4] NCCL INFO Using non-device net plugin version 0 deepseek2:460:460 [4] NCCL INFO Using network IB deepseek2:459:459 [3] NCCL INFO ncclCommInitRank comm 0xb0c19c0 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 67000 commId 0x8da567a0342af828 - Init START deepseek2:458:458 [2] NCCL INFO ncclCommInitRank comm 0xb7b98b0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 61000 commId 0x8da567a0342af828 - Init START deepseek2:457:457 [1] NCCL INFO ncclCommInitRank comm 0xc8103e0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 42000 commId 0x8da567a0342af828 - Init START deepseek2:460:460 [4] NCCL INFO ncclCommInitRank comm 0xc1ff420 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId ad000 commId 0x8da567a0342af828 - Init START deepseek2:456:456 [0] NCCL INFO ncclCommInitRank comm 0xbd9b580 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 3d000 commId 0x8da567a0342af828 - Init START deepseek2:461:461 [5] NCCL INFO ncclCommInitRank comm 0xc041a00 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId b1000 commId 0x8da567a0342af828 - Init START deepseek2:463:463 [7] NCCL INFO ncclCommInitRank comm 0xc41ace0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId d3000 commId 0x8da567a0342af828 - Init START deepseek2:462:462 [6] NCCL INFO ncclCommInitRank comm 0xb794080 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId d0000 commId 0x8da567a0342af828 - Init START deepseek2:463:463 [7] NCCL INFO Setting affinity for GPU 7 to ffff,fff00000,00ffffff,f0000000 deepseek2:463:463 [7] NCCL INFO NVLS multicast support is not available on dev 7 deepseek2:459:459 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff deepseek2:459:459 [3] NCCL INFO NVLS multicast support is not available on dev 3 deepseek2:460:460 [4] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000 deepseek2:460:460 [4] NCCL INFO NVLS multicast support is not available on dev 4 deepseek2:461:461 [5] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000 deepseek2:461:461 [5] NCCL INFO NVLS multicast support is not available on dev 5 deepseek2:456:456 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff deepseek2:456:456 [0] NCCL INFO NVLS multicast support is not available on dev 0 deepseek2:457:457 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff deepseek2:457:457 [1] NCCL INFO NVLS multicast support is not available on dev 1 deepseek2:458:458 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff deepseek2:458:458 [2] NCCL INFO NVLS multicast support is not available on dev 2 deepseek2:462:462 [6] NCCL INFO Setting affinity for GPU 6 to ffff,fff00000,00ffffff,f0000000 deepseek2:462:462 [6] NCCL INFO NVLS multicast support is not available on dev 6 deepseek2:463:463 [7] NCCL INFO comm 0xc41ace0 rank 15 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0 deepseek2:462:462 [6] NCCL INFO comm 0xb794080 rank 14 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0 deepseek2:457:457 [1] NCCL INFO comm 0xc8103e0 rank 9 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0 deepseek2:461:461 [5] NCCL INFO comm 0xc041a00 rank 13 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0 deepseek2:460:460 [4] NCCL INFO comm 0xc1ff420 rank 12 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0 deepseek2:458:458 [2] NCCL INFO comm 0xb7b98b0 rank 10 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0 deepseek2:462:462 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13 deepseek2:463:463 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] 8/-1/-1->15->14 deepseek2:459:459 [3] NCCL INFO comm 0xb0c19c0 rank 11 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0 deepseek2:456:456 [0] NCCL INFO comm 0xbd9b580 rank 8 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0 deepseek2:462:462 [6] NCCL INFO P2P Chunksize set to 131072 deepseek2:461:461 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12 deepseek2:463:463 [7] NCCL INFO P2P Chunksize set to 131072 deepseek2:457:457 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/16/-1->9->8 [3] -1/-1/-1->9->8 deepseek2:460:460 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11 deepseek2:461:461 [5] NCCL INFO P2P Chunksize set to 131072 deepseek2:458:458 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->19 [2] 11/-1/-1->10->9 [3] 11/6/-1->10->26 deepseek2:457:457 [1] NCCL INFO P2P Chunksize set to 131072 deepseek2:459:459 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/18/-1->11->10 deepseek2:458:458 [2] NCCL INFO P2P Chunksize set to 131072 deepseek2:460:460 [4] NCCL INFO P2P Chunksize set to 131072 deepseek2:459:459 [3] NCCL INFO P2P Chunksize set to 131072 deepseek2:456:456 [0] NCCL INFO Trees [0] 9/-1/-1->8->17 [1] 9/-1/-1->8->15 [2] 9/2/-1->8->24 [3] 9/-1/-1->8->15 deepseek2:456:456 [0] NCCL INFO P2P Chunksize set to 131072 deepseek2:460:460 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 02/0 : 13[5] -> 12[4] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 03/0 : 14[6] -> 13[5] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/CUMEM/read deepseek2:457:457 [1] NCCL INFO Channel 00/0 : 9[1] -> 16[0] [send] via NET/IB/0/GDRDMA deepseek2:459:459 [3] NCCL INFO Channel 01/0 : 11[3] -> 18[2] [send] via NET/IB/1/GDRDMA deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 9[1] -> 16[0] [send] via NET/IB/0/GDRDMA deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 11[3] -> 18[2] [send] via NET/IB/1/GDRDMA deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 3[3] -> 8[0] [receive] via NET/IB/0/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 7[7] -> 10[2] [receive] via NET/IB/1/GDRDMA deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 3[3] -> 8[0] [receive] via NET/IB/0/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 7[7] -> 10[2] [receive] via NET/IB/1/GDRDMA deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/CUMEM/read deepseek2:457:457 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 02/0 : 11[3] -> 10[2] via P2P/CUMEM/read deepseek2:463:463 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/CUMEM/read deepseek2:457:457 [1] NCCL INFO Channel 03/0 : 9[1] -> 8[0] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/CUMEM/read deepseek2:463:463 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/CUMEM/read deepseek2:463:463 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/CUMEM/read deepseek2:463:463 [7] NCCL INFO Channel 03/0 : 15[7] -> 14[6] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Connected all rings deepseek2:459:459 [3] NCCL INFO Connected all rings deepseek2:460:460 [4] NCCL INFO Connected all rings deepseek2:458:458 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 02/0 : 10[2] -> 11[3] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 00/0 : 11[3] -> 12[4] via P2P/CUMEM/read deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 11[3] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 00/0 : 12[4] -> 13[5] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 01/0 : 11[3] -> 12[4] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Connected all rings deepseek2:457:457 [1] NCCL INFO Connected all rings deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 8[0] -> 9[1] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 01/0 : 12[4] -> 13[5] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 02/0 : 11[3] -> 12[4] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 01/0 : 8[0] -> 9[1] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 02/0 : 12[4] -> 13[5] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 11[3] -> 12[4] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/CUMEM/read deepseek2:460:460 [4] NCCL INFO Channel 03/0 : 12[4] -> 13[5] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 03/0 : 8[0] -> 9[1] via P2P/CUMEM/read deepseek2:457:457 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 18[2] -> 11[3] [receive] via NET/IB/1/GDRDMA deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 2[2] -> 8[0] [receive] via NET/IB/0/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 6[6] -> 10[2] [receive] via NET/IB/1/GDRDMA deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 16[0] -> 9[1] [receive] via NET/IB/0/GDRDMA deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 8[0] -> 17[1] [send] via NET/IB/0/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 10[2] -> 19[3] [send] via NET/IB/1/GDRDMA deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 24[0] -> 8[0] [receive] via NET/IB/0/GDRDMA deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 24[0] [send] via NET/IB/0/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 26[2] -> 10[2] [receive] via NET/IB/1/GDRDMA deepseek2:461:461 [5] NCCL INFO Connected all rings deepseek2:463:463 [7] NCCL INFO Connected all rings deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 26[2] [send] via NET/IB/1/GDRDMA deepseek2:462:462 [6] NCCL INFO Connected all rings deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 17[1] -> 8[0] [receive] via NET/IB/0/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 19[3] -> 10[2] [receive] via NET/IB/1/GDRDMA deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 6[6] [send] via NET/IB/1/GDRDMA deepseek2:459:459 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/CUMEM/read deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/CUMEM/read deepseek2:457:457 [1] NCCL INFO Channel 00/0 : 9[1] -> 8[0] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/CUMEM/read deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 02/0 : 13[5] -> 14[6] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/CUMEM/read deepseek2:461:461 [5] NCCL INFO Channel 03/0 : 13[5] -> 14[6] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 02/0 : 14[6] -> 15[7] via P2P/CUMEM/read deepseek2:462:462 [6] NCCL INFO Channel 03/0 : 14[6] -> 15[7] via P2P/CUMEM/read deepseek2:463:463 [7] NCCL INFO Channel 01/0 : 15[7] -> 8[0] via P2P/CUMEM/read deepseek2:463:463 [7] NCCL INFO Channel 03/0 : 15[7] -> 8[0] via P2P/CUMEM/read deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 2[2] [send] via NET/IB/0/GDRDMA deepseek2:462:462 [6] NCCL INFO Connected all trees deepseek2:462:462 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:462:462 [6] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:461:461 [5] NCCL INFO Connected all trees deepseek2:461:461 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:461:461 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:460:460 [4] NCCL INFO Connected all trees deepseek2:460:460 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:460:460 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:458:458 [2] NCCL INFO Connected all trees deepseek2:459:459 [3] NCCL INFO Connected all trees deepseek2:458:458 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:458:458 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:459:459 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:459:459 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:463:463 [7] NCCL INFO Connected all trees deepseek2:463:463 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:463:463 [7] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:457:457 [1] NCCL INFO Connected all trees deepseek2:457:457 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:457:457 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:456:456 [0] NCCL INFO Connected all trees deepseek2:456:456 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512 deepseek2:456:456 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer deepseek2:457:457 [1] NCCL INFO ncclCommInitRank comm 0xc8103e0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 42000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:459:459 [3] NCCL INFO ncclCommInitRank comm 0xb0c19c0 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 67000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:462:462 [6] NCCL INFO ncclCommInitRank comm 0xb794080 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId d0000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:461:461 [5] NCCL INFO ncclCommInitRank comm 0xc041a00 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId b1000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:458:458 [2] NCCL INFO ncclCommInitRank comm 0xb7b98b0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 61000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:463:463 [7] NCCL INFO ncclCommInitRank comm 0xc41ace0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId d3000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:460:460 [4] NCCL INFO ncclCommInitRank comm 0xc1ff420 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId ad000 commId 0x8da567a0342af828 - Init COMPLETE deepseek2:456:456 [0] NCCL INFO ncclCommInitRank comm 0xbd9b580 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 3d000 commId 0x8da567a0342af828 - Init COMPLETE vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful! vLLM NCCL is successful! vLLM NCCL is successful!vLLM NCCL is successful!
vLLM NCCL is successful! vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful! vLLM NCCL with cuda graph is successful!
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 2 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 5 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 5 deepseek2:457:552 [1] NCCL INFO [Service thread] Connection closed by localRank 1 deepseek2:461:549 [5] NCCL INFO [Service thread] Connection closed by localRank 5 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 6 deepseek2:459:551 [3] NCCL INFO [Service thread] Connection closed by localRank 3 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 6 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 5 deepseek2:463:547 [7] NCCL INFO [Service thread] Connection closed by localRank 7 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 1 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 6 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 6 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 4 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 2 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 5 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 4 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 7 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 1 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 4 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 3 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 4 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 3 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 1 deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 7 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 1 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 7 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 2 deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 3 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 3 deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 2 deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 7 deepseek2:456:660 [0] NCCL INFO comm 0x8a11620 rank 8 nranks 32 cudaDev 0 busId 3d000 - Abort COMPLETE deepseek2:458:664 [2] NCCL INFO comm 0x8430430 rank 10 nranks 32 cudaDev 2 busId 61000 - Abort COMPLETE deepseek2:462:658 [6] NCCL INFO comm 0x84085f0 rank 14 nranks 32 cudaDev 6 busId d0000 - Abort COMPLETE deepseek2:460:661 [4] NCCL INFO comm 0x8e5a270 rank 12 nranks 32 cudaDev 4 busId ad000 - Abort COMPLETE deepseek2:463:662 [7] NCCL INFO comm 0x9090570 rank 15 nranks 32 cudaDev 7 busId d3000 - Abort COMPLETE deepseek2:461:657 [5] NCCL INFO comm 0x8cb26b0 rank 13 nranks 32 cudaDev 5 busId b1000 - Abort COMPLETE deepseek2:457:659 [1] NCCL INFO comm 0x948a0f0 rank 9 nranks 32 cudaDev 1 busId 42000 - Abort COMPLETE deepseek2:459:663 [3] NCCL INFO comm 0x7d3d650 rank 11 nranks 32 cudaDev 3 busId 67000 - Abort COMPLETE deepseek2:460:636 [32679] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek2:462:632 [32661] NCCL INFO [Service thread] Connection closed by localRank 0 deepseek2:460:636 [32679] NCCL INFO [Service thread] Connection closed by localRank 2 deepseek2:460:636 [32679] NCCL INFO [Service thread] Connection closed by localRank 6
INFO 02-24 04:45:42 model_runner.py:1115] Loading model weights took 35.4806 GB (RayWorkerWrapper pid=1124) INFO 02-24 04:45:44 model_runner.py:1115] Loading model weights took 35.4806 GB (RayWorkerWrapper pid=7564, ip=10.119.165.139) INFO 02-24 04:45:08 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 30x across cluster] (RayWorkerWrapper pid=1138) INFO 02-24 04:45:08 cuda.py:161] Using Triton MLA backend. [repeated 31x across cluster] (RayWorkerWrapper pid=976, ip=10.119.85.140) INFO 02-24 04:45:08 utils.py:950] Found nccl from library libnccl.so.2 [repeated 31x across cluster] (RayWorkerWrapper pid=976, ip=10.119.85.140) INFO 02-24 04:45:08 pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 31x across cluster] (RayWorkerWrapper pid=1149) NCCL version 2.21.5+cuda12.4 [repeated 7x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/IPC/read [repeated 43x across cluster] (RayWorkerWrapper pid=1116) 09/0 : 3[3] -> 2[2] via P2P/IPC/read [repeated 2x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO Connected all trees [repeated 6x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [repeated 6x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer [repeated 6x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so [repeated 6x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO TUNER/Plugin: Using internal tuner plugin. [repeated 6x across cluster] (RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO ncclCommInitRank comm 0xd15faf0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId d3000 commId 0x7f39c29d5bac68b4 - Init COMPLETE [repeated 6x across cluster] (RayWorkerWrapper pid=1115) Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read [repeated 2x across cluster] (RayWorkerWrapper pid=973, ip=10.119.85.140) INFO 02-24 04:45:08 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_3a784ba1'), local_subscribe_port=56925, remote_subscribe_port=None) [repeated 2x across cluster] (RayWorkerWrapper pid=976, ip=10.119.85.140) INFO 02-24 04:45:08 model_runner.py:1110] Starting to load model /root/.cache/huggingface/hub/models/unsloth/DeepSeek-R1-BF16/... [repeated 30x across cluster] (RayWorkerWrapper pid=978, ip=10.119.85.140) INFO 02-24 04:46:17 model_runner.py:1115] Loading model weights took 42.8992 GB [repeated 7x across cluster] ... ... ... (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO P2P Chunksize set to 131072 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 00/0 : 2[1] -> 3[1] [receive] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 2[1] -> 3[1] [receive] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[1] [send] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[1] [send] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31505 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Connected all rings (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[1] [receive] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 3[1] -> 1[1] [send] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 00/0 : 3[1] -> 2[1] [send] via NET/IB/1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Connected all trees (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO ncclCommInitRank comm 0x1179c2d0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 42000 commId 0x8b115fa040bc44fc - Init COMPLETE (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Using non-device net plugin version 0 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Using network IB (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO ncclCommInitRank comm 0x30ca9860 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 42000 commId 0xd37e8e722df3123f - Init START (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO NVLS multicast support is not available on dev 1 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO comm 0x30ca9860 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO P2P Chunksize set to 524288 (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read (RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Channel 02/0 : 1[ (RayWorkerWrapper pid=7587, ip=10.119.165.139) ] NCCL INFO Connected all rings (RayWorkerWrapper pid=7591, ip=10.119.165.139) Channel 09/0 : 7[7] -> 6[6] via P2P/IPC/read (RayWorkerWrapper pid=7580, ip=10.119.165.139) 4[4] -> 3[3] via P2P/IPC/read (RayWorkerWrapper pid=7580, ip=10.119.165.139) deepseek1:7580:31567 [4] NCCL I (RayWorkerWrapper pid=7591, ip=10.119.165.139) (RayWorkerWrapper pid=7591, ip=10.119.165.139) deepseek1:759 (RayWorkerWrapper pid=7580, ip=10.119.165.139) NFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read (RayWorkerWrapper pid=7580, ip=10.119.165.139) deepseek1:7580:31594 [4] NCCL INFO Channel 15 INFO 02-24 17:15:16 executor_base.py:110] # CUDA blocks: 68440, # CPU blocks: 13107 ... ... ... Capturing CUDA graph shapes: 43%|████████████████████████████████████████████████████████████▍ | 15/35 [00:06<00:09, 2.18it/s] Capturing CUDA graph shapes: 91%|█████████▏| 32/35 [00:13<00:01, 2.46it/s] Capturing CUDA graph shapes: 46%|████████████████████████████████████████████████████████████████▍ | 16/35 [00:07<00:08, 2.18it/s] Capturing CUDA graph shapes: 94%|█████████▍| 33/35 [00:13<00:00, 2.46it/s] Capturing CUDA graph shapes: 49%|████████████████████████████████████████████████████████████████████▍ | 17/35 [00:07<00:08, 2.19it/s] Capturing CUDA graph shapes: 97%|█████████▋| 34/35 [00:13<00:00, 2.47it/s] Capturing CUDA graph shapes: 51%|████████████████████████████████████████████████████████████████████████▌ | 18/35 [00:08<00:07, 2.13it/s] Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:14<00:00, 2.47it/s] Capturing CUDA graph shapes: 63%|████████████████████████████████████████████████████████████████████████████████████████▋ | 22/35 [00:10<00:06, 2.07it/s] Capturing CUDA graph shapes: 89%|████████▊ | 31/35 [00:13<00:01, 2.36it/s] [repeated 24x across cluster] Capturing CUDA graph shapes: 66%|████████████████████████████████████████████████████████████████████████████████████████████▋ | 23/35 [00:10<00:05, 2.08it/s] (RayWorkerWrapper pid=974, ip=10.119.85.140) INFO 02-24 04:48:08 model_runner.py:1562] Graph capturing finished in 64 secs, took 1.19 GiB Capturing CUDA graph shapes: 71%|████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 25/35 [00:11<00:04, 2.04it/s] (RayWorkerWrapper pid=7561, ip=10.119.165.139) INFO 02-24 04:48:09 custom_all_reduce.py:226] Registering 4480 cuda graph addresses [repeated 16x across cluster] Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:15<00:00, 2.19it/s] INFO 02-24 04:48:13 custom_all_reduce.py:226] Registering 4340 cuda graph addresses (RayWorkerWrapper pid=7564, ip=10.119.165.139) INFO 02-24 04:48:13 model_runner.py:1562] Graph capturing finished in 68 secs, took 1.19 GiB [repeated 23x across cluster] (RayWorkerWrapper pid=1138) INFO 02-24 04:48:18 custom_all_reduce.py:226] Registering 4340 cuda graph addresses [repeated 14x across cluster] INFO 02-24 04:48:18 model_runner.py:1562] Graph capturing finished in 73 secs, took 1.16 GiB INFO 02-24 04:48:18 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 84.43 seconds INFO 02-24 04:48:18 api_server.py:756] Using supplied chat template: INFO 02-24 04:48:18 api_server.py:756] None INFO 02-24 04:48:18 launcher.py:21] Available routes are: INFO 02-24 04:48:18 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET INFO 02-24 04:48:18 launcher.py:29] Route: /docs, Methods: HEAD, GET INFO 02-24 04:48:18 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 02-24 04:48:18 launcher.py:29] Route: /redoc, Methods: HEAD, GET INFO 02-24 04:48:18 launcher.py:29] Route: /health, Methods: GET INFO 02-24 04:48:18 launcher.py:29] Route: /ping, Methods: GET, POST INFO 02-24 04:48:18 launcher.py:29] Route: /tokenize, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /detokenize, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /v1/models, Methods: GET INFO 02-24 04:48:18 launcher.py:29] Route: /version, Methods: GET INFO 02-24 04:48:18 launcher.py:29] Route: /v1/chat/completions, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /v1/completions, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /v1/embeddings, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /pooling, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /score, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /v1/score, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /rerank, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /v1/rerank, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /v2/rerank, Methods: POST INFO 02-24 04:48:18 launcher.py:29] Route: /invocations, Methods: POST INFO: Started server process [12020] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
3.1.6 再次查看ray集群状态
root@deepseek1:/vllm-workspace# ray status ======== Autoscaler status: 2025-02-24 06:09:38.499412 ======== Node status --------------------------------------------------------------- Active: 1 node_78c610008905942ec65274e7c7ce990d1a554e9627512bf633c15c28 1 node_0aee3e0efd8a7f95dfa4205cd692f7f08e7d665b779a0facf0d3201d 1 node_8407508c22842dea6182c7accc2565b42daf4c7b051f1d37a4629258 1 node_3ec95e5ee056a4484f0f81cc518716edd4d2bfd98ffa771b024edc27 Pending: (no pending nodes) Recent failures: (no failures)
Resources --------------------------------------------------------------- Usage: 0.0/448.0 CPU 32.0/32.0 GPU (32.0 used of 32.0 reserved in placement groups) 0B/3.89TiB memory 0B/38.91GiB object_store_memory
#等待,直到open-webui容器状态更新为healthy (self-llm) deepseek@deepseek2:~$ sudo watch docker ps -a #实时更新查看open-webui容器日志 (self-llm) deepseek@deepseek2:~$ sudo docker logs open-webui -f Loading WEBUI_SECRET_KEY from file, not provided as an environment variable. Generating WEBUI_SECRET_KEY Loading WEBUI_SECRET_KEY from .webui_secret_key /app/backend/open_webui /app/backend /app Running migrations INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. INFO [open_webui.env] 'ENABLE_SIGNUP' loaded from the latest database entry INFO [open_webui.env] 'DEFAULT_LOCALE' loaded from the latest database entry INFO [open_webui.env] 'DEFAULT_PROMPT_SUGGESTIONS' loaded from the latest database entry WARNI [open_webui.env]
WARNING: CORS_ALLOW_ORIGIN IS SET TO '*' - NOT RECOMMENDED FOR PRODUCTION DEPLOYMENTS.
INFO [open_webui.env] Embedding model set: sentence-transformers/all-MiniLM-L6-v2 WARNI [langchain_community.utils.user_agent] USER_AGENT environment variable not set, consider setting it to identify your requests. INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) #出现了日志内容就是可用了