# GLM-5 在沐曦 C500 集群 PP≥3 Ray 后端推理 hang — 完整材料

**日期**：2026-04-17
**集群**：算丰沐曦 C500，4 节点 × 8 GPU = 32 卡
**vLLM**：v0.14.0（vllm_metax plugin）+ `--distributed-executor-backend ray`
**模型**：GLM-5-W8A8（FLASHMLA_SPARSE backend）
**现象**：PP=2 正常，PP=3 / PP=4 卡死
**根因**：MCCL `mcclSend` 在 Ray Compiled Graph + PP≥3 场景下 kernel **未投递到 GPU 执行**

---

## 目录

1. [环境信息](#1-环境信息)
2. [问题现象与测试矩阵](#2-问题现象与测试矩阵)
3. [根因定位](#3-根因定位)
4. [关键证据 A：rank 1 gdb 线程栈](#4-关键证据-arank-1-gdb-线程栈)
5. [关键证据 B：vLLM hang 现场日志](#5-关键证据-bvllm-hang-现场日志)
6. [关键证据 C：最小 repro 证明底层 MCCL 正常](#6-关键证据-c最小-repro-证明底层-mccl-正常)
7. [关键证据 D：MCCL perf 四机测试通过](#7-关键证据-dmccl-perf-四机测试通过)
8. [已尝试的修复及结果](#8-已尝试的修复及结果)
9. [给沐曦的调试方向建议](#9-给沐曦的调试方向建议)

---

## 1. 环境信息

- **硬件**：4 节点 × 8× MetaX C500 (64GB/卡)，共 32 GPU
- **驱动**：Kernel Mode Driver 3.4.4，MACA 3.5.3.18，BIOS 1.30.0.0
- **镜像**：`cr.metax-tech.com/.../vllm-metax:0.14.0-maca.ai3.5.3.102-torch2.8-py310-ubuntu22.04-amd64_glm_w4a8_full`
- **vLLM**：v0.14.0（vllm_metax plugin）
- **网络**：节点间 10 Gbps 内网（IP `10.66.3.{10-13}`）
- **MCCL 库**：`/opt/maca/lib/libmccl.so`，NCCL version 2.16.5（MCCL wrapper）
- **Ray 集群**：1 head + 3 worker
  ```bash
  # head
  ray start --head --port=6379
  # worker
  ray start --address=10.66.3.10:6379
  ```
- **vLLM 启动命令**：
  ```bash
  RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 \
  vllm serve /model \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 3 \
    --distributed-executor-backend ray \
    --enforce-eager \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.88 \
    --trust-remote-code
  ```

---

## 2. 问题现象与测试矩阵

vLLM 日志（每 10 秒一条）：
```
Engine 000: Avg prompt throughput: 0.6 tokens/s, gen: 0.1 tok/s, Running: 1 reqs, KV: 0.0%
Engine 000: Avg prompt throughput: 0.0 tokens/s, gen: 0.0 tok/s, Running: 1 reqs, KV: 0.0%
（之后永远 0/0，请求最终 RayChannelTimeoutError）
```

curl 请求 60/90 秒超时，0 字节返回。

### 测试矩阵

| 配置 | 后端 | 启动 | 推理 |
|------|------|------|------|
| PP=2 TP=8（2 节点） | Ray | ✅ | **✅ 正常** |
| PP=3 TP=8（3 节点） | Ray | ✅ | ❌ hang |
| PP=4 TP=8（4 节点） | Ray | ✅ | ❌ hang |
| MCCL `all_reduce_perf` 4 节点 32 卡 | MPI | ✅ | 0 错误，71.24 GB/s |
| MCCL `sendrecv_perf` 4 节点 32 卡 | MPI | ✅ | 0 错误，21.78 GB/s |

**偶数节点（PP=4）不能规避 bug**。

---

## 3. 根因定位

### 3.1 调用链

Ray 后端 vLLM PP 的 intermediate_tensors 传递路径：

```
Ray Compiled Graph (CGRAPH)
  └─ torch_tensor_accelerator_channel.send
       └─ RayPPCommunicator.send               (vllm/distributed/device_communicators/ray_communicator.py)
            └─ CudaCommunicator.send           (vllm/distributed/device_communicators/cuda_communicator.py:244)
                 └─ PyNcclCommunicator.send    (vllm/distributed/device_communicators/pynccl.py:306)
                      └─ ncclSend (raw, 异步) (vllm_metax/patch/distributed/pynccl_wrapper.py:425 → mcclSend)
```

注意：`CudaCommunicator.send` 的 docstring 写"blocking way"，但实际**优先走 PyNccl（异步 ncclSend）**：

```python
# cuda_communicator.py:236-245
def send(self, tensor, dst=None):
    """Sends a tensor to the destination rank in a blocking way"""  # 注释误导
    ...
    pynccl_comm = self.pynccl_comm
    if pynccl_comm is not None and not pynccl_comm.disabled:
        pynccl_comm.send(tensor, dst)      # ← 实际路径（异步 ncclSend）
    else:
        torch.distributed.send(tensor, self.ranks[dst], self.device_group)  # fallback
```

### 3.2 hang 因果链（完整推理）

1. Ray CGRAPH 编译出的 DAG：`stage0 → stage1 → stage2`（PP=3）
2. stage 0 执行完 forward，调用 `RayPPCommunicator.send()` → `PyNcclCommunicator.send()` → `mcclSend()`
3. `mcclSend` 把 send request 提交给 MCCL 异步队列，立即返回
4. **`mcclProxyProgress` 线程在 `sched_yield` 死循环空转**，不把 request 投递到 GPU 执行
5. GPU 永远不产生对应的 completion event
6. MetaX driver 的 `AsyncEventsLoop` 永远等不到事件（栈显示 `mxcd_evt_wait_many`）
7. Ray CGRAPH 看不到 channel 完成信号，**不调度 stage 1 的 `execute_model`**
8. 主线程一直在 Ray task loop 空等 → 最终 `RayChannelTimeoutError`

### 3.3 为什么 PP=2 能工作

PP=2 时 PP group 只有 2 rank，CGRAPH DAG 只有 `stage0 → stage1` 一次 P2P 传递。MCCL 在**单对** send/recv 场景下**碰巧**没触发此 bug。

PP≥3 至少有 2 次 P2P（stage0→1 和 stage1→2），**第一次 send 就卡 kernel 未投递**，rank 1 永远不执行，链式传递断开。

---

## 4. 关键证据 A：rank 1 gdb 线程栈

完整栈（hang 发生后约 90 秒 live attach）：

### Thread 1（主线程，Python/Ray）

```
#0  epoll_wait (epfd=38, events=..., maxevents=128, timeout=-1)
      at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  boost::asio::detail::epoll_reactor::run (...) [ray/_raylet.so]
#2  boost::asio::detail::scheduler::do_run_one (...) [ray/_raylet.so]
#3  boost::asio::detail::scheduler::run (...) [ray/_raylet.so]
#4  boost::asio::io_context::run () [ray/_raylet.so]
#5  ray::core::CoreWorker::RunTaskExecutionLoop () [ray/_raylet.so]
#6  ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop ()
#7  ray::core::CoreWorkerProcess::RunTaskExecutionLoop ()
```

**解读**：主 Python 线程停在 Ray CoreWorker 的 task execution loop，等待下一个 Ray actor task。
→ rank 1 的 `execute_model()` 从未被调度。
→ 说明 Ray CGRAPH 认为 stage 0 → stage 1 的数据传递尚未完成。

### Thread 234（mcclProxyProgress）— 关键证据

```
#0  sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  mcclProxyProgress(void*) () from /opt/maca/lib/libmccl.so
#2  std::execute_native_thread_routine (__p=<optimized out>)
      at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
#3  start_thread (...) at ./nptl/pthread_create.c:442
#4  clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
```

**解读**：MCCL proxy 的 progress 线程在 `sched_yield` 死循环空转。
→ MCCL **有任务待推进，但没有向 GPU 实际投递**。

### Thread 233（mcclProxyService）

```
#0  __GI___poll (fds=0x7f1b92ffa9a0, nfds=65, timeout=500)
      at ../sysdeps/unix/sysv/linux/poll.c:29
#1  mcclProxyService(void*) () from /opt/maca/lib/libmccl.so
```

**解读**：MCCL 网络 proxy service 在 `poll(timeout=500ms)` 循环，未收到预期的网络/同步事件。

### MetaX driver 内核线程 AsyncEventsLoop — 关键证据

(via `/proc/<pid>/task/<tid>/stack`)
```
[<0>] mxos_schedule_timeout + 0xe/0x20         [metax]
[<0>] mxcd_evt_wait_many + 0x543/0x660         [metax]
[<0>] mxcd_ioctl_wait_events + 0x29/0x30       [metax]
[<0>] mxcd_ioctl + 0x4a3/0xa50                 [metax]
[<0>] __x64_sys_ioctl + 0x92/0xd0
[<0>] x64_sys_call + 0x1e5f/0x1fa0
[<0>] do_syscall_64 + 0x56/0xb0
[<0>] entry_SYSCALL_64_after_hwframe + 0x6c/0xd6
```

**解读**：MetaX 驱动的 AsyncEventsLoop 通过 `ioctl(wait_events)` 阻塞在 `mxcd_evt_wait_many()`，等待 GPU 事件。
→ **这个事件永远不会 fire**（因为 `mcclSend` kernel 从未被真正投递）。

### 其他线程

多个 `pt_nccl_watchdg` / `pt_nccl_heartbt` 线程 state=S，stack=`futex_wait_queue_me`。PyTorch ProcessGroupNCCL 的 watchdog/heartbeat 线程等待条件变量。NCCL 默认 watchdog timeout = 30 min，此时尚未触发。

---

## 5. 关键证据 B：vLLM hang 现场日志

给 `CudaCommunicator.send/recv` 加 trace 打印后，PP=3 hang 后观察到：

### 初始化阶段（正常）

```
INFO 04-17 10:24:58 [parallel_state.py:1214] world_size=24 rank=16 local_rank=0
    distributed_init_method=tcp://10.66.3.10:46955 backend=nccl
...（24 个 RayWorkerWrapper 同样初始化，分布在 10.66.3.{10,11,12} 三节点）
INFO 04-17 10:29:11 [ray_executor.py:602] Using RayPPCommunicator (which wraps vLLM _PP
    GroupCoordinator) for Ray Compiled Graph communication.
2026-04-17 10:29:15,541 INFO torch_tensor_accelerator_channel.py:833
    -- Communicator group initialized.
```

### 推理阶段（hang 现场）

```
(RayWorkerWrapper pid=991, ip=10.66.3.11) [CudaComm.recv] rank=1 <- src=0 shape=[6, 6144]
(RayWorkerWrapper pid=1947)               [CudaComm.send] rank=0 -> dst=1 shape=[6, 6144]
(RayWorkerWrapper pid=992,  ip=10.66.3.11) [CudaComm.recv] rank=1 <- src=0 shape=[1, 6144] [repeated 35x]
(RayWorkerWrapper pid=2203)               [CudaComm.send] rank=0 -> dst=1 shape=[1, 6144] [repeated 47x]

# 之后 Engine 吞吐归零
INFO 04-17 10:29:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.6 tokens/s, gen: 0.1 tok/s
INFO 04-17 10:29:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, gen: 0.0 tok/s
... （一直 0/0）
```

**关键发现**：
- stage 0 → stage 1 的 send 被调用了 **49 次**（1 次 shape=[6,6144] + 48 次 shape=[1,6144]）
- stage 1 → stage 2 的 send **从未被调用**（整段日志里没有 `rank=1 -> dst=2` 的输出）
- stage 1 的 `execute_model` 从未执行

### 最终错误（约 4 分钟后）

```
ERROR 04-17 10:33:47 [core.py:938] EngineCore encountered a fatal error.
ERROR 04-17 10:33:47 [core.py:938]   File ".../ray/experimental/compiled_dag_ref.py", line 145, in get
ERROR 04-17 10:33:47 [core.py:938]     raise execution_error from None
ERROR 04-17 10:33:47 [core.py:938] ray.exceptions.ActorDiedError: The actor died unexpectedly
    before finishing this task.
ERROR 04-17 10:33:47 [core.py:938]   ip: 10.66.3.11
ERROR 04-17 10:33:47 [core.py:938]   Worker exit type: SYSTEM_ERROR
    Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file.
```

（worker 10.66.3.11 上的 rank 1 最终被 Ray 判定为异常——实际它一直活着只是没被调度，Ray 的 heartbeat 断了）

---

## 6. 关键证据 C：最小 repro 证明底层 MCCL 正常

### 脚本 1：3 节点 blocking P2P chain（通过）

`pp_chain_test.py` — 每节点 1 进程 1 GPU，`nccl` backend，blocking `send/recv`：

```python
import os, time
import torch
import torch.distributed as dist

rank = int(os.environ["RANK"])
world = int(os.environ["WORLD_SIZE"])
local = int(os.environ.get("LOCAL_RANK", 0))
torch.cuda.set_device(local)
dist.init_process_group(backend="nccl", rank=rank, world_size=world)
print(f"[rank {rank}/{world}] init ok on device {local}", flush=True)

shape = (1024, 4096)
dtype = torch.bfloat16

for step in range(3):
    t0 = time.time()
    if rank == 0:
        buf = torch.full(shape, float(step * 100), dtype=dtype, device="cuda")
        dist.send(buf, dst=1)
        print(f"[rank 0] step {step} sent to 1 ({(time.time()-t0)*1000:.1f}ms)", flush=True)
    elif rank == world - 1:
        buf = torch.empty(shape, dtype=dtype, device="cuda")
        dist.recv(buf, src=rank - 1)
        torch.cuda.synchronize()
        got = buf[0, 0].item()
        expected = float(step * 100)
        ok = "OK" if abs(got-expected)<0.01 else "FAIL"
        print(f"[rank {rank}] step {step} recv<-{rank-1} got={got:.1f} {ok} ({(time.time()-t0)*1000:.1f}ms)", flush=True)
    else:
        buf = torch.empty(shape, dtype=dtype, device="cuda")
        dist.recv(buf, src=rank - 1)
        torch.cuda.synchronize()
        dist.send(buf, dst=rank + 1)
        got = buf[0, 0].item()
        expected = float(step * 100)
        ok = "OK" if abs(got-expected)<0.01 else "FAIL"
        print(f"[rank {rank}] step {step} recv<-{rank-1} send->{rank+1} got={got:.1f} {ok} ({(time.time()-t0)*1000:.1f}ms)", flush=True)

dist.barrier()
if rank == 0:
    print(f"ALL OK world={world} chain test", flush=True)
dist.destroy_process_group()
```

**结果**：
```
[rank 0] step 0 sent to 1 (1744.6ms)      # warmup
[rank 1] step 0 recv<-0 send->2 got=0.0 OK (4297.8ms)
[rank 2] step 0 recv<-1 got=0.0 OK (3510.8ms)
[rank 0] step 1 sent to 1 (0.2ms)
[rank 1] step 1 recv<-0 send->2 got=100.0 OK (0.5ms)
[rank 2] step 1 recv<-1 got=100.0 OK (0.5ms)
[rank 0] step 2 sent to 1 (0.1ms)
[rank 1] step 2 recv<-0 send->2 got=200.0 OK (0.4ms)
[rank 2] step 2 recv<-1 got=200.0 OK (0.3ms)
ALL OK world=3 chain test
```

**结论**：MCCL 的 blocking `send/recv` 在 3 节点 chain 拓扑**完全正常**。底层通信不是根因。

### 脚本 2：2 节点 isend/irecv（失败——次要 bug）

`pp_ring_test2.py`：

```python
import os, time
import torch
import torch.distributed as dist

rank = int(os.environ["RANK"])
world = int(os.environ["WORLD_SIZE"])
local = int(os.environ.get("LOCAL_RANK", 0))
torch.cuda.set_device(local)
dist.init_process_group(backend="nccl", rank=rank, world_size=world)

dst = (rank + 1) % world
src = (rank - 1) % world
shape = (1024, 4096)  # 8MB bf16
dtype = torch.bfloat16

for step in range(3):
    t0 = time.time()
    send_buf = torch.full(shape, float(rank + step * 10), dtype=dtype, device="cuda")
    recv_buf = torch.empty(shape, dtype=dtype, device="cuda")
    h1 = dist.isend(send_buf, dst=dst)
    h2 = dist.irecv(recv_buf, src=src)
    h1.wait(); h2.wait()
    torch.cuda.synchronize()
    print(f"[rank {rank}] step {step} isend->{dst} irecv<-{src} OK", flush=True)

dist.barrier()
if rank == 0:
    print(f"ALL OK world={world}", flush=True)
dist.destroy_process_group()
```

**结果**（2-rank 即报错）：
```
torch.distributed.DistBackendError: NCCL error, internal error, NCCL version 2.16.5
Proxy Call to rank X failed (Connect)
```

这是 MCCL 自身的限制，**不是 PP hang 的直接原因**（vLLM PP 主路径不走 isend/irecv），但作为次要信号一并提供。

---

## 7. 关键证据 D：MCCL perf 四机测试通过

**测试命令**：
```bash
export MACA_PATH=/opt/maca
export LD_LIBRARY_PATH=${MACA_PATH}/lib:${MACA_PATH}/ompi/lib:${MACA_PATH}/ucx/lib
${MACA_PATH}/ompi/bin/mpirun -np 32 --allow-run-as-root \
  -mca btl_tcp_if_include 10.66.3.0/24 -mca oob_tcp_if_include 10.66.3.0/24 \
  -mca pml ^ucx -mca osc ^ucx -mca btl ^openib \
  -host 10.66.3.10:8,10.66.3.11:8,10.66.3.12:8,10.66.3.13:8 \
  -x MACA_PATH -x LD_LIBRARY_PATH -x MCCL_CROSS_NIC=1 -x FORCE_ACTIVE_WAIT=2 \
  <test_binary> -b 1K -e 1G -d float -f 2 -g 1 -n 10
```

### `all_reduce_perf` 峰值

```
#         size        time   algbw   busbw   #wrong
          (B)         (us)   (GB/s)  (GB/s)
   268435456        10429   25.74   49.87      0
   536870912        16236   33.06   64.06      0
  1073741824        29299   36.65   71.00      0    ← 71.24 GB/s in-place peak
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 20.214
```

### `sendrecv_perf` 峰值

```
#         size        time   algbw   busbw   #wrong
          (B)         (us)   (GB/s)  (GB/s)
   536870912        24670   21.76   21.76      0
  1073741824        49320   21.77   21.77      0    ← 21.78 GB/s peak
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 11.789
```

**结论**：MCCL 四机通信测试全部通过，0 错误。问题**不在 MCCL 底层通信**，而在 **`mcclSend` 异步路径 + Ray CGRAPH 多 stage 调度的交互**。

---

## 8. 已尝试的修复及结果

### Patch A：绕过 PyNccl，CudaCommunicator 直接用 `torch.distributed.send`

```python
# cuda_communicator.py — 强制走 torch.distributed（blocking）
def send(self, tensor, dst=None):
    ...
    torch.distributed.send(tensor, self.ranks[dst], self.device_group)

def recv(self, tensor, src=None):
    ...
    torch.distributed.recv(tensor, self.ranks[src], self.device_group)
```

**结果**：hang 现象**完全一样**。
- trace 证实 rank 1 的 `recv` 确实调用了（数据到达）
- 但 rank 1 的 `execute_model` 仍未被 Ray CGRAPH 调度
- 原因：**Ray CGRAPH 的 channel 依赖追踪 `ncclSend` 注册的 CUDA event**，绕过 ncclSend 后 Ray 再也看不到 event fire

### Patch B：保留 ncclSend，PyNccl 内加 group + sync

```python
# pynccl.py
def send(self, tensor, dst, stream=None):
    ...
    self.nccl.ncclGroupStart()
    self.nccl.ncclSend(buffer_type(tensor.data_ptr()), tensor.numel(),
                       ncclDataTypeEnum.from_torch(tensor.dtype), dst, self.comm, stream)
    self.nccl.ncclGroupEnd()
    stream.synchronize()  # 强制等 kernel 完成
```

**结果**：hang 现象**仍完全一样**。
- `stream.synchronize()` 本身不返回（因为 kernel 根本没投递，没有 event 可等）
- 说明 `ncclSend + ncclGroupEnd + stream.sync` 组合触发不到 kernel 执行

### 结论

**两个方向都验证失败** → 此 bug 在 vLLM Python 层无法绕开，**必须在 MCCL 内部修复**。

---

## 9. 给沐曦的调试方向建议

根因定位到 **MCCL `mcclSend` 在 Ray CGRAPH + 多 PP stage (≥3) 场景下 kernel 不执行**，建议沐曦团队：

1. **查 `mcclProxyProgress` 为什么空转**：它在 `sched_yield` 死循环，说明看到有任务但不推进。排查 progress 线程**何种条件会 skip 投递到 GPU**
2. **查 `mcclSend` 入队到 GPU stream enqueue 路径**：PP≥3 时 rank 1 同时是 recv target（来自 rank 0）和 send source（发往 rank 2），多对并发是否触发未覆盖分支/竞态
3. **沐曦侧复现**：`vllm-metax` + PP=3 + TP=8 + Ray backend + `enforce-eager`，最简复现配置无需 GLM-5 模型（任何需要 PP≥3 的模型均可）
4. **MCCL debug 日志**：`NCCL_DEBUG=INFO` 对 Ray 子进程不生效，需通过 Ray `runtime_env` 传。建议在 worker 启动命令里直接加该环境变量
5. **如有 MCCL debug 版本 / tracing 接口**，我们可以协助复现并回传 trace

### Ray 配置与节点识别说明（回应最初的问题）

Ray 配置**没有问题**：
- `ray status` 显示 32/32 GPU ready
- PP=2 能稳定推理 65k 上下文 → Ray + MCCL 通信器初始化 + GPU 识别全部正常
- 问题**只在 PP≥3 时出现**，所以不是 Ray 配置问题，而是多 stage P2P 传递时的 MCCL 行为问题

节点间配置详见本文档第 1 节。

---

## 附：当前集群状态

- 4 节点 Ray 集群已恢复到干净状态（`ray status` 0/32 GPU used）
- 所有 debug patch 已**完全回退**
- GPU 显存干净
- 等待沐曦修复或提供 debug 方向后再复现