mp 后端 (external_launcher + torchrun) 多节点 PP 推理测试结果

环境: 4节点 x 8卡 MetaX C500 (64GB), MACA 3.5.3
镜像: vllm-metax:0.14.0-maca.ai3.5.3.102-torch2.8-py310-ubuntu22.04-amd64_glm_w4a8_full
模型: GLM-5-W8A8, --enforce-eager

一、测试方法

在每个节点的容器内分别执行 torchrun 启动 vLLM，绕过 Ray：

torchrun --nnodes=N --nproc-per-node=8 --node-rank=<rank> \
  --master-addr=10.66.3.10 --master-port=29500 \
  -m vllm.entrypoints.openai.api_server \
  --model /model --tensor-parallel-size 8 --pipeline-parallel-size N \
  --distributed-executor-backend external_launcher \
  --enforce-eager --gpu-memory-utilization 0.88

二、测试结果

1. PP=4 TP=8（4节点32卡，external_launcher）
   - 启动：正常，模型加载成功，KV cache 1,171,840 tokens，API server startup complete
   - 推理：卡死。curl 请求 120 秒超时，0 字节返回，引擎无 throughput 日志输出
   - 详细日志见附件 vllm_pp4_external_launcher.log

2. PP=2 TP=8（2节点16卡，external_launcher）
   - 启动：正常，模型加载成功，KV cache 141,312 tokens，API server startup complete
   - 推理：卡死。curl 请求 60 秒超时，0 字节返回，引擎无 throughput 日志输出
   - 详细日志见附件 vllm_pp2_external_launcher.log

三、对比：Ray 后端

同样的模型和参数，使用 --distributed-executor-backend ray：
   - PP=2 TP=8（2节点16卡）：推理正常，throughput 正常输出
   - PP=3 TP=8（3节点24卡）：推理卡死
   - PP=4 TP=8（4节点32卡）：推理卡死

四、结论

- external_launcher 后端：多节点 PP 推理全部卡死，包括 PP=2
- Ray 后端：PP=2 正常，PP>2 卡死
- MCCL 底层四机通信测试（all_reduce_perf、sendrecv_perf）全部通过，0 错误