同时运行推理、嵌入、排序模型卡在显存分配问题

link

xiaoo

Members 11 posts

2026年1月30日 13:00 2026年1月30日 13:00

link

配置：
联想 SR658H，内存 512GB，显卡：N260 * 2

问题： 只运行一个不会出问题，但运行第二个就无法分配到显存，卡住了

模型版本：
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64

dokcer 命令：

docker run -itd \
  --restart always \
  --privileged \
  --device=/dev/dri \
  --device=/dev/mxcd \
  --group-add video \
  --network=host \
  --name Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  --shm-size 100gb \
  --ulimit memlock=-1 \
  -v /models:/models \
  cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64 \
  /bin/bash

模型启动命令:

VLLM_USE_V1=0 nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --port 8889 \
  -tp 2 \
  --enforce-eager \
  --max-model-len 15000 \
  --gpu-memory-utilization 0.7 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 35 \
  --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &

向量启动命令：

nohup vllm serve /models/qwen3-Embedding-0.6B \
  --port 8890 \
  --enforce-eager \
  --served-model-name qwen3-Embedding-0.6B \
  --max-model-len 1024 \
  --gpu-memory-utilization 0.1 \
  --trust-remote-code \
  --task embed \
  --api-key Dzdwd@85416 > vllm-emb.log 2>&1 &

问题：只运行一个不会出问题，但运行第二个就无法分配到显存，卡住了

(EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/models/qwen3-Embedding-0.6B', speculative_config=None, tokenizer='/models/qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-Embedding-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:40141 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:49 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

mx-smi：

mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Fri Jan 30 12:59:17 2026

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
| 52W / 225W       | 43C          P9 | 47883/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
| 47W / 225W       | 40C          P9 | 47867/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                  2322349         VLLM::Worker_TP              47198          |
|  0                  2343541         VLLM::EngineCor              16             |
|  1                  2322350         VLLM::Worker_TP              47198          |
+---------------------------------------------------------------------------------+

请问该如何操作，还是参数有问题？

link

shuai_chen

Members 221 posts

2026年2月2日 15:55 2026年2月2日 15:55

link

尊敬的开发者您好，请您调小Qwen3-Next-80B-A3B-Instruct.w8a8推理服务max-model-len以及gpu-memory-utilization尝试。

link

xiaoo

Members 11 posts

2026年2月2日 16:40 2026年2月2日 16:40

link

已尝试降低启动参数，最低初始化显存 0.7 才能运行大模型，但是运行向量依然报一样的错。是否是运行Qwen3-Next-80B-A3B-Instruct.w8a8 已达极限，我看显存每个显卡还剩余挺多的，10 多 G

link

xiaoo

Members 11 posts

2026年2月2日 16:41 2026年2月2日 16:41

link

已尝试降低启动参数，最低初始化显存 0.7 才能运行大模型，但是运行向量依然报一样的错。是否是运行Qwen3-Next-80B-A3B-Instruct.w8a8 已达极限，我看显存每个显卡还剩余挺多的，10 多 G

link

shuai_chen

Members 221 posts

2026年2月2日 16:45 2026年2月2日 16:45

link

尊敬的开发者您好，请您先启动qwen3-Embedding-0.6B模型，再启动Qwen3-Next-80B-A3B-Instruct.w8a8模型尝试。

link

xiaoo

Members 11 posts

2026年2月2日 16:48 2026年2月2日 16:48

link

已尝试，不行。先启动向量或者 LLM 都能成功，启动第二就会报错。

INFO 02-02 16:46:35 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:43627 backend=nccl
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 02-02 16:46:35 [mccl.py:28] Found nccl from library libmccl.so
INFO 02-02 16:46:35 [mccl.py:28] Found nccl from library libmccl.so
INFO 02-02 16:46:35 [pynccl.py:111] vLLM is using nccl==2.16.5
[16:46:46.087][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
[16:46:46.087][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:41742 type:21. Retrying.
[16:46:56.327][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
[16:46:56.327][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:41742 type:21. Retrying.
[16:47:06.567][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.

link

shuai_chen

Members 221 posts

2026年2月2日 16:49 2026年2月2日 16:49

link

尊敬的开发者您好，请您调小Qwen3-Next-80B-A3B-Instruct.w8a8推理服务max-model-len尝试。

link

xiaoo

Members 11 posts

2026年2月2日 16:51 2026年2月2日 16:51

link

目前已调至 2000 了，还要继续调吗。

nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --port 8889 \
  -tp 2 \
  --enforce-eager \
  --max-model-len 2000 \
  --gpu-memory-utilization 0.7 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 10 \
  --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &

link

shuai_chen

Members 221 posts

2026年2月2日 16:52 2026年2月2日 16:52

link

尊敬的开发者您好，请您继续调小Qwen3-Next-80B-A3B-Instruct.w8a8推理服务max-model-len（如512）以及gpu-memory-utilization尝试。

link

xiaoo

Members 11 posts

2026年2月2日 17:25 2026年2月2日 17:25

link

nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
  --port 8889 \
  -tp 2 \
  --enforce-eager \
  --max-model-len 512 \
  --gpu-memory-utilization 0.6 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 2 \
  --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &

token512,--gpu-memory-utilization 0.6 这个参数会跑不起来

token512,--gpu-memory-utilization 0.7可以跑起来，但是再跑向量依然报错

link

shuai_chen

Members 221 posts

2026年2月2日 17:28 2026年2月2日 17:28

link

尊敬的开发者您好，请您将gpu-memory-utilization调到0.65尝试。

link

xiaoo

Members 11 posts

2026年2月2日 18:19 2026年2月2日 18:19

link

依然报错。

link

shuai_chen

Members 221 posts

2026年2月2日 18:24 2026年2月2日 18:24

link

尊敬的开发者您好，请您更换其他小参数量LLM模型尝试。

link

xiaoo

Members 11 posts

2026年2月3日 09:19 2026年2月3日 09:19

link

你好，我今天尝试了千问30B。依然卡住

vllm serve /models/Qwen3-VL-30B-A3B-Instruct \
  --port 8889 \
  -tp 2 \
  --max-model-len 2000 \
  --gpu-memory-utilization 0.6 \
  --api-key Dzdwd@85416 \
  --max-num-seqs 30 \
  --served-model-name Qwen3-VL-30B-A3B-Instruct

dzdwd@dzdwd-server:~$ mx-smi
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Tue Feb  3 09:17:16 2026

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
| 52W / 225W       | 43C          P9 | 38897/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
| 47W / 225W       | 40C          P9 | 38881/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                   619946         VLLM::Worker_TP              38212          |
|  0                   665500         VLLM::EngineCor              16             |
|  1                   619947         VLLM::Worker_TP              38212          |
+---------------------------------------------------------------------------------+

卡在这里：

(EngineCore_DP0 pid=28524) INFO 02-03 09:05:54 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:42619 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=28524) INFO 02-03 09:05:54 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

link

shuai_chen

Members 221 posts

2026年2月3日 09:36 2026年2月3日 09:36

link

尊敬的开发者您好，请先尝试7B LLM模型。

link

xiaoo

Members 11 posts

2026年2月3日 09:52 2026年2月3日 09:52

link

额，两块 N260 运行 7B 模型好像有点很不合理吧。

link

shuai_chen

Members 221 posts

2026年2月3日 09:53 2026年2月3日 09:53

link

尊敬的开发者您好，进行一下交叉验证。

link

hjq

Members 2 posts

2026年2月3日 09:58 2026年2月3日 09:58

link

我也碰到一模一样的问题，两张卡上部署三个模型总有一个部署不上

link

xiaoo

Members 11 posts

2026年2月3日 11:59 2026年2月3日 11:59

link

你好，我已尝试 qwen2.5 的 7B 模型，还是卡住

(EngineCore_DP0 pid=34195) INFO 02-03 11:57:19 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:35845 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=34195) INFO 02-03 11:57:19 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

dzdwd@dzdwd-server:~$ mx-smi
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Tue Feb  3 11:59:24 2026

Attached GPUs                                     : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
| 51W / 225W       | 43C          P9 | 22079/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
| 47W / 225W       | 40C          P9 | 22063/65536 MiB     | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  0                  1384499         VLLM::Worker_TP              21394          |
|  0                  1400062         VLLM::EngineCor              16             |
|  1                  1384500         VLLM::Worker_TP              21394          |
+---------------------------------------------------------------------------------+

link

xiaoo

Members 11 posts

2026年2月3日 12:00 2026年2月3日 12:00

link

你什么卡，部署的什么模型哦。我只能同时运行一个

link