• Members 11 posts
    2026年1月30日 13:00

    配置:
    联想 SR658H,内存 512GB,显卡:N260 * 2

    问题: 只运行一个不会出问题,但运行第二个就无法分配到显存,卡住了

    模型版本:
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64

    dokcer 命令:

    docker run -itd \
      --restart always \
      --privileged \
      --device=/dev/dri \
      --device=/dev/mxcd \
      --group-add video \
      --network=host \
      --name Qwen3-Next-80B-A3B-Instruct.w8a8 \
      --security-opt seccomp=unconfined \
      --security-opt apparmor=unconfined \
      --shm-size 100gb \
      --ulimit memlock=-1 \
      -v /models:/models \
      cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64 \
      /bin/bash
    

    模型启动命令:

    VLLM_USE_V1=0 nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
      --port 8889 \
      -tp 2 \
      --enforce-eager \
      --max-model-len 15000 \
      --gpu-memory-utilization 0.7 \
      --api-key Dzdwd@85416 \
      --max-num-seqs 35 \
      --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &
    

    向量启动命令:

    nohup vllm serve /models/qwen3-Embedding-0.6B \
      --port 8890 \
      --enforce-eager \
      --served-model-name qwen3-Embedding-0.6B \
      --max-model-len 1024 \
      --gpu-memory-utilization 0.1 \
      --trust-remote-code \
      --task embed \
      --api-key Dzdwd@85416 > vllm-emb.log 2>&1 &
    

    问题只运行一个不会出问题,但运行第二个就无法分配到显存,卡住了

    (EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/models/qwen3-Embedding-0.6B', speculative_config=None, tokenizer='/models/qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-Embedding-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
    (EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:40141 backend=nccl
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    (EngineCore_DP0 pid=20179) INFO 01-30 12:54:49 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
    

    mx-smi:

    mx-smi  version: 2.2.9
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Fri Jan 30 12:59:17 2026
    
    Attached GPUs                                     : 2
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
    | MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
    |------------------+-----------------+---------------------+----------------------|
    | Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
    | Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
    |==================+=================+=====================+======================|
    | 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
    | 52W / 225W       | 43C          P9 | 47883/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
    | 47W / 225W       | 40C          P9 | 47867/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                  2322349         VLLM::Worker_TP              47198          |
    |  0                  2343541         VLLM::EngineCor              16             |
    |  1                  2322350         VLLM::Worker_TP              47198          |
    +---------------------------------------------------------------------------------+
    

    请问该如何操作,还是参数有问题?

  • Members 221 posts
    2026年2月2日 15:55

    尊敬的开发者您好,请您调小Qwen3-Next-80B-A3B-Instruct.w8a8推理服务max-model-len以及gpu-memory-utilization尝试。

  • arrow_forward

    Thread has been moved from 产品&运维.

  • Members 11 posts
    2026年2月2日 16:40

    已尝试降低启动参数,最低初始化显存 0.7 才能运行大模型,但是运行向量依然报一样的错。是否是运行Qwen3-Next-80B-A3B-Instruct.w8a8 已达极限,我看显存每个显卡还剩余挺多的,10 多 G

  • Members 11 posts
    2026年2月2日 16:41

    已尝试降低启动参数,最低初始化显存 0.7 才能运行大模型,但是运行向量依然报一样的错。是否是运行Qwen3-Next-80B-A3B-Instruct.w8a8 已达极限,我看显存每个显卡还剩余挺多的,10 多 G

  • Members 221 posts
    2026年2月2日 16:45

    尊敬的开发者您好,请您先启动qwen3-Embedding-0.6B模型,再启动Qwen3-Next-80B-A3B-Instruct.w8a8模型尝试。

  • Members 11 posts
    2026年2月2日 16:48

    已尝试,不行。先启动向量或者 LLM 都能成功,启动第二就会报错。

    INFO 02-02 16:46:35 [parallel_state.py:1208] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:43627 backend=nccl
    [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
    [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
    [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
    [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
    INFO 02-02 16:46:35 [mccl.py:28] Found nccl from library libmccl.so
    INFO 02-02 16:46:35 [mccl.py:28] Found nccl from library libmccl.so
    INFO 02-02 16:46:35 [pynccl.py:111] vLLM is using nccl==2.16.5
    [16:46:46.087][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
    [16:46:46.087][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:41742 type:21. Retrying.
    [16:46:56.327][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
    [16:46:56.327][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:41742 type:21. Retrying.
    [16:47:06.567][MXKW][E]queues.c                :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:879 type:21. Retrying.
    
  • Members 221 posts
    2026年2月2日 16:49

    尊敬的开发者您好,请您调小Qwen3-Next-80B-A3B-Instruct.w8a8推理服务max-model-len尝试。

  • Members 11 posts
    2026年2月2日 16:51

    目前已调至 2000 了,还要继续调吗。

    nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
      --port 8889 \
      -tp 2 \
      --enforce-eager \
      --max-model-len 2000 \
      --gpu-memory-utilization 0.7 \
      --api-key Dzdwd@85416 \
      --max-num-seqs 10 \
      --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &
    
  • Members 221 posts
    2026年2月2日 16:52

    尊敬的开发者您好,请您继续调小Qwen3-Next-80B-A3B-Instruct.w8a8推理服务max-model-len(如512)以及gpu-memory-utilization尝试。

  • Members 11 posts
    2026年2月2日 17:25
    nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
      --port 8889 \
      -tp 2 \
      --enforce-eager \
      --max-model-len 512 \
      --gpu-memory-utilization 0.6 \
      --api-key Dzdwd@85416 \
      --max-num-seqs 2 \
      --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &
    

    token512,--gpu-memory-utilization 0.6 这个参数会跑不起来

    token512,--gpu-memory-utilization 0.7可以跑起来,但是再跑向量依然报错

  • Members 221 posts
    2026年2月2日 17:28

    尊敬的开发者您好,请您将gpu-memory-utilization调到0.65尝试。

  • Members 221 posts
    2026年2月2日 18:24

    尊敬的开发者您好,请您更换其他小参数量LLM模型尝试。

  • Members 11 posts
    2026年2月3日 09:19

    你好,我今天尝试了千问30B。依然卡住

    vllm serve /models/Qwen3-VL-30B-A3B-Instruct \
      --port 8889 \
      -tp 2 \
      --max-model-len 2000 \
      --gpu-memory-utilization 0.6 \
      --api-key Dzdwd@85416 \
      --max-num-seqs 30 \
      --served-model-name Qwen3-VL-30B-A3B-Instruct
    
    dzdwd@dzdwd-server:~$ mx-smi
    mx-smi  version: 2.2.9
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Tue Feb  3 09:17:16 2026
    
    Attached GPUs                                     : 2
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
    | MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
    |------------------+-----------------+---------------------+----------------------|
    | Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
    | Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
    |==================+=================+=====================+======================|
    | 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
    | 52W / 225W       | 43C          P9 | 38897/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
    | 47W / 225W       | 40C          P9 | 38881/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                   619946         VLLM::Worker_TP              38212          |
    |  0                   665500         VLLM::EngineCor              16             |
    |  1                   619947         VLLM::Worker_TP              38212          |
    +---------------------------------------------------------------------------------+
    

    卡在这里:

    (EngineCore_DP0 pid=28524) INFO 02-03 09:05:54 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:42619 backend=nccl
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    (EngineCore_DP0 pid=28524) INFO 02-03 09:05:54 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
    
  • Members 221 posts
    2026年2月3日 09:36

    尊敬的开发者您好,请先尝试7B LLM模型。

  • Members 11 posts
    2026年2月3日 09:52

    额,两块 N260 运行 7B 模型好像有点很不合理吧。

  • Members 221 posts
    2026年2月3日 09:53

    尊敬的开发者您好,进行一下交叉验证。

  • Members 2 posts
    2026年2月3日 09:58

    我也碰到一模一样的问题,两张卡上部署三个模型总有一个部署不上

  • Members 11 posts
    2026年2月3日 11:59

    你好,我已尝试 qwen2.5 的 7B 模型,还是卡住

    (EngineCore_DP0 pid=34195) INFO 02-03 11:57:19 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:35845 backend=nccl
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    (EngineCore_DP0 pid=34195) INFO 02-03 11:57:19 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
    
    dzdwd@dzdwd-server:~$ mx-smi
    mx-smi  version: 2.2.9
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Tue Feb  3 11:59:24 2026
    
    Attached GPUs                                     : 2
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
    | MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
    |------------------+-----------------+---------------------+----------------------|
    | Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
    | Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
    |==================+=================+=====================+======================|
    | 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
    | 51W / 225W       | 43C          P9 | 22079/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
    | 47W / 225W       | 40C          P9 | 22063/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                  1384499         VLLM::Worker_TP              21394          |
    |  0                  1400062         VLLM::EngineCor              16             |
    |  1                  1384500         VLLM::Worker_TP              21394          |
    +---------------------------------------------------------------------------------+
    
  • Members 11 posts
    2026年2月3日 12:00

    你什么卡,部署的什么模型哦。我只能同时运行一个

  • Members 1 post
    2026年2月3日 14:40

    c500 64g,也是两张卡运行三个模型,任意两个模型都不能放在同一张卡上

  • Members 2 posts
    2026年2月3日 15:15

    qwen3-embedding,qwen3-reranker,qwen3-30b-a3b

  • arrow_forward

    Thread has been moved from 解决中.