MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

xiaoo

  • Members
  • Joined 2026年1月30日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

xiaoo has started 1 thread.

  • See post chevron_right
    xiaoo
    Members
    同时运行推理、嵌入、排序模型卡在显存分配问题 已解决 2026年1月30日 13:00

    配置:
    联想 SR658H,内存 512GB,显卡:N260 * 2

    问题: 只运行一个不会出问题,但运行第二个就无法分配到显存,卡住了

    模型版本:
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64

    dokcer 命令:

    docker run -itd \
      --restart always \
      --privileged \
      --device=/dev/dri \
      --device=/dev/mxcd \
      --group-add video \
      --network=host \
      --name Qwen3-Next-80B-A3B-Instruct.w8a8 \
      --security-opt seccomp=unconfined \
      --security-opt apparmor=unconfined \
      --shm-size 100gb \
      --ulimit memlock=-1 \
      -v /models:/models \
      cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64 \
      /bin/bash
    

    模型启动命令:

    VLLM_USE_V1=0 nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
      --port 8889 \
      -tp 2 \
      --enforce-eager \
      --max-model-len 15000 \
      --gpu-memory-utilization 0.7 \
      --api-key Dzdwd@85416 \
      --max-num-seqs 35 \
      --served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &
    

    向量启动命令:

    nohup vllm serve /models/qwen3-Embedding-0.6B \
      --port 8890 \
      --enforce-eager \
      --served-model-name qwen3-Embedding-0.6B \
      --max-model-len 1024 \
      --gpu-memory-utilization 0.1 \
      --trust-remote-code \
      --task embed \
      --api-key Dzdwd@85416 > vllm-emb.log 2>&1 &
    

    问题:只运行一个不会出问题,但运行第二个就无法分配到显存,卡住了

    (EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/models/qwen3-Embedding-0.6B', speculative_config=None, tokenizer='/models/qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-Embedding-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
    (EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:40141 backend=nccl
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
    (EngineCore_DP0 pid=20179) INFO 01-30 12:54:49 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
    

    mx-smi:

    mx-smi  version: 2.2.9
    
    =================== MetaX System Management Interface Log ===================
    Timestamp                                         : Fri Jan 30 12:59:17 2026
    
    Attached GPUs                                     : 2
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
    | MACA Version: 3.3.0.15             BIOS Version: 1.29.1.0                       |
    |------------------+-----------------+---------------------+----------------------|
    | Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
    | Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
    |==================+=================+=====================+======================|
    | 0     MetaX N260 | 0           Off | 0000:41:00.0        | 0%          Disabled |
    | 52W / 225W       | 43C          P9 | 47883/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    | 1     MetaX N260 | 1           Off | 0000:c1:00.0        | 0%          Disabled |
    | 47W / 225W       | 40C          P9 | 47867/65536 MiB     | Available            |
    +------------------+-----------------+---------------------+----------------------+
    
    +---------------------------------------------------------------------------------+
    | Process:                                                                        |
    |  GPU                    PID         Process Name                 GPU Memory     |
    |                                                                  Usage(MiB)     |
    |=================================================================================|
    |  0                  2322349         VLLM::Worker_TP              47198          |
    |  0                  2343541         VLLM::EngineCor              16             |
    |  1                  2322350         VLLM::Worker_TP              47198          |
    +---------------------------------------------------------------------------------+
    

    请问该如何操作,还是参数有问题?

  • 沐曦开发者论坛
powered by misago