MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

langhongbin

  • Members
  • Joined 2026年5月20日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

langhongbin has posted 12 messages.

  • See post chevron_right
    langhongbin
    Members
    metaX C500如何部署bge-m3和bge-reranker-v2-m3模型? 解决中 2026年5月23日 16:38

    端口错开后依旧有如上报错
    启动命令
    nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 131072 \
    --max-num-seqs 128 \
    --enable-chunked-prefill \
    --enable-prefix-caching \

    qwen.log 2>&1 &

    nohup vllm serve /root/vllm/bge-m3/ \
    --host 0.0.0.0 \
    --port 8001 \
    --served-model-name bge-m3 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.1 \
    --trust-remote-code \
    --dtype auto \

    bge-m3.log 2>&1 &

    报错日志:
    (EngineCore pid=26596) INFO 05-23 16:26:48 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/root/vllm/bge-m3/', speculative_config=None, tokenizer='/root/vllm/bge-m3/', skip_toke
    nizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1,
    pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_expert
    s=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='',
    reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_m
    etrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False
    ), seed=0, served_model_name=bge-m3, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=PoolerConfig(task=None, pooling_type=None, seq_pooling_type='CLS', tok_pooling_type='ALL', us
    e_activation=True, dimensions=None, enable_chunked_processing=False, max_embed_len=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'mode': <CompilationMode.VLLM_COM
    PILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_atten
    tion_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_m
    ixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm:
    :mx_sparse_attn_indexer_bf16', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'e
    ncoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_a
    sserts': False, 'scalar_asserts': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48,
    56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512
    ], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, '
    enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': Fa
    lse, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
    (EngineCore pid=26596) INFO 05-23 16:26:48 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.217.247.136:40835 backend=nccl
    (EngineCore pid=26596) INFO 05-23 16:26:48 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
    [16:26:59.536][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [16:27:09.776][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [16:27:20.016][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [16:27:30.256][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [16:27:40.496][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [16:27:50.736][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [16:28:00.977][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.

  • See post chevron_right
    langhongbin
    Members
    metaX C500如何部署bge-m3和bge-reranker-v2-m3模型? 解决中 2026年5月22日 14:47

    1.镜像版本:
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    2.启动容器命令:
    docker run -itd \
    --name qwen3.6 \
    --network host \
    --shm-size 512G \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 100gb \
    --ulimit memlock=-1 \
    -v /home/modelscope:/root/vllm \
    -e TZ=Asia/Shanghai \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    nohup vllm serve /root/vllm/bge-m3/ \
    --host 0.0.0.0 \
    --port 8001 \
    --served-model-name bge-m3 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.1 \
    --trust-remote-code \
    --dtype auto \

    bge-m3.log 2>&1 &

    nohup vllm serve /root/vllm/bge-reranker-v2-m3/ \
    --host 0.0.0.0 \
    --port 8001 \
    --served-model-name bge-reranker-v2-m3 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.1 \
    --trust-remote-code \
    --dtype auto \

    reranker.log 2>&1 &

    二问题:
    在容器中启动多个服务时报错,报错信息如下
    itionalGeneration.
    WARNING 05-22 14:34:58 [registry.py:915] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM.
    (Worker pid=2004) INFO 05-22 14:34:59 [parallel_state.py:1400] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:45863 backend=nccl
    (Worker pid=2000) [rank1]:W0522 14:34:59.640000 2000 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2000) [rank1]:W0522 14:34:59.640000 2000 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=2002) [rank3]:W0522 14:34:59.640000 2002 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2002) [rank3]:W0522 14:34:59.640000 2002 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=2001) [rank2]:W0522 14:34:59.640000 2001 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2001) [rank2]:W0522 14:34:59.640000 2001 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=1999) [rank0]:W0522 14:34:59.640000 1999 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=1999) [rank0]:W0522 14:34:59.640000 1999 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=2005) [rank6]:W0522 14:34:59.641000 2005 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2006) [rank7]:W0522 14:34:59.641000 2006 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2005) [rank6]:W0522 14:34:59.641000 2005 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=2006) [rank7]:W0522 14:34:59.641000 2006 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=2004) [rank5]:W0522 14:34:59.641000 2004 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2004) [rank5]:W0522 14:34:59.641000 2004 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=2003) [rank4]:W0522 14:34:59.642000 2003 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
    (Worker pid=2003) [rank4]:W0522 14:34:59.642000 2003 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
    (Worker pid=1999) INFO 05-22 14:34:59 [mccl.py:27] Found mccl from library libmccl.so
    (Worker pid=1999) INFO 05-22 14:34:59 [pynccl.py:111] vLLM is using nccl==2.16.5
    [14:35:11.312][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [14:35:21.552][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [14:35:31.792][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [14:35:42.032][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [14:35:52.273][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [14:36:02.512][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
    [14:36:12.752][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.

  • See post chevron_right
    langhongbin
    Members
    metaX C500如何部署bge-m3和bge-reranker-v2-m3模型? 解决中 2026年5月22日 12:22

    一、软硬件信息:
    1.服务器厂家:浪潮

    2.沐曦GPU型号:MetaX C500 8卡

    3.操作系统内核版本:6.6.0-32.7.v2505.ky11.x86_64

    4.是否开启CPU虚拟化:开启

    5.mx-smi回显:
    mx-smi version: 2.2.12

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 20 18:14:56 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: unknown BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
    | 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
    | 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
    | 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
    | 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
    | 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
    | 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
    | 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
    | 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0 1025936 VLLM::Worker_TP 39386 |
    | 1 1025937 VLLM::Worker_TP 40026 |
    | 2 1025938 VLLM::Worker_TP 39386 |
    | 3 1025939 VLLM::Worker_TP 40026 |
    | 4 1025940 VLLM::Worker_TP 40026 |
    | 5 1025941 VLLM::Worker_TP 39386 |
    | 6 1025942 VLLM::Worker_TP 40026 |
    | 7 1025943 VLLM::Worker_TP 39386 |
    +---------------------------------------------------------------------------------+

    6.docker info回显:
    [root@localhost ~]# docker info
    Client:
    Version: 24.0.9
    Context: default
    Debug Mode: false

    Server:
    Containers: 1
    Running: 1
    Paused: 0
    Stopped: 0
    Images: 1
    Server Version: 24.0.9
    Storage Driver: overlay2
    Backing Filesystem: xfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
    runc version:
    init version:
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
    Operating System: Kylin Linux Advanced Server V11 (Swan25)
    OSType: linux
    Architecture: x86_64
    CPUs: 256
    Total Memory: 1.472TiB
    Name: localhost.localdomain
    ID: ded90092-4000-426b-a3ca-08950e376242
    Docker Root Dir: /home/docker
    Debug Mode: false
    Experimental: false
    Insecure Registries:
    127.0.0.0/8
    Registry Mirrors:
    docker.1ms.run/
    dockerpull.com/
    registry.docker-cn.com/
    Live Restore Enabled: false

    二问题
    metaX C500如何部署bge-m3和bge-reranker-v2-m3模型?

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 13:38

    四 5月 21 11:15:48 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:15:48 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:15:48 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:15:48 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:15:58 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:15:58 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:15:58 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:15:58 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:09 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:09 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:09 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:09 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:19 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:19 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:19 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:19 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:29 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:29 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:29 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:29 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:39 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:39 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:39 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:39 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:50 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:50 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:16:50 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:16:50 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:17:29 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:17:29 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 11:17:29 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 11:17:29 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:11:31 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:11:31 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:11:31 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:11:31 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:11:41 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:11:41 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:11:41 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:11:41 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:11:51 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:11:51 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:11:51 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:11:51 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:02 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:02 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:02 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:02 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:12 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:12 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:12 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:12 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:22 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:22 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:22 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:22 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:32 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:32 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:12:32 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:12:32 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:18:14 2026] MXCD.B400.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:18:14 2026] MXCD.B400.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110
    [四 5月 21 12:18:14 2026] MXCD.B500.D0.RINGBUF.ERROR wait_ret failed, -110
    [四 5月 21 12:18:14 2026] MXCD.B500.D0.RINGBUF.ERROR type 0x0 create ringbuf failed, -110

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 12:30

    该报错持续一段时间后服务正常启动了,但是推理速度依然很慢,日志如下

    (APIServer pid=53) INFO 05-21 12:28:35 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 15.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
    1%, Prefix cache hit rate: 0.0%
    (APIServer pid=53) INFO 05-21 12:28:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
    2%, Prefix cache hit rate: 0.0%
    (APIServer pid=53) INFO 05-21 12:28:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 78.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
    2%, Prefix cache hit rate: 0.0%
    (APIServer pid=53) INFO 05-21 12:29:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 79.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
    2%, Prefix cache hit rate: 0.0%
    (APIServer pid=53) INFO 05-21 12:29:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 79.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
    3%, Prefix cache hit rate: 0.0%
    (APIServer pid=53) INFO: 10.217.247.136:40238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=53) INFO 05-21 12:29:25 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.
    0%, Prefix cache hit rate: 0.0%
    (APIServer pid=53) INFO 05-21 12:29:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0
    %, Prefix cache hit rate: 0.0%

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 12:16

    四卡仍然报相同错误,启动命令
    nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 4 \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 131072 \
    --max-num-seqs 64 \

    qwen.log 2>& 1 &

    日志
    tail -500f qwen.log
    nohup: ignoring input
    INFO 05-21 12:08:03 [init.py:44] Available plugins for group vllm.platform_plugins:
    INFO 05-21 12:08:03 [init.py:46] - metax -> vllm_metax:register
    INFO 05-21 12:08:03 [init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
    INFO 05-21 12:08:03 [init.py:239] Platform plugin metax is activated
    (EngineCore pid=758) INFO 05-21 12:08:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:09:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:10:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:11:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:12:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:13:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:14:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:15:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=758) INFO 05-21 12:16:12 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 11:05

    双卡推理服务启动卡死
    服务启动命令:
    nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 2\
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 131072 \
    --max-num-seqs 64 \

    qwen.log 2>& 1 &

    日志信息:
    (Worker_TP0 pid=1133)
    (Worker_TP0 pid=1133) INFO 05-21 10:58:12 [default_loader.py:384] Loading weights took 19.40 seconds
    (Worker_TP0 pid=1133) INFO 05-21 10:58:13 [gpu_model_runner.py:4820] Model loading took 32.86 GiB memory and 20.283825 seconds
    (Worker_TP0 pid=1133) INFO 05-21 10:58:15 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 131072 tokens, and profiled with 8 image items of the maximum feature size.
    (Worker_TP0 pid=1133) INFO 05-21 10:58:30 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/583c9adccf/rank_0_0/backbone for vLLM's torch.compile
    (Worker_TP0 pid=1133) INFO 05-21 10:58:30 [backends.py:1111] Dynamo bytecode transform time: 11.64 s
    (EngineCore pid=785) INFO 05-21 10:59:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=785) INFO 05-21 11:00:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=785) INFO 05-21 11:01:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=785) INFO 05-21 11:02:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).
    (EngineCore pid=785) INFO 05-21 11:03:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).

    (EngineCore pid=785) INFO 05-21 11:04:16 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-c
    onsuming work (e.g. compilation, weight/kv cache quantization).

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 10:50

    是否需要添加环境变量进行优化,需要的话具体添加哪些?

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 10:45

    单卡部署显存不足

  • See post chevron_right
    langhongbin
    Members
    8卡MetaX C500 部署qwen3.6推理速度太慢 解决中 2026年5月21日 09:08

    一、软硬件信息:
    1.服务器厂家:浪潮

    2.沐曦GPU型号:MetaX C500 8卡

    3.操作系统内核版本:6.6.0-32.7.v2505.ky11.x86_64

    4.是否开启CPU虚拟化:开启

    5.mx-smi回显:
    mx-smi version: 2.2.12

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 20 18:14:56 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: unknown BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
    | 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
    | 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
    | 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
    | 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
    | 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
    | 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
    | 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
    | 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0 1025936 VLLM::Worker_TP 39386 |
    | 1 1025937 VLLM::Worker_TP 40026 |
    | 2 1025938 VLLM::Worker_TP 39386 |
    | 3 1025939 VLLM::Worker_TP 40026 |
    | 4 1025940 VLLM::Worker_TP 40026 |
    | 5 1025941 VLLM::Worker_TP 39386 |
    | 6 1025942 VLLM::Worker_TP 40026 |
    | 7 1025943 VLLM::Worker_TP 39386 |
    +---------------------------------------------------------------------------------+

    6.docker info回显:
    [root@localhost ~]# docker info
    Client:
    Version: 24.0.9
    Context: default
    Debug Mode: false

    Server:
    Containers: 1
    Running: 1
    Paused: 0
    Stopped: 0
    Images: 1
    Server Version: 24.0.9
    Storage Driver: overlay2
    Backing Filesystem: xfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
    runc version:
    init version:
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
    Operating System: Kylin Linux Advanced Server V11 (Swan25)
    OSType: linux
    Architecture: x86_64
    CPUs: 256
    Total Memory: 1.472TiB
    Name: localhost.localdomain
    ID: ded90092-4000-426b-a3ca-08950e376242
    Docker Root Dir: /home/docker
    Debug Mode: false
    Experimental: false
    Insecure Registries:
    127.0.0.0/8
    Registry Mirrors:
    docker.1ms.run/
    dockerpull.com/
    registry.docker-cn.com/
    Live Restore Enabled: false

    7.镜像版本:
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    8.启动容器命令:
    docker run -itd \
    --name qwen3.6 \
    --network host \
    --shm-size 512G \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 100gb \
    --ulimit memlock=-1 \
    -v /home/modelscope:/root/vllm \
    -e TZ=Asia/Shanghai \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    9.容器内执行命令:
    nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 327680 \
    --kv-cache-dtype fp8_e4m3 >qwen.log 2>& 1 &

    二、问题现象
    推理速度慢,首轮 prompt 预填:2.2 tokens/s(输入解析慢)生成阶段稳定:70~73 tokens/s
    日志信息如下:
    (APIServer pid=254754) INFO 05-20 20:11:26 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage:
    0.6%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.7%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.9%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:56 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.2%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.3%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.6%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO: 10.217.247.136:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=254754) INFO 05-20 20:12:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.0%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage:
    0.0%, Prefix cache hit rate: 0.0%

  • See post chevron_right
    langhongbin
    Members
    Qwen3.6-35B-A3B模型部署报错 解决中 2026年5月20日 20:16

    一、软硬件信息:
    1.服务器厂家:浪潮

    2.沐曦GPU型号:MetaX C500 8卡

    3.操作系统内核版本:6.6.0-32.7.v2505.ky11.x86_64

    4.是否开启CPU虚拟化:开启

    5.mx-smi回显:
    mx-smi version: 2.2.12

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 20 18:14:56 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: unknown BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
    | 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
    | 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
    | 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
    | 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
    | 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
    | 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
    | 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
    | 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0 1025936 VLLM::Worker_TP 39386 |
    | 1 1025937 VLLM::Worker_TP 40026 |
    | 2 1025938 VLLM::Worker_TP 39386 |
    | 3 1025939 VLLM::Worker_TP 40026 |
    | 4 1025940 VLLM::Worker_TP 40026 |
    | 5 1025941 VLLM::Worker_TP 39386 |
    | 6 1025942 VLLM::Worker_TP 40026 |
    | 7 1025943 VLLM::Worker_TP 39386 |
    +---------------------------------------------------------------------------------+

    6.docker info回显:
    [root@localhost ~]# docker info
    Client:
    Version: 24.0.9
    Context: default
    Debug Mode: false

    Server:
    Containers: 1
    Running: 1
    Paused: 0
    Stopped: 0
    Images: 1
    Server Version: 24.0.9
    Storage Driver: overlay2
    Backing Filesystem: xfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
    runc version:
    init version:
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
    Operating System: Kylin Linux Advanced Server V11 (Swan25)
    OSType: linux
    Architecture: x86_64
    CPUs: 256
    Total Memory: 1.472TiB
    Name: localhost.localdomain
    ID: ded90092-4000-426b-a3ca-08950e376242
    Docker Root Dir: /home/docker
    Debug Mode: false
    Experimental: false
    Insecure Registries:
    127.0.0.0/8
    Registry Mirrors:
    docker.1ms.run/
    dockerpull.com/
    registry.docker-cn.com/
    Live Restore Enabled: false

    7.镜像版本:
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    8.启动容器命令:
    docker run -itd \
    --name qwen3.6 \
    --network host \
    --shm-size 512G \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 100gb \
    --ulimit memlock=-1 \
    -v /home/modelscope:/root/vllm \
    -e TZ=Asia/Shanghai \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    9.容器内执行命令:
    nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 327680 \
    --kv-cache-dtype fp8_e4m3 >qwen.log 2>& 1 &

    二、问题现象
    推理速度慢,首轮 prompt 预填:2.2 tokens/s(输入解析慢)生成阶段稳定:70~73 tokens/s
    日志信息如下:
    (APIServer pid=254754) INFO 05-20 20:11:26 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage:
    0.6%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.7%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.9%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:56 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.2%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.3%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.6%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO: 10.217.247.136:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=254754) INFO 05-20 20:12:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.0%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage:
    0.0%, Prefix cache hit rate: 0.0%

  • See post chevron_right
    langhongbin
    Members
    Qwen3.6-35B-A3B模型部署报错 解决中 2026年5月20日 16:41

    Metax C500 8卡部署Qwen3.6-35B-A3B模型,容器启动命令如下:
    docker run -itd \
    --name qwen3.6 \
    --network host \
    --shm-size 512G \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 100gb \
    --ulimit memlock=-1 \
    -v /home/modelscope:/root/vllm \
    -e TZ=Asia/Shanghai \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    vllm启动命令如下:
    vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 524288 \
    --kv-cache-dtype fp8_e4m3

    报错信息如下:
    (EngineCore pid=157812) ERROR 05-20 16:37:27 [core.py:1108] RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 22.74
    GiB is free. Of the allocated memory 35.24 GiB is allocated by PyTorch, and 442.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_
    CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace abov
    e for the root cause
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self._call_impl(args, kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return forward_call(
    args, kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "<eval_with_key>.82", line 258, in forward
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] submod_2 = self.submod_2(getitem_3, s59, getitem_4, l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameter
    s_weight_, getitem_5, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_, getitem_6, s18, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_
    weight_, l_inputs_embeds_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_
    modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_); getitem_3 = getitem_4 = l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_ = getitem
    5 = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight = getitem_6 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_inputs_e
    mbeds_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_m
    odules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in call
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self.runnable(*args,
    kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in call
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return range_entry.runnable(args)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 62, in call
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self._compiled_fn(
    args)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return fn(args, kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
    (Worker_TP5 pid=158165) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP4 pid=158164) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP0 pid=158160) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP2 pid=158162) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP6 pid=158166) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP1 pid=158161) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP7 pid=158167) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP3 pid=158163) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (EngineCore pid=157812) ERROR 05-20 16:37:38 [multiproc_executor.py:273] Worker proc VllmWorker-4 died unexpectedly, shutting down executor.
    (EngineCore pid=157812) Process EngineCore:
    (EngineCore pid=157812) Traceback (most recent call last):
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    (EngineCore pid=157812) self.run()
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
    (EngineCore pid=157812) self._target(
    self._args, self._kwargs)
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
    (EngineCore pid=157812) raise e
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
    (EngineCore pid=157812) engine_core = EngineCoreProc(*args, engine_index=dp_rank,
    kwargs)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (EngineCore pid=157812) return func(args, kwargs)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in init
    (EngineCore pid=157812) super().init(
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in init
    (EngineCore pid=157812) kv_cache_config = self._initialize_kv_caches(vllm_config)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (EngineCore pid=157812) return func(
    args, kwargs)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
    (EngineCore pid=157812) available_gpu_memory = self.model_executor.determine_available_memory()
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
    (EngineCore pid=157812) return self.collective_rpc("determine_available_memory")
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 397, in collective_rpc
    (EngineCore pid=157812) return aggregate(get_response())
    (EngineCore pid=157812) ^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 380, in get_response
    (EngineCore pid=157812) raise RuntimeError(
    (EngineCore pid=157812) RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 22.74 GiB is free. Of the allocated memor
    y 35.24 GiB is allocated by PyTorch, and 442.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avo
    id fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause
    (APIServer pid=157458) Traceback (most recent call last):
    (APIServer pid=157458) File "/opt/conda/bin/vllm", line 8, in <module>
    (APIServer pid=157458) sys.exit(main())
    (APIServer pid=157458) ^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    (APIServer pid=157458) args.dispatch_function(args)
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
    (APIServer pid=157458) uvloop.run(run_server(args))
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
    (APIServer pid=157458) return asyncio.run(
    (APIServer pid=157458) ^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
    (APIServer pid=157458) return runner.run(main)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
    (APIServer pid=157458) return self._loop.run_until_complete(task)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init
    .py", line 48, in wrapper
    (APIServer pid=157458) return await main
    (APIServer pid=157458) ^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
    (APIServer pid=157458) await run_server_worker(listen_address, sock, args,
    uvicorn_kwargs)
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
    (APIServer pid=157458) async with build_async_engine_client(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
    (APIServer pid=157458) return await anext(self.gen)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
    (APIServer pid=157458) async with build_async_engine_client_from_engine_args(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
    (APIServer pid=157458) return await anext(self.gen)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
    (APIServer pid=157458) async_llm = AsyncLLM.from_vllm_config(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
    (APIServer pid=157458) return cls(
    (APIServer pid=157458) ^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in init
    (APIServer pid=157458) self.engine_core = EngineCoreClient.make_async_mp_client(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (APIServer pid=157458) return func(args, kwargs)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
    (APIServer pid=157458) return AsyncMPClient(
    client_args)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (APIServer pid=157458) return func(args, *kwargs)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in init
    (APIServer pid=157458) super().init(
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in init
    (APIServer pid=157458) with launch_core_engines(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 144, in exit
    (APIServer pid=157458) next(self.gen)
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
    (APIServer pid=157458) wait_for_engine_startup(
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
    (APIServer pid=157458) raise RuntimeError(
    (APIServer pid=157458) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
    /opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

  • 沐曦开发者论坛
powered by misago