• Members 12 posts
    2026年5月20日 16:41

    Metax C500 8卡部署Qwen3.6-35B-A3B模型,容器启动命令如下:
    docker run -itd \
    --name qwen3.6 \
    --network host \
    --shm-size 512G \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 100gb \
    --ulimit memlock=-1 \
    -v /home/modelscope:/root/vllm \
    -e TZ=Asia/Shanghai \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    vllm启动命令如下:
    vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 524288 \
    --kv-cache-dtype fp8_e4m3

    报错信息如下:
    (EngineCore pid=157812) ERROR 05-20 16:37:27 [core.py:1108] RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 22.74
    GiB is free. Of the allocated memory 35.24 GiB is allocated by PyTorch, and 442.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_
    CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace abov
    e for the root cause
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self._call_impl(args, kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return forward_call(
    args, kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "<eval_with_key>.82", line 258, in forward
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] submod_2 = self.submod_2(getitem_3, s59, getitem_4, l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameter
    s_weight_, getitem_5, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_, getitem_6, s18, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_
    weight_, l_inputs_embeds_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_
    modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_); getitem_3 = getitem_4 = l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_ = getitem
    5 = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight = getitem_6 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_inputs_e
    mbeds_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_m
    odules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in call
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self.runnable(*args,
    kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in call
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return range_entry.runnable(args)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 62, in call
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return self._compiled_fn(
    args)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] return fn(args, kwargs)
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^
    (Worker_TP5 pid=158165) ERROR 05-20 16:37:27 [multiproc_executor.py:949] File "/opt/conda/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
    (Worker_TP5 pid=158165) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP4 pid=158164) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP0 pid=158160) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP2 pid=158162) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP6 pid=158166) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP1 pid=158161) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP7 pid=158167) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (Worker_TP3 pid=158163) WARNING 05-20 16:37:27 [multiproc_executor.py:871] WorkerProc was terminated
    (EngineCore pid=157812) ERROR 05-20 16:37:38 [multiproc_executor.py:273] Worker proc VllmWorker-4 died unexpectedly, shutting down executor.
    (EngineCore pid=157812) Process EngineCore:
    (EngineCore pid=157812) Traceback (most recent call last):
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    (EngineCore pid=157812) self.run()
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
    (EngineCore pid=157812) self._target(
    self._args, self._kwargs)
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
    (EngineCore pid=157812) raise e
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
    (EngineCore pid=157812) engine_core = EngineCoreProc(*args, engine_index=dp_rank,
    kwargs)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (EngineCore pid=157812) return func(args, kwargs)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in init
    (EngineCore pid=157812) super().init(
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in init
    (EngineCore pid=157812) kv_cache_config = self._initialize_kv_caches(vllm_config)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (EngineCore pid=157812) return func(
    args, kwargs)
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
    (EngineCore pid=157812) available_gpu_memory = self.model_executor.determine_available_memory()
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
    (EngineCore pid=157812) return self.collective_rpc("determine_available_memory")
    (EngineCore pid=157812) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 397, in collective_rpc
    (EngineCore pid=157812) return aggregate(get_response())
    (EngineCore pid=157812) ^^^^^^^^^^^^^^
    (EngineCore pid=157812) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 380, in get_response
    (EngineCore pid=157812) raise RuntimeError(
    (EngineCore pid=157812) RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 22.74 GiB is free. Of the allocated memor
    y 35.24 GiB is allocated by PyTorch, and 442.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avo
    id fragmentation. See documentation for Memory Management (pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause
    (APIServer pid=157458) Traceback (most recent call last):
    (APIServer pid=157458) File "/opt/conda/bin/vllm", line 8, in <module>
    (APIServer pid=157458) sys.exit(main())
    (APIServer pid=157458) ^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    (APIServer pid=157458) args.dispatch_function(args)
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
    (APIServer pid=157458) uvloop.run(run_server(args))
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
    (APIServer pid=157458) return asyncio.run(
    (APIServer pid=157458) ^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
    (APIServer pid=157458) return runner.run(main)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
    (APIServer pid=157458) return self._loop.run_until_complete(task)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/uvloop/__init
    .py", line 48, in wrapper
    (APIServer pid=157458) return await main
    (APIServer pid=157458) ^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
    (APIServer pid=157458) await run_server_worker(listen_address, sock, args,
    uvicorn_kwargs)
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
    (APIServer pid=157458) async with build_async_engine_client(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
    (APIServer pid=157458) return await anext(self.gen)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
    (APIServer pid=157458) async with build_async_engine_client_from_engine_args(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in aenter
    (APIServer pid=157458) return await anext(self.gen)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
    (APIServer pid=157458) async_llm = AsyncLLM.from_vllm_config(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
    (APIServer pid=157458) return cls(
    (APIServer pid=157458) ^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in init
    (APIServer pid=157458) self.engine_core = EngineCoreClient.make_async_mp_client(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (APIServer pid=157458) return func(args, kwargs)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
    (APIServer pid=157458) return AsyncMPClient(
    client_args)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    (APIServer pid=157458) return func(args, *kwargs)
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in init
    (APIServer pid=157458) super().init(
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in init
    (APIServer pid=157458) with launch_core_engines(
    (APIServer pid=157458) ^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/contextlib.py", line 144, in exit
    (APIServer pid=157458) next(self.gen)
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
    (APIServer pid=157458) wait_for_engine_startup(
    (APIServer pid=157458) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
    (APIServer pid=157458) raise RuntimeError(
    (APIServer pid=157458) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
    /opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

  • arrow_forward

    Thread has been moved from 产品&运维.

  • Members 12 posts
    2026年5月20日 20:16

    一、软硬件信息:
    1.服务器厂家:浪潮

    2.沐曦GPU型号:MetaX C500 8卡

    3.操作系统内核版本:6.6.0-32.7.v2505.ky11.x86_64

    4.是否开启CPU虚拟化:开启

    5.mx-smi回显:
    mx-smi version: 2.2.12

    =================== MetaX System Management Interface Log ===================
    Timestamp : Wed May 20 18:14:56 2026

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
    | MACA Version: unknown BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
    | 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
    | 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
    | 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
    | 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
    | 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
    | 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
    | 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+
    | 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
    | 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0 1025936 VLLM::Worker_TP 39386 |
    | 1 1025937 VLLM::Worker_TP 40026 |
    | 2 1025938 VLLM::Worker_TP 39386 |
    | 3 1025939 VLLM::Worker_TP 40026 |
    | 4 1025940 VLLM::Worker_TP 40026 |
    | 5 1025941 VLLM::Worker_TP 39386 |
    | 6 1025942 VLLM::Worker_TP 40026 |
    | 7 1025943 VLLM::Worker_TP 39386 |
    +---------------------------------------------------------------------------------+

    6.docker info回显:
    [root@localhost ~]# docker info
    Client:
    Version: 24.0.9
    Context: default
    Debug Mode: false

    Server:
    Containers: 1
    Running: 1
    Paused: 0
    Stopped: 0
    Images: 1
    Server Version: 24.0.9
    Storage Driver: overlay2
    Backing Filesystem: xfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: true
    userxattr: false
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
    runc version:
    init version:
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
    Operating System: Kylin Linux Advanced Server V11 (Swan25)
    OSType: linux
    Architecture: x86_64
    CPUs: 256
    Total Memory: 1.472TiB
    Name: localhost.localdomain
    ID: ded90092-4000-426b-a3ca-08950e376242
    Docker Root Dir: /home/docker
    Debug Mode: false
    Experimental: false
    Insecure Registries:
    127.0.0.0/8
    Registry Mirrors:
    docker.1ms.run/
    dockerpull.com/
    registry.docker-cn.com/
    Live Restore Enabled: false

    7.镜像版本:
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    8.启动容器命令:
    docker run -itd \
    --name qwen3.6 \
    --network host \
    --shm-size 512G \
    --device=/dev/dri \
    --device=/dev/mxcd \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --shm-size 100gb \
    --ulimit memlock=-1 \
    -v /home/modelscope:/root/vllm \
    -e TZ=Asia/Shanghai \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

    9.容器内执行命令:
    nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ -tp 8\
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen3.6 \
    --dtype bfloat16 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --max-num-batched-tokens 327680 \
    --kv-cache-dtype fp8_e4m3 >qwen.log 2>& 1 &

    二、问题现象
    推理速度慢,首轮 prompt 预填:2.2 tokens/s(输入解析慢)生成阶段稳定:70~73 tokens/s
    日志信息如下:
    (APIServer pid=254754) INFO 05-20 20:11:26 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage:
    0.6%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.7%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.9%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:11:56 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.2%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.3%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 1.6%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO: 10.217.247.136:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=254754) INFO 05-20 20:12:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
    : 0.0%, Prefix cache hit rate: 0.0%
    (APIServer pid=254754) INFO 05-20 20:12:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage:
    0.0%, Prefix cache hit rate: 0.0%

  • Members 458 posts
    2026年5月21日 10:32

    尊敬的开发者您好,请使用单卡推理尝试