1.模型:cpatonn-mirror/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit
2.镜像:cr.metax-tech.com/public-ai-release/maca/modelzoo.llm.vllm: 1.0.0-maca.ai3.2.1.8-torch2.6-py310-ubuntu22.04-amd64
3: GPU: C500 *8
4:操作系统:ubuntu 22.04
执行命令:
docker run --device=/dev/dri --device=/dev/mxcd --name Qwen3-Next-80B-A3B-Thinking-AWQ-8bit -v /8T/xxxx/model/cpatonn-mirror/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit:/data/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit -e CUDA_VISIBLE_DEVICES=4,5,6,7, -e TRITON_ENABLE_MACA_OPT_MOVE_DOT_OPERANDS_OUT_LOOP=1 -e TRITON_DISABLE_MACA_OPT_MMA_PREFETCH=1 -e TRITON_ENABLE_MACA_CHAIN_DOT_OPT=1 -e TRITON_ENABLE_MACA_COMPILER_INT8_OPT=True -e MACA_SMALL_PAGESIZE_ENABLE=1 -p 2031:30889 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --shm-size 100gb --ulimit memlock=-1 --group-add video af4bbc08aa93 /opt/conda/bin/python -m vllm.entrypoints.openai.api_server --model /data/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit --api-key c01b24fc-4bf1-4871-a1c3-8663e151555b --served-model-name Qwen3-Next-80B-A3B-Thinking-AWQ-8bit --max-model-len 8192 --gpu-memory-utilization 0.95 --port 30889 --tensor-parallel-size 4 --disable-log-stats --disable-log-requests --max-num-seqs 50
报错如下:
INFO 11-13 20:06:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_482a0da8'), local_subscribe_addr='ipc:///tmp/fafe91bf-0c90-4f70-81ca-b406e9c4f98c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-13 20:06:15 [parallel_state.py:1165] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-13 20:06:15 [parallel_state.py:1165] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-13 20:06:15 [parallel_state.py:1165] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 11-13 20:06:15 [parallel_state.py:1165] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(Worker_TP2 pid=435) INFO 11-13 20:06:15 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit...
(Worker_TP3 pid=449) INFO 11-13 20:06:15 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit...
(Worker_TP1 pid=427) INFO 11-13 20:06:15 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit...
(Worker_TP0 pid=424) INFO 11-13 20:06:15 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-Next-80B-A3B-Thinking-AWQ-8bit...
(Worker_TP2 pid=435) INFO 11-13 20:06:15 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP1 pid=427) INFO 11-13 20:06:15 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP3 pid=449) INFO 11-13 20:06:15 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP2 pid=435) torch_dtype is deprecated! Use dtype instead!
(Worker_TP0 pid=424) INFO 11-13 20:06:15 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP2 pid=435) INFO 11-13 20:06:15 [compressed_tensors.py:122] Using CompressedTensorsWNA16MoEMethod
(Worker_TP0 pid=424) torch_dtype is deprecated! Use dtype instead!
(Worker_TP0 pid=424) INFO 11-13 20:06:15 [compressed_tensors.py:122] Using CompressedTensorsWNA16MoEMethod
(Worker_TP3 pid=449) torch_dtype is deprecated! Use dtype instead!
(Worker_TP1 pid=427) torch_dtype is deprecated! Use dtype instead!
(Worker_TP3 pid=449) INFO 11-13 20:06:15 [compressed_tensors.py:122] Using CompressedTensorsWNA16MoEMethod
(Worker_TP1 pid=427) INFO 11-13 20:06:15 [compressed_tensors.py:122] Using CompressedTensorsWNA16MoEMethod
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] WorkerProc failed to start.
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] Traceback (most recent call last):
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] worker = WorkerProc(args, kwargs)
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 427, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.worker.load_model()
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.model = model_loader.load_model(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] model = initialize_model(vllm_config=vllm_config,
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_next.py", line 1079, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.model = Qwen3NextModel(vllm_config=vllm_config,
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 199, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] old_init(self, vllm_config=vllm_config, prefix=prefix, kwargs)
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_next.py", line 915, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 642, in make_layers
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] [PPMissingLayer() for _ in range(start_layer)] + [
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 643, in <listcomp>
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_next.py", line 904, in get_layer
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] return Qwen3NextDecoderLayer(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_next.py", line 782, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.mlp = Qwen3NextSparseMoeBlock(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_next.py", line 134, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.shared_expert = Qwen3NextMLP(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 77, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.gate_up_proj = MergedColumnParallelLinear(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 588, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] super().init(input_size=input_size,
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 442, in init
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] self.quant_method.create_weights(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 729, in create_weights
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] layer.scheme.create_weights(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 92, in create_weights
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/init.py", line 90, in choose_mp_linear_kernel
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] raise ValueError(
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] CutlassW4A8LinearKernel cannot implement due to: CUTLASS only supported on CUDA
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] MacheteLinearKernel cannot implement due to: Machete only supported on CUDA
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] AllSparkLinearKernel cannot implement due to: AllSpark currently does not support device_capability = 90.
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] MarlinLinearKernel cannot implement due to: Marlin only supported on CUDA
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] Dynamic4bitLinearKernel cannot implement due to: Only CPU is supported
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] BitBLASLinearKernel cannot implement due to: bitblas is not installed. Please install bitblas by running pip install bitblas>=0.1.0
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] ConchLinearKernel cannot implement due to: Group size (32) not supported by ConchLinearKernel, supported group sizes are: [-1, 128]
(Worker_TP2 pid=435) ERROR 11-13 20:06:15 [multiproc_executor.py:585] ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations
(Worker_TP2 pid=435) INFO 11-13 20:06:15 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP0 pid=424) INFO 11-13 20:06:15 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP1 pid=427) INFO 11-13 20:06:15 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP3 pid=449) INFO 11-13 20:06:16 [multiproc_executor.py:546] Parent process exited, terminating worker
[rank0]:[W1113 20:06:16.555908797 ProcessGroupNCCL.cpp:1502] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] engine_core = EngineCoreProc(args, kwargs)
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] self._init_executor()
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] raise e from None
(EngineCore_DP0 pid=286) ERROR 11-13 20:06:19 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=286) Process EngineCore_DP0:
(EngineCore_DP0 pid=286) Traceback (most recent call last):
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=286) self.run()
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=286) self._target(*self._args, self._kwargs)
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=286) raise e
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=286) engine_core = EngineCoreProc(args, kwargs)
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=286) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=286) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=286) self._init_executor()
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=286) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=286) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=286) raise e from None
(EngineCore_DP0 pid=286) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(APIServer pid=1) return _run_code(code, main_globals, None,
(APIServer pid=1) File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
(APIServer pid=1) exec(code, run_globals)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 2011, in <module>
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/uvloop/init.py", line 69, in run
(APIServer pid=1) return loop.run_until_complete(wrapper())
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, uvicorn_kwargs)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/opt/conda/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/opt/conda/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 1589, in inner
(APIServer pid=1) return fn(args, *kwargs)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 136, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(client_args)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1) File "/opt/conda/lib/python3.10/contextlib.py", line 142, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}