ai.gitee.com/serverless-api/packages/1492
您好,我在这个里面看到了Qwen3-VL-8B-Instruct,请问是C500现在不支持吗?
ai.gitee.com/serverless-api/packages/1492
您好,我在这个里面看到了Qwen3-VL-8B-Instruct,请问是C500现在不支持吗?
问题现象:详情请见附件
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) ERROR 11-21 17:05:53 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:55 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) WARNING 11-21 17:05:55 [utils.py:181] TransformersForMultimodalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2338] Starting to load model /data/Qwen3-VL-8B-Instruct...
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) `torch_dtype` is deprecated! Use `dtype` instead!
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:57 [transformers.py:439] Using Transformers backend.
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:05:58 [platform.py:298] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:05<00:16, 5.52s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:11<00:11, 5.79s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:13<00:03, 3.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00, 3.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00, 4.06s/it]
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791)
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:14 [default_loader.py:268] Loading weights took 16.46 seconds
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:15 [gpu_model_runner.py:2392] Model loading took 16.3341 GiB and 16.796520 seconds
(EngineCore_DP0 pid=296) (RayWorkerWrapper pid=791) INFO 11-21 17:06:15 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 183, in _initialize_kv_caches
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 84, in determine_available_memory
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 309, in collective_rpc
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._run_workers(method, *args, **(kwargs or {}))
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 505, in _run_workers
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] ray_worker_outputs = ray.get(ray_worker_outputs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return fn(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return func(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2858, in get
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 958, in get_objects
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] raise value.as_instanceof_cause()
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] ray.exceptions.RayTaskError(OutOfMemoryError): ray::RayWorkerWrapper.execute_method() (pid=791, ip=172.17.0.4, actor_id=2b9d7f7d597adf4159ecbb8101000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f5d455714e0>)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 628, in execute_method
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] raise e
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 619, in execute_method
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return func(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return func(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] self.model_runner.profile_run()
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] self.model.get_multimodal_embeddings(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 844, in get_multimodal_embeddings
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] vision_embeddings = self.model.get_image_features(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1061, in get_image_features
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] image_embeds, deepstack_image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 739, in forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] hidden_states = blk(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in __call__
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return super().__call__(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 267, in forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] hidden_states = hidden_states + self.attn(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 230, in forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] attn_outputs = [
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 231, in <listcomp>
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] attention_interface(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] attn_output = torch.nn.functional.scaled_dot_product_attention(
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 5912, in scaled_dot_product_attention
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa)
(EngineCore_DP0 pid=296) ERROR 11-21 17:06:26 [core.py:718] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 63.59 GiB of which 41.68 GiB is free. Of the allocated memory 19.19 GiB is allocated by PyTorch, and 442.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
docker info回显:
Client: Docker Engine - Community
Version: 26.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.14.0
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.26.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 103
Running: 72
Paused: 0
Stopped: 31
Images: 56
Server Version: 26.0.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e377cd56a71523140ca6ae87e30244719194a521
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.19.0-46-generic
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 1008GiB
Name: ZNDX-CA100
ID: 43f2ba6f-191c-4779-ad11-f360c2d5fc11
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://docker.1ms.run/
https://docker.m.daocloud.io/
https://dockerpull.com/
https://dockerproxy.com/
Live Restore Enabled: false
镜像版本:cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.10.2-maca.ai3.2.1.7-torch2.6-py310-ubuntu22.04-amd64
容器指令:
docker run -it --device=/dev/dri --device=/dev/mxcd \
--name Qwen3-VL-8B-Instruct \
-v /8T/perfxcloud/model/Qwen/Qwen3-VL-8B-Instruct:/data/Qwen3-VL-8B-Instruct \
-e CUDA_VISIBLE_DEVICES=5 \
-e TRITON_ENABLE_MACA_OPT_MOVE_DOT_OPERANDS_OUT_LOOP=1 \
-e TRITON_ENABLE_MACA_CHAIN_DOT_OPT=1 \
-e TRITON_DISABLE_MACA_OPT_MMA_PREFETCH=1 \
-e TRITON_ENABLE_MACA_COMPILER_INT8_OPT=True \
-e MACA_SMALL_PAGESIZE_ENABLE=1 \
-e RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 \
-p 2032:30889 \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
--group-add video \
6e519687a9e4 \
/opt/conda/bin/python -m vllm.entrypoints.openai.api_server \
--model /data/Qwen3-VL-8B-Instruct \
--api-key c01b24fc-4bf1-4871-a1c3-8663e151555b \
--served-model-name Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--gpu-memory-utilization 0.95 \
--port 30889 \
--swap-space 8 \
--tensor-parallel-size 1 \
--disable-log-stats \
--disable-log-requests \
--trust-remote-code \
--distributed-executor-backend ray \
--dtype bfloat16 \
--max-num-seqs 5
服务器厂家:H3C UniServer R5300 G6
沐曦GPU型号:MetaX C500
操作系统内核版本:5.19.0-46-generic
是否开启CPU虚拟化:开启
mx-smi回显:
mx-smi version: 2.2.3
=================== MetaX System Management Interface Log ===================
Timestamp : Fri Nov 21 16:53:45 2025
Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.3 Kernel Mode Driver Version: 2.14.6 |
| MACA Version: 3.0.0.8 BIOS Version: 1.24.3.0 |
|------------------------------------+---------------------+----------------------+
| GPU NAME | Bus-id | GPU-Util |
| Temp Pwr:Usage/Cap | Memory-Usage | |
|====================================+=====================+======================|
| 0 MetaX C500 | 0000:08:00.0 | 0% |
| 33C 52W / 350W | 64603/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 1 MetaX C500 | 0000:09:00.0 | 1% |
| 34C 54W / 350W | 58204/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 2 MetaX C500 | 0000:0e:00.0 | 0% |
| 35C 53W / 350W | 63899/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 3 MetaX C500 | 0000:11:00.0 | 0% |
| 34C 53W / 350W | 63643/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 4 MetaX C500 | 0000:32:00.0 | 1% |
| 32C 52W / 350W | 58204/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 5 MetaX C500 | 0000:38:00.0 | 0% |
| 31C 40W / 350W | 858/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 6 MetaX C500 | 0000:3b:00.0 | 0% |
| 34C 51W / 350W | 59997/65536 MiB | |
+------------------------------------+---------------------+----------------------+
| 7 MetaX C500 | 0000:3c:00.0 | 0% |
| 33C 52W / 350W | 59997/65536 MiB | |
+------------------------------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 2155589 VLLM::EngineCor 63744 |
| 1 2231555 python 57344 |
| 2 1290848 VLLM::EngineCor 63040 |
| 3 1586485 VLLM::EngineCor 62784 |
| 4 2232951 python 57344 |
| 6 2235998 VLLM::Worker_TP 59136 |
| 7 2235999 VLLM::Worker_TP 59136 |
+---------------------------------------------------------------------------------+