更换成Qwen3-Embedding-0.6B模型,测试正常,也不需要设置V0环境变量
更换成Qwen3-Embedding-0.6B模型,测试正常,也不需要设置V0环境变量
root@ictrek:/workspace# export VLLM_USE_V1=0
root@ictrek:/workspace# vllm serve /root/models/bge-m3 --tensor-parallel-size 1 --trust-remote-code --dtype auto --served-model-name embed --gpu-memory-utilization 0.3 --host 0.0.0.0 --port 8001
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO 11-21 09:53:59 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-21 09:53:59 [init.py:38] - metax -> vllm_metax:register
INFO 11-21 09:53:59 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-21 09:53:59 [init.py:207] Platform plugin metax is activated
WARNING 11-21 09:54:08 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM.
WARNING 11-21 09:54:08 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 11-21 09:54:08 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel.
WARNING 11-21 09:54:08 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
INFO 11-21 09:54:08 [platform.py:423] [hook] platform:pre_register_and_update...
INFO 11-21 09:54:08 [platform.py:423] [hook] platform:pre_register_and_update...
INFO 11-21 09:54:08 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=1710) INFO 11-21 09:54:08 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=1710) INFO 11-21 09:54:08 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=1710) INFO 11-21 09:54:08 [utils.py:328] non-default args: {'model_tag': '/root/models/bge-m3', 'host': '0.0.0.0', 'port': 8001, 'model': '/root/models/bge-m3', 'trust_remote_code': True, 'served_model_name': ['embed'], 'gpu_memory_utilization': 0.3}
(APIServer pid=1710) INFO 11-21 09:54:08 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=1710) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1710) INFO 11-21 09:54:08 [config.py:810] Found sentence-transformers tokenize configuration.
(APIServer pid=1710) INFO 11-21 09:54:24 [config.py:708] Found sentence-transformers modules configuration.
(APIServer pid=1710) INFO 11-21 09:54:24 [config.py:728] Found pooling configuration.
(APIServer pid=1710) INFO 11-21 09:54:24 [init.py:962] Resolved --runner auto to --runner pooling. Pass the value explicitly to silence this message.
(APIServer pid=1710) INFO 11-21 09:54:24 [init.py:742] Resolved architecture: XLMRobertaModel
(APIServer pid=1710) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=1710) INFO 11-21 09:54:24 [init.py:2764] Downcasting torch.float32 to torch.float16.
(APIServer pid=1710) INFO 11-21 09:54:24 [init.py:1815] Using max model len 8192
(APIServer pid=1710) INFO 11-21 09:54:24 [init.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
(APIServer pid=1710) INFO 11-21 09:54:24 [api_server.py:296] Started engine process with PID 1984
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO 11-21 09:54:28 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-21 09:54:28 [init.py:38] - metax -> vllm_metax:register
INFO 11-21 09:54:28 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-21 09:54:28 [init.py:207] Platform plugin metax is activated
WARNING 11-21 09:54:37 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM.
WARNING 11-21 09:54:37 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 11-21 09:54:37 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel.
WARNING 11-21 09:54:37 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
INFO 11-21 09:54:37 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='/root/models/bge-m3', speculative_config=None, tokenizer='/root/models/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=embed, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
INFO 11-21 09:54:37 [config.py:810] Found sentence-transformers tokenize configuration.
ERROR 11-21 09:54:38 [engine.py:468] V0 engine is deprecated on Maca. Please switch to V1.
ERROR 11-21 09:54:38 [engine.py:468] Traceback (most recent call last):
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 456, in run_mp_engine
ERROR 11-21 09:54:38 [engine.py:468] engine = MQLLMEngine.from_vllm_config(
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 1589, in inner
ERROR 11-21 09:54:38 [engine.py:468] return fn(args, kwargs)
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
ERROR 11-21 09:54:38 [engine.py:468] return cls(
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 88, in init
ERROR 11-21 09:54:38 [engine.py:468] self.engine = LLMEngine(args, kwargs)
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 260, in init
ERROR 11-21 09:54:38 [engine.py:468] self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
ERROR 11-21 09:54:38 [engine.py:468] self._init_executor()
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 11-21 09:54:38 [engine.py:468] self.collective_rpc("init_worker", args=([kwargs], ))
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 11-21 09:54:38 [engine.py:468] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3060, in run_method
ERROR 11-21 09:54:38 [engine.py:468] return func(*args, kwargs)
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 600, in init_worker
ERROR 11-21 09:54:38 [engine.py:468] self.worker = worker_class(kwargs)
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 88, in init
ERROR 11-21 09:54:38 [engine.py:468] self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1006, in init
ERROR 11-21 09:54:38 [engine.py:468] self.attn_backend = get_attn_backend(
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/attention/selector.py", line 154, in get_attn_backend
ERROR 11-21 09:54:38 [engine.py:468] return _cached_get_attn_backend(
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm/attention/selector.py", line 205, in _cached_get_attn_backend
ERROR 11-21 09:54:38 [engine.py:468] attention_cls = current_platform.get_attn_backend_cls(
ERROR 11-21 09:54:38 [engine.py:468] File "/opt/conda/lib/python3.10/site-packages/vllm_metax/platform.py", line 331, in get_attn_backend_cls
ERROR 11-21 09:54:38 [engine.py:468] raise AssertionError(
ERROR 11-21 09:54:38 [engine.py:468] AssertionError: V0 engine is deprecated on Maca. Please switch to V1.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 470, in run_mp_engine
raise e from None
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 456, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 1589, in inner
return fn(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 144, in from_vllm_config
return cls(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 88, in init
self.engine = LLMEngine(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 260, in init
self.model_executor = executor_class(vllm_config=vllm_config)
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in init
self._init_executor()
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("init_worker", args=([kwargs], ))
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3060, in run_method
return func(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 600, in init_worker
self.worker = worker_class(kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 88, in init
self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1006, in init
self.attn_backend = get_attn_backend(
File "/opt/conda/lib/python3.10/site-packages/vllm/attention/selector.py", line 154, in get_attn_backend
return _cached_get_attn_backend(
File "/opt/conda/lib/python3.10/site-packages/vllm/attention/selector.py", line 205, in _cached_get_attn_backend
attention_cls = current_platform.get_attn_backend_cls(
File "/opt/conda/lib/python3.10/site-packages/vllm_metax/platform.py", line 331, in get_attn_backend_cls
raise AssertionError(
AssertionError: V0 engine is deprecated on Maca. Please switch to V1.
(APIServer pid=1710) Traceback (most recent call last):
(APIServer pid=1710) File "/opt/conda/bin/vllm", line 8, in <module>
(APIServer pid=1710) sys.exit(main())
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=1710) args.dispatch_function(args)
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=1710) uvloop.run(run_server(args))
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/uvloop/init.py", line 69, in run
(APIServer pid=1710) return loop.run_until_complete(wrapper())
(APIServer pid=1710) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1710) return await main
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=1710) await run_server_worker(listen_address, sock, args, uvicorn_kwargs)
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=1710) async with build_async_engine_client(
(APIServer pid=1710) File "/opt/conda/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=1710) return await anext(self.gen)
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1710) async with build_async_engine_client_from_engine_args(
(APIServer pid=1710) File "/opt/conda/lib/python3.10/contextlib.py", line 199, in aenter
(APIServer pid=1710) return await anext(self.gen)
(APIServer pid=1710) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 319, in build_async_engine_client_from_engine_args
(APIServer pid=1710) raise RuntimeError(
(APIServer pid=1710) RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@ictrek:/workspace#
设置V0环境变量后,启动会报错
在没有卸载NV配置前使用同样的环境部署Qwen3-32B-AWQ,并使用evalscope测试,能返回测试结果,没有触发报错
ictrek@ictrek:~$ docker info
Client: Docker Engine - Community
Version: 28.3.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.26.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.39.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 5
Server Version: 28.3.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
runc version: v1.2.5-0-g59923ef
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-78-generic
Operating System: Ubuntu 22.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 104
Total Memory: 503.5GiB
Name: ictrek
ID: 978b718f-8738-4bf1-af4c-900e02826555
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
Live Restore Enabled: false
ictrek@ictrek:~$ docker info | grep -i nvi
ictrek@ictrek:~$ dpkg -l | grep -i nvidia-container
ictrek@ictrek:~$ docker run -itd \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--name vllm \
--device=/dev/mem \
--network=host \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size '100gb' \
--ulimit memlock=-1 \
-v /usr/local/:/usr/local/ \
-v /home/ictrek/models:/root/models \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.10.2-maca.ai3.2.1.7-torch2.6-py310-ubuntu22.04-amd64 \
/bin/bash
ce24695aa89b6eec7260bdd2e623249baa2ceb998c677dfcd92912a34e65b907
ictrek@ictrek:~$ docker exec -ti vllm bash
root@ictrek:/workspace# vllm serve /root/models/bge-m3 \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--served-model-name embed \
--gpu-memory-utilization 0.3 \
--host 0.0.0.0 \
--port 8001
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO 11-20 18:07:05 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-20 18:07:05 [init.py:38] - metax -> vllm_metax:register
INFO 11-20 18:07:05 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-20 18:07:05 [init.py:207] Platform plugin metax is activated
WARNING 11-20 18:07:14 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM.
WARNING 11-20 18:07:14 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 11-20 18:07:14 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel.
WARNING 11-20 18:07:14 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
INFO 11-20 18:07:14 [platform.py:423] [hook] platform:pre_register_and_update...
INFO 11-20 18:07:14 [platform.py:423] [hook] platform:pre_register_and_update...
INFO 11-20 18:07:14 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=31) INFO 11-20 18:07:14 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=31) INFO 11-20 18:07:14 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=31) INFO 11-20 18:07:14 [utils.py:328] non-default args: {'model_tag': '/root/models/bge-m3', 'host': '0.0.0.0', 'port': 8001, 'model': '/root/models/bge-m3', 'trust_remote_code': True, 'served_model_name': ['embed'], 'gpu_memory_utilization': 0.3}
(APIServer pid=31) INFO 11-20 18:07:14 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=31) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=31) INFO 11-20 18:07:14 [config.py:810] Found sentence-transformers tokenize configuration.
(APIServer pid=31) INFO 11-20 18:07:29 [config.py:708] Found sentence-transformers modules configuration.
(APIServer pid=31) INFO 11-20 18:07:29 [config.py:728] Found pooling configuration.
(APIServer pid=31) INFO 11-20 18:07:29 [init.py:962] Resolved --runner auto to --runner pooling. Pass the value explicitly to silence this message.
(APIServer pid=31) INFO 11-20 18:07:29 [init.py:742] Resolved architecture: XLMRobertaModel
(APIServer pid=31) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=31) INFO 11-20 18:07:29 [init.py:2764] Downcasting torch.float32 to torch.float16.
(APIServer pid=31) INFO 11-20 18:07:29 [init.py:1815] Using max model len 8192
(APIServer pid=31) INFO 11-20 18:07:29 [arg_utils.py:1639] (Disabling) chunked prefill by default
(APIServer pid=31) INFO 11-20 18:07:29 [arg_utils.py:1642] (Disabling) prefix caching by default
(APIServer pid=31) INFO 11-20 18:07:29 [init.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO 11-20 18:07:34 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-20 18:07:34 [init.py:38] - metax -> vllm_metax:register
INFO 11-20 18:07:34 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-20 18:07:34 [init.py:207] Platform plugin metax is activated
(EngineCore_DP0 pid=312) INFO 11-20 18:07:36 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=312) WARNING 11-20 18:07:42 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM.
(EngineCore_DP0 pid=312) WARNING 11-20 18:07:42 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=312) WARNING 11-20 18:07:42 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel.
(EngineCore_DP0 pid=312) WARNING 11-20 18:07:42 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
(EngineCore_DP0 pid=312) INFO 11-20 18:07:42 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/root/models/bge-m3', speculative_config=None, tokenizer='/root/models/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=embed, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":null}
(EngineCore_DP0 pid=312) ERROR 11-20 18:07:42 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
(EngineCore_DP0 pid=312) INFO 11-20 18:07:43 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=312) INFO 11-20 18:07:43 [gpu_model_runner.py:2338] Starting to load model /root/models/bge-m3...
(EngineCore_DP0 pid=312) INFO 11-20 18:07:43 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=312) INFO 11-20 18:07:43 [platform.py:298] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 91.89it/s]
(EngineCore_DP0 pid=312)
(EngineCore_DP0 pid=312) INFO 11-20 18:07:44 [default_loader.py:268] Loading weights took 0.62 seconds
(EngineCore_DP0 pid=312) INFO 11-20 18:07:44 [gpu_model_runner.py:2392] Model loading took 1.0558 GiB and 0.723533 seconds
(EngineCore_DP0 pid=312) INFO 11-20 18:07:49 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/d76eda0a72/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=312) INFO 11-20 18:07:49 [backends.py:550] Dynamo bytecode transform time: 4.72 s
(EngineCore_DP0 pid=312) INFO 11-20 18:07:51 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=312) [rank0]:W1120 18:07:52.007000 312 site-packages/torch/_inductor/utils.py:1138] [0/0] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=312) [rank0]:W1120 18:07:52.007000 312 site-packages/torch/_inductor/utils.py:1197] [0/0] Forcing disable 'CUTLASS' backend as it is not supported in maca platform.
(EngineCore_DP0 pid=312) [rank0]:W1120 18:07:52.022000 312 site-packages/torch/_inductor/utils.py:1197] [0/0] Forcing disable 'CUTLASS' backend as it is not supported in maca platform.
(EngineCore_DP0 pid=312) [rank0]:W1120 18:07:52.030000 312 site-packages/torch/_inductor/utils.py:1197] [0/0] Forcing disable 'CUTLASS' backend as it is not supported in maca platform.
(EngineCore_DP0 pid=312) [rank0]:W1120 18:08:04.751000 312 site-packages/torch/_inductor/utils.py:1197] [0/0] Forcing disable 'CUTLASS' backend as it is not supported in maca platform.
(EngineCore_DP0 pid=312) [rank0]:W1120 18:08:04.767000 312 site-packages/torch/_inductor/utils.py:1197] [0/0] Forcing disable 'CUTLASS' backend as it is not supported in maca platform.
(EngineCore_DP0 pid=312) [rank0]:W1120 18:08:04.775000 312 site-packages/torch/_inductor/utils.py:1197] [0/0] Forcing disable 'CUTLASS' backend as it is not supported in maca platform.
(EngineCore_DP0 pid=312) INFO 11-20 18:08:04 [backends.py:215] Compiling a graph for dynamic shape takes 15.29 s
(EngineCore_DP0 pid=312) INFO 11-20 18:08:14 [monitor.py:34] torch.compile takes 20.01 s in total
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 35.76it/s]
(EngineCore_DP0 pid=312) INFO 11-20 18:08:16 [gpu_model_runner.py:3118] Graph capturing finished in 2 secs, took -0.34 GiB
(EngineCore_DP0 pid=312) INFO 11-20 18:08:17 [core.py:218] init engine (profile, create kv cache, warmup model) took 32.31 seconds
(EngineCore_DP0 pid=312) INFO 11-20 18:08:17 [config.py:810] Found sentence-transformers tokenize configuration.
(EngineCore_DP0 pid=312) INFO 11-20 18:08:17 [core.py:120] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=312) INFO 11-20 18:08:17 [init.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
(APIServer pid=31) INFO 11-20 18:08:18 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 1
(APIServer pid=31) INFO 11-20 18:08:18 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=31) INFO 11-20 18:08:18 [api_server.py:1692] Supported_tasks: ['encode', 'embed']
(APIServer pid=31) INFO 11-20 18:08:18 [init.py:36] No IOProcessor plugins requested by the model
(APIServer pid=31) INFO 11-20 18:08:18 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8001
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:36] Available routes are:
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=31) INFO 11-20 18:08:18 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=31) INFO: Started server process [31]
(APIServer pid=31) INFO: Waiting for application startup.
(APIServer pid=31) INFO: Application startup complete.
(APIServer pid=31) INFO: 127.0.0.1:42122 - "GET /v1/models HTTP/1.1" 200 OK
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.2) with config: model='/root/models/bge-m3', speculative_config=None, tokenizer='/root/models/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=embed, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/d76eda0a72","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/d76eda0a72/rank_0_0/backbone"},
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=embd-4779351f7240402e8ea30461d30c7800-0,prompt_token_ids_len=4,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=(),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={embd-4779351f7240402e8ea30461d30c7800-0: 4}, total_num_scheduled_tokens=4, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] Traceback (most recent call last):
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] engine_core.run_busy_loop()
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] self._process_engine_step()
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 292, in step
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] raise err
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return model_fn(scheduler_output)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 93, in execute_model
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3060, in run_method
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return func(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return func(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return func(*args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] model_output = self.model(
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/roberta.py", line 126, in forward
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self.model(input_ids=input_ids,
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 312, in call
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] model_output = self.forward(*args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/bert.py", line 351, in forward
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] def forward(
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return fn(*args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._wrapped_call(self, args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] raise e
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return super(self.cls, obj).call(args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._call_impl(*args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "<eval_with_key>.50", line 306, in forward
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._wrapped_call(self, args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] raise e
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "<eval_with_key>.2", line 5, in forward
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_1, 'model.encoder.layer.0.attention.output.attn'); query = key = value = output_1 = unified_attention_with_output = None
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in call
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._op(*args, (kwargs or {}))
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/attention/layer.py", line 521, in unified_attention_with_output
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] self.impl.forward(self,
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 560, in forward
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] return self._forward_encoder_attention(query[:num_actual_tokens],
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 709, in _forward_encoder_attention
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] flash_attn_varlen_func(
(EngineCore_DP0 pid=312) ERROR 11-20 18:08:34 [core.py:720] TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'out'
(EngineCore_DP0 pid=312) Process EngineCore_DP0:
(EngineCore_DP0 pid=312) Traceback (most recent call last):
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=312) self.run()
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=312) self._target(self._args, self._kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=312) raise e
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
(EngineCore_DP0 pid=312) engine_core.run_busy_loop()
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
(EngineCore_DP0 pid=312) self._process_engine_step()
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
(EngineCore_DP0 pid=312) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 292, in step
(EngineCore_DP0 pid=312) model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=312) raise err
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=312) return model_fn(scheduler_output)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 93, in execute_model
(EngineCore_DP0 pid=312) output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_DP0 pid=312) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3060, in run_method
(EngineCore_DP0 pid=312) return func(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=312) return func(*args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
(EngineCore_DP0 pid=312) output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=312) return func(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
(EngineCore_DP0 pid=312) model_output = self.model(
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) return forward_call(*args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/roberta.py", line 126, in forward
(EngineCore_DP0 pid=312) return self.model(input_ids=input_ids,
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 312, in call
(EngineCore_DP0 pid=312) model_output = self.forward(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/bert.py", line 351, in forward
(EngineCore_DP0 pid=312) def forward(
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) return forward_call(*args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(EngineCore_DP0 pid=312) return fn(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=312) return self._wrapped_call(self, args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=312) raise e
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=312) return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) return forward_call(args, kwargs)
(EngineCore_DP0 pid=312) File "<eval_with_key>.50", line 306, in forward
(EngineCore_DP0 pid=312) submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=312) return self._wrapped_call(self, *args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=312) raise e
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=312) return super(self.cls, obj).call(args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=312) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=312) return forward_call(*args, kwargs)
(EngineCore_DP0 pid=312) File "<eval_with_key>.2", line 5, in forward
(EngineCore_DP0 pid=312) unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_1, 'model.encoder.layer.0.attention.output.attn'); query = key = value = output_1 = unified_attention_with_output = None
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in call
(EngineCore_DP0 pid=312) return self._op(args, *(kwargs or {}))
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm/attention/layer.py", line 521, in unified_attention_with_output
(EngineCore_DP0 pid=312) self.impl.forward(self,
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 560, in forward
(EngineCore_DP0 pid=312) return self._forward_encoder_attention(query[:num_actual_tokens],
(EngineCore_DP0 pid=312) File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 709, in _forward_encoder_attention
(EngineCore_DP0 pid=312) flash_attn_varlen_func(
(EngineCore_DP0 pid=312) TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'out'
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] AsyncLLM output_handler failed.
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] Traceback (most recent call last):
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 444, in output_handler
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] outputs = await engine_core.get_output_async()
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 845, in get_output_async
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] raise self._format_exception(outputs) from None
(APIServer pid=31) ERROR 11-20 18:08:34 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=31) INFO: 127.0.0.1:42320 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
[rank0]:[W1120 18:08:34.229014027 ProcessGroupNCCL.cpp:1502] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=31) INFO: Shutting down
(APIServer pid=31) INFO: Waiting for application shutdown.
(APIServer pid=31) INFO: Application shutdown complete.
(APIServer pid=31) INFO: Finished server process [31]
已经卸载NV相关,还是一样的报错,且从容器中的vllm报错回显来看是vllm_metax库中flash_attn.py脚本返回的
bge-m3是业界常用模型,麻烦帮忙看看是否存在兼容性问题,或者提供一个bge-m3成功部署示例我再部署试试
一、软硬件信息
1.服务器厂家:Supermicro
2.沐曦GPU型号:MetaX N260
3.操作系统内核版本:5.15.0-78-generic
4.是否开启CPU虚拟化:是
5.mx-smi回显:
mx-smi version: 2.2.9
=================== MetaX System Management Interface Log ===================
Timestamp : Thu Nov 20 10:59:49 2025
Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9 Kernel Mode Driver Version: 3.3.12 |
| MACA Version: 3.2.1.10 BIOS Version: 1.29.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX N260 | 0 Off | 0000:86:00.0 | 0% Disabled |
| 37W / 225W | 36C P0 | 666/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
6.docker info回显:
Client: Docker Engine - Community
Version: 28.3.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.26.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.39.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 5
Server Version: 28.3.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Discovered Devices:
cdi: nvidia.com/gpu=0
cdi: nvidia.com/gpu=GPU-6097b7c8-2459-81ea-fae9-1aa50b097119
cdi: nvidia.com/gpu=all
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
runc version: v1.2.5-0-g59923ef
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-78-generic
Operating System: Ubuntu 22.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 104
Total Memory: 503.5GiB
Name: ictrek
ID: 978b718f-8738-4bf1-af4c-900e02826555
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
Live Restore Enabled: false
7.镜像版本:
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.10.2-maca.ai3.2.1.7-torch2.6-py310-ubuntu22.04-amd64
8.启动容器命令:
docker run -itd \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--name vllm \
--device=/dev/mem \
--network=host \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size '100gb' \
--ulimit memlock=-1 \
-v /usr/local/:/usr/local/ \
-v /home/ictrek/models:/root/models \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.10.2-maca.ai3.2.1.7-torch2.6-py310-ubuntu22.04-amd64 \
/bin/bash
9.容器内执行命令:
vllm serve /root/models/bge-m3 \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--served-model-name embed \
--gpu-memory-utilization 0.3 \
--host 0.0.0.0 \
--port 8001
问题:
使用以下curl命令测试embedding接口时报错,无法正常返回结果。
ictrek@ictrek:~$ curl http://localhost:8001/v1/models
{"object":"list","data":[{"id":"embed","object":"model","created":1763606648,"owned_by":"vllm","root":"/root/models/bge-m3","parent":null,"max_model_len":8192,"permission":[{"id":"modelperm-8111aff2f88b4d7387d382e932293d52","object":"model_permission","created":1763606648,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}ictrek@ictrek:~$
ictrek@ictrek:~$
ictrek@ictrek:~$ curl -X POST \
http://127.0.0.1:8001/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "embed", "input":"Hello World"}'
{"error":{"message":"EngineCore encountered an issue. See stack trace (above) for the root cause.","type":"BadRequestError","param":null,"code":400}}ictrek@ictrek:~$
问题现象:
root@ictrek:~/models# vllm serve /root/models/bge-m3 \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--served-model-name embed \
--gpu-memory-utilization 0.3 \
--host 0.0.0.0 \
--port 8001
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO 11-20 10:41:17 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-20 10:41:17 [init.py:38] - metax -> vllm_metax:register
INFO 11-20 10:41:17 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-20 10:41:17 [init.py:207] Platform plugin metax is activated
WARNING 11-20 10:41:26 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM.
WARNING 11-20 10:41:26 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 11-20 10:41:26 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel.
WARNING 11-20 10:41:26 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
INFO 11-20 10:41:26 [platform.py:423] [hook] platform:pre_register_and_update...
INFO 11-20 10:41:26 [platform.py:423] [hook] platform:pre_register_and_update...
INFO 11-20 10:41:26 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=3604) INFO 11-20 10:41:26 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=3604) INFO 11-20 10:41:26 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=3604) INFO 11-20 10:41:26 [utils.py:328] non-default args: {'model_tag': '/root/models/bge-m3', 'host': '0.0.0.0', 'port': 8001, 'model': '/root/models/bge-m3', 'trust_remote_code': True, 'served_model_name': ['embed'], 'gpu_memory_utilization': 0.3}
(APIServer pid=3604) INFO 11-20 10:41:26 [platform.py:423] [hook] platform:pre_register_and_update...
(APIServer pid=3604) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3604) INFO 11-20 10:41:26 [config.py:810] Found sentence-transformers tokenize configuration.
(APIServer pid=3604) INFO 11-20 10:41:41 [config.py:708] Found sentence-transformers modules configuration.
(APIServer pid=3604) INFO 11-20 10:41:41 [config.py:728] Found pooling configuration.
(APIServer pid=3604) INFO 11-20 10:41:41 [init.py:962] Resolved --runner auto to --runner pooling. Pass the value explicitly to silence this message.
(APIServer pid=3604) INFO 11-20 10:41:41 [init.py:742] Resolved architecture: XLMRobertaModel
(APIServer pid=3604) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=3604) INFO 11-20 10:41:41 [init.py:2764] Downcasting torch.float32 to torch.float16.
(APIServer pid=3604) INFO 11-20 10:41:41 [init.py:1815] Using max model len 8192
(APIServer pid=3604) INFO 11-20 10:41:41 [arg_utils.py:1639] (Disabling) chunked prefill by default
(APIServer pid=3604) INFO 11-20 10:41:41 [arg_utils.py:1642] (Disabling) prefix caching by default
(APIServer pid=3604) INFO 11-20 10:41:41 [init.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: github.com/pytorch/vision/issues/6753, and you can also check out github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
INFO 11-20 10:41:46 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-20 10:41:46 [init.py:38] - metax -> vllm_metax:register
INFO 11-20 10:41:46 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-20 10:41:46 [init.py:207] Platform plugin metax is activated
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:48 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=3880) WARNING 11-20 10:41:54 [registry.py:483] Model architecture BaichuanForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.baichuan:BaichuanForCausalLM.
(EngineCore_DP0 pid=3880) WARNING 11-20 10:41:54 [registry.py:483] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_metax.models.qwen2_vl:Qwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=3880) WARNING 11-20 10:41:54 [registry.py:483] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_metax.models.internvl:InternVLChatModel.
(EngineCore_DP0 pid=3880) WARNING 11-20 10:41:54 [registry.py:483] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_mtp:DeepSeekMTP.
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:54 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/root/models/bge-m3', speculative_config=None, tokenizer='/root/models/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=embed, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":null}
(EngineCore_DP0 pid=3880) ERROR 11-20 10:41:54 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:55 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:55 [gpu_model_runner.py:2338] Starting to load model /root/models/bge-m3...
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:55 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:55 [platform.py:298] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 93.10it/s]
(EngineCore_DP0 pid=3880)
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:56 [default_loader.py:268] Loading weights took 0.65 seconds
(EngineCore_DP0 pid=3880) INFO 11-20 10:41:57 [gpu_model_runner.py:2392] Model loading took 1.0558 GiB and 0.759257 seconds
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:01 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/d76eda0a72/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:01 [backends.py:550] Dynamo bytecode transform time: 4.66 s
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:04 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.829 s
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:06 [monitor.py:34] torch.compile takes 4.66 s in total
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 39.26it/s]
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:08 [gpu_model_runner.py:3118] Graph capturing finished in 2 secs, took 0.07 GiB
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:08 [core.py:218] init engine (profile, create kv cache, warmup model) took 11.10 seconds
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:08 [config.py:810] Found sentence-transformers tokenize configuration.
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:09 [core.py:120] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=3880) INFO 11-20 10:42:09 [init.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
(APIServer pid=3604) INFO 11-20 10:42:09 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 1
(APIServer pid=3604) INFO 11-20 10:42:09 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=3604) INFO 11-20 10:42:09 [api_server.py:1692] Supported_tasks: ['encode', 'embed']
(APIServer pid=3604) INFO 11-20 10:42:09 [init.py:36] No IOProcessor plugins requested by the model
(APIServer pid=3604) INFO 11-20 10:42:09 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8001
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:36] Available routes are:
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=3604) INFO 11-20 10:42:09 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=3604) INFO: Started server process [3604]
(APIServer pid=3604) INFO: Waiting for application startup.
(APIServer pid=3604) INFO: Application startup complete.
(APIServer pid=3604) INFO: 127.0.0.1:55612 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=3604) INFO: 127.0.0.1:36214 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=3604) INFO: 127.0.0.1:38848 - "GET /v1/models HTTP/1.1" 200 OK
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.2) with config: model='/root/models/bge-m3', speculative_config=None, tokenizer='/root/models/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=embed, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/d76eda0a72","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":256,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/d76eda0a72/rank_0_0/backbone"},
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=embd-99b875a2f0344fc2906265c0362d7623-0,prompt_token_ids_len=4,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=(),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={embd-99b875a2f0344fc2906265c0362d7623-0: 4}, total_num_scheduled_tokens=4, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] Traceback (most recent call last):
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] engine_core.run_busy_loop()
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] self._process_engine_step()
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 292, in step
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] raise err
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return model_fn(scheduler_output)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 93, in execute_model
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3060, in run_method
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return func(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return func(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return func(*args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] model_output = self.model(
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/roberta.py", line 126, in forward
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self.model(input_ids=input_ids,
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 312, in call
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] model_output = self.forward(*args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/bert.py", line 351, in forward
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] def forward(
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return fn(*args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._wrapped_call(self, args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] raise e
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return super(self.cls, obj).call(args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._call_impl(*args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "<eval_with_key>.50", line 306, in forward
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._wrapped_call(self, args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] raise e
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return forward_call(args, kwargs)
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "<eval_with_key>.2", line 5, in forward
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_1, 'model.encoder.layer.0.attention.output.attn'); query = key = value = output_1 = unified_attention_with_output = None
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in call
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._op(*args, (kwargs or {}))
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm/attention/layer.py", line 521, in unified_attention_with_output
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] self.impl.forward(self,
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 560, in forward
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] return self._forward_encoder_attention(query[:num_actual_tokens],
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 709, in _forward_encoder_attention
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] flash_attn_varlen_func(
(EngineCore_DP0 pid=3880) ERROR 11-20 10:44:16 [core.py:720] TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'out'
(EngineCore_DP0 pid=3880) Process EngineCore_DP0:
(EngineCore_DP0 pid=3880) Traceback (most recent call last):
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=3880) self.run()
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=3880) self._target(self._args, self._kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=3880) raise e
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
(EngineCore_DP0 pid=3880) engine_core.run_busy_loop()
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
(EngineCore_DP0 pid=3880) self._process_engine_step()
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
(EngineCore_DP0 pid=3880) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 292, in step
(EngineCore_DP0 pid=3880) model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=3880) raise err
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=3880) return model_fn(scheduler_output)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 93, in execute_model
(EngineCore_DP0 pid=3880) output = self.collective_rpc("execute_model",
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_DP0 pid=3880) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/utils/init.py", line 3060, in run_method
(EngineCore_DP0 pid=3880) return func(args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=3880) return func(*args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
(EngineCore_DP0 pid=3880) output = self.model_runner.execute_model(scheduler_output,
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_DP0 pid=3880) return func(args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
(EngineCore_DP0 pid=3880) model_output = self.model(
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) return forward_call(*args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/roberta.py", line 126, in forward
(EngineCore_DP0 pid=3880) return self.model(input_ids=input_ids,
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 312, in call
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] AsyncLLM output_handler failed.
(EngineCore_DP0 pid=3880) model_output = self.forward(args, kwargs)
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] Traceback (most recent call last):
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/bert.py", line 351, in forward
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 444, in output_handler
(EngineCore_DP0 pid=3880) def forward(
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] outputs = await engine_core.get_output_async()
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 845, in get_output_async
(EngineCore_DP0 pid=3880) return self._call_impl(args, kwargs)
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] raise self._format_exception(outputs) from None
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(APIServer pid=3604) ERROR 11-20 10:44:16 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=3880) return forward_call(*args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(EngineCore_DP0 pid=3880) return fn(args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=3880) return self._wrapped_call(self, args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=3880) raise e
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=3880) return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) return forward_call(args, kwargs)
(EngineCore_DP0 pid=3880) File "<eval_with_key>.50", line 306, in forward
(EngineCore_DP0 pid=3880) submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(EngineCore_DP0 pid=3880) return self._wrapped_call(self, *args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=3880) raise e
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in call
(EngineCore_DP0 pid=3880) return super(self.cls, obj).call(args, kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(EngineCore_DP0 pid=3880) return self._call_impl(args, kwargs)
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(EngineCore_DP0 pid=3880) return forward_call(*args, kwargs)
(EngineCore_DP0 pid=3880) File "<eval_with_key>.2", line 5, in forward
(EngineCore_DP0 pid=3880) unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_1, 'model.encoder.layer.0.attention.output.attn'); query = key = value = output_1 = unified_attention_with_output = None
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1123, in call
(EngineCore_DP0 pid=3880) return self._op(args, *(kwargs or {}))
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm/attention/layer.py", line 521, in unified_attention_with_output
(EngineCore_DP0 pid=3880) self.impl.forward(self,
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 560, in forward
(EngineCore_DP0 pid=3880) return self._forward_encoder_attention(query[:num_actual_tokens],
(EngineCore_DP0 pid=3880) File "/opt/conda/lib/python3.10/site-packages/vllm_metax/v1/attention/backends/flash_attn.py", line 709, in _forward_encoder_attention
(EngineCore_DP0 pid=3880) flash_attn_varlen_func(
(EngineCore_DP0 pid=3880) TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'out'
(APIServer pid=3604) INFO: 127.0.0.1:55752 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
[rank0]:[W1120 10:44:16.605324213 ProcessGroupNCCL.cpp:1502] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=3604) INFO: Shutting down
(APIServer pid=3604) INFO: Waiting for application shutdown.
(APIServer pid=3604) INFO: Application shutdown complete.
(APIServer pid=3604) INFO: Finished server process [3604]
是否有多个模型同时部署到单卡N260的最佳实践参考,例如单卡N260同时部署Qwen3-32B-AWQ和bge-m3