配置:
联想 SR658H,内存 512GB,显卡:N260 * 2
问题: 只运行一个不会出问题,但运行第二个就无法分配到显存,卡住了
模型版本:
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64
dokcer 命令:
docker run -itd \
--restart always \
--privileged \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--network=host \
--name Qwen3-Next-80B-A3B-Instruct.w8a8 \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /models:/models \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.11.2-maca.ai3.3.0.103-torch2.8-py312-ubuntu22.04-amd64 \
/bin/bash
模型启动命令:
VLLM_USE_V1=0 nohup vllm serve /models/Qwen3-Next-80B-A3B-Instruct.w8a8 \
--port 8889 \
-tp 2 \
--enforce-eager \
--max-model-len 15000 \
--gpu-memory-utilization 0.7 \
--api-key Dzdwd@85416 \
--max-num-seqs 35 \
--served-model-name Qwen3-Next-80B-A3B-Instruct.w8a8 > vllm-80b.log 2>&1 &
向量启动命令:
nohup vllm serve /models/qwen3-Embedding-0.6B \
--port 8890 \
--enforce-eager \
--served-model-name qwen3-Embedding-0.6B \
--max-model-len 1024 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--task embed \
--api-key Dzdwd@85416 > vllm-emb.log 2>&1 &
问题:只运行一个不会出问题,但运行第二个就无法分配到显存,卡住了
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='/models/qwen3-Embedding-0.6B', speculative_config=None, tokenizer='/models/qwen3-Embedding-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-Embedding-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, softmax=None, activation=None, use_activation=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:48 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.196.210.3:40141 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=20179) INFO 01-30 12:54:49 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
mx-smi:
mx-smi version: 2.2.9
=================== MetaX System Management Interface Log ===================
Timestamp : Fri Jan 30 12:59:17 2026
Attached GPUs : 2
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9 Kernel Mode Driver Version: 3.4.4 |
| MACA Version: 3.3.0.15 BIOS Version: 1.29.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX N260 | 0 Off | 0000:41:00.0 | 0% Disabled |
| 52W / 225W | 43C P9 | 47883/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX N260 | 1 Off | 0000:c1:00.0 | 0% Disabled |
| 47W / 225W | 40C P9 | 47867/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| 0 2322349 VLLM::Worker_TP 47198 |
| 0 2343541 VLLM::EngineCor 16 |
| 1 2322350 VLLM::Worker_TP 47198 |
+---------------------------------------------------------------------------------+
请问该如何操作,还是参数有问题?