metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？

link

langhongbin

Members 15 posts

2026年5月22日 12:22 2026年5月22日 12:22

link

一、软硬件信息：
1.服务器厂家：浪潮

2.沐曦GPU型号：MetaX C500 8卡

3.操作系统内核版本：6.6.0-32.7.v2505.ky11.x86_64

4.是否开启CPU虚拟化：开启

5.mx-smi回显：
mx-smi version: 2.2.12

=================== MetaX System Management Interface Log ===================
Timestamp : Wed May 20 18:14:56 2026

Attached GPUs : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.12 Kernel Mode Driver Version: 3.6.11 |
| MACA Version: unknown BIOS Version: 1.31.1.0 |
|------------------+-----------------+---------------------+----------------------|
| Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
| Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
|==================+=================+=====================+======================|
| 0 MetaX C500 | 0 Off | 0000:04:00.0 | 0% Disabled |
| 82W / 350W | 61C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 1 MetaX C500 | 1 Off | 0000:05:00.0 | 0% Disabled |
| 75W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 2 MetaX C500 | 2 Off | 0000:63:00.0 | 0% Disabled |
| 80W / 350W | 56C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 3 MetaX C500 | 3 Off | 0000:64:00.0 | 0% Disabled |
| 80W / 350W | 59C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 4 MetaX C500 | 4 Off | 0000:83:00.0 | 0% Disabled |
| 82W / 350W | 56C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 5 MetaX C500 | 5 Off | 0000:84:00.0 | 0% Disabled |
| 72W / 350W | 53C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 6 MetaX C500 | 6 Off | 0000:e4:00.0 | 0% Disabled |
| 81W / 350W | 58C P9 | 40993/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+
| 7 MetaX C500 | 7 Off | 0000:e5:00.0 | 0% Disabled |
| 74W / 350W | 54C P9 | 40353/65536 MiB | Available |
+------------------+-----------------+---------------------+----------------------+

6.docker info回显：
[root@localhost ~]# docker info
Client:
Version: 24.0.9
Context: default
Debug Mode: false

Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 24.0.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9a04df1519ac2967eece6c6a5d13d3b846b574b2.m
runc version:
init version:
Security Options:
seccomp
Profile: builtin
Kernel Version: 6.6.0-32.7.v2505.ky11.x86_64
Operating System: Kylin Linux Advanced Server V11 (Swan25)
OSType: linux
Architecture: x86_64
CPUs: 256
Total Memory: 1.472TiB
Name: localhost.localdomain
ID: ded90092-4000-426b-a3ca-08950e376242
Docker Root Dir: /home/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
docker.1ms.run/
dockerpull.com/
registry.docker-cn.com/
Live Restore Enabled: false

二问题
metaX C500如何部署bge-m3和bge-reranker-v2-m3模型？

link

shuai_chen

Members 650 posts

2026年5月22日 13:30 2026年5月22日 13:30

link

尊敬的开发者您好，请使用沐曦开发者社区镜像中心vLL-MetaX镜像部署developer.metax-tech.com/softnova/docker?chip_name=%E6%9B%A6%E4%BA%91C500%E7%B3%BB%E5%88%97&package_kind=AI&dimension=docker&deliver_type=%E5%88%86%E5%B1%82%E5%8C%85&ai_frame=vllm-metax
服务启动命令参考vLLM社区文档docs.vllm.ai/en/stable/examples/pooling/embed/
docs.vllm.ai/en/stable/examples/pooling/score/

link

langhongbin

Members 15 posts

2026年5月22日 14:47 2026年5月22日 14:47

link

1.镜像版本：
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

2.启动容器命令：
docker run -itd \
--name qwen3.6 \
--network host \
--shm-size 512G \
--device=/dev/dri \
--device=/dev/mxcd \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--shm-size 100gb \
--ulimit memlock=-1 \
-v /home/modelscope:/root/vllm \
-e TZ=Asia/Shanghai \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64

nohup vllm serve /root/vllm/bge-m3/ \
--host 0.0.0.0 \
--port 8001 \
--served-model-name bge-m3 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--dtype auto \

bge-m3.log 2>&1 &

nohup vllm serve /root/vllm/bge-reranker-v2-m3/ \
--host 0.0.0.0 \
--port 8001 \
--served-model-name bge-reranker-v2-m3 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--dtype auto \

reranker.log 2>&1 &

二问题：
在容器中启动多个服务时报错，报错信息如下
itionalGeneration.
WARNING 05-22 14:34:58 [registry.py:915] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_metax.models.deepseek_v2:GlmMoeDsaForCausalLM.
(Worker pid=2004) INFO 05-22 14:34:59 [parallel_state.py:1400] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:45863 backend=nccl
(Worker pid=2000) [rank1]:W0522 14:34:59.640000 2000 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2000) [rank1]:W0522 14:34:59.640000 2000 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2002) [rank3]:W0522 14:34:59.640000 2002 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2002) [rank3]:W0522 14:34:59.640000 2002 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2001) [rank2]:W0522 14:34:59.640000 2001 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2001) [rank2]:W0522 14:34:59.640000 2001 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=1999) [rank0]:W0522 14:34:59.640000 1999 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=1999) [rank0]:W0522 14:34:59.640000 1999 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2005) [rank6]:W0522 14:34:59.641000 2005 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2006) [rank7]:W0522 14:34:59.641000 2006 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2005) [rank6]:W0522 14:34:59.641000 2005 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2006) [rank7]:W0522 14:34:59.641000 2006 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2004) [rank5]:W0522 14:34:59.641000 2004 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2004) [rank5]:W0522 14:34:59.641000 2004 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=2003) [rank4]:W0522 14:34:59.642000 2003 site-packages/torch/utils/cpp_extension.py:2527] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(Worker pid=2003) [rank4]:W0522 14:34:59.642000 2003 site-packages/torch/utils/cpp_extension.py:2527] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(Worker pid=1999) INFO 05-22 14:34:59 [mccl.py:27] Found mccl from library libmccl.so
(Worker pid=1999) INFO 05-22 14:34:59 [pynccl.py:111] vLLM is using nccl==2.16.5
[14:35:11.312][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:21.552][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:31.792][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:42.032][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:35:52.273][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:36:02.512][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[14:36:12.752][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.

link

shuai_chen

Members 650 posts

2026年5月22日 14:49 2026年5月22日 14:49

link

尊敬的开发者您好，您两个服务的port一致，请更换尝试

link

langhongbin

Members 15 posts

2026年5月23日 16:38 2026年5月23日 16:38

link

端口错开后依旧有如上报错
启动命令
nohup vllm serve /root/vllm/Qwen/Qwen3.6-35B-A3B/ \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3.6 \
--dtype bfloat16 \
--trust-remote-code \
--tensor-parallel-size 4 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--max-num-batched-tokens 131072 \
--max-num-seqs 128 \
--enable-chunked-prefill \
--enable-prefix-caching \

qwen.log 2>&1 &

nohup vllm serve /root/vllm/bge-m3/ \
--host 0.0.0.0 \
--port 8001 \
--served-model-name bge-m3 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.1 \
--trust-remote-code \
--dtype auto \

bge-m3.log 2>&1 &

报错日志：
(EngineCore pid=26596) INFO 05-23 16:26:48 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/root/vllm/bge-m3/', speculative_config=None, tokenizer='/root/vllm/bge-m3/', skip_toke
nizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1,
pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_expert
s=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='',
reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_m
etrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False
), seed=0, served_model_name=bge-m3, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=PoolerConfig(task=None, pooling_type=None, seq_pooling_type='CLS', tok_pooling_type='ALL', us
e_activation=True, dimensions=None, enable_chunked_processing=False, max_embed_len=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'mode': <CompilationMode.VLLM_COM
PILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_atten
tion_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_m
ixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm:
:mx_sparse_attn_indexer_bf16', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'e
ncoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_a
sserts': False, 'scalar_asserts': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48,
56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512
], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, '
enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': Fa
lse, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=26596) INFO 05-23 16:26:48 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.217.247.136:40835 backend=nccl
(EngineCore pid=26596) INFO 05-23 16:26:48 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[16:26:59.536][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:09.776][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:20.016][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:30.256][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:40.496][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:27:50.736][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.
[16:28:00.977][MXKW][E]queues.c :826 : [mxkwCreateQueueBlock][Hint]ioctl create queue block timeout, gpu_id:65475 type:21. Retrying.