MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

mukewang

  • Members
  • Joined 2026年5月11日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

mukewang has started 3 threads.

  • See post chevron_right
    mukewang
    Members
    模型运行过程中异常终止,且再次重启一直报错无法启动 解决中 2026年5月22日 13:51

    一、软硬件信息
    1.服务器厂家:浪潮信息:
    2.沐曦GPU型号:单张沐曦曦思N260
    3.操作系统内核版本:4.19.90-89.11.v2401.ky10.x86_64
    4.是否开启CPU虚拟化:是
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Mon May 11 10:18:11 2026

    Attached GPUs : 1
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.7.11 |
    | MACA Version: 3.7.0.38 BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX N260 | 0 Off | 0000:c1:00.0 | 0% Disabled |
    | 60W / 225W | 59C P9 | 52895/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0 3760427 VLLM::EngineCor 52228 |
    +---------------------------------------------------------------------------------+
    6.docker info回显:
    Client:
    Version: 29.3.1
    Context: default
    Debug Mode: false
    Plugins:
    compose: Docker Compose (Docker Inc.)
    Version: v2.24.6
    Path: /usr/local/lib/docker/cli-plugins/docker-compose

    Server:
    Containers: 25
    Running: 24
    Paused: 0
    Stopped: 1
    Images: 56
    Server Version: 29.3.1
    Storage Driver: overlayfs
    driver-type: io.containerd.snapshotter.v1
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
    CDI spec directories:
    /etc/cdi
    /var/run/cdi
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 metax runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 301b2dac98f15c27117da5c8af12118a041a31d9
    runc version: v1.3.4-0-gd6d73eb
    init version: de40ad0
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 4.19.90-89.11.v2401.ky10.x86_64
    Operating System: Kylin Linux Advanced Server V10 (Halberd)
    OSType: linux
    Architecture: x86_64
    CPUs: 64
    Total Memory: 61.55GiB
    Name: localhost.localdomain
    ID: f92e3bfc-06d2-4441-886f-8b48bf0e6b27
    Docker Root Dir: /var/lib/docker
    Debug Mode: false
    Experimental: false
    Insecure Registries:
    ::1/128
    127.0.0.0/8
    Live Restore Enabled: false
    Product License: Community Engine
    Firewall Backend: iptables

    WARNING: Support for cgroup v1 is deprecated and planned to be removed by no later than May 2029 (github.com/moby/moby/issues/51111)
    7.镜像版本:
    vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64
    8.启动容器命令:
    metax-docker run -itd --gpus="[<sgpu:${GPU_UUID}>]" --group-add video --network=host --name llm-model --entrypoint bash --restart unless-stopped --shm-size=32g --security-opt seccomp=unconfined --security-opt apparmor=unconfined --ulimit memlock=-1 -v /home/models:/models cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.19.0-maca.ai3.5.3.502-torch2.8-py312-kylinv11-amd64 -c "/models/run_model.sh"
    9.容器内执行命令:
    VLLM_USE_V1=1 /opt/conda/bin/vllm serve /models/Qwen3-32B-AWQ --max-num-seqs 8 --async-scheduling --host 0.0.0.0 --port 9901 --served-model-name qwen3 -tp 1 --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 8192 --max-num-batched-tokens 8192 --reasoning-parser qwen3 --no-enable-prefix-caching

    二、问题现象
    使用vllm运行模型时,模型出错异常停止,且无法重新启动,重启时提示memory access offset is negative, out of bounds, or misaligned in kernel,一直无法启动,请问该如何解决?

    报错日志如附件所示

  • See post chevron_right
    mukewang
    Members
    SGPU持久化问题 解决中 2026年5月15日 11:15

    一、软硬件信息
    1.服务器厂家:浪潮信息
    2.沐曦GPU型号:单张沐曦曦思N260
    3.操作系统内核版本:4.19.90-89.11.v2401.ky10.x86_64
    4.是否开启CPU虚拟化:是
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Fri May 15 11:06:59 2026

    Attached GPUs : 1
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
    | MACA Version: unknown BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX N260 | 0 Off | 0000:c1:00.0 | 0% Enabled |
    | 58W / 225W | 56C P9 | 58711/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Sliced GPU |
    |------------------------------------+---------------------+----------------------|
    | Minor GPU sGPU-Id Compute | Vram Quota | sGPU-Util |
    |====================================+=====================+======================|
    | 000 0 0 80% | 47068/49152 MiB | 0% |
    +------------------------------------+---------------------+----------------------+
    | 001 0 1 20% | 10976/12288 MiB | 0% |
    +------------------------------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0-s0 1797542 VLLM::EngineCor 47068 |
    | 0-s1 1800252 VLLM::EngineCor 10976 |
    +---------------------------------------------------------------------------------+

    二、问题现象
    由于只有一张卡且需要同时运行qwen3和qwen3-embedding模型,之前使用sgpu将单卡分为了48G/80%和12G/20%使用,但是发现设备一旦重启,所划分的sGPU就会不见,显卡sGPU重新变为disabled,求问如何能够让sGPU开启持久化?重启之后的mx-smi回显如下所示:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Fri May 15 11:10:17 2026

    Attached GPUs : 1
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.8.23 |
    | MACA Version: unknown BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX N260 | 0 Off | 0000:c1:00.0 | 0% Disabled |
    | 35W / 225W | 50C P0 | 666/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | no process found |
    +---------------------------------------------------------------------------------+

  • See post chevron_right
    mukewang
    Members
    Qwen3.5-27B-W8A8模型推理速度慢,且经常异常终止 已解决 2026年5月11日 11:27

    一、软硬件信息
    1.服务器厂家:浪潮信息:
    2.沐曦GPU型号:单张沐曦曦思N260
    3.操作系统内核版本:4.19.90-89.11.v2401.ky10.x86_64
    4.是否开启CPU虚拟化:是
    5.mx-smi回显:
    mx-smi version: 2.3.1

    =================== MetaX System Management Interface Log ===================
    Timestamp : Mon May 11 10:18:11 2026

    Attached GPUs : 1
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.3.1 Kernel Mode Driver Version: 3.7.11 |
    | MACA Version: 3.7.0.38 BIOS Version: 1.31.1.0 |
    |------------------+-----------------+---------------------+----------------------|
    | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M |
    | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State |
    |==================+=================+=====================+======================|
    | 0 MetaX N260 | 0 Off | 0000:c1:00.0 | 0% Disabled |
    | 60W / 225W | 59C P9 | 52895/65536 MiB | Available |
    +------------------+-----------------+---------------------+----------------------+

    +---------------------------------------------------------------------------------+
    | Process: |
    | GPU PID Process Name GPU Memory |
    | Usage(MiB) |
    |=================================================================================|
    | 0 3760427 VLLM::EngineCor 52228 |
    +---------------------------------------------------------------------------------+
    6.docker info回显:
    Client:
    Version: 29.3.1
    Context: default
    Debug Mode: false
    Plugins:
    compose: Docker Compose (Docker Inc.)
    Version: v2.24.6
    Path: /usr/local/lib/docker/cli-plugins/docker-compose

    Server:
    Containers: 25
    Running: 24
    Paused: 0
    Stopped: 1
    Images: 56
    Server Version: 29.3.1
    Storage Driver: overlayfs
    driver-type: io.containerd.snapshotter.v1
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Cgroup Version: 1
    Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
    CDI spec directories:
    /etc/cdi
    /var/run/cdi
    Swarm: inactive
    Runtimes: io.containerd.runc.v2 metax runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 301b2dac98f15c27117da5c8af12118a041a31d9
    runc version: v1.3.4-0-gd6d73eb
    init version: de40ad0
    Security Options:
    seccomp
    Profile: builtin
    Kernel Version: 4.19.90-89.11.v2401.ky10.x86_64
    Operating System: Kylin Linux Advanced Server V10 (Halberd)
    OSType: linux
    Architecture: x86_64
    CPUs: 64
    Total Memory: 61.55GiB
    Name: localhost.localdomain
    ID: f92e3bfc-06d2-4441-886f-8b48bf0e6b27
    Docker Root Dir: /var/lib/docker
    Debug Mode: false
    Experimental: false
    Insecure Registries:
    ::1/128
    127.0.0.0/8
    Live Restore Enabled: false
    Product License: Community Engine
    Firewall Backend: iptables

    WARNING: Support for cgroup v1 is deprecated and planned to be removed by no later than May 2029 (github.com/moby/moby/issues/51111)
    7.镜像版本:
    vllm-metax:0.18.0-maca.ai3.5.3.405-torch2.8-py310-kylinv11-amd64
    8.启动容器命令:
    metax-docker run -itd --device=/dev/dri --device=/dev/mxcd --group-add video --network=host --name vllm --restart unless-stopped --security-opt seccomp=unconfined --security-opt apparmor=unconfined --ulimit memlock=-1 -v /home/models:/models cr.metax-tech.com/public-ai-release/maca/vllm-metax:0.18.0-maca.ai3.5.3.405-torch2.8-py312-kylinv11-amd64
    9.容器内执行命令:
    /opt/conda/bin/vllm serve /models/Qwen3.5-27B-W8A8 --host 0.0.0.0 --port 9901 --served-model-name qwen3 -tp 1 --trust-remote-code --dtype half --gpu-memory-utilization 0.8 --max-model-len 8192 --max-num-batched-tokens 8192 --reasoning-parser qwen3

    二、问题现象
    使用vllm运行Qwen3.5-27B-W8A8模型时,推理速度最高只有12.4 tokens/s,更换为Qwen3-32B-W8A8也是如此,当使用Qwen3-14B-AWQ时有所改善,能够达到18~22 tokens/s,并且模型运行过程中发生内存寻址错误出现异常中断,导致 vLLM 引擎崩溃影响正常使用,所以:1.我该如何提升模型运行速度?当前速度影响正常业务;2.如何解决模型运行报错中断的问题?

    报错日志如下所示:
    (APIServer pid=8943) INFO 05-11 09:56:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
    [09:56:52.419][MXC][E]xnack(0x8): kernel causes atu address translation error
    [09:56:52.420][MCR][E]mx_trapProcess.cpp :76 : traping pipeID=0,queueID=3!
    [09:56:52.420][MCR][E]mx_trapProcess.cpp :77 : traping virtualDevice=0x7f3a48000c70!
    [09:56:52.421][MCR][E]mx_trapProcess.cpp :707 : Xnack(0x8) exception happened in the shader, the mcruntime api will be disabled
    [09:56:52.421][MCR][E]mx_trapProcess.cpp :798 : Node ID is 4
    [09:56:52.421][MCR][E]mx_trapProcess.cpp :811 : debug info: 0x04,0x00,0x00,0x00,0x04,0x00,0x00,0x00,0xe0,0x23,0x00,0x00,0x00,0x20,0x00,0x00,
    [09:56:52.421][MCR][E]mx_device.cpp :9606: trap:precise positioning.
    [09:56:52.421][MCR][E]mx_device.cpp :9611: traping: kernelName: ZN4vllm30reshape_and_cache_flash_kernelIttLNS_18Fp8KVCacheDataTypeE0EEEvPKT_S4_PT0_S6_PKlllllliiiPKfSA ,commandIndex: 1179567 , trapType: Xnack Error/ATU Fault(0x8)
    [09:56:53.408][MCR][E]mx_signal.cpp :592 : Failed signal [0x7f3d423766c0] wait
    [09:56:53.408][MCR][E]mx_signal.cpp :592 : Failed signal [0x7f3d423766c0] wait
    [09:56:53.408][MCR][E]mx_signal.cpp :160 : shouldn't destroy a profiling signal 0x5594b2ae4b70 that is still busy!
    [09:56:53.408][MCR][E]mx_signal.cpp :160 : shouldn't destroy a profiling signal 0x7f3a48004aa0 that is still busy!
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.18.0) with config: model='/models/Qwen3.5-27B-W8A8', speculative_config=None, tokenizer='/models/Qwen3.5-27B-W8A8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/f62306b0e2', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::mx_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 400, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/f62306b0e2/rank_0_0/backbone', 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-a2089b62a4a5346f-a2d0cdd1', 'chatcmpl-b6dcfcbbedded5d1-898d1a0f'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None, None],num_computed_tokens=[2914, 1184],num_output_tokens=[2904, 458]), num_scheduled_tokens={chatcmpl-a2089b62a4a5346f-a2d0cdd1: 1, chatcmpl-b6dcfcbbedded5d1-898d1a0f: 1}, total_num_scheduled_tokens=2, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.02970297029702973, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] EngineCore encountered a fatal error.
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] Traceback (most recent call last):
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1092, in run_engine_core
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] engine_core.run_busy_loop()
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1133, in run_busy_loop
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] self._process_engine_step()
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1172, in _process_engine_step
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] outputs, model_executed = self.step_fn()
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 398, in step
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] model_output = self.model_executor.sample_tokens(grammar_output)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 118, in sample_tokens
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return self.collective_rpc(
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] result = run_method(self.driver_worker, method, args, kwargs)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return func(args, kwargs)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return func(
    args, kwargs)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 759, in sample_tokens
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return self.model_runner.sample_tokens(grammar_output)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] return func(*args,
    kwargs)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4029, in sample_tokens
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ) = self._bookkeeping_sync(
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3161, in _bookkeeping_sync
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] valid_sampled_token_ids = self._to_list(sampled_token_ids)
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6643, in _to_list
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] self.transfer_event.synchronize()
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
    (EngineCore pid=9223) ERROR 05-11 09:56:53 [core.py:1101]
    (EngineCore pid=9223) Process EngineCore:
    (EngineCore pid=9223) Traceback (most recent call last):
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    (EngineCore pid=9223) self.run()
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
    (EngineCore pid=9223) self._target(self._args, self._kwargs)
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
    (EngineCore pid=9223) raise e
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1092, in run_engine_core
    (EngineCore pid=9223) engine_core.run_busy_loop()
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1133, in run_busy_loop
    (EngineCore pid=9223) self._process_engine_step()
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1172, in _process_engine_step
    (EngineCore pid=9223) outputs, model_executed = self.step_fn()
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] AsyncLLM output_handler failed.
    (EngineCore pid=9223) ^^^^^^^^^^^^^^
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] Traceback (most recent call last):
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 398, in step
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 663, in output_handler
    (EngineCore pid=9223) model_output = self.model_executor.sample_tokens(grammar_output)
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] outputs = await engine_core.get_output_async()
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 118, in sample_tokens
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1022, in get_output_async
    (EngineCore pid=9223) return self.collective_rpc(
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] raise self._format_exception(outputs) from None
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^
    (APIServer pid=8943) ERROR 05-11 09:56:53 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
    (EngineCore pid=9223) result = run_method(self.driver_worker, method, args, kwargs)
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
    (EngineCore pid=9223) return func(
    args, kwargs)
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore pid=9223) return func(*args,
    kwargs)
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 759, in sample_tokens
    (EngineCore pid=9223) return self.model_runner.sample_tokens(grammar_output)
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore pid=9223) return func(args, *kwargs)
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4029, in sample_tokens
    (EngineCore pid=9223) ) = self._bookkeeping_sync(
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3161, in _bookkeeping_sync
    (EngineCore pid=9223) valid_sampled_token_ids = self._to_list(sampled_token_ids)
    (EngineCore pid=9223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore pid=9223) File "/opt/conda/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6643, in _to_list
    (EngineCore pid=9223) self.transfer_event.synchronize()
    (EngineCore pid=9223) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
    (EngineCore pid=9223) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    (EngineCore pid=9223) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    (EngineCore pid=9223) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
    (EngineCore pid=9223)
    (APIServer pid=8943) INFO: 10.75.45.11:43660 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
    (APIServer pid=8943) INFO: 172.20.0.51:33402 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
    (APIServer pid=8943) INFO: Shutting down
    (APIServer pid=8943) INFO: Waiting for application shutdown.
    (APIServer pid=8943) INFO: Application shutdown complete.
    (APIServer pid=8943) INFO: Finished server process [8943]

  • 沐曦开发者论坛
powered by misago