是云上租用的服务器,不是裸金属,貌似没有权限。
# whoami
root
# dmesg -T | grep -i err
dmesg: read kernel buffer failed: Operation not permitted
是云上租用的服务器,不是裸金属,貌似没有权限。
# whoami
root
# dmesg -T | grep -i err
dmesg: read kernel buffer failed: Operation not permitted
一、软硬件信息
1.服务器厂家:模力方舟
2.沐曦GPU型号:C500
3.操作系统内核版本:5.15.0-58-generic
4.是否开启CPU虚拟化:已开启 (VT-x)
5.mx-smi回显:
mx-smi version: 2.2.12
Attached GPUs: 1
GPU 0: MetaX C500
- Bus-id: 0000:12:00.0
- GPU-Util: 0%
- Power: 56W / 350W
- Temp: 35°C
- Memory: 826/65536 MiB
- sGPU: Enabled
- Kernel Mode Driver: 3.0.11
- MACA Version: 3.5.3.20
- BIOS Version: 1.27.5.0
Sliced GPU:
- Minor 015, sGPU-Id 2, Compute 25%, Vram Quota 0/16000 MiB
Process: no process found
6.docker info回显:docker未安装
7.镜像版本:docker未安装
8.启动容器命令:docker未安装
9.容器内执行命令:docker未安装
二、问题现象
请描述详细的问题现象日志。若日志过长,请上传附件(txt格式)。
使用下述最小jax测试脚本在c500和N卡上测试
测试代码:
"""
Minimal reproducer for Metax MACA backend bug:
llvm-link: error: dereferenceable bytes must be non-zero
Environment:
- GPU: MetaX C500 (XCORE1000)
- MACA stack: 3.5.3.20
- JAX: 0.4.34.dev20260402+maca
- LLVM: 19.1.3
Trigger: JIT-compiling a function that returns a NamedTuple/PyTree
containing a zero-size jax.Array. XLA generates a copy_fusion kernel
with 'dereferenceable(0)' in the LLVM IR, which llvm-link rejects.
"""
import os
os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = 'false'
import jax
import jax.numpy as jnp
from typing import NamedTuple
print("JAX version:", jax.__version__)
print("JAX devices :", jax.devices())
# A NamedTuple with one zero-size field and one non-zero field.
# This pattern is common in JAX libraries (e.g. MJX contact data).
class MyStruct(NamedTuple):
empty: jax.Array # shape (0,) -> triggers the bug
data: jax.Array # shape (3,) -> normal field
s = MyStruct(empty=jnp.zeros(0), data=jnp.ones(3))
# Direct call works fine.
print("Direct call OK:", s)
# JIT compile a passthrough function. XLA generates a copy_fusion
# kernel to copy the struct fields. The zero-size 'empty' field
# produces 'dereferenceable(0)' in LLVM IR, which is rejected by
# llvm-link because LLVM requires dereferenceable(N) with N > 0.
print("JIT-compiling passthrough function...")
jit_fn = jax.jit(lambda x: x)
jit_fn(s)
print("If you see this, the bug is fixed.")
C500输出:
JAX version: 0.4.34.dev20260402
JAX devices : [CudaDevice(id=0)]
Direct call OK: MyStruct(empty=Array([], shape=(0,), dtype=float32), data=Array([1., 1., 1.], dtype=float32))
JIT-compiling passthrough function...
/opt/maca/mxgpu_llvm/bin/llvm-link: /tmp/xla_maca_llvm-84c5c7.ll:6:79: error: dereferenceable bytes must be non-zero
define metaxgpu_kernel void @copy_fusion(ptr noalias align 16 dereferenceable(0) %0, ptr noalias align 16 dereferenceable(12) %1, ptr noalias align 128 dereferenceable(0) %2, ptr noalias align 128 dereferenceable(12) %3) #0 {
^
/opt/maca/mxgpu_llvm/bin/llvm-link: error: loading file '/tmp/xla_maca_llvm-84c5c7.ll'
Traceback (most recent call last):
File "/data/MJX/MJX_ZS/ai_tmp/test_metax_bug_minimal.py", line 43, in <module>
jit_fn(s)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: failed to execute llvm-link to link LLVM IR:
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
N卡输出:
JAX version: 0.6.2
JAX devices : [CudaDevice(id=0)]
Direct call OK: MyStruct(empty=Array([], shape=(0,), dtype=float32), data=Array([1., 1., 1.], dtype=float32))
JIT-compiling passthrough function...
If you see this, the bug is fixed.
配置见附件图片。然后在跑jax框架时,会出现
/opt/maca/mxgpu_llvm/bin/llvm-link: /tmp/xla_maca_llvm-6ddb98.ll:1035:82: error: dereferenceable bytes must be non-zero
define metaxgpu_kernel void @copy_fusion_10(ptr noalias align 16 dereferenceable(0) %0, ptr noalias align 16 dereferenceable(0) %1, ptr noalias align 16 dereferenceable(4) %2, ptr noalias align 128 dereferenceable(0) %3, ptr noalias align 128 dereferenceable(0) %4, ptr noalias align 128 dereferenceable(4) %5)。
参考AI给的分析结论:
这是 LLVM IR 语法错误:dereferenceable(0) 是非法的 ——
LLVM 规定 dereferenceable 属性的字节数必须 > 0,但 XLA 生成的 IR 里出现了 dereferenceable(0)。
这说明沐曦的 XLA 编译器(xla_maca_llvm)在处理某些 zero-size buffer 时,前端生成了非法的 IR 属性。
请问下,有没有什么规避手段去解决某些size为0的这种ir解析报错问题。