• Members 4 posts
    2026年5月13日 11:53

    配置见附件图片。然后在跑jax框架时,会出现
    /opt/maca/mxgpu_llvm/bin/llvm-link: /tmp/xla_maca_llvm-6ddb98.ll:1035:82: error: dereferenceable bytes must be non-zero
    define metaxgpu_kernel void @copy_fusion_10(ptr noalias align 16 dereferenceable(0) %0, ptr noalias align 16 dereferenceable(0) %1, ptr noalias align 16 dereferenceable(4) %2, ptr noalias align 128 dereferenceable(0) %3, ptr noalias align 128 dereferenceable(0) %4, ptr noalias align 128 dereferenceable(4) %5)。

    参考AI给的分析结论:
    这是 LLVM IR 语法错误:dereferenceable(0) 是非法的 ——
    LLVM 规定 dereferenceable 属性的字节数必须 > 0,但 XLA 生成的 IR 里出现了 dereferenceable(0)。
    这说明沐曦的 XLA 编译器(xla_maca_llvm)在处理某些 zero-size buffer 时,前端生成了非法的 IR 属性。

    请问下,有没有什么规避手段去解决某些size为0的这种ir解析报错问题。

    image.png

    PNG, 270.9 KB, uploaded by ninja on 2026年5月13日.

  • arrow_forward

    Thread has been moved from 公共.

  • Members 4 posts
    2026年5月13日 14:02

    一、软硬件信息
    1.服务器厂家:模力方舟
    2.沐曦GPU型号:C500
    3.操作系统内核版本:5.15.0-58-generic
    4.是否开启CPU虚拟化:已开启 (VT-x)
    5.mx-smi回显:

    mx-smi  version: 2.2.12
      Attached GPUs: 1
      GPU 0: MetaX C500
        - Bus-id: 0000:12:00.0
        - GPU-Util: 0%
        - Power: 56W / 350W
        - Temp: 35°C
        - Memory: 826/65536 MiB
        - sGPU: Enabled
        - Kernel Mode Driver: 3.0.11
        - MACA Version: 3.5.3.20
        - BIOS Version: 1.27.5.0
      Sliced GPU:
        - Minor 015, sGPU-Id 2, Compute 25%, Vram Quota 0/16000 MiB
      Process: no process found
    

    6.docker info回显:docker未安装
    7.镜像版本:docker未安装
    8.启动容器命令:docker未安装
    9.容器内执行命令:docker未安装
    二、问题现象
    请描述详细的问题现象日志。若日志过长,请上传附件(txt格式)。
    使用下述最小jax测试脚本在c500和N卡上测试
    测试代码:

    """
    Minimal reproducer for Metax MACA backend bug:
      llvm-link: error: dereferenceable bytes must be non-zero
    
    Environment:
      - GPU: MetaX C500 (XCORE1000)
      - MACA stack: 3.5.3.20
      - JAX: 0.4.34.dev20260402+maca
      - LLVM: 19.1.3
    
    Trigger: JIT-compiling a function that returns a NamedTuple/PyTree
    containing a zero-size jax.Array. XLA generates a copy_fusion kernel
    with 'dereferenceable(0)' in the LLVM IR, which llvm-link rejects.
    """
    
    import os
    os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = 'false'
    
    import jax
    import jax.numpy as jnp
    from typing import NamedTuple
    
    print("JAX version:", jax.__version__)
    print("JAX devices :", jax.devices())
    
    # A NamedTuple with one zero-size field and one non-zero field.
    # This pattern is common in JAX libraries (e.g. MJX contact data).
    class MyStruct(NamedTuple):
        empty: jax.Array   # shape (0,)  -> triggers the bug
        data:  jax.Array   # shape (3,)  -> normal field
    
    s = MyStruct(empty=jnp.zeros(0), data=jnp.ones(3))
    
    # Direct call works fine.
    print("Direct call OK:", s)
    
    # JIT compile a passthrough function. XLA generates a copy_fusion
    # kernel to copy the struct fields. The zero-size 'empty' field
    # produces 'dereferenceable(0)' in LLVM IR, which is rejected by
    # llvm-link because LLVM requires dereferenceable(N) with N > 0.
    print("JIT-compiling passthrough function...")
    jit_fn = jax.jit(lambda x: x)
    jit_fn(s)
    print("If you see this, the bug is fixed.")
    

    C500输出:

    JAX version: 0.4.34.dev20260402
    JAX devices : [CudaDevice(id=0)]
    Direct call OK: MyStruct(empty=Array([], shape=(0,), dtype=float32), data=Array([1., 1., 1.], dtype=float32))
    JIT-compiling passthrough function...
    /opt/maca/mxgpu_llvm/bin/llvm-link: /tmp/xla_maca_llvm-84c5c7.ll:6:79: error: dereferenceable bytes must be non-zero
    define metaxgpu_kernel void @copy_fusion(ptr noalias align 16 dereferenceable(0) %0, ptr noalias align 16 dereferenceable(12) %1, ptr noalias align 128 dereferenceable(0) %2, ptr noalias align 128 dereferenceable(12) %3) #0 {
                                                                                  ^
    /opt/maca/mxgpu_llvm/bin/llvm-link: error:  loading file '/tmp/xla_maca_llvm-84c5c7.ll'
    Traceback (most recent call last):
      File "/data/MJX/MJX_ZS/ai_tmp/test_metax_bug_minimal.py", line 43, in <module>
        jit_fn(s)
    jaxlib.xla_extension.XlaRuntimeError: INTERNAL: failed to execute llvm-link to link LLVM IR: 
    --------------------
    For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
    

    N卡输出:

    JAX version: 0.6.2
    JAX devices : [CudaDevice(id=0)]
    Direct call OK: MyStruct(empty=Array([], shape=(0,), dtype=float32), data=Array([1., 1., 1.], dtype=float32))
    JIT-compiling passthrough function...
    If you see this, the bug is fixed.
    
  • Members 458 posts
    2026年5月14日 09:28

    尊敬的开发者您好,请给出jax框架的安装以及相关配置命令

  • Members 458 posts
    2026年5月14日 13:16

    尊敬的开发者您好,请裸金属执行

    dmesg -T | grep -i err
    
  • Members 4 posts
    2026年5月14日 14:33

    是云上租用的服务器,不是裸金属,貌似没有权限。

    # whoami
    root
    # dmesg -T | grep -i err
    dmesg: read kernel buffer failed: Operation not permitted
    
  • Members 458 posts
    2026年5月14日 16:42

    尊敬的开发者您好,当前版本暂不支持,请等待后续版本更新。