MetaX-Tech Developer Forum 论坛首页
  • 沐曦开发者
search
Sign in

Cafba

  • Members
  • Joined 2025年12月9日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

Cafba has started 2 threads.

  • See post chevron_right
    Cafba
    Members
    GPU-Operator部署后使用hami调度的两个问题 解决中 2025年12月9日 19:22

    一、软硬件信息
    1.服务器厂家:
    2.沐曦GPU型号:C550
    3.操作系统内核版本:Linux mx-oam-151 5.15.0-58-generic #64~20.04.1-Ubuntu SMP Fri Jan 6 16:42:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
    4.是否开启CPU虚拟化:开启
    image.png
    二、具体问题1
    GPU-Operator部署后使用的SDK如图,
    image.png
    创建的pod如下

    apiVersion: v1
    kind: Pod
    metadata:
      name: task-sample-pod-2
    spec:
      schedulerName: hami-scheduler
      containers:
        - name: ubuntu-task
          image: docker.io/ubuntu:20.04
          imagePullPolicy: Never
          command: [
            "bash",
            "-c",
            "cp -r /opt/maca/samples/0_Introduction/vectorAdd /home;
            cd /home/vectorAdd;
            mxcc -x maca vectorAdd.cpp -o vectorAdd --maca-path=/opt/maca;
            ./vectorAdd > log/vectoradd_exec_output.log;
            tail -f /dev/null",
          ]
          resources:
            limits:
              metax-tech.com/sgpu: 1 # requesting 1 GPU
              metax-tech.com/vcore: 40 # requesting 60% compute of full GPU
              metax-tech.com/vmemory: 4 # requesting 4 GiB device memory of full GPU
    

    报错如下:
    image.png
    三、具体问题2
    如果新创建一个 pod,会显示
    Allocate failed due to rpc error: code = Unknown desc = set 0000:c2:00.0 model error write /proc/1/root/sys/bus/pci/devices/0000:c2:00.0/model: device or resource busy, which is unexpected
    但mx-smi如下,并没有使用第二张卡
    image.png

  • 沐曦开发者论坛
powered by misago