MetaX-Tech Developer Forum
  • 沐曦开发者
search
Sign in

SunnySun

  • Members
  • Joined 2025年9月2日
  • message 帖子
  • forum 主题
  • favorite 关注者
  • favorite_border Follows
  • person_outline 详细信息

SunnySun has posted 6 messages.

  • See post chevron_right
    SunnySun
    Members
    如何使用异步拷贝以达到优化的效果呢? 已解决 2025年9月10日 10:06

    您好,这个问题还未解决,请问有相关人员跟进吗

  • See post chevron_right
    SunnySun
    Members
    如何使用异步拷贝以达到优化的效果呢? 已解决 2025年9月4日 09:38

    您好,也使用了sdk版本为2.33的镜像(使用的镜像为:cr.metax-tech.com/public-ai-release/maca/vllm:maca.ai2.33.0.13-torch2.6-py310-ubuntu22.04-amd64),但是测试结果仍然和上面一样(异步拷贝的效果比同步拷贝差)

    同时对比了两个sdk下的 /opt/maca/mxgpu_llvm/lib/clang/12.0.0/include/maca_async.h这个文件的内容,发现2.33.0的和2.32.0的内容是一样的,仍然和上面描述的一样,异步接口最终调用的是同步接口

    是否可告知一个实现了异步拷贝的环境?以及相应的测试脚本?谢谢

  • See post chevron_right
    SunnySun
    Members
    如何使用异步拷贝以达到优化的效果呢? 已解决 2025年9月3日 17:52

    您好,根据您的回复,我们做了如下尝试:

    1)使用cu-bridge 2.32.0源码,并且修改cuda_pipeline_primitives.h这一文件,与master分支的保持一致;

    2)直接使用master分支

    进行如下安装:

    rm -rf /opt/maca/tools/cu-bridge  ##删除环境中原有的cu-bridge
    ##按照如下步骤安装
    export MACA_PATH=/opt/maca
    git clone https://gitee.com/metax-maca/cu-bridge.git
    sudo chmod 755 cu-bridge -Rf
    cd cu-bridge
    mkdir build && cd ./build
    cmake -DCMAKE_INSTALL_PREFIX=/opt/maca/tools/cu-bridge ../
    make && make install
    
    ###安装后配置环境
    export MACA_PATH=/opt/maca       
    export CUCC_PATH=/opt/maca/tools/cu-bridge 
    export PATH=$PATH:${CUCC_PATH}/tools:${CUCC_PATH}/bin
    export CUCC_CMAKE_ENTRY=2        
    export CUDA_PATH=${CUCC_PATH}
    

    安装好后,重新编译并运行memcpy_async的测试代码(与问题描述中的代码相同),发现异步拷贝还是比同步拷贝性能差(异步拷贝没起作用)。

    此外,根据cuda_pipeline_primitives.h中定义的__pipeline_memcpy_async函数,我们在容器中进行了查找(查找哪些位置使用了该函数),定位到/opt/maca/mxgpu_llvm/lib/clang/12.0.0/include/maca_async.h这一文件中使用了该函数,部分代码如下:

    template <size_t CopySize, size_t SourceSize>
    __CG_QUALIFIER__ void pipeline_memcpy_async(void *__restrict__ dst,
                                                const void *__restrict__ src) {
      /*
       assert(CopySize == 4 || CopySize == 4 || CopySize == 4);
       assert(SourceSize <= CopySize);
       assert(__isShared(dst));
       assert(__isGlobal(src));
       assert(!(reinterpret_cast<uintptr_t>(dst) & (CopySize - 1)));
       assert(!(reinterpret_cast<uintptr_t>(src) & (CopySize - 1)));
       */
      pipeline_memcpy_sync<CopySize, SourceSize>(dst, src);
    }
    
    __CG_STATIC_QUALIFIER__ void
    __pipeline_memcpy_async(void *__restrict dst_shared,
                            const void *__restrict__ src_global,
                            size_t size_and_align, size_t zfill = 0) {
      /*
       assert(size_and_align == 4 || size_and_align == 4 || size_and_align == 4);
       assert(zfill <= size_and_align);
       assert(__isShared(dst_shared));
       assert(__isGlobal(src_global));
       assert(!(reinterpret_cast<uintptr_t>(dst_shared) & (size_and_align - 1)));
       assert(!(reinterpret_cast<uintptr_t>(src_global) & (size_and_align - 1)));
       */
    
      switch (size_and_align) {
      case 16:
        switch (zfill) {
        case 0:
          pipeline_memcpy_async<16, 16>(dst_shared, src_global);
          return;
        case 1:
          pipeline_memcpy_async<16, 15>(dst_shared, src_global);
          return;
        case 2:
          pipeline_memcpy_async<16, 14>(dst_shared, src_global);
    ...}
    

    __pipeline_memcpy_async(void *__restrict dst_shared, const void *__restrict__ src_global, size_t size_and_align, size_t zfill = 0)中调用的pipeline_memcpy_async(void *__restrict__ dst, const void *__restrict__ src),实际上使用的还是 pipeline_memcpy_sync<CopySize, SourceSize>(dst, src)(同步接口)

    综上,我们有如下疑问:
    1)镜像中是否尚未实现异步拷贝,是否可提供支持异步拷贝的环境?或者告知如何针对当前的环境进行修改,以完成异步拷贝?
    2)可否提供 已实现的异步拷贝的测试用例供我们参考呢?
    期待您的回复,谢谢

  • See post chevron_right
    SunnySun
    Members
    如何使用异步拷贝以达到优化的效果呢? 已解决 2025年9月3日 09:15

    您好,我们一直使用的是镜像中的cu-bridge,没有单独下载并安装过,请问是这个gitee.com/p4ul/cu-bridge 吗?(最新的版本也是11个月以前更新的),在cr.metax-tech.com/public-ai-release/c500/vllm 这一镜像下安装,是否需要修改什么东西来正确安装呢?

    此外,请问您那边可以提供一个使用memcpy_async进行异步拷贝的测试用例吗?想参考一下,谢谢

  • See post chevron_right
    SunnySun
    Members
    如何使用异步拷贝以达到优化的效果呢? 已解决 2025年9月2日 17:38

    CPU型号: Intel(R) Xeon(R) Platinum 8480+
    系统内核版本:5.15.0-58-generic #64~20.04.1-Ubuntu
    mx-smi回显:
    Timestamp : Tue Sep 2 09:31:33 2025

    Attached GPUs : 8
    +---------------------------------------------------------------------------------+
    | MX-SMI 2.2.3 Kernel Mode Driver Version: 2.14.6 |
    | MACA Version: 2.32.0.6 BIOS Version: 1.24.3.0 |
    |------------------------------------+---------------------+----------------------+
    | GPU NAME | Bus-id | GPU-Util |
    | Temp Pwr:Usage/Cap | Memory-Usage | |
    |====================================+=====================+======================|
    | 0 MetaX C500 | 0000:0e:00.0 | 0% |
    | 43C 77W / 350W | 17636/65536 MiB | |
    +------------------------------------+---------------------+----------------------+
    | 1 MetaX C500 | 0000:0f:00.0 | 0% |
    | 51C 87W / 350W | 18264/65536 MiB | |
    +------------------------------------+---------------------+----------------------+
    | 2 MetaX C500 | 0000:10:00.0 | 0% |
    | 47C 80W / 350W | 1565/65536 MiB | |
    +------------------------------------+---------------------+----------------------+

    Docker版本:Docker version 27.1.1, build 6312585
    镜像名称:cr.metax-tech.com/public-ai-release/c500/vllm maca2.32.0.11-torch2.4-py310-ubuntu22.04-amd64
    容器启动命令:
    sudo docker run -it --name mx_test -v /home/workspace:/home/workspace -v /usr/local/cuda-11.6:/usr/local/cuda --cap-add=SYS_PTRACE --privileged=true --ulimit stack=68719476736 --network host --shm-size=20G -w /home/workspace cr.metax-tech.com/public-ai-release/c500/vllm:maca2.32.0.11-torch2.4-py310-ubuntu22.04-amd64 /bin/bash
    程序编译以及运行:
    mxcc --std=c++17 -I/opt/maca/tools/cu-bridge/include/ -lruntime_cu test_copy.cu -o test_copy
    ./test_copy

  • See post chevron_right
    SunnySun
    Members
    如何使用异步拷贝以达到优化的效果呢? 已解决 2025年9月2日 16:48
    #include <cooperative_groups.h>
    #include <maca_async.h>
    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    #include <cstdio>
    #include <cmath>
    
    namespace cg = cooperative_groups;
    
    constexpr int BLOCK_SIZE = 256;
    constexpr int TILE_SIZE = 6144;      // 每个tile的元素数量
    constexpr int NUM_TILES = 2048;         // 处理的tile数量
    constexpr int COMPUTE_ITERATIONS = 1; // 每个tile的计算工作量
    
    // compute func
    __device__ float compute_kernel(float data, int iterations) {
        float result = data;
        for (int i = 0; i < iterations; ++i) {
            result = result * 0.99f + 0.01f;
        }
        return result;
    }
    
    __global__ void async_copy_pipeline(float* global_data, float* global_output) {
        auto cta = cg::this_thread_block();
        __shared__ float smem[2][TILE_SIZE];
        // calc offset
        const int cta_offset = blockIdx.x * NUM_TILES * TILE_SIZE;
    
        float *src = global_data + cta_offset;
    
        int read_buf  = 0;   // compute
        int write_buf = 1;   // copy
    
        // the first tile 异步拷贝
        cg::memcpy_async(cta, smem[read_buf], src, sizeof(float) * TILE_SIZE);
        cg::wait(cta);
        cta.sync();
    
        for (int tile_idx = 0; tile_idx < NUM_TILES; ++tile_idx) {
            if (tile_idx < NUM_TILES - 1) {
                // 发起next tile 的异步拷贝
                float* next_src = src + (tile_idx + 1) * TILE_SIZE;
                cg::memcpy_async(cta, smem[write_buf], next_src, sizeof(float) * TILE_SIZE);
            }
            // current tile compute
            for (int i = threadIdx.x; i < TILE_SIZE; i += blockDim.x) {
                smem[read_buf][i] = compute_kernel(smem[read_buf][i], COMPUTE_ITERATIONS);
            }
    
            if (tile_idx < NUM_TILES - 1) {
                cg::wait(cta);
            }
            cta.sync();
    
            float *dst = global_output + cta_offset + tile_idx * TILE_SIZE;
            for (int i = threadIdx.x; i < TILE_SIZE; i += blockDim.x) {
                dst[i] = smem[read_buf][i];
            }
            if (tile_idx < NUM_TILES - 1) {
                int temp = read_buf;
                read_buf = write_buf;
                write_buf = temp;
            }
        }
    }
    
    // 同步版本用于对比
    __global__ void sync_copy_sequential(float* global_data, float* global_output) {
        __shared__ float smem[TILE_SIZE];
    
        float* src = global_data + blockIdx.x * NUM_TILES * TILE_SIZE;
        float* dst = global_output + blockIdx.x * NUM_TILES * TILE_SIZE;
    
        for (int tile_idx = 0; tile_idx < NUM_TILES; ++tile_idx) {
            float* current_src = src + tile_idx * TILE_SIZE;
            float* current_dst = dst + tile_idx * TILE_SIZE;
    
            //copy to smem
            for (int i = threadIdx.x; i < TILE_SIZE; i += blockDim.x) {
                smem[i] = current_src[i];
            }
            __syncthreads();
    
            for (int i = threadIdx.x; i < TILE_SIZE; i += blockDim.x) {
                smem[i] = compute_kernel(smem[i], COMPUTE_ITERATIONS);
            }
            __syncthreads();
    
            for (int i = threadIdx.x; i < TILE_SIZE; i += blockDim.x) {
                current_dst[i] = smem[i];
            }
            __syncthreads();
        }
    }
    
    int main() {
        cudaError_t cudaStatus = cudaSetDevice(0);
        const int num_blocks = 4;
        const int total_elements = num_blocks * NUM_TILES * TILE_SIZE;
    
        printf("Testing async copy pipeline with %d blocks, %d tiles per block, %d elements per tile\n",
               num_blocks, NUM_TILES, TILE_SIZE);
    
        thrust::host_vector<float> h_input(total_elements);
        thrust::host_vector<float> h_output_async(total_elements);
        thrust::host_vector<float> h_output_sync(total_elements);
        for (int j = 0; j < total_elements; ++j) h_output_async[j] = static_cast<float>(-1);
        for (int j = 0; j < total_elements; ++j) h_output_sync[j] = static_cast<float>(-1);
    
        for (int i = 0; i < total_elements; ++i) {
            h_input[i] = static_cast<float>(i) * 0.1f;
        }
    
        thrust::device_vector<float> d_input = h_input;
        thrust::device_vector<float> d_output_async = h_output_async;
        thrust::device_vector<float> d_output_sync = h_output_sync;
    
        // test async
        cudaEvent_t start_async, stop_async;
        cudaEventCreate(&start_async);
        cudaEventCreate(&stop_async);
    
        cudaEventRecord(start_async);
        async_copy_pipeline<<<num_blocks, BLOCK_SIZE>>>(
            d_input.data().get(),d_output_async.data().get()
        );
        cudaEventRecord(stop_async);
        cudaDeviceSynchronize();
    
        float async_ms = 0;
        cudaEventElapsedTime(&async_ms, start_async, stop_async);
    
        // test sync
        cudaEvent_t start_sync, stop_sync;
        cudaEventCreate(&start_sync);
        cudaEventCreate(&stop_sync);
    
        cudaEventRecord(start_sync);
        sync_copy_sequential<<<num_blocks, BLOCK_SIZE>>>(
            d_input.data().get(),d_output_sync.data().get()
        );
        cudaEventRecord(stop_sync);
        cudaDeviceSynchronize();
    
        float sync_ms = 0;
        cudaEventElapsedTime(&sync_ms, start_sync, stop_sync);
    
        // 验证结果一致性
        thrust::host_vector<float> h_result_async = d_output_async;
        thrust::host_vector<float> h_result_sync = d_output_sync;
    
        bool results_match = true;
        for (int i = 0; i < total_elements; ++i) {
            if (fabs(h_result_async[i] - h_result_sync[i]) > 1e-6) {
                results_match = false;
                break;
            }
        }
    
        printf("Results: %s\n", results_match ? "MATCH" : "MISMATCH");
        printf("Async pipeline time: %.3f ms\n", async_ms);
        printf("Sync sequential time: %.3f ms\n", sync_ms);
        printf("Speedup: %.2fx\n", sync_ms / async_ms);
    
        // clear
        cudaEventDestroy(start_async);
        cudaEventDestroy(stop_async);
        cudaEventDestroy(start_sync);
        cudaEventDestroy(stop_sync);
    
        return 0;
    }
    

    上面的代码是用于 测试异步拷贝加计算的运行时间 以及 同步拷贝加计算的运行时间,预期结果是使用异步拷贝性能优于同步拷贝,但是在MetaX C500上测试,结果如下(结果显示异步拷贝耗时更长):

    Testing async copy pipeline with 4 blocks, 2048 tiles per block, 6144 elements per tile
    Results: MATCH
    Async pipeline time: 30.342 ms
    Sync sequential time: 9.460 ms
    Speedup: 0.31x

    同样的代码,在L20上测试,结果如下(L20上异步拷贝的耗时比同步短):

    Testing async copy pipeline with 4 blocks, 2048 tiles per block, 6144 elements per tile
    Results: MATCH
    Async pipeline time: 4.221 ms
    Sync sequential time: 17.118 ms
    Speedup: 4.06x

    代码中用到了异步拷贝,主要参考developer.metax-tech.com/api/client/document/preview/559/C500_MXMACAC%2B%2BProgrammingGuide_CN.html#ehg6356t2io61 中说明的使用方法

    问题:为什么在C500上异步拷贝的性能比同步的还要差呢?请问是异步拷贝的接口使用有问题还是什么原因呢?请指教,谢谢

  • 沐曦开发者论坛
powered by misago