9. MacaRT-LMDeploy

9.1. MacaRT-LMDeploy介绍

MacaRT-LMDeploy是在曦云系列GPU上适配官方LMDeploy的推理工具，基于MXMACA后端对LMDeploy方法进行了兼容适配和Kernel优化。使用MacaRT-LMDeploy在曦云系列GPU上进行大模型推理，其方法和功能与官方LMDeploy兼容。

9.2. MacaRT-LMDeploy功能与局限性

MacaRT-LMDeploy兼容适配了最新LMDeploy，除以下局限性以外，兼容其他所有LMDeploy原有功能，包括离线批处理、在线推理、控制台命令行交互等，可参考LMDeploy官方文档。

局限性：

支持PyTorch推理后端，不支持TurboMind引擎
支持FLOAT16和BFLOAT16推理，暂不支持量化模型部署
当前只对Qwen2.5，InternLM等部分模型进行了功能验证和性能优化
处于性能考虑，block_size只支持8、16、32，建议16
当前仅包含Ubuntu 20和Ubuntu 22系统版本，后续完善支持其他系统

9.3. MacaRT-LMDeploy使用流程

本章节介绍MacaRT-LMDeploy的使用步骤，主要分为离线推理、静态推理性能测试和Server服务动态推理性能测试。

9.3.1. 环境准备

使用MacaRT-LMDeploy进行推理需要以下准备：

获取vLLM镜像
安装dlinfer
安装lmdeploy

备注

完成上述步骤后，LMDeploy本身的依赖已经完整，但是在运行具体模型时，某些模型可能有它自己独有的依赖，请按照相关提示进行安装，提前配置好Python的pip源。

9.3.1.1. 获取vLLM镜像

从发布的软件包中获取vLLM镜像并启动，参见《曦云^® 系列通用GPU用户指南》中“容器相关场景支持”章节。

9.3.1.2. 安装dlinfer

dlinfer编译需要CUDA toolkit，建议使用CUDA 11.6。可以在启动容器时通过 /usr/local:/usr/local -v 挂载宿主机的 /usr/local/cuda路径到容器。

设置MACA环境变量：

DEFAULT_DIR="/opt/maca"
export MACA_PATH=${1:-$DEFAULT_DIR}
export CUDA_PATH=/usr/local/cuda
export CUCC_PATH=${MACA_PATH}/tools/cu-bridge
export PATH=${CUDA_PATH}/bin:${MACA_PATH}/mxgpu_llvm/bin:${MACA_PATH}/bin:${CUCC_PATH}/tools:${CUCC_PATH}/bin:$PATH
export LD_LIBRARY_PATH=${MACA_PATH}/lib:${MACA_PATH}/ompi/lib:${MACA_PATH}/mxgpu_llvm/lib:${LD_LIBRARY_PATH}

源码安装：

git clone https://github.com/DeepLink-org/dlinfer.git
cd dlinfer
# 建议使用以下commit，已经过测试
git checkout dbb1feb71b0983d8b5b166771a7bb99e00461b36
rm -rf _skbuild
pip3 install -r requirements/maca/full.txt
DEVICE=maca python3 setup.py develop

9.3.1.3. 安装LMDeploy

源码安装：

git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
# 建议使用以下commit，已经过测试
git checkout 832bfc45b4497e8d16e08ecfd663671e634aae40
LMDEPLOY_TARGET_DEVICE=maca python setup.py develop

9.3.2. 离线推理

离线推理代码示例如下：

import lmdeploy
from lmdeploy import PytorchEngineConfig
if __name__ == "__main__":
   pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                        backend_config = PytorchEngineConfig(tp=1,
                        cache_max_entry_count=0.8, device_type="maca", block_size=16))
question = ["Shanghai is", "Please introduce China", "How are you?"]
response = pipe(question, request_output_len=256, do_preprocess=False)
for idx, r in enumerate(response):
   print(f"Q: {question[idx]}")
   print(f"A: {r.text}")
   print()

API基本用法请参考官方文档。

9.3.3. 静态推理性能测试

因为profile_generation.py目前不支持传入 device_type，需要手动修改代码如下：

--- a/benchmark/profile_generation.py
+++ b/benchmark/profile_generation.py
@@ -430,6 +430,7 @@ def main():
                  eager_mode=args.eager_mode,
                  enable_prefix_caching=args.enable_prefix_caching,
                  dtype=args.dtype,
+                 device_type='maca',
               )

测试代码：

python profile_generation.py /models/llm/Internlm2-chat-7b --backend pytorch -c 1 -pt 256 -ct 128 --tp 1 --cache-block-seq-len 16 --dtype float16

python profile_generation.py ：脚本在lmdeploy/benchmark。
/models/llm/Internlm2-chat-7b ：模型文件路径。
--backend pytorch ：指定使用的后端为PyTorch。
-c 1 ：并发数为1。
-pt 256 ：输入长度为256。
-ct 128 ：输出长度为128。
--tp 1 ：设置张量并行的大小为1。
--cache-block-seq-len 16 ：设置block size为16。
--dtype float16 ：指定数据类型为float16。

运行程序后，结果打印如下：

-------------------------------------
total time: 5.51s
concurrency: 1, test_round: 3
input_tokens: 256, output_tokens: 128
first_token latency(min, max, ave): 0.045s, 0.047s, 0.046s
total_token latency(min, max, ave): 1.833s, 1.846s, 1.839s
token_latency percentiles(50%,75%,95%,99%)(s):[0.014, 0.014, 0.02, 0.021]
throughput(output): 69.68 token/s
throughput(total): 209.04 token/s
--------------------------------------

打印结果指示首字延迟、输出吞吐和全部吞吐（包括首个token）。

9.3.4. Server服务动态推理性能测试

详细信息可参考官方文档请求吞吐量性能测试和api_server性能测试。

启动服务

启动服务代码示例：

lmdeploy serve api_server --server-port 23333 --tp 1 --backend pytorch --max-batch-size 256 /models/llm/Internlm2-chat-7b --dtype float16 --device maca --cache-block-seq-len 16

输出如下：

HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Started server process[1384373]
HINT: Waiting for application startup.
HINT: Application startup complete.
HINT: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)

发起吞吐测试请求

profile_restful_api.py偶发NaN异常，为了暂时规避该问题，请手动修改代码如下：

--- a/benchmark/profile_restful_api.py
+++ b/benchmark/profile_restful_api.py
@@ -153,7 +153,7 @@ async def async_request_openai_completions(
      payload = {
         'model': request_func_input.model,
         'prompt': prompt,
-        'temperature': 0.0,
+        'temperature': 1.0,
         'best_of': 1,
         'max_tokens': request_func_input.output_len,
         'stream': not args.disable_stream,

下载ShareGPT_V3_unfiltered_cleaned_split.json：

python profile_restful_api.py --port 23333 --backend lmdeploy --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

输出如下:

Backend: lmdeploy
Traffic request rate: inf
Successful requests: 1000
Benchmark duration(s): 96.23
Total input tokens: 228316
Total generated tokens: 195534
Total generated tokens (retokimized): 181775
Request throughput (req/s): 10.39
Input token throughput (tok/s): 2372.62
Output token throughput (tok/s): 2031.95
End-to-End Latency
Mean E2E Latency (ms): 43531.86
Median E2E Latency (ms): 42845.51
Mean TTFT (ms): 26513.87
Median TTFT (ms): 24519.39
P99 TTFT (ms): 62724.06
Time per Output Token (excl. 1st token)
Mean TPOT (ms): 110.65
Median TPOT (ms): 95.38
P99 TPOT (ms): 535.69
Inter-token Latency
Mean ITL (ms): 91.67
Median ITL (ms): 61.98
P99 ITL (ms): 724.82