注意力头部测试问题

link

lihz

Members 7 posts

2025年8月28日 14:39 2025年8月28日 14:39

link

在做注意力头部测试时发现，
query_states shape: [8, 16, 1, 24]
key_states shape: [8, 16, 1, 24]
value_states shape: [8, 16, 1, 48]
dtype为float16时，torch.nn.functional.scaled_dot_product_attention会报错：
Head dimension of query/key must greater or equal to head dimension in query。
dtype 为float32时则不会报错。
宿主机环境：
CPU：Intel(R) Xeon(R) Gold 5318Y 2
内存：256GB（32GB8）
GPU：N260*2
OS：Ubuntu 22.04.4 LTS
内核：5.15.0-88-generic
MACA：3.0.0.8
vBIOS：1.26.1.0
docker：27.5.1
容器环境：
cr.metax-tech.com/public-ai-release/maca/vllm:maca.ai3.0.0.5-torch2.6-py310-ubuntu22.04-amd64
cr.metax-tech.com/public-ai-release/maca/sglang:maca.ai2.33.1.7-torch2.6-py310-ubuntu22.04-amd64
简单测试内容介绍
后面在做测试发现在sglang的容器内会出现上述报错，而在vllm容器中不会出现上述报错。使用附件中的test1（上传限制需要修改为.py文件运行。）可稳定复现。
后续在代码中添加下面全局开关后：
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
在sglang和vllm容器中就不会在出现上述报错。使用附件中的test2复现。
在后续测试中发现即使设置了上面的全局开关，在使用 flash_attn_func 时还是会出现上述错误。附件test3可稳定复现。
回显：
成功及报错回显可查看附件中的截图。

报错的回显.png

PNG, 45.8 KB, uploaded by lihz on 2025年8月28日.

运行成功的回显.png

PNG, 64.7 KB, uploaded by lihz on 2025年8月28日.

insert_drive_file

注意力头部测试问题.docx

DOCX, 85.0 KB, uploaded by lihz on 2025年8月28日.

insert_drive_file

test3.txt

Text, 2.6 KB, uploaded by lihz on 2025年8月28日.

insert_drive_file

test2.txt

Text, 2.4 KB, uploaded by lihz on 2025年8月28日.

insert_drive_file

test1.txt

Text, 2.3 KB, uploaded by lihz on 2025年8月28日.

link

lihz

Members 7 posts

2025年8月28日 14:43 2025年8月28日 14:43

link

各位专家有什么解决办法吗

link

shuai_chen

Members 48 posts

2025年8月28日 14:49 2025年8月28日 14:49

link

尊敬的开发者您好，请给出详细的报错日志，去除try except。此代码在n卡是否可以运行，请先验证一下。

link

lihz

Members 7 posts

2025年8月29日 10:12 2025年8月29日 10:12

link

客户在N卡上也做过测试，是没有出现这个报错的。
下面是去除try except 的回显。附件中为运行的代码和报错截图。测试容器是sglang

python test4.py
使用设备: cuda
query_states shape: torch.Size([8, 16, 1, 24]), 设备: cuda:0
key_states shape: torch.Size([8, 16, 1, 24]), 设备: cuda:0
value_states shape: torch.Size([8, 16, 1, 48]), 设备: cuda:0
/opt/conda/lib/python3.10/contextlib.py:103: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(args, kwds)
Traceback (most recent call last):
File "/data/lhz/BD/test1.py", line 51, in <module>
attn_output = flash_attn_func(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1054, in flash_attn_func
return FlashAttnFunc.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 704, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state, attn_mask = _flash_attn_forward(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 110, in _flash_attn_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state, attn_mask = flash_attn_cuda.fwd(
RuntimeError: Head dimension of query/key must greater or equal to head dimension in query