-
Notifications
You must be signed in to change notification settings - Fork 12k
[CANN]: add the basic supports of Flash Attention kernel #13627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
72df31d
3a73182
6a39d63
8a902b9
f5e24a5
c8c2908
47f2c64
fb62f01
b266beb
8a112f0
1779e00
092ccf6
c380305
89f884e
1a3bfec
3b084d5
d23697b
8a7829b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1747,6 +1747,9 @@ static bool ggml_cann_compute_forward(ggml_backend_cann_context& ctx, | |
case GGML_OP_COUNT_EQUAL: | ||
ggml_cann_count_equal(ctx, dst); | ||
break; | ||
case GGML_OP_FLASH_ATTN_EXT: | ||
ggml_cann_flash_attn_ext(ctx, dst); | ||
break; | ||
default: | ||
return false; | ||
} | ||
|
@@ -2161,6 +2164,36 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, | |
case GGML_OP_PAD_REFLECT_1D: | ||
case GGML_OP_COUNT_EQUAL: | ||
return true; | ||
case GGML_OP_FLASH_ATTN_EXT:{ | ||
// copy from [ggml-cuda.cu] | ||
if (op->src[1]->ne[0] != op->src[2]->ne[0]) { | ||
// different head sizes of K and V are not supported yet | ||
return false; | ||
} | ||
if (op->src[0]->ne[0] == 192) { | ||
return false; | ||
} | ||
if (op->src[0]->ne[0] == 576) { | ||
// DeepSeek MLA | ||
return false; | ||
} | ||
if (op->src[0]->ne[3] != 1) { | ||
return false; | ||
} | ||
if (op->src[1]->type == GGML_TYPE_BF16 || op->src[2]->type == GGML_TYPE_BF16) { | ||
return false; | ||
} | ||
if (op->src[0]->ne[0] == 64 && op->src[1]->type == GGML_TYPE_F16) { | ||
return true; | ||
} | ||
if (op->src[0]->ne[0] == 128) { | ||
return true; | ||
} | ||
if (op->src[0]->ne[0] == 256 && op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16) { | ||
return true; | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that the current FA doesn't support cases where float logitSoftcap = 0.0f;
memcpy(&logitSoftcap, (float*)op->op_params + 2, sizeof(float));
if(logitSoftcap != 0.0f) {
return false;
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for your comment. We have added it. |
||
return op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16; | ||
} | ||
default: | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a logical issue here. Currently, it seems that when
op->src[0]->ne[0] == 128
, the code allowskv
to have a data type like q4/q8, implying that this case is supported. However, quantized formats are actually not supported at the moment. I believe the logic should be adjusted accordingly to reflect this:Could you please help confirm the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have updated the if-else logic to pass all of the tests.