Skip to content

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU #8266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 5, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented Jul 3, 2024

This PR fixes some bugs of WARP_SIZE=16 for Intel GPU. All warp-related UTs are passed.
WARP_SIZE=16 has the same output as WARP_SIZE=32 on Intel GPUs.

NOTE: QX_K kernels are specialized for WARP_SIZE=32, so I use a fixed WARP_SIZE for them.

Performance change

llama-2-7b-chat-hf-q4_0.gguf, 32 in and 32 out, on ARC A770, from 40 tokens/s to 44 tokens/s

Master Branch

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, have  new experiences and learn new things. She was a curious child, always eager to explore and discover new things.

One day, she found a small door
llama_print_timings:        load time =   14573.13 ms
llama_print_timings:      sample time =       0.53 ms /    32 runs   (    0.02 ms per token, 60952.38 tokens per second)
llama_print_timings: prompt eval time =     217.69 ms /    32 tokens (    6.80 ms per token,   147.00 tokens per second)
llama_print_timings:        eval time =     771.81 ms /    31 runs   (   24.90 ms per token,    40.17 tokens per second)
llama_print_timings:       total time =     994.93 ms /    63 tokens

PR Branch

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, have  new experiences and learn new things. She was a curious child, always eager to explore and discover new things.

One day, she found a small door
llama_print_timings:        load time =   14716.53 ms
llama_print_timings:      sample time =       0.52 ms /    32 runs   (    0.02 ms per token, 61776.06 tokens per second)
llama_print_timings: prompt eval time =     216.68 ms /    32 tokens (    6.77 ms per token,   147.69 tokens per second)
llama_print_timings:        eval time =     698.20 ms /    31 runs   (   22.52 ms per token,    44.40 tokens per second)
llama_print_timings:       total time =     920.69 ms /    63 tokens

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 3, 2024
@airMeng airMeng requested a review from AidanBeltonS July 3, 2024 03:38
Copy link
Collaborator

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested Meta-Llama-3-8B-Instruct-Q4_K_S.gguf and llama-2-7b.Q4_0.gguf

@airMeng
Copy link
Collaborator

airMeng commented Jul 3, 2024

@joeatodd @OuadiElfarouki

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's passed on MTL after I test.

@qnixsynapse
Copy link
Collaborator

Tested iq4_XS, Q4_K_S.

LGTM

@luoyu-intel
Copy link
Contributor Author

@qnixsynapse Thanks for your test! Q4_K models still use WARP_SIZE=32, so they won't benefit from this PR.

@qnixsynapse
Copy link
Collaborator

@luoyu-intel Yes, I am aware. I am testing IQ4 models currently.

@Alcpz
Copy link
Collaborator

Alcpz commented Jul 3, 2024

@joeatodd @OuadiElfarouki Performance of the SYCL branch using an NVIDIA A100 with Q4_K has no regressions.

model size params backend ngl sm mmap test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 78 none 0 pp512 2203.66 ± 15.26
llama 13B Q4_K - Medium 7.33 GiB 13.02 B SYCL 78 none 0 pp512 1720.49 ± 23.80
llama 70B Q4_K - Medium 38.58 GiB 68.98 B SYCL 78 none 0 pp512 606.95 ± 5.49

build: 4887fdc (3293)

model size params backend ngl sm mmap test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 81 none 0 tg128 5.36 ± 0.00
llama 13B Q4_K - Medium 7.33 GiB 13.02 B SYCL 81 none 0 tg128 4.27 ± 0.00
llama 70B Q4_K - Medium 38.58 GiB 68.98 B SYCL 81 none 0 tg128 2.08 ± 0.00

build: 4887fdc (3293)

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 3, 2024
@airMeng airMeng merged commit a9554e2 into ggml-org:master Jul 5, 2024
53 checks passed
@luoyu-intel luoyu-intel deleted the sycl-acc branch July 8, 2024 02:44
arthw added a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (ggml-org#8266)
    * fix group_norm ut

    * split softmax

    * fix softmax

    * add concat support condition

    * revert debug code

    * move QK_WARP_SIZE to presets.hpp

Fix issue in above PR:
  fix norm() nullptr lead to crash on iGPU.
  use WARP_32_SIZE replace QK_WARP_SIZE
  optimize dmmv.cpp for iGPU.
  add sycl_hw.cpp to detect Hardware info.
arthw added a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (ggml-org#8266) cherry-pick b549a1b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants