You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers. Recent works like [QuaRot](https://arxiv.org/abs/2404.00456), [SpinQuant](https://arxiv.org/abs/2405.16406), and [FlashAttention-3](https://arxiv.org/pdf/2407.08608) introduce methods to increase the numerical accuracy of INT4, INT8 and FP8 quantization in LLMs. These methods rely on [Hadamard Transforms](https://en.wikipedia.org/wiki/Hadamard_transform). In this blog, we present HadaCore, a Hadamard Transform CUDA kernel that achieves state-of-the-art performance on NVIDIA A100 and H100 GPUs. Our kernel achieves speedups of **1.1–1.4x** and **1.0–1.3x**, with a peak gain of **3.5x** and **3.6x** respectively, over Dao AI Lab’s [Fast Hadamard Transform Kernel](https://github.com/Dao-AILab/fast-hadamard-transform). We leverage a hardware-aware work decomposition that benefits from Tensor Core acceleration while maintaining quantization error reduction.
@@ -52,19 +52,19 @@ We benchmark HadaCore against the[ Dao AI Lab Hadamard Kernel](https://github.co
52
52
*Figure 5: HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel*
53
53
54
54
55
-
{:style="width:100%"}
55
+
{:style="width:100%; margin-top: 35px;"}
56
56
57
57
58
58
*Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline*
59
59
60
60
61
-
{:style="width:100%"}
61
+
{:style="width:100%; margin-top: 35px;"}
62
62
63
63
64
64
*Figure 4: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel*
65
65
66
66
67
-
{:style="width:100%"}
67
+
{:style="width:100%; margin-top: 35px;"}
68
68
69
69
70
70
*Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline*
@@ -74,17 +74,16 @@ We showcase our speedup as the input tensor size (labeled element count) in our
74
74
75
75
The queries (Q), keys (K) and values (V) tensors are 4D tensors of size:
76
76
77
-
\
78
-
`(batch_size, seq_len, n_heads, head_dim)`\
79
-
\
77
+
`(batch_size, seq_len, n_heads, head_dim)`
78
+
80
79
A Hadamard matrix of size `head_dim` is applied to these activation tensors, so we refer to this as using a Hadamard size of `head_dim` with an element count of:
81
80
82
-
`batch_size*seq_len*n_heads*head_dim`.
81
+
`batch_size*seq_len*n_heads*head_dim.`
83
82
84
83
Common element counts for query rotations in an attention block:
85
84
86
85
87
-
<table>
86
+
<tableclass="table table-bordered">
88
87
<tr>
89
88
<td><strong>Model \ Tokens</strong>
90
89
</td>
@@ -97,31 +96,32 @@ Common element counts for query rotations in an attention block:
97
96
<td><strong>Llama-2 70b</strong>
98
97
</td>
99
98
<td>33,554,432 elements
100
-
<p>
99
+
<br>
101
100
128 Hadamard size
102
-
<p>
101
+
<br>
102
+
103
103
(1 batch * 64 heads * 4096 tokens * 128 dimensional embeddings per head per token)
104
104
</td>
105
105
<td>8192 elements
106
-
<p>
106
+
<br>
107
107
128 Hadamard size
108
-
<p>
108
+
<br>
109
109
(1 batch * 64 heads * 1 token * 128 dimensional embeddings per head per token)
110
110
</td>
111
111
</tr>
112
112
<tr>
113
113
<td><strong>Llama-3 8b</strong>
114
114
</td>
115
115
<td>33,554,432 elements
116
-
<p>
116
+
<br>
117
117
128 Hadamard size
118
-
<p>
118
+
<br>
119
119
(1 batch * 32 heads * 8192 tokens * 128 dimensional embeddings per head per token)
120
120
</td>
121
121
<td>4,096 elements
122
-
<p>
122
+
<br>
123
123
128 Hadamard size
124
-
<p>
124
+
<br>
125
125
(1 batch * 32 heads * 1 token * 128 dimensional embeddings per head per token)
126
126
</td>
127
127
</tr>
@@ -132,24 +132,25 @@ HadaCore achieves **1.1–1.4x** speedup on A100 and **1.0–1.3x** speedup on H
132
132
133
133
**MMLU Benchmarks**
134
134
135
-
We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16. \
135
+
We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16.
136
+
136
137
Our results show the benefit of using HadaCore for accuracy preservation when combined with optimizations such as FP8 FlashAttention.
137
138
138
139
139
-
<table>
140
+
<tableclass="table table-bordered">
140
141
<tr>
141
142
<td><strong>Format</strong>
142
143
</td>
143
144
<td><strong>Method</strong>
144
145
</td>
145
146
<td><strong>Llama3.1-8B</strong>
146
-
<pstyle="text-align: center">
147
+
<br>
147
148
<strong>Avg. 5-Shot MMLU Accuracy</strong>
148
149
</td>
149
150
</tr>
150
151
<tr>
151
152
<td><strong>Q, K, V: FP16</strong>
152
-
<p>
153
+
<br>
153
154
<strong>FlashAttention: FP16</strong>
154
155
</td>
155
156
<td>N/A
@@ -159,7 +160,7 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
159
160
</tr>
160
161
<tr>
161
162
<td><strong>Q, K, V: FP16</strong>
162
-
<p>
163
+
<br>
163
164
<strong>FlashAttention: FP8</strong>
164
165
</td>
165
166
<td>No Hadamard
@@ -169,7 +170,7 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
169
170
</tr>
170
171
<tr>
171
172
<td><strong>Q, K, V: FP8</strong>
172
-
<p>
173
+
<br>
173
174
<strong>FlashAttention: FP8</strong>
174
175
</td>
175
176
<td>HadaCore
@@ -179,7 +180,7 @@ Our results show the benefit of using HadaCore for accuracy preservation when co
0 commit comments