Skip to content

Commit 3607c5f

Browse files
dvnagornyrandall77
authored andcommitted
runtime: improve memmove for amd64
Use AVX if available on 4th generation of Intel(TM) Core(TM) processors. (collected on E5 2609v3 @1.9GHz) name old speed new speed delta Memmove/1-6 158MB/s ± 0% 172MB/s ± 0% +9.09% (p=0.000 n=16+16) Memmove/2-6 316MB/s ± 0% 345MB/s ± 0% +9.09% (p=0.000 n=18+16) Memmove/3-6 517MB/s ± 0% 517MB/s ± 0% ~ (p=0.445 n=16+16) Memmove/4-6 687MB/s ± 1% 690MB/s ± 0% +0.35% (p=0.000 n=20+17) Memmove/5-6 729MB/s ± 0% 729MB/s ± 0% +0.01% (p=0.000 n=16+18) Memmove/6-6 875MB/s ± 0% 875MB/s ± 0% +0.01% (p=0.000 n=18+18) Memmove/7-6 1.02GB/s ± 0% 1.02GB/s ± 1% ~ (p=0.139 n=19+20) Memmove/8-6 1.26GB/s ± 0% 1.26GB/s ± 0% +0.00% (p=0.000 n=18+18) Memmove/9-6 1.42GB/s ± 0% 1.42GB/s ± 0% +0.00% (p=0.000 n=17+18) Memmove/10-6 1.58GB/s ± 0% 1.58GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/11-6 1.74GB/s ± 0% 1.74GB/s ± 0% +0.00% (p=0.001 n=18+17) Memmove/12-6 1.90GB/s ± 0% 1.90GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/13-6 2.05GB/s ± 0% 2.05GB/s ± 0% +0.00% (p=0.000 n=18+19) Memmove/14-6 2.21GB/s ± 0% 2.21GB/s ± 0% +0.00% (p=0.000 n=16+20) Memmove/15-6 2.37GB/s ± 0% 2.37GB/s ± 0% +0.00% (p=0.004 n=19+20) Memmove/16-6 2.53GB/s ± 0% 2.53GB/s ± 0% +0.00% (p=0.000 n=16+16) Memmove/32-6 4.67GB/s ± 0% 4.67GB/s ± 0% +0.00% (p=0.000 n=17+17) Memmove/64-6 8.67GB/s ± 0% 8.64GB/s ± 0% -0.33% (p=0.000 n=18+17) Memmove/128-6 12.6GB/s ± 0% 11.6GB/s ± 0% -8.05% (p=0.000 n=16+19) Memmove/256-6 16.3GB/s ± 0% 16.6GB/s ± 0% +1.66% (p=0.000 n=20+18) Memmove/512-6 21.5GB/s ± 0% 24.4GB/s ± 0% +13.35% (p=0.000 n=18+17) Memmove/1024-6 24.7GB/s ± 0% 33.7GB/s ± 0% +36.12% (p=0.000 n=18+18) Memmove/2048-6 27.3GB/s ± 0% 43.3GB/s ± 0% +58.77% (p=0.000 n=19+17) Memmove/4096-6 37.5GB/s ± 0% 50.5GB/s ± 0% +34.56% (p=0.000 n=19+19) MemmoveUnalignedDst/1-6 135MB/s ± 0% 146MB/s ± 0% +7.69% (p=0.000 n=16+14) MemmoveUnalignedDst/2-6 271MB/s ± 0% 292MB/s ± 0% +7.69% (p=0.000 n=18+18) MemmoveUnalignedDst/3-6 438MB/s ± 0% 438MB/s ± 0% ~ (p=0.352 n=16+19) MemmoveUnalignedDst/4-6 584MB/s ± 0% 584MB/s ± 0% ~ (p=0.876 n=17+17) MemmoveUnalignedDst/5-6 631MB/s ± 1% 632MB/s ± 0% +0.25% (p=0.000 n=20+17) MemmoveUnalignedDst/6-6 759MB/s ± 0% 759MB/s ± 0% +0.00% (p=0.000 n=19+16) MemmoveUnalignedDst/7-6 885MB/s ± 0% 883MB/s ± 1% ~ (p=0.647 n=18+20) MemmoveUnalignedDst/8-6 1.08GB/s ± 0% 1.08GB/s ± 0% +0.00% (p=0.035 n=19+18) MemmoveUnalignedDst/9-6 1.22GB/s ± 0% 1.22GB/s ± 0% ~ (p=0.251 n=18+17) MemmoveUnalignedDst/10-6 1.35GB/s ± 0% 1.35GB/s ± 0% ~ (p=0.327 n=17+18) MemmoveUnalignedDst/11-6 1.49GB/s ± 0% 1.49GB/s ± 0% ~ (p=0.531 n=18+19) MemmoveUnalignedDst/12-6 1.63GB/s ± 0% 1.63GB/s ± 0% ~ (p=0.886 n=19+18) MemmoveUnalignedDst/13-6 1.76GB/s ± 0% 1.76GB/s ± 1% -0.24% (p=0.006 n=18+20) MemmoveUnalignedDst/14-6 1.90GB/s ± 0% 1.90GB/s ± 0% ~ (p=0.818 n=20+19) MemmoveUnalignedDst/15-6 2.03GB/s ± 0% 2.03GB/s ± 0% ~ (p=0.294 n=17+16) MemmoveUnalignedDst/16-6 2.17GB/s ± 0% 2.17GB/s ± 0% ~ (p=0.602 n=16+18) MemmoveUnalignedDst/32-6 4.05GB/s ± 0% 4.05GB/s ± 0% +0.00% (p=0.010 n=18+17) MemmoveUnalignedDst/64-6 7.59GB/s ± 0% 7.59GB/s ± 0% +0.00% (p=0.022 n=18+16) MemmoveUnalignedDst/128-6 11.1GB/s ± 0% 11.4GB/s ± 0% +2.79% (p=0.000 n=18+17) MemmoveUnalignedDst/256-6 16.4GB/s ± 0% 16.7GB/s ± 0% +1.59% (p=0.000 n=20+17) MemmoveUnalignedDst/512-6 15.7GB/s ± 0% 21.3GB/s ± 0% +35.87% (p=0.000 n=18+20) MemmoveUnalignedDst/1024-6 16.0GB/s ±20% 31.5GB/s ± 0% +96.93% (p=0.000 n=20+14) MemmoveUnalignedDst/2048-6 19.6GB/s ± 0% 42.1GB/s ± 0% +115.16% (p=0.000 n=17+18) MemmoveUnalignedDst/4096-6 6.41GB/s ± 0% 33.18GB/s ± 0% +417.56% (p=0.000 n=17+18) MemmoveUnalignedSrc/1-6 171MB/s ± 0% 166MB/s ± 0% -3.33% (p=0.000 n=19+16) MemmoveUnalignedSrc/2-6 343MB/s ± 0% 342MB/s ± 1% -0.41% (p=0.000 n=17+20) MemmoveUnalignedSrc/3-6 508MB/s ± 0% 493MB/s ± 1% -2.90% (p=0.000 n=17+17) MemmoveUnalignedSrc/4-6 677MB/s ± 0% 660MB/s ± 2% -2.55% (p=0.000 n=17+20) MemmoveUnalignedSrc/5-6 790MB/s ± 0% 790MB/s ± 0% ~ (p=0.139 n=17+17) MemmoveUnalignedSrc/6-6 948MB/s ± 0% 946MB/s ± 1% ~ (p=0.330 n=17+19) MemmoveUnalignedSrc/7-6 1.11GB/s ± 0% 1.11GB/s ± 0% -0.05% (p=0.026 n=17+17) MemmoveUnalignedSrc/8-6 1.38GB/s ± 0% 1.38GB/s ± 0% ~ (p=0.091 n=18+16) MemmoveUnalignedSrc/9-6 1.42GB/s ± 0% 1.40GB/s ± 1% -1.04% (p=0.000 n=19+20) MemmoveUnalignedSrc/10-6 1.58GB/s ± 0% 1.56GB/s ± 1% -1.15% (p=0.000 n=18+19) MemmoveUnalignedSrc/11-6 1.73GB/s ± 0% 1.71GB/s ± 1% -1.30% (p=0.000 n=20+20) MemmoveUnalignedSrc/12-6 1.89GB/s ± 0% 1.87GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/13-6 2.05GB/s ± 0% 2.02GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/14-6 2.21GB/s ± 0% 2.18GB/s ± 1% -1.14% (p=0.000 n=17+20) MemmoveUnalignedSrc/15-6 2.36GB/s ± 0% 2.34GB/s ± 1% -1.04% (p=0.000 n=17+20) MemmoveUnalignedSrc/16-6 2.52GB/s ± 0% 2.49GB/s ± 1% -1.26% (p=0.000 n=19+20) MemmoveUnalignedSrc/32-6 4.82GB/s ± 0% 4.61GB/s ± 0% -4.40% (p=0.000 n=19+20) MemmoveUnalignedSrc/64-6 5.03GB/s ± 4% 7.97GB/s ± 0% +58.55% (p=0.000 n=20+16) MemmoveUnalignedSrc/128-6 11.1GB/s ± 0% 11.2GB/s ± 0% +0.52% (p=0.000 n=17+18) MemmoveUnalignedSrc/256-6 16.5GB/s ± 0% 16.4GB/s ± 0% -0.10% (p=0.000 n=20+18) MemmoveUnalignedSrc/512-6 21.0GB/s ± 0% 22.1GB/s ± 0% +5.48% (p=0.000 n=14+17) MemmoveUnalignedSrc/1024-6 24.9GB/s ± 0% 31.9GB/s ± 0% +28.20% (p=0.000 n=19+20) MemmoveUnalignedSrc/2048-6 23.3GB/s ± 0% 33.8GB/s ± 0% +45.22% (p=0.000 n=17+19) MemmoveUnalignedSrc/4096-6 37.3GB/s ± 0% 42.7GB/s ± 0% +14.30% (p=0.000 n=17+17) Change-Id: Iab488d93a293cdf573ab5cd89b95a818bbb5d531 Reviewed-on: https://go-review.googlesource.com/22515 Run-TryBot: Denis Nagorny <denis.nagorny@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
1 parent 04ade8e commit 3607c5f

File tree

4 files changed

+443
-1
lines changed

4 files changed

+443
-1
lines changed

src/runtime/cpuflags_amd64.go

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
// Copyright 2015 The Go Authors. All rights reserved.
2+
// Use of this source code is governed by a BSD-style
3+
// license that can be found in the LICENSE file.
4+
5+
package runtime
6+
7+
var vendorStringBytes [12]byte
8+
var maxInputValue uint32
9+
var featureFlags uint32
10+
var processorVersionInfo uint32
11+
12+
var useRepMovs bool
13+
14+
func hasFeature(feature uint32) bool {
15+
return (featureFlags & feature) != 0
16+
}
17+
18+
func cpuid_low(arg1, arg2 uint32) (eax, ebx, ecx, edx uint32) // implemented in cpuidlow_amd64.s
19+
func xgetbv_low(arg1 uint32) (eax, edx uint32) // implemented in cpuidlow_amd64.s
20+
21+
func init() {
22+
const cfOSXSAVE uint32 = 1 << 27
23+
const cfAVX uint32 = 1 << 28
24+
25+
leaf0()
26+
leaf1()
27+
28+
enabledAVX := false
29+
// Let's check if OS has set CR4.OSXSAVE[bit 18]
30+
// to enable XGETBV instruction.
31+
if hasFeature(cfOSXSAVE) {
32+
eax, _ := xgetbv_low(0)
33+
// Let's check that XCR0[2:1] = ‘11b’
34+
// i.e. XMM state and YMM state are enabled by OS.
35+
enabledAVX = (eax & 0x6) == 0x6
36+
}
37+
38+
isIntelBridgeFamily := (processorVersionInfo == 0x206A0 ||
39+
processorVersionInfo == 0x206D0 ||
40+
processorVersionInfo == 0x306A0 ||
41+
processorVersionInfo == 0x306E0) &&
42+
isIntel()
43+
44+
useRepMovs = !(hasFeature(cfAVX) && enabledAVX) || isIntelBridgeFamily
45+
}
46+
47+
func leaf0() {
48+
eax, ebx, ecx, edx := cpuid_low(0, 0)
49+
maxInputValue = eax
50+
int32ToBytes(ebx, vendorStringBytes[0:4])
51+
int32ToBytes(edx, vendorStringBytes[4:8])
52+
int32ToBytes(ecx, vendorStringBytes[8:12])
53+
}
54+
55+
func leaf1() {
56+
if maxInputValue < 1 {
57+
return
58+
}
59+
eax, _, ecx, _ := cpuid_low(1, 0)
60+
// Let's remove stepping and reserved fields
61+
processorVersionInfo = eax & 0x0FFF3FF0
62+
featureFlags = ecx
63+
}
64+
65+
func int32ToBytes(arg uint32, buffer []byte) {
66+
buffer[3] = byte(arg >> 24)
67+
buffer[2] = byte(arg >> 16)
68+
buffer[1] = byte(arg >> 8)
69+
buffer[0] = byte(arg)
70+
}
71+
72+
func isIntel() bool {
73+
intelSignature := [12]byte{'G', 'e', 'n', 'u', 'i', 'n', 'e', 'I', 'n', 't', 'e', 'l'}
74+
return vendorStringBytes == intelSignature
75+
}

src/runtime/cpuidlow_amd64.s

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
// Copyright 2015 The Go Authors. All rights reserved.
2+
// Use of this source code is governed by a BSD-style
3+
// license that can be found in the LICENSE file.
4+
5+
// func cpuid_low(arg1, arg2 uint32) (eax, ebx, ecx, edx uint32)
6+
TEXT ·cpuid_low(SB), 4, $0-24
7+
MOVL arg1+0(FP), AX
8+
MOVL arg2+4(FP), CX
9+
CPUID
10+
MOVL AX, eax+8(FP)
11+
MOVL BX, ebx+12(FP)
12+
MOVL CX, ecx+16(FP)
13+
MOVL DX, edx+20(FP)
14+
RET
15+
// func xgetbv_low(arg1 uint32) (eax, edx uint32)
16+
TEXT ·xgetbv_low(SB), 4, $0-16
17+
MOVL arg1+0(FP), CX
18+
// XGETBV
19+
BYTE $0x0F; BYTE $0x01; BYTE $0xD0
20+
MOVL AX,eax+8(FP)
21+
MOVL DX,edx+12(FP)
22+
RET

src/runtime/memmove_amd64.s

Lines changed: 242 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,9 @@ tail:
6464
JBE move_129through256
6565
// TODO: use branch table and BSR to make this just a single dispatch
6666

67+
TESTB $1, runtime·useRepMovs(SB)
68+
JZ avxUnaligned
69+
6770
/*
6871
* check and set for backwards
6972
*/
@@ -108,7 +111,6 @@ back:
108111
ADDQ BX, CX
109112
CMPQ CX, DI
110113
JLS forward
111-
112114
/*
113115
* whole thing backwards has
114116
* adjusted addresses
@@ -273,3 +275,242 @@ move_256through2048:
273275
LEAQ 256(DI), DI
274276
JGE move_256through2048
275277
JMP tail
278+
279+
avxUnaligned:
280+
// There are two implementations of move algorithm.
281+
// The first one for non-ovelapped memory regions. It uses forward copying.
282+
// The second one for overlapped regions. It uses backward copying
283+
MOVQ DI, CX
284+
SUBQ SI, CX
285+
// Now CX contains distance between SRC and DEST
286+
CMPQ CX, BX
287+
// If the distance lesser than region length it means that regions are overlapped
288+
JC copy_backward
289+
290+
// Non-temporal copy would be better for big sizes.
291+
CMPQ BX, $0x100000
292+
JAE gobble_big_data_fwd
293+
294+
// Memory layout on the source side
295+
// SI CX
296+
// |<---------BX before correction--------->|
297+
// | |<--BX corrected-->| |
298+
// | | |<--- AX --->|
299+
// |<-R11->| |<-128 bytes->|
300+
// +----------------------------------------+
301+
// | Head | Body | Tail |
302+
// +-------+------------------+-------------+
303+
// ^ ^ ^
304+
// | | |
305+
// Save head into Y4 Save tail into X5..X12
306+
// |
307+
// SI+R11, where R11 = ((DI & -32) + 32) - DI
308+
// Algorithm:
309+
// 1. Unaligned save of the tail's 128 bytes
310+
// 2. Unaligned save of the head's 32 bytes
311+
// 3. Destination-aligned copying of body (128 bytes per iteration)
312+
// 4. Put head on the new place
313+
// 5. Put the tail on the new place
314+
// It can be important to satisfy processor's pipeline requirements for
315+
// small sizes as the cost of unaligned memory region copying is
316+
// comparable with the cost of main loop. So code is slightly messed there.
317+
// There is more clean implementation of that algorithm for bigger sizes
318+
// where the cost of unaligned part copying is negligible.
319+
// You can see it after gobble_big_data_fwd label.
320+
LEAQ (SI)(BX*1), CX
321+
MOVQ DI, R10
322+
// CX points to the end of buffer so we need go back slightly. We will use negative offsets there.
323+
MOVOU -0x80(CX), X5
324+
MOVOU -0x70(CX), X6
325+
MOVQ $0x80, AX
326+
// Align destination address
327+
ANDQ $-32, DI
328+
ADDQ $32, DI
329+
// Continue tail saving.
330+
MOVOU -0x60(CX), X7
331+
MOVOU -0x50(CX), X8
332+
// Make R11 delta between aligned and unaligned destination addresses.
333+
MOVQ DI, R11
334+
SUBQ R10, R11
335+
// Continue tail saving.
336+
MOVOU -0x40(CX), X9
337+
MOVOU -0x30(CX), X10
338+
// Let's make bytes-to-copy value adjusted as we've prepared unaligned part for copying.
339+
SUBQ R11, BX
340+
// Continue tail saving.
341+
MOVOU -0x20(CX), X11
342+
MOVOU -0x10(CX), X12
343+
// The tail will be put on it's place after main body copying.
344+
// It's time for the unaligned heading part.
345+
VMOVDQU (SI), Y4
346+
// Adjust source address to point past head.
347+
ADDQ R11, SI
348+
SUBQ AX, BX
349+
// Aligned memory copying there
350+
gobble_128_loop:
351+
VMOVDQU (SI), Y0
352+
VMOVDQU 0x20(SI), Y1
353+
VMOVDQU 0x40(SI), Y2
354+
VMOVDQU 0x60(SI), Y3
355+
ADDQ AX, SI
356+
VMOVDQA Y0, (DI)
357+
VMOVDQA Y1, 0x20(DI)
358+
VMOVDQA Y2, 0x40(DI)
359+
VMOVDQA Y3, 0x60(DI)
360+
ADDQ AX, DI
361+
SUBQ AX, BX
362+
JA gobble_128_loop
363+
// Now we can store unaligned parts.
364+
ADDQ AX, BX
365+
ADDQ DI, BX
366+
VMOVDQU Y4, (R10)
367+
VZEROUPPER
368+
MOVOU X5, -0x80(BX)
369+
MOVOU X6, -0x70(BX)
370+
MOVOU X7, -0x60(BX)
371+
MOVOU X8, -0x50(BX)
372+
MOVOU X9, -0x40(BX)
373+
MOVOU X10, -0x30(BX)
374+
MOVOU X11, -0x20(BX)
375+
MOVOU X12, -0x10(BX)
376+
RET
377+
378+
gobble_big_data_fwd:
379+
// There is forward copying for big regions.
380+
// It uses non-temporal mov instructions.
381+
// Details of this algorithm are commented previously for small sizes.
382+
LEAQ (SI)(BX*1), CX
383+
MOVOU -0x80(SI)(BX*1), X5
384+
MOVOU -0x70(CX), X6
385+
MOVOU -0x60(CX), X7
386+
MOVOU -0x50(CX), X8
387+
MOVOU -0x40(CX), X9
388+
MOVOU -0x30(CX), X10
389+
MOVOU -0x20(CX), X11
390+
MOVOU -0x10(CX), X12
391+
VMOVDQU (SI), Y4
392+
MOVQ DI, R8
393+
ANDQ $-32, DI
394+
ADDQ $32, DI
395+
MOVQ DI, R10
396+
SUBQ R8, R10
397+
SUBQ R10, BX
398+
ADDQ R10, SI
399+
LEAQ (DI)(BX*1), CX
400+
SUBQ $0x80, BX
401+
gobble_mem_fwd_loop:
402+
PREFETCHNTA 0x1C0(SI)
403+
PREFETCHNTA 0x280(SI)
404+
// Prefetch values were choosen empirically.
405+
// Approach for prefetch usage as in 7.6.6 of [1]
406+
// [1] 64-ia-32-architectures-optimization-manual.pdf
407+
// http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
408+
VMOVDQU (SI), Y0
409+
VMOVDQU 0x20(SI), Y1
410+
VMOVDQU 0x40(SI), Y2
411+
VMOVDQU 0x60(SI), Y3
412+
ADDQ $0x80, SI
413+
VMOVNTDQ Y0, (DI)
414+
VMOVNTDQ Y1, 0x20(DI)
415+
VMOVNTDQ Y2, 0x40(DI)
416+
VMOVNTDQ Y3, 0x60(DI)
417+
ADDQ $0x80, DI
418+
SUBQ $0x80, BX
419+
JA gobble_mem_fwd_loop
420+
// NT instructions don't follow the normal cache-coherency rules.
421+
// We need SFENCE there to make copied data available timely.
422+
SFENCE
423+
VMOVDQU Y4, (R8)
424+
VZEROUPPER
425+
MOVOU X5, -0x80(CX)
426+
MOVOU X6, -0x70(CX)
427+
MOVOU X7, -0x60(CX)
428+
MOVOU X8, -0x50(CX)
429+
MOVOU X9, -0x40(CX)
430+
MOVOU X10, -0x30(CX)
431+
MOVOU X11, -0x20(CX)
432+
MOVOU X12, -0x10(CX)
433+
RET
434+
435+
copy_backward:
436+
MOVQ DI, AX
437+
// Backward copying is about the same as the forward one.
438+
// Firstly we load unaligned tail in the beginning of region.
439+
MOVOU (SI), X5
440+
MOVOU 0x10(SI), X6
441+
ADDQ BX, DI
442+
MOVOU 0x20(SI), X7
443+
MOVOU 0x30(SI), X8
444+
LEAQ -0x20(DI), R10
445+
MOVQ DI, R11
446+
MOVOU 0x40(SI), X9
447+
MOVOU 0x50(SI), X10
448+
ANDQ $0x1F, R11
449+
MOVOU 0x60(SI), X11
450+
MOVOU 0x70(SI), X12
451+
XORQ R11, DI
452+
// Let's point SI to the end of region
453+
ADDQ BX, SI
454+
// and load unaligned head into X4.
455+
VMOVDQU -0x20(SI), Y4
456+
SUBQ R11, SI
457+
SUBQ R11, BX
458+
// If there is enough data for non-temporal moves go to special loop
459+
CMPQ BX, $0x100000
460+
JA gobble_big_data_bwd
461+
SUBQ $0x80, BX
462+
gobble_mem_bwd_loop:
463+
VMOVDQU -0x20(SI), Y0
464+
VMOVDQU -0x40(SI), Y1
465+
VMOVDQU -0x60(SI), Y2
466+
VMOVDQU -0x80(SI), Y3
467+
SUBQ $0x80, SI
468+
VMOVDQA Y0, -0x20(DI)
469+
VMOVDQA Y1, -0x40(DI)
470+
VMOVDQA Y2, -0x60(DI)
471+
VMOVDQA Y3, -0x80(DI)
472+
SUBQ $0x80, DI
473+
SUBQ $0x80, BX
474+
JA gobble_mem_bwd_loop
475+
// Let's store unaligned data
476+
VMOVDQU Y4, (R10)
477+
VZEROUPPER
478+
MOVOU X5, (AX)
479+
MOVOU X6, 0x10(AX)
480+
MOVOU X7, 0x20(AX)
481+
MOVOU X8, 0x30(AX)
482+
MOVOU X9, 0x40(AX)
483+
MOVOU X10, 0x50(AX)
484+
MOVOU X11, 0x60(AX)
485+
MOVOU X12, 0x70(AX)
486+
RET
487+
488+
gobble_big_data_bwd:
489+
SUBQ $0x80, BX
490+
gobble_big_mem_bwd_loop:
491+
PREFETCHNTA -0x1C0(SI)
492+
PREFETCHNTA -0x280(SI)
493+
VMOVDQU -0x20(SI), Y0
494+
VMOVDQU -0x40(SI), Y1
495+
VMOVDQU -0x60(SI), Y2
496+
VMOVDQU -0x80(SI), Y3
497+
SUBQ $0x80, SI
498+
VMOVNTDQ Y0, -0x20(DI)
499+
VMOVNTDQ Y1, -0x40(DI)
500+
VMOVNTDQ Y2, -0x60(DI)
501+
VMOVNTDQ Y3, -0x80(DI)
502+
SUBQ $0x80, DI
503+
SUBQ $0x80, BX
504+
JA gobble_big_mem_bwd_loop
505+
SFENCE
506+
VMOVDQU Y4, (R10)
507+
VZEROUPPER
508+
MOVOU X5, (AX)
509+
MOVOU X6, 0x10(AX)
510+
MOVOU X7, 0x20(AX)
511+
MOVOU X8, 0x30(AX)
512+
MOVOU X9, 0x40(AX)
513+
MOVOU X10, 0x50(AX)
514+
MOVOU X11, 0x60(AX)
515+
MOVOU X12, 0x70(AX)
516+
RET

0 commit comments

Comments
 (0)