Description
The caching added in 3733ed6 by @MaskRay seems to have broken LTO and --gc-sections
for certain use cases. Specifically the change made in lld/ELF/MarkLive.cpp
to use the cached isExported
value instead of calling includeInDynsym
seems to be causing additional sections to be dropped that should not be (or at least were not before the change): 3733ed6#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)
The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a .kd
suffix and are emitted by AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor
. Prior to the referenced commit these symbols existed and after they don't.
By reverting the mentioned line in MarkLive.cpp
the original behavior is restored. I'm not familiar with the codebase but I suspect isExported
is not initialized or not safe to cache at that location.
The following repro shows the issue (lld_lto_bug.c
):
[[clang::amdgpu_kernel, gnu::visibility("protected")]] void some_kernel(int n) {
//
}
compiled using
$ clang \
-x c -std=c23 \
-target amdgcn-amd-amdhsa -march=gfx1100 \
-nogpulib \
-fgpu-rdc \
-fno-ident \
-fvisibility=hidden \
-O3 \
lld_lto_bug.c \
-c -emit-llvm -o lld_lto_bug.bc
or since bc files cannot be attached:
; ModuleID = 'lld_lto_bug.bc'
source_filename = "lld_lto_bug.c"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"
@__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define protected amdgpu_kernel void @some_kernel(i32 noundef %n) local_unnamed_addr #0 {
entry:
ret void
}
attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot12-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="false" }
!llvm.module.flags = !{!0, !1, !2}
!0 = !{i32 1, !"amdhsa_code_object_version", i32 500}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
Linking with LTO and gc-sections:
lld \
-flavor gnu \
-m elf64_amdgpu \
-shared \
-plugin-opt=mcpu=gfx1100 \
-plugin-opt=O3 \
--lto-CGO3 \
--gc-sections \
--print-gc-sections \
--strip-debug \
--discard-all \
--discard-locals \
-o lld_lto_bug.so \
lld_lto_bug.bc
Before the commit this will print the expected output (no removal of the rodata):
removing unused section lld_lto_bug_patched.so.lto.o:(.text)
After the commit with the regression removing the rodata:
removing unused section lld_lto_bug.so.lto.o:(.text)
removing unused section lld_lto_bug.so.lto.o:(.rodata)
This can be verified with llvm-readelf as before:
Symbol table '.dynsym' contains 3 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000001500 4 FUNC GLOBAL PROTECTED 7 some_kernel
2: 0000000000000480 64 OBJECT GLOBAL PROTECTED 6 some_kernel.kd
The some_kernel.kd
OBJECT
is what is required at runtime to use the ELF.
And after:
Symbol table '.dynsym' contains 2 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000001500 4 FUNC GLOBAL PROTECTED 6 some_kernel