Skip to content

[NVPTX] Add tcgen05 alloc/dealloc intrinsics #124961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions clang/lib/Basic/Targets/NVPTX.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,13 @@ NVPTXTargetInfo::NVPTXTargetInfo(const llvm::Triple &Triple,
HasFloat16 = true;

if (TargetPointerWidth == 32)
resetDataLayout("e-p:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");
else if (Opts.NVPTXUseShortPointers)
resetDataLayout(
"e-p3:32:32-p4:32:32-p5:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");
"e-p:32:32-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");
else if (Opts.NVPTXUseShortPointers)
resetDataLayout("e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-i64:64-i128:128-v16:"
"16-v32:32-n16:32:64");
else
resetDataLayout("e-i64:64-i128:128-v16:16-v32:32-n16:32:64");
resetDataLayout("e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64");

// If possible, get a TargetInfo for our host triple, so we can match its
// types.
Expand Down
4 changes: 2 additions & 2 deletions clang/test/CodeGen/target-data.c
Original file line number Diff line number Diff line change
Expand Up @@ -160,11 +160,11 @@

// RUN: %clang_cc1 -triple nvptx-unknown -o - -emit-llvm %s | \
// RUN: FileCheck %s -check-prefix=NVPTX
// NVPTX: target datalayout = "e-p:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
// NVPTX: target datalayout = "e-p:32:32-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"

// RUN: %clang_cc1 -triple nvptx64-unknown -o - -emit-llvm %s | \
// RUN: FileCheck %s -check-prefix=NVPTX64
// NVPTX64: target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
// NVPTX64: target datalayout = "e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"

// RUN: %clang_cc1 -triple r600-unknown -o - -emit-llvm %s | \
// RUN: FileCheck %s -check-prefix=R600
Expand Down
98 changes: 98 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -962,6 +962,104 @@ The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite gr
For more information, refer
`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol>`__.

TCGEN05 family of Intrinsics
----------------------------

The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructions
exposed by PTX. These intrinsics use 'Tensor Memory' (henceforth ``tmem``).
NVPTX represents this memory using ``addrspace(6)`` and is always 32-bits.

For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory>`_.

The tensor-memory pointers may only be used with the tcgen05 intrinsics.
There are specialized load/store instructions provided (tcgen05.ld/st) to
work with tensor-memory.

See the PTX ISA for more information on tensor-memory load/store instructions
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions>`_.

'``llvm.nvvm.tcgen05.alloc``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.tcgen05.alloc.cg1(ptr %dst, i32 %ncols)
declare void @llvm.nvvm.tcgen05.alloc.cg2(ptr %dst, i32 %ncols)
declare void @llvm.nvvm.tcgen05.alloc.shared.cg1(ptr addrspace(3) %dst, i32 %ncols)
declare void @llvm.nvvm.tcgen05.alloc.shared.cg2(ptr addrspace(3) %dst, i32 %ncols)

Overview:
"""""""""

The '``@llvm.nvvm.tcgen05.alloc.*``' intrinsics correspond to the
``tcgen05.alloc.cta_group*.sync.aligned.b32`` family of PTX instructions.
The ``tcgen05.alloc`` is a potentially blocking instruction which dynamically
allocates the specified number of columns in the Tensor Memory and writes
the address of the allocated Tensor Memory into shared memory at the
location specified by ``%dst``. The 32-bit operand ``%ncols`` specifies
the number of columns to be allocated and it must be a power-of-two.
The ``.shared`` variant explicitly uses shared memory address space for
the ``%dst`` operand. The ``.cg1`` and ``.cg2`` variants generate
``cta_group::1`` and ``cta_group::2`` variants of the instruction respectively.

For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions>`_.

'``llvm.nvvm.tcgen05.dealloc``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.tcgen05.dealloc.cg1(ptr addrspace(6) %tmem_addr, i32 %ncols)
declare void @llvm.nvvm.tcgen05.dealloc.cg2(ptr addrspace(6) %tmem_addr, i32 %ncols)

Overview:
"""""""""

The '``@llvm.nvvm.tcgen05.dealloc.*``' intrinsics correspond to the
``tcgen05.dealloc.*`` set of PTX instructions. The ``tcgen05.dealloc``
instructions deallocates the Tensor Memory specified by the Tensor Memory
address ``%tmem_addr``. The operand ``%tmem_addr`` must point to a previous
Tensor Memory allocation. The 32-bit operand ``%ncols`` specifies the number
of columns to be de-allocated. The ``.cg1`` and ``.cg2`` variants generate
``cta_group::1`` and ``cta_group::2`` variants of the instruction respectively.

For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions>`_.

'``llvm.nvvm.tcgen05.relinq.alloc.permit``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg1()
declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg2()

Overview:
"""""""""

The '``@llvm.nvvm.tcgen05.relinq.alloc.permit.*``' intrinsics correspond
to the ``tcgen05.relinquish_alloc_permit.*`` set of PTX instructions.
This instruction specifies that the CTA of the executing thread is
relinquishing the right to allocate Tensor Memory. So, it is illegal
for a CTA to perform ``tcgen05.alloc`` after any of its constituent
threads execute ``tcgen05.relinquish_alloc_permit``. The ``.cg1``
and ``.cg2`` variants generate ``cta_group::1`` and ``cta_group::2``
flavors of the instruction respectively.

For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions>`_.

Other Intrinsics
----------------

Expand Down
30 changes: 30 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@

def llvm_global_ptr_ty : LLVMQualPointerType<1>; // (global)ptr
def llvm_shared_ptr_ty : LLVMQualPointerType<3>; // (shared)ptr
def llvm_tmem_ptr_ty : LLVMQualPointerType<6>; // (tensor memory)ptr

//
// MISC
Expand Down Expand Up @@ -5055,4 +5056,33 @@ def int_nvvm_cp_async_bulk_prefetch_L2
def int_nvvm_griddepcontrol_launch_dependents: Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects]>;
def int_nvvm_griddepcontrol_wait: Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects]>;

//
// Tcgen05 family of Intrinsics
//

// Tcgen05 alloc/dealloc related intrinsics

foreach cta_group = ["cg1", "cg2"] in {
def int_nvvm_tcgen05_alloc_ # cta_group : Intrinsic<[],
[llvm_ptr_ty, // dst_ptr
llvm_i32_ty] , // num_columns
[IntrConvergent, IntrInaccessibleMemOrArgMemOnly,
WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>]>;

def int_nvvm_tcgen05_alloc_shared_ # cta_group : Intrinsic<[],
[llvm_shared_ptr_ty, // dst_ptr
llvm_i32_ty], // num_columns
[IntrConvergent, IntrInaccessibleMemOrArgMemOnly,
WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>]>;

def int_nvvm_tcgen05_dealloc_ # cta_group : Intrinsic<[],
[llvm_tmem_ptr_ty, // tmem_addr
llvm_i32_ty], // num_columns
[IntrConvergent, IntrArgMemOnly,
NoCapture<ArgIndex<0>>]>;

def int_nvvm_tcgen05_relinq_alloc_permit_ # cta_group : Intrinsic<[], [],
[IntrConvergent, IntrInaccessibleMemOnly]>;
}

} // let TargetPrefix = "nvvm"
1 change: 1 addition & 0 deletions llvm/include/llvm/Support/NVPTXAddrSpace.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ enum AddressSpace : unsigned {
ADDRESS_SPACE_SHARED = 3,
ADDRESS_SPACE_CONST = 4,
ADDRESS_SPACE_LOCAL = 5,
ADDRESS_SPACE_TENSOR = 6,

ADDRESS_SPACE_PARAM = 101,
};
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ def do_SQRTF32_RN : Predicate<"usePrecSqrtF32()">;
def hasHWROT32 : Predicate<"Subtarget->hasHWROT32()">;
def noHWROT32 : Predicate<"!Subtarget->hasHWROT32()">;
def hasDotInstructions : Predicate<"Subtarget->hasDotInstructions()">;
def hasTcgen05Instructions : Predicate<"Subtarget->hasTcgen05Instructions()">;

def True : Predicate<"true">;
def False : Predicate<"false">;
Expand Down
41 changes: 41 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -7582,3 +7582,44 @@ def GRIDDEPCONTROL_WAIT :
Requires<[hasSM<90>, hasPTX<78>]>;

def INT_EXIT : NVPTXInst<(outs), (ins), "exit;", [(int_nvvm_exit)]>;

// Tcgen05 intrinsics
let isConvergent = true in {

multiclass TCGEN05_ALLOC_INTR<NVPTXRegClass rc, string AS, string num, Intrinsic Intr> {
def NAME : NVPTXInst<(outs),
(ins rc:$dst, Int32Regs:$ncols),
!strconcat("tcgen05.alloc.cta_group::", num, ".sync.aligned", AS, ".b32 [$dst], $ncols;"),
[(Intr rc:$dst, Int32Regs:$ncols)]>,
Requires<[hasTcgen05Instructions]>;
}

defm TCGEN05_ALLOC_CG1 : TCGEN05_ALLOC_INTR<Int64Regs, "", "1", int_nvvm_tcgen05_alloc_cg1>;
defm TCGEN05_ALLOC_CG2 : TCGEN05_ALLOC_INTR<Int64Regs, "", "2", int_nvvm_tcgen05_alloc_cg2>;

defm TCGEN05_ALLOC_S64_CG1 : TCGEN05_ALLOC_INTR<Int64Regs, ".shared::cta", "1", int_nvvm_tcgen05_alloc_shared_cg1>;
defm TCGEN05_ALLOC_S64_CG2 : TCGEN05_ALLOC_INTR<Int64Regs, ".shared::cta", "2", int_nvvm_tcgen05_alloc_shared_cg2>;

defm TCGEN05_ALLOC_S32_CG1 : TCGEN05_ALLOC_INTR<Int32Regs, ".shared::cta", "1", int_nvvm_tcgen05_alloc_shared_cg1>;
defm TCGEN05_ALLOC_S32_CG2 : TCGEN05_ALLOC_INTR<Int32Regs, ".shared::cta", "2", int_nvvm_tcgen05_alloc_shared_cg2>;

multiclass TCGEN05_DEALLOC_INTR<string num, Intrinsic Intr> {
def NAME : NVPTXInst<(outs),
(ins Int32Regs:$tmem_addr, Int32Regs:$ncols),
!strconcat("tcgen05.dealloc.cta_group::", num, ".sync.aligned.b32 $tmem_addr, $ncols;"),
[(Intr Int32Regs:$tmem_addr, Int32Regs:$ncols)]>,
Requires<[hasTcgen05Instructions]>;
}
defm TCGEN05_DEALLOC_CG1: TCGEN05_DEALLOC_INTR<"1", int_nvvm_tcgen05_dealloc_cg1>;
defm TCGEN05_DEALLOC_CG2: TCGEN05_DEALLOC_INTR<"2", int_nvvm_tcgen05_dealloc_cg2>;

multiclass TCGEN05_RELINQ_PERMIT_INTR<string num, Intrinsic Intr> {
def NAME : NVPTXInst<(outs), (ins),
!strconcat("tcgen05.relinquish_alloc_permit.cta_group::", num, ".sync.aligned;"),
[(Intr)]>,
Requires<[hasTcgen05Instructions]>;
}
defm TCGEN05_RELINQ_CG1: TCGEN05_RELINQ_PERMIT_INTR<"1", int_nvvm_tcgen05_relinq_alloc_permit_cg1>;
defm TCGEN05_RELINQ_CG2: TCGEN05_RELINQ_PERMIT_INTR<"2", int_nvvm_tcgen05_relinq_alloc_permit_cg2>;

} // isConvergent
15 changes: 15 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXSubtarget.h
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,21 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
bool hasDotInstructions() const {
return SmVersion >= 61 && PTXVersion >= 50;
}
// Tcgen05 instructions in Blackwell family
bool hasTcgen05Instructions() const {
bool HasTcgen05 = false;
switch (FullSmVersion) {
default:
break;
case 1001: // sm_100a
case 1011: // sm_101a
HasTcgen05 = true;
break;
}

return HasTcgen05 && PTXVersion >= 86;
}

// Prior to CUDA 12.3 ptxas did not recognize that the trap instruction
// terminates a basic block. Instead, it would assume that control flow
// continued to the next instruction. The next instruction could be in the
Expand Down
3 changes: 3 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,9 @@ static std::string computeDataLayout(bool is64Bit, bool UseShortPointers) {
else if (UseShortPointers)
Ret += "-p3:32:32-p4:32:32-p5:32:32";

// Tensor Memory (addrspace:6) is always 32-bits.
Ret += "-p6:32:32";

Ret += "-i64:64-i128:128-v16:16-v32:32-n16:32:64";

return Ret;
Expand Down
Loading