Skip to content

Commit 65bb9ef

Browse files
Implements accumulation functions in dpctl.tensor (#1602)
* Use `shT` instead of `std::vector<py::ssize_t>` in `repeat` * Add missing host task to `host_tasks_list` in _reduction.py * Implements `dpt.cumulative_logsumexp`, `dpt.cumulative_prod`, and `dpt.cumulative_sum` The Python bindings for these functions are implemented in a new submodule `_tensor_accumulation_impl` * Adds the first tests for `dpt.cumulative_sum` * Pass host task vector to accumulator kernel calls This resolves hangs in unique functions * Implements `out` keyword for accumulators * Fixes cumulative_logsumexp when both an intermediate input and result temporary are needed * Only permute dims of allocated outputs if accumulated axis is not the trailing axis Fixes a bug where in some cases output axes were not being permuted * Enable scalar inputs to accumulation functions * Adds test for scalar inputs to cumulative_sum * Adds docstrings for cumulative_sum. cumulative_prod, and cumulative_logsumexp * Removed redundant dtype kind check in _default_accumulation_dtype * Reduce repetition of code allocation out array in _accumulate_common * Adds tests for accumulation function identities, `include_initial` keyword * Adds more tests for cumulative_sum * Correct typo in kernels/accumulators.hpp constexpr nwiT variables rather than nwiT constexpr variables * Increase work per work item in inclusive_scan_iter_1d update step * Removes a dead branch from _accumulate_common As `out` and the input would have to have the same data type to overlap, the second branch is never reached if `out` is the same array as the input * More accumulator tests * Removes dead branch from _accumulators.py A second out temporary does not need to be made in either branch when input and requested dtype are not implemented, as temporaries are always made Also removes part of a test intended to reach this branch * Adds tests for `cumulative_prod` and `cumulative_logsumexp` Also fixes incorrect TypeError in _accumulation.py * Widen acceptable results of test_logcumsumexp * Use np.logaddexp.accumulate in hopes of better numerical accuracy of expected result for cumulative_logsumexp * Attempt to improve cumulative_logsumexp testing by computing running logsumexp of test array * Reduce size of array in test_logcumsumexp_basic * Use const qualifiers to make compiler's job easier Indexers are made const, integral variables in kernels made const too Make two-offset instances const references to avoid copying. Gor rid of get_src_const_ptr unused methods in stack_t structs. Replaced auto with size_t as appropriate. Added const to make compiler analysis easier (and faster). * Add test for cumulative_logsumexp for geometric series summation, testing against closed form * Fix race condition in `custom_inclusive_scan_over_group` By returning data from `local_mem_acc` after the group barrier, if memory is later overwritten, a race condition follows, which was especially obvious on CPU Now the value is stored in variable before the barrier and then returned * Remove use of Numpy functions from test_tensor_accumulation and increase size of test_logcumsumexp_basic * Need barrier after call to custom inclusive scan to avoid race condition (#1624) added comments explaining why barriers are needed * Docstring edits Add empty new line after list item to make Sphinx happy. --------- Co-authored-by: Oleksandr Pavlyk <oleksandr.pavlyk@intel.com>
1 parent 57495af commit 65bb9ef

19 files changed

+3405
-171
lines changed

dpctl/tensor/CMakeLists.txt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,17 @@ set(_tensor_linalg_impl_sources
158158
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/simplify_iteration_space.cpp
159159
${_linalg_sources}
160160
)
161+
set(_accumulator_sources
162+
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/accumulators_common.cpp
163+
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/cumulative_logsumexp.cpp
164+
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/cumulative_prod.cpp
165+
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/cumulative_sum.cpp
166+
)
167+
set(_tensor_accumulation_impl_sources
168+
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/tensor_accumulation.cpp
169+
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/simplify_iteration_space.cpp
170+
${_accumulator_sources}
171+
)
161172

162173
set(_py_trgts)
163174

@@ -186,6 +197,11 @@ pybind11_add_module(${python_module_name} MODULE ${_tensor_linalg_impl_sources})
186197
add_sycl_to_target(TARGET ${python_module_name} SOURCES ${_tensor_linalg_impl_sources})
187198
list(APPEND _py_trgts ${python_module_name})
188199

200+
set(python_module_name _tensor_accumulation_impl)
201+
pybind11_add_module(${python_module_name} MODULE ${_tensor_accumulation_impl_sources})
202+
add_sycl_to_target(TARGET ${python_module_name} SOURCES ${_tensor_accumulation_impl_sources})
203+
list(APPEND _py_trgts ${python_module_name})
204+
189205
set(_clang_prefix "")
190206
if (WIN32)
191207
set(_clang_prefix "/clang:")
@@ -203,6 +219,7 @@ list(APPEND _no_fast_math_sources
203219
${_reduction_sources}
204220
${_sorting_sources}
205221
${_linalg_sources}
222+
${_accumulator_sources}
206223
)
207224

208225
foreach(_src_fn ${_no_fast_math_sources})

dpctl/tensor/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@
9696
from dpctl.tensor._usmarray import usm_ndarray
9797
from dpctl.tensor._utility_functions import all, any
9898

99+
from ._accumulation import cumulative_logsumexp, cumulative_prod, cumulative_sum
99100
from ._array_api import __array_api_version__, __array_namespace_info__
100101
from ._clip import clip
101102
from ._constants import e, inf, nan, newaxis, pi
@@ -367,4 +368,7 @@
367368
"tensordot",
368369
"vecdot",
369370
"searchsorted",
371+
"cumulative_logsumexp",
372+
"cumulative_prod",
373+
"cumulative_sum",
370374
]

0 commit comments

Comments
 (0)