Implement `BandedDot` `Op` #1416

jessegrabowski · 2025-05-23T08:45:20Z

Description

This PR adds a BandedDot Op that uses gbmv to do matrix-vector multiplication for the case that A is a banded matrix.

In my testing, I found that this case sped up computation significantly. Benchmarking against Pytensor's dot, however, the current implementation is significantly slower:

------------------------------------------------------------------------------------------------- benchmark: 8 tests ------------------------------------------------------------------------------------------------
Name (time in us)                       Min                    Max                  Mean              StdDev                Median                IQR            Outliers           OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_dot_perf[10]                    1.7500 (1.0)          17.3330 (1.0)          1.9054 (1.0)        0.1292 (1.0)          1.9160 (1.0)       0.0420 (1.0)      585;1740  524,831.2234 (1.0)       38401           1
test_banded_dot_perf[10]            19.9580 (11.40)    13,765.1250 (794.16)      32.5111 (17.06)    282.5468 (>1000.0)     20.5830 (10.74)     0.3750 (8.93)        6;349   30,758.7051 (0.06)       3275           1

test_dot_perf[100]                   2.4580 (1.40)         42.5420 (2.45)         2.7856 (1.46)       0.3265 (2.53)         2.7500 (1.44)      0.0420 (1.0)      343;7436  358,988.7425 (0.68)      71429           1
test_banded_dot_perf[100]           19.8330 (11.33)    15,203.3750 (877.13)      30.9185 (16.23)    193.8617 (>1000.0)     20.9580 (10.94)     0.4160 (9.90)      51;3057   32,343.1413 (0.06)      20566           1

test_dot_perf[1000]                 15.0000 (8.57)         61.5000 (3.55)        16.6383 (8.73)       1.4182 (10.98)       17.2920 (9.03)      2.2080 (52.57)     905;126   60,102.3508 (0.11)      18377           1
test_banded_dot_perf[1000]          27.0420 (15.45)       423.8750 (24.45)       32.9042 (17.27)      5.2005 (40.25)       32.6250 (17.03)     0.6250 (14.88)    129;1334   30,391.2634 (0.06)      12501           1

test_dot_perf[10_000]            3,369.4580 (>1000.0)   5,011.3330 (289.12)   3,412.7784 (>1000.0)  119.9981 (928.81)   3,394.5625 (>1000.0)  17.2910 (411.69)       4;25      293.0164 (0.00)        198           1
test_banded_dot_perf[10_000]       109.9170 (62.81)       611.5830 (35.28)      139.2751 (73.10)     52.3002 (404.81)     116.5000 (60.80)    14.0000 (333.33)    472;678    7,180.0341 (0.01)       3386           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I guess there's some major overhead from doing the diagonal extractions and looking up the blas function in python? This could and should probably be a C Op, but I'm not sure I have time to realistically dig into all that anytime soon. Help wanted, at any rate.

Related Issue

Closes Implement linalg.BandedDot #1415
Related to Add Ops for specialized dot products with structured matrices #1323

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Type of change

📚 Documentation preview 📚: https://pytensor--1416.org.readthedocs.build/en/1416/

tests/tensor/test_slinalg.py

Co-authored-by: Ricardo Vieira <28983449+ricardoV94@users.noreply.github.com>

jessegrabowski · 2025-05-23T09:54:31Z

I added trust_input and I also load the BLAS functions once on import and save them. So that should reduce some of the most obvious sources of python overhead. New benchmarks (note that they're in ns now, not us):

------------------------------------------------------------------------------------------------------------------- benchmark: 8 tests -------------------------------------------------------------------------------------------------------------------
Name (time in ns)                                      Min                       Max                      Mean                  StdDev                    Median                     IQR            Outliers             OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_banded_dot_perf[10-dot]                      541.9988 (1.0)          4,292.0001 (1.0)            638.1136 (1.0)           51.0902 (1.0)            625.0011 (1.0)           41.0000 (40.91)    1506;209  1,567,119.1257 (1.0)       15636           1
test_banded_dot_perf[10-banded_dot]            17,500.0005 (32.29)      418,167.0010 (97.43)       18,191.1183 (28.51)      3,829.7598 (74.96)       18,083.0011 (28.93)        167.0014 (166.62)     70;630     54,971.8815 (0.04)      11353           1

test_banded_dot_perf[100-dot]                   1,209.0004 (2.23)        23,959.0008 (5.58)         1,340.3628 (2.10)         103.1441 (2.02)         1,333.0009 (2.13)           1.0023 (1.0)    1217;34675    746,066.6804 (0.48)      88889           1
test_banded_dot_perf[100-banded_dot]           17,542.0009 (32.37)       77,083.9997 (17.96)       18,240.8191 (28.59)      1,230.1810 (24.08)       18,000.0006 (28.80)        250.0001 (249.44)   654;2431     54,822.0996 (0.03)      19018           1

test_banded_dot_perf[1000-dot]                 13,291.9995 (24.52)       49,874.9996 (11.62)       15,195.7498 (23.81)      1,137.7872 (22.27)       15,833.0004 (25.33)      1,832.9993 (>1000.0)  2954;119     65,807.8747 (0.04)      22347           1
test_banded_dot_perf[1000-banded_dot]          24,624.9983 (45.43)       74,874.9990 (17.45)       30,233.2753 (47.38)      1,347.0049 (26.37)       30,125.0002 (48.20)        375.0010 (374.15)   874;1333     33,076.1385 (0.02)      15595           1

test_banded_dot_perf[10_000-dot]            3,394,874.9988 (>1000.0)  5,084,541.9992 (>1000.0)  3,585,834.0104 (>1000.0)  191,227.5142 (>1000.0)  3,558,604.5005 (>1000.0)  199,729.5003 (>1000.0)      16;3        278.8752 (0.00)        192           1
test_banded_dot_perf[10_000-banded_dot]       105,208.0006 (194.11)     389,250.0008 (90.69)      124,879.6041 (195.70)    35,967.3472 (704.00)     110,375.0001 (176.60)     8,343.4998 (>1000.0)   320;440      8,007.7128 (0.01)       2665           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ricardoV94 · 2025-05-23T09:59:52Z

pytensor/tensor/slinalg.py

+    A = np.asarray(A)
+    m, n = A.shape
+    ab = np.zeros((kl + ku + 1, n), dtype=A.dtype, order="C")
+
+    for i, k in enumerate(range(ku, -kl - 1, -1)):
+        padding = (k, 0) if k >= 0 else (0, -k)
+        diag = np.pad(np.diag(A, k=k), padding)
+        ab[i, :] = diag
+
+    return ab


I imagine this explains most of the python overhead for small cases?

one way or another we have to do that though as part of the cost of the Op. Unless we demand users have inputs ready in that form.

Yeah it's fine, I was just thinking out loud.

This rearrangement could be done symbolically in a wrapper Op that calls the blas Op (which expects things to be ready in the correct form)

It might also be better to do smart column indexing on ab instead of using pad

Yeah it's similar to the Solve, in that you can also do it once and reuse many times possibly, but I think that's too much micro-optimization for now. We also don't want to autodiff through it

ricardoV94 · 2025-05-23T10:02:30Z

pytensor/tensor/slinalg.py

+_dgbmv = scipy_linalg.get_blas_funcs("gbmv", dtype="float64")
+_sgbmv = scipy_linalg.get_blas_funcs("gbmv", dtype="float32")


This will cause import time overhead to PyTensor.

I'm okay paying the extra 3us at runtime instead since virtually nobody will ever use this (or use it in a case where they need those extra us)

I thought about this as well. It won't stay in the final verison.

You can exploit prepare_node and add the function to node.tag, which the perform method can then retrieve from. That's two attribute accesses instead of a string check / scipy caching...

Or you can sidestep perform and use make_thunk instead

ricardoV94 · 2025-05-23T10:12:27Z

I think the Op is fine, specially if we are not trying to introduce it automatically via rewrites. If we are we may consider the backend (once we have it in numba I suspect it will win for smaller matrices) and/or static shapes if we think the worse-case penalty is still too big

pytensor/tensor/slinalg.py

jessegrabowski · 2025-05-23T10:27:28Z

Benchmark after tuning up the _to_banded_form function:

------------------------------------------------------------------------------------------------------------------- benchmark: 8 tests ------------------------------------------------------------------------------------------------------------------
Name (time in ns)                                      Min                       Max                      Mean                 StdDev                    Median                     IQR            Outliers             OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_banded_dot_perf[10-dot]                      499.9965 (1.0)         55,500.0006 (1.41)           665.4888 (1.0)         390.9718 (1.0)            666.0011 (1.0)           42.0005 (1.00)      31;2639  1,502,654.9287 (1.0)       32129           1
test_banded_dot_perf[10-banded_dot]             2,832.9996 (5.67)        71,957.9984 (1.82)         3,356.9474 (5.04)        782.8860 (2.00)         3,332.9998 (5.00)         332.9988 (7.93)    1874;2239    297,889.6806 (0.20)      32833           1

test_banded_dot_perf[100-dot]                   1,000.0003 (2.00)        58,208.9997 (1.47)         1,191.9862 (1.79)        396.5918 (1.01)         1,166.9981 (1.75)          41.9968 (1.0)      305;3163    838,935.8643 (0.56)      91258           1
test_banded_dot_perf[100-banded_dot]            3,332.9998 (6.67)        39,499.9988 (1.0)          3,874.8349 (5.82)        471.5917 (1.21)         3,875.0004 (5.82)          84.0009 (2.00)   1020;11972    258,075.5142 (0.17)      71008           1

test_banded_dot_perf[1000-dot]                 13,584.0019 (27.17)      118,374.9991 (3.00)        16,143.5130 (24.26)     1,984.1144 (5.07)        16,291.0001 (24.46)      2,042.0011 (48.62)    1390;171     61,944.3861 (0.04)      14202           1
test_banded_dot_perf[1000-banded_dot]           8,167.0005 (16.33)       68,749.9996 (1.74)        10,694.7895 (16.07)     1,131.4230 (2.89)        11,000.0001 (16.52)        416.9997 (9.93)    6811;7582     93,503.4764 (0.06)      32521           1

test_banded_dot_perf[10_000-dot]            3,379,415.9972 (>1000.0)  3,680,959.0019 (93.19)    3,463,207.0645 (>1000.0)  79,485.8545 (203.30)   3,434,124.9993 (>1000.0)  114,541.9992 (>1000.0)       6;0        288.7497 (0.00)         31           1
test_banded_dot_perf[10_000-banded_dot]        93,582.9994 (187.17)     294,458.0010 (7.45)       100,154.2338 (150.50)   22,660.4163 (57.96)       95,479.0012 (143.36)     2,083.4996 (49.61)       10;27      9,984.6004 (0.01)        248           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ricardoV94 · 2025-05-23T10:29:20Z

That looks much better!

jessegrabowski · 2025-05-23T10:31:29Z

I agree numba will probably be better across the board. I'd really like this Op to win on the 100x100 case, that's already a pretty big matrix. 1000x1000 and 10,000x10,000 doesn't really show up in nature too often

pytensor/tensor/slinalg.py

ricardoV94 · 2025-05-23T10:38:08Z

100x100 is 1us, you are at the edge of python overhead there. Calling an identity PyTensor function and no trust_input is 300-500ns. Calling np.zeros is like 100-200ns. That means you would basically need to have no python overhead whatsoever

Edit: those are on my machine, don't know about yours

pytensor/tensor/slinalg.py

ricardoV94 · 2025-05-23T11:02:09Z

This is the best I think we can get out of this in python?

    def make_thunk(self, node, storage_map, compute_map, no_recycling, impl):
        kl = self.lower_diags
        ku = self.upper_diags
        if node.outputs[0].dtype == "float64":
            gbmv = scipy_linalg.get_blas_funcs("gbmv", dtype="float64")
        else:
            gbmv = scipy_linalg.get_blas_funcs("gbmv", dtype="float32")

        ab_size = kl + ku + 1
        a_storage = storage_map[node.inputs[0]]
        b_storage = storage_map[node.inputs[1]]
        out_storage = storage_map[node.outputs[0]]
        out_computed = compute_map[node.outputs[0]] if compute_map is not None else [False]
        def thunk(
            a_storage=a_storage,
            b_storage=b_storage,
            out_storage=out_storage,
            out_computed=out_computed,
            kl=kl,
            ku=ku,
            ab_size=ab_size,
            gbmv=gbmv,
        ):
            A = a_storage[0]
            b = b_storage[0]
            m, n = A.shape

            ab = np.zeros((ab_size, n), dtype=A.dtype, order="C")
            for i, k in enumerate(range(ku, -kl - 1, -1)):
                if k > 0:
                    ab[i, k:] = diag(A, k=k)
                else:
                    ab[i, :n + k] = diag(A, k=k)

            out_storage[0] = gbmv(m, n, kl, ku, 1, ab, b)
            out_computed[0] = True

        return thunk

ricardoV94 · 2025-05-23T11:03:08Z

pytensor/tensor/slinalg.py

+        A = as_tensor_variable(A)
+        B = as_tensor_variable(b)
+
+        out_dtype = pytensor.scalar.upcast(A.dtype, B.dtype)


I suspect this is wrong for integer types

ricardoV94 · 2025-05-24T09:21:06Z

That's much more palatable.

The difference between numba/python gbmv is also what you should expect to see if you implemented gbmv in C so you don't have to wonder.

jessegrabowski · 2025-05-24T09:33:25Z

Thinking emoji

pytensor/link/numba/dispatch/slinalg.py

pytensor/link/numba/dispatch/linalg/dot/banded.py

jessegrabowski · 2025-05-24T11:02:12Z

The problem in the timings was some copies being done both in python and numba mode. Here are the updated timings. They're essentially the same except on the low-end, where getting rid of the python overhead is giving numba a small consistent speed bump.

jessegrabowski · 2025-05-24T13:08:40Z

I'd like to call this one done for now, although there are three major things that are left to do:

Enable GEMV rewrites in NUMBA and re-use that machinery to allow all arguments to the numba xgemv fuction. Right now I'm forcing alpha=1, beta=0.
Split off the code that converts a dense banded matrix into the banded matrix form into a separate Op. Then we can add a rewrite to do things like lift that outside of scan, for example. More importantly, we can;
Introduce a rewrite that converts GEMV(BandedMatrix(A), x, ...) into BandedGEMV(BandedMatrix(A), x, ...). The existing BandedDot can become BandedGEMV and we can use all arguments.

I want to merge this then do these 3 things because I want to do #1418 first, and put the resulting function into the new _BLAS.py file in this PR. Enable the relevant rewrites, then revisit this code.

I also need to think about how to handle the splitting out of the BandedMatrix Op, because it destroys information about how many rows the input matrix has (gemv needs to know this).

codecov · 2025-05-24T13:50:23Z

Codecov Report

Attention: Patch coverage is 72.72727% with 33 lines in your changes missing coverage. Please review.

Project coverage is 82.09%. Comparing base (261aaf3) to head (481814f).
Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
pytensor/link/numba/dispatch/linalg/dot/banded.py	46.93%	26 Missing ⚠️
pytensor/tensor/slinalg.py	90.24%	2 Missing and 2 partials ⚠️
pytensor/link/numba/dispatch/slinalg.py	75.00%	2 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (72.72%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1416      +/-   ##
==========================================
- Coverage   82.11%   82.09%   -0.02%     
==========================================
  Files         211      213       +2     
  Lines       49686    49843     +157     
  Branches     8813     8827      +14     
==========================================
+ Hits        40798    40920     +122     
- Misses       6710     6740      +30     
- Partials     2178     2183       +5

Files with missing lines	Coverage Δ
pytensor/link/numba/dispatch/basic.py	`79.54% <ø> (ø)`
pytensor/link/numba/dispatch/linalg/_BLAS.py	`100.00% <100.00%> (ø)`
pytensor/link/numba/dispatch/slinalg.py	`70.10% <75.00%> (+0.34%)`	⬆️
pytensor/tensor/slinalg.py	`93.00% <90.24%> (-0.18%)`	⬇️
pytensor/link/numba/dispatch/linalg/dot/banded.py	`46.93% <46.93%> (ø)`

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pytensor/link/numba/dispatch/linalg/dot/banded.py

ricardoV94 · 2025-05-24T15:03:14Z

pytensor/tensor/slinalg.py

+        A = as_tensor_variable(A)
+        x = as_tensor_variable(x)
+
+        out_dtype = pytensor.scalar.upcast(A.dtype, x.dtype)


wrong for integers/should raise. Also reject complex?

I copied this from other make_node in slinalg (eigvalsh, eigvalsh grad, solve lyapunov stuff). What's the right way to upcast here?

The right way is to predict what scipy outputs. Some Ops are lazy and just call scipy with a minimal input case to find out the output type. I don't love that.

Which makes me wonder I guess numba/direct call to xbmv doesn't work with integers arrays, so we may need to cast/raise?

What does JAX do on integer inputs?

Also it's not that onerous to just try every combination of input pairs on the scipy function, write it in a dictionary, and just look it up. Is that too crazy?

What does JAX do on integer inputs?

No idea, cast them to float or call a dot function that works on integers?

Also it's not that onerous to just try every combination of input pairs on the scipy function, write it in a dictionary, and just look it up. Is that too crazy?

I think it's a bit crazy, you could add a function with lru_cache on the dtypes, that tries it and stores the result. Most combinations will never be needed. And we don't want to do it at import time

pytensor/tensor/slinalg.py

ricardoV94 · 2025-05-24T15:04:10Z

pytensor/tensor/slinalg.py

+
+    def make_node(self, A, x):
+        A = as_tensor_variable(A)
+        x = as_tensor_variable(x)


Raise ValueError for non core ndims

pytensor/tensor/slinalg.py

ricardoV94 · 2025-05-24T15:05:32Z

pytensor/tensor/slinalg.py

@@ -1669,6 +1670,73 @@ def block_diag(*matrices: TensorVariable):
    return _block_diagonal_matrix(*matrices)


+class BandedDot(Op):


Put in blas.py?

I saw your message, fine

You mean in pytensor.tensor.blas ? I can do that if you think it's better

ricardoV94 · 2025-05-24T15:09:15Z

pytensor/link/numba/dispatch/linalg/dot/banded.py

+        KU = val_to_int_ptr(ku)
+
+        ALPHA = np.array(1.0, dtype=dtype)
+        INCX = val_to_int_ptr(x.strides[0] // x.itemsize)


Please test non-unit positive and negative strides for x. In C Gemv for instance we need to point to the last memory position when strides is negative

You can, but need not also test for A, y, strides. Since we're creating them now ourselves we know they're always correct. But once we split the Op we will need to worry about those as well

lmk what you think about the way I'm testing strides now, and I can expand it if it's adequate.

Unresolving this because the negative stride tests are failing

If it's like the Cblas, when you have negative strides, you have to point to the end of the numpy array (x[-1]). Blas wants to know where the block of memory starts, even if it iterates in reverse, but numpy points to the end of the array when it has negative strides.

pytensor/pytensor/tensor/blas_c.py

Lines 243 to 249 in 2d414d4

// gemv expects pointers to the beginning of memory arrays,

// but numpy provides provides a pointer to the first element,

// so when the stride is negative, we need to get the last one.

if (Sx < 0)

x_data += (Nz0 - 1) * Sx;

if (Sy < 0)

y_data += (Nz1 - 1) * Sy;

pytensor/tensor/slinalg.py

tests/link/numba/test_slinalg.py

jessegrabowski · 2025-05-28T07:30:44Z

@ricardoV94 since #1418 got resolved without adding the GEMV rewrite to numba, how should I handle expanding this Op to include rank-1 updates?

ricardoV94 · 2025-05-28T07:36:43Z

We may still link directly to blas for the full update, not sure numba does it besides dispatching the matrix/vector dot part

ricardoV94 · 2025-05-28T07:40:23Z

I would start by benchmarking directly with numba to see if we get a speedup from calling the fused gemv op directly or if numba does it (the regular one, it for sure doesn't do it for gbmv)

jessegrabowski added enhancement New feature or request help wanted Extra attention is needed Op implementation linalg Linear algebra labels May 23, 2025

ricardoV94 reviewed May 23, 2025

View reviewed changes

tests/tensor/test_slinalg.py Outdated Show resolved Hide resolved

ricardoV94 reviewed May 23, 2025

View reviewed changes

tests/tensor/test_slinalg.py Outdated Show resolved Hide resolved

ricardoV94 reviewed May 23, 2025

View reviewed changes

tests/tensor/test_slinalg.py Outdated Show resolved Hide resolved

jessegrabowski and others added 2 commits May 23, 2025 17:32

Naive implementation, do not merge

bbf3141

Co-authored-by: Ricardo Vieira <28983449+ricardoV94@users.noreply.github.com>

Implement suggestions

2282161

jessegrabowski force-pushed the banded-dot branch from 912a88d to 2282161 Compare May 23, 2025 09:41

Simplify perf test

ae8eff6

float32 compat in tests

b2f68a5

ricardoV94 reviewed May 23, 2025

View reviewed changes

pytensor/tensor/slinalg.py Show resolved Hide resolved

ricardoV94 reviewed May 23, 2025

View reviewed changes

pytensor/tensor/slinalg.py Outdated Show resolved Hide resolved

Remove np.pad

e64d4d3

set dtype correctly

c979e9d

ricardoV94 reviewed May 23, 2025

View reviewed changes

pytensor/tensor/slinalg.py Outdated Show resolved Hide resolved

ricardoV94 reviewed May 23, 2025

View reviewed changes

pytensor/tensor/slinalg.py Outdated Show resolved Hide resolved

jessegrabowski added 2 commits May 23, 2025 18:57

fix signature, add infer_shape

f1066a9

micro-optimizations

0302fac

ricardoV94 reviewed May 23, 2025

View reviewed changes

jessegrabowski mentioned this pull request May 24, 2025

Implement gemv numba dispatch #1418

Closed

Eliminate extra copy in numba impl

7d109b9

ricardoV94 reviewed May 24, 2025

View reviewed changes

pytensor/link/numba/dispatch/slinalg.py Outdated Show resolved Hide resolved

ricardoV94 reviewed May 24, 2025

View reviewed changes

pytensor/link/numba/dispatch/linalg/dot/banded.py Outdated Show resolved Hide resolved

jessegrabowski added 4 commits May 24, 2025 18:41

Create A_banded as F-contiguous array

c18f095

Remove benchmark

607a871

Don't cache numba function

f6f12aa

all hail mypy

e8fe5e3

jessegrabowski added 2 commits May 24, 2025 21:02

set INCX by strides

5344c27

relax tolerance of float32 test

31e9a29

jessegrabowski marked this pull request as ready for review May 24, 2025 13:03

ricardoV94 reviewed May 24, 2025

View reviewed changes

jessegrabowski added 5 commits May 25, 2025 12:36

Add suggestions

0505c57

Test strides

2b5c51d

Add L_op

519c933

*remove* type hints to make mypy happy

5754f93

Remove order argument from numba A_to_banded

481814f

ricardoV94 requested changes May 25, 2025

View reviewed changes

jessegrabowski added 2 commits May 25, 2025 21:07

Incorporate feedback

30fece4

Adjust numba test

4bd259c

jessegrabowski force-pushed the banded-dot branch from c396a8a to 4bd259c Compare May 25, 2025 13:24

Remove more useful type information for mypy

497721e

		_dgbmv = scipy_linalg.get_blas_funcs("gbmv", dtype="float64")
		_sgbmv = scipy_linalg.get_blas_funcs("gbmv", dtype="float32")

		@@ -1669,6 +1670,73 @@ def block_diag(*matrices: TensorVariable):
		return _block_diagonal_matrix(*matrices)


		class BandedDot(Op):

	// gemv expects pointers to the beginning of memory arrays,
	// but numpy provides provides a pointer to the first element,
	// so when the stride is negative, we need to get the last one.
	if (Sx < 0)
	x_data += (Nz0 - 1) * Sx;
	if (Sy < 0)
	y_data += (Nz1 - 1) * Sy;

Implement BandedDot Op #1416

Are you sure you want to change the base?

Implement BandedDot Op #1416

Conversation

jessegrabowski commented May 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Checklist

Type of change

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jessegrabowski commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ricardoV94 May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ricardoV94 commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

jessegrabowski commented May 23, 2025

Uh oh!

ricardoV94 commented May 23, 2025

Uh oh!

jessegrabowski commented May 23, 2025

Uh oh!

Uh oh!

ricardoV94 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ricardoV94 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ricardoV94 commented May 24, 2025

Uh oh!

jessegrabowski commented May 24, 2025

Uh oh!

Uh oh!

Uh oh!

jessegrabowski commented May 24, 2025

Uh oh!

jessegrabowski commented May 24, 2025

Uh oh!

codecov bot commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Implement `BandedDot` `Op` #1416

Implement `BandedDot` `Op` #1416

jessegrabowski commented May 23, 2025 •

edited by github-actions bot

Loading

ricardoV94 May 23, 2025 •

edited

Loading

ricardoV94 commented May 23, 2025 •

edited

Loading

ricardoV94 commented May 23, 2025 •

edited

Loading

codecov bot commented May 24, 2025 •

edited

Loading

ricardoV94 May 24, 2025 •

edited

Loading

ricardoV94 May 24, 2025 •

edited

Loading

ricardoV94 May 27, 2025 •

edited

Loading

ricardoV94 commented May 28, 2025 •

edited

Loading

ricardoV94 commented May 28, 2025 •

edited

Loading