Skip to content

group mean over several axes of array with nans is wrong #1118

Open
@gdementen

Description

@gdementen

What happens is the mean is computed on each axis in turn (mean of mean). When no nans are involved, we get theoretically the same result. In practice, we loose some precision but it was deemed acceptable so far. However, when nans are involved, the result is significantly wrong.

>>> arr = Array([[1, 3], [4, nan]], [Axis('a=a0,a1'), Axis('b=b0,b1')])
>>> arr
a\b   b0   b1
 a0  1.0  3.0
 a1  4.0  nan
>>> arr.mean("a0,a1 >> a01", "b0,b1 >> b01")
2.75

While this should be 2.6666... What happens is that it computes:

>>> (((1 + 4) / 2) + ((3 + 0) / 1)) / 2
2.75
>>> 1/4 + 4/4 + 3/2
2.75

Instead of:

>>> (1 + 4 + 3) / 3
2.6666666666666665
>>> 1/3 + 4/3 + 3/3
2.6666666666666665

As a workaround until larray 0.35 is released, I have recommended using:

>>> # TODO: do not use this function anymore when larray 0.35 will be available
... def nd_mean(array, axes_or_groups):
...     """
...     Computes the mean of array over axes_or_groups.
...     
...     This function is temporarily necessary because larray versions up to (and including) 0.34.x
...     behave badly when computing the means on groups over several dimensions 
...     when some values are nans. See https://github.com/larray-project/larray/issues/1118
...     """
...     return array.sum(*axes_or_groups) / (~isnan(array)).sum(*axes_or_groups)
>>> nd_mean(arr, ("a0,a1 >> a01", "b0,b1 >> b01"))
2.6666666666666665

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions