Description
Describe the bug
There seems to be an inconsistency in the calculation of tile_latent_min_width
inside the decoder of the AutoencoderKLMochi model. In the following line of code, tile_latent_min_width
is calculated using:
tile_latent_min_width = self.tile_sample_stride_width // self.spatial_compression_ratio
However, directly above it, tile_latent_min_height
is calculated as:
tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
This discrepancy appears to be unintended. Logically, the width counterpart should follow the same pattern. The correct computation for tile_latent_min_width
should likely be:
tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio
Further evidence supporting this interpretation can be found in the tiled decoding section at line 1057, where tile_latent_min_width
is indeed correctly computed using self.tile_sample_min_width
This reinforces the idea that the line at L912 is likely a copy-paste or naming oversight.
While this issue may not cause runtime errors under standard conditions—such as when generating 848x480 videos with mochi, where the check at L914 still passes—it can affect edge cases (e.g., for profiling, debugging with custom video dimensions), potentially leading to inconsistent or unintended behavior.
If the maintainers agree this is indeed a bug, I’d be happy to submit a PR to fix it.
Reproduction
This issue does not trigger an immediate runtime error but leads to an incorrect value being calculated, which may result in unintended behavior when tiling is enabled in the VAE.
To reproduce the problem, simply run the Mochi pipeline with the following configuration:
width = 848
height = 480
vae_tiling = True # Enable VAE tiling
With this setup and default parameters, the AutoencoderKLMochi._decode function will receive the following relevant attributes:
width = 106 (downsampled by spatial_compression_ratio)
height = 60 (downsampled by spatial_compression_ratio)
self.tile_sample_min_height = 256
self.tile_sample_min_width = 256
self.tile_sample_stride_width = 192
self.spatial_compression_ratio = 8
From the current code at L912, the computed value becomes:
tile_latent_min_width = self.tile_sample_stride_width // self.spatial_compression_ratio
= 192 // 8
= 24
Instead, it should have been:
tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio
= 256 // 8
= 32
As a result, the conditional check:
if width > tile_latent_min_width:
incorrectly evaluates to True for 32<=width<24
, even though it should evaluate to False with the correct tile_latent_min_width = 32
.
Logs
System Info
I realized this on my environment with diffusers 0.32.1 but it's still present in the master branch.