Skip to content

Inconsistent variable usage in tile_latent_min_width computation in AutoencoderKLMochi decoder #11291

@kuantuna

Description

@kuantuna

Describe the bug

There seems to be an inconsistency in the calculation of tile_latent_min_width inside the decoder of the AutoencoderKLMochi model. In the following line of code, tile_latent_min_width is calculated using:

tile_latent_min_width = self.tile_sample_stride_width // self.spatial_compression_ratio

However, directly above it, tile_latent_min_height is calculated as:

tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio

This discrepancy appears to be unintended. Logically, the width counterpart should follow the same pattern. The correct computation for tile_latent_min_width should likely be:

tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio

Further evidence supporting this interpretation can be found in the tiled decoding section at line 1057, where tile_latent_min_width is indeed correctly computed using self.tile_sample_min_width

This reinforces the idea that the line at L912 is likely a copy-paste or naming oversight.

While this issue may not cause runtime errors under standard conditions—such as when generating 848x480 videos with mochi, where the check at L914 still passes—it can affect edge cases (e.g., for profiling, debugging with custom video dimensions), potentially leading to inconsistent or unintended behavior.

If the maintainers agree this is indeed a bug, I’d be happy to submit a PR to fix it.

Reproduction

This issue does not trigger an immediate runtime error but leads to an incorrect value being calculated, which may result in unintended behavior when tiling is enabled in the VAE.

To reproduce the problem, simply run the Mochi pipeline with the following configuration:

width = 848
height = 480
vae_tiling = True  # Enable VAE tiling

With this setup and default parameters, the AutoencoderKLMochi._decode function will receive the following relevant attributes:

width = 106 (downsampled by spatial_compression_ratio)
height = 60 (downsampled by spatial_compression_ratio)
self.tile_sample_min_height = 256
self.tile_sample_min_width  = 256
self.tile_sample_stride_width = 192
self.spatial_compression_ratio = 8

From the current code at L912, the computed value becomes:

tile_latent_min_width = self.tile_sample_stride_width // self.spatial_compression_ratio
                       = 192 // 8
                       = 24

Instead, it should have been:

tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio
                       = 256 // 8
                       = 32

As a result, the conditional check:

if width > tile_latent_min_width:

incorrectly evaluates to True for 32<=width<24, even though it should evaluate to False with the correct tile_latent_min_width = 32.

Logs

System Info

I realized this on my environment with diffusers 0.32.1 but it's still present in the master branch.

Who can help?

@DN6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions