Skip to content

ot.emd with large datasets #397

Open
@johnalison

Description

@johnalison

🚀 Feature

I was wondering if it would be possible to do the OT matrix computation (ot.emd) using sparse matrices,
(eg: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_array.html#scipy.sparse.coo_array)
where only the distances between pairs of entries below some cutoff would be given, and only the coupling for pairs entries above some cutoff would be returned.

Motivation

I'd like to compute the OT matrix with between large (O(500k)) datasets.
Currently, the biggest limitation to dataset size is the memory requirement for storing the len(a) x len(b) matrices.
In practice, nearly all of the input distances are large, and nearly all of the corresponding couplings are small.
So I thought that the sparse matrices could be a good solution.

Alternatives

We are open to any suggestions for alternatives.

We currently approximate the full OT matrix by computing many OT matrices using O(10k) subsets of the source and target datasets. The effect of the full OT between the datasets is then approximated by "stitching together" the smaller coupling matrices. This reduces the effective statistical power of our data, which we think we can avoid with the sparse matrix implementation.

Additional context

The large datasets come from LHC data:
https://arxiv.org/abs/2208.02807
Thank you for your OT implementation ! We have found it very useful in our work.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions