ot.emd with large datasets

## 🚀 Feature
I was wondering if it would be possible to do the OT matrix computation (ot.emd) using sparse matrices,
 (eg: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_array.html#scipy.sparse.coo_array)
where only the distances between pairs of entries below some cutoff would be given, and only the coupling for pairs entries above some cutoff would be returned.


### Motivation
I'd like to compute the OT matrix with between large (O(500k)) datasets. 
Currently, the biggest limitation to dataset size is the memory requirement for storing the len(a) x len(b) matrices. 
In practice, nearly all of the input distances are large, and nearly all of the corresponding couplings are small.
So I thought that the sparse matrices could be a good solution. 
 

### Alternatives
We are open to any suggestions for alternatives. 

We currently approximate the full OT matrix by computing many OT matrices using O(10k) subsets of the source and target datasets. The effect of the full OT between the datasets is then approximated by "stitching together" the smaller coupling matrices.  This reduces the effective statistical power of our data, which we think we can avoid with the sparse matrix implementation. 


### Additional context
The large datasets come from LHC data:
https://arxiv.org/abs/2208.02807
Thank you for your OT implementation !  We have found it very useful in our work. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ot.emd with large datasets #397

🚀 Feature

Motivation

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ot.emd with large datasets #397

Description

🚀 Feature

Motivation

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions