Skip to content

Add description of algorithms to the doc #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,3 @@ htmlcov/
.cache/
.pytest_cache/
doc/auto_examples/*
coverage
.coverage
.coverage*
153 changes: 136 additions & 17 deletions doc/supervised.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,17 +41,37 @@ the covariance matrix of the input data. This is a simple baseline method.

.. [1] On the Generalized Distance in Statistics, P.C.Mahalanobis, 1936

.. _lmnn:

Copy link
Member

@wdevazelhes wdevazelhes Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should let the reference for LMNN here (right now the link from the docstring to the user guide doesn't work), this way:

.. _lmnn:

LMNN
--------

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

LMNN
-----

Large-margin nearest neighbor metric learning.
Large Margin Nearest Neighbor Metric Learning
(:py:class:`LMNN <metric_learn.lmnn.LMNN>`)

`LMNN` learns a Mahanalobis distance metric in the kNN classification
setting using semidefinite programming. The learned metric attempts to keep
k-nearest neighbors in the same class, while keeping examples from different
classes separated by a large margin. This algorithm makes no assumptions about
`LMNN` learns a Mahalanobis distance metric in the kNN classification
setting. The learned metric attempts to keep close k-nearest neighbors
from the same class, while keeping examples from different classes
separated by a large margin. This algorithm makes no assumptions about
the distribution of the data.

The distance is learned by solving the following optimization problem:

.. math::

\min_\mathbf{L}\sum_{i, j}\eta_{ij}||\mathbf{L(x_i-x_j)}||^2 +
c\sum_{i, j, l}\eta_{ij}(1-y_{ij})[1+||\mathbf{L(x_i-x_j)}||^2-||
\mathbf{L(x_i-x_l)}||^2]_+)

where :math:`\mathbf{x}_i` is an data point, :math:`\mathbf{x}_j` is one
of its k nearest neighbors sharing the same label, and :math:`\mathbf{x}_l`
are all the other instances within that region with different labels,
:math:`\eta_{ij}, y_{ij} \in \{0, 1\}` are both the indicators,
:math:`\eta_{ij}` represents :math:`\mathbf{x}_{j}` is the k nearest
neighbors(with same labels) of :math:`\mathbf{x}_{i}`, :math:`y_{ij}=0`
indicates :math:`\mathbf{x}_{i}, \mathbf{x}_{j}` belong to different class,
:math:`[\cdot]_+=\max(0, \cdot)` is the Hinge loss.

.. topic:: Example Code:

::
Expand Down Expand Up @@ -80,16 +100,44 @@ The two implementations differ slightly, and the C++ version is more complete.
-margin -nearest-neighbor-classification>`_ Kilian Q. Weinberger, John
Blitzer, Lawrence K. Saul

.. _nca:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

NCA
---

Neighborhood Components Analysis (`NCA`) is a distance metric learning
algorithm which aims to improve the accuracy of nearest neighbors
classification compared to the standard Euclidean distance. The algorithm
directly maximizes a stochastic variant of the leave-one-out k-nearest
neighbors (KNN) score on the training set. It can also learn a low-dimensional
linear embedding of data that can be used for data visualization and fast
classification.
Neighborhood Components Analysis(:py:class:`NCA <metric_learn.nca.NCA>`)

`NCA` is a distance metric learning algorithm which aims to improve the
accuracy of nearest neighbors classification compared to the standard
Euclidean distance. The algorithm directly maximizes a stochastic variant
of the leave-one-out k-nearest neighbors (KNN) score on the training set.
It can also learn a low-dimensional linear transformation of data that can
be used for data visualization and fast classification.

They use the decomposition :math:`\mathbf{M} = \mathbf{L}^T\mathbf{L}` and
define the probability :math:`p_{ij}` that :math:`\mathbf{x}_i` is the
neighbor of :math:`\mathbf{x}_j` by calculating the softmax likelihood of
the Mahalanobis distance:

.. math::

p_{ij} = \frac{\exp(-|| \mathbf{Lx}_i - \mathbf{Lx}_j ||_2^2)}
{\sum_{l\neq i}\exp(-||\mathbf{Lx}_i - \mathbf{Lx}_l||_2^2)},
\qquad p_{ii}=0

Then the probability that :math:`\mathbf{x}_i` will be correctly classified
by the stochastic nearest neighbors rule is:

.. math::

p_{i} = \sum_{j:j\neq i, y_j=y_i}p_{ij}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again here, it does not show properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works for me too here


The optimization problem is to find matrix :math:`\mathbf{L}` that maximizes
the sum of probability of being correctly classified:

.. math::

\mathbf{L} = \text{argmax}\sum_i p_i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argmax does not show properly. maybe try
\underset{\mathbf{L}}{\arg\max}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me it show this: is it not the right thing ?
image


.. topic:: Example Code:

Expand All @@ -116,16 +164,55 @@ classification.
.. [2] Wikipedia entry on Neighborhood Components Analysis
https://en.wikipedia.org/wiki/Neighbourhood_components_analysis

.. _lfda:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

LFDA
----

Local Fisher Discriminant Analysis (LFDA)
Local Fisher Discriminant Analysis(:py:class:`LFDA <metric_learn.lfda.LFDA>`)

`LFDA` is a linear supervised dimensionality reduction method. It is
particularly useful when dealing with multimodality, where one ore more classes
particularly useful when dealing with multi-modality, where one ore more classes
consist of separate clusters in input space. The core optimization problem of
LFDA is solved as a generalized eigenvalue problem.


The algorithm define the Fisher local within-/between-class scatter matrix
:math:`\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}` in a pairwise fashion:

.. math::

\mathbf{S}^{(w)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(w)}(\mathbf{x}_i -
\mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\
\mathbf{S}^{(b)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(b)}(\mathbf{x}_i -
\mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\

where

.. math::

W_{ij}^{(w)} = \left\{\begin{aligned}0 \qquad y_i\neq y_j \\
\,\,\mathbf{A}_{i,j}/n_l \qquad y_i = y_j\end{aligned}\right.\\
W_{ij}^{(b)} = \left\{\begin{aligned}1/n \qquad y_i\neq y_j \\
\,\,\mathbf{A}_{i,j}(1/n-1/n_l) \qquad y_i = y_j\end{aligned}\right.\\

here :math:`\mathbf{A}_{i,j}` is the :math:`(i,j)`-th entry of the affinity
matrix :math:`\mathbf{A}`:, which can be calculated with local scaling methods.

Then the learning problem becomes derive the LFDA transformation matrix
:math:`\mathbf{T}_{LFDA}`:

.. math::

\mathbf{T}_{LFDA} = \arg\max_\mathbf{T}
[\text{tr}((\mathbf{T}^T\mathbf{S}^{(w)}
\mathbf{T})^{-1}\mathbf{T}^T\mathbf{S}^{(b)}\mathbf{T})]

That is, it is looking for a transformation matrix :math:`\mathbf{T}` such that
nearby data pairs in the same class are made close and the data pairs in
different classes are separated from each other; far apart data pairs in the
same class are not imposed to be close.

.. topic:: Example Code:

::
Expand All @@ -151,17 +238,50 @@ LFDA is solved as a generalized eigenvalue problem.
<https://gastrograph.com/resources/whitepapers/local-fisher
-discriminant-analysis-on-beer-style-clustering.html#>`_ Yuan Tang.

.. _mlkr:

MLKR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

----

Metric Learning for Kernel Regression.
Metric Learning for Kernel Regression(:py:class:`MLKR <metric_learn.mlkr.MLKR>`)

`MLKR` is an algorithm for supervised metric learning, which learns a
distance function by directly minimising the leave-one-out regression error.
distance function by directly minimizing the leave-one-out regression error.
This algorithm can also be viewed as a supervised variation of PCA and can be
used for dimensionality reduction and high dimensional data visualization.

Theoretically, `MLKR` can be applied with many types of kernel functions and
distance metrics, we hereafter focus the exposition on a particular instance
of the Gaussian kernel and Mahalanobis metric, as these are used in our
empirical development. The Gaussian kernel is denoted as:

.. math::

k_{ij} = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{d(\mathbf{x}_i,
\mathbf{x}_j)}{\sigma^2})

where :math:`d(\cdot, \cdot)` is the squared distance under some metrics,
here in the fashion of Mahalanobis, it should be :math:`d(\mathbf{x}_i,
\mathbf{x}_j) = ||\mathbf{A}(\mathbf{x}_i - \mathbf{x}_j)||`, the transition
matrix :math:`\mathbf{A}` is derived from the decomposition of Mahalanobis
matrix :math:`\mathbf{M=A^TA}`.

Since :math:`\sigma^2` can be integrated into :math:`d(\cdot)`, we can set
:math:`\sigma^2=1` for the sake of simplicity. Here we use the cumulative
leave-one-out quadratic regression error of the training samples as the
loss function:

.. math::

\mathcal{L} = \sum_i(y_i - \hat{y}_i)^2

where the prediction :math:`\hat{y}_i` is derived from kernel regression by
calculating a weighted average of all the training samples:

.. math::

\hat{y}_i = \frac{\sum_{j\neq i}y_jk_{ij}}{\sum_{j\neq i}k_{ij}}

.. topic:: Example Code:

::
Expand Down Expand Up @@ -193,7 +313,6 @@ generated from the labels information and passed to the underlying algorithm.
.. todo:: add more details about that (see issue `<https://github
.com/metric-learn/metric-learn/issues/135>`_)


.. topic:: Example Code:

::
Expand Down
Loading