scikit-learn-contrib · bellet · May 3, 2019 · Mar 6, 2019 · Mar 7, 2019 · Mar 13, 2019
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,3 @@ htmlcov/
 .cache/
 .pytest_cache/
 doc/auto_examples/*
-coverage
-.coverage
-.coverage*
diff --git a/doc/supervised.rst b/doc/supervised.rst
@@ -41,17 +41,37 @@ the covariance matrix of the input data. This is a simple baseline method.
 
     .. [1] On the Generalized Distance in Statistics, P.C.Mahalanobis, 1936
 
+.. _lmnn:
+
 LMNN
 -----
 
-Large-margin nearest neighbor metric learning.
+Large Margin Nearest Neighbor Metric Learning
+(:py:class:`LMNN <metric_learn.lmnn.LMNN>`)
 
-`LMNN` learns a Mahanalobis distance metric in the kNN classification
-setting using semidefinite programming. The learned metric attempts to keep
-k-nearest neighbors in the same class, while keeping examples from different
-classes separated by a large margin. This algorithm makes no assumptions about
+`LMNN` learns a Mahalanobis distance metric in the kNN classification
+setting. The learned metric attempts to keep close k-nearest neighbors 
+from the same class, while keeping examples from different classes 
+separated by a large margin. This algorithm makes no assumptions about
 the distribution of the data.
 
+The distance is learned by solving the following optimization problem:
+
+.. math::
+
+      \min_\mathbf{L}\sum_{i, j}\eta_{ij}||\mathbf{L(x_i-x_j)}||^2 + 
+      c\sum_{i, j, l}\eta_{ij}(1-y_{ij})[1+||\mathbf{L(x_i-x_j)}||^2-||
+      \mathbf{L(x_i-x_l)}||^2]_+)
+
+where :math:`\mathbf{x}_i` is an data point, :math:`\mathbf{x}_j` is one 
+of its k nearest neighbors sharing the same label, and :math:`\mathbf{x}_l` 
+are all the other instances within that region with different labels, 
+:math:`\eta_{ij}, y_{ij} \in \{0, 1\}` are both the indicators, 
+:math:`\eta_{ij}` represents :math:`\mathbf{x}_{j}` is the k nearest 
+neighbors(with same labels) of :math:`\mathbf{x}_{i}`, :math:`y_{ij}=0` 
+indicates :math:`\mathbf{x}_{i}, \mathbf{x}_{j}` belong to different class, 
+:math:`[\cdot]_+=\max(0, \cdot)` is the Hinge loss.
+
 .. topic:: Example Code:
 
 ::
@@ -80,16 +100,44 @@ The two implementations differ slightly, and the C++ version is more complete.
        -margin -nearest-neighbor-classification>`_ Kilian Q. Weinberger, John
        Blitzer, Lawrence K. Saul
 
+.. _nca:
+
 NCA
 ---
 
-Neighborhood Components Analysis (`NCA`) is a distance metric learning
-algorithm which aims to improve the accuracy of nearest neighbors
-classification compared to the standard Euclidean distance. The algorithm
-directly  maximizes  a stochastic  variant  of  the leave-one-out k-nearest
-neighbors (KNN) score on the training set.  It can also learn a low-dimensional
-linear  embedding  of  data  that  can  be used for data visualization and fast
-classification.
+Neighborhood Components Analysis(:py:class:`NCA <metric_learn.nca.NCA>`)
+
+`NCA` is a distance metric learning algorithm which aims to improve the 
+accuracy of nearest neighbors classification compared to the standard 
+Euclidean distance. The algorithm directly maximizes a stochastic variant 
+of the leave-one-out k-nearest neighbors (KNN) score on the training set. 
+It can also learn a low-dimensional linear transformation of data that can 
+be used for data visualization and fast classification.
+
+They use the decomposition :math:`\mathbf{M} = \mathbf{L}^T\mathbf{L}` and 
+define the probability :math:`p_{ij}` that :math:`\mathbf{x}_i` is the 
+neighbor of :math:`\mathbf{x}_j` by calculating the softmax likelihood of 
+the Mahalanobis distance:
+
+.. math::
+
+      p_{ij} = \frac{\exp(-|| \mathbf{Lx}_i - \mathbf{Lx}_j ||_2^2)}
+      {\sum_{l\neq i}\exp(-||\mathbf{Lx}_i - \mathbf{Lx}_l||_2^2)}, 
+      \qquad p_{ii}=0
+
+Then the probability that :math:`\mathbf{x}_i` will be correctly classified 
+by the stochastic nearest neighbors rule is:
+
+.. math::
+
+      p_{i} = \sum_{j:j\neq i, y_j=y_i}p_{ij}
+
+The optimization problem is to find matrix :math:`\mathbf{L}` that maximizes 
+the sum of probability of being correctly classified:
+
+.. math::
+
+      \mathbf{L} = \text{argmax}\sum_i p_i
 
 .. topic:: Example Code:
 
@@ -116,16 +164,55 @@ classification.
     .. [2] Wikipedia entry on Neighborhood Components Analysis
        https://en.wikipedia.org/wiki/Neighbourhood_components_analysis
 
+.. _lfda:
+
 LFDA
 ----
 
-Local Fisher Discriminant Analysis (LFDA)
+Local Fisher Discriminant Analysis(:py:class:`LFDA <metric_learn.lfda.LFDA>`)
 
 `LFDA` is a linear supervised dimensionality reduction method. It is
-particularly useful when dealing with multimodality, where one ore more classes
+particularly useful when dealing with multi-modality, where one ore more classes
 consist of separate clusters in input space. The core optimization problem of
 LFDA is solved as a generalized eigenvalue problem.
 
+
+The algorithm define the Fisher local within-/between-class scatter matrix 
+:math:`\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}` in a pairwise fashion:
+
+.. math::
+
+    \mathbf{S}^{(w)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(w)}(\mathbf{x}_i - 
+    \mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\
+    \mathbf{S}^{(b)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(b)}(\mathbf{x}_i - 
+    \mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\
+
+where 
+
+.. math::
+
+    W_{ij}^{(w)} = \left\{\begin{aligned}0 \qquad y_i\neq y_j \\
+    \,\,\mathbf{A}_{i,j}/n_l \qquad y_i = y_j\end{aligned}\right.\\
+    W_{ij}^{(b)} = \left\{\begin{aligned}1/n \qquad y_i\neq y_j \\
+    \,\,\mathbf{A}_{i,j}(1/n-1/n_l) \qquad y_i = y_j\end{aligned}\right.\\
+
+here :math:`\mathbf{A}_{i,j}` is the :math:`(i,j)`-th entry of the affinity
+matrix :math:`\mathbf{A}`:, which can be calculated with local scaling methods.
+
+Then the learning problem becomes derive the LFDA transformation matrix 
+:math:`\mathbf{T}_{LFDA}`:
+
+.. math::
+
+    \mathbf{T}_{LFDA} = \arg\max_\mathbf{T}
+    [\text{tr}((\mathbf{T}^T\mathbf{S}^{(w)}
+    \mathbf{T})^{-1}\mathbf{T}^T\mathbf{S}^{(b)}\mathbf{T})]
+
+That is, it is looking for a transformation matrix :math:`\mathbf{T}` such that 
+nearby data pairs in the same class are made close and the data pairs in 
+different classes are separated from each other; far apart data pairs in the 
+same class are not imposed to be close.
+
 .. topic:: Example Code:
 
 ::
@@ -151,17 +238,50 @@ LFDA is solved as a generalized eigenvalue problem.
        <https://gastrograph.com/resources/whitepapers/local-fisher
        -discriminant-analysis-on-beer-style-clustering.html#>`_ Yuan Tang.
 
+.. _mlkr:
 
 MLKR
 ----
 
-Metric Learning for Kernel Regression.
+Metric Learning for Kernel Regression(:py:class:`MLKR <metric_learn.mlkr.MLKR>`)
 
 `MLKR` is an algorithm for supervised metric learning, which learns a
-distance function by directly minimising the leave-one-out regression error.
+distance function by directly minimizing the leave-one-out regression error.
 This algorithm can also be viewed as a supervised variation of PCA and can be
 used for dimensionality reduction and high dimensional data visualization.
 
+Theoretically, `MLKR` can be applied with many types of kernel functions and 
+distance metrics, we hereafter focus the exposition on a particular instance 
+of the Gaussian kernel and Mahalanobis metric, as these are used in our 
+empirical development. The Gaussian kernel is denoted as:
+
+.. math::
+
+    k_{ij} = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{d(\mathbf{x}_i, 
+    \mathbf{x}_j)}{\sigma^2})
+
+where :math:`d(\cdot, \cdot)` is the squared distance under some metrics, 
+here in the fashion of Mahalanobis, it should be :math:`d(\mathbf{x}_i, 
+\mathbf{x}_j) = ||\mathbf{A}(\mathbf{x}_i - \mathbf{x}_j)||`, the transition 
+matrix :math:`\mathbf{A}` is derived from the decomposition of Mahalanobis 
+matrix :math:`\mathbf{M=A^TA}`.
+
+Since :math:`\sigma^2` can be integrated into :math:`d(\cdot)`, we can set 
+:math:`\sigma^2=1` for the sake of simplicity. Here we use the cumulative 
+leave-one-out quadratic regression error of the training samples as the 
+loss function:
+
+.. math::
+
+    \mathcal{L} = \sum_i(y_i - \hat{y}_i)^2
+
+where the prediction :math:`\hat{y}_i` is derived from kernel regression by 
+calculating a weighted average of all the training samples:
+
+.. math::
+
+    \hat{y}_i = \frac{\sum_{j\neq i}y_jk_{ij}}{\sum_{j\neq i}k_{ij}}
+
 .. topic:: Example Code:
 
 ::
@@ -193,7 +313,6 @@ generated from the labels information and passed to the underlying algorithm.
 .. todo:: add more details about that (see issue `<https://github
           .com/metric-learn/metric-learn/issues/135>`_)
 
-
 .. topic:: Example Code:
 
 ::