diff --git a/.gitignore b/.gitignore
index a51c1a82..449f70ea 100644
--- a/.gitignore
+++ b/.gitignore
@@ -7,6 +7,3 @@ htmlcov/
 .cache/
 .pytest_cache/
 doc/auto_examples/*
-coverage
-.coverage
-.coverage*
diff --git a/doc/supervised.rst b/doc/supervised.rst
index 26934a47..83bf4449 100644
--- a/doc/supervised.rst
+++ b/doc/supervised.rst
@@ -41,17 +41,37 @@ the covariance matrix of the input data. This is a simple baseline method.
 
     .. [1] On the Generalized Distance in Statistics, P.C.Mahalanobis, 1936
 
+.. _lmnn:
+
 LMNN
 -----
 
-Large-margin nearest neighbor metric learning.
+Large Margin Nearest Neighbor Metric Learning
+(:py:class:`LMNN <metric_learn.lmnn.LMNN>`)
 
-`LMNN` learns a Mahanalobis distance metric in the kNN classification
-setting using semidefinite programming. The learned metric attempts to keep
-k-nearest neighbors in the same class, while keeping examples from different
-classes separated by a large margin. This algorithm makes no assumptions about
+`LMNN` learns a Mahalanobis distance metric in the kNN classification
+setting. The learned metric attempts to keep close k-nearest neighbors 
+from the same class, while keeping examples from different classes 
+separated by a large margin. This algorithm makes no assumptions about
 the distribution of the data.
 
+The distance is learned by solving the following optimization problem:
+
+.. math::
+
+      \min_\mathbf{L}\sum_{i, j}\eta_{ij}||\mathbf{L(x_i-x_j)}||^2 + 
+      c\sum_{i, j, l}\eta_{ij}(1-y_{ij})[1+||\mathbf{L(x_i-x_j)}||^2-||
+      \mathbf{L(x_i-x_l)}||^2]_+)
+
+where :math:`\mathbf{x}_i` is an data point, :math:`\mathbf{x}_j` is one 
+of its k nearest neighbors sharing the same label, and :math:`\mathbf{x}_l` 
+are all the other instances within that region with different labels, 
+:math:`\eta_{ij}, y_{ij} \in \{0, 1\}` are both the indicators, 
+:math:`\eta_{ij}` represents :math:`\mathbf{x}_{j}` is the k nearest 
+neighbors(with same labels) of :math:`\mathbf{x}_{i}`, :math:`y_{ij}=0` 
+indicates :math:`\mathbf{x}_{i}, \mathbf{x}_{j}` belong to different class, 
+:math:`[\cdot]_+=\max(0, \cdot)` is the Hinge loss.
+
 .. topic:: Example Code:
 
 ::
@@ -80,16 +100,44 @@ The two implementations differ slightly, and the C++ version is more complete.
        -margin -nearest-neighbor-classification>`_ Kilian Q. Weinberger, John
        Blitzer, Lawrence K. Saul
 
+.. _nca:
+
 NCA
 ---
 
-Neighborhood Components Analysis (`NCA`) is a distance metric learning
-algorithm which aims to improve the accuracy of nearest neighbors
-classification compared to the standard Euclidean distance. The algorithm
-directly  maximizes  a stochastic  variant  of  the leave-one-out k-nearest
-neighbors (KNN) score on the training set.  It can also learn a low-dimensional
-linear  embedding  of  data  that  can  be used for data visualization and fast
-classification.
+Neighborhood Components Analysis(:py:class:`NCA <metric_learn.nca.NCA>`)
+
+`NCA` is a distance metric learning algorithm which aims to improve the 
+accuracy of nearest neighbors classification compared to the standard 
+Euclidean distance. The algorithm directly maximizes a stochastic variant 
+of the leave-one-out k-nearest neighbors (KNN) score on the training set. 
+It can also learn a low-dimensional linear transformation of data that can 
+be used for data visualization and fast classification.
+
+They use the decomposition :math:`\mathbf{M} = \mathbf{L}^T\mathbf{L}` and 
+define the probability :math:`p_{ij}` that :math:`\mathbf{x}_i` is the 
+neighbor of :math:`\mathbf{x}_j` by calculating the softmax likelihood of 
+the Mahalanobis distance:
+
+.. math::
+
+      p_{ij} = \frac{\exp(-|| \mathbf{Lx}_i - \mathbf{Lx}_j ||_2^2)}
+      {\sum_{l\neq i}\exp(-||\mathbf{Lx}_i - \mathbf{Lx}_l||_2^2)}, 
+      \qquad p_{ii}=0
+
+Then the probability that :math:`\mathbf{x}_i` will be correctly classified 
+by the stochastic nearest neighbors rule is:
+
+.. math::
+
+      p_{i} = \sum_{j:j\neq i, y_j=y_i}p_{ij}
+
+The optimization problem is to find matrix :math:`\mathbf{L}` that maximizes 
+the sum of probability of being correctly classified:
+
+.. math::
+
+      \mathbf{L} = \text{argmax}\sum_i p_i
 
 .. topic:: Example Code:
 
@@ -116,16 +164,55 @@ classification.
     .. [2] Wikipedia entry on Neighborhood Components Analysis
        https://en.wikipedia.org/wiki/Neighbourhood_components_analysis
 
+.. _lfda:
+
 LFDA
 ----
 
-Local Fisher Discriminant Analysis (LFDA)
+Local Fisher Discriminant Analysis(:py:class:`LFDA <metric_learn.lfda.LFDA>`)
 
 `LFDA` is a linear supervised dimensionality reduction method. It is
-particularly useful when dealing with multimodality, where one ore more classes
+particularly useful when dealing with multi-modality, where one ore more classes
 consist of separate clusters in input space. The core optimization problem of
 LFDA is solved as a generalized eigenvalue problem.
 
+
+The algorithm define the Fisher local within-/between-class scatter matrix 
+:math:`\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}` in a pairwise fashion:
+
+.. math::
+
+    \mathbf{S}^{(w)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(w)}(\mathbf{x}_i - 
+    \mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\
+    \mathbf{S}^{(b)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(b)}(\mathbf{x}_i - 
+    \mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\
+
+where 
+
+.. math::
+
+    W_{ij}^{(w)} = \left\{\begin{aligned}0 \qquad y_i\neq y_j \\
+    \,\,\mathbf{A}_{i,j}/n_l \qquad y_i = y_j\end{aligned}\right.\\
+    W_{ij}^{(b)} = \left\{\begin{aligned}1/n \qquad y_i\neq y_j \\
+    \,\,\mathbf{A}_{i,j}(1/n-1/n_l) \qquad y_i = y_j\end{aligned}\right.\\
+
+here :math:`\mathbf{A}_{i,j}` is the :math:`(i,j)`-th entry of the affinity
+matrix :math:`\mathbf{A}`:, which can be calculated with local scaling methods.
+
+Then the learning problem becomes derive the LFDA transformation matrix 
+:math:`\mathbf{T}_{LFDA}`:
+
+.. math::
+
+    \mathbf{T}_{LFDA} = \arg\max_\mathbf{T}
+    [\text{tr}((\mathbf{T}^T\mathbf{S}^{(w)}
+    \mathbf{T})^{-1}\mathbf{T}^T\mathbf{S}^{(b)}\mathbf{T})]
+
+That is, it is looking for a transformation matrix :math:`\mathbf{T}` such that 
+nearby data pairs in the same class are made close and the data pairs in 
+different classes are separated from each other; far apart data pairs in the 
+same class are not imposed to be close.
+
 .. topic:: Example Code:
 
 ::
@@ -151,17 +238,50 @@ LFDA is solved as a generalized eigenvalue problem.
        <https://gastrograph.com/resources/whitepapers/local-fisher
        -discriminant-analysis-on-beer-style-clustering.html#>`_ Yuan Tang.
 
+.. _mlkr:
 
 MLKR
 ----
 
-Metric Learning for Kernel Regression.
+Metric Learning for Kernel Regression(:py:class:`MLKR <metric_learn.mlkr.MLKR>`)
 
 `MLKR` is an algorithm for supervised metric learning, which learns a
-distance function by directly minimising the leave-one-out regression error.
+distance function by directly minimizing the leave-one-out regression error.
 This algorithm can also be viewed as a supervised variation of PCA and can be
 used for dimensionality reduction and high dimensional data visualization.
 
+Theoretically, `MLKR` can be applied with many types of kernel functions and 
+distance metrics, we hereafter focus the exposition on a particular instance 
+of the Gaussian kernel and Mahalanobis metric, as these are used in our 
+empirical development. The Gaussian kernel is denoted as:
+
+.. math::
+
+    k_{ij} = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{d(\mathbf{x}_i, 
+    \mathbf{x}_j)}{\sigma^2})
+
+where :math:`d(\cdot, \cdot)` is the squared distance under some metrics, 
+here in the fashion of Mahalanobis, it should be :math:`d(\mathbf{x}_i, 
+\mathbf{x}_j) = ||\mathbf{A}(\mathbf{x}_i - \mathbf{x}_j)||`, the transition 
+matrix :math:`\mathbf{A}` is derived from the decomposition of Mahalanobis 
+matrix :math:`\mathbf{M=A^TA}`.
+
+Since :math:`\sigma^2` can be integrated into :math:`d(\cdot)`, we can set 
+:math:`\sigma^2=1` for the sake of simplicity. Here we use the cumulative 
+leave-one-out quadratic regression error of the training samples as the 
+loss function:
+
+.. math::
+
+    \mathcal{L} = \sum_i(y_i - \hat{y}_i)^2
+
+where the prediction :math:`\hat{y}_i` is derived from kernel regression by 
+calculating a weighted average of all the training samples:
+
+.. math::
+
+    \hat{y}_i = \frac{\sum_{j\neq i}y_jk_{ij}}{\sum_{j\neq i}k_{ij}}
+
 .. topic:: Example Code:
 
 ::
@@ -193,7 +313,6 @@ generated from the labels information and passed to the underlying algorithm.
 .. todo:: add more details about that (see issue `<https://github
           .com/metric-learn/metric-learn/issues/135>`_)
 
-
 .. topic:: Example Code:
 
 ::
diff --git a/doc/weakly_supervised.rst b/doc/weakly_supervised.rst
index 6bf6f993..93720ffc 100644
--- a/doc/weakly_supervised.rst
+++ b/doc/weakly_supervised.rst
@@ -190,18 +190,55 @@ See also: `sklearn.calibration`.
 Algorithms
 ==========
 
+.. _itml:
+
 ITML
 ----
 
-Information Theoretic Metric Learning, Davis et al., ICML 2007
+Information Theoretic Metric Learning(:py:class:`ITML <metric_learn.itml.ITML>`)
 
-`ITML` minimizes the differential relative entropy between two multivariate
-Gaussians under constraints on the distance function, which can be formulated
-into a Bregman optimization problem by minimizing the LogDet divergence subject
-to linear constraints. This algorithm can handle a wide variety of constraints
+`ITML` minimizes the (differential) relative entropy, aka Kullback–Leibler 
+divergence, between two multivariate Gaussians subject to constraints on the 
+associated Mahalanobis distance, which can be formulated into a Bregman 
+optimization problem by minimizing the LogDet divergence subject to 
+linear constraints. This algorithm can handle a wide variety of constraints
 and can optionally incorporate a prior on the distance function. Unlike some
-other methods, ITML does not rely on an eigenvalue computation or semi-definite
-programming.
+other methods, `ITML` does not rely on an eigenvalue computation or 
+semi-definite programming.
+
+
+Given a Mahalanobis distance parameterized by :math:`A`, its corresponding 
+multivariate Gaussian is denoted as:
+
+.. math::
+    p(\mathbf{x}; \mathbf{A}) = \frac{1}{Z}\exp(-\frac{1}{2}d_\mathbf{A}
+    (\mathbf{x}, \mu)) 
+    =  \frac{1}{Z}\exp(-\frac{1}{2}((\mathbf{x} - \mu)^T\mathbf{A}
+    (\mathbf{x} - \mu)) 
+
+where :math:`Z` is the normalization constant, the inverse of Mahalanobis 
+matrix :math:`\mathbf{A}^{-1}` is the covariance of the Gaussian.
+
+Given pairs of similar points :math:`S` and pairs of dissimilar points 
+:math:`D`, the distance metric learning problem is to minimize the LogDet
+divergence, which is equivalent as minimizing :math:`\textbf{KL}(p(\mathbf{x}; 
+\mathbf{A}_0) || p(\mathbf{x}; \mathbf{A}))`:
+
+.. math::
+
+    \min_\mathbf{A} D_{\ell \mathrm{d}}\left(A, A_{0}\right) = 
+    \operatorname{tr}\left(A A_{0}^{-1}\right)-\log \operatorname{det}
+    \left(A A_{0}^{-1}\right)-n\\
+    \text{subject to } \quad d_\mathbf{A}(\mathbf{x}_i, \mathbf{x}_j) 
+    \leq u \qquad (\mathbf{x}_i, \mathbf{x}_j)\in S \\
+    d_\mathbf{A}(\mathbf{x}_i, \mathbf{x}_j) \geq l \qquad (\mathbf{x}_i, 
+    \mathbf{x}_j)\in D
+
+
+where :math:`u` and :math:`l` is the upper and the lower bound of distance
+for similar and dissimilar pairs respectively, and :math:`\mathbf{A}_0` 
+is the prior distance metric, set to identity matrix by default, 
+:math:`D_{\ell \mathrm{d}}(\cdot)` is the log determinant.
 
 .. topic:: Example Code:
 
@@ -231,11 +268,124 @@ programming.
     .. [2] Adapted from Matlab code at http://www.cs.utexas.edu/users/pjain/
        itml/
 
+
+.. _lsml:
+
+LSML
+----
+
+Metric Learning from Relative Comparisons by Minimizing Squared Residual
+(:py:class:`LSML <metric_learn.lsml.LSML>`)
+
+`LSML` proposes a simple, yet effective, algorithm that minimizes a convex 
+objective function corresponding to the sum of squared residuals of 
+constraints. This algorithm uses the constraints in the form of the 
+relative distance comparisons, such method is especially useful where 
+pairwise constraints are not natural to obtain, thus pairwise constraints 
+based algorithms become infeasible to be deployed. Furthermore, its sparsity 
+extension leads to more stable estimation when the dimension is high and 
+only a small amount of constraints is given.
+
+The loss function of each constraint 
+:math:`d(\mathbf{x}_a, \mathbf{x}_b) < d(\mathbf{x}_c, \mathbf{x}_d)` is 
+denoted as:
+
+.. math::
+
+    H(d_\mathbf{M}(\mathbf{x}_a, \mathbf{x}_b) 
+    - d_\mathbf{M}(\mathbf{x}_c, \mathbf{x}_d))
+
+where :math:`H(\cdot)` is the squared Hinge loss function defined as:
+
+.. math::
+
+    H(x) = \left\{\begin{aligned}0 \qquad x\leq 0 \\
+    \,\,x^2 \qquad x>0\end{aligned}\right.\\
+
+The summed loss function :math:`L(C)` is the simple sum over all constraints 
+:math:`C = \{(\mathbf{x}_a , \mathbf{x}_b , \mathbf{x}_c , \mathbf{x}_d) 
+: d(\mathbf{x}_a , \mathbf{x}_b) < d(\mathbf{x}_c , \mathbf{x}_d)\}`. The 
+original paper suggested here should be a weighted sum since the confidence 
+or probability of each constraint might differ. However, for the sake of 
+simplicity and assumption of no extra knowledge provided, we just deploy 
+the simple sum here as well as what the authors did in the experiments.
+
+The distance metric learning problem becomes minimizing the summed loss 
+function of all constraints plus a regularization term w.r.t. the prior 
+knowledge:
+
+.. math::
+
+    \min_\mathbf{M}(D_{ld}(\mathbf{M, M_0}) + \sum_{(\mathbf{x}_a, 
+    \mathbf{x}_b, \mathbf{x}_c, \mathbf{x}_d)\in C}H(d_\mathbf{M}(
+    \mathbf{x}_a, \mathbf{x}_b) - d_\mathbf{M}(\mathbf{x}_c, \mathbf{x}_c))\\
+
+where :math:`\mathbf{M}_0` is the prior metric matrix, set as identity 
+by default, :math:`D_{ld}(\mathbf{\cdot, \cdot})` is the LogDet divergence:
+
+.. math::
+
+    D_{ld}(\mathbf{M, M_0}) = \text{tr}(\mathbf{MM_0}) − \text{logdet}
+    (\mathbf{M})
+
+.. topic:: Example Code:
+
+::
+
+    from metric_learn import LSML
+
+    quadruplets = [[[1.2, 7.5], [1.3, 1.5], [6.4, 2.6], [6.2, 9.7]],
+                   [[1.3, 4.5], [3.2, 4.6], [6.2, 5.5], [5.4, 5.4]],
+                   [[3.2, 7.5], [3.3, 1.5], [8.4, 2.6], [8.2, 9.7]],
+                   [[3.3, 4.5], [5.2, 4.6], [8.2, 5.5], [7.4, 5.4]]]
+
+    # we want to make closer points where the first feature is close, and
+    # further if the second feature is close
+
+    lsml = LSML()
+    lsml.fit(quadruplets)
+
+.. topic:: References:
+
+    .. [1] Liu et al.
+       "Metric Learning from Relative Comparisons by Minimizing Squared
+       Residual". ICDM 2012. http://www.cs.ucla.edu/~weiwang/paper/ICDM12.pdf
+
+    .. [2] Adapted from https://gist.github.com/kcarnold/5439917
+
+.. _sdml:
+
+=======
+
 SDML
 ----
 
-`SDML`: An efficient sparse metric learning in high-dimensional space via
-L1-penalized log-determinant regularization
+Sparse High-Dimensional Metric Learning
+(:py:class:`SDML <metric_learn.sdml.SDML>`)
+
+`SDML` is an efficient sparse metric learning in high-dimensional space via 
+double regularization: an L1-penalization on the off-diagonal elements of the 
+Mahalanobis matrix :math:`\mathbf{M}`, and a log-determinant divergence between 
+:math:`\mathbf{M}` and :math:`\mathbf{M_0}` (set as either :math:`\mathbf{I}` 
+or :math:`\mathbf{\Omega}^{-1}`, where :math:`\mathbf{\Omega}` is the 
+covariance matrix).
+
+The formulated optimization on the semidefinite matrix :math:`\mathbf{M}` 
+is convex:
+
+.. math::
+
+    \min_{\mathbf{M}} = \text{tr}((\mathbf{M}_0 + \eta \mathbf{XLX}^{T})
+    \cdot \mathbf{M}) - \log\det \mathbf{M} + \lambda ||\mathbf{M}||_{1, off}
+
+where :math:`\mathbf{X}=[\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n]` is 
+the training data, the incidence matrix :math:`\mathbf{K}_{ij} = 1` if 
+:math:`(\mathbf{x}_i, \mathbf{x}_j)` is a similar pair, otherwise -1. The 
+Laplacian matrix :math:`\mathbf{L}=\mathbf{D}-\mathbf{K}` is calculated from 
+:math:`\mathbf{K}` and :math:`\mathbf{D}`, a diagonal matrix whose entries are 
+the sums of the row elements of :math:`\mathbf{K}`., :math:`||\cdot||_{1, off}` 
+is the off-diagonal L1 norm.
+
 
 .. topic:: Example Code:
 
@@ -265,18 +415,33 @@ L1-penalized log-determinant regularization
 
     .. [2] Adapted from https://gist.github.com/kcarnold/5439945
 
+.. _rca:
 
 RCA
 ---
 
-Relative Components Analysis (RCA)
+Relative Components Analysis (:py:class:`RCA <metric_learn.rca.RCA>`)
 
 `RCA` learns a full rank Mahalanobis distance metric based on a weighted sum of
-in-class covariance matrices. It applies a global linear transformation to
-assign large weights to relevant dimensions and low weights to irrelevant
-dimensions. Those relevant dimensions are estimated using "chunklets", subsets
+in-chunklets covariance matrices. It applies a global linear transformation to 
+assign large weights to relevant dimensions and low weights to irrelevant 
+dimensions. Those relevant dimensions are estimated using "chunklets", subsets 
 of points that are known to belong to the same class.
 
+For a training set with :math:`n` training points in :math:`k` chunklets, the 
+algorithm is efficient since it simply amounts to computing
+
+.. math::
+
+      \mathbf{C} = \frac{1}{n}\sum_{j=1}^k\sum_{i=1}^{n_j}
+      (\mathbf{x}_{ji}-\hat{\mathbf{m}}_j)
+      (\mathbf{x}_{ji}-\hat{\mathbf{m}}_j)^T
+
+
+where chunklet :math:`j` consists of :math:`\{\mathbf{x}_{ji}\}_{i=1}^{n_j}` 
+with a mean :math:`\hat{m}_j`. The inverse of :math:`\mathbf{C}^{-1}` is used 
+as the Mahalanobis matrix.
+
 .. topic:: Example Code:
 
 ::
@@ -295,7 +460,6 @@ of points that are known to belong to the same class.
     rca = RCA()
     rca.fit(pairs, y)
 
-
 .. topic:: References:
 
     .. [1] `Adjustment learning and relevant component analysis
@@ -307,21 +471,34 @@ of points that are known to belong to the same class.
     .. [3]'Learning a Mahalanobis metric from equivalence constraints', JMLR
        2005
 
+.. _mmc:
+
 MMC
 ---
 
-Mahalanobis Metric Learning with Application for Clustering with
-Side-Information, Xing et al., NIPS 2002
-
-`MMC` minimizes the sum of squared distances between similar examples, while
-enforcing the sum of distances between dissimilar examples to be greater than a
-certain margin. This leads to a convex and, thus, local-minima-free
-optimization problem that can be solved efficiently. However, the algorithm
-involves the computation of eigenvalues, which is the main speed-bottleneck.
-Since it has initially been designed for clustering applications, one of the
-implicit assumptions of MMC is that all classes form a compact set, i.e.,
-follow a unimodal distribution, which restricts the possible use-cases of this
-method. However, it is one of the earliest and a still often cited technique.
+Metric Learning with Application for Clustering with Side Information
+(:py:class:`MMC <metric_learn.mmc.MMC>`)
+
+`MMC` minimizes the sum of squared distances between similar points, while
+enforcing the sum of distances between dissimilar ones to be greater than one. 
+This leads to a convex and, thus, local-minima-free optimization problem that 
+can be solved efficiently. 
+However, the algorithm involves the computation of eigenvalues, which is the 
+main speed-bottleneck. Since it has initially been designed for clustering 
+applications, one of the implicit assumptions of MMC is that all classes form 
+a compact set, i.e., follow a unimodal distribution, which restricts the 
+possible use-cases of this method. However, it is one of the earliest and a 
+still often cited technique.
+
+The algorithm aims at minimizing the sum of distances between all the similar 
+points, while constrains the sum of distances between dissimilar points:
+
+.. math::
+
+      \min_{\mathbf{M}\in\mathbb{S}_+^d}\sum_{(\mathbf{x}_i, 
+      \mathbf{x}_j)\in S} d_{\mathbf{M}}(\mathbf{x}_i, \mathbf{x}_j)
+      \qquad \qquad \text{s.t.} \qquad \sum_{(\mathbf{x}_i, \mathbf{x}_j)
+      \in D} d^2_{\mathbf{M}}(\mathbf{x}_i, \mathbf{x}_j) \geq 1
 
 .. topic:: Example Code:
 
diff --git a/metric_learn/itml.py b/metric_learn/itml.py
index 9b6dccb2..6cb34313 100644
--- a/metric_learn/itml.py
+++ b/metric_learn/itml.py
@@ -1,16 +1,17 @@
-"""
-Information Theoretic Metric Learning, Kulis et al., ICML 2007
-
-ITML minimizes the differential relative entropy between two multivariate
-Gaussians under constraints on the distance function,
-which can be formulated into a Bregman optimization problem by minimizing the
-LogDet divergence subject to linear constraints.
-This algorithm can handle a wide variety of constraints and can optionally
-incorporate a prior on the distance function.
-Unlike some other methods, ITML does not rely on an eigenvalue computation
-or semi-definite programming.
-
-Adapted from Matlab code at http://www.cs.utexas.edu/users/pjain/itml/
+r"""
+Information Theoretic Metric Learning(ITML)
+
+`ITML` minimizes the (differential) relative entropy, aka Kullback-Leibler
+divergence, between two multivariate Gaussians subject to constraints on the
+associated Mahalanobis distance, which can be formulated into a Bregman
+optimization problem by minimizing the LogDet divergence subject to
+linear constraints. This algorithm can handle a wide variety of constraints
+and can optionally incorporate a prior on the distance function. Unlike some
+other methods, `ITML` does not rely on an eigenvalue computation or
+semi-definite programming.
+
+Read more in the :ref:`User Guide <itml>`.
+
 """
 
 from __future__ import print_function, absolute_import
diff --git a/metric_learn/lfda.py b/metric_learn/lfda.py
index 2feff211..2ca085d4 100644
--- a/metric_learn/lfda.py
+++ b/metric_learn/lfda.py
@@ -1,14 +1,13 @@
-"""
-Local Fisher Discriminant Analysis (LFDA)
+r"""
+Local Fisher Discriminant Analysis(LFDA)
+
+LFDA is a linear supervised dimensionality reduction method. It is
+particularly useful when dealing with multimodality, where one ore more classes
+consist of separate clusters in input space. The core optimization problem of
+LFDA is solved as a generalized eigenvalue problem.
 
-Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction
-Sugiyama, ICML 2006
+Read more in the :ref:`User Guide <lfda>`.
 
-LFDA is a linear supervised dimensionality reduction method.
-It is particularly useful when dealing with multimodality,
-where one ore more classes consist of separate clusters in input space.
-The core optimization problem of LFDA is solved as a generalized
-eigenvalue problem.
 """
 from __future__ import division, absolute_import
 import numpy as np
diff --git a/metric_learn/lmnn.py b/metric_learn/lmnn.py
index f9cd0e91..9e606c56 100644
--- a/metric_learn/lmnn.py
+++ b/metric_learn/lmnn.py
@@ -1,11 +1,14 @@
-"""
-Large-margin nearest neighbor metric learning. (Weinberger 2005)
+r"""
+Large Margin Nearest Neighbor Metric learning(LMNN)
+
+LMNN learns a Mahalanobis distance metric in the kNN classification
+setting. The learned metric attempts to keep close k-nearest neighbors
+from the same class, while keeping examples from different classes
+separated by a large margin. This algorithm makes no assumptions about
+the distribution of the data.
+
+Read more in the :ref:`User Guide <lmnn>`.
 
-LMNN learns a Mahanalobis distance metric in the kNN classification setting
-using semidefinite programming.
-The learned metric attempts to keep k-nearest neighbors in the same class,
-while keeping examples from different classes separated by a large margin.
-This algorithm makes no assumptions about the distribution of the data.
 """
 #TODO: periodic recalculation of impostors, PCA initialization
 
diff --git a/metric_learn/lsml.py b/metric_learn/lsml.py
index 536719ba..1d66cbc0 100644
--- a/metric_learn/lsml.py
+++ b/metric_learn/lsml.py
@@ -1,10 +1,17 @@
-"""
-Liu et al.
-"Metric Learning from Relative Comparisons by Minimizing Squared Residual".
-ICDM 2012.
+r"""
+Metric Learning from Relative Comparisons by Minimizing Squared Residual(LSML)
+
+`LSML` proposes a simple, yet effective, algorithm that minimizes a convex
+objective function corresponding to the sum of squared residuals of
+constraints. This algorithm uses the constraints in the form of the
+relative distance comparisons, such method is especially useful where
+pairwise constraints are not natural to obtain, thus pairwise constraints
+based algorithms become infeasible to be deployed. Furthermore, its sparsity
+extension leads to more stable estimation when the dimension is high and
+only a small amount of constraints is given.
+
+Read more in the :ref:`User Guide <lsml>`.
 
-Adapted from https://gist.github.com/kcarnold/5439917
-Paper: http://www.cs.ucla.edu/~weiwang/paper/ICDM12.pdf
 """
 
 from __future__ import print_function, absolute_import, division
diff --git a/metric_learn/mlkr.py b/metric_learn/mlkr.py
index 74a21a82..927c64e3 100644
--- a/metric_learn/mlkr.py
+++ b/metric_learn/mlkr.py
@@ -1,10 +1,13 @@
-"""
-Metric Learning for Kernel Regression (MLKR), Weinberger et al.,
+r"""
+Metric Learning for Kernel Regression(MLKR)
+
+MLKR is an algorithm for supervised metric learning, which learns a
+distance function by directly minimizing the leave-one-out regression error.
+This algorithm can also be viewed as a supervised variation of PCA and can be
+used for dimensionality reduction and high dimensional data visualization.
+
+Read more in the :ref:`User Guide <mlkr>`.
 
-MLKR is an algorithm for supervised metric learning, which learns a distance
-function by directly minimising the leave-one-out regression error. This
-algorithm can also be viewed as a supervised variation of PCA and can be used
-for dimensionality reduction and high dimensional data visualization.
 """
 from __future__ import division, print_function
 import time
diff --git a/metric_learn/mmc.py b/metric_learn/mmc.py
index 346db2f8..eb7dc529 100644
--- a/metric_learn/mmc.py
+++ b/metric_learn/mmc.py
@@ -1,19 +1,19 @@
-"""
-Mahalanobis Metric Learning with Application for Clustering with Side-Information, Xing et al., NIPS 2002
+r"""
+Metric Learning with Application for Clustering with Side Information(MMC)
 
-MMC minimizes the sum of squared distances between similar examples,
-while enforcing the sum of distances between dissimilar examples to be
-greater than a certain margin.
-This leads to a convex and, thus, local-minima-free optimization problem
-that can be solved efficiently.
+MMC minimizes the sum of squared distances between similar points, while
+enforcing the sum of distances between dissimilar ones to be greater than one.
+This leads to a convex and, thus, local-minima-free optimization problem that
+can be solved efficiently.
 However, the algorithm involves the computation of eigenvalues, which is the
-main speed-bottleneck.
-Since it has initially been designed for clustering applications, one of the
-implicit assumptions of MMC is that all classes form a compact set, i.e.,
-follow a unimodal distribution, which restricts the possible use-cases of
-this method. However, it is one of the earliest and a still often cited technique.
+main speed-bottleneck. Since it has initially been designed for clustering
+applications, one of the implicit assumptions of MMC is that all classes form
+a compact set, i.e., follow a unimodal distribution, which restricts the
+possible use-cases of this method. However, it is one of the earliest and a
+still often cited technique.
+
+Read more in the :ref:`User Guide <mmc>`.
 
-Adapted from Matlab code at http://www.cs.cmu.edu/%7Eepxing/papers/Old_papers/code_Metric_online.tar.gz
 """
 
 from __future__ import print_function, absolute_import, division
diff --git a/metric_learn/nca.py b/metric_learn/nca.py
index 5abe52e3..7139f0ff 100644
--- a/metric_learn/nca.py
+++ b/metric_learn/nca.py
@@ -1,6 +1,15 @@
-"""
-Neighborhood Components Analysis (NCA)
-Ported to Python from https://github.com/vomjom/nca
+r"""
+Neighborhood Components Analysis(NCA)
+
+NCA is a distance metric learning algorithm which aims to improve the
+accuracy of nearest neighbors classification compared to the standard
+Euclidean distance. The algorithm directly maximizes a stochastic variant
+of the leave-one-out k-nearest neighbors(KNN) score on the training set.
+It can also learn a low-dimensional linear transformation of data that can
+be used for data visualization and fast classification.
+
+Read more in the :ref:`User Guide <nca>`.
+
 """
 
 from __future__ import absolute_import
diff --git a/metric_learn/rca.py b/metric_learn/rca.py
index c9fedd59..88538e8b 100644
--- a/metric_learn/rca.py
+++ b/metric_learn/rca.py
@@ -1,14 +1,14 @@
-"""Relative Components Analysis (RCA)
+r"""
+Relative Components Analysis(RCA)
 
-RCA learns a full rank Mahalanobis distance metric based on a
-weighted sum of in-class covariance matrices.
-It applies a global linear transformation to assign large weights to
-relevant dimensions and low weights to irrelevant dimensions.
-Those relevant dimensions are estimated using "chunklets",
-subsets of points that are known to belong to the same class.
+RCA learns a full rank Mahalanobis distance metric based on a weighted sum of
+in-chunklets covariance matrices. It applies a global linear transformation to
+assign large weights to relevant dimensions and low weights to irrelevant
+dimensions. Those relevant dimensions are estimated using "chunklets", subsets
+of points that are known to belong to the same class.
+
+Read more in the :ref:`User Guide <rca>`.
 
-'Learning distance functions using equivalence relations', ICML 2003
-'Learning a Mahalanobis metric from equivalence constraints', JMLR 2005
 """
 
 from __future__ import absolute_import
diff --git a/metric_learn/sdml.py b/metric_learn/sdml.py
index e9828d07..b300b9ac 100644
--- a/metric_learn/sdml.py
+++ b/metric_learn/sdml.py
@@ -1,11 +1,15 @@
-"""
-Qi et al.
-An efficient sparse metric learning in high-dimensional space via
-L1-penalized log-determinant regularization.
-ICML 2009
+r"""
+Sparse High-Dimensional Metric Learning(SDML)
+
+SDML is an efficient sparse metric learning in high-dimensional space via
+double regularization: an L1-penalization on the off-diagonal elements of the
+Mahalanobis matrix :math:`\mathbf{M}`, and a log-determinant divergence between
+:math:`\mathbf{M}` and :math:`\mathbf{M_0}` (set as either :math:`\mathbf{I}`
+or :math:`\mathbf{\Omega}^{-1}`, where :math:`\mathbf{\Omega}` is the
+covariance matrix).
+
+Read more in the :ref:`User Guide <sdml>`.
 
-Adapted from https://gist.github.com/kcarnold/5439945
-Paper: http://lms.comp.nus.edu.sg/sites/default/files/publication-attachments/icml09-guojun.pdf
 """
 
 from __future__ import absolute_import