@@ -41,17 +41,37 @@ the covariance matrix of the input data. This is a simple baseline method.
41
41
42
42
.. [1 ] On the Generalized Distance in Statistics, P.C.Mahalanobis, 1936
43
43
44
+ .. _lmnn :
45
+
44
46
LMNN
45
47
-----
46
48
47
- Large-margin nearest neighbor metric learning.
49
+ Large Margin Nearest Neighbor Metric Learning
50
+ (:py:class: `LMNN <metric_learn.lmnn.LMNN> `)
48
51
49
- `LMNN ` learns a Mahanalobis distance metric in the kNN classification
50
- setting using semidefinite programming . The learned metric attempts to keep
51
- k-nearest neighbors in the same class, while keeping examples from different
52
- classes separated by a large margin. This algorithm makes no assumptions about
52
+ `LMNN ` learns a Mahalanobis distance metric in the kNN classification
53
+ setting. The learned metric attempts to keep close k-nearest neighbors
54
+ from the same class, while keeping examples from different classes
55
+ separated by a large margin. This algorithm makes no assumptions about
53
56
the distribution of the data.
54
57
58
+ The distance is learned by solving the following optimization problem:
59
+
60
+ .. math ::
61
+
62
+ \min _\mathbf {L}\sum _{i, j}\eta _{ij}||\mathbf {L(x_i-x_j)}||^2 +
63
+ c\sum _{i, j, l}\eta _{ij}(1 -y_{ij})[1 +||\mathbf {L(x_i-x_j)}||^2 -||
64
+ \mathbf {L(x_i-x_l)}||^2 ]_+)
65
+
66
+ where :math: `\mathbf {x}_i` is an data point, :math: `\mathbf {x}_j` is one
67
+ of its k nearest neighbors sharing the same label, and :math: `\mathbf {x}_l`
68
+ are all the other instances within that region with different labels,
69
+ :math: `\eta _{ij}, y_{ij} \in \{ 0 , 1 \}` are both the indicators,
70
+ :math: `\eta _{ij}` represents :math: `\mathbf {x}_{j}` is the k nearest
71
+ neighbors(with same labels) of :math: `\mathbf {x}_{i}`, :math: `y_{ij}=0 `
72
+ indicates :math: `\mathbf {x}_{i}, \mathbf {x}_{j}` belong to different class,
73
+ :math: `[\cdot ]_+=\max (0 , \cdot )` is the Hinge loss.
74
+
55
75
.. topic :: Example Code:
56
76
57
77
::
@@ -80,16 +100,44 @@ The two implementations differ slightly, and the C++ version is more complete.
80
100
-margin -nearest-neighbor-classification> `_ Kilian Q. Weinberger, John
81
101
Blitzer, Lawrence K. Saul
82
102
103
+ .. _nca :
104
+
83
105
NCA
84
106
---
85
107
86
- Neighborhood Components Analysis (`NCA `) is a distance metric learning
87
- algorithm which aims to improve the accuracy of nearest neighbors
88
- classification compared to the standard Euclidean distance. The algorithm
89
- directly maximizes a stochastic variant of the leave-one-out k-nearest
90
- neighbors (KNN) score on the training set. It can also learn a low-dimensional
91
- linear embedding of data that can be used for data visualization and fast
92
- classification.
108
+ Neighborhood Components Analysis(:py:class: `NCA <metric_learn.nca.NCA> `)
109
+
110
+ `NCA ` is a distance metric learning algorithm which aims to improve the
111
+ accuracy of nearest neighbors classification compared to the standard
112
+ Euclidean distance. The algorithm directly maximizes a stochastic variant
113
+ of the leave-one-out k-nearest neighbors (KNN) score on the training set.
114
+ It can also learn a low-dimensional linear transformation of data that can
115
+ be used for data visualization and fast classification.
116
+
117
+ They use the decomposition :math: `\mathbf {M} = \mathbf {L}^T\mathbf {L}` and
118
+ define the probability :math: `p_{ij}` that :math: `\mathbf {x}_i` is the
119
+ neighbor of :math: `\mathbf {x}_j` by calculating the softmax likelihood of
120
+ the Mahalanobis distance:
121
+
122
+ .. math ::
123
+
124
+ p_{ij} = \frac {\exp (-|| \mathbf {Lx}_i - \mathbf {Lx}_j ||_2 ^2 )}
125
+ {\sum _{l\neq i}\exp (-||\mathbf {Lx}_i - \mathbf {Lx}_l||_2 ^2 )},
126
+ \qquad p_{ii}=0
127
+
128
+ Then the probability that :math: `\mathbf {x}_i` will be correctly classified
129
+ by the stochastic nearest neighbors rule is:
130
+
131
+ .. math ::
132
+
133
+ p_{i} = \sum _{j:j\neq i, y_j=y_i}p_{ij}
134
+
135
+ The optimization problem is to find matrix :math: `\mathbf {L}` that maximizes
136
+ the sum of probability of being correctly classified:
137
+
138
+ .. math ::
139
+
140
+ \mathbf {L} = \text {argmax}\sum _i p_i
93
141
94
142
.. topic :: Example Code:
95
143
@@ -116,16 +164,55 @@ classification.
116
164
.. [2 ] Wikipedia entry on Neighborhood Components Analysis
117
165
https://en.wikipedia.org/wiki/Neighbourhood_components_analysis
118
166
167
+ .. _lfda :
168
+
119
169
LFDA
120
170
----
121
171
122
- Local Fisher Discriminant Analysis ( LFDA)
172
+ Local Fisher Discriminant Analysis( :py:class: ` LFDA <metric_learn.lfda.LFDA> ` )
123
173
124
174
`LFDA ` is a linear supervised dimensionality reduction method. It is
125
- particularly useful when dealing with multimodality , where one ore more classes
175
+ particularly useful when dealing with multi-modality , where one ore more classes
126
176
consist of separate clusters in input space. The core optimization problem of
127
177
LFDA is solved as a generalized eigenvalue problem.
128
178
179
+
180
+ The algorithm define the Fisher local within-/between-class scatter matrix
181
+ :math: `\mathbf {S}^{(w)}/ \mathbf {S}^{(b)}` in a pairwise fashion:
182
+
183
+ .. math ::
184
+
185
+ \mathbf {S}^{(w)} = \frac {1 }{2 }\sum _{i,j=1 }^nW_{ij}^{(w)}(\mathbf {x}_i -
186
+ \mathbf {x}_j)(\mathbf {x}_i - \mathbf {x}_j)^T,\\
187
+ \mathbf {S}^{(b)} = \frac {1 }{2 }\sum _{i,j=1 }^nW_{ij}^{(b)}(\mathbf {x}_i -
188
+ \mathbf {x}_j)(\mathbf {x}_i - \mathbf {x}_j)^T,\\
189
+
190
+ where
191
+
192
+ .. math ::
193
+
194
+ W_{ij}^{(w)} = \left \{\begin {aligned}0 \qquad y_i\neq y_j \\
195
+ \,\,\mathbf {A}_{i,j}/n_l \qquad y_i = y_j\end {aligned}\right .\\
196
+ W_{ij}^{(b)} = \left \{\begin {aligned}1 /n \qquad y_i\neq y_j \\
197
+ \,\,\mathbf {A}_{i,j}(1 /n-1 /n_l) \qquad y_i = y_j\end {aligned}\right .\\
198
+
199
+ here :math: `\mathbf {A}_{i,j}` is the :math: `(i,j)`-th entry of the affinity
200
+ matrix :math: `\mathbf {A}`:, which can be calculated with local scaling methods.
201
+
202
+ Then the learning problem becomes derive the LFDA transformation matrix
203
+ :math: `\mathbf {T}_{LFDA}`:
204
+
205
+ .. math ::
206
+
207
+ \mathbf {T}_{LFDA} = \arg \max _\mathbf {T}
208
+ [\text {tr}((\mathbf {T}^T\mathbf {S}^{(w)}
209
+ \mathbf {T})^{-1 }\mathbf {T}^T\mathbf {S}^{(b)}\mathbf {T})]
210
+
211
+ That is, it is looking for a transformation matrix :math: `\mathbf {T}` such that
212
+ nearby data pairs in the same class are made close and the data pairs in
213
+ different classes are separated from each other; far apart data pairs in the
214
+ same class are not imposed to be close.
215
+
129
216
.. topic :: Example Code:
130
217
131
218
::
@@ -151,17 +238,50 @@ LFDA is solved as a generalized eigenvalue problem.
151
238
<https://gastrograph.com/resources/whitepapers/local-fisher
152
239
-discriminant-analysis-on-beer-style-clustering.html#> `_ Yuan Tang.
153
240
241
+ .. _mlkr :
154
242
155
243
MLKR
156
244
----
157
245
158
- Metric Learning for Kernel Regression.
246
+ Metric Learning for Kernel Regression( :py:class: ` MLKR <metric_learn.mlkr.MLKR> `)
159
247
160
248
`MLKR ` is an algorithm for supervised metric learning, which learns a
161
- distance function by directly minimising the leave-one-out regression error.
249
+ distance function by directly minimizing the leave-one-out regression error.
162
250
This algorithm can also be viewed as a supervised variation of PCA and can be
163
251
used for dimensionality reduction and high dimensional data visualization.
164
252
253
+ Theoretically, `MLKR ` can be applied with many types of kernel functions and
254
+ distance metrics, we hereafter focus the exposition on a particular instance
255
+ of the Gaussian kernel and Mahalanobis metric, as these are used in our
256
+ empirical development. The Gaussian kernel is denoted as:
257
+
258
+ .. math ::
259
+
260
+ k_{ij} = \frac {1 }{\sqrt {2 \pi }\sigma }\exp (-\frac {d(\mathbf {x}_i,
261
+ \mathbf {x}_j)}{\sigma ^2 })
262
+
263
+ where :math: `d(\cdot , \cdot )` is the squared distance under some metrics,
264
+ here in the fashion of Mahalanobis, it should be :math: `d(\mathbf {x}_i,
265
+ \mathbf {x}_j) = ||\mathbf {A}(\mathbf {x}_i - \mathbf {x}_j)||`, the transition
266
+ matrix :math: `\mathbf {A}` is derived from the decomposition of Mahalanobis
267
+ matrix :math: `\mathbf {M=A^TA}`.
268
+
269
+ Since :math: `\sigma ^2 ` can be integrated into :math: `d(\cdot )`, we can set
270
+ :math: `\sigma ^2 =1 ` for the sake of simplicity. Here we use the cumulative
271
+ leave-one-out quadratic regression error of the training samples as the
272
+ loss function:
273
+
274
+ .. math ::
275
+
276
+ \mathcal {L} = \sum _i(y_i - \hat {y}_i)^2
277
+
278
+ where the prediction :math: `\hat {y}_i` is derived from kernel regression by
279
+ calculating a weighted average of all the training samples:
280
+
281
+ .. math ::
282
+
283
+ \hat {y}_i = \frac {\sum _{j\neq i}y_jk_{ij}}{\sum _{j\neq i}k_{ij}}
284
+
165
285
.. topic :: Example Code:
166
286
167
287
::
@@ -193,7 +313,6 @@ generated from the labels information and passed to the underlying algorithm.
193
313
.. todo :: add more details about that (see issue `<https://github
194
314
.com/metric-learn/metric-learn/issues/135>`_)
195
315
196
-
197
316
.. topic :: Example Code:
198
317
199
318
::
0 commit comments