@@ -11,145 +11,6 @@ For more information on the motivation and implementation details behind
11
11
BackendConfig, please refer to this
12
12
`README <https://github.com/pytorch/pytorch/tree/master/torch/ao/quantization/backend_config >`__.
13
13
14
- BackendConfig API Specification
15
- -------------------------------
16
-
17
- On a high level, BackendConfig specifies the quantization behavior for
18
- each supported operator pattern (e.g. linear, conv-bn-relu). The API is
19
- broken down into the following class hierarchy:
20
-
21
- - `BackendConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.BackendConfig.html >`__:
22
- The main class to be passed to prepare and convert functions.
23
- - `BackendPatternConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.BackendPatternConfig.html >`__:
24
- Config object that specifies quantization behavior for a given
25
- operator pattern. Each BackendConfig consists of many of these.
26
- - `DTypeConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeConfig.html >`__:
27
- Config object that specifies the supported data types passed as
28
- arguments to quantize ops in the reference model spec, for input
29
- and output activations, weights, and biases. This object also
30
- optionally specifies constraints associated with the data types.
31
- Each BackendPatternConfig consists of one or more of these.
32
- - `DTypeWithConstraints <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeWithConstraints.html >`__:
33
- Constraints imposed by the backend on the quantization parameters
34
- (scale and zero point) and ranges when quantizing to a given data
35
- type. Each DTypeConfig consists of many of these.
36
-
37
- The pattern specified in BackendPatternConfig follows the format
38
- described `here <https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/backend_config/README.md#pattern-specification >`__.
39
-
40
- BackendPatternConfig Specification
41
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
42
-
43
- set_observation_type
44
- ^^^^^^^^^^^^^^^^^^^^
45
-
46
- Observation type here refers to how observers (or quant-dequant ops)
47
- will be placed in the graph. This is used to produce the desired
48
- reference patterns understood by the backend. Weighted ops such as
49
- linear and conv require different observers (or quantization parameters
50
- passed to quantize ops in the reference model) for the input and the
51
- output (see `ObservationType <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.ObservationType.html >`__).
52
-
53
- Note: This will be renamed in the near future, since we will soon insert
54
- QuantDeQuantStubs with observers (and fake quantizes) attached instead
55
- of observers themselves.
56
-
57
- set_dtype_configs / add_type_config
58
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
59
-
60
- Each operator pattern may support one or more sets of
61
- input/output/weight/bias data types, and each set may have their own
62
- constraints. These requirements are captured in DTypeConfigs, which will
63
- be described in more detail in the next section.
64
-
65
- set_root_module / set_reference_quantized_module
66
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
67
-
68
- When we construct the reference quantized model during the convert
69
- phase, the root modules (e.g. ``torch.nn.Linear `` for
70
- ``torch.ao.nn.intrinsic.LinearReLU ``) will be swapped to the
71
- corresponding reference quantized modules (e.g.
72
- ``torch.ao.nn.reference.quantized.Linear ``). This allows custom backends
73
- to specify custom reference quantized module implementations to match
74
- the numerics of their lowered operators. Since this is a one-to-one
75
- mapping, both the root module and the reference quantized module must be
76
- specified in the same BackendPatternConfig in order for the conversion
77
- to take place.
78
-
79
- set_fuser_method
80
- ^^^^^^^^^^^^^^^^
81
-
82
- As an optimization, operator patterns such as (``torch.nn.Linear ``,
83
- ``torch.nn.ReLU ``) may be fused into ``nni.LinearReLU ``.
84
- ``set_fuser_method `` specifies the function through which this is
85
- performed. The first argument of this function is ``is_qat ``, and the
86
- rest of the arguments are the items in the tuple pattern, e.g. the fuser
87
- method for the above pattern will have three arguments, ``is_qat ``,
88
- ``linear ``, and ``relu ``. See `this
89
- example <https://gist.github.com/jerryzh168/8bea7180a8ba3c279f2c9b050f2a69a6> `__
90
- for a slightly more complicated usage.
91
-
92
- set_fused_module
93
- ^^^^^^^^^^^^^^^^
94
-
95
- This is used to identify fused weighted modules (e.g.
96
- ``torch.ao.nn.intrinsic.LinearReLU ``) that need to be converted to
97
- reference quantized modules.
98
-
99
- Data Type Restrictions
100
- ~~~~~~~~~~~~~~~~~~~~~~
101
-
102
- Each DTypeConfig attached to a BackendPatternConfig represents a set of
103
- supported data types passed as arguments to quantize ops in the reference
104
- model spec. For example, consider the following reference model::
105
-
106
- quant1 - [dequant1 - fp32_linear - quant2] - dequant2
107
-
108
- The pattern in the square brackets refers to the reference pattern of
109
- statically quantized linear. Setting the input dtype as `torch.quint8 `
110
- in the DTypeConfig means we pass in `torch.quint8 ` as the dtype argument
111
- to the first quantize op (quant1). Similarly, setting the output dtype as
112
- `torch.quint8 ` means we pass in `torch.quint8 ` as the dtype argument to
113
- the second quantize op (quant2).
114
-
115
- Note that the dtype here does not refer to the interface dtypes of the
116
- op. For example, the "input dtype" here is not the dtype of the input
117
- tensor passed to the quantized linear op. Though it can still be the
118
- same as the interface dtype, this is not always the case, e.g. the
119
- interface dtype is fp32 in dynamic quantization but the "input dtype"
120
- specified in the DTypeConfig would still be quint8. The semantics of
121
- dtypes here are the same as the semantics of the dtypes specified in
122
- the observers.
123
-
124
- These dtypes are matched against the ones specified in the user’s
125
- QConfig. If there is a match, and the QConfig satisfies the constraints
126
- specified in the DTypeConfig (if any), then we will quantize the given
127
- pattern using this DTypeConfig. Otherwise, the QConfig is ignored and
128
- the pattern will not be quantized.
129
-
130
- There are two ways of specifying ``input_dtype ``, ``output_dtype ``, and
131
- ``weight_dtype ``, as simple ``torch.dtype `` or as
132
- ``DTypeWithConstraints ``. The constraints currently supported are:
133
-
134
- - **quant_min_lower_bound ** and **quant_max_upper_bound **: Lower and upper
135
- bounds for the minimum and maximum quantized values respectively. If the
136
- QConfig’s ``quant_min `` and ``quant_max `` fall outside this range, then
137
- the QConfig will be ignored.
138
- - **scale_min_lower_bound ** and **scale_max_upper_bound **: Lower and
139
- upper bounds for the minimum and aximum scale values respectively. If
140
- the QConfig’s minimum scale value (currently exposed as ``eps ``) falls
141
- below the lower bound, then the QConfig will be ignored. Note that the
142
- upper bound is currently not enforced.
143
- - **scale_exact_match ** and **zero_point_exact_match **: Exact match
144
- requirements for scale and zero point, to be used for operators with
145
- fixed quantization parameters such as sigmoid and tanh. If the observer
146
- specified in the QConfig is neither ``FixedQParamsObserver `` nor
147
- ``FixedQParamsFakeQuantize ``, or if the quantization parameters don't
148
- match, then the QConfig will be ignored.
149
-
150
- End-to-End Example
151
- ------------------
152
-
153
14
Suppose we are a backend developer and we wish to integrate our backend
154
15
with PyTorch's quantization APIs. Our backend consists of two ops only:
155
16
quantized linear and quantized conv-relu. In this section, we will walk
@@ -175,6 +36,9 @@ BackendConfig through `prepare_fx` and `convert_fx`.
175
36
)
176
37
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
177
38
39
+ 1. Derive reference pattern for each quantized pattern
40
+ -------------------------------------------------------
41
+
178
42
For quantized linear, suppose our backend expects the reference pattern
179
43
`[dequant - fp32_linear - quant] ` and lowers it into a single quantized
180
44
linear op. The way to achieve this is to first insert quant-dequant ops
@@ -183,17 +47,21 @@ reference model::
183
47
184
48
quant1 - [dequant1 - fp32_linear - quant2] - dequant2
185
49
186
- Here we specify using different observers (will be renamed) for the input
187
- and output for the linear op, so the quantization params passed to the two
188
- quantize ops (quant1 and quant2) will be different. This is commonly the
189
- case for weighted ops like linear and conv.
50
+ Similarly, for quantized conv-relu, we wish to produce the following
51
+ reference model, where the reference pattern in the square brackets will
52
+ be lowered into a single quantized conv-relu op::
53
+
54
+ quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2
190
55
191
- The input dtype specified in the DTypeConfig will be passed as the dtype
192
- argument to quant1, while the output dtype will be passed as the dtype
193
- argument to quant2. If the output dtype is fp32, as in the case of dynamic
194
- quantization, then the output quant-dequant pair will not be inserted.
195
- This example also shows how to specify restrictions on quantization and
196
- scale ranges on a particular dtype.
56
+ 2. Set DTypeConfigs with backend constraints
57
+ ---------------------------------------------
58
+
59
+ In the reference patterns above, the input dtype specified in the
60
+ DTypeConfig will be passed as the dtype argument to quant1, while the
61
+ output dtype will be passed as the dtype argument to quant2. If the output
62
+ dtype is fp32, as in the case of dynamic quantization, then the output
63
+ quant-dequant pair will not be inserted. This example also shows how to
64
+ specify restrictions on quantization and scale ranges on a particular dtype.
197
65
198
66
.. code :: ipython3
199
67
@@ -211,6 +79,38 @@ scale ranges on a particular dtype.
211
79
weight_dtype=torch.qint8,
212
80
bias_dtype=torch.float)
213
81
82
+ 3. Set up fusion for conv-relu
83
+ -------------------------------
84
+
85
+ Note that the original user model contains separate conv and relu ops,
86
+ so we need to first fuse the conv and relu ops into a single conv-relu
87
+ op (`fp32_conv_relu `), and then quantize this op similar to how the linear
88
+ op is quantized. We can set up fusion by defining a function that accepts
89
+ 3 arguments, where the first is whether or not this is for QAT, and the
90
+ remaining arguments refer to the individual items of the fused pattern.
91
+
92
+ .. code :: ipython3
93
+
94
+ def fuse_conv2d_relu(is_qat, conv, relu):
95
+ """Return a fused ConvReLU2d from individual conv and relu modules."""
96
+ return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)
97
+
98
+ 4. Define the BackendConfig
99
+ ----------------------------
100
+
101
+ Now we have all the necessary pieces, so we go ahead and define our
102
+ BackendConfig. Here we use different observers (will be renamed) for
103
+ the input and output for the linear op, so the quantization params
104
+ passed to the two quantize ops (quant1 and quant2) will be different.
105
+ This is commonly the case for weighted ops like linear and conv.
106
+
107
+ For the conv-relu op, the observation type is the same. However, we
108
+ need two BackendPatternConfigs to support this op, one for fusion
109
+ and one for quantization. For both conv-relu and linear, we use the
110
+ DTypeConfig defined above.
111
+
112
+ .. code :: ipython3
113
+
214
114
linear_config = BackendPatternConfig() \
215
115
.set_pattern(torch.nn.Linear) \
216
116
.set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
@@ -219,24 +119,6 @@ scale ranges on a particular dtype.
219
119
.set_qat_module(torch.nn.qat.Linear) \
220
120
.set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear)
221
121
222
- For quantized conv-relu, the observation type and DTypeConfig settings
223
- are the same, since we wish to produce the following reference model,
224
- where the reference pattern in the square brackets will be lowered into
225
- a single quantized conv-relu op::
226
-
227
- quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2
228
-
229
- However, first we need to fuse the conv and relu ops into a single
230
- conv-relu op (`fp32_conv_relu `), and then quantize this op similar to
231
- how the linear op is quantized. Thus, we need two BackendPatternConfigs
232
- to support this op, one for fusion and one for quantization:
233
-
234
- .. code :: ipython3
235
-
236
- def fuse_conv2d_relu(is_qat, conv, relu):
237
- """Return a fused ConvReLU2d from individual conv and relu modules."""
238
- return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)
239
-
240
122
# For fusing Conv2d + ReLU into ConvReLU2d
241
123
# No need to set observation type and dtype config here, since we are not
242
124
# inserting quant-dequant ops in this step yet
@@ -254,23 +136,43 @@ to support this op, one for fusion and one for quantization:
254
136
.set_qat_module(torch.ao.nn.intrinsic.qat.ConvReLU2d) \
255
137
.set_reference_quantized_module(torch.ao.nn.quantized.reference.Conv2d)
256
138
257
- Now we have all the necessary pieces, so we go ahead and define our
258
- BackendConfig and test it out on an example model. Here we see that
259
- both linear and conv-relu are quantized.
260
-
261
- .. code :: ipython3
262
-
263
139
backend_config = BackendConfig("my_backend") \
264
140
.set_backend_pattern_config(linear_config) \
265
141
.set_backend_pattern_config(conv_relu_config) \
266
142
.set_backend_pattern_config(fused_conv_relu_config)
267
143
144
+ 5. Set up QConfigMapping that satisfies the backend constraints
145
+ ----------------------------------------------------------------
146
+
147
+ In order to use the ops defined above, the user must define a QConfig
148
+ that satisfies the constraints specified in the DTypeConfig. For more
149
+ detail, see the documentation for `DTypeConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeConfig.html >`__.
150
+ We will then use this QConfig for all the modules used in the patterns
151
+ we wish to quantize.
152
+
153
+ .. code :: ipython3
154
+
155
+ # Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
156
+ activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
157
+ qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)
158
+
159
+ # Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
160
+ # (Conv2d, ReLU), must have the same QConfig
161
+ qconfig_mapping = QConfigMapping() \
162
+ .set_object_type(torch.nn.Linear, qconfig) \
163
+ .set_object_type(torch.nn.Conv2d, qconfig) \
164
+ .set_object_type(torch.nn.BatchNorm2d, qconfig) \
165
+ .set_object_type(torch.nn.ReLU, qconfig)
166
+
167
+ 6. Quantize the model through prepare and convert
168
+ --------------------------------------------------
169
+
170
+ Finally, we quantize the model by passing the BackendConfig we defined
171
+ into prepare and convert. This produces a quantized linear module and
172
+ a fused quantized conv-relu module.
173
+
268
174
.. code :: ipython3
269
175
270
- # ====================
271
- # Example user model
272
- # ====================
273
-
274
176
class MyModel(torch.nn.Module):
275
177
def __init__(self, use_bn: bool):
276
178
super().__init__()
@@ -280,7 +182,7 @@ both linear and conv-relu are quantized.
280
182
self.relu = torch.nn.ReLU()
281
183
self.sigmoid = torch.nn.Sigmoid()
282
184
self.use_bn = use_bn
283
-
185
+
284
186
def forward(self, x):
285
187
x = self.linear(x)
286
188
x = self.conv(x)
@@ -290,31 +192,6 @@ both linear and conv-relu are quantized.
290
192
x = self.sigmoid(x)
291
193
return x
292
194
293
- .. code :: ipython3
294
-
295
- # =======================
296
- # Custom QConfigMapping
297
- # =======================
298
-
299
- # Define a QConfig that satisfies the constraints specified in DTypeConfig
300
- # Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
301
- activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
302
- qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)
303
-
304
- # Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
305
- # (Conv2d, ReLU), must have the same QConfig
306
- qconfig_mapping = QConfigMapping() \
307
- .set_object_type(torch.nn.Linear, qconfig) \
308
- .set_object_type(torch.nn.Conv2d, qconfig) \
309
- .set_object_type(torch.nn.BatchNorm2d, qconfig) \
310
- .set_object_type(torch.nn.ReLU, qconfig)
311
-
312
- .. code :: ipython3
313
-
314
- # =====================
315
- # Prepare and Convert
316
- # =====================
317
-
318
195
example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
319
196
model = MyModel(use_bn=False)
320
197
prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
@@ -341,17 +218,16 @@ both linear and conv-relu are quantized.
341
218
sigmoid = self.sigmoid(dequantize_2); dequantize_2 = None
342
219
return sigmoid
343
220
221
+ (7. Experiment with faulty BackendConfig setups)
222
+ -------------------------------------------------
223
+
344
224
As an experiment, here we modify the model to use conv-bn-relu
345
225
instead of conv-relu, but use the same BackendConfig, which doesn't
346
226
know how to quantize conv-bn-relu. As a result, only linear is
347
227
quantized, but conv-bn-relu is neither fused nor quantized.
348
228
349
229
.. code :: ipython3
350
-
351
- # ================================================
352
- # Prepare and Convert (only linear is quantized)
353
- # ================================================
354
-
230
+ # Only linear is quantized, since there's no rule for fusing conv-bn-relu
355
231
example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
356
232
model = MyModel(use_bn=True)
357
233
prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
@@ -387,11 +263,7 @@ doesn't satisfy the dtype constraints specified in the backend. As
387
263
a result, nothing is quantized since the QConfigs are simply ignored.
388
264
389
265
.. code :: ipython3
390
-
391
- # ============================================
392
- # Prepare and Convert (nothing is quantized)
393
- # ============================================
394
-
266
+ # Nothing is quantized or fused, since backend constraints are not satisfied
395
267
example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
396
268
model = MyModel(use_bn=True)
397
269
prepared = prepare_fx(model, get_default_qconfig_mapping(), example_inputs, backend_config=backend_config)
0 commit comments