@@ -508,6 +508,10 @@ Now we can compare the size and model accuracy with baseline model.
508
508
target device, it's just a representation of quantized computation in ATen
509
509
operators.
510
510
511
+ .. note ::
512
+ The weights are still in fp32 right now, we may do constant propagation for quantize op to
513
+ get integer weights in the future.
514
+
511
515
If you want to get better accuracy or performance, try configuring
512
516
``quantizer `` in different ways, and each ``quantizer `` will have its own way
513
517
of configuration, so please consult the documentation for the
@@ -519,46 +523,54 @@ Save and Load Quantized Model
519
523
520
524
We'll show how to save and load the quantized model.
521
525
522
- .. code-block :: python
523
526
524
- # 1. Save state_dict
525
- pt2e_quantized_model_file_path = saved_model_dir + " resnet18_pt2e_quantized.pth"
526
- torch.save(quantized_model.state_dict(), pt2e_quantized_model_file_path)
527
+ .. code-block :: python
527
528
528
- # Get a reference output
529
+ # 0. Store reference output, for example, inputs, and check evaluation accuracy:
529
530
example_inputs = (next (iter (data_loader))[0 ],)
530
531
ref = quantized_model(* example_inputs)
532
+ top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
533
+ print (" [before serialization] Evaluation accuracy on test dataset: %2.2f , %2.2f " % (top1.avg, top5.avg))
531
534
532
- # 2. Initialize the quantized model and Load state_dict
533
- # Rerun all steps to get a quantized model
534
- model_to_quantize = load_model(saved_model_dir + float_model_file).to(" cpu" )
535
- model_to_quantize.eval()
536
- from torch._export import capture_pre_autograd_graph
537
-
538
- exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs)
539
- from torch.ao.quantization.quantizer.xnnpack_quantizer import (
540
- XNNPACKQuantizer,
541
- get_symmetric_quantization_config,
542
- )
535
+ # 1. Export the model and Save ExportedProgram
536
+ pt2e_quantized_model_file_path = saved_model_dir + " resnet18_pt2e_quantized.pth"
537
+ # capture the model to get an ExportedProgram
538
+ quantized_ep = torch.export.export(quantized_model, example_inputs)
539
+ # use torch.export.save to save an ExportedProgram
540
+ torch.export.save(quantized_ep, pt2e_quantized_model_file_path)
543
541
544
- quantizer = XNNPACKQuantizer()
545
- quantizer.set_global(get_symmetric_quantization_config())
546
- prepared_model = prepare_pt2e(exported_model, quantizer)
547
- prepared_model(* example_inputs)
548
- loaded_quantized_model = convert_pt2e(prepared_model)
549
542
550
- # load the state_dict from saved file to intialized model
551
- loaded_quantized_model.load_state_dict(torch.load(pt2e_quantized_model_file_path))
543
+ # 2. Load the saved ExportedProgram
544
+ loaded_quantized_ep = torch.export.load(pt2e_quantized_model_file_path)
545
+ loaded_quantized_model = loaded_quantized_ep.module()
552
546
553
- # Sanity check with sample data
547
+ # 3. Check results for example inputs and check evaluation accuracy again:
554
548
res = loaded_quantized_model(* example_inputs)
555
-
556
- # 3. Evaluate the loaded quantized model
549
+ print ( " diff: " , ref - res)
550
+
557
551
top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test)
558
552
print (" [after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f , %2.2f " % (top1.avg, top5.avg))
559
553
554
+
555
+ Output:
556
+
557
+
558
+ .. code-block :: python
559
+
560
+ [before serialization] Evaluation accuracy on test dataset: 79.82 , 94.55
561
+ diff: tensor([[0 ., 0 ., 0 ., ... , 0 ., 0 ., 0 .],
562
+ [0 ., 0 ., 0 ., ... , 0 ., 0 ., 0 .],
563
+ [0 ., 0 ., 0 ., ... , 0 ., 0 ., 0 .],
564
+ ... ,
565
+ [0 ., 0 ., 0 ., ... , 0 ., 0 ., 0 .],
566
+ [0 ., 0 ., 0 ., ... , 0 ., 0 ., 0 .],
567
+ [0 ., 0 ., 0 ., ... , 0 ., 0 ., 0 .]])
568
+
569
+ [after serialization/ deserialization] Evaluation accuracy on test dataset: 79.82 , 94.55
570
+
571
+
560
572
Debugging the Quantized Model
561
- ----------------------------
573
+ ------------------------------
562
574
563
575
You can use `Numeric Suite <https://pytorch.org/docs/stable/quantization-accuracy-debugging.html#numerical-debugging-tooling-prototype >`_
564
576
that can help with debugging in eager mode and FX graph mode. The new version of
@@ -569,9 +581,10 @@ Lowering and Performance Evaluation
569
581
570
582
The model produced at this point is not the final model that runs on the device,
571
583
it is a reference quantized model that captures the intended quantized computation
572
- from the user, expressed as ATen operators, to get a model that runs on real
573
- devices, we'll need to lower the model. For example for the models that run on
574
- edge devices, we can lower to executorch.
584
+ from the user, expressed as ATen operators and some additional quantize/dequantize operators,
585
+ to get a model that runs on real devices, we'll need to lower the model.
586
+ For example, for the models that run on edge devices, we can lower with delegation and ExecuTorch runtime
587
+ operators.
575
588
576
589
Conclusion
577
590
--------------
0 commit comments