Skip to content

Update Dynamic Quant BERT Tutorial 2 #753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 53 additions & 81 deletions intermediate_source/dynamic_quantization_bert_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
#
#
# In this tutorial, we will apply the dynamic quantization on a BERT
# model, closely following the BERT model from the HuggingFace
# Transformers examples (https://github.com/huggingface/transformers).
# model, closely following the BERT model from `the HuggingFace
# Transformers examples <https://github.com/huggingface/transformers>`_.
# With this step-by-step journey, we would like to demonstrate how to
# convert a well-known state-of-the-art model like BERT into dynamic
# quantized model.
Expand All @@ -27,18 +27,16 @@
# achieves the state-of-the-art accuracy results on many popular
# Natural Language Processing (NLP) tasks, such as question answering,
# text classification, and others. The original paper can be found
# here: https://arxiv.org/pdf/1810.04805.pdf.
# `here <https://arxiv.org/pdf/1810.04805.pdf>`_.
#
# - Dynamic quantization support in PyTorch converts a float model to a
# quantized model with static int8 or float16 data types for the
# weights and dynamic quantization for the activations. The activations
# are quantized dynamically (per batch) to int8 when the weights are
# quantized to int8.
#
# In PyTorch, we have `torch.quantization.quantize_dynamic API
# <https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic>`_
# ,which replaces specified modules with dynamic weight-only quantized
# versions and output the quantized model.
# quantized to int8. In PyTorch, we have `torch.quantization.quantize_dynamic API
# <https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic>`_,
# which replaces specified modules with dynamic weight-only quantized
# versions and output the quantized model.
#
# - We demonstrate the accuracy and inference performance results on the
# `Microsoft Research Paraphrase Corpus (MRPC) task <https://www.microsoft.com/en-us/download/details.aspx?id=52398>`_
Expand All @@ -47,29 +45,24 @@
# a corpus of sentence pairs automatically extracted from online news
# sources, with human annotations of whether the sentences in the pair
# are semantically equivalent. Because the classes are imbalanced (68%
# positive, 32% negative), we follow common practice and report both
# accuracy and `F1 score <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html>`_
# positive, 32% negative), we follow the common practice and report
# `F1 score <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html>`_.
# MRPC is a common NLP task for language pair classification, as shown
# below.
#
# .. figure:: /_static/img/bert_mrpc.png
# .. figure:: /_static/img/bert.png


######################################################################
# Setup
# 1. Setup
# -------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the header underlining must be equal to the length of the header

#
# Install PyTorch and HuggingFace Transformers
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# To start this tutorial, let’s first follow the installation instructions
# in PyTorch and HuggingFace Github Repo: -
#
# * https://github.com/pytorch/pytorch/#installation -
#
# * https://github.com/huggingface/transformers#installation
#
# In addition, we also install ``sklearn`` package, as we will reuse its
# in PyTorch `here <https://github.com/pytorch/pytorch/#installation>`_ and HuggingFace Github Repo `here <https://github.com/huggingface/transformers#installation>`_.
# In addition, we also install `scikit-learn <https://github.com/scikit-learn/scikit-learn>`_ package, as we will reuse its
# built-in F1 score calculation helper function.
#
# .. code:: shell
Expand All @@ -94,7 +87,7 @@


######################################################################
# Import the necessary modules
# 2. Import the necessary modules
# ----------------------------
#
# In this step we import the necessary Python modules for the tutorial.
Expand Down Expand Up @@ -137,61 +130,51 @@


######################################################################
# Download the dataset
# 3. Download the dataset
# --------------------
#
# Before running MRPC tasks we download the `GLUE data
# <https://gluebenchmark.com/tasks>`_ by running this `script
# <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_ followed by
# `download_glue_data <https://github.com/nyu-mll/GLUE-baselines/blob/master/download_glue_data.py>`_.
# and unpack it to some directory “glue_data/MRPC”.
# <https://gluebenchmark.com/tasks>`_ by running `this script
# <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
# and unpack it to a directory `glue_data`.
#
#
# .. code:: shell
#
# wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
# python download_glue_data.py --data_dir='glue_data' --tasks='MRPC'
# ls glue_data/MRPC
#


######################################################################
# Helper functions
# 4. Helper functions
# ----------------
#
# The helper functions are built-in in transformers library. We mainly use
# the following helper functions: one for converting the text examples
# into the feature vectors; The other one for measuring the F1 score of
# the predicted result.
#
# Convert the texts into features
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# `glue_convert_examples_to_features <https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py>`_.
# load a data file into a list of ``InputFeatures``.
# The `glue_convert_examples_to_features <https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py>`_ function converts the texts into input features:
#
# - Tokenize the input sequences;
# - Insert [CLS] at the beginning;
# - Insert [SEP] between the first sentence and the second sentence, and
# at the end;
# - Generate token type ids to indicate whether a token belongs to the
# first sequence or the second sequence;
#
# F1 metric
# ~~~~~~~~~
# first sequence or the second sequence.
#
# The `F1 score <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html>`_
# can be interpreted as a weighted average of the precision and recall,
# where an F1 score reaches its best value at 1 and worst score at 0. The
# relative contribution of precision and recall to the F1 score are equal.
# The formula for the F1 score is:
# The equation for the F1 score is:
#
# F1 = 2 \* (precision \* recall) / (precision + recall)
# - F1 = 2 \* (precision \* recall) / (precision + recall)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. math:: F1 = 2 * (\text{precision} * \text{recall}) / (\text{precision} + \text{recall})

#


######################################################################
# Fine-tune the BERT model
# 5. Fine-tune the BERT model
# --------------------------
#

Expand All @@ -204,15 +187,15 @@
# with the pre-trained BERT model to classify semantically equivalent
# sentence pairs on MRPC task.
#
# To fine-tune the pre-trained BERT model (bert-base-uncased model in
# To fine-tune the pre-trained BERT model (``bert-base-uncased`` model in
# HuggingFace transformers) for the MRPC task, you can follow the command
# in `examples<https://github.com/huggingface/transformers/tree/master/examples>`_"
# in `examples <https://github.com/huggingface/transformers/tree/master/examples#mrpc>`_:
#
# ::
#
# export GLUE_DIR=./glue_data
# export TASK_NAME=MRPC
# export OUT_DIR=/mnt/homedir/jianyuhuang/public/bert/$TASK_NAME/
# export OUT_DIR=./$TASK_NAME/
# python ./run_glue.py \
# --model_type bert \
# --model_name_or_path bert-base-uncased \
Expand All @@ -229,24 +212,11 @@
# --save_steps 100000 \
# --output_dir $OUT_DIR
#
# We provide the fined-tuned BERT model for MRPC task here (We did the
# fine-tuning on CPUs with a total train batch size of 8):
#
# https://drive.google.com/drive/folders/1mGBx0t-YJAWXHbgab2f_IimaMiVHlKh-
#
# To save time, you can manually copy the fined-tuned BERT model for MRPC
# task in your Google Drive (Create the same “BERT_Quant_Tutorial/MRPC”
# folder in the Google Drive directory), and then mount your Google Drive
# on your runtime using an authorization code, so that we can directly
# read and write the models into Google Drive in the following steps.
#

from google.colab import drive
drive.mount('/content/drive')

# We provide the fined-tuned BERT model for MRPC task `here <https://download.pytorch.org/tutorial/MRPC.zip>`_.
# To save time, you can download the model file (~400 MB) directly into your local folder ``$OUT_DIR``.

######################################################################
# Set global configurations
# 6. Set global configurations
# -------------------------
#

Expand All @@ -258,11 +228,11 @@

configs = Namespace()

# The output directory for the fine-tuned model.
configs.output_dir = "/content/drive/My Drive/BERT_Quant_Tutorial/MRPC/"
# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark.
configs.data_dir = "/content/glue_data/MRPC"
# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
Expand Down Expand Up @@ -294,7 +264,7 @@ def set_seed(seed):


######################################################################
# Load the fine-tuned BERT model
# 7. Load the fine-tuned BERT model
# ------------------------------
#

Expand All @@ -312,11 +282,12 @@ def set_seed(seed):


######################################################################
# Define the tokenize and evaluation function
# 8. Define the tokenize and evaluation function
# -------------------------------------------
#
# We reuse the tokenize and evaluation function from `huggingface <https://github.com/huggingface/transformers/blob/master/examples/run_glue.py>`_.
# We reuse the tokenize and evaluation function from `Huggingface <https://github.com/huggingface/transformers/blob/master/examples/run_glue.py>`_.
#

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Expand Down Expand Up @@ -455,7 +426,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):


######################################################################
# Apply the dynamic quantization
# 9. Apply the dynamic quantization
# -------------------------------
#
# We call ``torch.quantization.quantize_dynamic`` on the model to apply
Expand All @@ -474,11 +445,11 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):


######################################################################
# Check the model size
# 10. Check the model size
# --------------------
#
# Let’s first check the model size. We can observe a significant reduction
# in model size:
# in model size (FP32 total size: 438 MB; INT8 total size: 181 MB):
#

def print_size_of_model(model):
Expand All @@ -491,7 +462,7 @@ def print_size_of_model(model):


######################################################################
# The BERT model used in this tutorial (bert-base-uncased) has a
# The BERT model used in this tutorial (``bert-base-uncased``) has a
# vocabulary size V of 30522. With the embedding size of 768, the total
# size of the word embedding table is ~ 4 (Bytes/FP32) \* 30522 \* 768 =
# 90 MB. So with the help of quantization, the model size of the
Expand All @@ -501,15 +472,14 @@ def print_size_of_model(model):


######################################################################
# Evaluate the inference accuracy and time
# 11. Evaluate the inference accuracy and time
# ----------------------------------------
#
# Next, let’s compare the inference time as well as the evaluation
# accuracy between the original FP32 model and the INT8 model after the
# dynamic quantization.
#

# Evaluate the original FP32 BERT model
def time_model_evaluation(model, configs, tokenizer):
eval_start_time = time.time()
result = evaluate(configs, model, tokenizer, prefix="")
Expand All @@ -518,6 +488,7 @@ def time_model_evaluation(model, configs, tokenizer):
print(result)
print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

# Evaluate the original FP32 BERT model
time_model_evaluation(model, configs, tokenizer)

# Evaluate the INT8 BERT model after the dynamic quantization
Expand All @@ -539,7 +510,8 @@ def time_model_evaluation(model, configs, tokenizer):
#
# We have 0.6% F1 score accuracy after applying the post-training dynamic
# quantization on the fine-tuned BERT model on the MRPC task. As a
# comparison, in the recent paper [3] (Table 1), it achieved 0.8788 by
# comparison, in a `recent paper <https://arxiv.org/pdf/1910.06188.pdf>`_ (Table 1),
# it achieved 0.8788 by
# applying the post-training dynamic quantization and 0.8956 by applying
# the quantization-aware training. The main reason is that we support the
# asymmetric quantization in PyTorch while that paper supports the
Expand All @@ -561,7 +533,7 @@ def time_model_evaluation(model, configs, tokenizer):


######################################################################
# Serialize the quantized model
# 12. Serialize the quantized model
# -----------------------------
#
# We can serialize and save the quantized model for the future use.
Expand All @@ -583,7 +555,7 @@ def time_model_evaluation(model, configs, tokenizer):
# having a limited implication on accuracy.
#
# Thanks for reading! As always, we welcome any feedback, so please create
# an issue here (https://github.com/pytorch/pytorch/issues) if you have
# an issue `here <https://github.com/pytorch/pytorch/issues>`_ if you have
# any.
#

Expand All @@ -592,14 +564,14 @@ def time_model_evaluation(model, configs, tokenizer):
# References
# -----------
#
# [1] J.Devlin, M. Chang, K. Lee and K. Toutanova, BERT: Pre-training of
# [1] J.Devlin, M. Chang, K. Lee and K. Toutanova, `BERT: Pre-training of
# Deep Bidirectional Transformers for Language Understanding (2018)
# <https://arxiv.org/pdf/1810.04805.pdf>`_.
#
# [2] HuggingFace Transformers.
# https://github.com/huggingface/transformers
# [2] `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
#
# [3] O. Zafrir, G. Boudoukh, P. Izsak, & M. Wasserblat (2019). Q8BERT:
# Quantized 8bit BERT. arXiv preprint arXiv:1910.06188.
# [3] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat (2019). `Q8BERT:
# Quantized 8bit BERT <https://arxiv.org/pdf/1910.06188.pdf>`_.
#


Expand Down