.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "gyexamples/plot_quantization.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_gyexamples_plot_quantization.py: .. _l-plot-quantization: ============================== Quantization with onnxruntime ============================= .. index:: quantization, onnxruntime Quantization aims at reducing the model size but it does compute the output at a lower precision too. The static quantization estimates the best quantization parameters for every observation in a dataset. The dynamic quantization estimates these parameters for every observation at inference time. Let's see the differences (see alse `Quantize ONNX Models `_). .. contents:: :local: A model ======= Let's retrieve a not so big model. They are taken from the `ONNX Model Zoo `_ or can even be custom. .. GENERATED FROM PYTHON SOURCE LINES 29-73 .. code-block:: default import os import urllib.request import time import tqdm import numpy import onnx import pandas import matplotlib.pyplot as plt from onnxruntime import InferenceSession from onnxruntime.quantization.quantize import quantize_dynamic, quantize_static from onnxruntime.quantization.calibrate import CalibrationDataReader from onnxruntime.quantization.quant_utils import QuantFormat, QuantType from onnxruntime.quantization.shape_inference import quant_pre_process def download_file(url, name, min_size): if not os.path.exists(name): print(f"download '{url}'") with urllib.request.urlopen(url) as u: content = u.read() if len(content) < min_size: raise RuntimeError( f"Unable to download '{url}' due to\n{content}") print(f"downloaded {len(content)} bytes.") with open(name, "wb") as f: f.write(content) else: print(f"'{name}' already downloaded") small = "small" if small: model_name = "mobilenetv2-12.onnx" url_name = ("https://github.com/onnx/models/raw/main/vision/" "classification/mobilenet/model") else: model_name = "resnet50-v1-12.onnx" url_name = ("https://github.com/onnx/models/raw/main/vision/" "classification/resnet/model") if url_name is not None: url_name += "/" + model_name download_file(url_name, model_name, 100000) .. rst-class:: sphx-glr-script-out .. code-block:: none download 'https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-12.onnx' downloaded 13964571 bytes. .. GENERATED FROM PYTHON SOURCE LINES 74-75 Inputs and outputs. .. GENERATED FROM PYTHON SOURCE LINES 75-93 .. code-block:: default sess_full = InferenceSession(model_name, providers=["CPUExecutionProvider"]) for i in sess_full.get_inputs(): print(f"input {i}, name={i.name!r}, type={i.type}, shape={i.shape}") input_name = i.name input_shape = list(i.shape) if input_shape[0] in [None, "batch_size", "N"]: input_shape[0] = 1 output_name = None for i in sess_full.get_outputs(): print(f"output {i}, name={i.name!r}, type={i.type}, shape={i.shape}") if output_name is None: output_name = i.name print(f"input_name={input_name!r}, output_name={output_name!r}") .. rst-class:: sphx-glr-script-out .. code-block:: none input NodeArg(name='input', type='tensor(float)', shape=['batch_size', 3, 224, 224]), name='input', type=tensor(float), shape=['batch_size', 3, 224, 224] output NodeArg(name='output', type='tensor(float)', shape=['batch_size', 1000]), name='output', type=tensor(float), shape=['batch_size', 1000] input_name='input', output_name='output' .. GENERATED FROM PYTHON SOURCE LINES 94-95 We build random data. .. GENERATED FROM PYTHON SOURCE LINES 95-102 .. code-block:: default maxN = 50 imgs = [numpy.random.rand(*input_shape).astype(numpy.float32) for i in range(maxN)] experiments = [] .. GENERATED FROM PYTHON SOURCE LINES 103-109 Static Quantization =================== This quantization estimates the best quantization parameters (scale and bias) to minimize an error compare to the original model. It requires data. .. GENERATED FROM PYTHON SOURCE LINES 109-126 .. code-block:: default class DataReader(CalibrationDataReader): def __init__(self, input_name, imgs): self.input_name = input_name self.data = imgs self.pos = -1 def get_next(self): if self.pos >= len(self.data) - 1: return None self.pos += 1 return {self.input_name: self.data[self.pos]} def rewind(self): self.pos = -1 .. GENERATED FROM PYTHON SOURCE LINES 127-128 Runs the quantization. .. GENERATED FROM PYTHON SOURCE LINES 128-138 .. code-block:: default quantize_name = model_name + ".qdq.onnx" quantize_static(model_name, quantize_name, calibration_data_reader=DataReader(input_name, imgs), quant_format=QuantFormat.QDQ) .. rst-class:: sphx-glr-script-out .. code-block:: none WARNING:root:Please consider pre-processing before quantization. See https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/ReadMe.md WARNING:root:Please consider pre-processing before quantization. See https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/ReadMe.md .. GENERATED FROM PYTHON SOURCE LINES 139-140 Compares the size. .. GENERATED FROM PYTHON SOURCE LINES 140-153 .. code-block:: default with open(model_name, "rb") as f: model_onnx = onnx.load(f) with open(quantize_name, "rb") as f: quant_onnx = onnx.load(f) model_onnx_bytes = model_onnx.SerializeToString() quant_onnx_bytes = quant_onnx.SerializeToString() print(f"Model size: {len(model_onnx_bytes)} and " f"quantized: {len(quant_onnx_bytes)}, " f"ratio={len(quant_onnx_bytes) / len(model_onnx_bytes)}.") .. rst-class:: sphx-glr-script-out .. code-block:: none Model size: 13964571 and quantized: 3597737, ratio=0.2576331918825147. .. GENERATED FROM PYTHON SOURCE LINES 154-155 Let's measure the dIscrepancies. .. GENERATED FROM PYTHON SOURCE LINES 155-202 .. code-block:: default def compare_with(sess_full, imgs, quantize_name): sess = InferenceSession(quantize_name, providers=["CPUExecutionProvider"]) mean_diff = 0 mean_max = 0 time_full = 0 time_quant = 0 disa = 0 for img in tqdm.tqdm(imgs): feeds = {input_name: img} begin = time.perf_counter() full = sess_full.run(None, feeds) time_full += time.perf_counter() - begin begin = time.perf_counter() quant = sess.run(None, feeds) time_quant += time.perf_counter() - begin diff = numpy.abs(full[0] - quant[0]).ravel() mean_max += numpy.abs(full[0].ravel().max() - quant[0].ravel().max()) mean_diff += diff.mean() if full[0].argmax() != quant[0].argmax(): disa += 1 mean_diff /= len(imgs) mean_max /= len(imgs) time_full /= len(imgs) time_quant /= len(imgs) return dict(mean_diff=mean_diff, mean_max=mean_max, time_full=time_full, time_quant=time_quant, disagree=disa / len(imgs), ratio=time_quant / time_full) res = compare_with(sess_full, imgs, quantize_name) res["name"] = "static" experiments.append(res) print(f"Discrepancies: mean={res['mean_diff']:.2f}, " f"mean_max={res['mean_max']:.2f}, " f"times {res['time_full']} -> {res['time_quant']}, " f"disagreement={res['disagree']:.2f}") res .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/50 [00:00 0.07221101615577936, disagreement=0.82 {'mean_diff': 0.38776373863220215, 'mean_max': 0.33206552505493164, 'time_full': 0.045105748001951725, 'time_quant': 0.07221101615577936, 'disagree': 0.82, 'ratio': 1.600927140209598, 'name': 'static'} .. GENERATED FROM PYTHON SOURCE LINES 203-205 With preprocessing ================== .. GENERATED FROM PYTHON SOURCE LINES 205-210 .. code-block:: default preprocessed_name = model_name + ".pre.onnx" quant_pre_process(model_name, preprocessed_name) .. GENERATED FROM PYTHON SOURCE LINES 211-212 And quantization again. .. GENERATED FROM PYTHON SOURCE LINES 212-227 .. code-block:: default quantize_static(preprocessed_name, quantize_name, calibration_data_reader=DataReader(input_name, imgs), quant_format=QuantFormat.QDQ) res = compare_with(sess_full, imgs, quantize_name) res["name"] = "static-pre" experiments.append(res) print(f"Discrepancies: mean={res['mean_diff']:.2f}, " f"mean_max={res['mean_max']:.2f}, " f"times {res['time_full']} -> {res['time_quant']}, " f"disagreement={res['disagree']:.2f}") res .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/50 [00:00 0.07532987864222378, disagreement=0.82 {'mean_diff': 0.38776373863220215, 'mean_max': 0.33206552505493164, 'time_full': 0.04743068429874256, 'time_quant': 0.07532987864222378, 'disagree': 0.82, 'ratio': 1.5882098214682696, 'name': 'static-pre'} .. GENERATED FROM PYTHON SOURCE LINES 228-230 Dynamic quantization ==================== .. GENERATED FROM PYTHON SOURCE LINES 230-245 .. code-block:: default quantize_name = model_name + ".qdq.dyn.onnx" quantize_dynamic(preprocessed_name, quantize_name, weight_type=QuantType.QUInt8) res = compare_with(sess_full, imgs, quantize_name) res["name"] = "dynamic" experiments.append(res) print(f"Discrepancies: mean={res['mean_diff']:.2f}, " f"mean_max={res['mean_max']:.2f}, " f"times {res['time_full']} -> {res['time_quant']}, " f"disagreement={res['disagree']:.2f}") res .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/50 [00:00 0.13009194625774398, disagreement=0.18 {'mean_diff': 0.294784417450428, 'mean_max': 0.1978781032562256, 'time_full': 0.021916336235590278, 'time_quant': 0.13009194625774398, 'disagree': 0.18, 'ratio': 5.935843694827316, 'name': 'dynamic'} .. GENERATED FROM PYTHON SOURCE LINES 246-254 Conclusion ========== The static quantization (same quantized parameters for all observations) is not really working. The quantized model disagrees on almost all observations. Dynamic quantization (quantized parameters different for each observation) is a lot better but a lot slower too. .. GENERATED FROM PYTHON SOURCE LINES 254-263 .. code-block:: default fig, ax = plt.subplots(1, 3, figsize=(12, 4)) df = pandas.DataFrame(experiments).set_index("name") df[["ratio"]].plot(ax=ax[0], title="Speedup\nlower better", kind="bar") df[["mean_diff"]].plot(ax=ax[1], title="Average difference", kind="bar") df[["disagree"]].plot( ax=ax[2], title="Proportion bast class is the same", kind="bar") # plt.show() .. image-sg:: /gyexamples/images/sphx_glr_plot_quantization_001.png :alt: Speedup lower better, Average difference, Proportion bast class is the same :srcset: /gyexamples/images/sphx_glr_plot_quantization_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 40.287 seconds) .. _sphx_glr_download_gyexamples_plot_quantization.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_quantization.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quantization.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_