module onnxrt.ops_cpu.op_quantize_linear#

Inheritance diagram of mlprodict.onnxrt.ops_cpu.op_quantize_linear

Short summary#

module mlprodict.onnxrt.ops_cpu.op_quantize_linear

Runtime operator.

source on GitHub



truncated documentation



DynamicQuantizeLinear ===================== A Function to fuse calculation for Scale, Zero Point and FP32->8Bit convertion …


QuantizeLinear ============== The linear quantization operator. It consumes a high precision tensor, a scale, and a zero …



truncated documentation


Returns the list of arguments as well as the list of parameters with the default values (close to the signature). …


Returns the list of arguments as well as the list of parameters with the default values (close to the signature). …


Returns the list of arguments as well as the list of parameters with the default values (close to the signature). …


Returns the list of modified parameters.


Returns the list of modified parameters.


Returns the list of modified parameters.


Returns the list of optional arguments.


Returns the list of optional arguments.


Returns the list of optional arguments.


Returns the list of optional arguments.


Returns the list of optional arguments.


Returns the list of optional arguments.


Returns all parameters in a dictionary.


Returns all parameters in a dictionary.


Returns all parameters in a dictionary.



truncated documentation









Runtime operator.

source on GitHub

class mlprodict.onnxrt.ops_cpu.op_quantize_linear.DynamicQuantizeLinear(onnx_node, desc=None, **options)#

Bases: OpRun

A Function to fuse calculation for Scale, Zero Point and FP32->8Bit convertion of FP32 Input data. Outputs Scale, ZeroPoint and Quantized Input for a given FP32 Input. Scale is calculated as: ``

y_scale = (max(x) - min(x))/(qmax - qmin) * where qmax and qmin are max and min values for quantization range .i.e [0, 255] in case of uint8 * data range is adjusted to include 0.

`` Zero point is calculated as: `` intermediate_zero_point = qmin - min(x)/y_scale y_zero_point = cast(round(saturate(itermediate_zero_point))) * where qmax and qmin are max and min values for quantization range .i.e [0, 255] in case of uint8 * for saturation, it saturates to [0, 255] if it’s uint8, or [-127, 127] if it’s int8. Right now only uint8 is supported. * rounding to nearest ties to even. `` Data quantization formula is: `` y = saturate (round (x / y_scale) + y_zero_point) * for saturation, it saturates to [0, 255] if it’s uint8, or [-127, 127] if it’s int8. Right now only uint8 is supported. * rounding to nearest ties to even. ``


  • x (heterogeneous)T1: Input tensor


  • y (heterogeneous)T2: Quantized output tensor

  • y_scale (heterogeneous)tensor(float): Output scale. It’s a scalar, which means a per-tensor/layer quantization.

  • y_zero_point (heterogeneous)T2: Output zero point. It’s a scalar, which means a per-tensor/layer quantization.

Type Constraints

  • T1 tensor(float): Constrain ‘x’ to float tensor.

  • T2 tensor(uint8): Constrain ‘y_zero_point’ and ‘y’ to 8-bit unsigned integer tensor.


Onnx name: DynamicQuantizeLinear

This version of the operator has been available since version 11.

Runtime implementation: DynamicQuantizeLinear

__init__(onnx_node, desc=None, **options)#
_run(x, attributes=None, verbose=0, fLOG=None)#

Should be overwritten.

source on GitHub

class mlprodict.onnxrt.ops_cpu.op_quantize_linear.QuantizeLinear(onnx_node, desc=None, **options)#

Bases: _CommonQuantizeLinear

The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor. The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization. The quantization formula is y = saturate ((x / y_scale) + y_zero_point). For saturation, it saturates to [0, 255] if it’s uint8, or [-128, 127] if it’s int8. For (x / y_scale), it’s rounding to nearest ties to even. Refer to for details. ‘y_zero_point’ and ‘y’ must have same type.


  • axis: (Optional) The axis of the quantization dimension of the input tensor. Ignored for per-tensor quantization. Negative value means counting dimensions from the back. Accepted range is [-r, r-1] where r = rank(input). Default value is nameaxisi1typeINT (INT)


Between 2 and 3 inputs.

  • x (heterogeneous)T1: N-D full precision Input tensor to be quantized.

  • y_scale (heterogeneous)tensor(float): Scale for doing quantization to get ‘y’. It can be a scalar, which means per-tensor/layer quantization, or a 1-D Tensor for per-axis quantization.

  • y_zero_point (optional, heterogeneous)T2: Zero point for doing quantization to get ‘y’. Shape must match y_scale. Default is uint8 with zero point of 0 if it’s not specified.


  • y (heterogeneous)T2: N-D quantized output tensor. It has same shape as input ‘x’.

Type Constraints

  • T1 tensor(float), tensor(int32): Constrain ‘x’ to float or int32 tensor.

  • T2 tensor(int8), tensor(uint8): Constrain ‘y_zero_point’ and ‘y’ to 8-bit integer tensor.


Onnx name: QuantizeLinear

This version of the operator has been available since version 13.

Runtime implementation: QuantizeLinear

__init__(onnx_node, desc=None, **options)#
_run(*args, attributes=None, verbose=0, fLOG=None)#

Should be overwritten.

source on GitHub

class mlprodict.onnxrt.ops_cpu.op_quantize_linear._CommonQuantizeLinear(onnx_node, desc=None, expected_attributes=None, **options)#

Bases: OpRun

__init__(onnx_node, desc=None, expected_attributes=None, **options)#