com.microsoft - QAttention#

QAttention - 1 (com.microsoft)#

Version

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Quantization of Multi-Head Self Attention.

Attributes

num_heads (required): Number of attention heads Default value is ?.
unidirectional: Whether every token can only attend to previous tokens. Default value is 0. Default value is ?.

Inputs

Between 5 and 9 inputs.

input (heterogeneous) - T1: 3D input tensor with shape (batch_size, sequence_length, input_hidden_size)
weight (heterogeneous) - T2: 2D input tensor with shape (input_hidden_size, 3 * hidden_size), hidden_size = num_heads * head_size
bias (heterogeneous) - T3: 1D input tensor with shape (3 * hidden_size)
input_scale (heterogeneous) - T3: scale of quantized input tensor. It’s a scalar, which means a per- tensor/layer quantization.
weight_scale (heterogeneous) - T3: scale of weight scale. It’s a scalar or a 1D tensor, which means a per-tensor/per-column quantization.Its size should be 3 * hidden_size if it is per-column quantization
mask_index (optional, heterogeneous) - T4: Attention mask index with shape (batch_size)
input_zero_point (optional, heterogeneous) - T1: zero point of quantized input tensor.It’s a scalar, which means a per-tensor/layer quantization.
weight_zero_point (optional, heterogeneous) - T2: zero point of quantized weight tensor. It’s a scalar or a 1D tensor, which means a per-tensor/per-column quantization.Its size should be 3 * hidden_size if it is per-column quantization
past (optional, heterogeneous) - T3: past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).

Outputs

Between 1 and 2 outputs.

output (heterogeneous) - T3: 3D output tensor with shape (batch_size, sequence_length, hidden_size)
present (optional, heterogeneous) - T3: present state for key and value with shape (2, batch_size, num_heads, past_sequence_length + sequence_length, head_size)

Examples