com.microsoft - QAttention#

QAttention - 1 (com.microsoft)#

Version

  • name: QAttention (GitHub)

  • domain: com.microsoft

  • since_version: 1

  • function:

  • support_level:

  • shape inference:

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Quantization of Multi-Head Self Attention.

Attributes

  • num_heads (required): Number of attention heads Default value is ?.

  • unidirectional: Whether every token can only attend to previous tokens. Default value is 0. Default value is ?.

Inputs

Between 5 and 9 inputs.

  • input (heterogeneous) - T1: 3D input tensor with shape (batch_size, sequence_length, input_hidden_size)

  • weight (heterogeneous) - T2: 2D input tensor with shape (input_hidden_size, 3 * hidden_size), hidden_size = num_heads * head_size

  • bias (heterogeneous) - T3: 1D input tensor with shape (3 * hidden_size)

  • input_scale (heterogeneous) - T3: scale of quantized input tensor. It’s a scalar, which means a per- tensor/layer quantization.

  • weight_scale (heterogeneous) - T3: scale of weight scale. It’s a scalar or a 1D tensor, which means a per-tensor/per-column quantization.Its size should be 3 * hidden_size if it is per-column quantization

  • mask_index (optional, heterogeneous) - T4: Attention mask index with shape (batch_size)

  • input_zero_point (optional, heterogeneous) - T1: zero point of quantized input tensor.It’s a scalar, which means a per-tensor/layer quantization.

  • weight_zero_point (optional, heterogeneous) - T2: zero point of quantized weight tensor. It’s a scalar or a 1D tensor, which means a per-tensor/per-column quantization.Its size should be 3 * hidden_size if it is per-column quantization

  • past (optional, heterogeneous) - T3: past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).

Outputs

Between 1 and 2 outputs.

  • output (heterogeneous) - T3: 3D output tensor with shape (batch_size, sequence_length, hidden_size)

  • present (optional, heterogeneous) - T3: present state for key and value with shape (2, batch_size, num_heads, past_sequence_length + sequence_length, head_size)

Examples