com.microsoft - DecoderAttention#

DecoderAttention - 1 (com.microsoft)#

Version

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

This DecoderAttention supports self attention and cross attention, key and value cache, and key_padding_mask. The attention mask is not support at the moment. Some boolean parameters are passed by runtime input for generic purpose

Attributes

  • num_heads (required): Number of attention heads Default value is ?.

Inputs

  • query (heterogeneous) - T: 3D input tensor with shape (sequence_length, batch_size, hidden_size), hidden_size = num_heads * head_size

  • key (heterogeneous) - T: 3D input tensor with shape (total_sequence_length, batch_size, hidden_size)

  • q_weight (heterogeneous) - T: 2D input tensor with shape (hidden_size, hidden_size)

  • kv_weight (heterogeneous) - T: 2D input tensor with shape (hidden_size, 2 * hidden_size)

  • bias (heterogeneous) - T: 1D input tensor with shape (3 * hidden_size)

  • key_padding_mask (optional, heterogeneous) - B: 2D input tensor with shape (batch_size, total_sequence_length)

  • key_cache (optional, heterogeneous) - T: input tensor with shape (batch_size, num_heads, sequence_length or total_sequence_length, head_size)

  • value_cache (optional, heterogeneous) - T: input tensor with shape (batch_size, num_heads, sequence_length or total_sequence_length, head_size)

  • static_kv (heterogeneous) - B: If static_kv = true, cross-attention; else self-attention

  • use_past (heterogeneous) - B: If use_past = true, use cache; else no cache

  • has_layer_state (heterogeneous) - B: If has_layer_state = true, layer_state = {} or [a,b]; else layer_state = None

  • has_key_padding_mask (heterogeneous) - B: has_key_padding_mask or not

Outputs

Between 1 and 3 outputs.

  • output (heterogeneous) - T: 3D output tensor with shape (sequence_length, batch_size, hidden_size)

  • new_key_cache (optional, heterogeneous) - T: output tensor with shape (batch_size, num_heads, new sequence_length, head_size)

  • new_value_cache (optional, heterogeneous) - T: output tensor with shape (batch_size, num_heads, new sequence_length, head_size)

Examples