com.microsoft - DecoderAttention#

DecoderAttention - 1 (com.microsoft)#

Version

name: DecoderAttention (GitHub)
domain: com.microsoft
since_version: 1
function:
support_level:
shape inference:

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

This DecoderAttention supports self attention and cross attention, key and value cache, and key_padding_mask. The attention mask is not support at the moment. Some boolean parameters are passed by runtime input for generic purpose

Attributes

num_heads (required): Number of attention heads Default value is ?.

Inputs

query (heterogeneous) - T: 3D input tensor with shape (sequence_length, batch_size, hidden_size), hidden_size = num_heads * head_size
key (heterogeneous) - T: 3D input tensor with shape (total_sequence_length, batch_size, hidden_size)
q_weight (heterogeneous) - T: 2D input tensor with shape (hidden_size, hidden_size)
kv_weight (heterogeneous) - T: 2D input tensor with shape (hidden_size, 2 * hidden_size)
bias (heterogeneous) - T: 1D input tensor with shape (3 * hidden_size)
key_padding_mask (optional, heterogeneous) - B: 2D input tensor with shape (batch_size, total_sequence_length)
key_cache (optional, heterogeneous) - T: input tensor with shape (batch_size, num_heads, sequence_length or total_sequence_length, head_size)
value_cache (optional, heterogeneous) - T: input tensor with shape (batch_size, num_heads, sequence_length or total_sequence_length, head_size)
static_kv (heterogeneous) - B: If static_kv = true, cross-attention; else self-attention
use_past (heterogeneous) - B: If use_past = true, use cache; else no cache
has_layer_state (heterogeneous) - B: If has_layer_state = true, layer_state = {} or [a,b]; else layer_state = None
has_key_padding_mask (heterogeneous) - B: has_key_padding_mask or not

Outputs

Between 1 and 3 outputs.

output (heterogeneous) - T: 3D output tensor with shape (sequence_length, batch_size, hidden_size)
new_key_cache (optional, heterogeneous) - T: output tensor with shape (batch_size, num_heads, new sequence_length, head_size)
new_value_cache (optional, heterogeneous) - T: output tensor with shape (batch_size, num_heads, new sequence_length, head_size)

Examples