com.microsoft - LongformerAttention#

LongformerAttention - 1 (com.microsoft)#

Version

name: LongformerAttention (GitHub)
domain: com.microsoft
since_version: 1
function:
support_level:
shape inference:

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Longformer Self Attention with a local context and a global context. Tokens attend locally: Each token attends to its W previous tokens and W succeding tokens with W being the window length. A selected few tokens attend globally to all other tokens.

The attention mask is of shape (batch_size, sequence_length), where sequence_length is a multiple of 2W after padding. Mask value < 0 (like -10000.0) means the token is masked, 0 otherwise.

Global attention flags have value 1 for the tokens attend globally and 0 otherwise.

Attributes

num_heads (required): Number of attention heads Default value is ?.
window (required): One sided attention windows length W, or half of total window length Default value is ?.

Inputs

input (heterogeneous) - T: 3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * head_size
weight (heterogeneous) - T: 2D input tensor with shape (hidden_size, 3 * hidden_size)
bias (heterogeneous) - T: 1D input tensor with shape (3 * hidden_size)
mask (heterogeneous) - T: Attention mask with shape (batch_size, sequence_length)
global_weight (heterogeneous) - T: 2D input tensor with shape (hidden_size, 3 * hidden_size)
global_bias (heterogeneous) - T: 1D input tensor with shape (3 * hidden_size)
global (heterogeneous) - G: Global attention flags with shape (batch_size, sequence_length)

Outputs

output (heterogeneous) - T: 3D output tensor with shape (batch_size, sequence_length, hidden_size)

Examples