UpdatedSeptember 5, 2025

Attention

Description

Multi-Head Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT).

The weights for input projection of Q, K and V are merged. The data is stacked on the second dimension. Its shape is (input_hidden_size, hidden_size + hidden_size + v_hidden_size). Here hidden_size is the hidden dimension of Q and K, and v_hidden_size is that of V.

The mask_index is optional. Besides raw attention mask with shape (batch_size, total_sequence_length) or (batch_size, sequence_length, total_sequence_length) with value 0 for masked and 1 otherwise, we support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size), where value is actual sequence length excluding padding. When input has left-side padding, mask_index has shape (2 * batch_size), where the values are the exclusive end positions followed by the inclusive start positions.

When unidirectional is 1, each token only attends to previous tokens.

Both past and present state are optional. They shall be used together, and not allowed to use only one of them. The qkv_hidden_sizes is required only when K and V have different hidden sizes.

When there is past state, hidden dimension for Q, K and V shall be the same.

The total_sequence_length is past_sequence_length + kv_sequence_length. Here kv_sequence_length is the length of K or V. For self attention, kv_sequence_length equals to sequence_length (sequence length of Q). For cross attention, query and key might have different lengths.

Input parameters

specified_outputs_name : array, this parameter lets you manually assign custom names to the output tensors of a node.

Graphs in : cluster, ONNX model architecture.

input – T : object, input tensor with shape (batch_size, sequence_length, input_hidden_size).
weights – T : object, merged Q/K/V weights with shape (input_hidden_size, hidden_size + hidden_size + v_hidden_size).
bias – T : object, bias tensor with shape (hidden_size + hidden_size + v_hidden_size) for input projection.
mask_index – M : object, attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, total_sequence_length) or (batch_size, sequence_length, total_sequence_length), or index with shape (batch_size) or (2 * batch_size) or (3 * batch_size + 2).
past – T : object, past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size)When past_present_share_buffer is set, its shape is (2, batch_size, num_heads, max_sequence_length, head_size).
relative_position_bias – T : object, additional add to QxK’ with shape (batch_size or 1, num_heads or 1, sequence_length, total_sequence_length).
past_sequence_length – M : object, when past_present_share_buffer is used, it is required to specify past_sequence_length (could be 0).

Parameters : cluster,

do rotary : boolean, whether to use rotary position embedding.
Default value “False”.
mask filter value : float, the value to be filled in the attention mask.
Default value “-10000”.
num heads : integer, number of attention heads.
Default value “2”.
past present share buffer : boolean, corresponding past and present are same tensor, its size is (2, batch_size, num_heads, max_sequence_length, head_size).
Default value “False”.
qkv hidden sizes : array, hidden dimension of Q, K, V: hidden_size, hidden_size and v_hidden_size.
Default value “empty”.
rotary embedding dim : integer, dimension of rotary embedding. Limited to 32, 64 or 128.
Default value “0”.
scale : float, custom scale will be used if specified.
Default value “0.5”.
unidirectional : boolean, whether every token can only attend to previous tokens.
Default value “False”.
training? : boolean, whether the layer is in training mode (can store data for backward).
Default value “True”.
lda coeff : float, defines the coefficient by which the loss derivative will be multiplied before being sent to the previous layer (since during the backward run we go backwards).
Default value “1”.

name (optional) : string, name of the node.

Output parameters

Graphs out : cluster, ONNX model architecture.

output – T : object, 3D output tensor with shape (batch_size, sequence_length, v_hidden_size).
present – T : object, past state for key and value with shape (2, batch_size, num_heads, total_sequence_length, head_size). If past_present_share_buffer is set, its shape is (2, batch_size, num_heads, max_sequence_length, head_size), while effective_seq_length = (past_sequence_length + kv_sequence_length).

Type Constraints

T in (tensor(float), tensor(float16)) : Constrain input and output types to float tensors.

M in (tensor(int32)) : Constrain mask index to integer types.

Example

All these exemples are snippets PNG, you can drop these Snippet onto the block diagram and get the depicted code added to your VI (Do not forget to install Deep Learning library to run it).

Quick start

Installation guide

Execution providers

General

Iconography

API

Architecture

Layers

Nodes

Nodes

Activation

Mono Input

Parameters

Graph Function

Graph

File

Get & Set

Runtime

Create

Inference

Training

Academic Training

Exec

Inference

Input

Reinforcement Learning

Advanced

Add Weight

Index

Name

Format Weight

Get Weight

Index

Name

Set Weight

More

Layers parameters

Nodes Parameters