UpdatedSeptember 5, 2025

QOrderedAttention

Description

Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout). Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT). The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length) or (batch_size, sequence_length, past_sequence_length + sequence_length) with value 0 for masked and 1 otherwise, we also support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size), where value of each element is the end position, or valid length of actual sequence excluding padding. When input has left-side padding, mask_index has shape (2 * batch_size), where the values are the exclusive end positions followed by the inclusive start positions. When unidirectional is 1, and each token only attend to previous tokens. For GPT-2, both past and present state are optional. Present state could appear in output even when past state is not in input. Current version does not support past/present, attention_bias and qkv_hidden_sizes.

Input parameters

specified_outputs_name : array, this parameter lets you manually assign custom names to the output tensors of a node.

Graphs in : cluster, ONNX model architecture.

input (heterogeneous) – Q : object, 3D input tensor with shape (batch_size, sequence_length, input_hidden_size).
scale_input (heterogeneous) – S : object, scale of the input, scalar value (per tensor) currently.
scale_Q_gemm (heterogeneous) – S : object, scale of the gemm – scalar (per-tensor quantization).
scale_K_gemm (heterogeneous) – S : object, scale of the gemm – scalar (per-tensor quantization).
scale_V_gemm (heterogeneous) – S : object, scale of the gemm – scalar (per-tensor quantization).
Q_weight (heterogeneous) – Q : object, 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size.
K_weight (heterogeneous) – Q : object, 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size.
V_weight (heterogeneous) – Q : object, 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size.
scale_Q_weight (heterogeneous) – S : object, scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization).
scale_K_weight (heterogeneous) – S : object, scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization).
scale_V_weight (heterogeneous) – S : object, scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization).
Q_bias (heterogeneous) – S : object, 1D input tensor with shape (hidden_size).
K_bias (heterogeneous) – S : object, 1D input tensor with shape (hidden_size).
V_bias (heterogeneous) – S : object, 1D input tensor with shape (hidden_size).
scale_QKT_gemm (optional, heterogeneous) – S : object, scale of the gemm – scalar (per-tensor quantization).
scale_QKT_softmax (optional, heterogeneous) – S : object, scale of the softmax result – scalar (per-tensor quantization).
scale_values_gemm (heterogeneous) – S : object, scale of the gemm – scalar (per-tensor quantization). Also this is the output scale for the operator.
mask_index (optional, heterogeneous) – G : object, attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, past_sequence_length + sequence_length)or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape (batch_size) or (2 * batch_size).
past (optional, heterogeneous) – Q : object, past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).
relative_position_bias (optional, heterogeneous) – S : object, additional add to QxK’ with shape (batch_size or 1, num_heads or 1, sequence_length, total_sequence_length).

Parameters : cluster,

num_heads : integer, number of attention heads.
Default value “0”.
order_input : integer, cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
Default value “0”.
order_output : integer, cublasLt order of global bias.
Default value “0”.
order_weight : integer, cublasLt order of weight matrix.
Default value “0”.
qkv_hidden_sizes : array, hidden layer sizes of Q, K, V paths in Attention.
Default value “empty”.
unidirectional : boolean, whether every token can only attend to previous tokens.
Default value “False”.
training? : boolean, whether the layer is in training mode (can store data for backward).
Default value “True”.
lda coeff : float, defines the coefficient by which the loss derivative will be multiplied before being sent to the previous layer (since during the backward run we go backwards).
Default value “1”.

name (optional) : string, name of the node.

Output parameters

output (heterogeneous) – Q : object, 3D output tensor with shape (batch_size, sequence_length, hidden_size).

Type Constraints

Q in (tensor(int8)) : Constrain input and output types to int8 tensors.

S in (tensor(float)) : Constrain scales to float32 tensors.

G in (tensor(int32)) : Constrain to integer types.

Example

All these exemples are snippets PNG, you can drop these Snippet onto the block diagram and get the depicted code added to your VI (Do not forget to install Deep Learning library to run it).

Quick start

Installation guide

Execution providers

General

Iconography

API

Architecture

Layers

Nodes

Nodes

Activation

Mono Input

Parameters

Graph Function

Graph

File

Get & Set

Runtime

Create

Inference

Training

Academic Training

Exec

Inference

Input

Reinforcement Learning

Advanced

Add Weight

Index

Name

Format Weight

Get Weight

Index

Name

Set Weight

More

Layers parameters