Welcome to our Support Center

# algorithm

algorithm : enum, (name of optimizer) for optimizer instance.

Adadelta scales the learning rate based on the historical gradient while only taking into account the recent time window and not the entire history, like AdaGrad. Also uses a component that serves as an acceleration term, which accumulates historical updates (similar to momentum).

are a decay constant.
are learning rate.
are numerical stability (e-7).
are weight.

are momentum.
are gradients of the parameters we want to update.
are learning rate.
are smoothing term (avoids division by zero).
are weight.

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.

and and are bias-corrected first and second moment estimates.
and are momentum coefficient.
are gradients of the parameters we want to update.
are learning rate.
are smoothing term (avoids division by zero).
are weight.

AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) Optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm. Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

and
and are momentum coefficient.
are gradients of the parameters we want to update.
are learning rate.
are updated learning rate.
are smoothing term (avoids division by zero).
are weight.

#### Inertia

are momentum
are momentum coefficient.
are gradients of the parameters we want to update.
are learning rate.
are weight.

and and are bias-corrected first and second moment estimates.
and are momentum coefficient.
are gradients of the parameters we want to update.
are learning rate.
are smoothing term (avoids division by zero).
are weight.

#### Nesterov

Nesterov momentum is an extension of momentum that involves calculating the decaying moving average of the gradients of projected positions in the search space rather than the actual positions themselves.

are momentum
are momentum coefficient.
are gradients of the parameters we want to update.
are learning rate.
are weight.

#### RMSprop

The gist of RMSprop is to:

1. Maintain a moving (discounted) average of the square of gradients
2. Divide the gradient by the root of this average

This implementation of RMSprop uses plain momentum, not Nesterov momentum.
The centered version additionally maintains a moving average of the gradients, and uses that average to estimate the variance.

are momentum coefficient.
are gradients of the parameters we want to update.
are learning rate.
are smoothing term (avoids division by zero).
are weight.

#### SGD

are gradients of the parameters we want to update.
are learning rate.
are weight.

This parameter is used in add_to_graph an define VIs of the AdditiveAttention, Attention, BatchNormalization, Conv1D, Conv1DTranspose, Conv2D, Conv2DTranspose, Conv3D, Conv3DTranspose, Dense, DepthwiseConv2D, Embedding, GRU, LayerNormalization, LSTM, MultiHeadAttention, SeparableConv1D, SeparableConv2D, SimpleRNN, PReLU, ConvLSTM1DCell, ConvLSTM2DCell, ConvLSTM3DCell, GRUCell, LSTMCell, SimpleRNNCell layers.