Layers

Transformer layers

interpolate_pos_embeddings(pos_embed, src_grid_size, tgt_grid_size, nb_tokens=0)[source]

This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution images.

Parameters:

pos_embed (Tensor) – Positional embeddings to interpolate, shape (1, N, D)
src_grid_size (Tuple[int, int]) – Grid size of given embeddings.
tgt_grid_size (Tuple[int, int]) – Input size to which position embeddings should be adapted
nb_tokens (int) – How many token should be ignored for interpolation (e.g., class or distillation tokens)

Returns:

Position embeddings (including class tokens) appropriate to input_size

Return type:

Tensor

interpolate_pos_embeddings_grid(pos_embed, tgt_grid_size)[source]

This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution images. We use this function if position embeddings are already given as a grid.

Parameters:

pos_embed (Tensor) – Positional embeddings to interpolate, shape (B, H, W, D)
tgt_grid_size (Tuple[int, int]) – Input size to which position embeddings should be adapted

Returns:

Position embeddings (including class tokens) appropriate to input_size

Return type:

Tensor

class PatchEmbeddings(*args, **kwargs)[source]

Image to Patch Embedding.

Supports overlapping patches when stride is specified. Used, e.g., in Pyramid Vision Transformer V2 or PoolFormer.

Parameters:

patch_size (int) – Patchifying the image is implemented via a convolutional layer with kernel size equal to patch_size.
embed_dim (int) – Number of embedding dimensions.
stride (int | None) – If None, we use non-overlapping patches, i.e., stride=patch_size. Other values are used e.g., in PVT v2 or PoolFormer.
padding (int | None) – Padding is only applied when overlapping patches are used. In that case we default to padding = patch_size // 2, but other values can be specified via this parameter. For non-overlapping patches, padding is always 0.
flatten (bool) – If True, we reshape the patchified image from (H', W', D) into (H'*W', D).
norm_layer (str) – Normalization layer to be applied after the patch projection.
kernel_initializer (str) – Initializer for kernel weights
bias_initializer (str) – Initializer for bias weights
**kwargs – Other arguments are passed to the parent class.

call(x, training=False, return_shape=False)[source]

Forward pass through the layer.

An image of shape (H, W, C) will be mapped to patches of shape (H', W', D), where D is embed_dim. The output is then optionally flattened to (H'*W', D).

Parameters:

x – Input to layer
training – Training or inference phase?
return_shape – If True, we additionally return the spatial shape of the image as well as the tokens.

Returns:

We return the flattened tokens and if return_shape=True, additionally the shape (H', W').

This information is used by models that use convolutional layers in addition to attention layers and convolutional layers need to know the original shape of the token list.

Normalization layers

class Affine(*args, **kwargs)[source]

Affine normalisation as used in ResMLP networks.

For NHWC x, we return alpha * x + beta, where alpha, beta are C tensors.

class GroupNormalization(*args, **kwargs)[source]

A group-norm “layer” (see abs/1803.08494 go/dune-gn).

This function creates beta/gamma variables in a name_scope, and uses them to apply group_normalize on the input x. You can either specify a fixed number of groups nb_groups, which will automatically select a corresponding group size depending on the input’s number of channels, or you must specify a group_size, which leads to an automatic number of groups depending on the input’s number of channels. If you specify neither, the paper’s recommended nb_groups=32 is used.

Authors: Lucas Beyer, Joan Puigcerver.

Parameters:

nb_groups – int, the number of channel-groups to normalize over.
group_size – int, size of the groups to normalize over.
eps – float, a small additive constant to avoid /sqrt(0).
beta_init – initializer for bias, defaults to zeros.
gamma_init – initializer for scale, defaults to ones.
**kwargs – other tf.keras.layers.Layer arguments.

group_normalize(x, gamma, beta, nb_groups=None, group_size=None, eps=1e-05)[source]

Applies group-normalization to NHWC x (see abs/1803.08494, go/dune-gn).

This function just does the math, if you want a “layer” that creates the necessary variables etc., see group_norm below. You must either specify a fixed number of groups nb_groups, which will automatically select a corresponding group size depending on the input’s number of channels, or you must specify a group_size, which leads to an automatic number of groups depending on the input’s number of channels.

Author: Lucas Beyer

Parameters:

x – N..C-tensor, the input to group-normalize. For images, this would be a NHWC-tensor, for time-series a NTC, for videos a NHWTC or NTHWC, all of them work, as normalization includes everything between N and C. Even just NC shape works, as C is grouped and normalized.
gamma – tensor with C entries, learnable scale after normalization.
beta – tensor with C entries, learnable bias after normalization.
nb_groups – int, number of groups to normalize over (divides C).
group_size – int, size of the groups to normalize over (divides C).
eps – float, a small additive constant to avoid /sqrt(0).

Returns:

Group-normalized x, of the same shape and type as x.