PiT
We provide an implementation and pretrained weights for Pooling-based Vision Transformers (PiT).
Paper: Rethinking Spatial Dimensions of Vision Transformers. [arXiv:2103.16302].
Original pytorch code and weights from NAVER AI.
This code has been ported from the timm implementation.
The following models are available.
Models trained on ImageNet-1k
pit_ti_224pit_xs_224pit_s_224pit_b_224
Models trained on ImageNet-1k, using knowledge distillation
pit_ti_distilled_224pit_xs_distilled_224pit_s_distilled_224pit_b_distilled_224
- class PoolingVisionTransformerConfig(name='', url='', nb_classes=1000, in_channels=3, input_size=(224, 224), patch_size=16, stride=8, embed_dim=(64, 128, 256), nb_blocks=(2, 6, 4), nb_heads=(2, 4, 8), mlp_ratio=4.0, distilled=False, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer='layer_norm_eps_1e-6', act_layer='gelu', interpolate_input=False, crop_pct=0.9, interpolation='bicubic', mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225), first_conv='patch_embed/conv', classifier='head')[source]
Configuration class for ConvNeXt models.
- Parameters:
name (str) – Name of the model.
url (str) – URL for pretrained weights.
nb_classes (int) – Number of classes for classification head.
in_channels (int) – Number of input image channels.
input_size (Tuple[int, int]) – Input image size (height, width)
patch_size (int) – Patchifying the image is implemented via a convolutional layer with kernel size
patch_sizeand stride given bystride.stride (int) – Stride in patch embedding layer.
embed_dim (Tuple) – Feature dimensions at each stage.
nb_blocks (Tuple) – Number of blocks at each stage.
nb_heads (Tuple) – Number of self-attention heads at each stage.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim
distilled (bool) – If
True, we add a distillation head in addition to classification head.drop_rate (float) – Dropout rate.
attn_drop_rate (float) – Attention dropout rate.
drop_path_rate (float) – Dropout rate for stochastic depth.
norm_layer (str) – Normalization layer. See
norm_layer_factory()for possible values.act_layer (str) – Activation function. See
act_layer_factory()for possible values.interpolate_input (bool) – If
True, we interpolate position embeddings to given input size, so inference can be done for arbitrary input shape. IfFalseinference can only be performed atinput_size.crop_pct (float) – Crop percentage for ImageNet evaluation.
interpolation (str) – Interpolation method for ImageNet evaluation.
mean (Tuple[float, float, float]) – Defines preprocessing function. If
xis an image with pixel values in (0, 1), the preprocessing function is(x - mean) / std.std (Tuple[float, float, float]) – Defines preprpocessing function.
first_conv (str) – Name of first convolutional layer. Used by
create_model()to adapt the number in input channels when loading pretrained weights.classifier (str | Tuple[str, str]) – Name of classifier layer. Used by
create_model()to adapt the classifier when loading pretrained weights.
- class PoolingVisionTransformer(*args, **kwargs)[source]
Class implementing a Pooling-based Vision Transformer (PiT).
Paper: Rethinking Spatial Dimensions of Vision Transformers. [arXiv:2103.16302]
- Parameters:
cfg (PoolingVisionTransformerConfig) – Configuration class for the model.
**kwargs – Arguments are passed to
tf.keras.Model
- call(x, training=False, return_features=False)[source]
Forward pass through the full model.
- Parameters:
x – Input to model
training (bool) – Training or inference phase?
return_features (bool) – If
True, we return not only the model output, but a dictionary with intermediate features.
- Returns:
If
return_features=True, we return a tuple(y, features), whereyis the model output andfeaturesis a dictionary with intermediate features.If
return_features=False, we return onlyy.
- property feature_names: List[str][source]
Names of features, returned when calling
callwithreturn_features=True.
- forward_features(x, training=False, return_features=False)[source]
Forward pass through model, excluding the classifier layer. This function is useful if the model is used as input for downstream tasks such as object detection.
- Parameters:
x – Input to model
training (bool) – Training or inference phase?
return_features (bool) – If
True, we return not only the model output, but a dictionary with intermediate features.
- Returns:
If
return_features=True, we return a tuple(y, features), whereyis the model output andfeaturesis a dictionary with intermediate features.If
return_features=False, we return onlyy.