Exam: Multimodal AI: kartičky | generative-multimodal-ai

Convolution

convolution – “filter”

Convolution

notation

batch size $B$
image size $W×H$
$C$ $C$ … number of feature channels (neurons per pixel)
- $C_{in}$ … number of feature channels in current layer
- $C_{out}$ … number of feature channels in next layer
- usually $C_{in}=3$ for the first layer (for a color image)
$K×K$ … convolutional filter kernel size

Convolution

advantages

spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
parameter sharing – we don't need that many weights
translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)

Convolution

motivation for padding (with zeros)

convolutions can only by executed in kernel lies entirely within input domain – that's inconvenient as it couples architecture and input size

Convolution

downsampling approaches

Convolution

upsampling approaches

nearest neighbor (we just copy the value)
bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)
- requires corresponding pairs of down- and upsampling layers
- used in SegNet

Convolution

architectures

LeNet – 2 convolution layers, 2 pooling, 2 fully connected
- state-of-the-art accuracy on MNIST
AlexNet – 8 layers, ReLUs, dropout, data augmentation
- number of feature channels increases with depth, spatial resolution decreases
VGG architecture
- uses 3×3 convolutions everywhere
- receptive field size
  - in the original image, the receptive field is 1×1
  - in the first layer, the receptive field is 3×3
  - by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
  - the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$
Inception / GoogLeNet – 22 layers
- multiple intermediate classification heads to improve gradient flow
- global average pooling (no FC layers), less parameters than VGG
- uses 1×1 convolutions (only across channels) to reduce number of features → higher efficiency
ResNet (2016)
- residual connections allow for training deeper networks (up to 152 layers)
- very simple and regular network structure with 3×3 convolutions
- strided convolutions for downsampling
U-Net
- max-pooling, up-convolutions and skip-connections
- defacto standard for many tasks with image output (e.g. depth, segmentation)

Convolution

RNNs

hidden state
- combination of the current input and the previous hidden state
- updated at each time step
- allows for processing sequences of variable length
usually tanh activation
output of a cell is based on current hidden state
there can be one or multiple outputs
- one to many – image captioning (image → sentence)
- many to one – action recognition (video → action)
- many to many – machine translation (sentence → sentence)
- many to many – object tracking (every frame: video → object location)
- to determine the length of the output sequence, a stop symbol can be predicted
backpropagation becomes intractable
- so truncated backpropagation may be used
we can even have multiple layers
- or we can make the cells deeper
- often combined with residual connections in vertical direction
problem: vanishing or exploding gradients
- RNNs require careful initialization to avoid saturating activation functions
- to prevent exploding – gradient clipping
- to prevent vanishing – architectural change is required
- GRU and LSTM units are used to solve this problem
  - use gates for filtering information
  - Gated Recurrent Unit: reset gate, update gate
  - Long Short-Term Memory: forget, input, output

Convolution

Transformer

CNNs could see more context but only through many stacked layers, which made them inefficient for truly long sequences
RNNs processed sequences one step at a time, making training slow and making it hard to capture relationships across long distances
idea: let every element of a sequence directly “pay attention” to every other element → no recurrence, no deep stacks required
- attention … $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$