Exam: Multimodal AI

Convolution Probabilistic models with latent variables: VAE, GAN Evaluation of Generative Models Audio Diffusion Representation Learning Image Generation Diffusion models Multimodal Learning

Convolution

fully connected neural network – impractical for images (too many weights)
convolution – “filter”
- we move a function over the signal and integrate
- what to do at the ends? → shrink or pad
CNN is learning the filters to transform the images
notation
- batch size $B$
- image size $W×H$
- $C$ $C$ … number of feature channels (neurons per pixel)
  - $C_{in}$ … number of feature channels in current layer
  - $C_{out}$ … number of feature channels in next layer
  - usually $C_{in}=3$ for the first layer (for a color image)
- $K×K$ … convolutional filter kernel size
number of weights … $C_{out}×(K×K×C_{in}+1)$ of a convolutional layer
advantages
- spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
- parameter sharing – we don't need that many weights
- translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)
motivation for padding (with zeros)
- convolutions can only by executed in kernel lies entirely within input domain – that's inconvenient as it couples architecture and input size
downsampling approaches
- stride – we are sliding the filter with a step size larger than one
- pooling – we apply a function (usually max) over a patch
if pixel-level outputs are expected, we need to use upsampling afterwards
upsampling approaches
- nearest neighbor (we just copy the value)
- bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
- max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)
  - requires corresponding pairs of down- and upsampling layers
  - used in SegNet
benchmark: ImageNet Large Scale Visual Recognition Challenge
architectures
- LeNet – 2 convolution layers, 2 pooling, 2 fully connected
  - state-of-the-art accuracy on MNIST
- AlexNet – 8 layers, ReLUs, dropout, data augmentation
  - number of feature channels increases with depth, spatial resolution decreases
- VGG architecture
  - uses 3×3 convolutions everywhere
  - receptive field size
    - in the original image, the receptive field is 1×1
    - in the first layer, the receptive field is 3×3
    - by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
    - the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$
- Inception / GoogLeNet – 22 layers
  - multiple intermediate classification heads to improve gradient flow
  - global average pooling (no FC layers), less parameters than VGG
  - uses 1×1 convolutions (only across channels) to reduce number of features → higher efficiency
- ResNet (2016)
  - residual connections allow for training deeper networks (up to 152 layers)
  - very simple and regular network structure with 3×3 convolutions
  - strided convolutions for downsampling
- U-Net
  - max-pooling, up-convolutions and skip-connections
  - defacto standard for many tasks with image output (e.g. depth, segmentation)
RNNs
- hidden state
  - combination of the current input and the previous hidden state
  - updated at each time step
  - allows for processing sequences of variable length
- usually tanh activation
- output of a cell is based on current hidden state
- there can be one or multiple outputs
  - one to many – image captioning (image → sentence)
  - many to one – action recognition (video → action)
  - many to many – machine translation (sentence → sentence)
  - many to many – object tracking (every frame: video → object location)
  - to determine the length of the output sequence, a stop symbol can be predicted
- backpropagation becomes intractable
  - so truncated backpropagation may be used
- we can even have multiple layers
  - or we can make the cells deeper
  - often combined with residual connections in vertical direction
- problem: vanishing or exploding gradients
  - RNNs require careful initialization to avoid saturating activation functions
  - to prevent exploding – gradient clipping
  - to prevent vanishing – architectural change is required
  - GRU and LSTM units are used to solve this problem
    - use gates for filtering information
    - Gated Recurrent Unit: reset gate, update gate
    - Long Short-Term Memory: forget, input, output
Transformer
- CNNs could see more context but only through many stacked layers, which made them inefficient for truly long sequences
- RNNs processed sequences one step at a time, making training slow and making it hard to capture relationships across long distances
- idea: let every element of a sequence directly “pay attention” to every other element → no recurrence, no deep stacks required
  - attention … $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Probabilistic models with latent variables: VAE, GAN

probabilistic models – aim to learn a parametric distribution $p_\theta(x)$ $p_{θ} (x)$ that approximates the complex data distribution $p_\mathrm{data}(x)$ $p_{data} (x)$
- once learned, we can (ideally) sample new data
- we can jointly learn them with other probabilistic models using maximum likelihood
Kullback-Leibler divergence
- non-negative, asymmetric (it's not a distance)
- $D_{\mathrm{KL}}(p(x)\|q(x))=-\mathbb E_{p(x)}[\log\frac{q(x)}{p(x)}]$
- we minimize the KL divergence between the data distribution and the learned distribution
  - this leads to the maximum likelihood estimation of parameters (data distribution is constant)
latent variables
- not observed directly
- we try to get a more compact representation based on the observation
- examples
  - speech enhancement: noisy speech (observation) → clean speech (latent variable)
  - person tracking: detections (observation) → person positions (latent variable)
  - representation learning: raw data (observation) → representation (latent variable)
- notation
  - observed variable … $x$
  - latent variable … $z$
- $p_\theta(x)=\int p_\theta(x,z)\ dz=\int p_\theta(x|z)p_\theta(z)\ dz$
- to get samples, first draw $\hat z\sim p_\theta(z)$ , then draw $\hat x\sim p_\theta(x|\hat z)$
simple example: clustering
- basic approach: K-means algorithm
- point-to-cluster assignment … latent variable (unknown)
  - must be inferred with the centroids (parameters of the model)
more advanced approach: Gaussian mixture model
- $p(x_n|z_n=k)=\mathcal N(x_n;\mu_k,\Sigma_k)$
- we find parameters using EM algorithm
  - we maximize $\mathcal Q$ (expected complete-data log-likelihood)
  - $\mathcal Q(\theta,\theta^{r-1})=\mathbb E_{p(z|x;\theta^{r-1})}\log p(x,z;\theta)$
- relationship with log-likelihood
  - $\log p(x)=\mathbb E_{q(z)}[\log\frac{p(x)p(z|x)}{q(z)}]+D_{KL}(q(z)\|p(z|x))$
  - fist term – M-step ( $\mathcal Q$ )
  - second term – E-step
- we set $q(z)=p(z\mid x)$ so $D_{KL}=0$
we can also consider continuous latent variables
- PPCA (probabilistic principal component analysis)
- $z\in\mathbb R^D,\;x\in\mathbb R^F$ $z \in R^{D}, x \in R^{F}$ where $D\ll F$ $D ≪ F$
  - we want to extract a representation $z$ of each $x$
- $p(z)=\mathcal N(z;0,I)$
- linear model → $p(x|z)=\mathcal N(x;Az+b,\nu I)$
- non-linear model → $p_\theta(x|z)=\mathcal N(x;\mu_\theta,\Sigma_\theta(z))$
variational autoencoders (VAEs)
- encoder + decoder
  - encoder learns $p(z|x)$
  - decoder learns $p(x|z)$
  - we consider Gaussian prior $p(z)=\mathcal N(z;0,I)$
  - to infer $z$ from $x$ , we can use the encoder
  - to generate $x$ , we can use the prior and the decoder
- decoder … $p(x|z)$ $p (x ∣ z)$
  - covariance matrix has to be symmetric and positive
    - we assume the matrix to be diagonal
    - trick: instead of estimating the variance $\nu$ directly, we estimate the log-variance $\eta$ (→ variance is positive)
  - the network outputs $\mu_\theta,\eta_\theta$ $μ_{θ}, η_{θ}$
    - $z$ goes in, the outputs have the dimension of $x$
- if $p(x\mid z)$ $p (x ∣ z)$ is non-linear (implemented as deep network), the posterior distribution $p(z\mid x)$ $p (z ∣ x)$ cannot be computed analytically, it needs to be approximated
  - we use another feed-forward network to do that → encoder
  - outputs $\mu_\phi,\eta_\phi$ have the dimension of $z$ (but $x$ goes in)
- we “chain” the posterior (encoder) and the generative (decoder) model
- learning – ELBO (evidence lower-bound)
  - formulation from EM: $\log p(x)=\mathbb E_{q(z|x)}[\log\frac{p(x,z)}{q(z|x)}]+D_{KL}(q(z|x)\|p(z|x))$
  - second term cannot be computed but its positive so $\log p(x;\theta,\phi)\geq\mathbb E_{q_\phi(z|x)}[\log\frac{p(x,z)}{q_\phi(z|x)}]$
  - $\log p(x;\theta,\phi)\geq\mathbb E_{q_\phi(z|x)}[\log p_\theta(x|z)]-D_{KL}(q_\phi (z|x)\| p(z))$ $lo g p (x; θ, ϕ) \geq E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z))$
    - first term – reconstruction (does the decoder work well?)
    - second term – regularization (is the latent distribution standard normal?)
  - $\mathcal L_{ELBO}(\theta,\phi)=\mathbb E_{q(z|x)}[\log\frac{p(x,z)}{q(z|x)}]$ $L_{E L BO} (θ, ϕ) = E_{q (z ∣ x)} [lo g \frac{p ( x , z )}{q ( z ∣ x )}]$
    - note: we need to maximize this (or we can minimize $-\mathcal L_{ELBO}$ in gradient descent)
    - we cannot compute the expectation in closed form, we need to sample from $q(z|x)$
    - sampling is non-differentiable, we cannot backpropagate
  - reparametrization trick
    - we cannot sample directly from the posterior like this: $\hat z\sim\mathcal N(\mu,\Sigma)$
    - so we sample like this: $\bar z=\mu+\Sigma^{1/2}\epsilon$ with $\epsilon\sim\mathcal N(0,I)$
    - $\bar z$ is differentiable and follows the same distribution as $\hat z$
  - posterior collapse
    - it can happen that the VAE stops learning if the posterior $q$ gets too close to the standard prior
    - KL term dominates the ELBO – we should reduce its weight
    - also reducing dimensionality $D$ of the latent space helps
- exact EM × VAE
  - there also exist things in the middle (variational EM)
- limitation of VAE
  - frames modeled independently – we need time/sequential modeling!
    - for spectrogram, for example
  - one solution: consider blocks of spectrogram as inputs
- probabilistic sequential modeling & inference
  - we can use a RNN
  - the sampling occurs sequentially and cannot be parallelized
  - → dynamical VAEs (DVAEs)
generative adversarial network (GAN)
- dataset (real samples), generator (fake samples)
- generator $G_\theta$ takes a random noise $z$ as input and generates an image $x=G_\theta(z)$
- discriminator $D_\phi$ takes an image $x$ as input and outputs the probability that $x$ is real
- generator and discriminator are trained jointly in a minimax game
  - $\max_\theta\min_\phi\mathcal L_{BCE}(D_\phi;x,G_\theta(z))$
  - or $\min_\theta\max_\phi\mathbb E_{x\sim p_{BCE}(x)}[\log D_{\phi}(x)]+\mathbb E_{z\sim p_z(z)}[\log(1-D_\phi(G_\theta(z)))]$
  - this corresponds to minimizing the Jensen-Shannon divergence between $p_{\mathrm{data}}(x)$ and $p_\theta(x)$ with optimal $\phi$
  - $D_{JS}(p,q)=\frac12 D_{KL}(p\|\frac{p+q}2)+\frac12 D_{KL}(q\|\frac{p+q}2)$
- typically, GANs are trained by alternating between updating the discriminator and the generator with different batches of data
  - update the discriminator $D_\phi$ for a few steps
  - update the generator $G_\theta$ for a step
- problems
  - very sensitive to the choice of hyperparameters
  - weak discriminator → generator may produce non-realistic samples
  - strong discriminator → generator cannot learn (if the generator is always caught, it does not know how to improve) or tends to replicate the training set (overfitting)
  - mode collapse – the model does not generate the diversity of the dataset and focuses on one thing instead (e.g. generates just ones from MNIST)
- if we consider two different “Dirac distributions” (with 1 at a single point), their JS divergence is constant and does not reflect the distance of the two points → bad
  - solution: Wasserstein distance
- avoiding mode collapse: Wasserstein GAN
  - based on Wasserstein distance – “minimum cost of transporting mass from one distribution to another”
    - $W(p,q)=\inf_{\gamma\in\Gamma(p,q)}\mathbb E_{(x,y)\sim\gamma}[\|x-y\|]$
    - where $\Gamma$ is the set of all joint distributions
  - properties of $W$ $W$
    - it is a real distance, not a divergence – it satisfies the triangle inequality and is sensitive to the geometry of the underlying space
    - it is useful for comparing distributions that are not well-aligned or have different supports (as opposite to the JS divergence)
    - it's hard to compute efficiently (due to the infimum) in high-dim spaces
    - but it can be written as $\max_{\|f\|_L\leq 1}\set{\mathbb E_{x\sim p}[f(x)]-\mathbb E_{y\sim q}[f(y)]}$
  - WGANs use the Wasserstein distance instead of the JS divergence
    - they use a critic $C_\phi$ (instead of the discriminator) which is trained to approximate $W$
    - they use weight clipping to enforce a Lipschitz constraint on the critic
    - it's not a competition anymore
      - the critic is trained to approximate the Wasserstein distance
      - the generator is trained to minimize it

Evaluation of Generative Models

it's important but hard to evaluate the quality of generated samples
- we need to know how well the models perform
- but they can produce a wide range of outputs → it's hard to define a single evaluation metric capturing all aspects of quality
- also, evaluation metrics may not align with human perception of quality
objective metrics
- precision, recall (for classification)
- Inception Score (IS)
  - measures diversity and quality of the data
  - based on the Inception classification model
  - $IS(G_\theta)=\exp(\mathbb E_{x\sim p_\theta(x)}[D_{KL}(p(y|x)\|\int p(y|x)p(x)dx)])$ $I S (G_{θ}) = exp (E_{x \sim p_{θ} (x)} [D_{K L} (p (y ∣ x) ∥ \int p (y ∣ x) p (x) d x)])$
    - $p(y|x)$ is the class distribution predicted by Inception
  - the higher the better
- Fréchet Inception Distance
  - measures the Wasserstein distance between the distribution of generated images and the distribution of real images in the feature space of a pretrained Inception model
    - extracts features from (provided) real images and generated images using a pretrained Inception model
    - FID score is computed based on the means and covariances of the features
      - $FID(G_\theta)=\|\mu_r-\mu_g\|^2+\mathrm{Tr}(\Sigma_r+\Sigma_g-2\sqrt{\Sigma_r\Sigma_g})$
  - assumes Gaussian distributions
- comparison of IS and FID
  - FID is more robust to mode collapse
  - both rely on Inception model trained on ImageNet
    - trained for classification, might not reflect all the aspects of the image quality
    - the model might not well reflect the evaluated data
      - example: spectrograms
    - both scores depend on the pretrained model we choose
  - not suitable for non-image data
  - don't provide insights into the diversity or realism of generated samples
subjective evaluation
- user studies and visual inspection
- provide valuable insights
- costly, subjective :)
- might need specific user expertise
hybrid alternative – mean opinion score
- crowd-source evaluation technique where human evaluators rate the quality of generated samples on a scale (e.g. 1 to 5)
- MOS network trained to predict the score
  - so the network is trained to estimate subjective criteria
  - can be used for non-image data

Audio

introduction
- we sample the signal using frequency $F_s$
- Nyquist-Shannon sampling theorem: we can only reconstruct content corresponding to frequencies $\lt F_s/2$
- telephone voice effect – losing high-frequency details
  - speech signal can contain energy up to 20 kHz
  - most of energy within 0.3–3 kHz → phone standards sample at 8 kHz
- discrete Fourier transform
  - outputs sequence of numbers describing the magnitude and phase at each frequency bin
  - computed using FFT
  - but energy for frequencies changes over time
- short-time Fourier transform (STFT)
  - sliding window with a kernel
  - apply DFT to each segment
  - the modulus
- mel frequency scale
  - better matches human perception (compared to the linear scale)
  - mel-frequency cepstral coefficients (MFCC)
    - popular audio representations
  - usual pipeline to get MFCC: raw data → STFT → Mel scale → log → discrete cosine transform
audio representations based on self-supervised learning
- sometimes, part of the representation is not learned (classical audio representations are used)
  - using learned representations may be better – they can be taylored to specific data or have more general and reusable representations
- wav2vec 2.0
  - masked speech in latent space (approach similar to masked language modeling)
  - architecture similar to STFT, but the transformation is learnt
    - CNN (to get latent representations based on raw waveform) + Transformer encoder (to get contextualized representations)
  - contrastive loss to train using masked prediction
    - some latent representations are masked
    - did the model predict the true latent? – cosine similarity
- hidden-unit BERT (HuBERT)
  - extension of wav2vec 2.0
  - learns from unlabeled audio, predicts masked portions like BERT
  - idea: use clustering to generate pseudo-labels for audio segments, then train a model to predict those labels
    - pseudo-labels initialized using MFCC (and k-means)
  - fine-tuned for automatic speech recognition (ASR)
- WavLM
  - extension of HuBERT
  - more robust learning objective, more data
  - idea: incorporate speech denoising as well as time/channel-wise masking in pre-training to improve robustness and generalization
  - strong performance on both speech recognition and speaker-related tasks (e.g. speaker verification, diarization = partitioning of audio according to speakers)
  - gated relative position bias
    - relative position bias – attention is affected by the distance between tokens (tokens close to each other should attend more)
    - learnable gate controls the influence of the position bias
end-to-end approaches (audio → audio)
- WaveNet
  - deep generative model for raw audio waveforms
  - unconditioned
  - idea: model the joint probability of an audio waveform as a product of conditional probabilites (using chain rule)
  - capturing long-range dependencies in an efficient way (using stacks of dilated causal convolutions)
    - outputs at time $t$ depends only on $x_{\lt t}$
    - exponentially increase receptive field
    - gated activation units and residual connections for stable training
  - applications: audio generation (e.g. speech synthesis) and enhancement; can be adapted for music or other sequential data
- SampleRNN
  - similar to WaveNet
  - hierarchical structure
    - upper tiers summarize longer contexts
    - lower tiers generate fine-grained details
- it's very difficult to capture very long dependencies (like 60 seconds)
generating audio from intermediate representations
- STFT is invertible but the reconstruction of the audio signal from the spectrogram is not immediate (we used modulus)
  - we need the phase of the signal
- classical approach: Griffin-Lim
  - iterative algorithm to estimate the phase
  - exploits the redundancy between time-frames and between frequency bins of the STFT representation
  - widely used in classical speech synthesis and as a baseline
- learning-based approach: HiFi-GAN
  - uses GAN to synthesize realistic waveforms conditioned on Mel-spectrograms
  - enables real-time speech synthesis with high perceptual quality
  - generator takes Mel-spectrogram as input, outputs raw audio
    - uses transposed convolutions and residual blocks to upsample and generate waveform samples
  - multi-scale discriminators – operate at different resolutions of the waveform
  - multi-period discriminators – focus on periodic patterns in speech
- Tacotron
  - end-to-end TTS model
  - maps character or phoneme sequences to Mel-spectrograms
  - no need for hand-crafted linguistic features
  - typically used together with WaveNet or HiFi-GAN to generate the final waveform from the predicted spectrogram
  - encoder + decoder
  - uses attention to map text position to audio frames
  - L1 loss
- AnCoGen
  - masked-modeling-based model
  - idea: map the spectrogram to attributes (pitch, SNR, reverbation, content, …)
  - enables control over the audio attributes
  - ratio – used for masking
    - (0,1) → audio non-masked, attributes masked
    - masking can be partial (e.g. 0.7)
  - is combined with a neural vocoder (HiFi-GAN) to generate the final waveform from the predicted spectrogram

Diffusion

basic division of generative models
- explicit density
  - tractable density → autoregressive
  - approximate density → VAE
- implicit density
  - direct sampling → GAN
  - indirect sampling → diffusion
basic concepts
- Brownian motion – continuous random movement of a particle, with increments that are Gaussian and independent
- diffusion process – stochastic system whose evolution is governed by Brownian motion
- diffusion (in image generation) – we add noise
- the goal of the model (DDPM, denoising diffusion probabilistic model) is to remove the noise
forward diffustion process (fixed) – start with data $x_0$ $x_{0}$ , gradually add Gaussian noise in $T$ $T$ steps
- $q(x_t|x_{t-1})=\mathcal N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI)$
reverse denoising process (generative)
- learn $p_\theta(x_{t-1}|x_t)$ to denoise
- $p_\theta(x_{t-1}|x_t)=\mathcal N(x_{t-1};\mu_\theta(x_t,t),\Sigma_\theta(x_t,t))$
training
- the model should estimate a noise vector $\epsilon\in\mathbb R^n$ from a given noise level $\sigma\gt 0$ and noisy input $x_\sigma\in\mathbb R^n$ s.t. for some $x_0$ in the data manifold $\mathcal K$ it holds that $x_\sigma\approx x_0+\sigma\epsilon$
- a denoiser $\epsilon_\theta:\mathbb R^n\times\mathbb R_+\to\mathbb R^n$ $ϵ_{θ} : R^{n} \times R_{+} \to R^{n}$ is learned by minimizing $L(\theta):=\mathbb E_{x_0,\sigma,\epsilon}\|\epsilon_\theta(x_0+\sigma\epsilon,\sigma)-\epsilon\|^2$ $L (θ) := E_{x_{0}, σ, ϵ} ∥ ϵ_{θ} (x_{0} + σ ϵ, σ) - ϵ ∥^{2}$
  - $x_0$ sampled from training data
  - $\sigma$ $σ$ sampled from a training noise schedule
    - in practice, noise level $\sigma$ range from 0.01 to 100
  - $\epsilon$ sampled from $\mathcal N(0,I_n)$
- we are trying to find an ideal denoiser $\epsilon^*$ $ϵ^{*}$ that minimizes $L(\theta)$ $L (θ)$
  - for finite $\mathcal K$ $K$ , there is a close-form solution
    - $\epsilon^*(x_\sigma,\sigma)=\frac{\sum_{x_0\in\mathcal K}(x_\sigma-x_0)\exp(-\|x_\sigma-x_0\|^2/2\sigma^2)}{\sigma\sum_{x_0\in\mathcal K}\exp(-\|x_\sigma-x_0\|^2/2\sigma^2)}$
  - assumption: $\epsilon^*(x_\sigma,\sigma)=\mathbb E[\epsilon\mid x_\sigma,\sigma]$
  - steps
    - replace $\epsilon$ $ϵ$ by the forward noise relation $x_\sigma=x_0+\sigma\epsilon\implies\epsilon=\frac{x_\sigma-x_0}{\sigma}$ $x_{σ} = x_{0} + σ ϵ ⟹ ϵ = \frac{x _{σ} - x _{0}}{σ}$
      - so we get $\epsilon^*(x_\sigma,\sigma)=\mathbb E[\frac{x_\sigma-x_0}{\sigma}\mid x_\sigma,\sigma]=\frac1\sigma(x_\sigma-\mathbb E[x_0\mid x_\sigma,\sigma])$
      - and $\mathbb E[x_0\mid x_\sigma,\sigma]=\sum_{x_0\in\mathcal K} x_0\cdot p(x_0\mid x_\sigma,\sigma)$
    - posterior $p(x_0\mid x_\sigma,\sigma)$ $p (x_{0} ∣ x_{σ}, σ)$
      - forward step $p(x_0\mid x_\sigma,\sigma)\propto\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})$
        
        equal up to a constant factor (it gets canceled out in the following formula)
      - Bayes: $p(x_0\mid x_\sigma,\sigma)=\frac{p(x_\sigma\mid x_0,\sigma)p(x_0)}{\sum_{x'_0\in\mathcal K} p(x_\sigma\mid x'_0,\sigma)p(x'_0)}=\frac{\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})}{\sum_{x'_0\in\mathcal K}\exp(-\frac{\|x_\sigma-x'_0\|^2}{2\sigma^2})}$
        
        because $p(x_0)=\frac1{|\mathcal K|}$
    - so $\mathbb E[x_0\mid x_\sigma,\sigma]= \frac{\sum_{x_0\in\mathcal K} x_0\cdot\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})}{\sum_{x'_0\in\mathcal K}\exp(-\frac{\|x_\sigma-x'_0\|^2}{2\sigma^2})}$
    - and $\epsilon^*(x_\sigma,\sigma)=\frac{\sum_{x_0\in\mathcal K} (x_\sigma-x_0)\cdot\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})}{\sigma\cdot \sum_{x'_0\in\mathcal K}\exp(-\frac{\|x_\sigma-x'_0\|^2}{2\sigma^2})}$
common model architectures
- convolutional U-nets
- patch-wise transformers
reverse denoising process – sampling
- the learned denoiser $\epsilon_\theta(x_\sigma,\sigma)$ estimates $\hat x_0=x_\sigma-\sigma\epsilon_\theta(x_\sigma,\sigma)$
- for loop, we denoise the data in several steps
- DDIM (denoising diffusion implicit model) × DDPM (probabilistic)
  - deterministic (DDIM) update: $x_{t-1}=x_t+(\sigma_{t-1}-\sigma_t)\epsilon_\theta(x_t,\sigma_t)$
  - probabilistic (DDPM) update: $x_{t-1}=x_t+(\sigma_{t'}-\sigma_t)\epsilon_\theta(x_t,\sigma_t)+\eta w_t$
  - DDPM is derived from the true reverse diffusion (a stochastic differential equation / SDE)
    - we need to add noise proportional to uncertainty in the denoising steps
  - DDIM replaces the SDE with a probability-flow ODE, which has no diffusion term, so the evolution is deterministic
  - they share the same deterministic mean DDPM differs by the Gaussian noise scaled by uncertainty
flow matching models vs. diffusion models
- in flow matching models, we are trying to get a function which maps from one distribution to another
- so we need less sampling steps

Representation Learning

types of learning
- supervised – training data + desired outputs (labels)
- unsupervised – unlabeled data
- semi-supervised – training data + a few desired outputs
unsupervised/representation learning – useful if we don't have enough annotations
initial approach: pretraining (e.g. ImageNet) & fine-tuning
- to fine-tune, we drop the last weight matrix with dimension $f\times 1000$ and replace it with a matrix with dimension $f\times c$ where $c$ is the desired number of classes ( $f$ … number of features)
- problems
  - not optimal for every problem (e.g. video, medical)
  - humans don't need ImageNet pretraining
- solution
  - replace ImageNet pre-training by an unsupervised training (representation learning)
generation-based methods
- autoencoders
  - train such features that can be used to reconstruct original data
  - input data $x$ $x$ → encoder → features $z$ $z$ → decoder → reconstructed input data $\hat x$ $\overset{x}{^}$
    - $z$ typically has less features than $x$
    - we minimize $\|x-\hat x\|^2$
  - the encoder learns the representation
- if we have a large unlabeled dataset and a small annotated dataset, we can use an encoder or a GAN to initialize a supervised model
- limitations
  - features not trained to discriminate
  - limited performance
  - additional computation cost (decoder or generator)
- solution: self-supervised learning (SSL)
self-supervised learning – supervision comes from the data (no need to annotate)
- pretext task
  - we don't care about this specific task but it helps the model to learn the representations
  - e.g. relative patch prediction
    - but it's not that easy
      - color distortion helps the model cheat the task
      - solution: drop two channels, replace by Gaussian noise
  - another task: solving jigsaw puzzles
    - to make it easier, we can subset 1000 permutations and only train the classifier on them
  - other tasks
    - colorization
    - rotation prediction
    - super-resolution
- contrastive learning
  - goal: to learn features that are discriminative among instances
  - but we would need too many classes (one for each instance in the training dataset) – we need non-parametric softmax & memory bank
    - memory bank contains feature representations of all images in the dataset
  - invariant information clustering – maximizing the mutual information between encoded variables
  - SimCLR
    - we have two images $A,B$ , apply two random transformations to each of them
    - so we get four images $A_1,A_2,B_1,B_2$ , we want to maximize agreement between the ones based on the same image (e.g. $A_1,A_2$ ) and minimize agreement between the ones based on different images (e.g. $A_1,B_1$ )
    - agreement defined as cosine similarity (pairwise)
  - Moco
    - instead of end-to-end learning or memory bank, we use momentum encoder
    - we mix the previous parameters of the network with the current one
  - SimSiam
  - Dino
    - vision transformer (ViT)
      - linear projection of flattened 16×16 patches + learned position embedding
      - additional classifier token
    - two networks: student and teacher
    - momentum teacher as Moco
    - segmentation emerges
what is a good representation? – we need robustness (to handle domain shift)
- appearance changes due to different sensors – infrared vs. normal camera
- use of synthetic data – synthetic datasets may be cheaper to make
- unseen scenarios (e.g. natural disasters)
- biased datasets
unsupervised domain adaptation (DA)
- source and target distributions, we want them to have similar representations
- e.g. we trained the model labeled photos, we want it to handle (unlabeled) cartoon images
- approaches
  - discrepancy-based method
    - use maximum mean discrepancy (MMD) to align the distributions
  - alignment layers
    - idea: learn domain-agnostic representation by adjusting the network architecture
    - batch normalization
  - adversarial-based methods
    - employ an adversarial objective to ensure that the network cannot distinguish between the source and target domains
  - adaptation through translation
    - train model which can translate between domains
in some contexts, discrete representations may be useful
- VQ-VAE = VAE with vector quantization
- vector quantization maps a vector from a continuous space to a vector from a dictionary (codebook)

Image Generation

variational autoencoders (VAEs)
- encoder (predicts distribution in latent space) + decoder (predicts distribution in feature space)
- we want the latent space to be close to Gaussian
  - that's what KL divergence term does
- dimensions in latent space may correspond to some properties of the objects in the image
- we can do linear interpolation – we encode two images, “mix” them (in some ratio), then decode
GANs
- problem: want to sample from complex, high-dimensional training distribution (no direct way to do this!)
- solution: sample from a simple distribution (e.g. random noise) & learn transformation to training distribution
  - minimax objective function
  - alternate between gradient ascent on discriminator and gradient descent on generator
  - in practice: instead of minimizing likelihood of discriminator being correct, we maximize likelihood of discriminator being wrong (higher gradient signal for bad samples → works better)
- Progressive GAN – training layer by layer (we start by training simple small layers, then add larger layers)
- BigGAN
- style stransfer
  - we want to take content from one image and style from the other one
  - we don't want to transfer only color but also brush strokes
    - we don't change the structure of the original image, we change statistical properties of its patches (to get different style)
    - that's what AdaIN normalization does
      - $\mathrm{AdaIN}(x,y)=\sigma(y)(\frac{x-\mu(x)}{\sigma(x)})+\mu(y)$
  - architecture: VGG encoder → normalization tricks → decoder
    - to compute loss, the VGG encoder needs to be used again on the result and the “style” image
- style-based GAN
  - traditional approach: latent vector comes from the source image
  - style-based GAN starts with learned constant tensor, adds noise and style (in each layer) by predicting scale and shift
    - we swap source images at some point in the process to get the mix of style and content
image-to-image translation
- goal: translate image from one representation to another
  - edges (drawing) → photo
  - labels → street scene
  - BW → color
  - aerial → map
  - day → night
- Pix2Pix
  - use GAN, discriminator gets both images (we want the generated images to be both plausible and to correspond to the original image)
  - generator is just autoencoder (encoder + decoder)
    - convolution & deconvolution
    - U-Net uses skip connections from the encoder to the decoder (not everything has to be encoded in the latent space)
      - works better
- example: generating image based on segmentation
  - we can use a trained segmentation model to segment the generated image
  - then, we can apply metrics used for image segmentation evaluation
- smarter discriminator
  - instead of predicting only one score (on the scale from real to fake), we can predict multiple scores (one for each region of the image)
  - this doesn't work for too small regions – “is this pixel realistic?” is not a good question (the discriminator cannot see patterns, only colors of individual pixels)
- we assume we have access to $p(x,y)$ and train model to sample $y\sim p(y\mid x)$ or $x\sim p(x\mid y)$
- but we don't always have $p(x,y)$ $p (x, y)$ → unpaired image-to-image generation
  - example: you may have many images of horses and many images of zebras, but never a pair of corresponding images
  - CycleGAN
    - uses both GAN losses and cycle-consistency loss
      - if we generate zebra based on a horse, we want to be able to generate horse based on the zebra and get the same horse as before
    - based on ResNet (not U-Net)
  - let's have a shared latent space!
    - so we have two encoders (one for zebra, one for horse) and two decoders that share the same latent space
    - weights are shared between encoders
  - geometry-consistency
    - we check how well the model works for transformed images (we then inverse the transformation and compared with the result for untransformed image)
    - GcGAN
- high resolution images
  - Pix2PixHD
  - we don't want to use many layers – you lose information
  - architecture similar to style-based GAN
  - struggles with uniform surfaces
video generation
- we need temporal consistency
- we could do 3D convolution instead of 2D convolution
  - but we would need a lot of data
- we could consider static background and moving objects
  - so we generate static background (image) and two videos – foreground and mask (ratio for mixing the foreground and background)
- let's generate a trajectory of vectors in latent space we can then pass to a decoder
  - limitations
    - fixed-length videos only
    - no control over motion and content
  - MoCoGAN
  - DVDGAN
- video-to-video translation
- animating single subject – latent space with human pose
neural radiance filters
- estimate “shape” of an object based on several photos
- can render novel views

Diffusion models

we address generation as a denoising problem
similar to GAN, we start with a distribution easy to sample (Gaussian) and get a distribution we want (but we do it in multiple steps)
we estimate mean of the next distribution
we can combine multiple steps of adding noise just into one step
instead of predicting the image, we predict the noise
- it's easier as the variance is fixed (we can focus on predicting mean)
- also, the image changes over time – the noise does not (?)
we can use simpler loss formula even though there's no theoretical explanation for it
- $L_t=\mathbb E_{t\sim[1,T],x_0,\epsilon_t}[\|\epsilon_t-\epsilon_\theta(x_t,t)\|^2]$
training and sampling algorithms
- we want the results to follow a distribution → we add some randomness according to the variance
we want the distribution to be conditional
- first approach: classifier guidance
  - we have a diffusion model $P(x)$
  - we have a classifier $P(y\mid x)$
  - we want to be able to sample $P(x\mid y)$
  - idea: instead of just denoising, we also move in the direction that makes the probability $P(y\mid x)$ higher
  - but we need to compute the gradient of the classifier (using backpropagation) – high computational cost
- second approach: classifier-free guidance
  - the noise model is trained with $y$ in mind
  - we also used unconditioned denoiser – balance between quality and fidelity
latent diffusion models
- problem: to generate high-resolution images, we need to start from high-resolution noise and it takes many denoising steps (→ computational cost)
  - we could use distillation
  - or we can project the images in a discrete latent space
- let's have an encoder and a decoder
- we consider a diffusion model in the latent space
  - U-Net used for denoising
  - cross-attention to apply conditioning (in the U-Net)
    - conditioning is in a single vector $C$
diffusion transformers (DiT)
- conditioning is used to predict scale and shift (similar to StyleGAN)
2D → 3D latent space
- denoiser for video gets very computationally intensive if we want to attend everywhere
- instead, we consider separate spatial and temporal layers
why it changes everything
- GAN worked only on datasets with limited diversity
- fine-grained control with text
- we have a general-purpose image prior
  - we can start with pretrained large models and use transfer learning for specific tasks
  - one training of a large model costs 600 000 euros
how can we reuse a pretrained diffusion model so that we can condition using spatial data (a sketch…)
- how to fit all the information in a single vector $C$
- ControlNet – encoder with skip connections to the U-Net
- impainting
  - we want to put a specific object in the image
  - idea: we add noise to the whole image and let it generate with a conditioning
  - to make sure that the rest of the image does not change, we can replace the rest of the image with the original image (+ noise) in every step of denoising
    - we use a mask for that
(other slides skipped)
personalized text-to-image
- example: I want to generate something based on this specific (real) statue
- how can I describe this specific object using an embedding vector?
  - has to be learnt
- Dreambooth
  - problem: if I finetune the model using photos of my dog standing, I will only get results with my dog standing (not sitting)
  - so I use specific loss that ensures the generated diversity is similar to the diversity of real dog poses

Multimodal Learning

multimodal learning
- many research questions
- many other modalities than just video and audio
  - text, lidar, thermal, events, …
- interactions between modalities
sensor fusion – using information from diverse sensors to make predictions
- types
  - camera + depth sensor → RGB-D object detection
  - RGB + thermal
  - RGB + optical flow (object moving in the video)
- we have aligned inputs; when to perform fusion?
  - early fusion – concat, then pass to the model
  - late fusion – two models, jointly predict
    - easier to train (we don't need that much paired data)
    - can be run in parallel
  - middle fusion – two models (with shared weights), then fuse features and pass to third model
  - another approach: learn when to perform fusion (siamese network)
- using ViT with two modalities
  - either pass shorter sequence of pairs → early fusion
  - or pass two sequences (so the entire sequence is longer) → late fusion
    - worked better
- RGB + lidar detection
  - advantages
    - effective in low light and some adverse weather
    - robust in low-texture areas
    - penetrates dense foliage (vegetation) – for satellite imagery
    - long range
  - lidar returns a point cloud (points detected in 3D)
  - approaches
    - first find object in image, then use lidar points (sequential)
    - or we can use late fusion
multimodal translation (from one modality to another)
- I2T: image captioning – we condition text on some visual observation
  - first RNNs with LSTM, then Transformer (attention-based approaches)
- V2T: lip reading
- T2I: text-to-image generation
- A2V: audio to video
- ASR: speech recognition – not generative (???)
- TTS: text-to-speech
hybrid tasks
- visual question answering
  - projecting the image and the question in the same vector space
    - image processed by CNN
    - question processed by CNN/LSTM
  - attention layers – which part of the image should I look at?
- lips reading
  - seq2seq with attention – which time should I look at to predict the next word? (predicting alignment between text and audio/video frames)
multimodal alignment (identifying and modeling correspondances)
- ImageNet
  - hard to scale up
  - vision is not only about classes
  - limited robustness to distribution shifts
  - adaptation to other tasks (new classes) requires further training
- zero-shot classification: CLIP
  - frame the problem as an image-caption matching problem
  - captions
    - easier to get than classes
    - contain semantic, geometric, and stylistic information
    - multi-object images
    - collected 400M pairs (312× more than ImageNet)
  - contrastive pre-training
    - captions encoded using transformer
    - images encoded by ViT or ResNet
    - for each image, probability (softmax) of every possible caption (and vice versa)
      - loss function – maximize likelihood of predicting correct text for the image and correct image for the text
  - can be then used for zero-shot classification
    - user-defined classes can be expressed as captions: “a photo of a {object}.“
  - can be also used as a search engine
    - you compute the similarity between the provided caption and the images in your database
  - can be used to build on top of (prompt engineering – “software 3.0”)
  - CLIP can be also used for audio classification
- Imagebind
  - based on contrastive loss
  - paired data of many different modalities: images, videos, text, audio, depth, thermal, IMU (movement)
  - connecting modalities which were not connected before
mask image model vs. language model
- mask image model – BERT, bidirectional
- language model – GPT, unidirectional
training LLM
- pre-training → instruction fine-tuning → reinforcement learning with human feedback
- having several copies of LLMs fine-tuned for different tasks is expensive
- alternative approach: prefix-tuning
  - freeze the weights
  - train new prefix embeddings (added at the beginning of the sequence) that optimize the behavior of the network for the specific task
  - similar to prompt engineering (“You are an AI agent, be kind to the user.“)
    - but here, we use gradient descent to find tokens which work the best (they don't have to correspond to existing words)
- bidirectional vs. causal (unidirectional) attention
  - bidirectional models cannot be used to generate
  - unidirectional uses masked self-attention
    - → can generate, is more compute efficient, has good modeling capacities
multimodal LLMs
- VisualBERT (bidirectional)
  - image (split using bounding boxes by an object detector) + caption
  - masking words in the caption
  - objective 1: predict masked words
  - objective 2: predict if the image matches the caption or not
  - downstream task: visual question answering
    - [mask] token is appended to the question (→ answer is predicted by the model)
    - VQA is considered as a classification problem
- unidirectional MLLM (encoder + decoder)
  - encoder gets image and beginning of the sentence
    - image is first split into patches and processed by convolution
    - bidirectional attention
  - decoder continues the sentence
- decoder-only
  - vision encoder trained using the task of next token prediction
    - produces encoded representations of the image
    - used as prefix for LLM (or even as a part of the input – anywhere)
  - LLM frozen – why?
    - cost of training
    - contains a lot of useful knowledge we don't want to lose
  - it combines perception of vision encoder and reasoning capacity of LLM
  - alternative approach: use CLIP instead of encoder training (to get the embeddings)
- Flamingo
  - images are removed from the text and replaced by placeholders
  - then, images are provided using gated cross-attention
    - skip connections make sure that the model preserves its pre-trained abilities even after modification (at the beginning of the fine-tuning phase – with the initial parameters for the new blocks in the architecture)
    - the text cross-attend only at the last image
      - but the self-attention layer still ensures everyone sees everything
- Socratic Models
  - idea: convert all modality into text
  - then perform reasoning in text form
  - but there are things hard to describe with text
- training conversational agent for visual question answering
  - hard to get data
  - Llava architecture
    - frozen vision encoder
    - frozen LLM
    - trained projection is applied to the output of the vision encoder (before passing these “tokens” to the LLM)
- Visual LLM
  - Qwen
  - supports audio, video, and text
  - uses vision and audio encoders
  - we need positional embeddings
    - rotary position embedding (RoPE)
    - multiple rotations happening at once (vector of dimension $n$ split into $n/2$ parts and each part is rotated differently)
  - for videos, we use M-RoPE
    - rotary embedding decomposed into temporal, width, and height component
  - best open-source model
  - speech synthesis in an autoregressive manner
image-to-text vs. text-to-image
- generating text – autoregressive approach (predicting next token based on the previous ones)
- generating images – diffusion
- how to unify the two tasks? → Mixed-Modal Auto-Regressive LM
  - image converted to tokens and back using tokenizer and de-tokenizer
Vision-Language-Action models (VLA)