# Exam: Multimodal AI

## Convolution

- fully connected neural network – impractical for images (too many weights)
- convolution – “filter”
	- we move a function over the signal and integrate
	- what to do at the ends? → shrink or pad
- CNN is learning the filters to transform the images
- notation
	- batch size $B$
	- image size $W×H$
	- $C$ … number of feature channels (neurons per pixel)
		- $C_{in}$ … number of feature channels in current layer
		- $C_{out}$ … number of feature channels in next layer
		- usually $C_{in}=3$ for the first layer (for a color image)
	- $K×K$ … convolutional filter kernel size
- number of weights … $C_{out}×(K×K×C_{in}+1)$ of a convolutional layer
- advantages
	- spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
	- parameter sharing – we don't need that many weights
	- translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)
- motivation for padding (with zeros)
	- convolutions can only by executed in kernel lies entirely within input domain – that's inconvenient as it couples architecture and input size
- downsampling approaches
	- stride – we are sliding the filter with a step size larger than one
	- pooling – we apply a function (usually max) over a patch
- if pixel-level outputs are expected, we need to use upsampling afterwards
- upsampling approaches
	- nearest neighbor (we just copy the value)
	- bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
	- max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)
		- requires corresponding pairs of down- and upsampling layers
		- used in SegNet
- benchmark: ImageNet Large Scale Visual Recognition Challenge
- architectures
	- LeNet – 2 convolution layers, 2 pooling, 2 fully connected
		- state-of-the-art accuracy on MNIST
	- AlexNet – 8 layers, ReLUs, dropout, data augmentation
		- number of feature channels increases with depth, spatial resolution decreases
	- VGG architecture
		- uses 3×3 convolutions everywhere
		- receptive field size
			- in the original image, the receptive field is 1×1
			- in the first layer, the receptive field is 3×3
			- by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
			- the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$
	- Inception / GoogLeNet – 22 layers
		- multiple intermediate classification heads to improve gradient flow
		- global average pooling (no FC layers), less parameters than VGG
		- uses 1×1 convolutions (only across channels) to reduce number of features → higher efficiency
	- ResNet (2016)
		- residual connections allow for training deeper networks (up to 152 layers)
		- very simple and regular network structure with 3×3 convolutions
		- strided convolutions for downsampling
	- U-Net
		- max-pooling, up-convolutions and skip-connections
		- defacto standard for many tasks with image output (e.g. depth, segmentation)
- RNNs
	- hidden state
		- combination of the current input and the previous hidden state
		- updated at each time step
		- allows for processing sequences of variable length
	- usually tanh activation
	- output of a cell is based on current hidden state
	- there can be one or multiple outputs
		- one to many – image captioning (image → sentence)
		- many to one – action recognition (video → action)
		- many to many – machine translation (sentence → sentence)
		- many to many – object tracking (every frame: video → object location)
		- to determine the length of the output sequence, a stop symbol can be predicted
	- backpropagation becomes intractable
		- so truncated backpropagation may be used
	- we can even have multiple layers
		- or we can make the cells deeper
		- often combined with residual connections in vertical direction
	- problem: vanishing or exploding gradients
		- RNNs require careful initialization to avoid saturating activation functions
		- to prevent exploding – gradient clipping
		- to prevent vanishing – architectural change is required
		- GRU and LSTM units are used to solve this problem
			- use gates for filtering information
			- Gated Recurrent Unit: reset gate, update gate
			- Long Short-Term Memory: forget, input, output
- Transformer
	- CNNs could see more context but only through many stacked layers, which made them inefficient for truly long sequences
	- RNNs processed sequences one step at a time, making training slow and making it hard to capture relationships across long distances
	- idea: let every element of a sequence directly “pay attention” to every other element → no recurrence, no deep stacks required
		- attention … $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

## Probabilistic models with latent variables: VAE, GAN

- probabilistic models – aim to learn a parametric distribution $p_\theta(x)$ that approximates the complex data distribution $p_\mathrm{data}(x)$
	- once learned, we can (ideally) sample new data
	- we can jointly learn them with other probabilistic models using maximum likelihood
- Kullback-Leibler divergence
	- non-negative, asymmetric (it's not a distance)
	- $D_{\mathrm{KL}}(p(x)\|q(x))=-\mathbb E_{p(x)}[\log\frac{q(x)}{p(x)}]$
	- we minimize the KL divergence between the data distribution and the learned distribution
		- this leads to the maximum likelihood estimation of parameters (data distribution is constant)
- latent variables
	- not observed directly
	- we try to get a more compact representation based on the observation
	- examples
		- speech enhancement: noisy speech (observation) → clean speech (latent variable)
		- person tracking: detections (observation) → person positions (latent variable)
		- representation learning: raw data (observation) → representation (latent variable)
	- notation
		- observed variable … $x$
		- latent variable … $z$
	- $p_\theta(x)=\int p_\theta(x,z)\ dz=\int p_\theta(x|z)p_\theta(z)\ dz$
	- to get samples, first draw $\hat z\sim p_\theta(z)$, then draw $\hat x\sim p_\theta(x|\hat z)$
- simple example: clustering
	- basic approach: K-means algorithm
	- point-to-cluster assignment … latent variable (unknown)
		- must be inferred with the centroids (parameters of the model)
- more advanced approach: Gaussian mixture model
	- $p(x_n|z_n=k)=\mathcal N(x_n;\mu_k,\Sigma_k)$
	- we find parameters using EM algorithm
		- we maximize $\mathcal Q$ (expected complete-data log-likelihood)
		- $\mathcal Q(\theta,\theta^{r-1})=\mathbb E_{p(z|x;\theta^{r-1})}\log p(x,z;\theta)$
	- relationship with log-likelihood
		- $\log p(x)=\mathbb E_{q(z)}[\log\frac{p(x)p(z|x)}{q(z)}]+D_{KL}(q(z)\|p(z|x))$
		- fist term – M-step ($\mathcal Q$)
		- second term – E-step
	- we set $q(z)=p(z\mid x)$ so $D_{KL}=0$
- we can also consider continuous latent variables
	- PPCA (probabilistic principal component analysis)
	- $z\in\mathbb R^D,\;x\in\mathbb R^F$ where $D\ll F$
		- we want to extract a representation $z$ of each $x$
	- $p(z)=\mathcal N(z;0,I)$
	- linear model → $p(x|z)=\mathcal N(x;Az+b,\nu I)$
	- non-linear model → $p_\theta(x|z)=\mathcal N(x;\mu_\theta,\Sigma_\theta(z))$
- variational autoencoders (VAEs)
	- encoder + decoder
		- encoder learns $p(z|x)$
		- decoder learns $p(x|z)$
		- we consider Gaussian prior $p(z)=\mathcal N(z;0,I)$
		- to infer $z$ from $x$, we can use the encoder
		- to generate $x$, we can use the prior and the decoder
	- decoder … $p(x|z)$
		- covariance matrix has to be symmetric and positive
			- we assume the matrix to be diagonal
			- trick: instead of estimating the variance $\nu$ directly, we estimate the log-variance $\eta$ (→ variance is positive)
		- the network outputs $\mu_\theta,\eta_\theta$
			- $z$ goes in, the outputs have the dimension of $x$
	- if $p(x\mid z)$ is non-linear (implemented as deep network), the posterior distribution $p(z\mid x)$ cannot be computed analytically, it needs to be approximated
		- we use another feed-forward network to do that → encoder
		- outputs $\mu_\phi,\eta_\phi$ have the dimension of $z$ (but $x$ goes in)
	- we “chain” the posterior (encoder) and the generative (decoder) model
	- learning – ELBO (evidence lower-bound)
		- formulation from EM: $\log p(x)=\mathbb E_{q(z|x)}[\log\frac{p(x,z)}{q(z|x)}]+D_{KL}(q(z|x)\|p(z|x))$
		- second term cannot be computed but its positive so $\log p(x;\theta,\phi)\geq\mathbb E_{q_\phi(z|x)}[\log\frac{p(x,z)}{q_\phi(z|x)}]$
		- $\log p(x;\theta,\phi)\geq\mathbb E_{q_\phi(z|x)}[\log p_\theta(x|z)]-D_{KL}(q_\phi (z|x)\| p(z))$
			- first term – reconstruction (does the decoder work well?)
			- second term – regularization (is the latent distribution standard normal?)
		- $\mathcal L_{ELBO}(\theta,\phi)=\mathbb E_{q(z|x)}[\log\frac{p(x,z)}{q(z|x)}]$
			- note: we need to maximize this (or we can minimize $-\mathcal L_{ELBO}$ in gradient descent)
			- we cannot compute the expectation in closed form, we need to sample from $q(z|x)$
			- sampling is non-differentiable, we cannot backpropagate
		- reparametrization trick
			- we cannot sample directly from the posterior like this: $\hat z\sim\mathcal N(\mu,\Sigma)$
			- so we sample like this: $\bar z=\mu+\Sigma^{1/2}\epsilon$ with $\epsilon\sim\mathcal N(0,I)$
			- $\bar z$ is differentiable and follows the same distribution as $\hat z$
		- posterior collapse
			- it can happen that the VAE stops learning if the posterior $q$ gets too close to the standard prior
			- KL term dominates the ELBO – we should reduce its weight
			- also reducing dimensionality $D$ of the latent space helps
	- exact EM × VAE
		- there also exist things in the middle (variational EM)
	- limitation of VAE
		- frames modeled independently – we need time/sequential modeling!
			- for spectrogram, for example
		- one solution: consider *blocks* of spectrogram as inputs
	- probabilistic sequential modeling & inference
		- we can use a RNN
		- the sampling occurs sequentially and cannot be parallelized
		- → dynamical VAEs (DVAEs)
- generative adversarial network (GAN)
	- dataset (real samples), generator (fake samples)
	- generator $G_\theta$ takes a random noise $z$ as input and generates an image $x=G_\theta(z)$
	- discriminator $D_\phi$ takes an image $x$ as input and outputs the probability that $x$ is real
	- generator and discriminator are trained jointly in a minimax game
		- $\max_\theta\min_\phi\mathcal L_{BCE}(D_\phi;x,G_\theta(z))$
		- or $\min_\theta\max_\phi\mathbb E_{x\sim p_{BCE}(x)}[\log D_{\phi}(x)]+\mathbb E_{z\sim p_z(z)}[\log(1-D_\phi(G_\theta(z)))]$
		- this corresponds to minimizing the Jensen-Shannon divergence between $p_{\mathrm{data}}(x)$ and $p_\theta(x)$ with optimal $\phi$
		- $D_{JS}(p,q)=\frac12 D_{KL}(p\|\frac{p+q}2)+\frac12 D_{KL}(q\|\frac{p+q}2)$
	- typically, GANs are trained by alternating between updating the discriminator and the generator with different batches of data
		- update the discriminator $D_\phi$ for a few steps
		- update the generator $G_\theta$ for a step
	- problems
		- very sensitive to the choice of hyperparameters
		- weak discriminator → generator may produce non-realistic samples
		- strong discriminator → generator cannot learn (if the generator is always caught, it does not know how to improve) or tends to replicate the training set (overfitting)
		- mode collapse – the model does not generate the diversity of the dataset and focuses on one thing instead (e.g. generates just ones from MNIST)
	- if we consider two different “Dirac distributions” (with 1 at a single point), their JS divergence is constant and does not reflect the distance of the two points → bad
		- solution: Wasserstein distance
	- avoiding mode collapse: Wasserstein GAN
		- based on Wasserstein distance – “minimum cost of transporting mass from one distribution to another”
			- $W(p,q)=\inf_{\gamma\in\Gamma(p,q)}\mathbb E_{(x,y)\sim\gamma}[\|x-y\|]$
			- where $\Gamma$ is the set of all joint distributions
		- properties of $W$
			- it is a real distance, not a divergence – it satisfies the triangle inequality and is sensitive to the geometry of the underlying space
			- it is useful for comparing distributions that are not well-aligned or have different supports (as opposite to the JS divergence)
			- it's hard to compute efficiently (due to the infimum) in high-dim spaces
			- but it can be written as $\max_{\|f\|_L\leq 1}\set{\mathbb E_{x\sim p}[f(x)]-\mathbb E_{y\sim q}[f(y)]}$
		- WGANs use the Wasserstein distance instead of the JS divergence
			- they use a critic $C_\phi$ (instead of the discriminator) which is trained to approximate $W$
			- they use weight clipping to enforce a Lipschitz constraint on the critic
			- it's not a competition anymore
				- the critic is trained to approximate the Wasserstein distance
				- the generator is trained to minimize it

## Evaluation of Generative Models

- it's important but hard to evaluate the quality of generated samples
	- we need to know how well the models perform
	- but they can produce a wide range of outputs → it's hard to define a single evaluation metric capturing all aspects of quality
	- also, evaluation metrics may not align with human perception of quality
- objective metrics
	- precision, recall (for classification)
	- Inception Score (IS)
		- measures *diversity* and *quality* of the data
		- based on the Inception classification model
		- $IS(G_\theta)=\exp(\mathbb E_{x\sim p_\theta(x)}[D_{KL}(p(y|x)\|\int p(y|x)p(x)dx)])$
			- $p(y|x)$ is the class distribution predicted by Inception
		- the higher the better
	- Fréchet Inception Distance
		- measures the Wasserstein distance between the distribution of generated images and the distribution of real images in the feature space of a pretrained Inception model
			- extracts features from (provided) real images and generated images using a pretrained Inception model
			- FID score is computed based on the means and covariances of the features
				- $FID(G_\theta)=\|\mu_r-\mu_g\|^2+\mathrm{Tr}(\Sigma_r+\Sigma_g-2\sqrt{\Sigma_r\Sigma_g})$
		- assumes Gaussian distributions
	- comparison of IS and FID
		- FID is more robust to mode collapse
		- both rely on Inception model trained on ImageNet
			- trained for classification, might not reflect all the aspects of the image quality
			- the model might not well reflect the evaluated data
				- example: spectrograms
			- both scores depend on the pretrained model we choose
		- not suitable for non-image data
		- don't provide insights into the diversity or realism of generated samples
- subjective evaluation
	- user studies and visual inspection
	- provide valuable insights
	- costly, subjective :)
	- might need specific user expertise
- hybrid alternative – mean opinion score
	- crowd-source evaluation technique where human evaluators rate the quality of generated samples on a scale (e.g. 1 to 5)
	- MOS network trained to predict the score
		- so the network is trained to estimate subjective criteria
		- can be used for non-image data

## Audio

- introduction
	- we sample the signal using frequency $F_s$
	- Nyquist-Shannon sampling theorem: we can only reconstruct content corresponding to frequencies $\lt F_s/2$
	- telephone voice effect – losing high-frequency details
		- speech signal can contain energy up to 20 kHz
		- most of energy within 0.3–3 kHz → phone standards sample at 8 kHz
	- discrete Fourier transform
		- outputs sequence of numbers describing the magnitude and phase at each frequency bin
		- computed using FFT
		- but energy for frequencies changes over time
	- short-time Fourier transform (STFT)
		- sliding window with a kernel
		- apply DFT to each segment
		- the modulus
	- mel frequency scale
		- better matches human perception (compared to the linear scale)
		- mel-frequency cepstral coefficients (MFCC)
			- popular audio representations
		- usual pipeline to get MFCC: raw data → STFT → Mel scale → log → discrete cosine transform
- audio representations based on self-supervised learning
	- sometimes, part of the representation is not learned (classical audio representations are used)
		- using learned representations may be better – they can be taylored to specific data or have more general and reusable representations
	- wav2vec 2.0
		- masked speech in latent space (approach similar to masked language modeling)
		- architecture similar to STFT, but the transformation is learnt
			- CNN (to get latent representations based on raw waveform) + Transformer encoder (to get contextualized representations)
		- contrastive loss to train using masked prediction
			- some latent representations are masked
			- did the model predict the true latent? – cosine similarity
	- hidden-unit BERT (HuBERT)
		- extension of wav2vec 2.0
		- learns from unlabeled audio, predicts masked portions like BERT
		- idea: use clustering to generate pseudo-labels for audio segments, then train a model to predict those labels
			- pseudo-labels initialized using MFCC (and k-means)
		- fine-tuned for automatic speech recognition (ASR)
	- WavLM
		- extension of HuBERT
		- more robust learning objective, more data
		- idea: incorporate speech denoising as well as time/channel-wise masking in pre-training to improve robustness and generalization
		- strong performance on both speech recognition and speaker-related tasks (e.g. speaker verification, diarization = partitioning of audio according to speakers)
		- gated relative position bias
			- relative position bias – attention is affected by the distance between tokens (tokens close to each other should attend more)
			- learnable gate controls the influence of the position bias
- end-to-end approaches (audio → audio)
	- WaveNet
		- deep generative model for raw audio waveforms
		- unconditioned
		- idea: model the joint probability of an audio waveform as a product of conditional probabilites (using chain rule)
		- capturing long-range dependencies in an efficient way (using stacks of dilated causal convolutions)
			- outputs at time $t$ depends only on $x_{\lt t}$
			- exponentially increase receptive field
			- gated activation units and residual connections for stable training
		- applications: audio generation (e.g. speech synthesis) and enhancement; can be adapted for music or other sequential data
	- SampleRNN
		- similar to WaveNet
		- hierarchical structure
			- upper tiers summarize longer contexts
			- lower tiers generate fine-grained details
	- it's very difficult to capture very long dependencies (like 60 seconds)
- generating audio from intermediate representations
	- STFT is invertible but the reconstruction of the audio signal from the spectrogram is not immediate (we used modulus)
		- we need the phase of the signal
	- classical approach: Griffin-Lim
		- iterative algorithm to estimate the phase
		- exploits the redundancy between time-frames and between frequency bins of the STFT representation
		- widely used in classical speech synthesis and as a baseline
	- learning-based approach: HiFi-GAN
		- uses GAN to synthesize realistic waveforms conditioned on Mel-spectrograms
		- enables real-time speech synthesis with high perceptual quality
		- generator takes Mel-spectrogram as input, outputs raw audio
			- uses transposed convolutions and residual blocks to upsample and generate waveform samples
		- multi-scale discriminators – operate at different resolutions of the waveform
		- multi-period discriminators – focus on periodic patterns in speech
	- Tacotron
		- end-to-end TTS model
		- maps character or phoneme sequences to Mel-spectrograms
		- no need for hand-crafted linguistic features
		- typically used together with WaveNet or HiFi-GAN to generate the final waveform from the predicted spectrogram
		- encoder + decoder
		- uses attention to map text position to audio frames
		- L1 loss
	- AnCoGen
		- masked-modeling-based model
		- idea: map the spectrogram to attributes (pitch, SNR, reverbation, content, …)
		- enables control over the audio attributes
		- ratio – used for masking
			- (0,1) → audio non-masked, attributes masked
			- masking can be partial (e.g. 0.7)
		- is combined with a neural vocoder (HiFi-GAN) to generate the final waveform from the predicted spectrogram

## Diffusion

- basic division of generative models
	- explicit density
		- tractable density → autoregressive
		- approximate density → VAE
	- implicit density
		- direct sampling → GAN
		- indirect sampling → diffusion
- basic concepts
	- Brownian motion – continuous random movement of a particle, with increments that are Gaussian and independent
	- diffusion process – stochastic system whose evolution is governed by Brownian motion
	- diffusion (in image generation) – we add noise
	- the goal of the model (DDPM, denoising diffusion probabilistic model) is to remove the noise
- forward diffustion process (fixed) – start with data $x_0$, gradually add Gaussian noise in $T$ steps
	- $q(x_t|x_{t-1})=\mathcal N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI)$
- reverse denoising process (generative)
	- learn $p_\theta(x_{t-1}|x_t)$ to denoise
	- $p_\theta(x_{t-1}|x_t)=\mathcal N(x_{t-1};\mu_\theta(x_t,t),\Sigma_\theta(x_t,t))$
- training
	- the model should estimate a noise vector $\epsilon\in\mathbb R^n$ from a given noise level $\sigma\gt 0$ and noisy input $x_\sigma\in\mathbb R^n$ s.t. for some $x_0$ in the data manifold $\mathcal K$ it holds that $x_\sigma\approx x_0+\sigma\epsilon$
	- a denoiser $\epsilon_\theta:\mathbb R^n\times\mathbb R_+\to\mathbb R^n$ is learned by minimizing $L(\theta):=\mathbb E_{x_0,\sigma,\epsilon}\|\epsilon_\theta(x_0+\sigma\epsilon,\sigma)-\epsilon\|^2$
		- $x_0$ sampled from training data
		- $\sigma$ sampled from a training noise schedule
			- in practice, noise level $\sigma$ range from 0.01 to 100
		- $\epsilon$ sampled from $\mathcal N(0,I_n)$
	- we are trying to find an ideal *denoiser* $\epsilon^*$ that minimizes $L(\theta)$
		- for finite $\mathcal K$, there is a close-form solution
			- $\epsilon^*(x_\sigma,\sigma)=\frac{\sum_{x_0\in\mathcal K}(x_\sigma-x_0)\exp(-\|x_\sigma-x_0\|^2/2\sigma^2)}{\sigma\sum_{x_0\in\mathcal K}\exp(-\|x_\sigma-x_0\|^2/2\sigma^2)}$
		- assumption: $\epsilon^*(x_\sigma,\sigma)=\mathbb E[\epsilon\mid x_\sigma,\sigma]$
		- steps
			- replace $\epsilon$ by the forward noise relation $x_\sigma=x_0+\sigma\epsilon\implies\epsilon=\frac{x_\sigma-x_0}{\sigma}$
				- so we get $\epsilon^*(x_\sigma,\sigma)=\mathbb E[\frac{x_\sigma-x_0}{\sigma}\mid x_\sigma,\sigma]=\frac1\sigma(x_\sigma-\mathbb E[x_0\mid x_\sigma,\sigma])$
				- and $\mathbb E[x_0\mid x_\sigma,\sigma]=\sum_{x_0\in\mathcal K} x_0\cdot p(x_0\mid x_\sigma,\sigma)$
			- posterior $p(x_0\mid x_\sigma,\sigma)$
				- forward step $p(x_0\mid x_\sigma,\sigma)\propto\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})$
					- equal up to a constant factor (it gets canceled out in the following formula)
				- Bayes: $p(x_0\mid x_\sigma,\sigma)=\frac{p(x_\sigma\mid x_0,\sigma)p(x_0)}{\sum_{x'_0\in\mathcal K} p(x_\sigma\mid x'_0,\sigma)p(x'_0)}=\frac{\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})}{\sum_{x'_0\in\mathcal K}\exp(-\frac{\|x_\sigma-x'_0\|^2}{2\sigma^2})}$
					- because $p(x_0)=\frac1{|\mathcal K|}$
			- so $\mathbb E[x_0\mid x_\sigma,\sigma]= \frac{\sum_{x_0\in\mathcal K} x_0\cdot\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})}{\sum_{x'_0\in\mathcal K}\exp(-\frac{\|x_\sigma-x'_0\|^2}{2\sigma^2})}$
			- and $\epsilon^*(x_\sigma,\sigma)=\frac{\sum_{x_0\in\mathcal K} (x_\sigma-x_0)\cdot\exp(-\frac{\|x_\sigma-x_0\|^2}{2\sigma^2})}{\sigma\cdot \sum_{x'_0\in\mathcal K}\exp(-\frac{\|x_\sigma-x'_0\|^2}{2\sigma^2})}$
- common model architectures
	- convolutional U-nets
	- patch-wise transformers
- reverse denoising process – sampling
	- the learned denoiser $\epsilon_\theta(x_\sigma,\sigma)$ estimates $\hat x_0=x_\sigma-\sigma\epsilon_\theta(x_\sigma,\sigma)$
	- *for loop*, we denoise the data in several steps
	- DDIM (denoising diffusion *implicit* model) × DDPM (*probabilistic*)
		- deterministic (DDIM) update: $x_{t-1}=x_t+(\sigma_{t-1}-\sigma_t)\epsilon_\theta(x_t,\sigma_t)$
		- probabilistic (DDPM) update: $x_{t-1}=x_t+(\sigma_{t'}-\sigma_t)\epsilon_\theta(x_t,\sigma_t)+\eta w_t$
		- DDPM is derived from the true reverse diffusion (a stochastic differential equation / SDE)
			- we need to add noise proportional to uncertainty in the denoising steps
		- DDIM replaces the SDE with a probability-flow ODE, which has no diffusion term, so the evolution is deterministic
		- they share the same deterministic mean DDPM differs by the Gaussian noise scaled by uncertainty
- flow matching models vs. diffusion models
	- in flow matching models, we are trying to get a function which maps from one distribution to another
	- so we need less sampling steps

## Representation Learning

- types of learning
	- supervised – training data + desired outputs (labels)
	- unsupervised – unlabeled data
	- semi-supervised – training data + a few desired outputs
- unsupervised/representation learning – useful if we don't have enough annotations
- initial approach: pretraining (e.g. ImageNet) & fine-tuning
	- to fine-tune, we drop the last weight matrix with dimension $f\times 1000$ and replace it with a matrix with dimension $f\times c$ where $c$ is the desired number of classes ($f$ … number of features)
	- problems
		- not optimal for every problem (e.g. video, medical)
		- humans don't need ImageNet pretraining
	- solution
		- replace ImageNet pre-training by an unsupervised training (representation learning)
- generation-based methods
	- autoencoders
		- train such features that can be used to reconstruct original data
		- input data $x$ → encoder → features $z$ → decoder → reconstructed input data $\hat x$
			- $z$ typically has less features than $x$
			- we minimize $\|x-\hat x\|^2$
		- the encoder learns the representation
	- if we have a large unlabeled dataset and a small annotated dataset, we can use an encoder or a GAN to initialize a supervised model
	- limitations
		- features not trained to discriminate
		- limited performance
		- additional computation cost (decoder or generator)
	- solution: self-supervised learning (SSL)
- self-supervised learning – supervision comes from the data (no need to annotate)
	- pretext task 
		- we don't care about this specific task but it helps the model to learn the representations
		- e.g. relative patch prediction
			- but it's not that easy
				- color distortion helps the model cheat the task
				- solution: drop two channels, replace by Gaussian noise
		- another task: solving jigsaw puzzles
			- to make it easier, we can subset 1000 permutations and only train the classifier on them
		- other tasks
			- colorization
			- rotation prediction
			- super-resolution
	- contrastive learning
		- goal: to learn features that are discriminative among instances
		- but we would need too many classes (one for each instance in the training dataset) – we need non-parametric softmax & memory bank
			- memory bank contains feature representations of all images in the dataset
		- invariant information clustering – maximizing the mutual information between encoded variables
		- SimCLR
			- we have two images $A,B$, apply two random transformations to each of them
			- so we get four images $A_1,A_2,B_1,B_2$, we want to maximize agreement between the ones based on the same image (e.g. $A_1,A_2$) and minimize agreement between the ones based on different images (e.g. $A_1,B_1$)
			- agreement defined as cosine similarity (pairwise)
		- Moco
			- instead of end-to-end learning or memory bank, we use momentum encoder
			- we mix the previous parameters of the network with the current one
		- SimSiam
		- Dino
			- vision transformer (ViT)
				- linear projection of flattened 16×16 patches + *learned* position embedding
				- additional classifier token
			- two networks: student and teacher
			- momentum teacher as Moco
			- segmentation emerges
- what is a good representation? – we need robustness (to handle domain shift)
	- appearance changes due to different sensors – infrared vs. normal camera
	- use of synthetic data – synthetic datasets may be cheaper to make
	- unseen scenarios (e.g. natural disasters)
	- biased datasets
- unsupervised domain adaptation (DA)
	- source and target distributions, we want them to have similar representations
	- e.g. we trained the model labeled photos, we want it to handle (unlabeled) cartoon images
	- approaches
		- discrepancy-based method
			- use maximum mean discrepancy (MMD) to align the distributions
		- alignment layers
			- idea: learn domain-agnostic representation by adjusting the network architecture
			- batch normalization
		- adversarial-based methods
			- employ an adversarial objective to ensure that the network cannot distinguish between the source and target domains
		- adaptation through translation
			- train model which can translate between domains
- in some contexts, discrete representations may be useful
	- VQ-VAE = VAE with vector quantization
	- vector quantization maps a vector from a continuous space to a vector from a dictionary (codebook)

## Image Generation

- variational autoencoders (VAEs)
	- encoder (predicts distribution in latent space) + decoder (predicts distribution in feature space)
	- we want the latent space to be close to Gaussian
		- that's what KL divergence term does
	- dimensions in latent space may correspond to some properties of the objects in the image
	- we can do linear interpolation – we encode two images, “mix” them (in some ratio), then decode
- GANs
	- problem: want to sample from complex, high-dimensional training distribution (no direct way to do this!)
	- solution: sample from a simple distribution (e.g. random noise) & learn transformation to training distribution
		- minimax objective function
		- alternate between gradient ascent on discriminator and gradient descent on generator
		- in practice: instead of minimizing likelihood of discriminator being correct, we maximize likelihood of discriminator being wrong (higher gradient signal for bad samples → works better)
	- Progressive GAN – training layer by layer (we start by training simple small layers, then add larger layers)
	- BigGAN
	- style stransfer
		- we want to take content from one image and style from the other one
		- we don't want to transfer only color but also brush strokes
			- we don't change the structure of the original image, we change statistical properties of its patches (to get different style)
			- that's what AdaIN normalization does
				- $\mathrm{AdaIN}(x,y)=\sigma(y)(\frac{x-\mu(x)}{\sigma(x)})+\mu(y)$
		- architecture: VGG encoder → normalization tricks → decoder
			- to compute loss, the VGG encoder needs to be used again on the result and the “style” image
	- style-based GAN
		- traditional approach: latent vector comes from the source image
		- style-based GAN starts with learned constant tensor, adds noise and style (in each layer) by predicting scale and shift
			- we swap source images at some point in the process to get the mix of style and content
- image-to-image translation
	- goal: translate image from one representation to another
		- edges (drawing) → photo
		- labels → street scene
		- BW → color
		- aerial → map
		- day → night
	- Pix2Pix
		- use GAN, discriminator gets both images (we want the generated images to be both plausible and to correspond to the original image)
		- generator is just autoencoder (encoder + decoder)
			- convolution & deconvolution
			- U-Net uses skip connections from the encoder to the decoder (not everything has to be encoded in the latent space)
				- works better
	- example: generating image based on segmentation
		- we can use a trained segmentation model to segment the generated image
		- then, we can apply metrics used for image segmentation evaluation
	- smarter discriminator
		- instead of predicting only one score (on the scale from real to fake), we can predict multiple scores (one for each region of the image)
		- this doesn't work for too small regions – “is this pixel realistic?” is not a good question (the discriminator cannot see patterns, only colors of individual pixels)
	- we assume we have access to $p(x,y)$ and train model to sample $y\sim p(y\mid x)$ or $x\sim p(x\mid y)$
	- but we don't always have $p(x,y)$ → *unpaired image-to-image generation*
		- example: you may have many images of horses and many images of zebras, but never a pair of corresponding images
		- CycleGAN
			- uses both GAN losses and *cycle-consistency loss*
				- if we generate zebra based on a horse, we want to be able to generate horse based on the zebra and get the same horse as before
			- based on ResNet (not U-Net)
		- let's have a shared latent space!
			- so we have two encoders (one for zebra, one for horse) and two decoders that share the same latent space
			- weights are shared between encoders
		- geometry-consistency
			- we check how well the model works for transformed images (we then inverse the transformation and compared with the result for untransformed image)
			- GcGAN
	- high resolution images
		- Pix2PixHD
		- we don't want to use many layers – you lose information
		- architecture similar to style-based GAN
		- struggles with uniform surfaces
- video generation
	- we need temporal consistency
	- we could do 3D convolution instead of 2D convolution
		- but we would need a lot of data
	- we could consider static background and moving objects
		- so we generate static background (image) and two videos – foreground and mask (ratio for mixing the foreground and background)
	- let's generate a trajectory of vectors in latent space we can then pass to a decoder
		- limitations
			- fixed-length videos only
			- no control over motion and content
		- MoCoGAN
		- DVDGAN
	- video-to-video translation
	- animating single subject – latent space with human pose
- neural radiance filters
	- estimate “shape” of an object based on several photos
	- can render novel views

### Diffusion models

- we address generation as a denoising problem
- similar to GAN, we start with a distribution easy to sample (Gaussian) and get a distribution we want (but we do it in multiple steps)
- we estimate mean of the next distribution
- we can combine multiple steps of adding noise just into one step
- instead of predicting the image, we predict the noise
	- it's easier as the variance is fixed (we can focus on predicting mean)
	- also, the image changes over time – the noise does not (?)
- we can use simpler loss formula even though there's no theoretical explanation for it
	- $L_t=\mathbb E_{t\sim[1,T],x_0,\epsilon_t}[\|\epsilon_t-\epsilon_\theta(x_t,t)\|^2]$
- training and sampling algorithms
	- we want the results to follow a distribution → we add some randomness according to the variance
- we want the distribution to be conditional
	- first approach: classifier guidance
		- we have a diffusion model $P(x)$
		- we have a classifier $P(y\mid x)$
		- we want to be able to sample $P(x\mid y)$
		- idea: instead of just denoising, we also move in the direction that makes the probability $P(y\mid x)$ higher
		- but we need to compute the gradient of the classifier (using backpropagation) – high computational cost
	- second approach: classifier-free guidance
		- the noise model is trained with $y$ in mind
		- we also used unconditioned denoiser – balance between quality and fidelity
- latent diffusion models
	- problem: to generate high-resolution images, we need to start from high-resolution noise and it takes many denoising steps (→ computational cost)
		- we could use distillation
		- or we can project the images in a discrete *latent space*
	- let's have an encoder and a decoder
	- we consider a diffusion model in the latent space
		- U-Net used for denoising
		- cross-attention to apply conditioning (in the U-Net)
			- conditioning is in a single vector $C$
- diffusion transformers (DiT)
	- conditioning is used to predict scale and shift (similar to StyleGAN)
- 2D → 3D latent space
	- denoiser for video gets very computationally intensive if we want to attend everywhere
	- instead, we consider separate spatial and temporal layers
- why it changes everything
	- GAN worked only on datasets with limited diversity
	- fine-grained control with text
	- we have a general-purpose image prior
		- we can start with pretrained large models and use transfer learning for specific tasks
		- one training of a large model costs 600 000 euros
- how can we reuse a pretrained diffusion model so that we can condition using spatial data (a sketch…)
	- how to fit all the information in a single vector $C$
	- ControlNet – encoder with skip connections to the U-Net
	- impainting
		- we want to put a specific object in the image
		- idea: we add noise to the whole image and let it generate with a conditioning
		- to make sure that the rest of the image does not change, we can replace the rest of the image with the original image (+ noise) in every step of denoising
			- we use a mask for that
- (other slides skipped)
- personalized text-to-image
	- example: I want to generate something based on this specific (real) statue
	- how can I describe this specific object using an embedding vector?
		- has to be learnt
	- Dreambooth
		- problem: if I finetune the model using photos of my dog standing, I will only get results with my dog standing (not sitting)
		- so I use specific loss that ensures the generated diversity is similar to the diversity of real dog poses

## Multimodal Learning

- multimodal learning
	- many research questions
	- many other modalities than just video and audio
		- text, lidar, thermal, events, …
	- interactions between modalities
- sensor fusion – using information from diverse sensors to make predictions
	- types
		- camera + depth sensor → RGB-D object detection
		- RGB + thermal
		- RGB + optical flow (object moving in the video)
	- we have aligned inputs; when to perform fusion?
		- early fusion – concat, then pass to the model
		- late fusion – two models, jointly predict
			- easier to train (we don't need that much paired data)
			- can be run in parallel
		- middle fusion – two models (with shared weights), then fuse features and pass to third model
		- another approach: learn when to perform fusion (siamese network)
	- using ViT with two modalities
		- either pass shorter sequence of pairs → early fusion
		- or pass two sequences (so the entire sequence is longer) → late fusion
			- worked better
	- RGB + lidar detection
		- advantages
			- effective in low light and some adverse weather
			- robust in low-texture areas
			- penetrates dense foliage (vegetation) – for satellite imagery
			- long range
		- lidar returns a point cloud (points detected in 3D)
		- approaches
			- first find object in image, then use lidar points (sequential)
			- or we can use late fusion
- multimodal translation (from one modality to another)
	- I2T: image captioning – we condition text on some visual observation
		- first RNNs with LSTM, then Transformer (attention-based approaches)
	- V2T: lip reading
	- T2I: text-to-image generation
	- A2V: audio to video
	- ASR: speech recognition – not generative (???)
	- TTS: text-to-speech
- hybrid tasks
	- visual question answering
		- projecting the image and the question in the same vector space
			- image processed by CNN
			- question processed by CNN/LSTM
		- attention layers – which part of the image should I look at?
	- lips reading
		- seq2seq with attention – which time should I look at to predict the next word? (predicting alignment between text and audio/video frames)
- multimodal alignment (identifying and modeling correspondances)
	- ImageNet
		- hard to scale up
		- vision is not only about classes
		- limited robustness to distribution shifts
		- adaptation to other tasks (new classes) requires further training
	- zero-shot classification: CLIP
		- frame the problem as an image-caption matching problem
		- captions
			- easier to get than classes
			- contain semantic, geometric, and stylistic information
			- multi-object images
			- collected 400M pairs (312× more than ImageNet)
		- contrastive pre-training
			- captions encoded using transformer
			- images encoded by ViT or ResNet
			- for each image, probability (softmax) of every possible caption (and vice versa)
				- loss function – maximize likelihood of predicting correct text for the image and correct image for the text
		- can be then used for zero-shot classification
			- user-defined classes can be expressed as captions: “a photo of a {object}.“
		- can be also used as a search engine
			- you compute the similarity between the provided caption and the images in your database
		- can be used to build on top of (prompt engineering – “software 3.0”)
		- CLIP can be also used for audio classification
	- Imagebind
		- based on contrastive loss
		- paired data of many different modalities: images, videos, text, audio, depth, thermal, IMU (movement)
		- connecting modalities which were not connected before
- mask image model vs. language model
	- mask image model – BERT, bidirectional
	- language model – GPT, unidirectional
- training LLM
	- pre-training → instruction fine-tuning → reinforcement learning with human feedback
	- having several copies of LLMs fine-tuned for different tasks is expensive
	- alternative approach: prefix-tuning
		- freeze the weights
		- train new prefix embeddings (added at the beginning of the sequence) that optimize the behavior of the network for the specific task
		- similar to prompt engineering (“You are an AI agent, be kind to the user.“)
			- but here, we use gradient descent to find tokens which work the best (they don't have to correspond to existing words)
	- bidirectional vs. causal (unidirectional) attention
		- bidirectional models cannot be used to generate
		- unidirectional uses masked self-attention
			- → can generate, is more compute efficient, has good modeling capacities
- multimodal LLMs
	- VisualBERT (bidirectional)
		- image (split using bounding boxes by an object detector) + caption
		- masking words in the caption
		- objective 1: predict masked words
		- objective 2: predict if the image matches the caption or not
		- downstream task: visual question answering
			- `[mask]` token is appended to the question (→ answer is predicted by the model)
			- VQA is considered as a classification problem
	- unidirectional MLLM (encoder + decoder)
		- encoder gets image and beginning of the sentence
			- image is first split into patches and processed by convolution
			- bidirectional attention
		- decoder continues the sentence
	- decoder-only
		- vision encoder trained using the task of next token prediction
			- produces encoded representations of the image
			- used as prefix for LLM (or even as a part of the input – anywhere)
		- LLM frozen – why?
			- cost of training
			- contains a lot of useful knowledge we don't want to lose
		- it combines perception of vision encoder and reasoning capacity of LLM
		- alternative approach: use CLIP instead of encoder training (to get the embeddings)
	- Flamingo
		- images are removed from the text and replaced by placeholders
		- then, images are provided using gated cross-attention
			- skip connections make sure that the model preserves its pre-trained abilities even after modification (at the beginning of the fine-tuning phase – with the initial parameters for the new blocks in the architecture)
			- the text cross-attend only at the last image
				- but the self-attention layer still ensures everyone sees everything
	- Socratic Models
		- idea: convert all modality into text
		- then perform reasoning in text form
		- but there are things hard to describe with text
	- training conversational agent for visual question answering
		- hard to get data
		- Llava architecture
			- frozen vision encoder
			- frozen LLM
			- *trained* projection is applied to the output of the vision encoder (before passing these “tokens” to the LLM)
	- Visual LLM
		- Qwen
		- supports audio, video, and text
		- uses vision and audio encoders
		- we need positional embeddings
			- rotary position embedding (RoPE)
			- multiple rotations happening at once (vector of dimension $n$ split into $n/2$ parts and each part is rotated differently)
		- for videos, we use M-RoPE
			- rotary embedding decomposed into temporal, width, and height component
		- best open-source model
		- speech synthesis in an autoregressive manner
- image-to-text vs. text-to-image
	- generating text – autoregressive approach (predicting next token based on the previous ones)
	- generating images – diffusion
	- how to unify the two tasks? → *Mixed-Modal Auto-Regressive LM*
		- image converted to tokens and back using tokenizer and de-tokenizer
- Vision-Language-Action models (VLA)