first term – reconstruction (does the decoder work well?)
second term – regularization (is the latent distribution standard normal?)
LELBO(θ,ϕ)=Eq(z∣x)[logq(z∣x)p(x,z)]
note: we need to maximize this (or we can minimize −LELBO in gradient descent)
we cannot compute the expectation in closed form, we need to sample from q(z∣x)
sampling is non-differentiable, we cannot backpropagate
reparametrization trick
we cannot sample directly from the posterior like this: z^∼N(μ,Σ)
so we sample like this: zˉ=μ+Σ1/2ϵ with ϵ∼N(0,I)
zˉ is differentiable and follows the same distribution as z^
posterior collapse
it can happen that the VAE stops learning if the posterior q gets too close to the standard prior
KL term dominates the ELBO – we should reduce its weight
also reducing dimensionality D of the latent space helps
exact EM × VAE
there also exist things in the middle (variational EM)
limitation of VAE
frames modeled independently – we need time/sequential modeling!
for spectrogram, for example
one solution: consider blocks of spectrogram as inputs
probabilistic sequential modeling & inference
we can use a RNN
the sampling occurs sequentially and cannot be parallelized
→ dynamical VAEs (DVAEs)
generative adversarial network (GAN)
dataset (real samples), generator (fake samples)
generator Gθ takes a random noise z as input and generates an image x=Gθ(z)
discriminator Dϕ takes an image x as input and outputs the probability that x is real
generator and discriminator are trained jointly in a minimax game
maxθminϕLBCE(Dϕ;x,Gθ(z))
or minθmaxϕEx∼pBCE(x)[logDϕ(x)]+Ez∼pz(z)[log(1−Dϕ(Gθ(z)))]
this corresponds to minimizing the Jensen-Shannon divergence between pdata(x) and pθ(x) with optimal ϕ
DJS(p,q)=21DKL(p∥2p+q)+21DKL(q∥2p+q)
typically, GANs are trained by alternating between updating the discriminator and the generator with different batches of data
update the discriminator Dϕ for a few steps
update the generator Gθ for a step
problems
very sensitive to the choice of hyperparameters
weak discriminator → generator may produce non-realistic samples
strong discriminator → generator cannot learn (if the generator is always caught, it does not know how to improve) or tends to replicate the training set (overfitting)
mode collapse – the model does not generate the diversity of the dataset and focuses on one thing instead (e.g. generates just ones from MNIST)
if we consider two different “Dirac distributions” (with 1 at a single point), their JS divergence is constant and does not reflect the distance of the two points → bad
solution: Wasserstein distance
avoiding mode collapse: Wasserstein GAN
based on Wasserstein distance – “minimum cost of transporting mass from one distribution to another”
W(p,q)=infγ∈Γ(p,q)E(x,y)∼γ[∥x−y∥]
where Γ is the set of all joint distributions
properties of W
it is a real distance, not a divergence – it satisfies the triangle inequality and is sensitive to the geometry of the underlying space
it is useful for comparing distributions that are not well-aligned or have different supports (as opposite to the JS divergence)
it's hard to compute efficiently (due to the infimum) in high-dim spaces
but it can be written as max∥f∥L≤1{Ex∼p[f(x)]−Ey∼q[f(y)]}
WGANs use the Wasserstein distance instead of the JS divergence
they use a critic Cϕ (instead of the discriminator) which is trained to approximate W
they use weight clipping to enforce a Lipschitz constraint on the critic
it's not a competition anymore
the critic is trained to approximate the Wasserstein distance
the generator is trained to minimize it
Evaluation of Generative Models
it's important but hard to evaluate the quality of generated samples
we need to know how well the models perform
but they can produce a wide range of outputs → it's hard to define a single evaluation metric capturing all aspects of quality
also, evaluation metrics may not align with human perception of quality
p(y∣x) is the class distribution predicted by Inception
the higher the better
Fréchet Inception Distance
measures the Wasserstein distance between the distribution of generated images and the distribution of real images in the feature space of a pretrained Inception model
extracts features from (provided) real images and generated images using a pretrained Inception model
FID score is computed based on the means and covariances of the features
FID(Gθ)=∥μr−μg∥2+Tr(Σr+Σg−2ΣrΣg)
assumes Gaussian distributions
comparison of IS and FID
FID is more robust to mode collapse
both rely on Inception model trained on ImageNet
trained for classification, might not reflect all the aspects of the image quality
the model might not well reflect the evaluated data
example: spectrograms
both scores depend on the pretrained model we choose
not suitable for non-image data
don't provide insights into the diversity or realism of generated samples
subjective evaluation
user studies and visual inspection
provide valuable insights
costly, subjective :)
might need specific user expertise
hybrid alternative – mean opinion score
crowd-source evaluation technique where human evaluators rate the quality of generated samples on a scale (e.g. 1 to 5)
MOS network trained to predict the score
so the network is trained to estimate subjective criteria
can be used for non-image data
Audio
introduction
we sample the signal using frequency Fs
Nyquist-Shannon sampling theorem: we can only reconstruct content corresponding to frequencies <Fs/2
most of energy within 0.3–3 kHz → phone standards sample at 8 kHz
discrete Fourier transform
outputs sequence of numbers describing the magnitude and phase at each frequency bin
computed using FFT
but energy for frequencies changes over time
short-time Fourier transform (STFT)
sliding window with a kernel
apply DFT to each segment
the modulus
mel frequency scale
better matches human perception (compared to the linear scale)
mel-frequency cepstral coefficients (MFCC)
popular audio representations
usual pipeline to get MFCC: raw data → STFT → Mel scale → log → discrete cosine transform
audio representations based on self-supervised learning
sometimes, part of the representation is not learned (classical audio representations are used)
using learned representations may be better – they can be taylored to specific data or have more general and reusable representations
wav2vec 2.0
masked speech in latent space (approach similar to masked language modeling)
architecture similar to STFT, but the transformation is learnt
CNN (to get latent representations based on raw waveform) + Transformer encoder (to get contextualized representations)
contrastive loss to train using masked prediction
some latent representations are masked
did the model predict the true latent? – cosine similarity
hidden-unit BERT (HuBERT)
extension of wav2vec 2.0
learns from unlabeled audio, predicts masked portions like BERT
idea: use clustering to generate pseudo-labels for audio segments, then train a model to predict those labels
pseudo-labels initialized using MFCC (and k-means)
fine-tuned for automatic speech recognition (ASR)
WavLM
extension of HuBERT
more robust learning objective, more data
idea: incorporate speech denoising as well as time/channel-wise masking in pre-training to improve robustness and generalization
strong performance on both speech recognition and speaker-related tasks (e.g. speaker verification, diarization = partitioning of audio according to speakers)
gated relative position bias
relative position bias – attention is affected by the distance between tokens (tokens close to each other should attend more)
learnable gate controls the influence of the position bias
end-to-end approaches (audio → audio)
WaveNet
deep generative model for raw audio waveforms
unconditioned
idea: model the joint probability of an audio waveform as a product of conditional probabilites (using chain rule)
capturing long-range dependencies in an efficient way (using stacks of dilated causal convolutions)
outputs at time t depends only on x<t
exponentially increase receptive field
gated activation units and residual connections for stable training
applications: audio generation (e.g. speech synthesis) and enhancement; can be adapted for music or other sequential data
SampleRNN
similar to WaveNet
hierarchical structure
upper tiers summarize longer contexts
lower tiers generate fine-grained details
it's very difficult to capture very long dependencies (like 60 seconds)
generating audio from intermediate representations
STFT is invertible but the reconstruction of the audio signal from the spectrogram is not immediate (we used modulus)
we need the phase of the signal
classical approach: Griffin-Lim
iterative algorithm to estimate the phase
exploits the redundancy between time-frames and between frequency bins of the STFT representation
widely used in classical speech synthesis and as a baseline
learning-based approach: HiFi-GAN
uses GAN to synthesize realistic waveforms conditioned on Mel-spectrograms
enables real-time speech synthesis with high perceptual quality
generator takes Mel-spectrogram as input, outputs raw audio
uses transposed convolutions and residual blocks to upsample and generate waveform samples
multi-scale discriminators – operate at different resolutions of the waveform
multi-period discriminators – focus on periodic patterns in speech
Tacotron
end-to-end TTS model
maps character or phoneme sequences to Mel-spectrograms
no need for hand-crafted linguistic features
typically used together with WaveNet or HiFi-GAN to generate the final waveform from the predicted spectrogram
encoder + decoder
uses attention to map text position to audio frames
L1 loss
AnCoGen
masked-modeling-based model
idea: map the spectrogram to attributes (pitch, SNR, reverbation, content, …)
enables control over the audio attributes
ratio – used for masking
(0,1) → audio non-masked, attributes masked
masking can be partial (e.g. 0.7)
is combined with a neural vocoder (HiFi-GAN) to generate the final waveform from the predicted spectrogram
Diffusion
basic division of generative models
explicit density
tractable density → autoregressive
approximate density → VAE
implicit density
direct sampling → GAN
indirect sampling → diffusion
basic concepts
Brownian motion – continuous random movement of a particle, with increments that are Gaussian and independent
diffusion process – stochastic system whose evolution is governed by Brownian motion
diffusion (in image generation) – we add noise
the goal of the model (DDPM, denoising diffusion probabilistic model) is to remove the noise
forward diffustion process (fixed) – start with data x0, gradually add Gaussian noise in T steps
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
reverse denoising process (generative)
learn pθ(xt−1∣xt) to denoise
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
training
the model should estimate a noise vector ϵ∈Rn from a given noise level σ>0 and noisy input xσ∈Rn s.t. for some x0 in the data manifold K it holds that xσ≈x0+σϵ
a denoiser ϵθ:Rn×R+→Rn is learned by minimizing L(θ):=Ex0,σ,ϵ∥ϵθ(x0+σϵ,σ)−ϵ∥2
x0 sampled from training data
σ sampled from a training noise schedule
in practice, noise level σ range from 0.01 to 100
ϵ sampled from N(0,In)
we are trying to find an ideal denoiserϵ∗ that minimizes L(θ)
to fine-tune, we drop the last weight matrix with dimension f×1000 and replace it with a matrix with dimension f×c where c is the desired number of classes (f … number of features)
problems
not optimal for every problem (e.g. video, medical)
humans don't need ImageNet pretraining
solution
replace ImageNet pre-training by an unsupervised training (representation learning)
generation-based methods
autoencoders
train such features that can be used to reconstruct original data
input data x → encoder → features z → decoder → reconstructed input data x^
z typically has less features than x
we minimize ∥x−x^∥2
the encoder learns the representation
if we have a large unlabeled dataset and a small annotated dataset, we can use an encoder or a GAN to initialize a supervised model
limitations
features not trained to discriminate
limited performance
additional computation cost (decoder or generator)
solution: self-supervised learning (SSL)
self-supervised learning – supervision comes from the data (no need to annotate)
pretext task
we don't care about this specific task but it helps the model to learn the representations
e.g. relative patch prediction
but it's not that easy
color distortion helps the model cheat the task
solution: drop two channels, replace by Gaussian noise
another task: solving jigsaw puzzles
to make it easier, we can subset 1000 permutations and only train the classifier on them
other tasks
colorization
rotation prediction
super-resolution
contrastive learning
goal: to learn features that are discriminative among instances
but we would need too many classes (one for each instance in the training dataset) – we need non-parametric softmax & memory bank
memory bank contains feature representations of all images in the dataset
invariant information clustering – maximizing the mutual information between encoded variables
SimCLR
we have two images A,B, apply two random transformations to each of them
so we get four images A1,A2,B1,B2, we want to maximize agreement between the ones based on the same image (e.g. A1,A2) and minimize agreement between the ones based on different images (e.g. A1,B1)
agreement defined as cosine similarity (pairwise)
Moco
instead of end-to-end learning or memory bank, we use momentum encoder
we mix the previous parameters of the network with the current one
SimSiam
Dino
vision transformer (ViT)
linear projection of flattened 16×16 patches + learned position embedding
additional classifier token
two networks: student and teacher
momentum teacher as Moco
segmentation emerges
what is a good representation? – we need robustness (to handle domain shift)
appearance changes due to different sensors – infrared vs. normal camera
use of synthetic data – synthetic datasets may be cheaper to make
unseen scenarios (e.g. natural disasters)
biased datasets
unsupervised domain adaptation (DA)
source and target distributions, we want them to have similar representations
e.g. we trained the model labeled photos, we want it to handle (unlabeled) cartoon images
approaches
discrepancy-based method
use maximum mean discrepancy (MMD) to align the distributions
alignment layers
idea: learn domain-agnostic representation by adjusting the network architecture
batch normalization
adversarial-based methods
employ an adversarial objective to ensure that the network cannot distinguish between the source and target domains
adaptation through translation
train model which can translate between domains
in some contexts, discrete representations may be useful
VQ-VAE = VAE with vector quantization
vector quantization maps a vector from a continuous space to a vector from a dictionary (codebook)
Image Generation
variational autoencoders (VAEs)
encoder (predicts distribution in latent space) + decoder (predicts distribution in feature space)
we want the latent space to be close to Gaussian
that's what KL divergence term does
dimensions in latent space may correspond to some properties of the objects in the image
we can do linear interpolation – we encode two images, “mix” them (in some ratio), then decode
GANs
problem: want to sample from complex, high-dimensional training distribution (no direct way to do this!)
solution: sample from a simple distribution (e.g. random noise) & learn transformation to training distribution
minimax objective function
alternate between gradient ascent on discriminator and gradient descent on generator
in practice: instead of minimizing likelihood of discriminator being correct, we maximize likelihood of discriminator being wrong (higher gradient signal for bad samples → works better)
Progressive GAN – training layer by layer (we start by training simple small layers, then add larger layers)
BigGAN
style stransfer
we want to take content from one image and style from the other one
we don't want to transfer only color but also brush strokes
we don't change the structure of the original image, we change statistical properties of its patches (to get different style)
to compute loss, the VGG encoder needs to be used again on the result and the “style” image
style-based GAN
traditional approach: latent vector comes from the source image
style-based GAN starts with learned constant tensor, adds noise and style (in each layer) by predicting scale and shift
we swap source images at some point in the process to get the mix of style and content
image-to-image translation
goal: translate image from one representation to another
edges (drawing) → photo
labels → street scene
BW → color
aerial → map
day → night
Pix2Pix
use GAN, discriminator gets both images (we want the generated images to be both plausible and to correspond to the original image)
generator is just autoencoder (encoder + decoder)
convolution & deconvolution
U-Net uses skip connections from the encoder to the decoder (not everything has to be encoded in the latent space)
works better
example: generating image based on segmentation
we can use a trained segmentation model to segment the generated image
then, we can apply metrics used for image segmentation evaluation
smarter discriminator
instead of predicting only one score (on the scale from real to fake), we can predict multiple scores (one for each region of the image)
this doesn't work for too small regions – “is this pixel realistic?” is not a good question (the discriminator cannot see patterns, only colors of individual pixels)
we assume we have access to p(x,y) and train model to sample y∼p(y∣x) or x∼p(x∣y)
but we don't always have p(x,y) → unpaired image-to-image generation
example: you may have many images of horses and many images of zebras, but never a pair of corresponding images
CycleGAN
uses both GAN losses and cycle-consistency loss
if we generate zebra based on a horse, we want to be able to generate horse based on the zebra and get the same horse as before
based on ResNet (not U-Net)
let's have a shared latent space!
so we have two encoders (one for zebra, one for horse) and two decoders that share the same latent space
weights are shared between encoders
geometry-consistency
we check how well the model works for transformed images (we then inverse the transformation and compared with the result for untransformed image)
GcGAN
high resolution images
Pix2PixHD
we don't want to use many layers – you lose information
architecture similar to style-based GAN
struggles with uniform surfaces
video generation
we need temporal consistency
we could do 3D convolution instead of 2D convolution
but we would need a lot of data
we could consider static background and moving objects
so we generate static background (image) and two videos – foreground and mask (ratio for mixing the foreground and background)
let's generate a trajectory of vectors in latent space we can then pass to a decoder
limitations
fixed-length videos only
no control over motion and content
MoCoGAN
DVDGAN
video-to-video translation
animating single subject – latent space with human pose
neural radiance filters
estimate “shape” of an object based on several photos
can render novel views
Diffusion models
we address generation as a denoising problem
similar to GAN, we start with a distribution easy to sample (Gaussian) and get a distribution we want (but we do it in multiple steps)
we estimate mean of the next distribution
we can combine multiple steps of adding noise just into one step
instead of predicting the image, we predict the noise
it's easier as the variance is fixed (we can focus on predicting mean)
also, the image changes over time – the noise does not (?)
we can use simpler loss formula even though there's no theoretical explanation for it
Lt=Et∼[1,T],x0,ϵt[∥ϵt−ϵθ(xt,t)∥2]
training and sampling algorithms
we want the results to follow a distribution → we add some randomness according to the variance
we want the distribution to be conditional
first approach: classifier guidance
we have a diffusion model P(x)
we have a classifier P(y∣x)
we want to be able to sample P(x∣y)
idea: instead of just denoising, we also move in the direction that makes the probability P(y∣x) higher
but we need to compute the gradient of the classifier (using backpropagation) – high computational cost
second approach: classifier-free guidance
the noise model is trained with y in mind
we also used unconditioned denoiser – balance between quality and fidelity
latent diffusion models
problem: to generate high-resolution images, we need to start from high-resolution noise and it takes many denoising steps (→ computational cost)
we could use distillation
or we can project the images in a discrete latent space
let's have an encoder and a decoder
we consider a diffusion model in the latent space
U-Net used for denoising
cross-attention to apply conditioning (in the U-Net)
conditioning is in a single vector C
diffusion transformers (DiT)
conditioning is used to predict scale and shift (similar to StyleGAN)
2D → 3D latent space
denoiser for video gets very computationally intensive if we want to attend everywhere
instead, we consider separate spatial and temporal layers
why it changes everything
GAN worked only on datasets with limited diversity
fine-grained control with text
we have a general-purpose image prior
we can start with pretrained large models and use transfer learning for specific tasks
one training of a large model costs 600 000 euros
how can we reuse a pretrained diffusion model so that we can condition using spatial data (a sketch…)
how to fit all the information in a single vector C
ControlNet – encoder with skip connections to the U-Net
impainting
we want to put a specific object in the image
idea: we add noise to the whole image and let it generate with a conditioning
to make sure that the rest of the image does not change, we can replace the rest of the image with the original image (+ noise) in every step of denoising
we use a mask for that
(other slides skipped)
personalized text-to-image
example: I want to generate something based on this specific (real) statue
how can I describe this specific object using an embedding vector?
has to be learnt
Dreambooth
problem: if I finetune the model using photos of my dog standing, I will only get results with my dog standing (not sitting)
so I use specific loss that ensures the generated diversity is similar to the diversity of real dog poses
Multimodal Learning
multimodal learning
many research questions
many other modalities than just video and audio
text, lidar, thermal, events, …
interactions between modalities
sensor fusion – using information from diverse sensors to make predictions
types
camera + depth sensor → RGB-D object detection
RGB + thermal
RGB + optical flow (object moving in the video)
we have aligned inputs; when to perform fusion?
early fusion – concat, then pass to the model
late fusion – two models, jointly predict
easier to train (we don't need that much paired data)
can be run in parallel
middle fusion – two models (with shared weights), then fuse features and pass to third model
another approach: learn when to perform fusion (siamese network)
using ViT with two modalities
either pass shorter sequence of pairs → early fusion
or pass two sequences (so the entire sequence is longer) → late fusion
worked better
RGB + lidar detection
advantages
effective in low light and some adverse weather
robust in low-texture areas
penetrates dense foliage (vegetation) – for satellite imagery
long range
lidar returns a point cloud (points detected in 3D)
approaches
first find object in image, then use lidar points (sequential)
or we can use late fusion
multimodal translation (from one modality to another)
I2T: image captioning – we condition text on some visual observation
first RNNs with LSTM, then Transformer (attention-based approaches)
V2T: lip reading
T2I: text-to-image generation
A2V: audio to video
ASR: speech recognition – not generative (???)
TTS: text-to-speech
hybrid tasks
visual question answering
projecting the image and the question in the same vector space
image processed by CNN
question processed by CNN/LSTM
attention layers – which part of the image should I look at?
lips reading
seq2seq with attention – which time should I look at to predict the next word? (predicting alignment between text and audio/video frames)
multimodal alignment (identifying and modeling correspondances)
ImageNet
hard to scale up
vision is not only about classes
limited robustness to distribution shifts
adaptation to other tasks (new classes) requires further training
zero-shot classification: CLIP
frame the problem as an image-caption matching problem
captions
easier to get than classes
contain semantic, geometric, and stylistic information
multi-object images
collected 400M pairs (312× more than ImageNet)
contrastive pre-training
captions encoded using transformer
images encoded by ViT or ResNet
for each image, probability (softmax) of every possible caption (and vice versa)
loss function – maximize likelihood of predicting correct text for the image and correct image for the text
can be then used for zero-shot classification
user-defined classes can be expressed as captions: “a photo of a {object}.“
can be also used as a search engine
you compute the similarity between the provided caption and the images in your database
can be used to build on top of (prompt engineering – “software 3.0”)
CLIP can be also used for audio classification
Imagebind
based on contrastive loss
paired data of many different modalities: images, videos, text, audio, depth, thermal, IMU (movement)
connecting modalities which were not connected before
mask image model vs. language model
mask image model – BERT, bidirectional
language model – GPT, unidirectional
training LLM
pre-training → instruction fine-tuning → reinforcement learning with human feedback
having several copies of LLMs fine-tuned for different tasks is expensive
alternative approach: prefix-tuning
freeze the weights
train new prefix embeddings (added at the beginning of the sequence) that optimize the behavior of the network for the specific task
similar to prompt engineering (“You are an AI agent, be kind to the user.“)
but here, we use gradient descent to find tokens which work the best (they don't have to correspond to existing words)
bidirectional vs. causal (unidirectional) attention
bidirectional models cannot be used to generate
unidirectional uses masked self-attention
→ can generate, is more compute efficient, has good modeling capacities
multimodal LLMs
VisualBERT (bidirectional)
image (split using bounding boxes by an object detector) + caption
masking words in the caption
objective 1: predict masked words
objective 2: predict if the image matches the caption or not
downstream task: visual question answering
[mask] token is appended to the question (→ answer is predicted by the model)
VQA is considered as a classification problem
unidirectional MLLM (encoder + decoder)
encoder gets image and beginning of the sentence
image is first split into patches and processed by convolution
bidirectional attention
decoder continues the sentence
decoder-only
vision encoder trained using the task of next token prediction
produces encoded representations of the image
used as prefix for LLM (or even as a part of the input – anywhere)
LLM frozen – why?
cost of training
contains a lot of useful knowledge we don't want to lose
it combines perception of vision encoder and reasoning capacity of LLM
alternative approach: use CLIP instead of encoder training (to get the embeddings)
Flamingo
images are removed from the text and replaced by placeholders
then, images are provided using gated cross-attention
skip connections make sure that the model preserves its pre-trained abilities even after modification (at the beginning of the fine-tuning phase – with the initial parameters for the new blocks in the architecture)
the text cross-attend only at the last image
but the self-attention layer still ensures everyone sees everything
Socratic Models
idea: convert all modality into text
then perform reasoning in text form
but there are things hard to describe with text
training conversational agent for visual question answering
hard to get data
Llava architecture
frozen vision encoder
frozen LLM
trained projection is applied to the output of the vision encoder (before passing these “tokens” to the LLM)
Visual LLM
Qwen
supports audio, video, and text
uses vision and audio encoders
we need positional embeddings
rotary position embedding (RoPE)
multiple rotations happening at once (vector of dimension n split into n/2 parts and each part is rotated differently)
for videos, we use M-RoPE
rotary embedding decomposed into temporal, width, and height component
best open-source model
speech synthesis in an autoregressive manner
image-to-text vs. text-to-image
generating text – autoregressive approach (predicting next token based on the previous ones)
generating images – diffusion
how to unify the two tasks? → Mixed-Modal Auto-Regressive LM
image converted to tokens and back using tokenizer and de-tokenizer