convolution – “filter”
notation
advantages
motivation for padding (with zeros)
convolutions can only by executed in kernel lies entirely within input domain – that's inconvenient as it couples architecture and input size
downsampling approaches
upsampling approaches
architectures
RNNs
Transformer
probabilistic models – aim to learn a parametric distribution that approximates the complex data distribution
Kullback-Leibler divergence
latent variables
simple example: clustering
more advanced approach: Gaussian mixture model
we can also consider continuous latent variables
variational autoencoders (VAEs)
generative adversarial network (GAN)
it's important but hard to evaluate the quality of generated samples
objective metrics
subjective evaluation
hybrid alternative – mean opinion score
introduction
audio representations based on self-supervised learning
end-to-end approaches (audio → audio)
generating audio from intermediate representations
basic division of generative models
basic concepts
forward diffustion process (fixed) – start with data , gradually add Gaussian noise in steps
reverse denoising process (generative)
training
common model architectures
reverse denoising process – sampling
flow matching models vs. diffusion models
types of learning
initial approach: pretraining (e.g. ImageNet) & fine-tuning
generation-based methods
self-supervised learning – supervision comes from the data (no need to annotate)
what is a good representation? – we need robustness (to handle domain shift)
unsupervised domain adaptation (DA)
in some contexts, discrete representations may be useful
variational autoencoders (VAEs)
GANs
image-to-image translation
video generation
neural radiance filters
instead of predicting the image, we predict the noise
we can use simpler loss formula even though there's no theoretical explanation for it
training and sampling algorithms
we want the results to follow a distribution → we add some randomness according to the variance
we want the distribution to be conditional
latent diffusion models
diffusion transformers (DiT)
conditioning is used to predict scale and shift (similar to StyleGAN)
2D → 3D latent space
why it changes everything
how can we reuse a pretrained diffusion model so that we can condition using spatial data (a sketch…)
personalized text-to-image
multimodal learning
sensor fusion – using information from diverse sensors to make predictions
multimodal translation (from one modality to another)
hybrid tasks
multimodal alignment (identifying and modeling correspondances)
mask image model vs. language model
training LLM
multimodal LLMs
[mask] token is appended to the question (→ answer is predicted by the model)image-to-text vs. text-to-image