this dir | view | cards | source | edit | dark top

Exam: Multimodal AI

Exam: Multimodal AI
Convolution

convolution – “filter”

Convolution

notation

Convolution

advantages

Convolution

motivation for padding (with zeros)

convolutions can only by executed in kernel lies entirely within input domain – that's inconvenient as it couples architecture and input size

Convolution

downsampling approaches

Convolution

upsampling approaches

Convolution

architectures

Convolution

RNNs

Convolution

Transformer

Probabilistic models with latent variables: VAE, GAN

probabilistic models – aim to learn a parametric distribution pθ(x)p_\theta(x) that approximates the complex data distribution pdata(x)p_\mathrm{data}(x)

Probabilistic models with latent variables: VAE, GAN

Kullback-Leibler divergence

Probabilistic models with latent variables: VAE, GAN

latent variables

Probabilistic models with latent variables: VAE, GAN

simple example: clustering

Probabilistic models with latent variables: VAE, GAN

more advanced approach: Gaussian mixture model

Probabilistic models with latent variables: VAE, GAN

we can also consider continuous latent variables

Probabilistic models with latent variables: VAE, GAN

variational autoencoders (VAEs)

Probabilistic models with latent variables: VAE, GAN

generative adversarial network (GAN)

Evaluation of Generative Models

it's important but hard to evaluate the quality of generated samples

Evaluation of Generative Models

objective metrics

Evaluation of Generative Models

subjective evaluation

Evaluation of Generative Models

hybrid alternative – mean opinion score

Audio

introduction

Audio

audio representations based on self-supervised learning

Audio

end-to-end approaches (audio → audio)

Audio

generating audio from intermediate representations

Diffusion

basic division of generative models

Diffusion

basic concepts

Diffusion

forward diffustion process (fixed) – start with data x0x_0, gradually add Gaussian noise in TT steps

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1})=\mathcal N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI)

Diffusion

reverse denoising process (generative)

Diffusion

training

Diffusion

common model architectures

Diffusion

reverse denoising process – sampling

Diffusion

flow matching models vs. diffusion models

Representation Learning

types of learning

Representation Learning

initial approach: pretraining (e.g. ImageNet) & fine-tuning

Representation Learning

generation-based methods

Representation Learning

self-supervised learning – supervision comes from the data (no need to annotate)

Representation Learning

what is a good representation? – we need robustness (to handle domain shift)

Representation Learning

unsupervised domain adaptation (DA)

Representation Learning

in some contexts, discrete representations may be useful

Image Generation

variational autoencoders (VAEs)

Image Generation

GANs

Image Generation

image-to-image translation

Image Generation

video generation

Image Generation

neural radiance filters

Image Generation

instead of predicting the image, we predict the noise

Image Generation

we can use simpler loss formula even though there's no theoretical explanation for it

Lt=Et[1,T],x0,ϵt[ϵtϵθ(xt,t)2]L_t=\mathbb E_{t\sim[1,T],x_0,\epsilon_t}[\|\epsilon_t-\epsilon_\theta(x_t,t)\|^2]

Image Generation

training and sampling algorithms

we want the results to follow a distribution → we add some randomness according to the variance

Image Generation

we want the distribution to be conditional

Image Generation

latent diffusion models

Image Generation

diffusion transformers (DiT)

conditioning is used to predict scale and shift (similar to StyleGAN)

Image Generation

2D → 3D latent space

Image Generation

why it changes everything

Image Generation

how can we reuse a pretrained diffusion model so that we can condition using spatial data (a sketch…)

Image Generation

personalized text-to-image

Multimodal Learning

multimodal learning

Multimodal Learning

sensor fusion – using information from diverse sensors to make predictions

Multimodal Learning

multimodal translation (from one modality to another)

Multimodal Learning

hybrid tasks

Multimodal Learning

multimodal alignment (identifying and modeling correspondances)

Multimodal Learning

mask image model vs. language model

Multimodal Learning

training LLM

Multimodal Learning

multimodal LLMs

Multimodal Learning

image-to-text vs. text-to-image

Hurá, máš hotovo! 🎉
Pokud ti moje kartičky pomohly, můžeš mi koupit pivo.