Lecture: kartičky | generative-multimodal-ai

supervised ML

training data $(x^i,y^i)$
given a new $x^0$ , provide a prediction $\tilde y^0$
consider a class $\varphi$ of functions
find a function $f_\theta\in\varphi$ $f_{θ} \in φ$ (find parameters $\theta$ $θ$ ) that minimizes some loss function
- $\mathrm{argmin}_\theta E(\theta)$
- where $E(\theta):=\sum_i^n\ell(f_\theta(x^i),y^i)$

gradient descent

$\theta_{t+1}\leftarrow\theta_t-\eta_t\cdot\nabla E(\theta_t)$
works well for convex functions $E$
doesn't work well for non-convex $E$
in practice, works well for neural networks

multi-layer neural networks

$x_{\ell+1}=\sigma(W_\ell x_\ell+b_\ell)$
$\sigma$ … activation function (sigmoid, ReLU, …)
$W$ … weight matrix
$b$ … bias vector

for two layers

$f_\theta(x)=\frac1n\sum_{k=1}^n u_k\cdot\sigma(\braket{v_k,x}+b_k)$ $f_{θ} (x) = \frac{1}{n} \sum_{k = 1}^{n} u_{k} \cdot σ (⟨ v_{k}, x ⟩ + b_{k})$
- $n$ … number of neurons in the hidden layer
weights $\theta=(u_k,v_k,b_k)_{k=1}^n$

Universal approximation theorem (Cybenko '89, Hornik '89)

any continuous function $f:\mathbb R^d\to\mathbb R$ can be approximated by a 2-layer neural network
(Baron '93): $\|g-f_\theta\|\leq \frac{O(R\|g\|)}{\sqrt n}$
conclusion: a large 2-layer neural network is good enough
Bach and Chizat '18: gradient descent (flow) converges to the global minimum for 2-layer neural networks
- we don't know how long it could take

very deep networks (CNN, Transformer, …)

residual networks (ResNet '16)
- $x_{\ell+1}=x_\ell+\frac 1L U_\ell^T\cdot\sigma(V_\ell x_\ell+b_\ell)$
- $U_\ell,V_\ell\in\mathbb R^{n\times d}$ $U_{ℓ}, V_{ℓ} \in R^{n \times d}$ … weight matrices
  - $n$ … number of neurons
- $x_0\xrightarrow{f_\theta} x_L$
- $L$ layers
neural differential equation
- for $L\to\infty$ $L \to \infty$
  - $\frac{dx_s}{dt}=U_s^T\cdot\sigma(V_sx_s+b_s)$
  - where $s\in(0,1)$
- theorem (2024): if the starting point is close enough to a local minimum, then the flow induced by $\frac{dx_s}{dt}$ converges to the local minimum

attention, transformers, BERT

modeling sequences with RNN
attention in neural networks – originally in computer vision
problems of RNNs: long range dependencies, gradient vanishing (and exploding), large number of training steps, no parallel computation
- transformers solve all four of these
complexity per layer
- for self-attention … $O(n^2d)$
BERT – just the encoder
how to process long sequence
- divide into smaller chunks
- optimize (attention)
recommended watch: Transformers, the tech behind LLMs (3Blue1Brown)
what happens in the first layer
- $E^{(1)}=\mathrm{Norm}(E^{(1)}_1+\mathrm{FFN}(E^{(1)}_1))$
- $E^{(1)}_1=\mathrm{Norm}(E^{(0)}+\mathrm{MATL}(E^{(0)}))$
- FFN … feed forward network (multi-layer perceptron)
- MATL … multi-head attention layer
- $E^{(0)}$ … embeddings with positional encoding
similarly for $E^{(\ell)}=\mathrm{Norm}(\mathrm{Norm}(E^{(\ell-1)}+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)}))$ $+\ \mathrm{FFN}(\mathrm{Norm}(E^{(\ell-1)})+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)})))$

example

input text: Machine learning is learning from
tokenization: machine learn ing is learn ing from
embeddings: $E^{(0)}_1,E^{(0)}_2,E^{(0)}_3,E^{(0)}_4,E^{(0)}_5,E^{(0)}_6,E^{(0)}_7$ $E_{1}^{(0)}, E_{2}^{(0)}, E_{3}^{(0)}, E_{4}^{(0)}, E_{5}^{(0)}, E_{6}^{(0)}, E_{7}^{(0)}$
- where $E^{(0)}_2=E^{(0)}_5$ and $E^{(0)}_3=E^{(0)}_6$
- $|V|$ $∣ V ∣$ … vocabulary size (number of possible tokens)
  - 50 257 tokens in ChatGPT 2
- $d_{\mathrm{model}}$ $d_{model}$ … dimension of the context independent vector representing a token
  - 12 288
- $W_E$ (embedding matrix) has $|V|$ columns and $d_\mathrm{model}$ rows, is initialized randomly
adding positional information
MATL … multi-head attention layer
- attention … $\mathrm{softmax}(\frac{K^TQ}{\sqrt{d_k}})\cdot V$ $softmax (\frac{K ^{T} Q}{d _{k}}) \cdot V$
  - $K=W_kE$
  - $Q = W_qE$
  - $V=W_vE$
- attention helps to distinguish the meaning based on the context
- sometimes we need to use masking (in a decoder – otherwise, it could just read the next token from the target output)
- $\Delta E_i^{(\ell-1)}=W_o(\sum_{j=1}^i\sigma (K_j\cdot Q_i)V_j)$
- dimensions of the weight matrices ( $W_k,W_q,W_v$ $W_{k}, W_{q}, W_{v}$ ) … 128 × 12 288
  - 128 … size of the hidden space
    - output vector of the attention mechanism has size 128
  - we have 96 attention heads
    - so just by concatenating the outputs, we get $128 \cdot 96 = 12\,288$
- to get the output of MATL, we just concatenate the outputs of individual heads and multiply it by $W_o$ $W_{o}$
  - dimensions of $W_o$ are 12 288 × 12 288
Norm … just normalize the matrices (subtract mean, divide by variance)
parameter check
- embedding
  - $W_E$ … 12 288 × 50 257
  - $W_U$ $W_{U}$ … 12 288 × 50 257
    - unembedding matrix, computes the distance to the individual words (then we select the closest one)
- attention
  - $W_q$ $W_{q}$ … 128 × 12 288 × 96 heads × 96 layers
    - same for $W_k,W_v$
  - $W_o$ … 12 288 × 12 288 × 96 layers
- MLP
  - 115 (?) bilion parameters
MLP – projection to a vector space with 4 times more dimensions (and back)
- ReLU activation

recent technical revolutions

RevNet – very deep networks
Transformers – adapted to next token prediction
Adam – suitable for transformer optimization

residual NN (ResNet)

$x(0)=x$
$x(k+1)=x(k)+w(k)\cdot\sigma(a(k)\cdot x(k)+b(k))$ $x (k + 1) = x (k) + w (k) \cdot σ (a (k) \cdot x (k) + b (k))$
- $0\leq k\leq L-1$
- for $L\to\infty$ $L \to \infty$ , ResNet becomes $\dot x(t)=w(t)\cdot\sigma(a(t)\cdot x(t)+b(t))$ $\overset{x}{˙} (t) = w (t) \cdot σ (a (t) \cdot x (t) + b (t))$
  - derivative behavior

transformer and attention mechanism

replaces previous mechanisms (like convolution) with attention
first approach, similar to ResNet
- $\forall i\leq n:\dot x_i(t)=\frac 1Z \sum_{j=1}^n e^{\beta\braket{Q(t)x_i(t),K(t)x_j(t)}}\cdot V(t)x_j(t)$
- params
  - matrices $Q,K,V$
  - scalar $\beta$ (hyperparameter?)
- $Z$ … normalization factor

theorem

for “simple functions”, there exists an assignment that achieves $(1+\varepsilon)$ $(1 + ε)$ -optimum and the form of the assignment is $x_i(t)\sim e^{\text{something}}$ $x_{i} (t) \sim e^{something}$
- something has a $\frac 1\varepsilon$ factor
→ if you have a simple function, you can use attention simply

assume there is no causality in the order of tokens, then

$x(t+1)=\sum_s \delta_s$
where $\delta_s=1$ if $s$ is next token, otherwise 0

theorem

if non-causality and $t\to+\infty$ $t \to + \infty$ , then
- $\mu\to\delta_s$ (Dirac function)
- $\|\mu(t)-\delta_{s^+}\|\leq\frac{O(1)}{f(t)}$

a synthetic, pseudo-code like view on transformers (for the exam)

see notes in Chamilo :)
decoder-only model
definitions
- $V$ $V$ … vocabulary of $|V|$ $∣ V ∣$ tokens, $W_E\in\mathbb R^{d_\mathrm{model}\times|V|}$ $W_{E} \in R^{d_{model} \times ∣ V ∣}$
  - columns of $W_E$ correspond to token embeddings in $d_\mathrm{model}$ -dimensional vector space
  - $\overrightarrow E_i$ denotes the embedding of the $i$ -th token
- $L$ … number of layers
- $M$ … number of heads
- $W_q^{(\ell,m)}\in\mathbb R^{d_q\times d_\mathrm{model}},W_k^{(\ell,m)}\in\mathbb R^{d_k\times d_\mathrm{model}},W_v^{(\ell,m)}\in\mathbb R^{d_v\times d_\mathrm{model}}$ $W_{q}^{(ℓ, m)} \in R^{d_{q} \times d_{model}}, W_{k}^{(ℓ, m)} \in R^{d_{k} \times d_{model}}, W_{v}^{(ℓ, m)} \in R^{d_{v} \times d_{model}}$
  - $Q,K,V$ matrices for each layer and head
- $N$ … length of input sequence
For $n=1$ $n = 1$ to $N$ $N$
- $\overrightarrow E_n^{(0)}=\overrightarrow E_n+\overrightarrow{\mathrm{Pos}(n)}$ $E_{n}^{(0)} = E_{n} + Pos (n)$
  - embedding with positional information, $\in\mathbb R^{d_\mathrm{model}}$
- For $\ell=1$ $ℓ = 1$ to $L$ $L$
  - For $m=1$ $m = 1$ to $M$ $M$
    - we define (for $j$ $j$ )
      - $\overrightarrow Q_j^{(\ell,m)}=W_q^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
      - $\overrightarrow K_j^{(\ell,m)}=W_k^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
      - $\overrightarrow V_j^{(\ell,m)}=W_v^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
    - $\overrightarrow O_n^{(\ell,m)}=\sum_{j=1}^n\mathrm{softmax}_j(\frac{\overrightarrow Q_n^{(\ell,m)}\cdot \overrightarrow K_j^{(\ell,m)}}{\sqrt{d_k}})V_j$
  - $\overrightarrow O_n^{(\ell)}=\mathrm{concat}(\overrightarrow O_n^{(\ell,1)},\dots,\overrightarrow O_n^{(\ell,M)})$ $O_{n}^{(ℓ)} = concat (O_{n}^{(ℓ, 1)}, \dots, O_{n}^{(ℓ, M)})$
    - column vector in $\mathbb R^{d_v\times M}$
  - $\Delta\overrightarrow E_n^{(\ell)}=W_O^{(\ell)}\cdot\overrightarrow O_n^{(\ell)}\quad (\in\mathbb R^{d_\mathrm{model}})$ $Δ E_{n}^{(ℓ)} = W_{O}^{(ℓ)} \cdot O_{n}^{(ℓ)} (\in R^{d_{model}})$
    - $W_O^{(\ell)}\in\mathbb R^{d_\mathrm{model}\times (d_v\times M)}$
  - $\overrightarrow E_{n,1}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_n^{(\ell-1)}+\Delta\overrightarrow E_n^{(\ell)})$
  - $\overrightarrow E_{n}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_{n,1}^{(\ell)}+W_\downarrow^{(\ell)}\mathrm{GELU}(W_\uparrow\overrightarrow E_{n,1}^{(\ell)}+\overrightarrow B^{(\ell)}_\uparrow)+\overrightarrow B^{(\ell)}_\downarrow)$ $E_{n}^{(ℓ)} = LayerNorm (E_{n, 1}^{(ℓ)} + W_{↓}^{(ℓ)} GELU (W_{↑} E_{n, 1}^{(ℓ)} + B_{↑}^{(ℓ)}) + B_{↓}^{(ℓ)})$
    - contextualized embedding of the $n$ -th token after layer $\ell$
    - $W_\uparrow\in\mathbb R^{d\times d_\mathrm{model}},B_\uparrow\in\mathbb R^{d},W_\downarrow\in\mathbb R^{d_\mathrm{model}\times d},B_\downarrow\in\mathbb R^{d_\mathrm{model}}$
- $\overrightarrow Z=W_U\cdot\overrightarrow E_n^{(L)}$ $Z = W_{U} \cdot E_{n}^{(L)}$
  - logit vector for next word
  - $W_U\in\mathbb R^{{|V|}\times d_\mathrm{model}},Z\in\mathbb R^{|V|}$
  - the logit vector provides a weight indicating the proximity of each token to $\overrightarrow E_n^{(L)}$
- $P=\mathrm{softmax}(\overrightarrow Z)$ $P = softmax (Z)$
  - $p_i=\frac{e^{z_i/\alpha}}{\sum_{j=1}^{|V|} e^{z_j/\alpha}}$ … $\alpha\to 0$ , Dirac distribution
  - probability distribution over tokens derived from logit vector
- training
  - compute loss between $P$ $P$ and true distribution $Q$ $Q$ (1-hot vector)
    - $L(P,Q)=-\sum_{i=1}^{|V|} Q_i\log P_i$ … NLL loss
  - backpropagation
    - $\theta\to\theta-\eta\nabla_\theta L(P,Q)$
- inference
  - we have $P$
  - for example, we can generate at random from top $k$ (usually, using $P$ )

encoder × decoder

encoder aims at building a representation of the sequence
in our decoder-only model, we use encodings but they are not that good
also, encoder would be bidirectional
representation of the sequence can be obtained by pooling representations of the individual tokens
models with both encoder and decoder are used for machine translation
- encoder – bidirectional, to encode the meaning of the original sentence
- decoder – unidirectional, to generate a new sentence based on the meaning of the original sentence and the previously generated words
  - in the decoder-only model, there's no “original sentence”

instruction fine-tuning

step 1: pretraining – only focused on predicting the next token
step 2: supervised fine-tuning (SFT) on various tasks – summarization, translation, sentiment analysis, text classification, …
- using cross-entropy loss
step 3: instruction fine-tuning using RLHF (reinforcement learning with human feedback) – model provides several responses for a given prompt, human ranks them
- direct preferred optimization (DPO)
  - we use pairwise preferences
    - $x$ … prompt
    - $y_w$ … preferred output
    - $y_\ell$ … less preferred output
  - - $\sigma$ … sigmoid function
    - $\beta$ … regularization parameter
    - $\pi_\theta$ … probability distribution of our model with parameters $\theta$
    - $\pi_\mathrm{ref}$ … probability dist. of the reference model we get after SFT
    - we want to maximize the first term and to minimize the second term

teaching DeepSeek-R1 Zero to reason

simple prompt
reward based on accuracy and formatting
GRPO loss (group relative policy optimization)
- we maximize $A_\theta(o,q)=\frac{\pi_\theta(o\mid q)}{\pi_{\theta_\mathrm{old}}(o\mid q)} r(o)$
- $q$ … question
- $o$ … answer
- $r$ … reward (positive for desirable answers, otherwise negative)
- to achieve stability
  - clip $A_\theta$ to the interval $(1-\varepsilon,1+\varepsilon)$
  - add KL-divergence between $\pi_\theta$ and $\pi_{\theta_\mathrm{old}}$

Multimodal AI

convolution

“filter”
we move a function over the signal and integrate
what to do at the ends?
- shrink or pad

Multimodal AI

CNN is learning the filters to transform the images

advantages
- spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
- parameter sharing – we don't need that many weights
- translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)
downsampling approaches
- stride – we are sliding the filter with a step size larger than one
- pooling – we apply a function (usually max) over a patch
if pixel-level outputs are expected, we need to use upsampling afterwards
upsampling approaches
- nearest neighbor (we just copy the value)
- bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
- max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)

Multimodal AI

VGG architecture

uses 3×3 convolutions everywhere
receptive field size
- in the original image, the receptive field is 1×1
- in the first layer, the receptive field is 3×3
- by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
- the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$ $R F_{0} = 1, R F_{i} = R F_{i - 1} + (K - 1)$
  - $K$ … convolution kernel size ( $K=3$ for a 3×3 filter)

Multimodal AI

RNNs

hidden state … combination of the current input and the previous hidden state
usually tanh activation
there can be one or multiple outputs
backpropagation becomes intractable
- so truncated backpropagation may be used
we can even have multiple layers
problem: vanishing or exploding gradients
- GRU and LSTM units are used to solve this problem