Lecture

supervised ML
- training data $(x^i,y^i)$
- given a new $x^0$ , provide a prediction $\tilde y^0$
- consider a class $\varphi$ of functions
- find a function $f_\theta\in\varphi$ $f_{θ} \in φ$ (find parameters $\theta$ $θ$ ) that minimizes some loss function
  - $\mathrm{argmin}_\theta E(\theta)$
  - where $E(\theta):=\sum_i^n\ell(f_\theta(x^i),y^i)$
gradient descent
- $\theta_{t+1}\leftarrow\theta_t-\eta_t\cdot\nabla E(\theta_t)$
- works well for convex functions $E$
- doesn't work well for non-convex $E$
- in practice, works well for neural networks
multi-layer neural networks
- $x_{\ell+1}=\sigma(W_\ell x_\ell+b_\ell)$
- $\sigma$ … activation function (sigmoid, ReLU, …)
- $W$ … weight matrix
- $b$ … bias vector
for two layers
- $f_\theta(x)=\frac1n\sum_{k=1}^n u_k\cdot\sigma(\braket{v_k,x}+b_k)$ $f_{θ} (x) = \frac{1}{n} \sum_{k = 1}^{n} u_{k} \cdot σ (⟨ v_{k}, x ⟩ + b_{k})$
  - $n$ … number of neurons in the hidden layer
- weights $\theta=(u_k,v_k,b_k)_{k=1}^n$
Universal approximation theorem (Cybenko '89, Hornik '89)
- any continuous function $f:\mathbb R^d\to\mathbb R$ can be approximated by a 2-layer neural network
- (Baron '93): $\|g-f_\theta\|\leq \frac{O(R\|g\|)}{\sqrt n}$
- conclusion: a large 2-layer neural network is good enough
- Bach and Chizat '18: gradient descent (flow) converges to the global minimum for 2-layer neural networks
  - we don't know how long it could take
very deep networks (CNN, Transformer, …)
- residual networks (ResNet '16)
  - $x_{\ell+1}=x_\ell+\frac 1L U_\ell^T\cdot\sigma(V_\ell x_\ell+b_\ell)$
  - $U_\ell,V_\ell\in\mathbb R^{n\times d}$ $U_{ℓ}, V_{ℓ} \in R^{n \times d}$ … weight matrices
    - $n$ … number of neurons
  - $x_0\xrightarrow{f_\theta} x_L$
  - $L$ layers
- neural differential equation
  - for $L\to\infty$ $L \to \infty$
    - $\frac{dx_s}{dt}=U_s^T\cdot\sigma(V_sx_s+b_s)$
    - where $s\in(0,1)$
  - theorem (2024): if the starting point is close enough to a local minimum, then the flow induced by $\frac{dx_s}{dt}$ converges to the local minimum
attention, transformers, BERT
- modeling sequences with RNN
- attention in neural networks – originally in computer vision
- problems of RNNs: long range dependencies, gradient vanishing (and exploding), large number of training steps, no parallel computation
  - transformers solve all four of these
- complexity per layer
  - for self-attention … $O(n^2d)$
- BERT – just the encoder
- how to process long sequence
  - divide into smaller chunks
  - optimize (attention)
- recommended watch: Transformers, the tech behind LLMs (3Blue1Brown)
- what happens in the first layer
  - $E^{(1)}=\mathrm{Norm}(E^{(1)}_1+\mathrm{FFN}(E^{(1)}_1))$
  - $E^{(1)}_1=\mathrm{Norm}(E^{(0)}+\mathrm{MATL}(E^{(0)}))$
  - FFN … feed forward network (multi-layer perceptron)
  - MATL … multi-head attention layer
  - $E^{(0)}$ … embeddings with positional encoding
- similarly for $E^{(\ell)}=\mathrm{Norm}(\mathrm{Norm}(E^{(\ell-1)}+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)}))$ $+\ \mathrm{FFN}(\mathrm{Norm}(E^{(\ell-1)})+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)})))$
example
- input text: Machine learning is learning from
- tokenization: machine learn ing is learn ing from
- embeddings: $E^{(0)}_1,E^{(0)}_2,E^{(0)}_3,E^{(0)}_4,E^{(0)}_5,E^{(0)}_6,E^{(0)}_7$ $E_{1}^{(0)}, E_{2}^{(0)}, E_{3}^{(0)}, E_{4}^{(0)}, E_{5}^{(0)}, E_{6}^{(0)}, E_{7}^{(0)}$
  - where $E^{(0)}_2=E^{(0)}_5$ and $E^{(0)}_3=E^{(0)}_6$
  - $|V|$ $∣ V ∣$ … vocabulary size (number of possible tokens)
    - 50 257 tokens in ChatGPT 2
  - $d_{\mathrm{model}}$ $d_{model}$ … dimension of the context independent vector representing a token
    - 12 288
  - $W_E$ (embedding matrix) has $|V|$ columns and $d_\mathrm{model}$ rows, is initialized randomly
- adding positional information
- MATL … multi-head attention layer
  - attention … $\mathrm{softmax}(\frac{K^TQ}{\sqrt{d_k}})\cdot V$ $softmax (\frac{K ^{T} Q}{d _{k}}) \cdot V$
    - $K=W_kE$
    - $Q = W_qE$
    - $V=W_vE$
  - attention helps to distinguish the meaning based on the context
  - sometimes we need to use masking (in a decoder – otherwise, it could just read the next token from the target output)
  - $\Delta E_i^{(\ell-1)}=W_o(\sum_{j=1}^i\sigma (K_j\cdot Q_i)V_j)$
  - dimensions of the weight matrices ( $W_k,W_q,W_v$ $W_{k}, W_{q}, W_{v}$ ) … 128 × 12 288
    - 128 … size of the hidden space
      - output vector of the attention mechanism has size 128
    - we have 96 attention heads
      - so just by concatenating the outputs, we get $128 \cdot 96 = 12\,288$
  - to get the output of MATL, we just concatenate the outputs of individual heads and multiply it by $W_o$ $W_{o}$
    - dimensions of $W_o$ are 12 288 × 12 288
- Norm … just normalize the matrices (subtract mean, divide by variance)
- parameter check
  - embedding
    - $W_E$ … 12 288 × 50 257
    - $W_U$ $W_{U}$ … 12 288 × 50 257
      - unembedding matrix, computes the distance to the individual words (then we select the closest one)
  - attention
    - $W_q$ $W_{q}$ … 128 × 12 288 × 96 heads × 96 layers
      - same for $W_k,W_v$
    - $W_o$ … 12 288 × 12 288 × 96 layers
  - MLP
    - 115 (?) bilion parameters
- MLP – projection to a vector space with 4 times more dimensions (and back)
  - ReLU activation
recent technical revolutions
- RevNet – very deep networks
- Transformers – adapted to next token prediction
- Adam – suitable for transformer optimization
residual NN (ResNet)
- $x(0)=x$
- $x(k+1)=x(k)+w(k)\cdot\sigma(a(k)\cdot x(k)+b(k))$ $x (k + 1) = x (k) + w (k) \cdot σ (a (k) \cdot x (k) + b (k))$
  - $0\leq k\leq L-1$
  - for $L\to\infty$ $L \to \infty$ , ResNet becomes $\dot x(t)=w(t)\cdot\sigma(a(t)\cdot x(t)+b(t))$ $\overset{x}{˙} (t) = w (t) \cdot σ (a (t) \cdot x (t) + b (t))$
    - derivative behavior
NNs for text generation are trained to minimize the distance of the generated text from the target
transformer and attention mechanism
- replaces previous mechanisms (like convolution) with attention
- first approach, similar to ResNet
  - $\forall i\leq n:\dot x_i(t)=\frac 1Z \sum_{j=1}^n e^{\beta\braket{Q(t)x_i(t),K(t)x_j(t)}}\cdot V(t)x_j(t)$
  - params
    - matrices $Q,K,V$
    - scalar $\beta$ (hyperparameter?)
  - $Z$ … normalization factor
theorem
- for “simple functions”, there exists an assignment that achieves $(1+\varepsilon)$ $(1 + ε)$ -optimum and the form of the assignment is $x_i(t)\sim e^{\text{something}}$ $x_{i} (t) \sim e^{something}$
  - something has a $\frac 1\varepsilon$ factor
- → if you have a simple function, you can use attention simply
assume there is no causality in the order of tokens, then
- $x(t+1)=\sum_s \delta_s$
- where $\delta_s=1$ if $s$ is next token, otherwise 0
better … $x(t+1)=\mu$ (prob. distribution over all tokens in the dictionary)
theorem
- if non-causality and $t\to+\infty$ $t \to + \infty$ , then
  - $\mu\to\delta_s$ (Dirac function)
  - $\|\mu(t)-\delta_{s^+}\|\leq\frac{O(1)}{f(t)}$
a synthetic, pseudo-code like view on transformers (for the exam)
- see notes in Chamilo :)
- decoder-only model
- definitions
  - $V$ $V$ … vocabulary of $|V|$ $∣ V ∣$ tokens, $W_E\in\mathbb R^{d_\mathrm{model}\times|V|}$ $W_{E} \in R^{d_{model} \times ∣ V ∣}$
    - columns of $W_E$ correspond to token embeddings in $d_\mathrm{model}$ -dimensional vector space
    - $\overrightarrow E_i$ denotes the embedding of the $i$ -th token
  - $L$ … number of layers
  - $M$ … number of heads
  - $W_q^{(\ell,m)}\in\mathbb R^{d_q\times d_\mathrm{model}},W_k^{(\ell,m)}\in\mathbb R^{d_k\times d_\mathrm{model}},W_v^{(\ell,m)}\in\mathbb R^{d_v\times d_\mathrm{model}}$ $W_{q}^{(ℓ, m)} \in R^{d_{q} \times d_{model}}, W_{k}^{(ℓ, m)} \in R^{d_{k} \times d_{model}}, W_{v}^{(ℓ, m)} \in R^{d_{v} \times d_{model}}$
    - $Q,K,V$ matrices for each layer and head
  - $N$ … length of input sequence
- For $n=1$ $n = 1$ to $N$ $N$
  - $\overrightarrow E_n^{(0)}=\overrightarrow E_n+\overrightarrow{\mathrm{Pos}(n)}$ $E_{n}^{(0)} = E_{n} + Pos (n)$
    - embedding with positional information, $\in\mathbb R^{d_\mathrm{model}}$
  - For $\ell=1$ $ℓ = 1$ to $L$ $L$
    - For $m=1$ $m = 1$ to $M$ $M$
      - we define (for $j$ )
        
        $\overrightarrow Q_j^{(\ell,m)}=W_q^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
        
        $\overrightarrow K_j^{(\ell,m)}=W_k^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
        
        $\overrightarrow V_j^{(\ell,m)}=W_v^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
      - $\overrightarrow O_n^{(\ell,m)}=\sum_{j=1}^n\mathrm{softmax}_j(\frac{\overrightarrow Q_n^{(\ell,m)}\cdot \overrightarrow K_j^{(\ell,m)}}{\sqrt{d_k}})V_j$
    - $\overrightarrow O_n^{(\ell)}=\mathrm{concat}(\overrightarrow O_n^{(\ell,1)},\dots,\overrightarrow O_n^{(\ell,M)})$ $O_{n}^{(ℓ)} = concat (O_{n}^{(ℓ, 1)}, \dots, O_{n}^{(ℓ, M)})$
      - column vector in $\mathbb R^{d_v\times M}$
    - $\Delta\overrightarrow E_n^{(\ell)}=W_O^{(\ell)}\cdot\overrightarrow O_n^{(\ell)}\quad (\in\mathbb R^{d_\mathrm{model}})$ $Δ E_{n}^{(ℓ)} = W_{O}^{(ℓ)} \cdot O_{n}^{(ℓ)} (\in R^{d_{model}})$
      - $W_O^{(\ell)}\in\mathbb R^{d_\mathrm{model}\times (d_v\times M)}$
    - $\overrightarrow E_{n,1}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_n^{(\ell-1)}+\Delta\overrightarrow E_n^{(\ell)})$
    - $\overrightarrow E_{n}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_{n,1}^{(\ell)}+W_\downarrow^{(\ell)}\mathrm{GELU}(W_\uparrow\overrightarrow E_{n,1}^{(\ell)}+\overrightarrow B^{(\ell)}_\uparrow)+\overrightarrow B^{(\ell)}_\downarrow)$ $E_{n}^{(ℓ)} = LayerNorm (E_{n, 1}^{(ℓ)} + W_{↓}^{(ℓ)} GELU (W_{↑} E_{n, 1}^{(ℓ)} + B_{↑}^{(ℓ)}) + B_{↓}^{(ℓ)})$
      - contextualized embedding of the $n$ -th token after layer $\ell$
      - $W_\uparrow\in\mathbb R^{d\times d_\mathrm{model}},B_\uparrow\in\mathbb R^{d},W_\downarrow\in\mathbb R^{d_\mathrm{model}\times d},B_\downarrow\in\mathbb R^{d_\mathrm{model}}$
  - $\overrightarrow Z=W_U\cdot\overrightarrow E_n^{(L)}$ $Z = W_{U} \cdot E_{n}^{(L)}$
    - logit vector for next word
    - $W_U\in\mathbb R^{{|V|}\times d_\mathrm{model}},Z\in\mathbb R^{|V|}$
    - the logit vector provides a weight indicating the proximity of each token to $\overrightarrow E_n^{(L)}$
  - $P=\mathrm{softmax}(\overrightarrow Z)$ $P = softmax (Z)$
    - $p_i=\frac{e^{z_i/\alpha}}{\sum_{j=1}^{|V|} e^{z_j/\alpha}}$ … $\alpha\to 0$ , Dirac distribution
    - probability distribution over tokens derived from logit vector
  - training
    - compute loss between $P$ $P$ and true distribution $Q$ $Q$ (1-hot vector)
      - $L(P,Q)=-\sum_{i=1}^{|V|} Q_i\log P_i$ … NLL loss
    - backpropagation
      - $\theta\to\theta-\eta\nabla_\theta L(P,Q)$
  - inference
    - we have $P$
    - for example, we can generate at random from top $k$ (usually, using $P$ )
encoder × decoder
- encoder aims at building a representation of the sequence
- in our decoder-only model, we use encodings but they are not that good
- also, encoder would be bidirectional
- representation of the sequence can be obtained by pooling representations of the individual tokens
- models with both encoder and decoder are used for machine translation
  - encoder – bidirectional, to encode the meaning of the original sentence
  - decoder – unidirectional, to generate a new sentence based on the meaning of the original sentence and the previously generated words
    - in the decoder-only model, there's no “original sentence”
instruction fine-tuning
- step 1: pretraining – only focused on predicting the next token
- step 2: supervised fine-tuning (SFT) on various tasks – summarization, translation, sentiment analysis, text classification, …
  - using cross-entropy loss
- step 3: instruction fine-tuning using RLHF (reinforcement learning with human feedback) – model provides several responses for a given prompt, human ranks them
  - direct preferred optimization (DPO)
    - we use pairwise preferences
      - $x$ … prompt
      - $y_w$ … preferred output
      - $y_\ell$ … less preferred output
    - - $\sigma$ … sigmoid function
      - $\beta$ … regularization parameter
      - $\pi_\theta$ … probability distribution of our model with parameters $\theta$
      - $\pi_\mathrm{ref}$ … probability dist. of the reference model we get after SFT
      - we want to maximize the first term and to minimize the second term
teaching DeepSeek-R1 Zero to reason
- simple prompt
- reward based on accuracy and formatting
- GRPO loss (group relative policy optimization)
  - we maximize $A_\theta(o,q)=\frac{\pi_\theta(o\mid q)}{\pi_{\theta_\mathrm{old}}(o\mid q)} r(o)$
  - $q$ … question
  - $o$ … answer
  - $r$ … reward (positive for desirable answers, otherwise negative)
  - to achieve stability
    - clip $A_\theta$ to the interval $(1-\varepsilon,1+\varepsilon)$
    - add KL-divergence between $\pi_\theta$ and $\pi_{\theta_\mathrm{old}}$

Multimodal AI

Pia Bideau, PhD
fully connected neural network – impractical for images (too many weights)
convolution
- “filter”
- we move a function over the signal and integrate
- what to do at the ends?
  - shrink or pad
CNN is learning the filters to transform the images
- advantages
  - spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
  - parameter sharing – we don't need that many weights
  - translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)
- downsampling approaches
  - stride – we are sliding the filter with a step size larger than one
  - pooling – we apply a function (usually max) over a patch
- if pixel-level outputs are expected, we need to use upsampling afterwards
- upsampling approaches
  - nearest neighbor (we just copy the value)
  - bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
  - max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)
VGG architecture
- uses 3×3 convolutions everywhere
- receptive field size
  - in the original image, the receptive field is 1×1
  - in the first layer, the receptive field is 3×3
  - by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
  - the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$ $R F_{0} = 1, R F_{i} = R F_{i - 1} + (K - 1)$
    - $K$ … convolution kernel size ( $K=3$ for a 3×3 filter)
other architectures: ResNet, Inception, GoogLeNet, U-Net
RNNs
- hidden state … combination of the current input and the previous hidden state
- usually tanh activation
- there can be one or multiple outputs
- backpropagation becomes intractable
  - so truncated backpropagation may be used
- we can even have multiple layers
- problem: vanishing or exploding gradients
  - GRU and LSTM units are used to solve this problem
attention