# Lecture - supervised ML - training data $(x^i,y^i)$ - given a new $x^0$, provide a prediction $\tilde y^0$ - consider a class $\varphi$ of functions - find a function $f_\theta\in\varphi$ (find parameters $\theta$) that minimizes some loss function - $\mathrm{argmin}_\theta E(\theta)$ - where $E(\theta):=\sum_i^n\ell(f_\theta(x^i),y^i)$ - gradient descent - $\theta_{t+1}\leftarrow\theta_t-\eta_t\cdot\nabla E(\theta_t)$ - works well for convex functions $E$ - doesn't work well for non-convex $E$ - in practice, works well for neural networks - multi-layer neural networks - $x_{\ell+1}=\sigma(W_\ell x_\ell+b_\ell)$ - $\sigma$ … activation function (sigmoid, ReLU, …) - $W$ … weight matrix - $b$ … bias vector - for two layers - $f_\theta(x)=\frac1n\sum_{k=1}^n u_k\cdot\sigma(\braket{v_k,x}+b_k)$ - $n$ … number of neurons in the hidden layer - weights $\theta=(u_k,v_k,b_k)_{k=1}^n$ - Universal approximation theorem (Cybenko '89, Hornik '89) - any continuous function $f:\mathbb R^d\to\mathbb R$ can be approximated by a 2-layer neural network - (Baron '93): $\|g-f_\theta\|\leq \frac{O(R\|g\|)}{\sqrt n}$ - conclusion: a large 2-layer neural network is good enough - Bach and Chizat '18: gradient descent (flow) converges to the global minimum for 2-layer neural networks - we don't know how long it could take - very deep networks (CNN, Transformer, …) - residual networks (ResNet '16) - $x_{\ell+1}=x_\ell+\frac 1L U_\ell^T\cdot\sigma(V_\ell x_\ell+b_\ell)$ - $U_\ell,V_\ell\in\mathbb R^{n\times d}$ … weight matrices - $n$ … number of neurons - $x_0\xrightarrow{f_\theta} x_L$ - $L$ layers - neural differential equation - for $L\to\infty$ - $\frac{dx_s}{dt}=U_s^T\cdot\sigma(V_sx_s+b_s)$ - where $s\in(0,1)$ - theorem (2024): if the starting point is close enough to a local minimum, then the flow induced by $\frac{dx_s}{dt}$ converges to the local minimum - attention, transformers, BERT - modeling sequences with RNN - attention in neural networks – originally in computer vision - problems of RNNs: long range dependencies, gradient vanishing (and exploding), large number of training steps, no parallel computation - transformers solve all four of these - complexity per layer - for self-attention … $O(n^2d)$ - BERT – just the encoder - how to process long sequence - divide into smaller chunks - optimize (attention) - recommended watch: Transformers, the tech behind LLMs (3Blue1Brown) - what happens in the first layer - $E^{(1)}=\mathrm{Norm}(E^{(1)}_1+\mathrm{FFN}(E^{(1)}_1))$ - $E^{(1)}_1=\mathrm{Norm}(E^{(0)}+\mathrm{MATL}(E^{(0)}))$ - FFN … feed forward network (multi-layer perceptron) - MATL … multi-head attention layer - $E^{(0)}$ … embeddings with positional encoding - similarly for $E^{(\ell)}=\mathrm{Norm}(\mathrm{Norm}(E^{(\ell-1)}+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)}))$ $+\ \mathrm{FFN}(\mathrm{Norm}(E^{(\ell-1)})+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)})))$ - example - input text: Machine learning is learning from - tokenization: machine learn ing is learn ing from - embeddings: $E^{(0)}_1,E^{(0)}_2,E^{(0)}_3,E^{(0)}_4,E^{(0)}_5,E^{(0)}_6,E^{(0)}_7$ - where $E^{(0)}_2=E^{(0)}_5$ and $E^{(0)}_3=E^{(0)}_6$ - $|V|$ … vocabulary size (number of possible tokens) - 50 257 tokens in ChatGPT 2 - $d_{\mathrm{model}}$ … dimension of the context independent vector representing a token - 12 288 - $W_E$ (embedding matrix) has $|V|$ columns and $d_\mathrm{model}$ rows, is initialized randomly - adding positional information - MATL … multi-head attention layer - attention … $\mathrm{softmax}(\frac{K^TQ}{\sqrt{d_k}})\cdot V$ - $K=W_kE$ - $Q = W_qE$ - $V=W_vE$ - attention helps to distinguish the meaning based on the context - sometimes we need to use masking (in a decoder – otherwise, it could just read the next token from the target output) - $\Delta E_i^{(\ell-1)}=W_o(\sum_{j=1}^i\sigma (K_j\cdot Q_i)V_j)$ - dimensions of the weight matrices ($W_k,W_q,W_v$) … 128 × 12 288 - 128 … size of the hidden space - output vector of the attention mechanism has size 128 - we have 96 attention heads - so just by concatenating the outputs, we get $128 \cdot 96 = 12\,288$ - to get the output of MATL, we just concatenate the outputs of individual heads and multiply it by $W_o$ - dimensions of $W_o$ are 12 288 × 12 288 - Norm … just normalize the matrices (subtract mean, divide by variance) - parameter check - embedding - $W_E$ … 12 288 × 50 257 - $W_U$ … 12 288 × 50 257 - unembedding matrix, computes the distance to the individual words (then we select the closest one) - attention - $W_q$ … 128 × 12 288 × 96 heads × 96 layers - same for $W_k,W_v$ - $W_o$ … 12 288 × 12 288 × 96 layers - MLP - 115 (?) bilion parameters - MLP – projection to a vector space with 4 times more dimensions (and back) - ReLU activation - recent technical revolutions - RevNet – very deep networks - Transformers – adapted to next token prediction - Adam – suitable for transformer optimization - residual NN (ResNet) - $x(0)=x$ - $x(k+1)=x(k)+w(k)\cdot\sigma(a(k)\cdot x(k)+b(k))$ - $0\leq k\leq L-1$ - for $L\to\infty$, ResNet becomes $\dot x(t)=w(t)\cdot\sigma(a(t)\cdot x(t)+b(t))$ - derivative behavior - NNs for text generation are trained to minimize the distance of the generated text from the target - transformer and attention mechanism - replaces previous mechanisms (like convolution) with attention - first approach, similar to ResNet - $\forall i\leq n:\dot x_i(t)=\frac 1Z \sum_{j=1}^n e^{\beta\braket{Q(t)x_i(t),K(t)x_j(t)}}\cdot V(t)x_j(t)$ - params - matrices $Q,K,V$ - scalar $\beta$ (hyperparameter?) - $Z$ … normalization factor - theorem - for “simple functions”, there exists an assignment that achieves $(1+\varepsilon)$-optimum and the form of the assignment is $x_i(t)\sim e^{\text{something}}$ - *something* has a $\frac 1\varepsilon$ factor - → if you have a simple function, you can use attention simply - assume there is no causality in the order of tokens, then - $x(t+1)=\sum_s \delta_s$ - where $\delta_s=1$ if $s$ is next token, otherwise 0 - better … $x(t+1)=\mu$ (prob. distribution over all tokens in the dictionary) - theorem - if non-causality and $t\to+\infty$, then - $\mu\to\delta_s$ (Dirac function) - $\|\mu(t)-\delta_{s^+}\|\leq\frac{O(1)}{f(t)}$ - a synthetic, pseudo-code like view on transformers (for the exam) - see notes in Chamilo :) - decoder-only model - definitions - $V$ … vocabulary of $|V|$ tokens, $W_E\in\mathbb R^{d_\mathrm{model}\times|V|}$ - columns of $W_E$ correspond to token embeddings in $d_\mathrm{model}$-dimensional vector space - $\overrightarrow E_i$ denotes the embedding of the $i$-th token - $L$ … number of layers - $M$ … number of heads - $W_q^{(\ell,m)}\in\mathbb R^{d_q\times d_\mathrm{model}},W_k^{(\ell,m)}\in\mathbb R^{d_k\times d_\mathrm{model}},W_v^{(\ell,m)}\in\mathbb R^{d_v\times d_\mathrm{model}}$ - $Q,K,V$ matrices for each layer and head - $N$ … length of input sequence - For $n=1$ to $N$ - $\overrightarrow E_n^{(0)}=\overrightarrow E_n+\overrightarrow{\mathrm{Pos}(n)}$ - embedding with positional information, $\in\mathbb R^{d_\mathrm{model}}$ - For $\ell=1$ to $L$ - For $m=1$ to $M$ - we define (for $j$) - $\overrightarrow Q_j^{(\ell,m)}=W_q^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$ - $\overrightarrow K_j^{(\ell,m)}=W_k^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$ - $\overrightarrow V_j^{(\ell,m)}=W_v^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$ - $\overrightarrow O_n^{(\ell,m)}=\sum_{j=1}^n\mathrm{softmax}_j(\frac{\overrightarrow Q_n^{(\ell,m)}\cdot \overrightarrow K_j^{(\ell,m)}}{\sqrt{d_k}})V_j$ - $\overrightarrow O_n^{(\ell)}=\mathrm{concat}(\overrightarrow O_n^{(\ell,1)},\dots,\overrightarrow O_n^{(\ell,M)})$ - column vector in $\mathbb R^{d_v\times M}$ - $\Delta\overrightarrow E_n^{(\ell)}=W_O^{(\ell)}\cdot\overrightarrow O_n^{(\ell)}\quad (\in\mathbb R^{d_\mathrm{model}})$ - $W_O^{(\ell)}\in\mathbb R^{d_\mathrm{model}\times (d_v\times M)}$ - $\overrightarrow E_{n,1}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_n^{(\ell-1)}+\Delta\overrightarrow E_n^{(\ell)})$ - $\overrightarrow E_{n}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_{n,1}^{(\ell)}+W_\downarrow^{(\ell)}\mathrm{GELU}(W_\uparrow\overrightarrow E_{n,1}^{(\ell)}+\overrightarrow B^{(\ell)}_\uparrow)+\overrightarrow B^{(\ell)}_\downarrow)$ - contextualized embedding of the $n$-th token after layer $\ell$ - $W_\uparrow\in\mathbb R^{d\times d_\mathrm{model}},B_\uparrow\in\mathbb R^{d},W_\downarrow\in\mathbb R^{d_\mathrm{model}\times d},B_\downarrow\in\mathbb R^{d_\mathrm{model}}$ - $\overrightarrow Z=W_U\cdot\overrightarrow E_n^{(L)}$ - logit vector for next word - $W_U\in\mathbb R^{{|V|}\times d_\mathrm{model}},Z\in\mathbb R^{|V|}$ - the logit vector provides a weight indicating the proximity of each token to $\overrightarrow E_n^{(L)}$ - $P=\mathrm{softmax}(\overrightarrow Z)$ - $p_i=\frac{e^{z_i/\alpha}}{\sum_{j=1}^{|V|} e^{z_j/\alpha}}$ … $\alpha\to 0$, Dirac distribution - probability distribution over tokens derived from logit vector - *training* - compute loss between $P$ and true distribution $Q$ (1-hot vector) - $L(P,Q)=-\sum_{i=1}^{|V|} Q_i\log P_i$ … NLL loss - backpropagation - $\theta\to\theta-\eta\nabla_\theta L(P,Q)$ - *inference* - we have $P$ - for example, we can generate at random from top $k$ (usually, using $P$) - encoder × decoder - encoder aims at building a representation of the sequence - in our decoder-only model, we use encodings but they are not that good - also, encoder would be bidirectional - representation of the sequence can be obtained by pooling representations of the individual tokens - models with both encoder and decoder are used for machine translation - encoder – bidirectional, to encode the meaning of the original sentence - decoder – unidirectional, to generate a new sentence based on the meaning of the original sentence and the previously generated words - in the decoder-only model, there's no “original sentence” - instruction fine-tuning - step 1: pretraining – only focused on predicting the next token - step 2: supervised fine-tuning (SFT) on various tasks – summarization, translation, sentiment analysis, text classification, … - using cross-entropy loss - step 3: instruction fine-tuning using RLHF (reinforcement learning with human feedback) – model provides several responses for a given prompt, human ranks them - direct preferred optimization (DPO) - we use pairwise preferences - $x$ … prompt - $y_w$ … preferred output - $y_\ell$ … less preferred output - ![DPO loss](https://miro.medium.com/v2/resize:fit:1400/1*LrLZgBo-4mHScN-75oWK-A.png) - $\sigma$ … sigmoid function - $\beta$ … regularization parameter - $\pi_\theta$ … probability distribution of our model with parameters $\theta$ - $\pi_\mathrm{ref}$ … probability dist. of the reference model we get after SFT - we want to maximize the first term and to minimize the second term - teaching *DeepSeek-R1 Zero* to reason - simple prompt - reward based on accuracy and formatting - GRPO loss (group relative policy optimization) - we maximize $A_\theta(o,q)=\frac{\pi_\theta(o\mid q)}{\pi_{\theta_\mathrm{old}}(o\mid q)} r(o)$ - $q$ … question - $o$ … answer - $r$ … reward (positive for desirable answers, otherwise negative) - to achieve stability - clip $A_\theta$ to the interval $(1-\varepsilon,1+\varepsilon)$ - add KL-divergence between $\pi_\theta$ and $\pi_{\theta_\mathrm{old}}$ ## Multimodal AI - Pia Bideau, PhD - fully connected neural network – impractical for images (too many weights) - convolution - “filter” - we move a function over the signal and integrate - what to do at the ends? - shrink or pad - CNN is learning the filters to transform the images - advantages - spatial locality (local receptive fields) – every neuron is looking at a small patch of the image - parameter sharing – we don't need that many weights - translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image) - downsampling approaches - stride – we are sliding the filter with a step size larger than one - pooling – we apply a function (usually max) over a patch - if pixel-level outputs are expected, we need to use upsampling afterwards - upsampling approaches - nearest neighbor (we just copy the value) - bed of nails (we put the value in the upper-left corner and use zeros elsewhere) - max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere) - VGG architecture - uses 3×3 convolutions everywhere - receptive field size - in the original image, the receptive field is 1×1 - in the first layer, the receptive field is 3×3 - by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer - the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$ - $K$ … convolution kernel size ($K=3$ for a 3×3 filter) - other architectures: ResNet, Inception, GoogLeNet, U-Net - RNNs - hidden state … combination of the current input and the previous hidden state - usually tanh activation - there can be one or multiple outputs - backpropagation becomes intractable - so truncated backpropagation may be used - we can even have multiple layers - problem: vanishing or exploding gradients - GRU and LSTM units are used to solve this problem - attention