# Lecture

- supervised ML
	- training data $(x^i,y^i)$
	- given a new $x^0$, provide a prediction $\tilde y^0$
	- consider a class $\varphi$ of functions
	- find a function $f_\theta\in\varphi$ (find parameters $\theta$) that minimizes some loss function
		- $\mathrm{argmin}_\theta E(\theta)$
		- where $E(\theta):=\sum_i^n\ell(f_\theta(x^i),y^i)$
- gradient descent
	- $\theta_{t+1}\leftarrow\theta_t-\eta_t\cdot\nabla E(\theta_t)$
	- works well for convex functions $E$
	- doesn't work well for non-convex $E$
	- in practice, works well for neural networks
- multi-layer neural networks
	- $x_{\ell+1}=\sigma(W_\ell x_\ell+b_\ell)$
	- $\sigma$ … activation function (sigmoid, ReLU, …)
	- $W$ … weight matrix
	- $b$ … bias vector
- for two layers
	- $f_\theta(x)=\frac1n\sum_{k=1}^n u_k\cdot\sigma(\braket{v_k,x}+b_k)$
		- $n$ … number of neurons in the hidden layer
	- weights $\theta=(u_k,v_k,b_k)_{k=1}^n$
- Universal approximation theorem (Cybenko '89, Hornik '89)
	- any continuous function $f:\mathbb R^d\to\mathbb R$ can be approximated by a 2-layer neural network
	- (Baron '93): $\|g-f_\theta\|\leq \frac{O(R\|g\|)}{\sqrt n}$
	- conclusion: a large 2-layer neural network is good enough
	- Bach and Chizat '18: gradient descent (flow) converges to the global minimum for 2-layer neural networks
		- we don't know how long it could take
- very deep networks (CNN, Transformer, …)
	- residual networks (ResNet '16)
		- $x_{\ell+1}=x_\ell+\frac 1L U_\ell^T\cdot\sigma(V_\ell x_\ell+b_\ell)$
		- $U_\ell,V_\ell\in\mathbb R^{n\times d}$ … weight matrices
			- $n$ … number of neurons
		- $x_0\xrightarrow{f_\theta} x_L$
		- $L$ layers
	- neural differential equation
		- for $L\to\infty$
			- $\frac{dx_s}{dt}=U_s^T\cdot\sigma(V_sx_s+b_s)$
			- where $s\in(0,1)$
		- theorem (2024): if the starting point is close enough to a local minimum, then the flow induced by $\frac{dx_s}{dt}$ converges to the local minimum
- attention, transformers, BERT
	- modeling sequences with RNN
	- attention in neural networks – originally in computer vision
	- problems of RNNs: long range dependencies, gradient vanishing (and exploding), large number of training steps, no parallel computation
		- transformers solve all four of these
	- complexity per layer
		- for self-attention … $O(n^2d)$
	- BERT – just the encoder
	- how to process long sequence
		- divide into smaller chunks
		- optimize (attention)
	- recommended watch: Transformers, the tech behind LLMs (3Blue1Brown)
	- what happens in the first layer
		- $E^{(1)}=\mathrm{Norm}(E^{(1)}_1+\mathrm{FFN}(E^{(1)}_1))$
		- $E^{(1)}_1=\mathrm{Norm}(E^{(0)}+\mathrm{MATL}(E^{(0)}))$
		- FFN … feed forward network (multi-layer perceptron)
		- MATL … multi-head attention layer
		- $E^{(0)}$ … embeddings with positional encoding
	- similarly for $E^{(\ell)}=\mathrm{Norm}(\mathrm{Norm}(E^{(\ell-1)}+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)}))$ $+\ \mathrm{FFN}(\mathrm{Norm}(E^{(\ell-1)})+\mathrm{MATL}^{(\ell)}(E^{(\ell-1)})))$
- example
	- input text: Machine learning is learning from
	- tokenization: machine learn ing is learn ing from
	- embeddings: $E^{(0)}_1,E^{(0)}_2,E^{(0)}_3,E^{(0)}_4,E^{(0)}_5,E^{(0)}_6,E^{(0)}_7$
		- where $E^{(0)}_2=E^{(0)}_5$ and $E^{(0)}_3=E^{(0)}_6$
		- $|V|$ … vocabulary size (number of possible tokens)
			- 50 257 tokens in ChatGPT 2
		- $d_{\mathrm{model}}$ … dimension of the context independent vector representing a token
			- 12 288
		- $W_E$ (embedding matrix) has $|V|$ columns and $d_\mathrm{model}$ rows, is initialized randomly
	- adding positional information
	- MATL … multi-head attention layer
		- attention … $\mathrm{softmax}(\frac{K^TQ}{\sqrt{d_k}})\cdot V$
			- $K=W_kE$
			- $Q = W_qE$
			- $V=W_vE$
		- attention helps to distinguish the meaning based on the context
		- sometimes we need to use masking (in a decoder – otherwise, it could just read the next token from the target output)
		- $\Delta E_i^{(\ell-1)}=W_o(\sum_{j=1}^i\sigma (K_j\cdot Q_i)V_j)$
		- dimensions of the weight matrices ($W_k,W_q,W_v$) … 128 × 12 288
			- 128 … size of the hidden space
				- output vector of the attention mechanism has size 128
			- we have 96 attention heads
				- so just by concatenating the outputs, we get $128 \cdot 96 = 12\,288$
		- to get the output of MATL, we just concatenate the outputs of individual heads and multiply it by $W_o$
			- dimensions of $W_o$ are 12 288 × 12 288
	- Norm … just normalize the matrices (subtract mean, divide by variance)
	- parameter check
		- embedding
			- $W_E$ … 12 288 × 50 257
			- $W_U$ … 12 288 × 50 257
				- unembedding matrix, computes the distance to the individual words (then we select the closest one)
		- attention
			- $W_q$ … 128 × 12 288 × 96 heads × 96 layers
				- same for $W_k,W_v$
			- $W_o$ … 12 288 × 12 288 × 96 layers
		- MLP
			- 115 (?) bilion parameters
	- MLP – projection to a vector space with 4 times more dimensions (and back)
		- ReLU activation
- recent technical revolutions
	- RevNet – very deep networks
	- Transformers – adapted to next token prediction
	- Adam – suitable for transformer optimization
- residual NN (ResNet)
	- $x(0)=x$
	- $x(k+1)=x(k)+w(k)\cdot\sigma(a(k)\cdot x(k)+b(k))$
		- $0\leq k\leq L-1$
		- for $L\to\infty$, ResNet becomes $\dot x(t)=w(t)\cdot\sigma(a(t)\cdot x(t)+b(t))$
			- derivative behavior
- NNs for text generation are trained to minimize the distance of the generated text from the target
- transformer and attention mechanism
	- replaces previous mechanisms (like convolution) with attention
	- first approach, similar to ResNet
		- $\forall i\leq n:\dot x_i(t)=\frac 1Z \sum_{j=1}^n e^{\beta\braket{Q(t)x_i(t),K(t)x_j(t)}}\cdot V(t)x_j(t)$
		- params
			- matrices $Q,K,V$
			- scalar $\beta$ (hyperparameter?)
		- $Z$ … normalization factor
- theorem
	- for “simple functions”, there exists an assignment that achieves $(1+\varepsilon)$-optimum and the form of the assignment is $x_i(t)\sim e^{\text{something}}$
		- *something* has a $\frac 1\varepsilon$ factor
	- → if you have a simple function, you can use attention simply
- assume there is no causality in the order of tokens, then
	- $x(t+1)=\sum_s \delta_s$
	- where $\delta_s=1$ if $s$ is next token, otherwise 0
- better … $x(t+1)=\mu$ (prob. distribution over all tokens in the dictionary)
- theorem
	- if non-causality and $t\to+\infty$, then
		- $\mu\to\delta_s$ (Dirac function)
		- $\|\mu(t)-\delta_{s^+}\|\leq\frac{O(1)}{f(t)}$
- a synthetic, pseudo-code like view on transformers (for the exam)
	- see notes in Chamilo :)
	- decoder-only model
	- definitions
		- $V$ … vocabulary of $|V|$ tokens, $W_E\in\mathbb R^{d_\mathrm{model}\times|V|}$
			- columns of $W_E$ correspond to token embeddings in $d_\mathrm{model}$-dimensional vector space
			- $\overrightarrow E_i$ denotes the embedding of the $i$-th token
		- $L$ … number of layers
		- $M$ … number of heads
		- $W_q^{(\ell,m)}\in\mathbb R^{d_q\times d_\mathrm{model}},W_k^{(\ell,m)}\in\mathbb R^{d_k\times d_\mathrm{model}},W_v^{(\ell,m)}\in\mathbb R^{d_v\times d_\mathrm{model}}$
			- $Q,K,V$ matrices for each layer and head
		- $N$ … length of input sequence
	- For $n=1$ to $N$
		- $\overrightarrow E_n^{(0)}=\overrightarrow E_n+\overrightarrow{\mathrm{Pos}(n)}$
			- embedding with positional information, $\in\mathbb R^{d_\mathrm{model}}$
		- For $\ell=1$ to $L$
			- For $m=1$ to $M$
				- we define (for $j$)
					- $\overrightarrow Q_j^{(\ell,m)}=W_q^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
					- $\overrightarrow K_j^{(\ell,m)}=W_k^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
					- $\overrightarrow V_j^{(\ell,m)}=W_v^{(\ell,m)}\times\overrightarrow E_j^{(\ell-1)}$
				- $\overrightarrow O_n^{(\ell,m)}=\sum_{j=1}^n\mathrm{softmax}_j(\frac{\overrightarrow Q_n^{(\ell,m)}\cdot \overrightarrow K_j^{(\ell,m)}}{\sqrt{d_k}})V_j$
			- $\overrightarrow O_n^{(\ell)}=\mathrm{concat}(\overrightarrow O_n^{(\ell,1)},\dots,\overrightarrow O_n^{(\ell,M)})$
				- column vector in $\mathbb R^{d_v\times M}$
			- $\Delta\overrightarrow E_n^{(\ell)}=W_O^{(\ell)}\cdot\overrightarrow O_n^{(\ell)}\quad (\in\mathbb R^{d_\mathrm{model}})$
				- $W_O^{(\ell)}\in\mathbb R^{d_\mathrm{model}\times (d_v\times M)}$
			- $\overrightarrow E_{n,1}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_n^{(\ell-1)}+\Delta\overrightarrow E_n^{(\ell)})$
			- $\overrightarrow E_{n}^{(\ell)}=\mathrm{LayerNorm}(\overrightarrow E_{n,1}^{(\ell)}+W_\downarrow^{(\ell)}\mathrm{GELU}(W_\uparrow\overrightarrow E_{n,1}^{(\ell)}+\overrightarrow B^{(\ell)}_\uparrow)+\overrightarrow B^{(\ell)}_\downarrow)$
				- contextualized embedding of the $n$-th token after layer $\ell$
				- $W_\uparrow\in\mathbb R^{d\times d_\mathrm{model}},B_\uparrow\in\mathbb R^{d},W_\downarrow\in\mathbb R^{d_\mathrm{model}\times d},B_\downarrow\in\mathbb R^{d_\mathrm{model}}$
		- $\overrightarrow Z=W_U\cdot\overrightarrow E_n^{(L)}$
			- logit vector for next word
			- $W_U\in\mathbb R^{{|V|}\times d_\mathrm{model}},Z\in\mathbb R^{|V|}$
			- the logit vector provides a weight indicating the proximity of each token to $\overrightarrow E_n^{(L)}$
		- $P=\mathrm{softmax}(\overrightarrow Z)$
			- $p_i=\frac{e^{z_i/\alpha}}{\sum_{j=1}^{|V|} e^{z_j/\alpha}}$ … $\alpha\to 0$, Dirac distribution
			- probability distribution over tokens derived from logit vector
		- *training*
			- compute loss between $P$ and true distribution $Q$ (1-hot vector)
				- $L(P,Q)=-\sum_{i=1}^{|V|} Q_i\log P_i$ … NLL loss
			- backpropagation
				- $\theta\to\theta-\eta\nabla_\theta L(P,Q)$
		- *inference*
			- we have $P$
			- for example, we can generate at random from top $k$ (usually, using $P$)
- encoder × decoder
	- encoder aims at building a representation of the sequence
	- in our decoder-only model, we use encodings but they are not that good
	- also, encoder would be bidirectional
	- representation of the sequence can be obtained by pooling representations of the individual tokens
	- models with both encoder and decoder are used for machine translation
		- encoder – bidirectional, to encode the meaning of the original sentence
		- decoder – unidirectional, to generate a new sentence based on the meaning of the original sentence and the previously generated words
			- in the decoder-only model, there's no “original sentence”
- instruction fine-tuning
	- step 1: pretraining – only focused on predicting the next token
	- step 2: supervised fine-tuning (SFT) on various tasks – summarization, translation, sentiment analysis, text classification, …
		- using cross-entropy loss
	- step 3: instruction fine-tuning using RLHF (reinforcement learning with human feedback) – model provides several responses for a given prompt, human ranks them
		- direct preferred optimization (DPO)
			- we use pairwise preferences
				- $x$ … prompt
				- $y_w$ … preferred output
				- $y_\ell$ … less preferred output
			- ![DPO loss](https://miro.medium.com/v2/resize:fit:1400/1*LrLZgBo-4mHScN-75oWK-A.png)
				- $\sigma$ … sigmoid function
				- $\beta$ … regularization parameter
				- $\pi_\theta$ … probability distribution of our model with parameters $\theta$
				- $\pi_\mathrm{ref}$ … probability dist. of the reference model we get after SFT
				- we want to maximize the first term and to minimize the second term
- teaching *DeepSeek-R1 Zero* to reason
	- simple prompt
	- reward based on accuracy and formatting
	- GRPO loss (group relative policy optimization)
		- we maximize $A_\theta(o,q)=\frac{\pi_\theta(o\mid q)}{\pi_{\theta_\mathrm{old}}(o\mid q)} r(o)$
		- $q$ … question
		- $o$ … answer
		- $r$ … reward (positive for desirable answers, otherwise negative)
		- to achieve stability
			- clip $A_\theta$ to the interval $(1-\varepsilon,1+\varepsilon)$
			- add KL-divergence between $\pi_\theta$ and $\pi_{\theta_\mathrm{old}}$

## Multimodal AI

- Pia Bideau, PhD
- fully connected neural network – impractical for images (too many weights)
- convolution
	- “filter”
	- we move a function over the signal and integrate
	- what to do at the ends?
		- shrink or pad
- CNN is learning the filters to transform the images
	- advantages
		- spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
		- parameter sharing – we don't need that many weights
		- translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)
	- downsampling approaches
		- stride – we are sliding the filter with a step size larger than one
		- pooling – we apply a function (usually max) over a patch
	- if pixel-level outputs are expected, we need to use upsampling afterwards
	- upsampling approaches
		- nearest neighbor (we just copy the value)
		- bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
		- max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)
- VGG architecture
	- uses 3×3 convolutions everywhere
	- receptive field size
		- in the original image, the receptive field is 1×1
		- in the first layer, the receptive field is 3×3
		- by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
		- the formula looks like this: $RF_0=1,\ RF_i=RF_{i-1}+(K-1)$
			- $K$ … convolution kernel size ($K=3$ for a 3×3 filter)
- other architectures: ResNet, Inception, GoogLeNet, U-Net
- RNNs
	- hidden state … combination of the current input and the previous hidden state
	- usually tanh activation
	- there can be one or multiple outputs
	- backpropagation becomes intractable
		- so truncated backpropagation may be used
	- we can even have multiple layers
	- problem: vanishing or exploding gradients
		- GRU and LSTM units are used to solve this problem
- attention