contextualized embedding of the n-th token after layer ℓ
W↑∈Rd×dmodel,B↑∈Rd,W↓∈Rdmodel×d,B↓∈Rdmodel
Z=WU⋅En(L)
logit vector for next word
WU∈R∣V∣×dmodel,Z∈R∣V∣
the logit vector provides a weight indicating the proximity of each token to En(L)
P=softmax(Z)
pi=∑j=1∣V∣ezj/αezi/α … α→0, Dirac distribution
probability distribution over tokens derived from logit vector
training
compute loss between P and true distribution Q (1-hot vector)
L(P,Q)=−∑i=1∣V∣QilogPi … NLL loss
backpropagation
θ→θ−η∇θL(P,Q)
inference
we have P
for example, we can generate at random from top k (usually, using P)
encoder × decoder
encoder aims at building a representation of the sequence
in our decoder-only model, we use encodings but they are not that good
also, encoder would be bidirectional
representation of the sequence can be obtained by pooling representations of the individual tokens
models with both encoder and decoder are used for machine translation
encoder – bidirectional, to encode the meaning of the original sentence
decoder – unidirectional, to generate a new sentence based on the meaning of the original sentence and the previously generated words
in the decoder-only model, there's no “original sentence”
instruction fine-tuning
step 1: pretraining – only focused on predicting the next token
step 2: supervised fine-tuning (SFT) on various tasks – summarization, translation, sentiment analysis, text classification, …
using cross-entropy loss
step 3: instruction fine-tuning using RLHF (reinforcement learning with human feedback) – model provides several responses for a given prompt, human ranks them
direct preferred optimization (DPO)
we use pairwise preferences
x … prompt
yw … preferred output
yℓ … less preferred output
σ … sigmoid function
β … regularization parameter
πθ … probability distribution of our model with parameters θ
πref … probability dist. of the reference model we get after SFT
we want to maximize the first term and to minimize the second term
teaching DeepSeek-R1 Zero to reason
simple prompt
reward based on accuracy and formatting
GRPO loss (group relative policy optimization)
we maximize Aθ(o,q)=πθold(o∣q)πθ(o∣q)r(o)
q … question
o … answer
r … reward (positive for desirable answers, otherwise negative)
to achieve stability
clip Aθ to the interval (1−ε,1+ε)
add KL-divergence between πθ and πθold
Multimodal AI
Pia Bideau, PhD
fully connected neural network – impractical for images (too many weights)
convolution
“filter”
we move a function over the signal and integrate
what to do at the ends?
shrink or pad
CNN is learning the filters to transform the images
advantages
spatial locality (local receptive fields) – every neuron is looking at a small patch of the image
parameter sharing – we don't need that many weights
translation equivariance – we don't need to preprocess the images that much (object detection works no matter the position of the object in the image)
downsampling approaches
stride – we are sliding the filter with a step size larger than one
pooling – we apply a function (usually max) over a patch
if pixel-level outputs are expected, we need to use upsampling afterwards
upsampling approaches
nearest neighbor (we just copy the value)
bed of nails (we put the value in the upper-left corner and use zeros elsewhere)
max unpooling (we need to remember where did we take the maximum from, then put it back there and put zeros elsewhere)
VGG architecture
uses 3×3 convolutions everywhere
receptive field size
in the original image, the receptive field is 1×1
in the first layer, the receptive field is 3×3
by applying the convolution on the convoluted pixels, we get 5×5 receptive field in the second layer
the formula looks like this: RF0=1,RFi=RFi−1+(K−1)
K … convolution kernel size (K=3 for a 3×3 filter)
other architectures: ResNet, Inception, GoogLeNet, U-Net
RNNs
hidden state … combination of the current input and the previous hidden state