supervised ML
gradient descent
multi-layer neural networks
for two layers
Universal approximation theorem (Cybenko '89, Hornik '89)
very deep networks (CNN, Transformer, …)
attention, transformers, BERT
example
recent technical revolutions
residual NN (ResNet)
transformer and attention mechanism
theorem
assume there is no causality in the order of tokens, then
theorem
a synthetic, pseudo-code like view on transformers (for the exam)
encoder × decoder
instruction fine-tuning

teaching DeepSeek-R1 Zero to reason
convolution
CNN is learning the filters to transform the images
VGG architecture
RNNs