# Exam - image processing vs. computer vision - processing: reasoning focused on the image, pixels, pixel groups - vision: focuses on the knowledge the image brings from a real scene - difficult problem - the goal of computer vision is not to mimic human vision but to build systems that extract information - computer vision is an inverse of the synthesis problem - projection is fundamentally ambiguous – we are losing information (depth, size, occlusions) - in general it's an ill-posed problem: no unique solution for a given observation, ambiguous solutions, incomplete data (example: scale of the observed scene, toy car instead of normal car) - need of injecting a priori knowledge and regularization (example: penalize non-smooth solutions) - from noisy observations, we estimate the parameters of a model - a priori knowledge: physics, geometry, semantics - example: by counting visible wheels of the car, we can tell the position of the camera - desirable characteristics - robustness – be able to identify observation noise/errors (have plan B in case of error) - speed - precision - generality – the algorithm should be generic; the pool of situations that it can handle should be large enough - main vision problems - calibration – where is the camera, which model, … - segmentation – which parts of the image belong together - detection – what is interesting/salient in the image - reconstruction – where are the objects (position, shape, 3D surface) - tracking – which motion is present in the image, how do objects move - recognition – what do we see, semantics - image – 2D signal (depicts a 3D scene), matrix of values that represent a signal, has semantic information - light - plays a fundamental role in 3D perception - no light → no image - shadow, diffuse reflection (in many directions), specular reflection (mirror-like, in one direction) - wavelength, spectrums - solar spectrum: almost continuous spectrum, some wavelengths are stronger - white light: continuous spectrum (energy evenly distributed) - sodium vapor lamp: only yellow → red car appears dark - what happens when a light ray hits the surface - absorption (black surface) - reflection, refraction (mirror, reflector) - diffusion (milk) - fluorescence - transmission and emission (human skin) - most surfaces can be approximated by simple models - first simplified hypothesis - no fluorescent surfaces - “cold” objects (they don't emit light) - all light emitted from a surface point is formed solely by the light arriving there - standard model: BRDF - bi-directional reflectance distribution function - models the ratio of energy for each wavelength $\lambda$ - incoming from direction $\hat v_i$ - emitted towards direction $\hat v_r$ - reciprocity (we can swap the light source and the camera) - isotropy (no favorite direction) – does not always hold - energy corresponds to an integral (we can use a discrete sum) - Lambertian assumption - diffuse surface: uniform in all directions (paper, milk, matt paint) - the BRDF is a constant function $f_d(\hat v_i,\hat v_r,\hat n,\lambda)=f_d(\lambda)$ - “when a ray hits the surface, it spreads equally in all directions” - do papers use it? - plenoptic function – does not assume, explicitly models angular dimension (intensity may be different) - focused plenoptic camera – using it for matching and processing depth - depth estimation – yes (for the rays) - spray on optics – yes (required by matching of the droplets) - Middlebury – putting the clay on - SPIN - ViT, DINO - specular (reflective) material, central lobe on $\hat s_i$ - Phong … $f_s(\theta_s,\lambda)=k_s(\lambda)\cos ^{k_e}\theta_s$ - Torrence-Sparrow - di-chromatic = diffuse + specular - pipeline in a digital camera - to get RAW - optics → aperture → shutter → sensor → gain (ISO) → A/D - to get JPEG from RAW - demosaic → sharpen → white balance → gamma/curve → compress - optical role: isolate the light rays (from one particular part of the scene) - pinhole cameras - Rayleigh: ideal diameter $d=1.9\sqrt{f\lambda}$ ($f$ focal distance, $\lambda$ wavelength) - image distorted by the shape of the pinhole - we can model a complicated system of lenses using just one lens - perfect lens: hypothesis - a point in the scene corresponds to a point in an image - all image points are in the same plane - this is not true, there are artifacts - chromatic artifacts (fringing) - diffraction – wavelength dependent - all colors are not focused to the same point - vignetting … border of the image is darker - geometric distortion (for wide-angle cameras) - CCD sensor, CMOS sensor - rolling shutter - lines of the sensor are discharged sequentially → moving objects can be distorted - usually in cheap CMOS - color spaces - RGB … additive - CMY … subtractive - color perception - retina (sítnice) – there is fovea - rods – achromatic perception of lights, pigmentation (rhodopsin) is sensitive to all visible spectrum (peak on green) - on peripheral retina - cones – color perception - on the fovea - mantis shrimp has the most complex visual system ever discovered - color perception in a camera - deviation/dispersion prism - 3 CCD sensors - precise alignment, high quality filter - expensive - Bayer filter - individual (plastic) filter for each pixel (RGGB, RGCB) - to get colors in each pixel, we interpolate (integrate over spectrums) - sensor artifacts - noise: salt and pepper, thermic noise (as the camera heats up) - aliasing (low-frequency artifacts emerge from high-frequency signal) - gamma correction – adjusting brightness to match human perception - otherwise, most of the linear scale would be bright - JPEG compression artifacts - color models - RGB – additive model, mixing wavelengths - CMY – subtractive model, mixing pigments (“subtracting” wavelengths) - HSV or HSB – hue (“intrinsic color”), saturation, value - useful for artists - YUV, YIQ – 1 luma component + 2 chroma components - for TV broadcasting, luminance can be used for black & white TV - application: robotic detection - we can detect an object based on its color (ignoring luminance or value) ## Filters, Contours, Segmentation - characterization of an image (what is interesting?) - salient (key) points – areas of strong contrast (0D) - lines, salient contours – contrast boundaries (1D) - regions – groups of pixels with similar properties (2D) - duality – contours & regions (between the contours) - noise filtering - “how to distinguish between information and noise” - most simple and adopted model: Gaussian noise (thermal) - additive noise, independently sampled in each pixel from a Gaussian distribution - convolution with a low-band Gaussian filter – to remove the high frequencies (likely noise) and make derivative easier to compute - convolution - $(m*f)(x)=\int_u m(u)f(x-u)\,du$ - $(m*f)(x,y)=\int_u\int_v m(u,v) f(x-u,y-v)\,du\,dv$ - theorem: derivative of convolution = convolution of derivative - average over neighboring pixels … mean filter - but Gaussian filter works better – the neighboring pixels have less effect - edge - some kind of discontinuity (surface, depth, color, illumination, …) - properties: contrast, orientation - descriptors: normal (unit vector, direction of maximal change), direction (perp. to normal), position, intensity - detecting edges - first apply smoothing, then find color change - peak of first derivative → Sobel - preserves edge strength, detects its direction - or zero-crossing of second derivative → LoG - precisely locates edges (but binary) - very sensitive to noise - edge notion is binary – we need a threshold - image is not a continuous function – we approximate it using Taylor expansion (just using a plane) - finite differences (used to approximate derivative) - gradient operator: Sobel - modern neural networks tend to learn this filter to process data :) - approximates horizontal ($\frac{\partial I}{\partial x}$) and vertical ($\frac{\partial I}{\partial y}$) gradients - it's a particular case of composition of $3×3$ Gaussian filter $\begin{pmatrix}1&2&1\\2&4&2\\1&2&1\end{pmatrix}$and finite differences - scaling of the filtering kernel influences the detected contours - another operator: Laplace $\begin{pmatrix}0&1&0\\1&-4&1\\0&1&0\end{pmatrix}$ - uses second derivative - Laplacian of Gaussian (LoG) … we apply Laplacian on the Gaussian and then convolute it with the image - reverse Mexican hat - contour extraction (Canny) - based on Sobel - we only take one pixel with the highest value in the direction of the gradient - we still need to use a reasonable scale and threshold - line detection – Hough transform - how to express a line - $ax+by+c=0$ - or we can use polar coordinates: $(r,\theta)$ - $r=x\cos\theta+y\sin\theta$ - $r$ … distance from the origin - $\theta$ … slope (kind of) - we consider a line segment perpendicular to the line – $r$ is its length, $\theta$ is its slope - let's vote! - each observation adds one point to each line going through it – to the pair $(r,\theta)$ - we have a restricted set of parameters → it works - segmentation - Gestalt theory - region – group of pixels with similar properties - we need a similarity measure - distance to the mean of the region - Mahalanobis distance (distance weighted with the variance of the characteristics) - similarity in color space - clustering by aggregation × subdivision - k-means - Gaussian mixture models - probability of a point belonging to a cluster - probability to observe a point is a mixture of Gaussians - expectation maximization is used to solve it - E: knowing the current estimates of blobs, we compute the probabilities of the points belonging to them - M: we recompute the blob parameters to maximize likelihood - advantages: allows non-spherical clusters, can better handle outliers (we can add an outlier cluster) - problems: we still need to choose $k$; it is sensitive to initialization; the generative model (shape of the blobs) need to be chosen - superpixels - we need to group based on both color and spatial proximity - SLIC algorithm … k-means in 5D (color + 2D coordinates) - segmentation using a graph partitioning strategy - one pixel → one vertex - edges to other pixels - remove the edges of low similarity - based on distance, intensity, color, … - subgraphs represent different groups - algorithm Graph cut - high memory and computation requirements, biases in the cuts (regions with similar sizes are preferred) - Grabcut – interactive method (user selects and refines the segmentation) - Meta AI: Segment Anything Model (SAM) - trained using superpixels - deep clustering - they trained encoder and decoder - k-means in the latent (embedding) space - Dino (segmentation emerges from self-supervised learning) ## Interest Points - motivation: 2D tracking (motion), find correspondances (in the 3D world), recognition - segmentation is usually avoided (it is a source of errors – finding interest points is more robust) - challenges: scale change, rotation, viewpoint change, change in appearance (lighting, colorimetry) - usual pipeline: detect points, describe points, find correspondances - desired properties of a detector - use *local* (not global) characteristics – more robust to occlusions - invariance to translation, rotation, scale, lightning - robust to noise, bad conditions, compression - discriminative – allow to identify objects - quantity – we need enough points - precision – accurate location of the object in the image - efficiency – fast computation - Moravec's detector - small window (does shifting it lead to a change of intensity?) - if a shift of the window in any direction results in an intensity change, we have found a corner - Harris detector – “differentiable Moravec” - Taylor expansion - bilinear form - eigenvalues - minimum eigenvalue – slowest change direction - maximum eigenvalue – fastest change direction - hard to compute → we use $R$ instead - $R=\det M-\alpha(\mathrm{trace}\ M)^2$ - $R\gt 0\implies$ corner - $R\lt 0\implies$ contour - small $|R|\implies$ homogeneous region - algorithm - compute derivatives of the image (using Sobel…) - compute $R$ in each point - find points with a high $R$ (over certain threshold) - keep the local maxima of $R$ - we don't want too many interest points, various strategies (keep top $k$ points per image; ensure that every $n$ pixels radius has its top interest point) - properties - rotation invariance - invariance to intensity shift, partial invariance to intensity scaling - no invariance to scale changes - we consider circular regions of different sizes on a point - but how to choose them? - scale invariant detector - image pyramid (consider several smaller versions of the same image) - scale invariant function which serves as a basis for the detection algorithm – LoG, we can also use difference of Gaussians (DoG) which is similar (but more efficient to compute) - Harris-Laplacian detector – find the local max. of… 1. Harris detector in image space (coordinates) 2. LoG in the scale space - SIFT (Lowe) – find the local max. of DoG in image space *and* scale space - how to match detected points? - we need descriptors – should be invariant and discriminant - Harris: use eigen values? not so discriminant - windowed approaches – not so invariant - multi-scale oriented patches (MOPS) – oriented patch of size 8×8 (orientation from the image gradient) - scale invariant feature transform (SIFT) – histogram of local gradient directions (the maximum defines the principal direction) - matching approaches - two-way matching (are we the best match for our best match?) - exhaustive search is slow - hashing is faster (compute a hash value for the descriptor) - nearest neighbour - k-dimensional tree, best bin first (BBF) ## Paper Session 1 - plenoptic function, early vision - systematic framework capturing visual information - plenoptic function - $P(x,y,t,\lambda,V_x,V_y,V_z)$ - output – intensity - $t$ time, $\lambda$ wavelength, $V$ viewing point - humans – only two samples along $V_x$ axis (both for a single value of $V_y$ and $V_z$) - extraction of information … derivatives - “periodic table” - “blobs” – convolutional filters for extracting information - focused plenoptic camera - angular–spatial resolution tradeoff - hi-res rendering algorithm - depth estimation - we want - simultaneous detection and depth estimation - local method to detect rays - method invariant to object size and depth - ray gaussian kernel - lisad – operator activation - ray $\neq$ light - we stack pictures taken from different angles - on the cut, there are “rays” → we can get depth (thanks to parallax) - if we move, the objects in the foreground shift - the objects in the background not that much - spray-on optics - drop extraction and simulation + remove distortion - match the images - what is the final resolution? - manual droplet segmentation - better image quality with more droplets - how to get real data? ground truth? - usually, we need both synthetic and real data ## 3D Vision - definitions & notation - point $p\in\mathbb R^3$ - matrix $M\in\mathbb R^{m\times n}$ ($m$ rows, $n$ columns) - line … $ax+by+c=0$ - $(a,b,c)(x,y,1)^T=0$ … dot product - one interpretation of dot product – projection of one vector into the other - cross product of vectors $\in\mathbb R^3$ - computed using determinants of 2×2 matrices (with alternating signs) - $x×y=(x_2y_3-x_3y_2,\, x_3y_1-x_1y_3,\,x_1y_2-x_2y_1)$ - determinant - $\mathrm{det}(AB)=\mathrm{det} A\cdot\mathrm{det}B$ - determinant zero/nonzero, co-linearity of column vectors - equivalence - $X\sim Y$ if $\exists\lambda\neq 0:\lambda X=Y$ - homogeneous coordinates – we add a *one* at the end of the vector, so instead of $(x,y)$ we get $(x,y,1)$ - Cholesky decomposition - projective geometry - perspective deformation can be modeled with 2D projective transformation - Euclidean geometry $\subset$ affine geometry $\subset$ projective geometry - infinity is modeled in projective geometry – so even parallel lines intersect (in infinity) - ray (point) – set $\mathcal R_v$ of all vectors $kv$ with direction $v$ - where $v\in\mathbb R^{n+1}\setminus\set{o},\ k\in\mathbb R^*$ - projective space - vectors in $\mathbb R^{n+1}$ → projective space $\mathcal P^n$ - two vectors $X,Y$ represent the same point if $X\sim Y$ - we divide by the last coordinate - if the last coordinate is zero, the rays don't intersect the plane (the points are in infinity) - another intuition – using a sphere/dome - projective geometry – infinity - $(x,y,1)$ - we move in direction $(a,b)$ - $(x+\lambda a,y+\lambda b,1)\sim (\frac x\lambda+a,\frac y\lambda +b,\frac 1\lambda)$ - for $\lambda\to\infty$, we get $(a,b,0)$ - all lines with the direction $(a,b)$ meet in infinity – at point $(a,b,0)$ - projective transformations - preserve incidence relationships – collinearity and concurence - concurrence … if two lines intersect at $p$, they will intersect at $Mp$ after applying the transformation defined by $M$ - projective basis - set of $(n+2)$ points of $\mathcal P^n$, no $(n+1)$ of which are linearly dependent - example – canonical basis of $\mathbb R^{n+1}$, plus $(1,1,\dots,1)^T$ - change of basis is a projective transformation - projection matrix $M$ is determined up to a scale factor - hyperplanes – $m$ linearly independent points form a projective space of dimension $m-1$ - duality - two lines define a point, two points define a line - also in some other cases - exercises - find intersection $X$ of lines $L_1,L_2$ - both lines must contain $X$ → $L_1X=0$ and $L_2X=0$ - use cross product $X=L_1×L_2$ - find line $L$ going through $X_1,X_2$ → similar reasoning (cross product) - conics – projective transformations of the circle (“infinity line” touching the parabola and intersecting the hyperbola) - transformation groups ![](attachments/transf-groups.png) - Euclidean (allows shift and rotation) - isometry (allows resizing) - affine - projective – only 8 DOF, because it is invariant to scaling - exercises - homography defined by 4 point correspondances - homography … projective transformation matrix - to move from 3 variables “up to scale” to 2 variables, we divide by the third variable (if its nonzero) and assume it's 1 - we need to consider that it could be zero → similarly, we then divide by the second variable (so the second variable is 1 and the third variable is 0) - so we need to consider several situations - $p_z'\neq 0$ - $p_z'=0$ - only $p'_x\neq 0$ - then we just write $P'\sim HP$ which corresponds to $\gamma(u,v,1)^T=H\cdot (x,y,1)^T$ - $\gamma u=h_1x+h_2y+h_3$ - $\gamma v=h_4x+h_5y+h_6$ - $\gamma=h_7x+h_8y+h_9$ - this can be rewritten as - $u(h_7x+h_8y+h_9)=h_1x+h_2y+h_3$ - $v(h_7x+h_8y+h_9)=h_4x+h_5y+h_6$ - and rearranged as two linear equations with 9 variables “up to scale” (so 8 variables) - so we need 4 correspondances to get 8 equations - refer to [CMU slides](https://www.cs.cmu.edu/~16385/lectures/lecture9.pdf#page=47) for the entire procedure - for a homography $H$ transforming points, the associated transformation for lines is $H^{-T}$ - $\ell^Tx=0$ - $\ell^TH^{-1}Hx=0$ - $(H^{-T}\ell)^THx=0$ - affine transformation preserve parallelism - parallel lines … $L_1\times L_2=(x,y,0)^T$ - $L_1'\times L_2'=H(x,y,0)^T$ - if $H$ is affine transformation, it preserves the last 0 (projective transformation might not preserve it) - 3D geometry - elementary transformations - translation $T=\begin{bmatrix}1&0&0&T_x\\0&1&0&T_y\\0&0&1&T_z\\0&0&0&1\end{bmatrix}$ - rotations $R=R_z\cdot R_y\cdot R_x$ - $\det R=1$ - $R^{-1}=R^T$ - $R_{x}=\begin{bmatrix}1&0&0&0\\0&\cos\theta &-\sin\theta &0\\0&\sin\theta &\cos\theta&0\\0&0&0&1\end{bmatrix}$ - $R_{y}=\begin{bmatrix}\cos\theta &0&\sin\theta &0\\0&1&0&0\\-\sin\theta &0&\cos\theta&0\\0&0&0&1\end{bmatrix}$ - $R_{z}=\begin{bmatrix}\cos\theta &-\sin\theta &0&0\\\sin\theta &\cos\theta&0&0\\0&0&1&0\\0&0&0&1\end{bmatrix}$ - perspective projections - into the optical centre - focal length $f$ … distance between the projection centre and the image plane - image point $(x,y,z{=}f)$, world point $(X,Y,Z)$, optical axis $z$ - here, we consider that the origin of the axes and the projection centre are the same point - perspective projection matrix $\begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&1/f&0\end{bmatrix}$ to get an image point from a world point - can be deduced using intercept theorem (triangle similarity) - so $x=fX/Z$ (and $y=fY/Z$) - parallel lines intersect at infinity – this location defines a *vanishing point* in the image - for lines in a plane, vanishing points define a *horizon line* - parallel projections (ortographic) - projection is perpendicular to the image plane - we simply remove one coordinate (set to zero) - as if $f\to+\infty$ - exercise - you have a projection matrix $P$ - you move the camera or rotate it (you apply $T$), how does the new projection matrix look like? - $P'=TPT^{-1}$ - first, we return to the original projection space using $T^{-1}$ (we move the point there) - we apply the projection $P$ - then, we come back using $T$ - sanity check: consider a zero transformation (e.g. for a parameterized rotation matrix, set $\alpha=0$) → it should hold that $P\sim P'$ - note: the (projected) size of the object is proportional to its distance from the optical plane - so if we turn the camera (while keeping the focal length), the object gets smaller or larger - vertigo effect: you move the camera (toward or away from an object) and update the focal length to make the object appear the same size (but the background changes) - camera models - most used camera model – pinhole model (perspective projection) - full transformation composed of - rigid transformation: world coordinates → camera coordinates - both 3D, only rotation & translation - perspective projection into the retinal plane (2D) - see matrix above - 2D transformation: retinal coordinates → image pixel coordinates - shift & scale - these transformations can be written as matrix multiplication - leftmost is the last one (retinal → image) - the third row is useless – we can remove it - also the perspective projection can be removed - so we get global transformation matrix $M\sim(K\cdot R\quad K\cdot T)$ - $K$ … 3×3 intrinsic parameter matrix (shift & scale) - $R,T$ … 3×4 extrinsic parameter matrix (camera location) - a camera can be therefore described by 11 parameters (3×4 projective matrix) - we need 5.5 correspondences (3D to 2D) to estimate $M$ ## Applied 3D Vision - panoramic mosaics - we assume all the images are taken from the same viewpoint - we apply homography - we need four correspondances to estimate it - we can use least squares if we have more than four - we can remap all the images onto one image plane by combining transformations (so we don't need correspondances for every pair of planes) - photometric mapping needs to be applied to make colors consistent - there may be some artifacts as a result of geometric errors - two view geometry - in theory: if we have 2 image projections of the same 3D point, we can estimate its position by triangulation - in practice: the two rays usually don't cross (due to the noise) - algebraic approach: use least squares to solve an overdetermined system of equations - geometric approach: find the closest point to the 2 viewing lines - $(X,Y,Z)^T=C_i+\lambda_iD_i+\frac d2D_j×D_i$ - $(X,Y,Z)^T=C_j+\lambda_jD_j-\frac d2D_j×D_i$ - so $C_i-C_j+\lambda_i D_i-\lambda_jD_j+d(D_j\times D_i)=0$ - $C_i,C_j$ … cameras (image centers) - $D_i,D_j$ … directions of the rays - $d$ … distance between the rays - 3 unknowns $\lambda_i,\lambda_j,d$ - epipolar geometry - a point in the first image has to be on a certain line in the second image - this epipolar constraint is symmetric - epipolar line (seen by the second camera) is the projection of the viewing ray (of the first camera) - so all the epipolar lines go through a point (the epipole) – the projection of the other image center (= first camera) - we can also estimate the projection of the point in infinity - using these two points (epipole and “infinity”), we get the epipolar line $L\sim F_{ji}p_i$ - $p_i$ … a point in the first image - $F_{ji}$ … fundamental matrix - the epipolar constraint between two corresponding points is then $p_j^T\underbrace{F_{ji}p_i}_L=0$ - $F^T_{ji}=F_{ij}$ … fundamental matrix defining epipolar lines in image $i$ for points in image $j$ - fundamental matrix … 3×3 matrix - captures all the geometric information between two images - is singular - maps points to lines (is one-to-many) - has a kernel $F_{ij}e_j=0$ - in practice has rank 2 - every correspondence gives us only one equation (equal to 0) - so we need 8 correspondences - but without any additional constraint, the estimated matrix will not be singular - essential matrix – takes calibration $K_i,K_j$ into account - $E_{ji}\sim K_jF_{ji}K_i$ - robust estimation - RANSAC … random sample consensus - we randomly select a set of observations, fit our model, and count outliers (points with distance $\gt$ threshold) - we run $n$ iterations and keep the best model (with the least outliers) - probability estimation - $w$ … proportion of good data - $n$ … sample size - $k$ … number of draws - $w^n$ … probability to get $n$ good points - $(1-w^n)^k$ … probability that $k$ draws are not good - we want the probability to get a good sample to be higher than $p$ - $1-p=(1-w^n)^k$ - $k\gt \ln(1-p)/\ln(1-w^n)$ ## Shape Modeling - motion capture systems using markers - markers provide sparse motion information - no information on complete shapes or their appearances - relief perception – two nearby viewpoints - hologram – relief image - 3D video can be observed from any viewpoint (so 3D cinema is not really 3D) - 4D modeling issues - acquisition issues – cameras with different modalities - shapes and appearances – need to be learned over time - recovering robust motion information, modeling and analyzing motions/gaits - multi-view platforms - input: multi-view videos - output: 4D representations - shape representation models - point clouds - usually correspond to the output of sensing devices - we don't know what is inside/outside, we don't have surfaces (but there are algorithms we can use to get surfaces) - voxels - 3D occupancy grid - complexity proportional to the size of the box - we usually want to first find the shape and modify the size of the box accordingly - we can use *marching cubes* to get a surface model (mesh) - 3D meshes - similar to point cloud but we have connectivity (often triangles) - we can easily attach textures - discretization attached to the shape (so simpler shapes can be discretized only using a few triangles) - Irregular Data Grid – mesh is a graph with nodes (vertices), edges (between neighboring vertices), and faces (often triangles) - image based - no explicit 3D model - we generate new image based on existing images - the set of images has to be calibrated (position and orientation data need to be stored), so that we can properly generate the new ones - appearance implicitly handled by the model - discretization attached to images - parallax occlusions difficult to handle without going to 3D - implicit - the object is described by a function $f:\mathbb R^3\to\mathbb R$ - points with a constant value define surfaces (level-sets) - explicit meshes can be extracted (using marching cubes algorithm) - no explicit correspondences between different shapes - used in ML (functions can be learned by MLPs) - traditional generative 3D-4D modeling pipeline (without prior model) - image acquisition → primitive extraction & calibration → shape construction → shape tracking → 3D-4D model - appearance of the model is based on the initial image and the constructed shape - silhouette - extraction - chroma keying (blue or green background) - problem: white clothes, sweat, … tend to reflect green - background subtraction (static background) - take a photo without the object, take a photo with an object → what has changed? - problems: color ambiguities between background and foreground objects, luminosity changes, … - we can also use U-Net - from silhouettes to shapes - visual hull - the observed object is contained - some points are touching - voxel carving algorithm - we start with each voxel full (equal to 1) - we project every voxel into every camera (using the corresponding projection matrix) - we set it to zero if the pixel is outside of the silhouette for the given camera - note: usually, the cameras are not orthographic, they have an optical centre - notes - camera images have origin in the top left corner - pixel with coordinates $(32,10)$ … column 32, row 10 - in a matrix, it's vice versa - convex shapes are difficult to carve - we can put camera inside the shape – but it is impractical (wiring, safety, etc.), also not general - getting 3D points - active system: depth camera - we can use a structured light projector (projects “stripes” so that we can detect depth), calibration model is the same as for a camera - time of flight cameras – measure light travel from source to 3D scene and back - stereovision cameras – 3D point matched in 2 images - passive system: multi view stereo (with color) - more than 2 cameras - we construct depth map using photoconsistency - challenge: occlusion (there is some other object visible from a different angle – it's blocking the view of a certain camera) - some platforms are using both systems - active systems have inherently more limitations than passive ones ## Deep Learning - non-linear transformations applied to the input data - we compose simple functions and create complex functions - CNNs - convolutions with learned kernels - pooling to enforce invariances - we want translation equivariance (detect the object wherever it is on the image) - one layer usually consists of 1) convolution, 2) non-linearity, 3) spatial pooling, 4) normalization - AlexNet - 8 layers in total - convolution & pooling – layers 1, 2, 5 - convolution only – 3, 4 - fully connected – 6, 7 - output layer (with softmax) – 8 - learned on ImageNet (1000 categories, 1.2M training images, 150k test images) - The Bitter Lesson (essay) – instead of trying to build our domain knowledge into AI, we should build better models so they can learn it themselves - training: gradient descent, backpropagation, parallelism on GPU - encoder/decoder (autoencoder) architecture – useful for segmentation, getting 3D based on 2D - McCulloch-Pitts neuron – $n$ binary inputs, one binary output (result of $\sum_i x_i\geq\theta$ for a threshold $\theta$) - Frank Rosenblatt Perceptron - $wx+b\gt 0$ - update: $w_i\leftarrow w_i+\alpha(t_j-y(x_j))x_{j,i}$ - problem: non linear separable set (algorithm does not terminate), e.g. XOR - MLP solves that - activation functions: linear, logistic (sigmoid), tanh, ReLU, GELU - without non-linearity, the network could be written only using 2 layers (without hidden layers) - continuous function can be approximated with a linear combination of translated/scaled ReLU functions - gradient descent - $w\leftarrow w-\alpha\nabla E(w)$ - SGD – only use subset to compute loss - backpropagation – previous computations can be reused (are needed because of chain rule) - initialization of weights using small random values - families of NNs: MLP, CNNs, autoencoders & VAEs, generative adversarial networks (GANs), RNNs & long short temporal memory (LSTMs), graph neural networks (GNNs), transformers - bias–variance tradeoff (undefitting × overfitting) - methodology – in general - state the problem you want to solve - classification, regression, segmentation - state input and output - narrow down the application domain - choose a “vanilla” architecture - choose the losses - evaluate properly (cross validation, train/validation sets) - be aware of (and explore) the biases in the datasets - push the boundaries ## Deep Learning Architectures - CNNs, pooling, normalization - 32×32×3 image & 5×5×3 filter → 28×28×1 output - striding, padding, dilation (spaces between convoluted pixels) - max and average pooling are most common - normalization – idea: output of the linear combinations must follow a standard Gaussian distribution (zero mean, unit variance) - CNN architectures - LeNet – images of size 32×32 - AlexNet – 224×224 - VGG – only small 3×3 filters, more layers than AlexNet - GoogleNet Inception - very deep - problem: gradient vanishing - solution: intermediate losses are used - ResNet - solution: skip connections - U-Net - convolution/deconvolution architecture for segmentation - skip connections from intermediate representations to deconvolutions (across the U shape) - object detection - two-stage detection: 1) is there something? (where?) 2) what is it? - R-CNN (region CNN) – extract regions, classify - problems: too many regions (slow), extracting regions is not learning-based - Fast R-CNN – first compute convolutions for the entire image, then look at regions (convolutions don't have to be computed separately for each region) - Faster R-CNN – learns to predict the regions of interest - Mask R-CNN – predicts a binary mask for each detected object - one-stage detection - Yolo (you only look once) - class probabilities, bounding box prediction (with confidence) - v4 uses mosaic augmentation (combining several annotated images into a mosaic and using it for training, makes the model more robust) - generative models - autoencoder - learns a representation (encodings) of unlabeled data - dimensionality reduction (ignoring insignificant data) - usual technique for dimensionality reduction: PCA - limitation: we cannot do non-linear reparametrization (like switching from Cartesian to polar coordinates) - encoder → code (“bottleneck”) → decoder - code size is very important - too big → overfit, lack of interest - too small → loss of information - denoising AE - add noise to input image - pass through network - measure reconstruction loss against original image - for PCA, we can interpolate (we can safely take a point between too points and it's meaningful) - generally, AEs can lead to an irregular latent space (with no meaningful interpolation → close points in the latent space are not similar when decoded) - we want to regularize the latent space to enable generation → **variational AE** - so an input is encoded as a distribution (not as a point) in the latent space - regularization using KL-divergence - closed form for two normal distributions - the decoder then decodes a sample from the distribution - the loss consists of the distance of the decoded image from the original and the KL divergence of the encoded distribution from the $N(0, I)$ distribution - how to sample to allow backpropagation? - reparametrization trick - instead of sampling $z$ from $N(\mu_x,\sigma_x)$, we always sample $\zeta$ from $N(0,I)$ and then compute $z=\sigma_x\zeta+\mu_x$ - so we can differentiate $z$ (and backpropagate) - limitation: images are blurry - generative adversarial network - generator, training set, discriminator - trying to match the real distribution - goal: fool the discriminator - GANs get sharper images than VAEs - VAEs use mean square error, producing sharp edges is not worth the risk - GAN discriminator can easily spot blurry images - diffusion models - slowly destroy structure in data distribution by adding noise - parameter $\beta_t$ (for time $t$) that governs the noise distribution (Gaussian) - its schedule affects the learning - the model tries to remove the noise - only step by step to improve stability - trained using U-Net (with adjustments) - can be conditioned with other inputs (other images, text, style, …) - then $\epsilon$ gets another parameter - Stable Diffusion uses CLIP to transform text into embeddings for conditioning - ways to make diffusion faster – distillation, one step diffusion - diffusion vs. GANs - training stability - GAN's optimum is a saddle point (min max) - diffusion is more stable - diversity - GANs have a risk of mode collapse (generating just one type of output) - diffusion models the distribution - cost & memory - GANs – fast, medium memory - diffusion – slow, high memory - conditional generation – diffusion outperforms GANs - recurrent neural networks - for sequences, persistence - problem of long-term dependencies - long short-term memory - we are updating the state with multiplication and addition - forget gate: is this worth remembering (multiplying by a number $\in[0,1]$) - input gate: what new information are we storing in the cell state - output gate: output of the cell - variants: peephole connections (let gate layers look at the cell state), grouping forget and keep gates, GRU (gated recurrent unit) - types: one to one, one to many, many to one, many to many (2×) - problems - GRU still forgets stuff (especially with long sequences) - processing is sequential, hard to parallelize on GPUs - → Transformer architecture - existing tasks are solved (to some extent) - new tasks need to be created → ARC-AGI datasets ## Paper Session 2 - ViT - self-attention can be as expressive as convolutional layer - image split into patches - encoder-only Transformer, learned position embeddings (capture relative distances of patches) - needs enough data to learn patterns (in CNNs, there is inductive bias) - DINO - emerging properties in self-supervised vision transformers - ViT limitations - supervised labels are coarse (one image – one label) - no explicit pressure to learn object boundaries, parts vs. background, spatial grouping - idea: use self-supervised pre-training - knowledge **di**stillation with **no** labels - different crops - teacher gets only global - student gets all crops - → enforce local-to-global correspondences - to avoid mode collapse – centering and sharpening applied to the teacher's output - we are forcing the teacher to “be creative” when predicting distribution - sharpening prevents the distribution from being flat - centering prevents the distribution to always peak at the same place (by computing a running average) - setup - MLP on top of `[class]` encoding (like in ViT) → distribution - goal: minimize cross-entropy between student and teacher - hi-res stereo datasets with subpixel ground truth (Middlebury) - capturing objects with stereo camera - how to create such datasets - two rounds of calibration (using checkerboard) - code images to get depth - ambient images - imperfection – useful for evaluation (real-world images are not perfect) - nothing in the world is perfect - pose reconstruction - SMPL body model – shape & pose - two approaches - regression – predicts shape & pose based on the image - optimization – optimizes shape & pose based on camera parameters and ground truth of 2D joints (and initial shape & pose) - SPIN combines both - initializes optimization with regression outputs - uses optimization to supervise regression ## Practical Units ### Augmented Reality - detection – detecting, refining, and drawing the projected image points of the checkerboard from a captured image 1. implement simple checkerboard spot detection with **findChessboardCorners** on grayscale version of the input image 2. draw the checkerboard points on the color input image with **drawChessboardCorners**. 3. the points from question 1 can be improved in precision by refining based on a local window around the initial detections (before drawing the checkerboard points, add the refinement step using **cornerSubPix**) - calibration, positioning 1. calibrate the camera using **calibrateCamera** (3D to 2D matches at every detection that will be given to this function) 2. find the camera extrinsics using **solvePnPRansac** 3. use **projectPoints** to find the 2D reprojections of the 3D points using the estimated intrinsics and extrinsics to check that the result is correct - drawing using reprojected points - the coordinates of a 3D cube are provided which you can use as initial augmentation object, but you can draw whatever you wish (use **circle**, **line**, **drawContours** for this purpose) ### 3D Shape Modeling - at the beginning of the file the calibration matrices for the $12$ silhouette cameras are stored in the array $calib$ – each matrix corresponds to a $3\times 4$ linear transformation from 3D to 2D - then the voxel 3D coordinate arrays $X, Y, Z$ and the associated occupancy grid, that shall be filled with $0$ or $1$, are defined - grid resolution can be modified with the parameter resolution (if you program is too slow, you can reduce the resolution) - training MLP to capture implicit representation - input: $x,y,z$ - output: occupancy (0 or 1) - training based on the voxel representation ### Segmentation – HourGlass vs. U-Net - HourGlass – autoencoder with convolutions and deconvolutions, has bottleneck - U-Net was invented to address precision problems of the original hourglass network - provides skip connections from intermediate representations