# Exam

## Introduction

- data compression
	- process of converting an input data stream into output data stream that has a smaller size
	- compression algorithm = encoding (compression) + decoding (decompression)
	- compression
		- lossless – the restored and original data are identical
		- lossy – the restored data are a reasonable approximation of the original
	- methods
		- static / adapative
		- streaming / block
			- block – we compress the data by blocks
	- goals – to save storage, to reduce the transmission bandwidth
- measuring the performance
	- units
		- input data size … $u$ bytes (uncompressed)
		- compressed data size … $k$ bytes, $K$ bits
		- compression ratio … $k/u$ (written as percentage)
		- compression factor … $u:k$
		- compression gain … $(u-k)/u$ (written as percentage)
		- average codeword length … $K/u$ (may be bpc or bpp)
			- bpc … bits per char
			- bpp … bits per pixel
		- relative compression (percent log ratio) … $100\ln(k/k')$
			- $k'$ … size of data compressed by a standard algorithm
	- data corpora
		- Calgary Corpus – 14 files: text, graphics, binary files
		- Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1)
		- Silesia Corpus
		- Prague Corpus
	- contests
		- Calgary Corpus Compression Challenge
		- Hutter Prize – Wikipedia compression
			- goal – encourage research in AI
- limits of lossless compression
	- encoding $f:$ {$n$-bit strings} → {strings of length $\lt n$}
		- $|\text{Dom}f|=2^n$
		- $|\text{Im}f|\leq 2^n-1$
		- such $f$ cannot be injective → does not have an inverse function
	- let $M\subseteq\text{Dom} f$ such that $\forall s\in M:|f(s)|\leq 0.9n$
		- $f$ injective on $M\implies|M|\leq2^{1+0.9n}-1$
		- $n=100$ … $|M|/2^n\lt 2^{-9}$
		- $n=1000$ … $|M|/2^n\lt 2^{-99}\approx 1.578\cdot 10^{-30}$
	- we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data)

## Lossless Compression

### Statistical Methods

- basic concepts
	- source alphabet … $A$
	- coding alphabet … $A_C$
	- encoding … $f:A^*\to A^*_C$
	- $f$ injective → uniquely decodable encoding
- typical strategy
	- input string is factorized into concatenation of substrings
		- $s=s_1s_2\dots s_k$
	- substrings = phrases
	- $f(s)=C(s_1)C(s_2)\dots C(s_k)$
		- $C(s_i)$ … codeword
- codes
	- (source) code … $K:A\to A^*_c$
	- $K(s_i)$ … codeword for symbol $s_i\in A$
	- encoding $K^*$ generated by code $K$ is the mapping $K^*(s_1s_2\dots s_n)=K(s_1)K(s_2)\dots K(s_n)$ for every $s\in A^*$
	- $K$ is uniquely decodable $\equiv$ it generates a uniquely decodable encoding $K^*$
	- example
		- $\set{0,01,11}$ … is uniquely decodable
		- $\set{0,01,10}$ … is not, counterexample: $010$
		- $K_1$ … fixed length
		- $K_2$ … no codeword is a prefix of another one
- prefix codes
	- prefix code is a code such that no codeword is a prefix of another codeword
	- observation
		- every prefix code $K$ (over a binary alphabet) may be represented by a binary tree = prefix tree for $K$
		- prefix tree may be used for decoding
	- idea: some letters are more frequent, let's assign shorter codewords to them
		- historical example – Morse code
- Shannon-Fano coding
	- input – source alphabet $A$, frequency $f(s)$ for every symbol $s\in A$
	- output – prefix code for $A$
	- algorithm
		- sort the source alphabet by symbol frequencies
		- divide the table into parts with similar sums of frequencies
		- append 0/1 to prefix, repeat
	- divide and conquer algorithm
	- definition: the algorithm constructs an optimal prefix / uniquely decodable code $K\equiv$ $|K'^*(a_1a_2\dots a_n)|\geq |K^*(a_1a_2\dots a_n)|$ for every prefix (uniquely decodable) code $K'$ and for every string $a\in A^*$ with symbol frequencies $f$
	- Shannon-Fano is not optimal
		- Fano's student David Huffman described a construction of optimal prefix code

### Huffman Code

- algorithm
	- constructs the tree from the bottom
	- starts with the symbols that have the lowest frequencies
	- sum of the frequencies is used as the frequency of the new node
- greedy algorithm
- optimality of Huffman code
	- notation
		- alphabet $A$ of symbols that occur in the input string
		- $\forall s\in A:f(s)$ … frequency of the symbol
		- let $T$ be a prefix tree for alphabet $A$ with frequencies $f$
			- leaves of $T$ … symbols of $A$
		- let $d_T(s)$ denote the depth of a leaf $s$ in the tree $T$
			- $d_T(s)=$ length of the path from root to $s$
			- $d_T(s)=$ length of the codeword for symbols $s$
		- cost $C(T)$ of tree $T$
			- $C(T)=\sum_{z\in A}f(z)d_T(z)$
			- $C(T)$ … length of the encoded message
	- lemma 0: if a prefix tree $T$ ($|V(T)|\geq 3$) contains a vertex with exactly one child, then $T$ is not optimal
	- proof: we could contract the edge between the vertex and its child
	- lemma 1
		- let $A$ ($|A|\geq 2$) be an alphabet with frequencies $f$
		- let $x,y\in A$ be symbols with the lowest frequencies
		- then there exists an optimal prefix tree for $A$ in which $x,y$ are siblings of maximum depth
	- proof
		- let $T$ be an optimal tree where $x,y$ are not siblings of maximum depth (instead, there are $a,b$ with maximum depth) 
		- now by switching $a$ and $x$, we get $T'$
		- $C(T)-C(T')=d_T(x)f(x)+d_T(a)f(a)-d_{T'}(a)f(a)-d_{T'}(x)f(x)=$ $\dots\geq 0$
		- $T'$ is also optimal
		- we repeat the steps for $y$
		- we get $T''$
		- $T$ is optimal $\implies T''$ is optimal
	- lemma 2
		- let $A$ be an alphabet with $f$
		- let $x,y\in A$ be symbols with the lowest frequencies
		- let $T'$ be an optimal prefix tree for alphabet $A'=A\setminus\set{x,y}\cup\set{z}$ where $f(z)=f(x)+f(y)$
		- then tree $T$ obtained from $T'$ by adding children $x$ and $y$ to $z$ is an optimal prefix tree for $A$
	- proof
		- $T'$ optimal for $A'$
		- we get $T$ by adding $x,y$ below $z$ (therefore $z$ is no longer a symbol of the alphabet)
		- there exists $T''$ optimal for $A$ in which $x,y$ are siblings of maximum depth
			- using lemma 1
			- we label their parent $z$
		- we remove $x,y$ and get $T'''$ for $A'$
		- $C(T)=C(T')+f(x)+f(y)\leq C(T''')+f(x)+f(y)=C(T'')$
			- clearly $C(T')\leq C(T''')$ as $T'$ is optimal
		- $T''$ optimal $\implies T$ optimal
	- theorem: algorithm Huffman produces an optimal prefix code
		- proof by induction
	- theorem (Kraft-McMillan inequality)
		- codewords lengths $l_1,l_2,\dots,l_n$ of a uniquely decodable code satisfy $\sum_i 2^{-l_i}\leq 1$
		- conversely, if positive integers $l_1,l_2,\dots,l_n$ satisfy the inequality, then there is a prefix code with codeword lengths $l_1,l_2,\dots,l_n$
	- corollary
		- Huffman coding algorithm produces an optimal uniquely decodable code
		- if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal
- idea: we could encode larger parts of the message, it might be better than Huffman coding
	- extended Huffman code … we use $n$-grams instead of individual symbols
- Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height
- can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length?
	- yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree
- what is the maximum height of a Huffman tree for an input message of length $n$?
	- we find the minimal input size to get the Huffman tree of height $k$
		- how to make the tree as deep as possible?
		- the frequencies of the symbols will be Lucas numbers (with the zeroth number broken in two ones) or Fibonacci sequence – depending on the rules of the construction of the tree
	- Huffman tree of height $k\implies n\geq f_{k+1}$
		- where $f_0=1,\;f_1=1,\;f_{k+2}=f_{k+1}+f_k$
	- $n\lt f_{k+2}\implies$ max. codeword length $\leq k$
- there is an optimal prefix code which cannot be obtained using Huffman algorithm
- generalize the construction of binary Huffman code to the case of $m$-ary coding alphabet ($m\gt 2$)
	- the trivial approach does not produce an optimal solution
	- for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree
- implementation notes
	- bitwise input and output (usually, we work with bytes)
	- overflow problem (for integers etc.)
	- code reconstruction necessary for decoding
- canonical Huffman code
	- if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct the code (we don't need to encode the tree)
	- 00, 01, 100, 101, 110, 111
- adaptive compression
	- static × adaptive methods
	- statistical data compression consists of two parts: modeling and coding
	- what if we cannot read the data twice?
		- we will use the adaptive model
- adaptive Huffman code
	- brute force strategy – we reconstruct the whole tree
		- after each frequency change
		- after reading $k$ symbols
		- after change of order of symbols sorted by frequencies
	- characterization of Huffman trees
		- Huffman tree – binary tree with nonnegative vertex weights
			- two vertices of a binary tree with the same parent are called siblings
			- binary tree with nonnegative vertex weights has a sibling property if
				- $\forall$ parent: weight(parent) = $\sum$ weight(child)
				- each vertex except root has a sibling
				- vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling (the weights are non-decreasing if we print them by levels, bottom to top, left to right)
		- theorem (Faller 1973): binary tree with non-negative vertex weights is a Huffman tree iff it has the sibling property
		- FGK algorithm (Faller, Gallagher, Knuth)
			- we maintain the sibling property
		- zero-frequency problem
			- how to encode a novel symbol
			- 1st solution: initialize Huffman tree with all symbols of the source alphabet, each with weight one
			- 2nd solution: initialize Huffman tree with a special symbol $esc$
				- encode the 1st occurence of symbol $s$ as a Huffman code of $esc$ followed by $s$
				- insert a new leaf representing $s$ into the Huffman tree
		- average codeword length … $l_{FGK}\leq l_H+O(1)$
	- implementation problems
		- overflow
			- we can multiply frequencies by $\frac12$
				- this may lead to tree reconstruction
			- statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time
				- by multiplying by $x\in(0,1)$
			- we'll use integral arithmetic → we need to scale the numbers

### Arithmetic Coding

- each string $s\in A^*$ associate with a subinterval $I_s\subseteq[0,1]$ such that
	- $s\neq s'\implies I_s\cap I_{s'}=\emptyset$
	- length of $I_s=p(s)$
	- $C(s)=$ number from $I_s$ with the shortest code possible
- we'll first assume that $p(a_1a_2\dots a_m)=p(a_1)\cdot p(a_2)\cdot \ldots \cdot p(a_m)$
- encoding
	- we count the frequencies → probabilities
	- we sort the symbols in some order
	- we assign intervals to the symbols in the order
		- symbol with a higher frequency will be assigned a larger interval
	- to encode the message, we subdivide the intervals
		- for example, let's say that $I_B=(0.2,0.3)$ and $I_I=(0.5,0.6)$
		- then $I_{BI}=(0.25,0.26)$
	- the shorter the message, the longer the interval
	- we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes
- decoding
	- we reconstruct the input string from the left
	- we inverse the mapping
- problem: using floating point arithmetic leads to rounding errors → we will use integers
	- underflow may happen that way if we use too short integers
- instead of individual frequencies of each symbol, we will use cumulative frequencies (it directly represents the intervals)
- if we have $b$ bits available to represent the current state, the (initial) interval can be $[0,2^{b}-1)$ or $[0,2^{b-1})$
- another problem
	- we use bit shifting and assume that the first bits of both bounds are the same
	- what if the first bits differ?
		- we don't output anything yet and increment a counter
	- it is similar to the situation (in real numbers) when we get to the range 0.53–0.67
		- we cannot write 5 or 6 and shift the number because we don't know whether the next range will be something like 0.53–0.56 or 0.63–0.67
		- we postpone the decision instead
- algorithm
	- start with the interval $[0,2^{b-1})$ represented as left bound $L$ and range $R$
	- encode each character of the block
	- if the range $R$ falls below $2^{b-2}$, we need to prevent underflow
		- if the current interval (range) starts in the first half of the interval $[0,2^{b-1})$ and ends in the second half, we don't output anything, only increment the counter and rescale the current interval
		- otherwise, we output one or more (most significant) bits of $L$ and rescale the current interval (also, we decrement the counter by printing the correct bits)
- time complexity
	- encoding $O(|A|+|\text{output}|+|\text{input}|)$
	- decoding $O(|A|+|\text{output}|+|\text{input}|\log|A|)$
	- it is also possible to make decoding faster by using more memory
		- we will represent the table of frequencies as an implicit tree
		- that way we can decode in time $O(|A|+|\text{output}|+|\text{input}|)$
- adaptive version
	- cumulative frequencies table as an implicit tree
	- Fenwick tree
		- Fenwick frequency = prefix sum (sum of all the weights of the lexicographically smaller vertices)

### Theoretical Limits

- Hartley's formula
	- minimum and average number of yes/no questions to find the answer
	- if $x$ is a member of $n$ element set, then $x$ carries $\log_2 n$ bits of information
	- if we have a $k$-tuple of elements of $n$ elements set and $S_k$ is the least number of questions necessary to determine all $x_i$, then $S_k/k$ is the average number of questions necessary to determine one $x_i$
		- $|R|=n,\;x\in R^k$
		- $\log n^k\leq S_k\lt \log n^k+1$
		- $k\log n\leq S_k\lt k\log n+1$
		- $\log n\leq S_k/k\lt\log n+1/k$
- Shannon's formula
	- entropy … $H(X)=-\sum_{x\in R}p(x)\log p(x)$
	- theorem: $H(X)\geq 0$
	- lemma: Gibbs inequality
	- theorem: for a random variable $X$ with a finite range $R$, $|R|=n$, we have $H(X)\leq\log n$
		- and $H(X)=\log n\iff \forall x\in R:p(x)=\frac1n$
		- we plug it into the inequality … $q_i=\frac1n$
- Kraft-McMillan theorem
	- codeword lengths $l_1,l_2,\dots$ of a uniquely decodable code $C$ satisfy $\sum_i 2^{-l_i}\leq 1$
	- on the other hand, if natural numbers $l_1,l_2,\dots$ satisfy the inequality, then there is a prefix code with codewords of these lengths
- proof
	- let's start with $k$-th power of the sum of codeword lengths
	- $(\sum_{i=1}^n 2^{-l_i})^k=(\sum_{x\in R} 2^{-l(x)})^k$
	- we can rewrite that as the sum of products
		- and $l(x_{i_1})+l(x_{i_2})=l(x_{i_1}x_{i_2})$
	- therefore it is equal to $\sum_{x_{i_1}\dots x_{i_k}\in R^*}2^{-l(x_{i_1}\dots x_{i_k})}$
	- which equals $\sum_{i=1}^{k\cdot l_\text{max}}n(i)\cdot 2^{-i}\leq \sum_{i=1}^{k\cdot l_\text{max}}2^i\cdot 2^{-i}=k\cdot l_\text{max}$
	- now $\sum_{i=1}^n2^{-l_i}\leq\lim_{k\to\infty}(k\cdot l_\text{max})^{1/k}=1$
	- also, $R$ can be infinite
	- we can get the prefix code using arithmetic coding (probably?)
- theorem: let $C$ be a uniquely decodable code for a random variable $X$, then $H(X)\leq L(C)$
	- $L(C)$ … average codeword length
	- $L(C)=\sum_{x\in R} p(x)|C(x)|$
	- proof: by Gibbs inequality, we use $\frac{2^{-l_i}}{\sum_j 2^{-l_j}}$ as the second distribution
- entropy of a discrete random vector
	- $H(X)=nH(X_i)$
	- for independent components with the same probability distribution
- theorem: an arbitrary optimal prefix code $C$ for a random variable $X$ satisfies $H(X)\leq L(C)\lt H(X)+1$
	- proof of the upper bound: put $l_i=\lceil-\log p_i\rceil$, then Kraft-McMillan
	- this also holds for Huffman code (if we consider the entropy of one symbol)
- analysis of arithmetic coding
	- $H(X)\leq L(C)\lt H(X)+2$
		- $\overline{F(x)}=0.y_1y_2\dots$ is the midpoint of the interval for $x\in\mathbb R$ ($F$ is a CDF)
		- then $C(x)=y_1y_2\dots y_{l(x)}$ where $l(x)=\lceil\log\frac1{p(x)}\rceil+1$
		- in the proof, we show that $C$ is a prefix code
		- then, we use $L(C)=\sum p(x)\cdot l(x)$ which is upper bounded by $H(X)+2$
	- here, we don't consider one symbol but the whole message → therefore, the result is much better then the Huffman code
	- the main problem of the Huffman code is that it cannot deal very well with skewed probabilities of symbols as the lengths of the codewords must have integral length
- problems
	- we assume that the encoded random variables are independent
	- we are working with a probabilistic model of our data, which is simplified too much
- a better probabilistic model
	- we cannot use “unlimited” context – the probabilities would be too small
	- we will use the context of $k$ symbols
- according to an experimental estimate
	- the entropy of English is 1.3 bits per symbol

### Context Methods

- model of order $i$ … makes use of context of length $i$
- methods
	- fixed length contexts
	- combined – contexts of various lengths
		- complete – all contexts of lengths $i,i-1,\dots,0$
		- partial
	- static, adaptive
- Prediction by Partial Matching (PPM)
	- Cleary, Witten, Moffat
	- uses arithmetic coding
	- we try to encode symbol $s$ in context of length $i$
	- if the context is too large (the frequency is not larger than zero), we output the ESC symbol and decrease the context length
	- assumption: every symbol has non-zero frequency for some (small) context
	- how to encode symbol $s$
		- we have read the context $c$ of length $i$ and the current symbol $s$
		- if $f(s\mid c)\gt 0$, encode $s$ using $f(*\mid c)$
			- each context $c$ has its own table of frequencies (for arithmetic encoding)
		- otherwise, output the code for ESC and try order $i-1$ context
	- exclusion principle
		- situation: symbol $x$ occurs in context $abc$ for the first time
		- we will encode it using second order model $f(x\mid bc)$
		- there is symbol $y$ such that $f(y\mid bc)\gt 0$ and also $f(y\mid abc)\gt 0$
		- by using ESC, we already signaled that we are not trying to encode any of the letters present in the third order model (such as $y$)
		- therefore, we can temporarily remove $y$ from the second order model so that the compression is more efficient
	- there are various approaches to assign $P(esc\mid c)$
		- for example, we can decrement frequencies of all symbols by one
	- data structure … trie (context tree)
		- where each node has an edge to its largest proper suffix, this simplifies adding of new leaves
		- it might not fit in the memory
			- to solve that, we can freeze the model (update frequencies in existing contexts, stop adding new contexts)
			- or we can rebuild the model using only the recent history (not the whole file)
		- we can rebuild the model either if the free memory size drops or if the relative compression ratio starts decreasing
	- advanced data structure: DAWG (Directed Acyclic Word Graph)
	- PPMII
		- “information inheritance”
		- uses heuristics
		- became part of RAR
	- PAQ
		- arithmetic coding over binary alphabet with sophisticated context modelling
		- using weighted average of probability estimates from various models
	- this algorithm has probably the best compression ratio
		- but it is very time and space consuming

### Integer Coding

- unary code $\alpha$
	- $\alpha(1)=1$
	- $\alpha(i+1)=0\,\alpha(i)$
	- clearly, $|\alpha(i)|=i$
	- it is the optimal code for probability distribution $p(i)=2^{-i}$
- binary code $\beta$
	- is not a prefix code!
	- we can use fixed length
		- optimal code for $p(i)=\frac1{|A|}$
	- or we can gradually increase the codeword length as the dictionary size increases
- Elias codes
	- binary code always starts with 1
	- reduced binary code $\beta'$ … without the first 1
	- $\gamma'$ … first, we encode the length of the $\beta$ code using $\alpha$, then we encode the $\beta'$
	- $\gamma$ … we interleave the bits of $\alpha$ and $\beta$
	- $\delta$ … instead of $\alpha$, we use $\gamma$ to encode the length of $\beta$
	- we could construct other codes in a similar fashion but $\delta$ code suffices
		- for large numbers, it is shorter than $\gamma$
- Elias codes are universal
	- observation: each probability distribution $p_1\geq p_2\geq\dots\geq p_n$ satisfies $\forall i\in[n]:p_i\leq \frac1i$
	- corollary: $-\log p_i\geq -\log\frac1i=\log i$
	- $|\gamma(i)|=\log i+\Theta(\log i)$
	- $|\delta(i)|=\log i+o(\log i)$

### Dictionary Methods

- idea
	- repeated phrases stored in a dictionary
	- phrase occurrence in the text → pointer
- problems
	- construction of the optimal dictionary is NP-hard → greedy algorithm
	- dictionary is needed for decoding → dynamic methods
- LZ77
	- Jacob Ziv, Abraham Lempel
	- sliding window
		- search buffer
		- look-ahead buffer
	- search for the longest string $S$ such that
		- $S$ starts in the search buffer
		- $S$ matches the prefix of the string in look-ahead buffer
	- $S$ is encoded as a triple $\braket{i,j,z}$ where
		- $i$ … distance of the beginning of $S$ from the look-ahead buffer
		- $j=|S|$
		- $z$ … the symbol following $S$ in the look-ahead buffer
	- example
		- `ac|cabracad|abrarr|ar`
		- `___\search/\lahead/__`
		- $\to\braket{7,4,r}$
	- window slides $j+1$ symbols to the right
	- sliding window size $2^{12}-2^{16}\,B$, look ahead buffer size are usually tens or hundreds of bytes
	- slow compression, fast decompression
	- typical application – single compression, multiple decompressions
- LZSS
	- triple $(i,j,z)$ replaced with $(i,j)$ or literal $z$
		- bit indicator to distinguish one from another
	- if pair of pointers requires the same space as $p$ symbols, we use pairs to encode only phrases of length $\gt p$ (shorter phrases are simply copied to the output)
	- sliding window → cyclic buffer
	- possible implementation
		- codeword 16b, $|i|=11$, $|j|=5$
		- 8 bit indicators in 1 byte (one byte describes the types of 8 following items)
- other variants
	- LZB (LZSS with more efficient encoding of $(i,j)$ – using binary code with increasing length for $i$ and $\gamma$ code for $j$)
	- LZH (two-pass: LZSS, Huffman)
- Deflate algorithm
	- by Phil Katz, originally for PKZIP 2
	- zip, gzip, jar, cab, png
	- LZSS + static Huffman coding
	- search buffer size 32 kB
	- look-ahead buffer size 258 B
		- match length 3..258 → 256 options
		- for a smaller match, it uses literals
	- input divided into blocks
		- 3bit header
			- is it last block? (1b)
			- type of the block (2b)
		- 3 block types
			1. no compression
			2. static Huffman code compressed using trees defined in Deflate RFC
			3. dynamic Huffman code based on input data statistics
	- type 3 block
		- starts with two Huffman trees
		- the first tree for literals and match lengths
			- 0..255 for literals
			- 256 = end-of-block
			- 257..285 for match length (extra bits are used to distinguish between particular numbers)
		- the second tree for offset (again with extra bits)
	- hashing is used for string searching in search buffer
		- hash table stores string of length 3
		- strings in linked lists sorted by their age
		- a parameter to limit the maximum list length
	- greedy strategy extension
		- first, look for a primary match
		- slide window one symbol to the right, find a secondary match
			- if better, encode as literal + secondary match
- LZ77 disadvantages
	- limited outlook of the search window (longer window → better compression, more complex search)
	- limited length of the look-ahead buffer (match length limitation)
- LZ78
	- sliding window, explicit dictionary (but the directory is not transmitted)
	- efficient data structure for storing the dictionary … trie
	- compression
		- start with an empty dictionary
		- read the longest prefix of the input which matches some phrase $f$ in the dictionary (zero length is also possible – especially if the dictionary is empty)
		- output $\braket{i,k(s)}$
			- $i$ points to $f$ in the dictionary
			- $k(s)$ is a codeword for symbol $s$ which follows $f$ in the input
		- insert new phrase $fs$ into the dictionary
	- dictionary reconstruction is necessary when the dictionary space is exhausted
		- we can delete the whole dictionary
		- or delete only the phrases that occurred least frequently or have not occurred recently
		- we need to preserve the prefix property
			- if the dictionary contains $S$, it has to contain all prefixes of $S$
		- if compression ratio starts getting worse, we can delete the dictionary and start over
- LZW
	- we initialize the dictionary by the alphabet
	- we encode the longest prefix present in the dictionary
		- then, we append the next character and add it to the dictionary
	- when decoding, it may happen that the item is not in the dictionary yet
		- because the encoder is one step ahead
		- so we use the last decoded string and append its first character
	- we store only the pointers (and the alphabet?)
- LZC
	- compress 4.0 utility in Unix (LZC method is the implementation of LZW algorithm)
	- pointers into dictionary – binary code with increasing length (9–16 bits)
	- when maximum size was reached, it continues with static dictionary
	- when compression ratio started getting worse, it deletes the dictionary and starts over with the initial setting (we use a special symbol for that)
- LZMW
	- idea: new phrase = concatenation of two last ones (→ phrase length increases faster)
	- dictionary full → LRU strategy (delete the phrases that have not occurred recently)
	- dictionary lacks prefix property – this complicates searching, backtracking is used
- LZAP
	- idea: instead of $ST$ add all substrings $Sp$ where $p$ is a prefix of $T$
	- larger dictionary → longer codewords, wider choice for phrase selection
	- faster search → backtracking is not needed

### Lossless Image Compression

- digital images – raster (bitmap), vector
- color spaces
	- RGB
	- CMYK
	- HSV
	- YUV – for  backwards compatibility of the TV broadcast
- representation of bitmap images
	- monochrome
	- shades of gray
	- color
		- pixel = $(r, g, b)$
		- alpha channel
	- color palette
		- conversion table (pixel = pointer into the table)
		- types: grayscale (pallete with shades of gray), pseudocolor (color palette), direct color (three color palettes, one for each part of the color – R/G/B)
- GIF (Graphics Interchange Format)
	- usage
		- originally for image transmission over phone lines, suitable for WWW
		- sharp-edged line art with a limited number of colors → can be used for logos
		- small animations
		- not used for digital photography
	- color palette, max. 256 colors
		- supports “transparent” color
	- multiple images can be stored in one file
	- supports interlacing
		- rough image after 25–50 % data
		- we can stop the transmission if it's not the image we wanted
		- 4-pass interlacing by lines
	- LZW-based compression (actually, it uses LZC)
	- pointers … binary code with increasing length
	- blocks … up to 255 B
	- we can increase the number of possible colors by stacking multiple images on top of another (and using transparent pixels)
- PNG (Portable Network Graphics)
	- motivation: replace GIF with a format based on a non-patented method (LZW was patented by Sperry Corporation / Unisys)
	- color space – gray scale, true color, palette
	- alpha channel – transparency $\in[0,1]$
	- supports only RGB, no other systems (it was designed for transferring images over the network)
	- compression has two phases
		- preprocessing – predict the value of the pixel, encode the difference between the real and predicted value (prediction based on neighboring pixels)
			- prediction types: none, sub (left pixel), up, average (average of sub and up), Paeth (more complicated formula)
			- each line may be processed with a different method
		- dictionary compression
			- deflate (LZ77)
	- supports interlacing
		- 7-pass interlacing by blocks of pixels
		- [![Adam7 interlacing](https://upload.wikimedia.org/wikipedia/commons/2/27/Adam7_passes.gif)](https://commons.wikimedia.org/wiki/File:Adam7_passes.gif)
	- single image only, no animation

### Burrows-Wheeler Transform

 - BSTW
	 - Bentley, Sleator, Tarjan, Wei
	 - *move to front* heuristic to restructure the dictionary
	 - we start with an empty dictionary $D$
	 - for input word $s$
		 - if $s$ s in dictionary $D$ on position $i$, we output $\gamma(i)$ and move $i$-th word in $D$ to the front of $D$
		- otherwise, we output $\gamma(|D|+1)$ and $s$ itself, we insert $s$ to the front of $D$
- Burrows–Wheeler transform
	- we are looking for a permutation of string $x$ such that the similar symbols are close to each other
	- matrix of cyclic shifts of $x$
		- sort rows lexicographically
		- the desired permutation is the last column
		- also, we need to store a number of the row where the original string lies in the matrix to be able to restore the original
		- if we used the first column, it would not be decodable
	- encoding
		- MTF (move to front) heuristic
			- for each symbol, its code would be the number of preceding symbols in the dictionary
			- then, move the symbol to the front of the dictionary
			- therefore, the number we use to encode the symbol matches the time that elapsed from the last encoding of that symbol (if it was encoded previously)
		- RLE … run length encoding
			- instead of $n$ zeros, we can just encode the zero and length of the block
		- Huffman/arithmetic coding
	- decoding
		- how to obtain the original string from the last column and the number?
		- we sort the last column to get the first one
		- we can determine the penultimate symbols from the pairs of first and last – thanks to the cyclic shift
			- what if there are multiple lines that end with the same symbol? we know that the lines were sorted lexicographically
	- matrix $n×n$ is not neccessary
		- we can use suffix array
		- this would be equivalent to the matrix of rotations of `x$`, where `x`is the original string and `$` is a sentinel character
	- why does BW transform work?
		- two same letters often share very similar right context
		- therefore they will be close in the rightmost column
		- (we are actually sorting the letters by their right context)
- bzip2
	- Julian Seward
	- open source
	- has more efficient compression than gzip/zip
		- but bzip2 is slower
		- it does not support archiving
			- it can only compress single
			- just like gzip
			- can be used to compress tar archives
	- compression
		- split the input into blocks of 100–900 kB (determinted by an input parameter)
		- Burrows–Wheeler transform
		- MTF, RLE
		- arithmetic coding (patented) was replaced with Huffman coding

## Lossy Compression

- no standard corpora exist
- how to measure the loss of information
	- mean squared error (MSE)
	- signal to noise ratio (SNR)
	- peak signal to noise ratio

### Digitizing Sound

- sound is a physical disturbance in a medium, it propagates as a pressure wave
- Fourier theorem
- sampling
	- Nyquist-Shannon sampling theorem
	- the sampling frequency should be at least twice the highest frequency we want to record
	- if the condition fails, it leads to aliasing (zkreslení)
		- there appear false frequency components that were not present in the original analog signal
		- for sampling frequency $F_s$ and original $f\gt\frac12F_s$, the new false frequency will be $f'=F_s-f$
	- example: the carriage (or stagecoach) wheels in the movie are spinning backwards
- quantization
	- we have $B$ bits available
	- we split the whole interval into $2^B$ subintervals
- we want the largest signal to noise ratio that is possible
- audio processing
	- input
		- amplifier
		- band-pass filter
		- sampler … sampling
		- A/D converter … quantization
		- RAM
	- output
		- RAM
		- D/A converter
		- band-pass filter
		- amplifier

### Quantization

- main idea
	- large set → smaller set
	- quantizer can be scalar or vector
	- quantization consists of two mapping – encoding and decoding
- central limit theorem
	- if we collect $n$ samples of random variable $X$, the distribution of the sample mean approaches $N(\mathbb EX,\text{var}(X)/2)$ for a sufficiently large $n$
- problem formulation
	- source … random variable $X$ with density $f(x)$ (common assumption – $f$ is symmetric about 0)
	- determine: number of intervals (levels), interval endpoints (decision boundaries), representative $y_i$ for each interval (reconstruction levels)
	- goal: to minimize the quantization error (distortion) $\sigma^2$ … MSE
		- $\sigma^2=\mathbb E[(x-Q(x))^2]$
	- the number of intervals (levels) may be fixed if we want fixed length codewords
- uniform quantizer
	- uniform probability distribution
	- all intervals of equal length
	- two types
		- midrise quantizer – even number of levels, $Q(0+\varepsilon)\neq 0$ (zero is a decision boundary)
		- midtread quantizer – odd number of levels, $Q(0+\varepsilon)=0$ (zero is in the middle of the central interval)
	- we assume that the range of $X$ is bounded
- forward adaptive quantizer (offline)
	- split the input into blocks
	- analyze each separately, set parameters accordingly
	- parameters are transmitted as side information
- backward adaptive quantizer (online)
	- no side info necessary
	- Jayant quantizer
		- multipliers
		- intervals can shrink and expand depending on the input
		- for each input value, we get the corresponding code $c$ and we multiply $\Delta$ by the multiplier $M_c$
		- the intervals automatically adapt to the input data
		- 3bit Jayant quantizer
			- table of intervals

| code | interval              | multiplier |
| ---- | --------------------- | ---------- |
| 0    | $(0,\Delta)$          | 0.8        |
| 1    | $(\Delta,2\Delta)$    | 0.9        |
| 2    | $(2\Delta,3\Delta)$   | 1          |
| 3    | $(3\Delta,\infty)$    | 1.2        |
| 4    | $(-\Delta,0)$         | 0.8        |
| 5    | $(-2\Delta,-\Delta)$  | 0.9        |
| 6    | $(-3\Delta,-2\Delta)$ | 1          |
| 7    | $(-\infty,-3\Delta)$  | 1.2        |

- nonuniform quantization
	- we set $y_j$ to the average value in the interval
		- $y_j=\frac{\int_{b_{j-1}}^{b_j}xf(x)dx}{\int_{b_{j-1}}^{b_j}f(x)dx}$
	- we set the interval bounds as the averages of neighboring values
		- $b_j=\frac{y_{j+1}+y_j}{2}$
	- Lloyd-Max algorithm
		- $M$-level symmetric quantizer
		- we set $b_0=0$
		- we somehow approximate $y_1$
		- then we compute $b_1$ from the equation for $y_j$ stated above
		- $y_2=2b_1-y_1$ (similarly $b_2,y_3,\dots$)
		- note: the $b$ bounds and $y$ values are symmetrical, $y_{-1}=y_1$ etc.
		- computation depends on the initial estimate $y_1$
			- we want to correct our wrong estimate
			- we know that $b_{M/2}$ = max input value
			- we can compare the resulting $y_{M/2}$ with the value $y^*_{M/2}$ computed using the quotient of integrals
			- if they differ too much, we correct the initial estimate of $y_1$ and repeat the whole procedure
- companding
	- compressing + expanding
	- first we stretch the high probability regions close to the origin & compress the regions with low probability
	- then apply uniform quantization
	- expand afterwards (transform quantized values to the original scale)
	- $\mu$-law
		- telephony, USA, Japan
		- $\mu=225$
		- there is $c(x)$ which compresses the values and $c^{-1}(x)$ which expands them (there is the constant $\mu$ in the formulas)
	- A-law
		- telephony, Europe
		- $A=87.6$
		- there are different $c(x)$ and $c^{-1}(x)$ then in $\mu$-law
	- G.711
		- ITU-T standard for audio companding
		- sampling frequency 8 kHz
		- sample length is 14b ($\mu$-law) or 13b (A-law)
		- $\mu$-law uses midtread quantizer
		- A-law uses midrise quantizer
		- comparison
			- $\mu$ … higher dynamic range, noise reduction in blocks of silence (thanks to midtread)
			- A … lower relative distortion of small values
		- application: VoIP

### Vector Quantization

- we split the file (or group the individual samples) into vectors
	- $L$ consecutive samples of speech
	- block of $L$ pixels of an image (a rectangle/square)
	- → vector of dimension $L$
- idea: codebook
	- code-vectors selected to be represented, each one has a binary index
	- encoder finds the closest vector in the codebook (using Euclidean distance or MSE)
	- decoder retrieves the code-vector given its binary index
	- codebook size $K$ → $K$-level vector quantizer
	- we can design the codebook using k-means algorithm (or we can use LBG as described below)
- LBG algorithm
	- Linde, Buzo, Gray
	- we have codebook $C$, training set $T$, threshold $\varepsilon$
	- we can compute quantization regions – assign each input vector to its code-vector
	- we can compute the average distortion (MSE) in each step of the algorithm
		- if it is less than the threshold, we can return the codebook
		- otherwise, we need to update it (we set the code-vector to the average of all the input vector in its group – like in k-means)
	- initialization
		- LBG splitting technique
			- start with a single code-vector – the average of all vectors in the training set
			- in each step we generate a perturbation vector, add it to each code-vector and introduce those new vectors to the codebook → the codebook size doubles
				- then, we run LBG with the codebook (repeatedly update the code-vectors until the distortion drops below certain level)
			- we repeat this until the desired number of levels is reached
		- random init (Hilbert)
			- initialize $C$ randomly several times
			- select the alternative with lowest distortion
		- pairwise nearest neighbor (Equitz)
			- start with $C=T$
			- in each step, replace two code-vectors by their mean
			- goal: smallest increase in distortion
	- it is possible that during the algorithm, we are left with a code-vector that does not have any corresponding training set vectors (“empty cell problem”) → we will replace it with a new code-vector chosen randomly from the cluster with the highest population of training vectors or the highest associated distortion
- tree-structured vector quantization (TSVQ)
	- when quantizing a vector, we don't want to compare it to each vector in the codebook
	- we will divide the codebook into two groups such that there exists two test vectors that all vectors from the first group are closer to the first vector (and vice versa)
	- we repeat the process → binary tree
	- pros: we need a lower number of comparisons, we can build a binary string (code) as we progress down the tree
	- cons: increased storage requirement (we need to store the test vectors, the codebook does not suffice), possible increase in distortion
	- we may use LBG splitting technique to get test vectors
		- we need to modify the technique, we will always run LBG only with two code-vectors and the corresponding subset of the test set
	- pruned TSVQ
		- we can remove some subtrees to decrease the average rate (codeword length) – but this will increase average distortion
		- also, we can split some nodes
		- this way, we get binary codewords of variable length, the codewords correspond to paths in the decision tree → prefix code
- lattice vector quantizers
	- idea: regular arrangement of code-vectors in space
		- quantization regions should tile the output space of the source
	- D lattices
		- $D_2$ lattice (in two-dimensional space)
			- $a\cdot v_1+b\cdot v_2=a(0,2)+b(1,1)=(b,2a+b)$
			- squares
		- the sum of the coordinates is always even
	- A lattices
		- $A_2$ lattice is hexagonal
	- G.719
		- ITU-T standard for wideband audio coding
		- modified dicrete cosine transform (DCT)
		- vector quantizer based on $D_8$ lattice
- classified vector quantization
	- idea: divide source vectors into classes with different properties, perform quantization separately
	- example: edge/nonedge regions (in an image)
		- large variance of pixels $\implies$ presence of an edge (will use edge codebook for this vector)
		- output has to contain the codebook label and the vector label
- applications
	- audio: G.719, CELP, Ogg Vorbis, TwinVQ
	- video: QuickTime, VQA

### Differential Encoding

- idea: consecutive samples are quite similar (that holds especially for speech and images)
- if we quantize the differences directly, the error will accumulate
- instead of that, we will quantize the difference between the previous **predicted** value and the current value
	- or we can use a more general function of multiple previous predicted values instead of just the last value
	- this function can be called $p_n$
- DPCM (differential pulse-code modulation)
	- C. Chapin Cutler, Bell Labs
	- goal: minimize MSE (error between the real value and $p_n$)
	- linear predictor $p_n=\sum_{i=1}^N a_i\hat x_{n-i}$
		- where $\hat x_k$ is $k$-th predicted value
	- problem: find $a_j$ which minimize MSE
		- set the derivative w.r.t. $a_j$ equal to zero
		- $\forall j:-2\mathbb E[(x_n-\sum a_ix_{n-i})x_{n-j}]=0$
		- $\mathbb E[(x_n-\sum a_ix_{n-i})x_{n-j}]=0$
		- $\mathbb E[x_nx_{n-j}]=\sum a_i\mathbb E[x_{n-i}x_{n-j}]$
		- we can rewrite this using autocorrelation $R_{XX}(k)=\mathbb E[x_nx_{n+k}]$
		- we get Wiener-Hopf equations (matrix multiplication)
		- we will estimate $R_{XX}(k)$ – nothing complicated, it's just an average of products
			- we assume that the process is stationary
- ADPCM
	- adaptive DPCM
	- forward adaptive prediction (DPCM-APF)
		- estimate autocorrelation for blocks
		- we need to transfer coefficients and buffer the input (group it into blocks)
	- backward adaptive prediction (DPCM-APB)
		- no need to transfer additional info
		- we will use stochastic gradient descent during coding (LMS = least mean squares algorithm)
		- $A^{(n+1)}=A^{(n)}+\alpha \hat d_n\hat X_{n-1}$
- G.726
	- ITU-T standard for speech compression based on ADPCM
	- predictor uses two latest predictions and six latest differences (between predicted values)
	- simplified LMS algorithm
	- backward adaptive quantizer (modified Jayant)
- delta modulation
	- speech coding
	- very simple form of DPCM
	- 2-level quantizer (we can only encode $+\Delta$ or $-\Delta$)
	- we need really high sampling frequency
	- problems
		- constant signal (we encode it as a sequence of `↑↓↑↓↑↓↑↓`)
			- “granular regions”
		- a slope steeper than delta
			- “slope overload region”
	- solution: adapt $\Delta$ to the characteristics of the input
- constant factor adaptive delta modulation (CFDM)
	- objective: increase delta in overload regions, decrease it in granular regions
	- how to identify the regions?
		- use a history of one sample (or more)
	- application: space shuttle ↔ ground terminal (CDFM with 7-sample memory)

### Transform Coding

- scheme: group individual samples into blocks, transform each block, quantization, coding
- for images, we have to make a two-dimensional transform
	- it is better to transform each dimension separately = separable transform
- separable transform using matrix $A$
	- transformed data $\Theta=AXA^T$
	- original data $X=A^{-1}\Theta(A^{-1})^T$
- orthogonal transform … we use orthogonal transform matrix
	- transformed data $\Theta=AXA^T$
	- original data $X=A^T\Theta A$
	- moreover $\sum_i \theta_i^2=\sum_i x_i^2$
	- gain $G_T$ (of a transform $T$) = arithmetic over geometric mean of $\sigma_*^2$
		- $\sigma_i^2$ … variance of coefficient $\theta_i$
- Karhunen-Loéve transform
	- minimizes the geometric mean of the variance of the transform coefficients
	- transform matrix consists of eigenvectors of the autocorrelation matrix $R$
	- problem: non-stationary input → matrix $R$ changes with time → KLT needs to be recomputed
- discrete cosine transform
	- DCT is the “cosine component” of DFT
	- redistributes information so that we can quantize it more easily
	- why is DCT better than DFT?
		- FT requires a periodic signal
		- periodic extension of non-periodic signal would introduce sharp discontinuities and non-zero high-frequency coefficients
	- DCT may be obtained from DFT (start with the original non-periodic signal consisting of $N$ points, mirror it to get $2N$ points, apply DFT, take the first $N$ points)
	- DCT transform matrix – 1st row is constant vector, following rows consist of vectors with increasing variation
		- these are the portions of the original “signal” (data matrix $X$) that the coefficients in the $\Theta$ matrix correspond to
		- [![DCT](https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/DCT.jpg)](https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html)
			- these can be called “basis matrices”
		- basis matrix corresponding to $\theta_{11}$ is a constant matrix $\implies$ $\theta_{11}$ is some multiple of the average value in the data matrix
			- $\theta_{11}$ … DC coefficient
			- other $\theta_{ij}$ … AC coefficients
	- gain $G_{DCT}$ is close to the optimum value for Markov sources with high correlation coefficient $\rho=\frac{\mathbb E[x_nx_{n+1}]}{\mathbb E[x_n^2]}$
- discrete sine transform
	- gain $G_{DST}$ is close to the optimum value for Markov sources with low correlation coefficient $\rho=\frac{\mathbb E[x_nx_{n+1}]}{\mathbb E[x_n^2]}$
- Hadamard matrix
	- square matrix of order $n$ such that $HH^T=nI$
	- $H_1=(1)$
	- $H_{2n}=\begin{pmatrix} H_n & H_n \\ H_n & -H_n \end{pmatrix}$
- discrete Walsh-Hadamard transform
	- DWHT transform matrix is a rearrangement of $H_n$
		- multiply $H_n$ by normalizing factor $\frac1{\sqrt n}$
		- reorder the rows by increasing number of sign changes
	- pro: speed
	- con: $G_{DWHT}\ll G_{DCT}$
- quantization of transform coefficients
	- zonal sampling
		- coefficients with higher expected variance are assigned more bits for sampling
		- pro: simplicity
		- con: bit allocations based on average rules → local variations (e.g. sharp edges on plain background) might not be reconstructed properly
- quantization and coding
	- threshold coding
		- based on the threshold, we decide which coefficients to keep (or discard)
		- we have to store the numbers of discarded coefficients
		- threshold is not decided a priori
	- Chen & Pratt suggest zigzag scan
		- tail of the scan should consist of zeros (generally, the higher-order coefficients have smaller amplitude) → for fixed length of the block, we don't need to specify the number of zeros, just send end-of-block (EOB) signal
		- note: DC coefficients correspond to contours, higher-order coefficients correspond to details of the image → we can get rid of them
- JPEG (Joint Photographic Experts Group)
	- JPEG standard specifies a codec
		- lossy / lossless compression
		- progressive transmission scheme
	- file format
		- JPEG/JFIF – minimal format version
		- JPEG/Exif – digital photography
	- compression scheme
		- color space transform
			- RGB → YCbCr
			- human eye is more sensitive to details in Y (luminance) component
		- reduction of colors (chroma subsampling) in color components
			- 4:4:4 (no reduction) or 4:2:2 or 4:2:0
			- 4:2:2 or 4:2:0 → colors have lower resolution than the luminance
			- [![chroma subsampling](https://d1hjkbq40fs2x4.cloudfront.net/2022-04-21/files/video-faq-420-422-colour-sampling_2214-04.jpg)](https://snapshot.canon-asia.com/article/eng/videography-faq-what-do-422-and-420-mean)
			- J:a:b (see [Chroma subsampling, Wikipedia](https://en.wikipedia.org/wiki/Chroma_subsampling))
				- J … width of the region (as a reference)
				- a … number of chrominance samples in the first row
				- b … number of changes (in chrominance) between first and second row
		- each color component is handled separately
			- split into blocks of 8×8 pixels
			- DCT of each block
			- uniform quantization of DCT coefficients
				- coefficients are divided by table values (recommended in the standard, stored in the compressed file) and rounded
				- table values are adjusted according to specified JPEG quality $\in\set{1,\dots,100}$
			- coding – zigzag matrix scan, Huffman coding (extended JPEG can use arithmetic coding)
	- Huffman encoding of coefficients
		- DC coefficient encoding
			- $\theta_{11}$ = constant multiple of the average value in the block (see the image in the DCT section above)
			- instead of $\theta_{11}$ encode the difference between neighboring blocks
			- split the range of values into categories (so that the Huffman tree is not too large)
				- we output the Huffman code of the category $n$ and then $n$ bits identifying the exact value
				- values
					- category 0 … 0
					- category 1 … -1, 1
					- category 2 … -3, -2, 2, 3
					- etc.
		- AC coefficient encoding
			- the categories are the same
			- but we don't encode zeros – we use Chen & Pratt method with EOB signal
			- for each value, we need to encode its category (+ value) and the number of previous zeros
				- use Huffman tree to encode $(Z,C)$ where $Z$ is the number of previous zeros and $C$ is the category
				- append $C$ bits to store the exact value
			- if the number of previous zeros is greater than 15, use a special ZRL signal
				- ZRL is encoded as $(15,0)$
				- EOB is encoded as $(0,0)$
	- JPEG use
		- digitized images, realistic scenes, subtle variations in tone and color
		- disadvantage: artificial division into blocks → coding artifacts at the block edges → tile effect
- WebP
	- color space: RGB → YCbCr
	- chroma subsampling 4:2:0
	- split into macroblocks 16×16 Y, 8×8 Cb, 8×8 Cr
		- high detail areas: 4×4
	- macroblock prediction
	- transform: DCT, DWHT
	- quantization – partition to segments with similar features
	- arithmetic coding

### Subband Coding

- example
	- rapid sample-to-sample variation
	- but the long-term trend varies slowly
	- idea: let's average the samples $x_n$ using the sliding window
		- this would smooth out the rapid variation
	- we could transmit both the averages $y_n$ and their distances $z_n$ from the original values $x_n$
		- we would need to transmit twice as many values than before
		- fortunately, if we transmit only even values, we can reconstruct the odd values from them
	- takeaway
		- it is useful to decompose original sequence $(x_n)$ into two subsequences $(y_n)$ and $(z_n)$
		- it did not result in any increase in the number of values we need to transmit
		- the two subsequences have different characteristics, we can use different techniques to encode them
- filters – selectively block frequencies that does not belong to the filtered range
	- low-pass filters – blocks frequencies above $f_0$
	- high-pass filters – block frequencies below $f_0$
	- band-pass filters – lets through only components between $f_1$ and $f_2$
- digital filter with input $(x_n)$ output $(y_n)$ coefficients $(a_i),(b_i)$
	- $y_n=\sum_{i=0}^N a_ix_{n-i}+\sum_{i=1}^Mb_iy_{n-i}$
	- finite impulse response (FIR)
		- $\forall i:b_i=0$
		- impulse response dies out after $N$ samples
		- $N$ … number of taps
	- infinite impulse response (IIR)
		- example for $a_0=1$ and $b_1=0.9$ and impulse 1000000…
			- $y_n:$ 1, 0.9, 0.81, 0.729, …
			- this $y_n$ is equal to the impulse response $h_n$
	- we can use convolution…
		- impulse response completely specifies the filter
		- $y_n=\sum_{k=0}^M h_kx_{n-k}$
		- for IIR, $M$ is infinite
	- examples of filters
		- averaging
			- low-pass filter
			- two-tap FIR filter
			- $h_n:0.5,0.5,0,\dots$
		- difference
			- high-pass filter
			- $h_n:0.5,-0.5,0,\dots$
- filters used in subband coding
	- filter bank
		- a cascade of stages
		- each stage contains low-pass and high-pass filters
	- QMF – quadrature mirror filter
		- impulse response of the low-pass filter is specified as $(h_n)$
		- impulse response of the high-pass filter is derived from that as $((-1)^nh_{N-1-n})$
			- the order is reversed and the signs are flipped
		- example: 8-tap Johnson QMF filter
		- fewer taps → lower efficiency in decomposition
- basic algorithm
	1. analysis
		- $M$ filters cover the frequency range of the input
		- generalized Nyquist rule – the sampling frequency has to be at least twice the supported frequency range
	2. downsampling (decimation)
		- keep each $M$th sample
	3. quantization + coding
		- allocation of bits between subbands
			- subband variance is a suitable heuristic
		- ADPCM, PCM, vector quantization
	4. decoding
	5. upsampling
		- insert appropriate number of zeros
	6. synthesis
		- bank of reconstruction filters
- G.722
	- ITU recommendation for speech coding
	- anti-aliasing filter with cutoff frequency 7 kHz
	- sampling frequency 16 kHz
	- 14b uniform quantizer
	- filtering – a bank of 2 QMF filters
		- 24 tap FIR filter
		- low-pass 0–4 kHz, high-pass remaining frequencies
		- down sampling by a factor of two
	- encoding – ADPCM
		- low-pass: 6b/sample
		- high-pass: 2b/sample
	- quantization – a variation of Jayant algorithm
- MPEG audio (Moving Picture Experts Group)
	- three coding schemes – layers (each is more complex than the previous)
		- MPEG-1 Audio Layer I (MP1)
		- MPEG-1 (or MPEG-2) Audio Layer II (MP2)
		- MPEG-1 (or MPEG-2) Audio Layer III (MP3)
	- subband coding
		- MP1 and MP2 use a bank of QMF filters
		- MP3 uses QMF and modified DCT
	- quantization
		- uniform (1, 2)
		- non-uniform (3)
	- bit allocation to subbands
		- based on a psychoacoustic model of human perception
		- sounds at different frequencies are perceived as differently loud even if they have the same pressure levels
		- threshold-of-hearing curve (boundary of audible sounds)
		- spectral masking
			- if there are multiple sounds of the similar frequencies, some might not be audible (audibility threshold rises)
		- temporal masking
			- sound raises the audibility threshold for a brief interval preceding and following the sound
	- coding
		- fixed codeword length (1, 2)
		- Huffman coding + a buffer to ensure fixed transmission rate (3)
- image compression
	- dependencies in two dimensions – it is better to use two one-dimensional filters rather than one two-dimensional filter
	- filter each row of an image separately using high-pass and low-pass filters, then perform downsampling by a factor of two
		- one image N×N → 2 images N×N/2
	- filter each column of the two images using high-pass and low-pass filters, then perform downsampling by a factor of two
		- 2 images N×N/2 → 4 images N/2×N/2
	- then we can stop or continue the decomposition (note: we don't have to decompose all of the subimages)
	- application: JPEG 2000
		- wavelet transform
		- coefficients correspond to details for various resolutions
			- fairly precise quantization possible
			- FBI fingerprint image compression standard, medical images compression
			- data optimization for progressive transmission over a network
			- a possibility to define regions of interest for higher-quality encoding

### Video Compression

- video = time sequence of images (frames)
- can we reduce the problem to image compression?
	- no!
	- if we change the average intensity of pixels, it becomes noticeable in a video
	- on the other hand, we don't care about the reproduction of sharp edges that much
- basic types of algorithms
	- two-way communication → symmetric compression / decompression, it needs to be fast, videotelephony etc.
	- one-way → used for storage, the compression can be more complex
- video signal representation
	- CRT
	- analog TV standards: NTSC (USA), PAL (Germany), SECAM (France)
	- composite signal – YCbCr
		- Y (luminance) used for black-and-white television 
	- gamma correction – to make colors look more natural
- MPEG-SIF format based on ITU-R 601 4:2:2
	- luminance compression – take odd rows, use horizontal filter and subsample (the horizontal filter prevents aliasing)
	- chrominance compression – take odd rows, use horizontal filter and subsample, use vertical filter and subsample
- ITU-R H.261
	- videotelephony, videoconferencing
	- format CIF
	- idea
		- frame = 8×8 blocks
		- prediction (by the previous frame) + differencing
		- DCT
		- quantization
		- entropy coder with variable codeword length
	- why do we have to do inverse quantization and transform when encoding?
		- to prevent loss, we need the encoding prediction to be based on the information available to the decoder (see differential encoding)
	- motion compensation
		- which parts of the image are moving?
		- for each block of the frame, we are going to find the most similar block of the previous frame
		- if there's no similar previous block, we just encode the current block
		- how to reduce the search space?
			- increase block size
		- H.261 uses macroblocks
			- for luminance component, group the blocks by four (into 16×16 macroblocks)
			- search for the best match (we get the motion vector)
			- to find the motion vector for chrominance components, we just divide the motion vector by two
	- the loop filter
		- sharp edges in the block used for prediction have some negative consequences
		- we use a smoothing filter
	- transform
		- DCT of the blocks
		- coder simulates the decoder
		- two modes
			- inter – motion compensation is used (we only have to encode the differences)
			- intra – original block is encoded directly
	- quantization
		- intra blocks have large DC coefficients, inter blocks have smaller ones
		- there are 32 different quantizers, they are identified in the macroblock header
	- coding
		- zigzag scan through the block of quantized coefficients (like in JPEG)
			- 20 most common → single variable length codeword
			- other → fixed length 20b
		- to avoid zero blocks, CBP (coded block pattern) is used to describe which blocks have non-zero coefficients
	- rate control
		- during videoconference we need the transmission rate to be steady
		- transmission buffer – keeps the output rate of the encoder fixed
			- buffer may request that the coder changes the transmission rate
			- coder than may change the quantizer (or as a last resort drop some frames)
- MPEG
	- standard describes bitstream format and decoder, not encoder
	- MPEG-1
		- lossy compression of sound and video
		- YUV color space
		- standard for video CD
		- layer 3 (MP3) format for audio compression
	- MPEG-2
		- TV broadcast, DVD
	- MPEG-3
		- originally intended for HDTV
		- development stopped, merged with MPEG-2
- MPEG-1
	- same basis as H.261
	- 8×8 blocks
	- inter/intra
	- motion-compensated prediction on macroblock level
	- DCT
	- quantization and coding
	- rate control using a transmission buffer
	- but has different applications (digital storage and retrieval)
	- frame types
		- I-frame (intraframe) – current frame (coded with no reference to other frames)
		- P-frame (predictive frame) – difference between current and most recent I/P-frame
		- B-frame (bidirectionally predictive) – difference between two closest I/P-frames (most recent and closest future frame)
		- e.g. $I:P:B \approx 2:5:12$
	- group of pictures (GOP)
		- smallest random access unit in the video sequence
		- a repeating sequence of 10–30 frames
		- contains at least one I-frame
		- starts with an I-frame or “forward-pointing” B-frames
		- usage
			- allows the video to replay at original speed even on a slow hardware
				- some frames may be skipped to keep the original speed
			- forward, jump forward/back
		- when processing a GOP, referenced frames have to be processed before the frames that reference them
	- coding
		- standard does not specify a particular method for motion compensation
			- search space size should grow with the distance between frames
		- DCT & quantization are similar to JPEG (different quantization tables for different frames)
		- rate control – we can discard B-frames first
	- properties
		- VHS quality for low-moderate motion (worse than VHS for high-motion)
		- interlacing was not considered
- MPEG-2
	- idea: application-independent standard, toolkit approach
	- profiles: simple, main, snr-scalable, spatially scalable, high
		- simple: no B-frames
		- main: like MPEG-1
	- possibility to use several layers (in the other three profiles)
		- base layer: encodes low-quality videosignal
		- enhancement layer: differential encoding to improve quality
	- several resolution levels
	- interlacing can be used
		- field-based alternatives to I/P/B-frames
		- P-frames may be replaced by 2 P-fields
		- B-frames may be replaced by 2 B-fields
		- I-frames may be replaced by 2 I-fields or I-field and P-field
			- predict the bottom field by the top field
	- new motion-compensated prediction modes
- MPEG-4
	- objective: encode video at low bitrates
	- object oriented approach: audio, video and VRML objects
	- allows to use arbitrary codecs
	- MPEG-4 Part 10 … H.264 (used in HDTV, Blu-ray)
- MPEG-M (Part 2)
	- HEVC (High Efficiency Video Coding)
	- ITU-T H.265
	- used in DVB-T2
- H.264 improvements (examples)
	- macroblocks may be further divided
	- 9 prediction models for 4×4 blocks
- H.265 improvements (examples)
	- frame is partitioned into Coded Tree Units
	- blocks can be subdivided if higher “resolution” is needed