Exam

Introduction Lossless Compression Statistical Methods Huffman Code Arithmetic Coding Theoretical Limits Context Methods Integer Coding Dictionary Methods Lossless Image Compression Burrows-Wheeler Transform Lossy Compression Digitizing Sound Quantization Vector Quantization Differential Encoding Transform Coding Subband Coding Video Compression

Introduction

data compression
- process of converting an input data stream into output data stream that has a smaller size
- compression algorithm = encoding (compression) + decoding (decompression)
- compression
  - lossless – the restored and original data are identical
  - lossy – the restored data are a reasonable approximation of the original
- methods
  - static / adapative
  - streaming / block
    - block – we compress the data by blocks
- goals – to save storage, to reduce the transmission bandwidth
measuring the performance
- units
  - input data size … $u$ bytes (uncompressed)
  - compressed data size … $k$ bytes, $K$ bits
  - compression ratio … $k/u$ (written as percentage)
  - compression factor … $u:k$
  - compression gain … $(u-k)/u$ (written as percentage)
  - average codeword length … $K/u$ $K / u$ (may be bpc or bpp)
    - bpc … bits per char
    - bpp … bits per pixel
  - relative compression (percent log ratio) … $100\ln(k/k')$ $100 ln (k / k^{'})$
    - $k'$ … size of data compressed by a standard algorithm
- data corpora
  - Calgary Corpus – 14 files: text, graphics, binary files
  - Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1)
  - Silesia Corpus
  - Prague Corpus
- contests
  - Calgary Corpus Compression Challenge
  - Hutter Prize – Wikipedia compression
    - goal – encourage research in AI
limits of lossless compression
- encoding $f:$ $f :$ { $n$ $n$ -bit strings} → {strings of length $\lt n$ $< n$ }
  - $|\text{Dom}f|=2^n$
  - $|\text{Im}f|\leq 2^n-1$
  - such $f$ cannot be injective → does not have an inverse function
- let $M\subseteq\text{Dom} f$ $M \subseteq Dom f$ such that $\forall s\in M:|f(s)|\leq 0.9n$ $\forall s \in M : ∣ f (s) ∣ \leq 0.9 n$
  - $f$ injective on $M\implies|M|\leq2^{1+0.9n}-1$
  - $n=100$ … $|M|/2^n\lt 2^{-9}$
  - $n=1000$ … $|M|/2^n\lt 2^{-99}\approx 1.578\cdot 10^{-30}$
- we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data)

Lossless Compression

Statistical Methods

basic concepts
- source alphabet … $A$
- coding alphabet … $A_C$
- encoding … $f:A^*\to A^*_C$
- $f$ injective → uniquely decodable encoding
typical strategy
- input string is factorized into concatenation of substrings
  - $s=s_1s_2\dots s_k$
- substrings = phrases
- $f(s)=C(s_1)C(s_2)\dots C(s_k)$ $f (s) = C (s_{1}) C (s_{2}) \dots C (s_{k})$
  - $C(s_i)$ … codeword
codes
- (source) code … $K:A\to A^*_c$
- $K(s_i)$ … codeword for symbol $s_i\in A$
- encoding $K^*$ generated by code $K$ is the mapping $K^*(s_1s_2\dots s_n)=K(s_1)K(s_2)\dots K(s_n)$ for every $s\in A^*$
- $K$ is uniquely decodable $\equiv$ it generates a uniquely decodable encoding $K^*$
- example
  - $\set{0,01,11}$ … is uniquely decodable
  - $\set{0,01,10}$ … is not, counterexample: $010$
  - $K_1$ … fixed length
  - $K_2$ … no codeword is a prefix of another one
prefix codes
- prefix code is a code such that no codeword is a prefix of another codeword
- observation
  - every prefix code $K$ (over a binary alphabet) may be represented by a binary tree = prefix tree for $K$
  - prefix tree may be used for decoding
- idea: some letters are more frequent, let's assign shorter codewords to them
  - historical example – Morse code
Shannon-Fano coding
- input – source alphabet $A$ , frequency $f(s)$ for every symbol $s\in A$
- output – prefix code for $A$
- algorithm
  - sort the source alphabet by symbol frequencies
  - divide the table into parts with similar sums of frequencies
  - append 0/1 to prefix, repeat
- divide and conquer algorithm
- definition: the algorithm constructs an optimal prefix / uniquely decodable code $K\equiv$ $|K'^*(a_1a_2\dots a_n)|\geq |K^*(a_1a_2\dots a_n)|$ for every prefix (uniquely decodable) code $K'$ and for every string $a\in A^*$ with symbol frequencies $f$
- Shannon-Fano is not optimal
  - Fano's student David Huffman described a construction of optimal prefix code

Huffman Code

algorithm
- constructs the tree from the bottom
- starts with the symbols that have the lowest frequencies
- sum of the frequencies is used as the frequency of the new node
greedy algorithm
optimality of Huffman code
- notation
  - alphabet $A$ of symbols that occur in the input string
  - $\forall s\in A:f(s)$ … frequency of the symbol
  - let $T$ $T$ be a prefix tree for alphabet $A$ $A$ with frequencies $f$ $f$
    - leaves of $T$ … symbols of $A$
  - let $d_T(s)$ $d_{T} (s)$ denote the depth of a leaf $s$ $s$ in the tree $T$ $T$
    - $d_T(s)=$ length of the path from root to $s$
    - $d_T(s)=$ length of the codeword for symbols $s$
  - cost $C(T)$ $C (T)$ of tree $T$ $T$
    - $C(T)=\sum_{z\in A}f(z)d_T(z)$
    - $C(T)$ … length of the encoded message
- lemma 0: if a prefix tree $T$ ( $|V(T)|\geq 3$ ) contains a vertex with exactly one child, then $T$ is not optimal
- proof: we could contract the edge between the vertex and its child
- lemma 1
  - let $A$ ( $|A|\geq 2$ ) be an alphabet with frequencies $f$
  - let $x,y\in A$ be symbols with the lowest frequencies
  - then there exists an optimal prefix tree for $A$ in which $x,y$ are siblings of maximum depth
- proof
  - let $T$ be an optimal tree where $x,y$ are not siblings of maximum depth (instead, there are $a,b$ with maximum depth)
  - now by switching $a$ and $x$ , we get $T'$
  - $C(T)-C(T')=d_T(x)f(x)+d_T(a)f(a)-d_{T'}(a)f(a)-d_{T'}(x)f(x)=$ $\dots\geq 0$
  - $T'$ is also optimal
  - we repeat the steps for $y$
  - we get $T''$
  - $T$ is optimal $\implies T''$ is optimal
- lemma 2
  - let $A$ be an alphabet with $f$
  - let $x,y\in A$ be symbols with the lowest frequencies
  - let $T'$ be an optimal prefix tree for alphabet $A'=A\setminus\set{x,y}\cup\set{z}$ where $f(z)=f(x)+f(y)$
  - then tree $T$ obtained from $T'$ by adding children $x$ and $y$ to $z$ is an optimal prefix tree for $A$
- proof
  - $T'$ optimal for $A'$
  - we get $T$ by adding $x,y$ below $z$ (therefore $z$ is no longer a symbol of the alphabet)
  - there exists $T''$ $T^{''}$ optimal for $A$ $A$ in which $x,y$ $x, y$ are siblings of maximum depth
    - using lemma 1
    - we label their parent $z$
  - we remove $x,y$ and get $T'''$ for $A'$
  - $C(T)=C(T')+f(x)+f(y)\leq C(T''')+f(x)+f(y)=C(T'')$ $C (T) = C (T^{'}) + f (x) + f (y) \leq C (T^{'''}) + f (x) + f (y) = C (T^{''})$
    - clearly $C(T')\leq C(T''')$ as $T'$ is optimal
  - $T''$ optimal $\implies T$ optimal
- theorem: algorithm Huffman produces an optimal prefix code
  - proof by induction
- theorem (Kraft-McMillan inequality)
  - codewords lengths $l_1,l_2,\dots,l_n$ of a uniquely decodable code satisfy $\sum_i 2^{-l_i}\leq 1$
  - conversely, if positive integers $l_1,l_2,\dots,l_n$ satisfy the inequality, then there is a prefix code with codeword lengths $l_1,l_2,\dots,l_n$
- corollary
  - Huffman coding algorithm produces an optimal uniquely decodable code
  - if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal
idea: we could encode larger parts of the message, it might be better than Huffman coding
- extended Huffman code … we use $n$ -grams instead of individual symbols
Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height
can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length?
- yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree
what is the maximum height of a Huffman tree for an input message of length $n$ $n$ ?
- we find the minimal input size to get the Huffman tree of height $k$ $k$
  - how to make the tree as deep as possible?
  - the frequencies of the symbols will be Lucas numbers (with the zeroth number broken in two ones) or Fibonacci sequence – depending on the rules of the construction of the tree
- Huffman tree of height $k\implies n\geq f_{k+1}$ $k ⟹ n \geq f_{k + 1}$
  - where $f_0=1,\;f_1=1,\;f_{k+2}=f_{k+1}+f_k$
- $n\lt f_{k+2}\implies$ max. codeword length $\leq k$
there is an optimal prefix code which cannot be obtained using Huffman algorithm
generalize the construction of binary Huffman code to the case of $m$ $m$ -ary coding alphabet ( $m\gt 2$ $m > 2$ )
- the trivial approach does not produce an optimal solution
- for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree
implementation notes
- bitwise input and output (usually, we work with bytes)
- overflow problem (for integers etc.)
- code reconstruction necessary for decoding
canonical Huffman code
- if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct the code (we don't need to encode the tree)
- 00, 01, 100, 101, 110, 111
adaptive compression
- static × adaptive methods
- statistical data compression consists of two parts: modeling and coding
- what if we cannot read the data twice?
  - we will use the adaptive model
adaptive Huffman code
- brute force strategy – we reconstruct the whole tree
  - after each frequency change
  - after reading $k$ symbols
  - after change of order of symbols sorted by frequencies
- characterization of Huffman trees
  - Huffman tree – binary tree with nonnegative vertex weights
    - two vertices of a binary tree with the same parent are called siblings
    - binary tree with nonnegative vertex weights has a sibling property if
      - $\forall$ parent: weight(parent) = $\sum$ weight(child)
      - each vertex except root has a sibling
      - vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling (the weights are non-decreasing if we print them by levels, bottom to top, left to right)
  - theorem (Faller 1973): binary tree with non-negative vertex weights is a Huffman tree iff it has the sibling property
  - FGK algorithm (Faller, Gallagher, Knuth)
    - we maintain the sibling property
  - zero-frequency problem
    - how to encode a novel symbol
    - 1st solution: initialize Huffman tree with all symbols of the source alphabet, each with weight one
    - 2nd solution: initialize Huffman tree with a special symbol $esc$ $esc$
      - encode the 1st occurence of symbol $s$ as a Huffman code of $esc$ followed by $s$
      - insert a new leaf representing $s$ into the Huffman tree
  - average codeword length … $l_{FGK}\leq l_H+O(1)$
- implementation problems
  - overflow
    - we can multiply frequencies by $\frac12$ $\frac{1}{2}$
      - this may lead to tree reconstruction
    - statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time
      - by multiplying by $x\in(0,1)$
    - we'll use integral arithmetic → we need to scale the numbers

Arithmetic Coding

each string $s\in A^*$ $s \in A^{*}$ associate with a subinterval $I_s\subseteq[0,1]$ $I_{s} \subseteq [0, 1]$ such that
- $s\neq s'\implies I_s\cap I_{s'}=\emptyset$
- length of $I_s=p(s)$
- $C(s)=$ number from $I_s$ with the shortest code possible
we'll first assume that $p(a_1a_2\dots a_m)=p(a_1)\cdot p(a_2)\cdot \ldots \cdot p(a_m)$
encoding
- we count the frequencies → probabilities
- we sort the symbols in some order
- we assign intervals to the symbols in the order
  - symbol with a higher frequency will be assigned a larger interval
- to encode the message, we subdivide the intervals
  - for example, let's say that $I_B=(0.2,0.3)$ and $I_I=(0.5,0.6)$
  - then $I_{BI}=(0.25,0.26)$
- the shorter the message, the longer the interval
- we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes
decoding
- we reconstruct the input string from the left
- we inverse the mapping
problem: using floating point arithmetic leads to rounding errors → we will use integers
- underflow may happen that way if we use too short integers
instead of individual frequencies of each symbol, we will use cumulative frequencies (it directly represents the intervals)
if we have $b$ bits available to represent the current state, the (initial) interval can be $[0,2^{b}-1)$ or $[0,2^{b-1})$
another problem
- we use bit shifting and assume that the first bits of both bounds are the same
- what if the first bits differ?
  - we don't output anything yet and increment a counter
- it is similar to the situation (in real numbers) when we get to the range 0.53–0.67
  - we cannot write 5 or 6 and shift the number because we don't know whether the next range will be something like 0.53–0.56 or 0.63–0.67
  - we postpone the decision instead
algorithm
- start with the interval $[0,2^{b-1})$ represented as left bound $L$ and range $R$
- encode each character of the block
- if the range $R$ $R$ falls below $2^{b-2}$ $2^{b - 2}$ , we need to prevent underflow
  - if the current interval (range) starts in the first half of the interval $[0,2^{b-1})$ and ends in the second half, we don't output anything, only increment the counter and rescale the current interval
  - otherwise, we output one or more (most significant) bits of $L$ and rescale the current interval (also, we decrement the counter by printing the correct bits)
time complexity
- encoding $O(|A|+|\text{output}|+|\text{input}|)$
- decoding $O(|A|+|\text{output}|+|\text{input}|\log|A|)$
- it is also possible to make decoding faster by using more memory
  - we will represent the table of frequencies as an implicit tree
  - that way we can decode in time $O(|A|+|\text{output}|+|\text{input}|)$
adaptive version
- cumulative frequencies table as an implicit tree
- Fenwick tree
  - Fenwick frequency = prefix sum (sum of all the weights of the lexicographically smaller vertices)

Theoretical Limits

Hartley's formula
- minimum and average number of yes/no questions to find the answer
- if $x$ is a member of $n$ element set, then $x$ carries $\log_2 n$ bits of information
- if we have a $k$ $k$ -tuple of elements of $n$ $n$ elements set and $S_k$ $S_{k}$ is the least number of questions necessary to determine all $x_i$ $x_{i}$ , then $S_k/k$ $S_{k} / k$ is the average number of questions necessary to determine one $x_i$ $x_{i}$
  - $|R|=n,\;x\in R^k$
  - $\log n^k\leq S_k\lt \log n^k+1$
  - $k\log n\leq S_k\lt k\log n+1$
  - $\log n\leq S_k/k\lt\log n+1/k$
Shannon's formula
- entropy … $H(X)=-\sum_{x\in R}p(x)\log p(x)$
- theorem: $H(X)\geq 0$
- lemma: Gibbs inequality
- theorem: for a random variable $X$ $X$ with a finite range $R$ $R$ , $|R|=n$ $∣ R ∣ = n$ , we have $H(X)\leq\log n$ $H (X) \leq lo g n$
  - and $H(X)=\log n\iff \forall x\in R:p(x)=\frac1n$
  - we plug it into the inequality … $q_i=\frac1n$
Kraft-McMillan theorem
- codeword lengths $l_1,l_2,\dots$ of a uniquely decodable code $C$ satisfy $\sum_i 2^{-l_i}\leq 1$
- on the other hand, if natural numbers $l_1,l_2,\dots$ satisfy the inequality, then there is a prefix code with codewords of these lengths
proof
- let's start with $k$ -th power of the sum of codeword lengths
- $(\sum_{i=1}^n 2^{-l_i})^k=(\sum_{x\in R} 2^{-l(x)})^k$
- we can rewrite that as the sum of products
  - and $l(x_{i_1})+l(x_{i_2})=l(x_{i_1}x_{i_2})$
- therefore it is equal to $\sum_{x_{i_1}\dots x_{i_k}\in R^*}2^{-l(x_{i_1}\dots x_{i_k})}$
- which equals $\sum_{i=1}^{k\cdot l_\text{max}}n(i)\cdot 2^{-i}\leq \sum_{i=1}^{k\cdot l_\text{max}}2^i\cdot 2^{-i}=k\cdot l_\text{max}$
- now $\sum_{i=1}^n2^{-l_i}\leq\lim_{k\to\infty}(k\cdot l_\text{max})^{1/k}=1$
- also, $R$ can be infinite
- we can get the prefix code using arithmetic coding (probably?)
theorem: let $C$ $C$ be a uniquely decodable code for a random variable $X$ $X$ , then $H(X)\leq L(C)$ $H (X) \leq L (C)$
- $L(C)$ … average codeword length
- $L(C)=\sum_{x\in R} p(x)|C(x)|$
- proof: by Gibbs inequality, we use $\frac{2^{-l_i}}{\sum_j 2^{-l_j}}$ as the second distribution
entropy of a discrete random vector
- $H(X)=nH(X_i)$
- for independent components with the same probability distribution
theorem: an arbitrary optimal prefix code $C$ $C$ for a random variable $X$ $X$ satisfies $H(X)\leq L(C)\lt H(X)+1$ $H (X) \leq L (C) < H (X) + 1$
- proof of the upper bound: put $l_i=\lceil-\log p_i\rceil$ , then Kraft-McMillan
- this also holds for Huffman code (if we consider the entropy of one symbol)
analysis of arithmetic coding
- $H(X)\leq L(C)\lt H(X)+2$ $H (X) \leq L (C) < H (X) + 2$
  - $\overline{F(x)}=0.y_1y_2\dots$ is the midpoint of the interval for $x\in\mathbb R$ ( $F$ is a CDF)
  - then $C(x)=y_1y_2\dots y_{l(x)}$ where $l(x)=\lceil\log\frac1{p(x)}\rceil+1$
  - in the proof, we show that $C$ is a prefix code
  - then, we use $L(C)=\sum p(x)\cdot l(x)$ which is upper bounded by $H(X)+2$
- here, we don't consider one symbol but the whole message → therefore, the result is much better then the Huffman code
- the main problem of the Huffman code is that it cannot deal very well with skewed probabilities of symbols as the lengths of the codewords must have integral length
problems
- we assume that the encoded random variables are independent
- we are working with a probabilistic model of our data, which is simplified too much
a better probabilistic model
- we cannot use “unlimited” context – the probabilities would be too small
- we will use the context of $k$ symbols
according to an experimental estimate
- the entropy of English is 1.3 bits per symbol

Context Methods

model of order $i$ … makes use of context of length $i$
methods
- fixed length contexts
- combined – contexts of various lengths
  - complete – all contexts of lengths $i,i-1,\dots,0$
  - partial
- static, adaptive
Prediction by Partial Matching (PPM)
- Cleary, Witten, Moffat
- uses arithmetic coding
- we try to encode symbol $s$ in context of length $i$
- if the context is too large (the frequency is not larger than zero), we output the ESC symbol and decrease the context length
- assumption: every symbol has non-zero frequency for some (small) context
- how to encode symbol $s$ $s$
  - we have read the context $c$ of length $i$ and the current symbol $s$
  - if $f(s\mid c)\gt 0$ $f (s ∣ c) > 0$ , encode $s$ $s$ using $f(*\mid c)$ $f (* ∣ c)$
    - each context $c$ has its own table of frequencies (for arithmetic encoding)
  - otherwise, output the code for ESC and try order $i-1$ context
- exclusion principle
  - situation: symbol $x$ occurs in context $abc$ for the first time
  - we will encode it using second order model $f(x\mid bc)$
  - there is symbol $y$ such that $f(y\mid bc)\gt 0$ and also $f(y\mid abc)\gt 0$
  - by using ESC, we already signaled that we are not trying to encode any of the letters present in the third order model (such as $y$ )
  - therefore, we can temporarily remove $y$ from the second order model so that the compression is more efficient
- there are various approaches to assign $P(esc\mid c)$ $P (esc ∣ c)$
  - for example, we can decrement frequencies of all symbols by one
- data structure … trie (context tree)
  - where each node has an edge to its largest proper suffix, this simplifies adding of new leaves
  - it might not fit in the memory
    - to solve that, we can freeze the model (update frequencies in existing contexts, stop adding new contexts)
    - or we can rebuild the model using only the recent history (not the whole file)
  - we can rebuild the model either if the free memory size drops or if the relative compression ratio starts decreasing
- advanced data structure: DAWG (Directed Acyclic Word Graph)
- PPMII
  - “information inheritance”
  - uses heuristics
  - became part of RAR
- PAQ
  - arithmetic coding over binary alphabet with sophisticated context modelling
  - using weighted average of probability estimates from various models
- this algorithm has probably the best compression ratio
  - but it is very time and space consuming

Integer Coding

unary code $\alpha$ $α$
- $\alpha(1)=1$
- $\alpha(i+1)=0\,\alpha(i)$
- clearly, $|\alpha(i)|=i$
- it is the optimal code for probability distribution $p(i)=2^{-i}$
binary code $\beta$ $β$
- is not a prefix code!
- we can use fixed length
  - optimal code for $p(i)=\frac1{|A|}$
- or we can gradually increase the codeword length as the dictionary size increases
Elias codes
- binary code always starts with 1
- reduced binary code $\beta'$ … without the first 1
- $\gamma'$ … first, we encode the length of the $\beta$ code using $\alpha$ , then we encode the $\beta'$
- $\gamma$ … we interleave the bits of $\alpha$ and $\beta$
- $\delta$ … instead of $\alpha$ , we use $\gamma$ to encode the length of $\beta$
- we could construct other codes in a similar fashion but $\delta$ $δ$ code suffices
  - for large numbers, it is shorter than $\gamma$
Elias codes are universal
- observation: each probability distribution $p_1\geq p_2\geq\dots\geq p_n$ satisfies $\forall i\in[n]:p_i\leq \frac1i$
- corollary: $-\log p_i\geq -\log\frac1i=\log i$
- $|\gamma(i)|=\log i+\Theta(\log i)$
- $|\delta(i)|=\log i+o(\log i)$

Dictionary Methods

idea
- repeated phrases stored in a dictionary
- phrase occurrence in the text → pointer
problems
- construction of the optimal dictionary is NP-hard → greedy algorithm
- dictionary is needed for decoding → dynamic methods
LZ77
- Jacob Ziv, Abraham Lempel
- sliding window
  - search buffer
  - look-ahead buffer
- search for the longest string $S$ $S$ such that
  - $S$ starts in the search buffer
  - $S$ matches the prefix of the string in look-ahead buffer
- $S$ $S$ is encoded as a triple $\braket{i,j,z}$ $⟨ i, j, z ⟩$ where
  - $i$ … distance of the beginning of $S$ from the look-ahead buffer
  - $j=|S|$
  - $z$ … the symbol following $S$ in the look-ahead buffer
- example
  - ac|cabracad|abrarr|ar
  - ___\search/\lahead/__
  - $\to\braket{7,4,r}$
- window slides $j+1$ symbols to the right
- sliding window size $2^{12}-2^{16}\,B$ , look ahead buffer size are usually tens or hundreds of bytes
- slow compression, fast decompression
- typical application – single compression, multiple decompressions
LZSS
- triple $(i,j,z)$ $(i, j, z)$ replaced with $(i,j)$ $(i, j)$ or literal $z$ $z$
  - bit indicator to distinguish one from another
- if pair of pointers requires the same space as $p$ symbols, we use pairs to encode only phrases of length $\gt p$ (shorter phrases are simply copied to the output)
- sliding window → cyclic buffer
- possible implementation
  - codeword 16b, $|i|=11$ , $|j|=5$
  - 8 bit indicators in 1 byte (one byte describes the types of 8 following items)
other variants
- LZB (LZSS with more efficient encoding of $(i,j)$ – using binary code with increasing length for $i$ and $\gamma$ code for $j$ )
- LZH (two-pass: LZSS, Huffman)
Deflate algorithm
- by Phil Katz, originally for PKZIP 2
- zip, gzip, jar, cab, png
- LZSS + static Huffman coding
- search buffer size 32 kB
- look-ahead buffer size 258 B
  - match length 3..258 → 256 options
  - for a smaller match, it uses literals
- input divided into blocks
  - 3bit header
    - is it last block? (1b)
    - type of the block (2b)
  - 3 block types
    1. no compression
    2. static Huffman code compressed using trees defined in Deflate RFC
    3. dynamic Huffman code based on input data statistics
- type 3 block
  - starts with two Huffman trees
  - the first tree for literals and match lengths
    - 0..255 for literals
    - 256 = end-of-block
    - 257..285 for match length (extra bits are used to distinguish between particular numbers)
  - the second tree for offset (again with extra bits)
- hashing is used for string searching in search buffer
  - hash table stores string of length 3
  - strings in linked lists sorted by their age
  - a parameter to limit the maximum list length
- greedy strategy extension
  - first, look for a primary match
  - slide window one symbol to the right, find a secondary match
    - if better, encode as literal + secondary match
LZ77 disadvantages
- limited outlook of the search window (longer window → better compression, more complex search)
- limited length of the look-ahead buffer (match length limitation)
LZ78
- sliding window, explicit dictionary (but the directory is not transmitted)
- efficient data structure for storing the dictionary … trie
- compression
  - start with an empty dictionary
  - read the longest prefix of the input which matches some phrase $f$ in the dictionary (zero length is also possible – especially if the dictionary is empty)
  - output $\braket{i,k(s)}$ $⟨ i, k (s) ⟩$
    - $i$ points to $f$ in the dictionary
    - $k(s)$ is a codeword for symbol $s$ which follows $f$ in the input
  - insert new phrase $fs$ into the dictionary
- dictionary reconstruction is necessary when the dictionary space is exhausted
  - we can delete the whole dictionary
  - or delete only the phrases that occurred least frequently or have not occurred recently
  - we need to preserve the prefix property
    - if the dictionary contains $S$ , it has to contain all prefixes of $S$
  - if compression ratio starts getting worse, we can delete the dictionary and start over
LZW
- we initialize the dictionary by the alphabet
- we encode the longest prefix present in the dictionary
  - then, we append the next character and add it to the dictionary
- when decoding, it may happen that the item is not in the dictionary yet
  - because the encoder is one step ahead
  - so we use the last decoded string and append its first character
- we store only the pointers (and the alphabet?)
LZC
- compress 4.0 utility in Unix (LZC method is the implementation of LZW algorithm)
- pointers into dictionary – binary code with increasing length (9–16 bits)
- when maximum size was reached, it continues with static dictionary
- when compression ratio started getting worse, it deletes the dictionary and starts over with the initial setting (we use a special symbol for that)
LZMW
- idea: new phrase = concatenation of two last ones (→ phrase length increases faster)
- dictionary full → LRU strategy (delete the phrases that have not occurred recently)
- dictionary lacks prefix property – this complicates searching, backtracking is used
LZAP
- idea: instead of $ST$ add all substrings $Sp$ where $p$ is a prefix of $T$
- larger dictionary → longer codewords, wider choice for phrase selection
- faster search → backtracking is not needed

Lossless Image Compression

digital images – raster (bitmap), vector
color spaces
- RGB
- CMYK
- HSV
- YUV – for backwards compatibility of the TV broadcast
representation of bitmap images
- monochrome
- shades of gray
- color
  - pixel = $(r, g, b)$
  - alpha channel
- color palette
  - conversion table (pixel = pointer into the table)
  - types: grayscale (pallete with shades of gray), pseudocolor (color palette), direct color (three color palettes, one for each part of the color – R/G/B)
GIF (Graphics Interchange Format)
- usage
  - originally for image transmission over phone lines, suitable for WWW
  - sharp-edged line art with a limited number of colors → can be used for logos
  - small animations
  - not used for digital photography
- color palette, max. 256 colors
  - supports “transparent” color
- multiple images can be stored in one file
- supports interlacing
  - rough image after 25–50 % data
  - we can stop the transmission if it's not the image we wanted
  - 4-pass interlacing by lines
- LZW-based compression (actually, it uses LZC)
- pointers … binary code with increasing length
- blocks … up to 255 B
- we can increase the number of possible colors by stacking multiple images on top of another (and using transparent pixels)
PNG (Portable Network Graphics)
- motivation: replace GIF with a format based on a non-patented method (LZW was patented by Sperry Corporation / Unisys)
- color space – gray scale, true color, palette
- alpha channel – transparency $\in[0,1]$
- supports only RGB, no other systems (it was designed for transferring images over the network)
- compression has two phases
  - preprocessing – predict the value of the pixel, encode the difference between the real and predicted value (prediction based on neighboring pixels)
    - prediction types: none, sub (left pixel), up, average (average of sub and up), Paeth (more complicated formula)
    - each line may be processed with a different method
  - dictionary compression
    - deflate (LZ77)
- supports interlacing
  - 7-pass interlacing by blocks of pixels
- single image only, no animation

Burrows-Wheeler Transform

BSTW
- Bentley, Sleator, Tarjan, Wei
- move to front heuristic to restructure the dictionary
- we start with an empty dictionary $D$
- for input word $s$ $s$
  - if $s$ s in dictionary $D$ on position $i$ , we output $\gamma(i)$ and move $i$ -th word in $D$ to the front of $D$
  - otherwise, we output $\gamma(|D|+1)$ and $s$ itself, we insert $s$ to the front of $D$
Burrows–Wheeler transform
- we are looking for a permutation of string $x$ such that the similar symbols are close to each other
- matrix of cyclic shifts of $x$ $x$
  - sort rows lexicographically
  - the desired permutation is the last column
  - also, we need to store a number of the row where the original string lies in the matrix to be able to restore the original
  - if we used the first column, it would not be decodable
- encoding
  - MTF (move to front) heuristic
    - for each symbol, its code would be the number of preceding symbols in the dictionary
    - then, move the symbol to the front of the dictionary
    - therefore, the number we use to encode the symbol matches the time that elapsed from the last encoding of that symbol (if it was encoded previously)
  - RLE … run length encoding
    - instead of $n$ zeros, we can just encode the zero and length of the block
  - Huffman/arithmetic coding
- decoding
  - how to obtain the original string from the last column and the number?
  - we sort the last column to get the first one
  - we can determine the penultimate symbols from the pairs of first and last – thanks to the cyclic shift
    - what if there are multiple lines that end with the same symbol? we know that the lines were sorted lexicographically
- matrix $n×n$ $n \times n$ is not neccessary
  - we can use suffix array
  - this would be equivalent to the matrix of rotations of x$, where xis the original string and $ is a sentinel character
- why does BW transform work?
  - two same letters often share very similar right context
  - therefore they will be close in the rightmost column
  - (we are actually sorting the letters by their right context)
bzip2
- Julian Seward
- open source
- has more efficient compression than gzip/zip
  - but bzip2 is slower
  - it does not support archiving
    - it can only compress single
    - just like gzip
    - can be used to compress tar archives
- compression
  - split the input into blocks of 100–900 kB (determinted by an input parameter)
  - Burrows–Wheeler transform
  - MTF, RLE
  - arithmetic coding (patented) was replaced with Huffman coding

Lossy Compression

no standard corpora exist
how to measure the loss of information
- mean squared error (MSE)
- signal to noise ratio (SNR)
- peak signal to noise ratio

Digitizing Sound

sound is a physical disturbance in a medium, it propagates as a pressure wave
Fourier theorem
sampling
- Nyquist-Shannon sampling theorem
- the sampling frequency should be at least twice the highest frequency we want to record
- if the condition fails, it leads to aliasing (zkreslení)
  - there appear false frequency components that were not present in the original analog signal
  - for sampling frequency $F_s$ and original $f\gt\frac12F_s$ , the new false frequency will be $f'=F_s-f$
- example: the carriage (or stagecoach) wheels in the movie are spinning backwards
quantization
- we have $B$ bits available
- we split the whole interval into $2^B$ subintervals
we want the largest signal to noise ratio that is possible
audio processing
- input
  - amplifier
  - band-pass filter
  - sampler … sampling
  - A/D converter … quantization
  - RAM
- output
  - RAM
  - D/A converter
  - band-pass filter
  - amplifier

Quantization

main idea
- large set → smaller set
- quantizer can be scalar or vector
- quantization consists of two mapping – encoding and decoding
central limit theorem
- if we collect $n$ samples of random variable $X$ , the distribution of the sample mean approaches $N(\mathbb EX,\text{var}(X)/2)$ for a sufficiently large $n$
problem formulation
- source … random variable $X$ with density $f(x)$ (common assumption – $f$ is symmetric about 0)
- determine: number of intervals (levels), interval endpoints (decision boundaries), representative $y_i$ for each interval (reconstruction levels)
- goal: to minimize the quantization error (distortion) $\sigma^2$ $σ^{2}$ … MSE
  - $\sigma^2=\mathbb E[(x-Q(x))^2]$
- the number of intervals (levels) may be fixed if we want fixed length codewords
uniform quantizer
- uniform probability distribution
- all intervals of equal length
- two types
  - midrise quantizer – even number of levels, $Q(0+\varepsilon)\neq 0$ (zero is a decision boundary)
  - midtread quantizer – odd number of levels, $Q(0+\varepsilon)=0$ (zero is in the middle of the central interval)
- we assume that the range of $X$ is bounded
forward adaptive quantizer (offline)
- split the input into blocks
- analyze each separately, set parameters accordingly
- parameters are transmitted as side information
backward adaptive quantizer (online)
- no side info necessary
- Jayant quantizer
  - multipliers
  - intervals can shrink and expand depending on the input
  - for each input value, we get the corresponding code $c$ and we multiply $\Delta$ by the multiplier $M_c$
  - the intervals automatically adapt to the input data
  - 3bit Jayant quantizer
    - table of intervals

code	interval	multiplier
0	$(0,\Delta)$	0.8
1	$(\Delta,2\Delta)$	0.9
2	$(2\Delta,3\Delta)$	1
3	$(3\Delta,\infty)$	1.2
4	$(-\Delta,0)$	0.8
5	$(-2\Delta,-\Delta)$	0.9
6	$(-3\Delta,-2\Delta)$	1
7	$(-\infty,-3\Delta)$	1.2

nonuniform quantization
- we set $y_j$ $y_{j}$ to the average value in the interval
  - $y_j=\frac{\int_{b_{j-1}}^{b_j}xf(x)dx}{\int_{b_{j-1}}^{b_j}f(x)dx}$
- we set the interval bounds as the averages of neighboring values
  - $b_j=\frac{y_{j+1}+y_j}{2}$
- Lloyd-Max algorithm
  - $M$ -level symmetric quantizer
  - we set $b_0=0$
  - we somehow approximate $y_1$
  - then we compute $b_1$ from the equation for $y_j$ stated above
  - $y_2=2b_1-y_1$ (similarly $b_2,y_3,\dots$ )
  - note: the $b$ bounds and $y$ values are symmetrical, $y_{-1}=y_1$ etc.
  - computation depends on the initial estimate $y_1$ $y_{1}$
    - we want to correct our wrong estimate
    - we know that $b_{M/2}$ = max input value
    - we can compare the resulting $y_{M/2}$ with the value $y^*_{M/2}$ computed using the quotient of integrals
    - if they differ too much, we correct the initial estimate of $y_1$ and repeat the whole procedure
companding
- compressing + expanding
- first we stretch the high probability regions close to the origin & compress the regions with low probability
- then apply uniform quantization
- expand afterwards (transform quantized values to the original scale)
- $\mu$ $μ$ -law
  - telephony, USA, Japan
  - $\mu=225$
  - there is $c(x)$ which compresses the values and $c^{-1}(x)$ which expands them (there is the constant $\mu$ in the formulas)
- A-law
  - telephony, Europe
  - $A=87.6$
  - there are different $c(x)$ and $c^{-1}(x)$ then in $\mu$ -law
- G.711
  - ITU-T standard for audio companding
  - sampling frequency 8 kHz
  - sample length is 14b ( $\mu$ -law) or 13b (A-law)
  - $\mu$ -law uses midtread quantizer
  - A-law uses midrise quantizer
  - comparison
    - $\mu$ … higher dynamic range, noise reduction in blocks of silence (thanks to midtread)
    - A … lower relative distortion of small values
  - application: VoIP

Vector Quantization

we split the file (or group the individual samples) into vectors
- $L$ consecutive samples of speech
- block of $L$ pixels of an image (a rectangle/square)
- → vector of dimension $L$
idea: codebook
- code-vectors selected to be represented, each one has a binary index
- encoder finds the closest vector in the codebook (using Euclidean distance or MSE)
- decoder retrieves the code-vector given its binary index
- codebook size $K$ → $K$ -level vector quantizer
- we can design the codebook using k-means algorithm (or we can use LBG as described below)
LBG algorithm
- Linde, Buzo, Gray
- we have codebook $C$ , training set $T$ , threshold $\varepsilon$
- we can compute quantization regions – assign each input vector to its code-vector
- we can compute the average distortion (MSE) in each step of the algorithm
  - if it is less than the threshold, we can return the codebook
  - otherwise, we need to update it (we set the code-vector to the average of all the input vector in its group – like in k-means)
- initialization
  - LBG splitting technique
    - start with a single code-vector – the average of all vectors in the training set
    - in each step we generate a perturbation vector, add it to each code-vector and introduce those new vectors to the codebook → the codebook size doubles
      - then, we run LBG with the codebook (repeatedly update the code-vectors until the distortion drops below certain level)
    - we repeat this until the desired number of levels is reached
  - random init (Hilbert)
    - initialize $C$ randomly several times
    - select the alternative with lowest distortion
  - pairwise nearest neighbor (Equitz)
    - start with $C=T$
    - in each step, replace two code-vectors by their mean
    - goal: smallest increase in distortion
- it is possible that during the algorithm, we are left with a code-vector that does not have any corresponding training set vectors (“empty cell problem”) → we will replace it with a new code-vector chosen randomly from the cluster with the highest population of training vectors or the highest associated distortion
tree-structured vector quantization (TSVQ)
- when quantizing a vector, we don't want to compare it to each vector in the codebook
- we will divide the codebook into two groups such that there exists two test vectors that all vectors from the first group are closer to the first vector (and vice versa)
- we repeat the process → binary tree
- pros: we need a lower number of comparisons, we can build a binary string (code) as we progress down the tree
- cons: increased storage requirement (we need to store the test vectors, the codebook does not suffice), possible increase in distortion
- we may use LBG splitting technique to get test vectors
  - we need to modify the technique, we will always run LBG only with two code-vectors and the corresponding subset of the test set
- pruned TSVQ
  - we can remove some subtrees to decrease the average rate (codeword length) – but this will increase average distortion
  - also, we can split some nodes
  - this way, we get binary codewords of variable length, the codewords correspond to paths in the decision tree → prefix code
lattice vector quantizers
- idea: regular arrangement of code-vectors in space
  - quantization regions should tile the output space of the source
- D lattices
  - $D_2$ $D_{2}$ lattice (in two-dimensional space)
    - $a\cdot v_1+b\cdot v_2=a(0,2)+b(1,1)=(b,2a+b)$
    - squares
  - the sum of the coordinates is always even
- A lattices
  - $A_2$ lattice is hexagonal
- G.719
  - ITU-T standard for wideband audio coding
  - modified dicrete cosine transform (DCT)
  - vector quantizer based on $D_8$ lattice
classified vector quantization
- idea: divide source vectors into classes with different properties, perform quantization separately
- example: edge/nonedge regions (in an image)
  - large variance of pixels $\implies$ presence of an edge (will use edge codebook for this vector)
  - output has to contain the codebook label and the vector label
applications
- audio: G.719, CELP, Ogg Vorbis, TwinVQ
- video: QuickTime, VQA

Differential Encoding

idea: consecutive samples are quite similar (that holds especially for speech and images)
if we quantize the differences directly, the error will accumulate
instead of that, we will quantize the difference between the previous predicted value and the current value
- or we can use a more general function of multiple previous predicted values instead of just the last value
- this function can be called $p_n$
DPCM (differential pulse-code modulation)
- C. Chapin Cutler, Bell Labs
- goal: minimize MSE (error between the real value and $p_n$ )
- linear predictor $p_n=\sum_{i=1}^N a_i\hat x_{n-i}$ $p_{n} = \sum_{i = 1}^{N} a_{i} \overset{x}{^}_{n - i}$
  - where $\hat x_k$ is $k$ -th predicted value
- problem: find $a_j$ $a_{j}$ which minimize MSE
  - set the derivative w.r.t. $a_j$ equal to zero
  - $\forall j:-2\mathbb E[(x_n-\sum a_ix_{n-i})x_{n-j}]=0$
  - $\mathbb E[(x_n-\sum a_ix_{n-i})x_{n-j}]=0$
  - $\mathbb E[x_nx_{n-j}]=\sum a_i\mathbb E[x_{n-i}x_{n-j}]$
  - we can rewrite this using autocorrelation $R_{XX}(k)=\mathbb E[x_nx_{n+k}]$
  - we get Wiener-Hopf equations (matrix multiplication)
  - we will estimate $R_{XX}(k)$ $R_{XX} (k)$ – nothing complicated, it's just an average of products
    - we assume that the process is stationary
ADPCM
- adaptive DPCM
- forward adaptive prediction (DPCM-APF)
  - estimate autocorrelation for blocks
  - we need to transfer coefficients and buffer the input (group it into blocks)
- backward adaptive prediction (DPCM-APB)
  - no need to transfer additional info
  - we will use stochastic gradient descent during coding (LMS = least mean squares algorithm)
  - $A^{(n+1)}=A^{(n)}+\alpha \hat d_n\hat X_{n-1}$
G.726
- ITU-T standard for speech compression based on ADPCM
- predictor uses two latest predictions and six latest differences (between predicted values)
- simplified LMS algorithm
- backward adaptive quantizer (modified Jayant)
delta modulation
- speech coding
- very simple form of DPCM
- 2-level quantizer (we can only encode $+\Delta$ or $-\Delta$ )
- we need really high sampling frequency
- problems
  - constant signal (we encode it as a sequence of ↑↓↑↓↑↓↑↓)
    - “granular regions”
  - a slope steeper than delta
    - “slope overload region”
- solution: adapt $\Delta$ to the characteristics of the input
constant factor adaptive delta modulation (CFDM)
- objective: increase delta in overload regions, decrease it in granular regions
- how to identify the regions?
  - use a history of one sample (or more)
- application: space shuttle ↔ ground terminal (CDFM with 7-sample memory)

Transform Coding

scheme: group individual samples into blocks, transform each block, quantization, coding
for images, we have to make a two-dimensional transform
- it is better to transform each dimension separately = separable transform
separable transform using matrix $A$ $A$
- transformed data $\Theta=AXA^T$
- original data $X=A^{-1}\Theta(A^{-1})^T$
orthogonal transform … we use orthogonal transform matrix
- transformed data $\Theta=AXA^T$
- original data $X=A^T\Theta A$
- moreover $\sum_i \theta_i^2=\sum_i x_i^2$
- gain $G_T$ $G_{T}$ (of a transform $T$ $T$ ) = arithmetic over geometric mean of $\sigma_*^2$ $σ_{*}^{2}$
  - $\sigma_i^2$ … variance of coefficient $\theta_i$
Karhunen-Loéve transform
- minimizes the geometric mean of the variance of the transform coefficients
- transform matrix consists of eigenvectors of the autocorrelation matrix $R$
- problem: non-stationary input → matrix $R$ changes with time → KLT needs to be recomputed
discrete cosine transform
- DCT is the “cosine component” of DFT
- redistributes information so that we can quantize it more easily
- why is DCT better than DFT?
  - FT requires a periodic signal
  - periodic extension of non-periodic signal would introduce sharp discontinuities and non-zero high-frequency coefficients
- DCT may be obtained from DFT (start with the original non-periodic signal consisting of $N$ points, mirror it to get $2N$ points, apply DFT, take the first $N$ points)
- DCT transform matrix – 1st row is constant vector, following rows consist of vectors with increasing variation
  - these are the portions of the original “signal” (data matrix $X$ ) that the coefficients in the $\Theta$ matrix correspond to
  - - these can be called “basis matrices”
  - basis matrix corresponding to $\theta_{11}$ $θ_{11}$ is a constant matrix $\implies$ $⟹$ $\theta_{11}$ $θ_{11}$ is some multiple of the average value in the data matrix
    - $\theta_{11}$ … DC coefficient
    - other $\theta_{ij}$ … AC coefficients
- gain $G_{DCT}$ is close to the optimum value for Markov sources with high correlation coefficient $\rho=\frac{\mathbb E[x_nx_{n+1}]}{\mathbb E[x_n^2]}$
discrete sine transform
- gain $G_{DST}$ is close to the optimum value for Markov sources with low correlation coefficient $\rho=\frac{\mathbb E[x_nx_{n+1}]}{\mathbb E[x_n^2]}$
Hadamard matrix
- square matrix of order $n$ such that $HH^T=nI$
- $H_1=(1)$
- $H_{2n}=\begin{pmatrix} H_n & H_n \\ H_n & -H_n \end{pmatrix}$
discrete Walsh-Hadamard transform
- DWHT transform matrix is a rearrangement of $H_n$ $H_{n}$
  - multiply $H_n$ by normalizing factor $\frac1{\sqrt n}$
  - reorder the rows by increasing number of sign changes
- pro: speed
- con: $G_{DWHT}\ll G_{DCT}$
quantization of transform coefficients
- zonal sampling
  - coefficients with higher expected variance are assigned more bits for sampling
  - pro: simplicity
  - con: bit allocations based on average rules → local variations (e.g. sharp edges on plain background) might not be reconstructed properly
quantization and coding
- threshold coding
  - based on the threshold, we decide which coefficients to keep (or discard)
  - we have to store the numbers of discarded coefficients
  - threshold is not decided a priori
- Chen & Pratt suggest zigzag scan
  - tail of the scan should consist of zeros (generally, the higher-order coefficients have smaller amplitude) → for fixed length of the block, we don't need to specify the number of zeros, just send end-of-block (EOB) signal
  - note: DC coefficients correspond to contours, higher-order coefficients correspond to details of the image → we can get rid of them
JPEG (Joint Photographic Experts Group)
- JPEG standard specifies a codec
  - lossy / lossless compression
  - progressive transmission scheme
- file format
  - JPEG/JFIF – minimal format version
  - JPEG/Exif – digital photography
- compression scheme
  - color space transform
    - RGB → YCbCr
    - human eye is more sensitive to details in Y (luminance) component
  - reduction of colors (chroma subsampling) in color components
    - 4:4:4 (no reduction) or 4:2:2 or 4:2:0
    - 4:2:2 or 4:2:0 → colors have lower resolution than the luminance
    - J:a:b (see Chroma subsampling, Wikipedia)
      - J … width of the region (as a reference)
      - a … number of chrominance samples in the first row
      - b … number of changes (in chrominance) between first and second row
  - each color component is handled separately
    - split into blocks of 8×8 pixels
    - DCT of each block
    - uniform quantization of DCT coefficients
      - coefficients are divided by table values (recommended in the standard, stored in the compressed file) and rounded
      - table values are adjusted according to specified JPEG quality $\in\set{1,\dots,100}$
    - coding – zigzag matrix scan, Huffman coding (extended JPEG can use arithmetic coding)
- Huffman encoding of coefficients
  - DC coefficient encoding
    - $\theta_{11}$ = constant multiple of the average value in the block (see the image in the DCT section above)
    - instead of $\theta_{11}$ encode the difference between neighboring blocks
    - split the range of values into categories (so that the Huffman tree is not too large)
      - we output the Huffman code of the category $n$ and then $n$ bits identifying the exact value
      - values
        
        category 0 … 0
        
        category 1 … -1, 1
        
        category 2 … -3, -2, 2, 3
        
        etc.
  - AC coefficient encoding
    - the categories are the same
    - but we don't encode zeros – we use Chen & Pratt method with EOB signal
    - for each value, we need to encode its category (+ value) and the number of previous zeros
      - use Huffman tree to encode $(Z,C)$ where $Z$ is the number of previous zeros and $C$ is the category
      - append $C$ bits to store the exact value
    - if the number of previous zeros is greater than 15, use a special ZRL signal
      - ZRL is encoded as $(15,0)$
      - EOB is encoded as $(0,0)$
- JPEG use
  - digitized images, realistic scenes, subtle variations in tone and color
  - disadvantage: artificial division into blocks → coding artifacts at the block edges → tile effect
WebP
- color space: RGB → YCbCr
- chroma subsampling 4:2:0
- split into macroblocks 16×16 Y, 8×8 Cb, 8×8 Cr
  - high detail areas: 4×4
- macroblock prediction
- transform: DCT, DWHT
- quantization – partition to segments with similar features
- arithmetic coding

Subband Coding

example
- rapid sample-to-sample variation
- but the long-term trend varies slowly
- idea: let's average the samples $x_n$ $x_{n}$ using the sliding window
  - this would smooth out the rapid variation
- we could transmit both the averages $y_n$ $y_{n}$ and their distances $z_n$ $z_{n}$ from the original values $x_n$ $x_{n}$
  - we would need to transmit twice as many values than before
  - fortunately, if we transmit only even values, we can reconstruct the odd values from them
- takeaway
  - it is useful to decompose original sequence $(x_n)$ into two subsequences $(y_n)$ and $(z_n)$
  - it did not result in any increase in the number of values we need to transmit
  - the two subsequences have different characteristics, we can use different techniques to encode them
filters – selectively block frequencies that does not belong to the filtered range
- low-pass filters – blocks frequencies above $f_0$
- high-pass filters – block frequencies below $f_0$
- band-pass filters – lets through only components between $f_1$ and $f_2$
digital filter with input $(x_n)$ $(x_{n})$ output $(y_n)$ $(y_{n})$ coefficients $(a_i),(b_i)$ $(a_{i}), (b_{i})$
- $y_n=\sum_{i=0}^N a_ix_{n-i}+\sum_{i=1}^Mb_iy_{n-i}$
- finite impulse response (FIR)
  - $\forall i:b_i=0$
  - impulse response dies out after $N$ samples
  - $N$ … number of taps
- infinite impulse response (IIR)
  - example for $a_0=1$ $a_{0} = 1$ and $b_1=0.9$ $b_{1} = 0.9$ and impulse 1000000…
    - $y_n:$ 1, 0.9, 0.81, 0.729, …
    - this $y_n$ is equal to the impulse response $h_n$
- we can use convolution…
  - impulse response completely specifies the filter
  - $y_n=\sum_{k=0}^M h_kx_{n-k}$
  - for IIR, $M$ is infinite
- examples of filters
  - averaging
    - low-pass filter
    - two-tap FIR filter
    - $h_n:0.5,0.5,0,\dots$
  - difference
    - high-pass filter
    - $h_n:0.5,-0.5,0,\dots$
filters used in subband coding
- filter bank
  - a cascade of stages
  - each stage contains low-pass and high-pass filters
- QMF – quadrature mirror filter
  - impulse response of the low-pass filter is specified as $(h_n)$
  - impulse response of the high-pass filter is derived from that as $((-1)^nh_{N-1-n})$ $((- 1)^{n} h_{N - 1 - n})$
    - the order is reversed and the signs are flipped
  - example: 8-tap Johnson QMF filter
  - fewer taps → lower efficiency in decomposition
basic algorithm
1. analysis
  - $M$ filters cover the frequency range of the input
  - generalized Nyquist rule – the sampling frequency has to be at least twice the supported frequency range
2. downsampling (decimation)
  - keep each $M$ th sample
3. quantization + coding
  - allocation of bits between subbands
    - subband variance is a suitable heuristic
  - ADPCM, PCM, vector quantization
4. decoding
5. upsampling
  - insert appropriate number of zeros
6. synthesis
  - bank of reconstruction filters
G.722
- ITU recommendation for speech coding
- anti-aliasing filter with cutoff frequency 7 kHz
- sampling frequency 16 kHz
- 14b uniform quantizer
- filtering – a bank of 2 QMF filters
  - 24 tap FIR filter
  - low-pass 0–4 kHz, high-pass remaining frequencies
  - down sampling by a factor of two
- encoding – ADPCM
  - low-pass: 6b/sample
  - high-pass: 2b/sample
- quantization – a variation of Jayant algorithm
MPEG audio (Moving Picture Experts Group)
- three coding schemes – layers (each is more complex than the previous)
  - MPEG-1 Audio Layer I (MP1)
  - MPEG-1 (or MPEG-2) Audio Layer II (MP2)
  - MPEG-1 (or MPEG-2) Audio Layer III (MP3)
- subband coding
  - MP1 and MP2 use a bank of QMF filters
  - MP3 uses QMF and modified DCT
- quantization
  - uniform (1, 2)
  - non-uniform (3)
- bit allocation to subbands
  - based on a psychoacoustic model of human perception
  - sounds at different frequencies are perceived as differently loud even if they have the same pressure levels
  - threshold-of-hearing curve (boundary of audible sounds)
  - spectral masking
    - if there are multiple sounds of the similar frequencies, some might not be audible (audibility threshold rises)
  - temporal masking
    - sound raises the audibility threshold for a brief interval preceding and following the sound
- coding
  - fixed codeword length (1, 2)
  - Huffman coding + a buffer to ensure fixed transmission rate (3)
image compression
- dependencies in two dimensions – it is better to use two one-dimensional filters rather than one two-dimensional filter
- filter each row of an image separately using high-pass and low-pass filters, then perform downsampling by a factor of two
  - one image N×N → 2 images N×N/2
- filter each column of the two images using high-pass and low-pass filters, then perform downsampling by a factor of two
  - 2 images N×N/2 → 4 images N/2×N/2
- then we can stop or continue the decomposition (note: we don't have to decompose all of the subimages)
- application: JPEG 2000
  - wavelet transform
  - coefficients correspond to details for various resolutions
    - fairly precise quantization possible
    - FBI fingerprint image compression standard, medical images compression
    - data optimization for progressive transmission over a network
    - a possibility to define regions of interest for higher-quality encoding

Video Compression

video = time sequence of images (frames)
can we reduce the problem to image compression?
- no!
- if we change the average intensity of pixels, it becomes noticeable in a video
- on the other hand, we don't care about the reproduction of sharp edges that much
basic types of algorithms
- two-way communication → symmetric compression / decompression, it needs to be fast, videotelephony etc.
- one-way → used for storage, the compression can be more complex
video signal representation
- CRT
- analog TV standards: NTSC (USA), PAL (Germany), SECAM (France)
- composite signal – YCbCr
  - Y (luminance) used for black-and-white television
- gamma correction – to make colors look more natural
MPEG-SIF format based on ITU-R 601 4:2:2
- luminance compression – take odd rows, use horizontal filter and subsample (the horizontal filter prevents aliasing)
- chrominance compression – take odd rows, use horizontal filter and subsample, use vertical filter and subsample
ITU-R H.261
- videotelephony, videoconferencing
- format CIF
- idea
  - frame = 8×8 blocks
  - prediction (by the previous frame) + differencing
  - DCT
  - quantization
  - entropy coder with variable codeword length
- why do we have to do inverse quantization and transform when encoding?
  - to prevent loss, we need the encoding prediction to be based on the information available to the decoder (see differential encoding)
- motion compensation
  - which parts of the image are moving?
  - for each block of the frame, we are going to find the most similar block of the previous frame
  - if there's no similar previous block, we just encode the current block
  - how to reduce the search space?
    - increase block size
  - H.261 uses macroblocks
    - for luminance component, group the blocks by four (into 16×16 macroblocks)
    - search for the best match (we get the motion vector)
    - to find the motion vector for chrominance components, we just divide the motion vector by two
- the loop filter
  - sharp edges in the block used for prediction have some negative consequences
  - we use a smoothing filter
- transform
  - DCT of the blocks
  - coder simulates the decoder
  - two modes
    - inter – motion compensation is used (we only have to encode the differences)
    - intra – original block is encoded directly
- quantization
  - intra blocks have large DC coefficients, inter blocks have smaller ones
  - there are 32 different quantizers, they are identified in the macroblock header
- coding
  - zigzag scan through the block of quantized coefficients (like in JPEG)
    - 20 most common → single variable length codeword
    - other → fixed length 20b
  - to avoid zero blocks, CBP (coded block pattern) is used to describe which blocks have non-zero coefficients
- rate control
  - during videoconference we need the transmission rate to be steady
  - transmission buffer – keeps the output rate of the encoder fixed
    - buffer may request that the coder changes the transmission rate
    - coder than may change the quantizer (or as a last resort drop some frames)
MPEG
- standard describes bitstream format and decoder, not encoder
- MPEG-1
  - lossy compression of sound and video
  - YUV color space
  - standard for video CD
  - layer 3 (MP3) format for audio compression
- MPEG-2
  - TV broadcast, DVD
- MPEG-3
  - originally intended for HDTV
  - development stopped, merged with MPEG-2
MPEG-1
- same basis as H.261
- 8×8 blocks
- inter/intra
- motion-compensated prediction on macroblock level
- DCT
- quantization and coding
- rate control using a transmission buffer
- but has different applications (digital storage and retrieval)
- frame types
  - I-frame (intraframe) – current frame (coded with no reference to other frames)
  - P-frame (predictive frame) – difference between current and most recent I/P-frame
  - B-frame (bidirectionally predictive) – difference between two closest I/P-frames (most recent and closest future frame)
  - e.g. $I:P:B \approx 2:5:12$
- group of pictures (GOP)
  - smallest random access unit in the video sequence
  - a repeating sequence of 10–30 frames
  - contains at least one I-frame
  - starts with an I-frame or “forward-pointing” B-frames
  - usage
    - allows the video to replay at original speed even on a slow hardware
      - some frames may be skipped to keep the original speed
    - forward, jump forward/back
  - when processing a GOP, referenced frames have to be processed before the frames that reference them
- coding
  - standard does not specify a particular method for motion compensation
    - search space size should grow with the distance between frames
  - DCT & quantization are similar to JPEG (different quantization tables for different frames)
  - rate control – we can discard B-frames first
- properties
  - VHS quality for low-moderate motion (worse than VHS for high-motion)
  - interlacing was not considered
MPEG-2
- idea: application-independent standard, toolkit approach
- profiles: simple, main, snr-scalable, spatially scalable, high
  - simple: no B-frames
  - main: like MPEG-1
- possibility to use several layers (in the other three profiles)
  - base layer: encodes low-quality videosignal
  - enhancement layer: differential encoding to improve quality
- several resolution levels
- interlacing can be used
  - field-based alternatives to I/P/B-frames
  - P-frames may be replaced by 2 P-fields
  - B-frames may be replaced by 2 B-fields
  - I-frames may be replaced by 2 I-fields or I-field and P-field
    - predict the bottom field by the top field
- new motion-compensated prediction modes
MPEG-4
- objective: encode video at low bitrates
- object oriented approach: audio, video and VRML objects
- allows to use arbitrary codecs
- MPEG-4 Part 10 … H.264 (used in HDTV, Blu-ray)
MPEG-M (Part 2)
- HEVC (High Efficiency Video Coding)
- ITU-T H.265
- used in DVB-T2
H.264 improvements (examples)
- macroblocks may be further divided
- 9 prediction models for 4×4 blocks
H.265 improvements (examples)
- frame is partitioned into Coded Tree Units
- blocks can be subdivided if higher “resolution” is needed