k′ … size of data compressed by a standard algorithm
data corpora
Calgary Corpus – 14 files: text, graphics, binary files
Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1)
Silesia Corpus
Prague Corpus
contests
Calgary Corpus Compression Challenge
Hutter Prize – Wikipedia compression
goal – encourage research in AI
limits of lossless compression
encoding f: {n-bit strings} → {strings of length <n}
∣Domf∣=2n
∣Imf∣≤2n−1
such f cannot be injective → does not have an inverse function
let M⊆Domf such that ∀s∈M:∣f(s)∣≤0.9n
f injective on M⟹∣M∣≤21+0.9n−1
n=100 … ∣M∣/2n<2−9
n=1000 … ∣M∣/2n<2−99≈1.578⋅10−30
we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data)
Lossless Compression
Statistical Methods
basic concepts
source alphabet … A
coding alphabet … AC
encoding … f:A∗→AC∗
f injective → uniquely decodable encoding
typical strategy
input string is factorized into concatenation of substrings
s=s1s2…sk
substrings = phrases
f(s)=C(s1)C(s2)…C(sk)
C(si) … codeword
codes
(source) code … K:A→Ac∗
K(si) … codeword for symbol si∈A
encoding K∗ generated by code K is the mapping K∗(s1s2…sn)=K(s1)K(s2)…K(sn) for every s∈A∗
K is uniquely decodable ≡ it generates a uniquely decodable encoding K∗
example
{0,01,11} … is uniquely decodable
{0,01,10} … is not, counterexample: 010
K1 … fixed length
K2 … no codeword is a prefix of another one
prefix codes
prefix code is a code such that no codeword is a prefix of another codeword
observation
every prefix code K (over a binary alphabet) may be represented by a binary tree = prefix tree for K
prefix tree may be used for decoding
idea: some letters are more frequent, let's assign shorter codewords to them
historical example – Morse code
Shannon-Fano coding
input – source alphabet A, frequency f(s) for every symbol s∈A
output – prefix code for A
algorithm
sort the source alphabet by symbol frequencies
divide the table into parts with similar sums of frequencies
append 0/1 to prefix, repeat
divide and conquer algorithm
definition: the algorithm constructs an optimal prefix / uniquely decodable code K≡∣K′∗(a1a2…an)∣≥∣K∗(a1a2…an)∣ for every prefix (uniquely decodable) code K′ and for every string a∈A∗ with symbol frequencies f
Shannon-Fano is not optimal
Fano's student David Huffman described a construction of optimal prefix code
Huffman Code
algorithm
constructs the tree from the bottom
starts with the symbols that have the lowest frequencies
sum of the frequencies is used as the frequency of the new node
greedy algorithm
optimality of Huffman code
notation
alphabet A of symbols that occur in the input string
∀s∈A:f(s) … frequency of the symbol
let T be a prefix tree for alphabet A with frequencies f
leaves of T … symbols of A
let dT(s) denote the depth of a leaf s in the tree T
dT(s)= length of the path from root to s
dT(s)= length of the codeword for symbols s
cost C(T) of tree T
C(T)=∑z∈Af(z)dT(z)
C(T) … length of the encoded message
lemma 0: if a prefix tree T (∣V(T)∣≥3) contains a vertex with exactly one child, then T is not optimal
proof: we could contract the edge between the vertex and its child
lemma 1
let A (∣A∣≥2) be an alphabet with frequencies f
let x,y∈A be symbols with the lowest frequencies
then there exists an optimal prefix tree for A in which x,y are siblings of maximum depth
proof
let T be an optimal tree where x,y are not siblings of maximum depth (instead, there are a,b with maximum depth)
let T′ be an optimal prefix tree for alphabet A′=A∖{x,y}∪{z} where f(z)=f(x)+f(y)
then tree T obtained from T′ by adding children x and y to z is an optimal prefix tree for A
proof
T′ optimal for A′
we get T by adding x,y below z (therefore z is no longer a symbol of the alphabet)
there exists T′′ optimal for A in which x,y are siblings of maximum depth
using lemma 1
we label their parent z
we remove x,y and get T′′′ for A′
C(T)=C(T′)+f(x)+f(y)≤C(T′′′)+f(x)+f(y)=C(T′′)
clearly C(T′)≤C(T′′′) as T′ is optimal
T′′ optimal ⟹T optimal
theorem: algorithm Huffman produces an optimal prefix code
proof by induction
theorem (Kraft-McMillan inequality)
codewords lengths l1,l2,…,ln of a uniquely decodable code satisfy ∑i2−li≤1
conversely, if positive integers l1,l2,…,ln satisfy the inequality, then there is a prefix code with codeword lengths l1,l2,…,ln
corollary
Huffman coding algorithm produces an optimal uniquely decodable code
if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal
idea: we could encode larger parts of the message, it might be better than Huffman coding
extended Huffman code … we use n-grams instead of individual symbols
Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height
can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length?
yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree
what is the maximum height of a Huffman tree for an input message of length n?
we find the minimal input size to get the Huffman tree of height k
how to make the tree as deep as possible?
the frequencies of the symbols will be Lucas numbers (with the zeroth number broken in two ones) or Fibonacci sequence – depending on the rules of the construction of the tree
Huffman tree of height k⟹n≥fk+1
where f0=1,f1=1,fk+2=fk+1+fk
n<fk+2⟹ max. codeword length ≤k
there is an optimal prefix code which cannot be obtained using Huffman algorithm
generalize the construction of binary Huffman code to the case of m-ary coding alphabet (m>2)
the trivial approach does not produce an optimal solution
for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree
implementation notes
bitwise input and output (usually, we work with bytes)
overflow problem (for integers etc.)
code reconstruction necessary for decoding
canonical Huffman code
if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct the code (we don't need to encode the tree)
00, 01, 100, 101, 110, 111
adaptive compression
static × adaptive methods
statistical data compression consists of two parts: modeling and coding
what if we cannot read the data twice?
we will use the adaptive model
adaptive Huffman code
brute force strategy – we reconstruct the whole tree
after each frequency change
after reading k symbols
after change of order of symbols sorted by frequencies
characterization of Huffman trees
Huffman tree – binary tree with nonnegative vertex weights
two vertices of a binary tree with the same parent are called siblings
binary tree with nonnegative vertex weights has a sibling property if
∀ parent: weight(parent) = ∑ weight(child)
each vertex except root has a sibling
vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling (the weights are non-decreasing if we print them by levels, bottom to top, left to right)
theorem (Faller 1973): binary tree with non-negative vertex weights is a Huffman tree iff it has the sibling property
FGK algorithm (Faller, Gallagher, Knuth)
we maintain the sibling property
zero-frequency problem
how to encode a novel symbol
1st solution: initialize Huffman tree with all symbols of the source alphabet, each with weight one
2nd solution: initialize Huffman tree with a special symbol esc
encode the 1st occurence of symbol s as a Huffman code of esc followed by s
insert a new leaf representing s into the Huffman tree
average codeword length … lFGK≤lH+O(1)
implementation problems
overflow
we can multiply frequencies by 21
this may lead to tree reconstruction
statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time
by multiplying by x∈(0,1)
we'll use integral arithmetic → we need to scale the numbers
Arithmetic Coding
each string s∈A∗ associate with a subinterval Is⊆[0,1] such that
s=s′⟹Is∩Is′=∅
length of Is=p(s)
C(s)= number from Is with the shortest code possible
we'll first assume that p(a1a2…am)=p(a1)⋅p(a2)⋅…⋅p(am)
encoding
we count the frequencies → probabilities
we sort the symbols in some order
we assign intervals to the symbols in the order
symbol with a higher frequency will be assigned a larger interval
to encode the message, we subdivide the intervals
for example, let's say that IB=(0.2,0.3) and II=(0.5,0.6)
then IBI=(0.25,0.26)
the shorter the message, the longer the interval
we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes
decoding
we reconstruct the input string from the left
we inverse the mapping
problem: using floating point arithmetic leads to rounding errors → we will use integers
underflow may happen that way if we use too short integers
instead of individual frequencies of each symbol, we will use cumulative frequencies (it directly represents the intervals)
if we have b bits available to represent the current state, the (initial) interval can be [0,2b−1) or [0,2b−1)
another problem
we use bit shifting and assume that the first bits of both bounds are the same
what if the first bits differ?
we don't output anything yet and increment a counter
it is similar to the situation (in real numbers) when we get to the range 0.53–0.67
we cannot write 5 or 6 and shift the number because we don't know whether the next range will be something like 0.53–0.56 or 0.63–0.67
we postpone the decision instead
algorithm
start with the interval [0,2b−1) represented as left bound L and range R
encode each character of the block
if the range R falls below 2b−2, we need to prevent underflow
if the current interval (range) starts in the first half of the interval [0,2b−1) and ends in the second half, we don't output anything, only increment the counter and rescale the current interval
otherwise, we output one or more (most significant) bits of L and rescale the current interval (also, we decrement the counter by printing the correct bits)
time complexity
encoding O(∣A∣+∣output∣+∣input∣)
decoding O(∣A∣+∣output∣+∣input∣log∣A∣)
it is also possible to make decoding faster by using more memory
we will represent the table of frequencies as an implicit tree
that way we can decode in time O(∣A∣+∣output∣+∣input∣)
adaptive version
cumulative frequencies table as an implicit tree
Fenwick tree
Fenwick frequency = prefix sum (sum of all the weights of the lexicographically smaller vertices)
Theoretical Limits
Hartley's formula
minimum and average number of yes/no questions to find the answer
if x is a member of n element set, then x carries log2n bits of information
if we have a k-tuple of elements of n elements set and Sk is the least number of questions necessary to determine all xi, then Sk/k is the average number of questions necessary to determine one xi
∣R∣=n,x∈Rk
lognk≤Sk<lognk+1
klogn≤Sk<klogn+1
logn≤Sk/k<logn+1/k
Shannon's formula
entropy … H(X)=−∑x∈Rp(x)logp(x)
theorem: H(X)≥0
lemma: Gibbs inequality
theorem: for a random variable X with a finite range R, ∣R∣=n, we have H(X)≤logn
and H(X)=logn⟺∀x∈R:p(x)=n1
we plug it into the inequality … qi=n1
Kraft-McMillan theorem
codeword lengths l1,l2,… of a uniquely decodable code C satisfy ∑i2−li≤1
on the other hand, if natural numbers l1,l2,… satisfy the inequality, then there is a prefix code with codewords of these lengths
proof
let's start with k-th power of the sum of codeword lengths
(∑i=1n2−li)k=(∑x∈R2−l(x))k
we can rewrite that as the sum of products
and l(xi1)+l(xi2)=l(xi1xi2)
therefore it is equal to ∑xi1…xik∈R∗2−l(xi1…xik)
which equals ∑i=1k⋅lmaxn(i)⋅2−i≤∑i=1k⋅lmax2i⋅2−i=k⋅lmax
now ∑i=1n2−li≤limk→∞(k⋅lmax)1/k=1
also, R can be infinite
we can get the prefix code using arithmetic coding (probably?)
theorem: let C be a uniquely decodable code for a random variable X, then H(X)≤L(C)
L(C) … average codeword length
L(C)=∑x∈Rp(x)∣C(x)∣
proof: by Gibbs inequality, we use ∑j2−lj2−li as the second distribution
entropy of a discrete random vector
H(X)=nH(Xi)
for independent components with the same probability distribution
theorem: an arbitrary optimal prefix code C for a random variable X satisfies H(X)≤L(C)<H(X)+1
proof of the upper bound: put li=⌈−logpi⌉, then Kraft-McMillan
this also holds for Huffman code (if we consider the entropy of one symbol)
analysis of arithmetic coding
H(X)≤L(C)<H(X)+2
F(x)=0.y1y2… is the midpoint of the interval for x∈R (F is a CDF)
then C(x)=y1y2…yl(x) where l(x)=⌈logp(x)1⌉+1
in the proof, we show that C is a prefix code
then, we use L(C)=∑p(x)⋅l(x) which is upper bounded by H(X)+2
here, we don't consider one symbol but the whole message → therefore, the result is much better then the Huffman code
the main problem of the Huffman code is that it cannot deal very well with skewed probabilities of symbols as the lengths of the codewords must have integral length
problems
we assume that the encoded random variables are independent
we are working with a probabilistic model of our data, which is simplified too much
a better probabilistic model
we cannot use “unlimited” context – the probabilities would be too small
we will use the context of k symbols
according to an experimental estimate
the entropy of English is 1.3 bits per symbol
Context Methods
model of order i … makes use of context of length i
methods
fixed length contexts
combined – contexts of various lengths
complete – all contexts of lengths i,i−1,…,0
partial
static, adaptive
Prediction by Partial Matching (PPM)
Cleary, Witten, Moffat
uses arithmetic coding
we try to encode symbol s in context of length i
if the context is too large (the frequency is not larger than zero), we output the ESC symbol and decrease the context length
assumption: every symbol has non-zero frequency for some (small) context
how to encode symbol s
we have read the context c of length i and the current symbol s
if f(s∣c)>0, encode s using f(∗∣c)
each context c has its own table of frequencies (for arithmetic encoding)
otherwise, output the code for ESC and try order i−1 context
exclusion principle
situation: symbol x occurs in context abc for the first time
we will encode it using second order model f(x∣bc)
there is symbol y such that f(y∣bc)>0 and also f(y∣abc)>0
by using ESC, we already signaled that we are not trying to encode any of the letters present in the third order model (such as y)
therefore, we can temporarily remove y from the second order model so that the compression is more efficient
there are various approaches to assign P(esc∣c)
for example, we can decrement frequencies of all symbols by one
data structure … trie (context tree)
where each node has an edge to its largest proper suffix, this simplifies adding of new leaves
it might not fit in the memory
to solve that, we can freeze the model (update frequencies in existing contexts, stop adding new contexts)
or we can rebuild the model using only the recent history (not the whole file)
we can rebuild the model either if the free memory size drops or if the relative compression ratio starts decreasing
advanced data structure: DAWG (Directed Acyclic Word Graph)
PPMII
“information inheritance”
uses heuristics
became part of RAR
PAQ
arithmetic coding over binary alphabet with sophisticated context modelling
using weighted average of probability estimates from various models
this algorithm has probably the best compression ratio
but it is very time and space consuming
Integer Coding
unary code α
α(1)=1
α(i+1)=0α(i)
clearly, ∣α(i)∣=i
it is the optimal code for probability distribution p(i)=2−i
binary code β
is not a prefix code!
we can use fixed length
optimal code for p(i)=∣A∣1
or we can gradually increase the codeword length as the dictionary size increases
Elias codes
binary code always starts with 1
reduced binary code β′ … without the first 1
γ′ … first, we encode the length of the β code using α, then we encode the β′
γ … we interleave the bits of α and β
δ … instead of α, we use γ to encode the length of β
we could construct other codes in a similar fashion but δ code suffices
for large numbers, it is shorter than γ
Elias codes are universal
observation: each probability distribution p1≥p2≥⋯≥pn satisfies ∀i∈[n]:pi≤i1
corollary: −logpi≥−logi1=logi
∣γ(i)∣=logi+Θ(logi)
∣δ(i)∣=logi+o(logi)
Dictionary Methods
idea
repeated phrases stored in a dictionary
phrase occurrence in the text → pointer
problems
construction of the optimal dictionary is NP-hard → greedy algorithm
dictionary is needed for decoding → dynamic methods
LZ77
Jacob Ziv, Abraham Lempel
sliding window
search buffer
look-ahead buffer
search for the longest string S such that
S starts in the search buffer
S matches the prefix of the string in look-ahead buffer
S is encoded as a triple ⟨i,j,z⟩ where
i … distance of the beginning of S from the look-ahead buffer
j=∣S∣
z … the symbol following S in the look-ahead buffer
example
ac|cabracad|abrarr|ar
___\search/\lahead/__
→⟨7,4,r⟩
window slides j+1 symbols to the right
sliding window size 212−216B, look ahead buffer size are usually tens or hundreds of bytes
slow compression, fast decompression
typical application – single compression, multiple decompressions
LZSS
triple (i,j,z) replaced with (i,j) or literal z
bit indicator to distinguish one from another
if pair of pointers requires the same space as p symbols, we use pairs to encode only phrases of length >p (shorter phrases are simply copied to the output)
sliding window → cyclic buffer
possible implementation
codeword 16b, ∣i∣=11, ∣j∣=5
8 bit indicators in 1 byte (one byte describes the types of 8 following items)
other variants
LZB (LZSS with more efficient encoding of (i,j) – using binary code with increasing length for i and γ code for j)
LZH (two-pass: LZSS, Huffman)
Deflate algorithm
by Phil Katz, originally for PKZIP 2
zip, gzip, jar, cab, png
LZSS + static Huffman coding
search buffer size 32 kB
look-ahead buffer size 258 B
match length 3..258 → 256 options
for a smaller match, it uses literals
input divided into blocks
3bit header
is it last block? (1b)
type of the block (2b)
3 block types
no compression
static Huffman code compressed using trees defined in Deflate RFC
dynamic Huffman code based on input data statistics
type 3 block
starts with two Huffman trees
the first tree for literals and match lengths
0..255 for literals
256 = end-of-block
257..285 for match length (extra bits are used to distinguish between particular numbers)
the second tree for offset (again with extra bits)
hashing is used for string searching in search buffer
hash table stores string of length 3
strings in linked lists sorted by their age
a parameter to limit the maximum list length
greedy strategy extension
first, look for a primary match
slide window one symbol to the right, find a secondary match
if better, encode as literal + secondary match
LZ77 disadvantages
limited outlook of the search window (longer window → better compression, more complex search)
limited length of the look-ahead buffer (match length limitation)
LZ78
sliding window, explicit dictionary (but the directory is not transmitted)
efficient data structure for storing the dictionary … trie
compression
start with an empty dictionary
read the longest prefix of the input which matches some phrase f in the dictionary (zero length is also possible – especially if the dictionary is empty)
output ⟨i,k(s)⟩
i points to f in the dictionary
k(s) is a codeword for symbol s which follows f in the input
insert new phrase fs into the dictionary
dictionary reconstruction is necessary when the dictionary space is exhausted
we can delete the whole dictionary
or delete only the phrases that occurred least frequently or have not occurred recently
we need to preserve the prefix property
if the dictionary contains S, it has to contain all prefixes of S
if compression ratio starts getting worse, we can delete the dictionary and start over
LZW
we initialize the dictionary by the alphabet
we encode the longest prefix present in the dictionary
then, we append the next character and add it to the dictionary
when decoding, it may happen that the item is not in the dictionary yet
because the encoder is one step ahead
so we use the last decoded string and append its first character
we store only the pointers (and the alphabet?)
LZC
compress 4.0 utility in Unix (LZC method is the implementation of LZW algorithm)
pointers into dictionary – binary code with increasing length (9–16 bits)
when maximum size was reached, it continues with static dictionary
when compression ratio started getting worse, it deletes the dictionary and starts over with the initial setting (we use a special symbol for that)
LZMW
idea: new phrase = concatenation of two last ones (→ phrase length increases faster)
dictionary full → LRU strategy (delete the phrases that have not occurred recently)
dictionary lacks prefix property – this complicates searching, backtracking is used
LZAP
idea: instead of ST add all substrings Sp where p is a prefix of T
larger dictionary → longer codewords, wider choice for phrase selection
faster search → backtracking is not needed
Lossless Image Compression
digital images – raster (bitmap), vector
color spaces
RGB
CMYK
HSV
YUV – for backwards compatibility of the TV broadcast
representation of bitmap images
monochrome
shades of gray
color
pixel = (r,g,b)
alpha channel
color palette
conversion table (pixel = pointer into the table)
types: grayscale (pallete with shades of gray), pseudocolor (color palette), direct color (three color palettes, one for each part of the color – R/G/B)
GIF (Graphics Interchange Format)
usage
originally for image transmission over phone lines, suitable for WWW
sharp-edged line art with a limited number of colors → can be used for logos
small animations
not used for digital photography
color palette, max. 256 colors
supports “transparent” color
multiple images can be stored in one file
supports interlacing
rough image after 25–50 % data
we can stop the transmission if it's not the image we wanted
4-pass interlacing by lines
LZW-based compression (actually, it uses LZC)
pointers … binary code with increasing length
blocks … up to 255 B
we can increase the number of possible colors by stacking multiple images on top of another (and using transparent pixels)
PNG (Portable Network Graphics)
motivation: replace GIF with a format based on a non-patented method (LZW was patented by Sperry Corporation / Unisys)
color space – gray scale, true color, palette
alpha channel – transparency ∈[0,1]
supports only RGB, no other systems (it was designed for transferring images over the network)
compression has two phases
preprocessing – predict the value of the pixel, encode the difference between the real and predicted value (prediction based on neighboring pixels)
prediction types: none, sub (left pixel), up, average (average of sub and up), Paeth (more complicated formula)
each line may be processed with a different method
dictionary compression
deflate (LZ77)
supports interlacing
7-pass interlacing by blocks of pixels
single image only, no animation
Burrows-Wheeler Transform
BSTW
Bentley, Sleator, Tarjan, Wei
move to front heuristic to restructure the dictionary
we start with an empty dictionary D
for input word s
if s s in dictionary D on position i, we output γ(i) and move i-th word in D to the front of D
otherwise, we output γ(∣D∣+1) and s itself, we insert s to the front of D
Burrows–Wheeler transform
we are looking for a permutation of string x such that the similar symbols are close to each other
matrix of cyclic shifts of x
sort rows lexicographically
the desired permutation is the last column
also, we need to store a number of the row where the original string lies in the matrix to be able to restore the original
if we used the first column, it would not be decodable
encoding
MTF (move to front) heuristic
for each symbol, its code would be the number of preceding symbols in the dictionary
then, move the symbol to the front of the dictionary
therefore, the number we use to encode the symbol matches the time that elapsed from the last encoding of that symbol (if it was encoded previously)
RLE … run length encoding
instead of n zeros, we can just encode the zero and length of the block
Huffman/arithmetic coding
decoding
how to obtain the original string from the last column and the number?
we sort the last column to get the first one
we can determine the penultimate symbols from the pairs of first and last – thanks to the cyclic shift
what if there are multiple lines that end with the same symbol? we know that the lines were sorted lexicographically
matrix n×n is not neccessary
we can use suffix array
this would be equivalent to the matrix of rotations of x$, where xis the original string and $ is a sentinel character
why does BW transform work?
two same letters often share very similar right context
therefore they will be close in the rightmost column
(we are actually sorting the letters by their right context)
bzip2
Julian Seward
open source
has more efficient compression than gzip/zip
but bzip2 is slower
it does not support archiving
it can only compress single
just like gzip
can be used to compress tar archives
compression
split the input into blocks of 100–900 kB (determinted by an input parameter)
Burrows–Wheeler transform
MTF, RLE
arithmetic coding (patented) was replaced with Huffman coding
Lossy Compression
no standard corpora exist
how to measure the loss of information
mean squared error (MSE)
signal to noise ratio (SNR)
peak signal to noise ratio
Digitizing Sound
sound is a physical disturbance in a medium, it propagates as a pressure wave
Fourier theorem
sampling
Nyquist-Shannon sampling theorem
the sampling frequency should be at least twice the highest frequency we want to record
if the condition fails, it leads to aliasing (zkreslení)
there appear false frequency components that were not present in the original analog signal
for sampling frequency Fs and original f>21Fs, the new false frequency will be f′=Fs−f
example: the carriage (or stagecoach) wheels in the movie are spinning backwards
quantization
we have B bits available
we split the whole interval into 2B subintervals
we want the largest signal to noise ratio that is possible
audio processing
input
amplifier
band-pass filter
sampler … sampling
A/D converter … quantization
RAM
output
RAM
D/A converter
band-pass filter
amplifier
Quantization
main idea
large set → smaller set
quantizer can be scalar or vector
quantization consists of two mapping – encoding and decoding
central limit theorem
if we collect n samples of random variable X, the distribution of the sample mean approaches N(EX,var(X)/2) for a sufficiently large n
problem formulation
source … random variable X with density f(x) (common assumption – f is symmetric about 0)
determine: number of intervals (levels), interval endpoints (decision boundaries), representative yi for each interval (reconstruction levels)
goal: to minimize the quantization error (distortion) σ2 … MSE
σ2=E[(x−Q(x))2]
the number of intervals (levels) may be fixed if we want fixed length codewords
uniform quantizer
uniform probability distribution
all intervals of equal length
two types
midrise quantizer – even number of levels, Q(0+ε)=0 (zero is a decision boundary)
midtread quantizer – odd number of levels, Q(0+ε)=0 (zero is in the middle of the central interval)
we assume that the range of X is bounded
forward adaptive quantizer (offline)
split the input into blocks
analyze each separately, set parameters accordingly
parameters are transmitted as side information
backward adaptive quantizer (online)
no side info necessary
Jayant quantizer
multipliers
intervals can shrink and expand depending on the input
for each input value, we get the corresponding code c and we multiply Δ by the multiplier Mc
the intervals automatically adapt to the input data
3bit Jayant quantizer
table of intervals
code
interval
multiplier
0
(0,Δ)
0.8
1
(Δ,2Δ)
0.9
2
(2Δ,3Δ)
1
3
(3Δ,∞)
1.2
4
(−Δ,0)
0.8
5
(−2Δ,−Δ)
0.9
6
(−3Δ,−2Δ)
1
7
(−∞,−3Δ)
1.2
nonuniform quantization
we set yj to the average value in the interval
yj=∫bj−1bjf(x)dx∫bj−1bjxf(x)dx
we set the interval bounds as the averages of neighboring values
bj=2yj+1+yj
Lloyd-Max algorithm
M-level symmetric quantizer
we set b0=0
we somehow approximate y1
then we compute b1 from the equation for yj stated above
y2=2b1−y1 (similarly b2,y3,…)
note: the b bounds and y values are symmetrical, y−1=y1 etc.
computation depends on the initial estimate y1
we want to correct our wrong estimate
we know that bM/2 = max input value
we can compare the resulting yM/2 with the value yM/2∗ computed using the quotient of integrals
if they differ too much, we correct the initial estimate of y1 and repeat the whole procedure
companding
compressing + expanding
first we stretch the high probability regions close to the origin & compress the regions with low probability
then apply uniform quantization
expand afterwards (transform quantized values to the original scale)
μ-law
telephony, USA, Japan
μ=225
there is c(x) which compresses the values and c−1(x) which expands them (there is the constant μ in the formulas)
A-law
telephony, Europe
A=87.6
there are different c(x) and c−1(x) then in μ-law
G.711
ITU-T standard for audio companding
sampling frequency 8 kHz
sample length is 14b (μ-law) or 13b (A-law)
μ-law uses midtread quantizer
A-law uses midrise quantizer
comparison
μ … higher dynamic range, noise reduction in blocks of silence (thanks to midtread)
A … lower relative distortion of small values
application: VoIP
Vector Quantization
we split the file (or group the individual samples) into vectors
L consecutive samples of speech
block of L pixels of an image (a rectangle/square)
→ vector of dimension L
idea: codebook
code-vectors selected to be represented, each one has a binary index
encoder finds the closest vector in the codebook (using Euclidean distance or MSE)
decoder retrieves the code-vector given its binary index
codebook size K → K-level vector quantizer
we can design the codebook using k-means algorithm (or we can use LBG as described below)
LBG algorithm
Linde, Buzo, Gray
we have codebook C, training set T, threshold ε
we can compute quantization regions – assign each input vector to its code-vector
we can compute the average distortion (MSE) in each step of the algorithm
if it is less than the threshold, we can return the codebook
otherwise, we need to update it (we set the code-vector to the average of all the input vector in its group – like in k-means)
initialization
LBG splitting technique
start with a single code-vector – the average of all vectors in the training set
in each step we generate a perturbation vector, add it to each code-vector and introduce those new vectors to the codebook → the codebook size doubles
then, we run LBG with the codebook (repeatedly update the code-vectors until the distortion drops below certain level)
we repeat this until the desired number of levels is reached
random init (Hilbert)
initialize C randomly several times
select the alternative with lowest distortion
pairwise nearest neighbor (Equitz)
start with C=T
in each step, replace two code-vectors by their mean
goal: smallest increase in distortion
it is possible that during the algorithm, we are left with a code-vector that does not have any corresponding training set vectors (“empty cell problem”) → we will replace it with a new code-vector chosen randomly from the cluster with the highest population of training vectors or the highest associated distortion
tree-structured vector quantization (TSVQ)
when quantizing a vector, we don't want to compare it to each vector in the codebook
we will divide the codebook into two groups such that there exists two test vectors that all vectors from the first group are closer to the first vector (and vice versa)
we repeat the process → binary tree
pros: we need a lower number of comparisons, we can build a binary string (code) as we progress down the tree
cons: increased storage requirement (we need to store the test vectors, the codebook does not suffice), possible increase in distortion
we may use LBG splitting technique to get test vectors
we need to modify the technique, we will always run LBG only with two code-vectors and the corresponding subset of the test set
pruned TSVQ
we can remove some subtrees to decrease the average rate (codeword length) – but this will increase average distortion
also, we can split some nodes
this way, we get binary codewords of variable length, the codewords correspond to paths in the decision tree → prefix code
lattice vector quantizers
idea: regular arrangement of code-vectors in space
quantization regions should tile the output space of the source
D lattices
D2 lattice (in two-dimensional space)
a⋅v1+b⋅v2=a(0,2)+b(1,1)=(b,2a+b)
squares
the sum of the coordinates is always even
A lattices
A2 lattice is hexagonal
G.719
ITU-T standard for wideband audio coding
modified dicrete cosine transform (DCT)
vector quantizer based on D8 lattice
classified vector quantization
idea: divide source vectors into classes with different properties, perform quantization separately
example: edge/nonedge regions (in an image)
large variance of pixels ⟹ presence of an edge (will use edge codebook for this vector)
output has to contain the codebook label and the vector label
applications
audio: G.719, CELP, Ogg Vorbis, TwinVQ
video: QuickTime, VQA
Differential Encoding
idea: consecutive samples are quite similar (that holds especially for speech and images)
if we quantize the differences directly, the error will accumulate
instead of that, we will quantize the difference between the previous predicted value and the current value
or we can use a more general function of multiple previous predicted values instead of just the last value
this function can be called pn
DPCM (differential pulse-code modulation)
C. Chapin Cutler, Bell Labs
goal: minimize MSE (error between the real value and pn)
linear predictor pn=∑i=1Naix^n−i
where x^k is k-th predicted value
problem: find aj which minimize MSE
set the derivative w.r.t. aj equal to zero
∀j:−2E[(xn−∑aixn−i)xn−j]=0
E[(xn−∑aixn−i)xn−j]=0
E[xnxn−j]=∑aiE[xn−ixn−j]
we can rewrite this using autocorrelation RXX(k)=E[xnxn+k]
we get Wiener-Hopf equations (matrix multiplication)
we will estimate RXX(k) – nothing complicated, it's just an average of products
we assume that the process is stationary
ADPCM
adaptive DPCM
forward adaptive prediction (DPCM-APF)
estimate autocorrelation for blocks
we need to transfer coefficients and buffer the input (group it into blocks)
backward adaptive prediction (DPCM-APB)
no need to transfer additional info
we will use stochastic gradient descent during coding (LMS = least mean squares algorithm)
A(n+1)=A(n)+αd^nX^n−1
G.726
ITU-T standard for speech compression based on ADPCM
predictor uses two latest predictions and six latest differences (between predicted values)
simplified LMS algorithm
backward adaptive quantizer (modified Jayant)
delta modulation
speech coding
very simple form of DPCM
2-level quantizer (we can only encode +Δ or −Δ)
we need really high sampling frequency
problems
constant signal (we encode it as a sequence of ↑↓↑↓↑↓↑↓)
“granular regions”
a slope steeper than delta
“slope overload region”
solution: adapt Δ to the characteristics of the input
constant factor adaptive delta modulation (CFDM)
objective: increase delta in overload regions, decrease it in granular regions
how to identify the regions?
use a history of one sample (or more)
application: space shuttle ↔ ground terminal (CDFM with 7-sample memory)
Transform Coding
scheme: group individual samples into blocks, transform each block, quantization, coding
for images, we have to make a two-dimensional transform
it is better to transform each dimension separately = separable transform
separable transform using matrix A
transformed data Θ=AXAT
original data X=A−1Θ(A−1)T
orthogonal transform … we use orthogonal transform matrix
transformed data Θ=AXAT
original data X=ATΘA
moreover ∑iθi2=∑ixi2
gain GT (of a transform T) = arithmetic over geometric mean of σ∗2
σi2 … variance of coefficient θi
Karhunen-Loéve transform
minimizes the geometric mean of the variance of the transform coefficients
transform matrix consists of eigenvectors of the autocorrelation matrix R
problem: non-stationary input → matrix R changes with time → KLT needs to be recomputed
discrete cosine transform
DCT is the “cosine component” of DFT
redistributes information so that we can quantize it more easily
why is DCT better than DFT?
FT requires a periodic signal
periodic extension of non-periodic signal would introduce sharp discontinuities and non-zero high-frequency coefficients
DCT may be obtained from DFT (start with the original non-periodic signal consisting of N points, mirror it to get 2N points, apply DFT, take the first N points)
DCT transform matrix – 1st row is constant vector, following rows consist of vectors with increasing variation
these are the portions of the original “signal” (data matrix X) that the coefficients in the Θ matrix correspond to
these can be called “basis matrices”
basis matrix corresponding to θ11 is a constant matrix ⟹θ11 is some multiple of the average value in the data matrix
θ11 … DC coefficient
other θij … AC coefficients
gain GDCT is close to the optimum value for Markov sources with high correlation coefficient ρ=E[xn2]E[xnxn+1]
discrete sine transform
gain GDST is close to the optimum value for Markov sources with low correlation coefficient ρ=E[xn2]E[xnxn+1]
Hadamard matrix
square matrix of order n such that HHT=nI
H1=(1)
H2n=(HnHnHn−Hn)
discrete Walsh-Hadamard transform
DWHT transform matrix is a rearrangement of Hn
multiply Hn by normalizing factor n1
reorder the rows by increasing number of sign changes
pro: speed
con: GDWHT≪GDCT
quantization of transform coefficients
zonal sampling
coefficients with higher expected variance are assigned more bits for sampling
pro: simplicity
con: bit allocations based on average rules → local variations (e.g. sharp edges on plain background) might not be reconstructed properly
quantization and coding
threshold coding
based on the threshold, we decide which coefficients to keep (or discard)
we have to store the numbers of discarded coefficients
threshold is not decided a priori
Chen & Pratt suggest zigzag scan
tail of the scan should consist of zeros (generally, the higher-order coefficients have smaller amplitude) → for fixed length of the block, we don't need to specify the number of zeros, just send end-of-block (EOB) signal
note: DC coefficients correspond to contours, higher-order coefficients correspond to details of the image → we can get rid of them
JPEG (Joint Photographic Experts Group)
JPEG standard specifies a codec
lossy / lossless compression
progressive transmission scheme
file format
JPEG/JFIF – minimal format version
JPEG/Exif – digital photography
compression scheme
color space transform
RGB → YCbCr
human eye is more sensitive to details in Y (luminance) component
reduction of colors (chroma subsampling) in color components
4:4:4 (no reduction) or 4:2:2 or 4:2:0
4:2:2 or 4:2:0 → colors have lower resolution than the luminance