# Lecture

- https://ktiml.mff.cuni.cz/~gregor/ds1/
- http://mj.ucw.cz/vyuka/dsnotes/

## Introduction

- data structures: data + operations (queries, updates)
	- complexity depends on the specific implementation
	- static or dynamic (does the data structure support updates?)
- queue (FIFO)
	- operations: enqueue(x), dequeue, isempty
	- implementations
		- (single) linked list with two pointers (head, tail)
			- enqueue $\Theta(1)$
			- dequeue $\Theta(1)$
			- isempty $\Theta(1)$
		- (rotating) array with two pointers
			- it's useful for the tail pointer to point at the first free cell
			- the queue can only store less elements than the capacity of the array
			- again, constant complexities
- note: when we use $\Theta$, we should make it clear if it relates to the implementation or to the operation in general (for any possible implementation)
- set $S\subseteq\mathcal U$
	- operations
		- insert(x)
		- delete(x)
		- find(x)
		- build($x_1,\dots,x_n)$ for distinct elements (can be faster than running insert $n$ times)
	- implementations
		- unsorted linked list
			- we could assume $x\notin S$
			- without this assumption, the complexity of insert is $O(n)$
			- delete, find … $O(n)$
			- build is also $O(n)$ (which is faster than repeating insert $n$ times)
		- (rotating) array
			- $O(n)$ all the way
		- sorted array
			- we need some linear order on the $\mathcal U$
			- find … binary search, $O(\log n)$
			- insert … $O(n)$ (we need to shift the elements)
			- delete … $O(n)$ again
			- build … $\Theta(n\log n)$
		- BST (balanced search tree)
			- we have tree property (elements in the left subtree are strictly smaller than the current node, similarly for the right subtree)
			- insert, delete, find $O(\log n)$
			- build $O(n \log n)$
			- note: we could write thetas everywhere
		- hashing
			- we randomly choose a hashing function satisfying some statistical properties
			- we get insert, delete, find in $O(1)$ but it is *expected time*
			- build in $O(n)$
- dictionary – like the set (but for each item, we have a pointer to the data)
- multiset – set with counters (→ dictionary where the data is the number of occurences)
- ordered set
	- new operations: min, max
		- linked list $O(n)$, array $O(n)$, sorted array $O(1)$, BST $O(\log n)$, hashing $O(n)$ (we need to go through the entire table)
	- pred(x), succ(x)
		- linked list $O(n)$, array $O(n)$, sorted array $O(\log n)$, BST $O(\log n)$, hashing $O(n)$
- RAM (random access machine)
	- potentially infinite memory organized into cells
		- each is addressed by a natural number
		- we keep natural numbers inside the cells (we need set an upper bound!)
		- we can access an arbitrary cell
	- numbers in cells $\leq  p(\mathrm{len}(I),\mathrm{max}(I))$
		- $p$ … polynomial
		- $I$ … instance of the problem (= the input)
	- instructions: store, arithmetical operations, logical operations, goto, memory allocator (allocates consecutive space)
	- time … number of instructions
	- space … number of cells used (including allocated empty space)

## Amortized Complexity

- stretchable array
	- usually append in constant time
	- sometimes we need to reallocate the array (to a larger one)
		- takes linear time
	- worst-case complexity $O(n)$
		- but such allocation does not happen often
	- let's consider $n$ operations
		- we consider $n$ such that $2^{k-1}\lt n\leq 2^{k}$
		- we start with capacity 1
		- total cost of reallocation (aggregated)
			- $2^0+2^1+\dots+2^{k-1}=2^k-1=\Theta(n)$
		- amortized cost $\Theta(1)$
		- “aggregation method”
- flexible array
	- stack
	- append(x), removelast
	- shrink to $C/2$ if $n\lt C/2$
		- this is problematic, we would get linear amortized complexity
		- we should use $n\lt C/4$ instead
	- amortized analysis
		- average over operations in the worst-case scenario
	- append – double the capacity if the array is full … costs $O(C)$ in this case
	- removelast – halve the capacity if the array becomes shorter than $C/4$ … costs $O(C)$
	- we consider blocks of operations
		- block boundaries = growing/shrinking operation
		- “how many elements does the array contain at the beginning and at the end of the block?”
		- if growing: $C$ at the end, then $C/2$ at the beginning
		- if shrinking: $C/4$ at the end, then $C/2$ at the beginning
		- so at the beginning of the block, the array always contains $C/2$ elements
		- at the end of the block, the array contains $C/4$ or $C$ elements
		- → there are at least $C/4$ operations in each block
		- → amortized $O(1)$ for operation
			- “accounting method” – we account the cost to the operations
- binary counter, $\ell$ bits
	- *increment* operation
	- how many bits we flip?
	- worst case: $O(\ell)$ bit flips
	- aggregation method
		- observation: bit $i$ changes once every $2^{i-1}$ operations
		- → \# bit flips $=\sum_{i=1}^\ell\lfloor\frac n{2^{i-1}}\rfloor\leq n\sum_{i=0}^\infty\frac1{2^i}=2n$
	- coin method
		- 1 coin … 1 bit flip
		- 000 → 001
			- we pay 1 coin for flipping 0→1
			- we store 1 coin on 1
		- 001 → 010
			- we use the stored coin – it pays for 1→0
			- we pay 1 coin for 0→1
			- we store 1 coin on 1 again
		- → 2 coins suffice
	- so we get $O(1)$ amortized
- potential method (generalization of coin method)
	- $m$ operations, real costs $R_i$
	- we want to compute upper estimates on the costs
	- amortized cost $A_i$ … our choice
	- we want $\sum_{j=1}^mA_j-\sum_{j=1}^mR_j\geq 0$
		- “never owe time”
		- this should hold not only for $m$ but for any $i\leq m$
	- we define potential $\Phi_i:=\sum_{j=1}^iA_j-\sum_{j=1}^iR_j$
		- $\Phi_0=0$ by definition
		- $\Delta\Phi_i=\Phi_i-\Phi_{i=1}=A_i-R_i$
			- > 0 → saving time
			- = 0 → pay exactly
			- < 0 → spending time
	- back to the bit flips
		- $\Phi_i=$ number of coins saved = number of ones
		- $A_i=2$
		- $R_i=1+t$
			- $t$ … number of trailing ones
		- $\Delta\Phi_i=1-t$
	- flexible array
		- $\Phi_i=$ number of ops since last reallocation
		- $R_i=1$ (without reallocation) or $1+\Theta(n)$
		- $A_i=2$
			- we may need to rescale this
	- different approach
		- we choose $\Phi_i\geq 0$
		- $A_i:=R_i+\Delta\Phi_i$
		- so $\sum_{i=1}^m A_i=\sum_{i=1}^mR_i+\sum_{i=1}^m(\Phi_i-\Phi_{i-1})$
			- the last sum is a telescopic sum
		- $\sum_{i=1}^m A_i=\sum_{i=1}^mR_i+\Phi_m-\Phi_0$
		- $\Phi_m-\Phi_0$ should be $\geq 0$
			- we can have a non-zero $\Phi_0$ but this difference has to be non-negative
		- steps
			0. choose unit cost
			1. estimate (upper) $R_i$
			2. choose potential $\Phi$
			3. show $\Phi_i\geq 0$
			4. aim $A_i=R_i+\Delta\Phi_i$ to be small
- lazily balanced trees
	- notation
		- $n$ … number of items stored
		- $v$ … node
		- $T(v)$ … subtree rooted at $v$
		- $\ell(v),r(v)$ … left/right child of $v$
		- $s(v)$ … size = number of nodes in $T(v)$ including $v$
	- def: BST is perfectly balanced if for every node $v$ it holds that $|s(\ell(v))-s(r(v))|\leq 1$
		- the ratio is approximately 1 : 1
	- def: BST is balanced if for every $v$ and its every child $c$ it holds that $s(\ell(c))\leq \frac23s(v)$
		- the ratio is between 1 : 2 and 2 : 1
	- lemma: any balanced BST has depth $O(\log n)$
		- imagine the longest path from the root to a leaf
		- in each step from the root to the leaf, the size drops to at most 2/3 of the preceding size
		- size of the root … $n$
		- size of the leaf … 1
		- $1\leq(\frac23)^dn$
		- $d\leq\log_{2/3}(1/n)=\log_{3/2}n=O(\log n)$
	- *insert* operation
		- *find*, add a leaf, update sizes (we keep track of the sizes $s(v)$)
		- if not balanced, *rebuild* in the highest unbalanced node
	- observation: for $n$ sorted items, *build* takes linear time
		- select the middle item as the root and split the rest in two subtrees
		- proceed recursively
	- $\Phi:=\sum_v\varphi(v)$
	- $\varphi(v):=\begin{cases}|s(\ell(v))-s(r(v))|&\text{if at least } 2\\ 0&\text{otherwise}\end{cases}$
		- the clamping ensures that perfectly balanced tree has zero potential
	- cost of *insert*
		- no rebuild: $A=R+\Delta\Phi=O(\log n)+O(\log n)=O(\log n)$
			- $\Delta\Phi=O(\log n)$ because $\Delta\varphi\leq 2$ for each visited node (usually $\Delta\varphi=1$ but there may be nodes with $\varphi$ hopping from 0 to 2 due to the clamping)
		- rebuild at $v$
			- the invariant was broken for $v$ and its child $c$
			- WLOG $s(\ell(v))\gt\frac23 s(v)\implies s(r(v))\lt\frac13 s(v)\implies\varphi(v)\gt\frac13 s(v)$
				- after the rebuild, this contribution and all the contributions in the subtree become zero
				- contributions elsewhere stay the same
			- cost of *rebuild*: $O(s(v))+c\cdot\Delta\Phi\leq 0$
				- for sufficiently large cost $c$
- splay trees
	- splay(x) … pull node $x$ to the root
		- preferably using double rotation (zigzig, zigzag)
		- possibly with one single rotation (zig)
	- zig
		- y has child x
		- zig(x) makes y child of x
	- zigzig, zigzag
		- z has child y
		- y has child x
		- zigzig(x)/zigzag(x) moves x above both y and z
		- zigzig is used if $x \lt y\lt z$ or $z\lt y\lt x$
		- zigzag is used if $y\lt x\lt z$
	- splay performs a sequence of the operations until x becomes the root
	- find(x)
		- search(x)
		- splay(lowest visited node)
	- how to choose potential
		- $\Phi:=\sum_v r(v)$
		- rank $r(v)=\log s(v)$
		- $1\leq s(v)\leq n$
		- $0\leq r(v)\leq \log n$
		- $0\leq\Phi\leq n\log n$