# Lecture - https://ktiml.mff.cuni.cz/~gregor/ds1/ - http://mj.ucw.cz/vyuka/dsnotes/ ## Introduction - data structures: data + operations (queries, updates) - complexity depends on the specific implementation - static or dynamic (does the data structure support updates?) - queue (FIFO) - operations: enqueue(x), dequeue, isempty - implementations - (single) linked list with two pointers (head, tail) - enqueue $\Theta(1)$ - dequeue $\Theta(1)$ - isempty $\Theta(1)$ - (rotating) array with two pointers - it's useful for the tail pointer to point at the first free cell - the queue can only store less elements than the capacity of the array - again, constant complexities - note: when we use $\Theta$, we should make it clear if it relates to the implementation or to the operation in general (for any possible implementation) - set $S\subseteq\mathcal U$ - operations - insert(x) - delete(x) - find(x) - build($x_1,\dots,x_n)$ for distinct elements (can be faster than running insert $n$ times) - implementations - unsorted linked list - we could assume $x\notin S$ - without this assumption, the complexity of insert is $O(n)$ - delete, find … $O(n)$ - build is also $O(n)$ (which is faster than repeating insert $n$ times) - (rotating) array - $O(n)$ all the way - sorted array - we need some linear order on the $\mathcal U$ - find … binary search, $O(\log n)$ - insert … $O(n)$ (we need to shift the elements) - delete … $O(n)$ again - build … $\Theta(n\log n)$ - BST (balanced search tree) - we have tree property (elements in the left subtree are strictly smaller than the current node, similarly for the right subtree) - insert, delete, find $O(\log n)$ - build $O(n \log n)$ - note: we could write thetas everywhere - hashing - we randomly choose a hashing function satisfying some statistical properties - we get insert, delete, find in $O(1)$ but it is *expected time* - build in $O(n)$ - dictionary – like the set (but for each item, we have a pointer to the data) - multiset – set with counters (→ dictionary where the data is the number of occurences) - ordered set - new operations: min, max - linked list $O(n)$, array $O(n)$, sorted array $O(1)$, BST $O(\log n)$, hashing $O(n)$ (we need to go through the entire table) - pred(x), succ(x) - linked list $O(n)$, array $O(n)$, sorted array $O(\log n)$, BST $O(\log n)$, hashing $O(n)$ - RAM (random access machine) - potentially infinite memory organized into cells - each is addressed by a natural number - we keep natural numbers inside the cells (we need set an upper bound!) - we can access an arbitrary cell - numbers in cells $\leq p(\mathrm{len}(I),\mathrm{max}(I))$ - $p$ … polynomial - $I$ … instance of the problem (= the input) - instructions: store, arithmetical operations, logical operations, goto, memory allocator (allocates consecutive space) - time … number of instructions - space … number of cells used (including allocated empty space) ## Amortized Complexity - stretchable array - usually append in constant time - sometimes we need to reallocate the array (to a larger one) - takes linear time - worst-case complexity $O(n)$ - but such allocation does not happen often - let's consider $n$ operations - we consider $n$ such that $2^{k-1}\lt n\leq 2^{k}$ - we start with capacity 1 - total cost of reallocation (aggregated) - $2^0+2^1+\dots+2^{k-1}=2^k-1=\Theta(n)$ - amortized cost $\Theta(1)$ - “aggregation method” - flexible array - stack - append(x), removelast - shrink to $C/2$ if $n\lt C/2$ - this is problematic, we would get linear amortized complexity - we should use $n\lt C/4$ instead - amortized analysis - average over operations in the worst-case scenario - append – double the capacity if the array is full … costs $O(C)$ in this case - removelast – halve the capacity if the array becomes shorter than $C/4$ … costs $O(C)$ - we consider blocks of operations - block boundaries = growing/shrinking operation - “how many elements does the array contain at the beginning and at the end of the block?” - if growing: $C$ at the end, then $C/2$ at the beginning - if shrinking: $C/4$ at the end, then $C/2$ at the beginning - so at the beginning of the block, the array always contains $C/2$ elements - at the end of the block, the array contains $C/4$ or $C$ elements - → there are at least $C/4$ operations in each block - → amortized $O(1)$ for operation - “accounting method” – we account the cost to the operations - binary counter, $\ell$ bits - *increment* operation - how many bits we flip? - worst case: $O(\ell)$ bit flips - aggregation method - observation: bit $i$ changes once every $2^{i-1}$ operations - → \# bit flips $=\sum_{i=1}^\ell\lfloor\frac n{2^{i-1}}\rfloor\leq n\sum_{i=0}^\infty\frac1{2^i}=2n$ - coin method - 1 coin … 1 bit flip - 000 → 001 - we pay 1 coin for flipping 0→1 - we store 1 coin on 1 - 001 → 010 - we use the stored coin – it pays for 1→0 - we pay 1 coin for 0→1 - we store 1 coin on 1 again - → 2 coins suffice - so we get $O(1)$ amortized - potential method (generalization of coin method) - $m$ operations, real costs $R_i$ - we want to compute upper estimates on the costs - amortized cost $A_i$ … our choice - we want $\sum_{j=1}^mA_j-\sum_{j=1}^mR_j\geq 0$ - “never owe time” - this should hold not only for $m$ but for any $i\leq m$ - we define potential $\Phi_i:=\sum_{j=1}^iA_j-\sum_{j=1}^iR_j$ - $\Phi_0=0$ by definition - $\Delta\Phi_i=\Phi_i-\Phi_{i=1}=A_i-R_i$ - > 0 → saving time - = 0 → pay exactly - < 0 → spending time - back to the bit flips - $\Phi_i=$ number of coins saved = number of ones - $A_i=2$ - $R_i=1+t$ - $t$ … number of trailing ones - $\Delta\Phi_i=1-t$ - flexible array - $\Phi_i=$ number of ops since last reallocation - $R_i=1$ (without reallocation) or $1+\Theta(n)$ - $A_i=2$ - we may need to rescale this - different approach - we choose $\Phi_i\geq 0$ - $A_i:=R_i+\Delta\Phi_i$ - so $\sum_{i=1}^m A_i=\sum_{i=1}^mR_i+\sum_{i=1}^m(\Phi_i-\Phi_{i-1})$ - the last sum is a telescopic sum - $\sum_{i=1}^m A_i=\sum_{i=1}^mR_i+\Phi_m-\Phi_0$ - $\Phi_m-\Phi_0$ should be $\geq 0$ - we can have a non-zero $\Phi_0$ but this difference has to be non-negative - steps 0. choose unit cost 1. estimate (upper) $R_i$ 2. choose potential $\Phi$ 3. show $\Phi_i\geq 0$ 4. aim $A_i=R_i+\Delta\Phi_i$ to be small - lazily balanced trees - notation - $n$ … number of items stored - $v$ … node - $T(v)$ … subtree rooted at $v$ - $\ell(v),r(v)$ … left/right child of $v$ - $s(v)$ … size = number of nodes in $T(v)$ including $v$ - def: BST is perfectly balanced if for every node $v$ it holds that $|s(\ell(v))-s(r(v))|\leq 1$ - the ratio is approximately 1 : 1 - def: BST is balanced if for every $v$ and its every child $c$ it holds that $s(\ell(c))\leq \frac23s(v)$ - the ratio is between 1 : 2 and 2 : 1 - lemma: any balanced BST has depth $O(\log n)$ - imagine the longest path from the root to a leaf - in each step from the root to the leaf, the size drops to at most 2/3 of the preceding size - size of the root … $n$ - size of the leaf … 1 - $1\leq(\frac23)^dn$ - $d\leq\log_{2/3}(1/n)=\log_{3/2}n=O(\log n)$ - *insert* operation - *find*, add a leaf, update sizes (we keep track of the sizes $s(v)$) - if not balanced, *rebuild* in the highest unbalanced node - observation: for $n$ sorted items, *build* takes linear time - select the middle item as the root and split the rest in two subtrees - proceed recursively - $\Phi:=\sum_v\varphi(v)$ - $\varphi(v):=\begin{cases}|s(\ell(v))-s(r(v))|&\text{if at least } 2\\ 0&\text{otherwise}\end{cases}$ - the clamping ensures that perfectly balanced tree has zero potential - cost of *insert* - no rebuild: $A=R+\Delta\Phi=O(\log n)+O(\log n)=O(\log n)$ - $\Delta\Phi=O(\log n)$ because $\Delta\varphi\leq 2$ for each visited node (usually $\Delta\varphi=1$ but there may be nodes with $\varphi$ hopping from 0 to 2 due to the clamping) - rebuild at $v$ - the invariant was broken for $v$ and its child $c$ - WLOG $s(\ell(v))\gt\frac23 s(v)\implies s(r(v))\lt\frac13 s(v)\implies\varphi(v)\gt\frac13 s(v)$ - after the rebuild, this contribution and all the contributions in the subtree become zero - contributions elsewhere stay the same - cost of *rebuild*: $O(s(v))+c\cdot\Delta\Phi\leq 0$ - for sufficiently large cost $c$ - splay trees - splay(x) … pull node $x$ to the root - preferably using double rotation (zigzig, zigzag) - possibly with one single rotation (zig) - zig - y has child x - zig(x) makes y child of x - zigzig, zigzag - z has child y - y has child x - zigzig(x)/zigzag(x) moves x above both y and z - zigzig is used if $x \lt y\lt z$ or $z\lt y\lt x$ - zigzag is used if $y\lt x\lt z$ - splay performs a sequence of the operations until x becomes the root - find(x) - search(x) - splay(lowest visited node) - how to choose potential - $\Phi:=\sum_v r(v)$ - rank $r(v)=\log s(v)$ - $1\leq s(v)\leq n$ - $0\leq r(v)\leq \log n$ - $0\leq\Phi\leq n\log n$