# Lecture

- credit
	- homeworks
	- attendance to practicals is mandatory, 3 absences allowed at maximum
	- final test

## Probability, information theory

- sample space $\Omega$
- event $A$ as a set of basic outcomes
	- $A\subseteq\Omega$
- we can estimate the probability of event $A$ by experiment
	- we divide the number of $A$ occurring by the number of experiments
	- maximum likelihood estimation
- axioms
	- $p(A)\in[0,1]$
	- $p(\Omega)=1$
	- $p(\bigcup A_i)=\sum p(A_i)$
- joint probability, conditional probability
- estimating conditional probability
- Bayes rule
- independence
- chain rule
- golden rule of statistical NLP
- expectation
- entropy
	- nothing can be more uncertain than the uniform distribution
- perplexity
	- $G(p)=2^{H(p)}$
- joint entropy, conditional entropy
- entropy is non-negative
- chain rule
	- $H(X,Y)=H(Y\mid X)+H(X)$
- $H(Y\mid X)\leq H(Y)$
- other properties of entropy
- coding interpretation
	- entropy … the least average number of bits needed to encode a message
- KL distance (divergence)
- mutual information
	- $I(X,Y)=D(p(x,y)\Vert p(x)p(y))$
	- we can derive that $I(X,Y)=H(X)-H(X\mid Y)$
		- by symmetry $I(X,Y)=H(Y)-H(Y\mid X)$
- cross-entropy

## Noisy channel model

- we try to recover the original input from a noised output
	- original input $A$ = the message in someone's mind
	- noised output $B$ = the written/spoken representation
- usage: OCR, handwriting, speech recognition, machine translation, POS tagging
- to get input
	- $p(A\mid B)=p(B\mid A)\cdot\frac{p(A)}{p(B)}$
	- $A^*=\text{argmax}_A\;p(B\mid A)\cdot p(A)$
		- $p(B\mid A)$ … acoustic model
		- $p(A)$ … language model
- language modelling
	- we need to model the probabilities of sequences of words
	- we will use n-gram probabilities so that we don't need many parameters
		- $p(w_i\mid w_{i-1},w_{i-2})=\frac{c(w_{i-2},w_{i-1},w_i)}{c(w_{i-2},w_{i-1})}$
	- number of parameters
		- uniform … 1
		- unigram … $|V|-1$
		- bigram … $\sim|V|^2$
		- trigram … $\sim|V|^3$
			- for $|V|\approx 60k$, it is larger than the number of parameters of the Llama model
- smoothing
	- we can add 1 to all of the counts
		- ADD-1 smoothing
		- $p(w\mid h)=\frac{c(w,h)+1}{c(h)+|V|}$
		- what if there is an word that we did not see in the training set? we will have a special word for that `<unk>`
	- we can add $\lambda$ to all of the counts
		- ADD-$\lambda$
		- $p(w\mid h)=\frac{c(w,h)+\lambda}{c(h)+\lambda\cdot |V|}$
		- how to estimate $\lambda$
		- we want to minimize the cross-entropy
		- we cannot use training (nor test) data for that, we need holdout (dev) data
		- $p(w_i\mid w_{i-2},w_{i-1})=\lambda_0\cdot p_0+\lambda_1 \cdot p_1(w_i)+\lambda_2\cdot p_2(w_i\mid w_{i-1})+\lambda_3\cdot p_3(w_i\mid w_{i-1},w_{i-2})$
			- $\sum\lambda_i=1$
		- expectation maximization (EM) algorithm
			- expectation: $c(\lambda_j)=\sum\lambda _jp_j(w\mid h)/p'(w\mid h)$
			- maximization: $\lambda_j=\frac{c(\lambda_j)}{\sum c(\lambda_h)}$
- homework
	- language identification using char-level language model
	- at least 2 languages

## Morphological analysis

- morphological annotation
	- POS tags
- tagsets
	- English: Penn Treebank (45 tags), Brown Corpus (87), Claws c5 (62), London-Lund (197)
	- Czech: Prague Dependency Treebank (4294; positional), Multext-East (1485; Orwell 1984 parallel corpus), Prague Spoken Corpus (over 10000)
	- Universal Dependencies: 17 universal POS tags, 27 universal features (each with 1–37 possible values)
- Czech positional tags of PDT
	- positions: part of speech, subpos, gender, number, case, poss gender, poss number, person, tense, degree, polarity, voice, (reserved), (reserved), style
	- gender ambiguities → more values of the position
- Penn Treebank tagset
	- prepositions and subordinate conjunctions have the same tag (hard to distinguish)
	- “to” has its own tag (it denotes and infinitive or works as a preposition, not easy to distinguish)
- universal POS tags (from Universal Dependencies)
	- noun, proper noun, verb, adjective, adverb, interjection, pronoun, determiner, auxiliary, numeral, adposition, subordinating conjunction, coordinating conjunction, particle, punctuation, symbol, unknown
- ancient Greek word classes
	- adjectives are missing
		- they are a relatively new invention from France
		- in French, adjectives behave quite differently than nouns
- traditional parts of speech
	- English: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection
	- Czech: noun, adjective, pronoun, numeral, verb, adverb, preposition, conjunction, particle, interjection
- openness vs. closeness, content vs. function words
	- open classes (take new words)
		- verbs (non-auxiliary), nouns, adjectives, adjectival adverbs, interjections
		- word formation (derivation) across classes
	- closed classes (words can be enumerated)
		- pronouns/determiners, adpositions, conjunctions particles
		- pronominal adverbs
		- auxiliary and modal verbs/particles
		- numerals
		- typically thay are not base for derivation
	- even closed classes evolve but over longer period of time
- morphological analysis
	- input: word form (token)
	- output
		- set of analyses (possibly empty)
		- an analysis
			- lemma (base form of the lexeme)
			- tag (morphological, POS)
				- POS
				- features and their values
- morphological analysis vs. tagging
	- tagging … context-based disambiguation
	- most taggers employ ML methods
	- taggers may or may not work on top of morphological analysis
- finite-state morphology
	- finite-state automaton/machine
	- example: FSA checking correct spelling of Czech *dě, tě, ně*
- lexicon is implemented as a FSA (trie)
	- composed of multiple sublexicons (prefixes, stems, suffixes)
	- notes (glosses) at the end of every sublexicon
		- e.g. POS tags
	- lexicon is a DAG, not a tree
- problem with phonology: baby+s → babies (not babys)
	- two-level mophology solves that
		- upper (lexical) language
		- lower (surface) language
	- two-level rules
		- lexical: `baby+0s`
		- surface: `babi0es`
	- zero is treated as a normal symbol (but corresponds to an empty string)
- finite-state transducer (převodník)
	- transudcer is a special case of automaton
	- checking (finite-state automaton)
		- does the word belong to the language (lexicon)?
	- analysis (finite-state transducer)
		- surface string → lexical string
	- generation (finite-state transducer)
		- lexical string → surface string
- another way of rule notation: two-level grammar
	- `a:b <=> l:l _ r:r`
	- lexical `a` must be realized as surface `b` in this context and only in this context
		- context … between `l` and `r`
	- FST can be constructed from that
	- disadvantage: capturing of long-distance dependencies is clumsy
		- example of long-distance dependencies – Czech adjectives (superlatives in particular)
		- `nej-` prefix is legal only if there is `-ší` suffix

## Syntactic analysis

- syntactic annotation
	- dependency tree – we need only „pointers“ to parent elements and labels of the connections
	- ways of annotation differ (e.g. Prague Dependency Treebank vs. Universal Dependencies)
- surface syntax
	- relations between sentence parts
	- sentence part = token (word, number, punctuation)
	- we do not restore elided constituents at this level
	- different shapes in different theories, hierarchical structure (tree)
		- phrasal (constituent) tree, parse tree
		- dependency tree
	- constituent vs. dependencies
		- there exist multiple constituent tree for one dependency tree
		- in dependency tree, the phrases have “heads” (parent nodes of the words of the phrases)
	- phrases
		- phrase replaceability – a phrase can be replaced by a different phrase of the same type
		- different POS phrases
		- propositional phrase attachment ambiguity
		- verb phrase × clause
		- coordination
		- ellipsis
- syntactic parsers
	- transition-based Malt parser
		- complete for projective trees
			- there are extensions that can produce non-projective trees
		- transitions
			- Shift – move word from buffer to stack
			- Larc – connect two topmost stack words, higher is parent (remove child from stack)
			- Rarc – connect two topmost stack words, lower is parent (remove child from stack)

## Information Retrieval

- there will be a homework similar to the last year (but with slight modifications)
- information retrieval
- boolean retrieval
	- queries are boolean expressions
	- search engine returns documents that satisfy the expression
	- term-document incidence matrix
	- for bigger collections, the incidence matrix would be very sparse
		- we will use a different way of storage
		- inverted index
			- for each term in a dictionary, we store a list of all documents that contain the term (postings list)
	- the tokens are usually normalized (lemmatized?), also we sort the whole list of (token, document) pairs
	- we want to have the dictionary in the memory, the postings can be stored on the disk
	- data structures for looking up term – hashes and trees
		- is there a fixed number of terms or will it be growing?
		- what are the frequencies with which various keys will be accessed?
		- how many terms are we likely to have?
	- hashing tables
		- term → integer, try to avoid collisions
		- fast lookup time
		- no way to find minor variants, no prefix search
		- might need to be rehashed
	- trees
		- prefix can be searched
		- slightly slower search
		- need to be rebalanced
		- …
		- binary tree
		- B-tree
	- simple conjunctive query (two terms)
		- we need to find the intersection of two postings
		- that is linear if the postings are sorted
	- boolean queries
		- use `and`, `or`, `not`
		- each document is considered to be a set of terms
- text processing
	- format, language, character set
	- normalization, tokenization
	- hyphens and compound words are problematic
		- accents and umlauts too
	- capital letters
	- stop words – extremely common words
		- older IR systems usually ignored them
		- you need stop words for phrase queries like “King of Denmark”
	- more equivalence classing
		- phonetic equivalence (Soundex)
		- semantic equivalence (thesauri)
	- lemmatization – we reduce the word to its proper dictionary headword form
		- inflectional
		- derivational
	- stemming
		- uses heuristics to chop off the end of the word
		- Porter stemmer
			- conventions + 5 phases of reductions applied sequentially
			- each phase consists of a set of commands
			- sample command – delete final *ement* if what remains is longer than 1 character (replacement → replac)
		- stemming is usually good enough (we don't need lemmatization)

### Ranked Retrieval

- boolean retrieval: good for experts, good for applications, not good for the majority of users
	- most users are not capable (or don't want) to write boolean queries
	- we usually get either too few or too many results
- with ranking, we can just show the top 10 results
- more relevant results are ranked higher than less relevant results
- we will use rank $\in[0,1]$
- Jaccard coefficient
- term weighting
- bag of words model
- “term frequency”
- we want high weights for terms that occur rarely in the collection
- inverse document frequency
- collection frequency vs. document frequency
- tf-idf weighting
	- $w_{t,d}=(1+\log\text{tf}_{t,d})\cdot\log\frac N{\text{df}_t}$
- documents (and the query) as vectors (components = tf-idf scores)
- we want to somehow measure similarity between the query and the documents
	- distance does not work well – longer documents are far from the shorter ones
	- we will use angle (cosine similarity)
	- cosine similarity for normalized vectors is just the dot product
- evaluation
	- precision, recall, F-score
	- accuracy is usually not helpful in information retrieval as $TP\ll TN$
	- precision and recall are computed for unranked sets
		- we can compute it for top $n$ results (for different values of $n$)

## Neural Networks

- the old view: a network of artificial neurons
- the current view: a network of layers
- …
- representing words
	- words are not vectors of continuous numbers :((
	- one-hot encoding
	- prediction of next word
	- result has interesting properties, similar words have similar weight vectors
- representing sequences
	- RNN = information pipeline
	- CNN = information in tree-like data structure (used in speech-recognition)
	- self-attentive = Transformers = information flow in a weighted complete bipartite graph
- Transformers
	- originally for machine translation
	- attention weights = similarity matrix between all pairs of states
	- the $\sqrt{d}$ helps the numerical stability
	- feed-forward layer
- named entity recognition
- LM as sequence labeling
	- for each word, we predict the following one
	- we need to modify the attention not to attend the right context
	- otherwise, it could look in the future and predict the word
- LM itself only computes probability

## Strojový překlad

- „královská disciplína počítačové lingvistiky“
- model transformer původně vznikl pro účely strojového překladu
- the space of good solutions is large
- manual evaluation depends on the expectations – one sentence can be evaluated very differently by different people
- metrics can drive the research for the topics they evaluate
- BLEU score combines precision and recall
	- precision: n-gram counting
	- recall: brevity penalty
- phrase-based machine translation
	- the morphology is in the dictionary – there are the exact forms of the words
	- there are both longer and shorter units in the dictionary