Exam

Basic notions from probability and information theory Language models, the noisy channel model Morphological analysis Syntactic analysis Information retrieval Language data resources Evaluation measures in NLP Deep learning for NLP Machine translation fundamentals Neural machine translation

Basic notions from probability and information theory

What are the three basic properties of a probability function?
- $p:2^\Omega\to[0,1]$
- $p(\Omega)=1$
- it holds $p(\bigcup_i A_i)=\sum_i p(A_i)$ for disjoint events $A_1,\dots,A_n\subseteq\Omega$
- ( $\Omega$ is a sample space / set of possible basic outcomes)
When do we say that two events are (statistically) independent?
- $A,B$ are independent iff $p(A\cap B)=p(A)\cdot p(B)$
- or $p(B\mid A)=p(B)$
Show how Bayes' Theorem can be derived.
- $p(A\cap B)=p(B\cap A)$ $p (A \cap B) = p (B \cap A)$
  - or $p(A,B)=p(B,A)$
- $p(A\mid B)\cdot p(B)=p(B\mid A)\cdot p(A)$ $p (A ∣ B) \cdot p (B) = p (B ∣ A) \cdot p (A)$
  - from the definition of conditional probability $p(A\mid B)=\frac{p(A,B)}{p(B)}$
- theorem: $p(A\mid B)=\frac{p(B\mid A)\cdot p(A)}{p(B)}$
Explain Chain Rule.
- chain rule: $p(A_1,A_2,A_3,\dots,A_n)=p(A_1\mid A_2,A_3,\dots,A_n)\cdot p(A_2\mid A_3,\dots,A_n)\cdot\ldots$ $p (A_{1}, A_{2}, A_{3}, \dots, A_{n}) = p (A_{1} ∣ A_{2}, A_{3}, \dots, A_{n}) \cdot p (A_{2} ∣ A_{3}, \dots, A_{n}) \cdot \dots$
  - $\cdot\,p(A_{n-1}\mid A_n)\cdot p(A_n)$
- if follows directly from the definition of conditional probability (or from Bayes rule) as $p(A_1,(A_2,\dots,A_n))=p(A_1\mid (A_2,\dots,A_n))\cdot p(A_2,\dots,A_n)$ etc.
Explain the notion of Entropy (formula expected too).
- measure of uncertainty
- higher entropy → higher uncertainty → higher "surprise"
- $H(X)=-\sum_{x\in\Omega}p(x)\log_2 p(x)$
- or $H(X)=\mathbb E[\log_2(\frac1{p(X)})]$
- coding interpretation: the least (average) number of bits needed to encode a message
- note that $H(X)\geq 0$
- $H(X)=0$ if a result of an experiment is known ahead of time
- there is no upper bound in general
- nothing can be more uncertain than the uniform distribution
Explain Kullback-Leibler distance (formula expected too).
- relative entropy
- “if we estimate $p$ using a series of experiments, how big error are we making?”
- $D(p\|q)=\sum_{x\in\Omega} p(x)\log_2\frac{p(x)}{q(x)}=\mathbb E_p[\log_2\frac{p(x)}{q(x)}]$
- it is not symmetric and does not satisfy triangle inequality
  - it is better to use “KL divergence” instead of ”KL distance”
- $H(p)+D(p\|q)$ bits needed to encode $p$ using $q$
Explain Mutual Information (formula expected too).
- $I(X,Y)=D(p(x,y)\| p(x)p(y))$
- how much our knowledge of $Y$ $Y$ contributes (on average) to easing the prediction of $X$ $X$
  - by how many bits the knowledge of $Y$ lowers the entropy $H(X)$
  - $I(X,Y)=H(X)-H(X\mid Y)=H(Y)-H(Y\mid X)$
- how $p(x,y)$ deviates from $p(x)p(y)$
- $I(X,Y)=\sum_{x\in\Omega}\sum_{y\in\Psi} p(x,y)\log_2\frac{p(x,y)}{p(x)p(y)}$
- measured in bits
- $I(X,X)=H(X)$
- $I(X,Y)=I(Y,X)$
- $I(X,Y)\geq 0$

Language models, the noisy channel model

Explain the notion of The Noisy Channel.
- idea: we have the noisy output, we want to “predict” the original input
- we want to find such an original input that $p(\text{original}\mid\text{noisy})$ is maximum
- Bayes formula: $p(A\mid B)=\frac{p(B\mid A)p(A)}{p(B)}$
- The Golden Rule: $A_\text{best}=\text{argmax}_A\;p(B\mid A)p(A)$
- if used for speech recognition
  - $p(B\mid A)$ … acoustic model
  - $p(A)$ … language model
Explain the notion of the n-gram language model.
- it would be impractical to compute the probability of a sequence of words as $p(W)=p(w_1,w_2,\dots,w_n)=$ $p (W) = p (w_{1}, w_{2}, \dots, w_{n}) =$ $p(w_1)\cdot p(w_2\mid w_1)\cdot\ldots\cdot p(w_n\mid w_1,w_2,\dots,w_{n-1})$ $p (w_{1}) \cdot p (w_{2} ∣ w_{1}) \cdot \dots \cdot p (w_{n} ∣ w_{1}, w_{2}, \dots, w_{n - 1})$
  - there would be too many parameters in the model
- in an $n$ -gram language model, we only remember $n-1$ previous words
- examples
  - 0-gram LM (uniform model) … $p(w)=1/|V|$
  - 1-gram LM … we store $p(w)$ for every word
  - 2-gram LM … $p(w_i\mid w_{i-1})$ for every pair of words
  - 3-gram LM … $p(w_i\mid w_{i-2},w_{i-1})$ etc.
- so in 3-gram LM, we estimate $p(W)=p(w_1\mid s,s)\cdot p(w_2\mid s,w_1)\cdot p(w_3\mid w_1,w_2)$ $\cdot\, p(w_4\mid w_2,w_3)\cdot\ldots\cdot p(w_n\mid w_{n-2},w_{n-1})$
Describe how Maximum Likelihood estimate of a trigram language model is computed. (2 points)
- we estimate $p(w_i\mid w_{i-2},w_{i-1})=\frac{c_3(w_{i-2},w_{i-1},w_i)}{c_2(w_{i-2},w_{i-1})}$
- to get $c_3$ , we count sequences of three words in training data
- to get $c_2$ $c_{2}$ we count sequences of two words
  - we can either get it from $c_3$ $c_{3}$
    - $c_2(y,z)=\sum_w c_3(y,z,w)$
  - or count differently at the beginning of the data
Why do we need smoothing (in language modelling)?
- in the trigram model, there will be always some zeros
- however, we cannot be sure that the trigram really does not exist in the language as our data is possibly incomplete
- when calculating the probability of a sequence, one trigram with zero probability makes the whole sequence have zero probability, which is impractical
- we need to make the system more robust and avoid infinite cross entropy (this happens when test data contain an event which has not been seen in training data)
- solution: redistribute probabilities (take from non-zero words, give to words with zero probabilities)
Give at least two examples of smoothing methods. (2 points)
- add 1
  - $\forall w\in V:p'(w\mid h)=\frac{c(h,w)+1}{c(h)+|V|}$ $\forall w \in V : p^{'} (w ∣ h) = \frac{c ( h , w ) + 1}{c ( h ) + ∣ V ∣}$
    - $h$ … history
    - $V$ … vocabulary
  - problem if $|V|>c(h)$ , it happens often
- add less than 1
  - $p'(w\mid h)=\frac{c(h,w)+\lambda}{c(h)+\lambda |V|}$ $p^{'} (w ∣ h) = \frac{c ( h , w ) + λ}{c ( h ) + λ ∣ V ∣}$
    - $\lambda \lt 1$
  - we choose $\lambda$ $λ$ in a way that minimizes cross-entropy of $p'$ $p^{'}$ relative to holdout (dev) data
    - cross-entropy of $q$ relative to $p$ : $H(p,q)=-\mathbb E_p[\log q]$
- linear interpolation, multiple lambdas
  - $p'(w_i\mid w_{i-2},w_{i-1})=\lambda_3p_3(w_i\mid w_{i-2},w_{i-1})$ $p^{'} (w_{i} ∣ w_{i - 2}, w_{i - 1}) = λ_{3} p_{3} (w_{i} ∣ w_{i - 2}, w_{i - 1})$
    - $+\ \lambda_2p_2(w_i\mid w_{i-1})+\lambda_1 p_1(w_i)+\lambda_0/|V|$
  - we can normalize it so that…
    - $\lambda_i\gt 0$
    - $\sum_i\lambda_i=1$ $\sum_{i} λ_{i} = 1$
      - $\lambda_0=1-\lambda_1-\lambda_2-\lambda_3$
  - estimated using expectation maximization (EM) algorithm
  - another version: bucketed smoothing

Morphological analysis

What is a morphological tag? List at least five features that are often encoded in morphological tag sets.
- morphological tag = a summary of the grammatical information about a specific word in the given context
- features: part of speech, case (pád) number, gender, person, mood (způsob), tense, animacy (životnost), degree, voice (rod), aspect (vid), polarity (zápor), …
List the open and closed part-of-speech classes and explain the difference between open and closed classes.
- open classes
  - verbs (non-auxiliary), nouns, adjectives, adjectival adverbs, interjections
  - take new words
  - word formation (derivation) across classes
- closed classes
  - pronouns/determiners, adpositions, conjunctions, particles, pronomial adverbs, auxiliary and modal verbs, numerals
  - words can be enumerated (even though these classes also evolve but over longer period of time)
  - the words are typically not base for derivation
Explain the difference between a finite-state automaton and a finite-state transducer. Describe the algorithm of using a finite-state transducer to transform a surface string to a lexical string (pseudocode or source code in your favorite programming language). (2 points)
- finite-state machine is a five-tuple $(A,Q,P,q_0,F)$ $(A, Q, P, q_{0}, F)$
  - input alphabet $A$ , set of states $Q$ , transition function $P$ , initial state $q_0$ , accepting states $F$
  - it can accept or reject a word
- morphology of a language can be described using a finite-state automaton
- attaching a suffix may result in phoneme or grapheme (spelling) changes
  - we call this phonology for simplicity
  - plural of baby is not *babys but babies
- we will use two-level morphology: upper (lexical) language and lower (surface) language
- finite-state transducer
  - a type of finite-state automaton that maps between two sets of symbols
  - pairs ( $r:s$ $r : s$ ) from finite alphabets $R$ $R$ and $S$ $S$
    - $R$ … lexical language
    - $S$ … surface language
  - tasks: analysis (get lexical string based on a surface string), generation (generate surface string from a lexical string)
- difference
  - FSA defines a set of accepted strings (lexicon)
  - FST defines a relation between sets of strings (lexical string → surface string or surface → lexical)
- algorithm
  - initialize set of paths $P=\set{}$
  - read input symbols one-by-one
  - for each input symbol $x$ $x$ :
    - generate all lexical symbols $y$ that may correspond to the empty symbol $y:0$
    - repeat until the maximum possible number of subsequent zeros is reached:
      - extend all paths in $P$ by all corresponding pairs $y:0$
      - check all new extensions against the phonological transducers and the lexical automaton; remove disallowed paths (partial solutions) from $P$
    - generate all possible lexical symbols $z$ for the current input symbol $x$ (pairs $z:x$ ); extend each path in $P$ by all such pairs
    - check all paths in $P$ (the next transition in FST/FSA); remove impossible paths
  - collect glosses from the lexicon for all paths that survived
Give an example of a phonological or an orthographical change caused by morphological inflection (any natural language). Describe the rule that would take care of the change during analysis or generation. It is not required that you draw a transducer, although drawing a transducer is one of the possible ways of describing the rule.
- Czech: káď → kádě
- ď:d <= _ +:0 e:ě
  - káď+e (lexical)
  - kád0ě (surface)
- complete version of the rule: [ď:d | ň:n | ť:t] <=> _ +:0 [e:ě | i:i | í:í]
- another example: nať → nati
Give an example of a long-distance dependency in morphology (any natural language). How would you handle it in a morphological analyzer?
- umlauts in German plurals
- Buch → Bücher
- simplified rule: u:ü <= _ c:c h:h +:0 e:e r:r
  - Buch+er (lexical)
  - Büch0er (surface)
- context should contain +:0 and perhaps test end of word (#)
- this is too simplified, the suffix must be plural and the stem must be marked for plural umlauting (Kocher and Besucher are correct!)
- capturing long-distance dependencies is clumsy
  - Kraut / Kräuter has different intervening symbols → different rule
  - a transducer could be more general but it would probably overgenerate

Syntactic analysis

Describe dependency trees, constituent trees, differences between them and phenomena that must be addressed when converting between them. (2 points)
- constituent tree
  - sentence is divided to phrases (contituents)
  - recursive: phrases are divided into smaller phrases
  - the smallest phrases are words
- dependency tree
  - similar to “větný rozbor” taught in Czech schools (there are some differences, e.g., in dependency trees, the subject depends on the verb)
  - graph of dependencies between words (constituents)
  - head of phrase = governing node, parent node
  - the other nodes are dependent nodes
- differences
  - phrase (constituent) trees
    - usually do not mark the head
    - may not mark the function of the constituent in the superordinate constituent
    - cannot capture nonprojective relations (see Úvod do počítačové lingvistiky)
  - dependency trees
    - do not show nonterminals (phrase types, like NP, VP, …) nor any other phrase-level features
    - do not show “how the sentence is generated” (order, recursion, proximity of constituents)
Give an example of a sentence (in any natural language) that has at least two plausible, semantically different syntactic analyses (readings). Draw the corresponding dependency trees and explain the difference in meaning. Are there other additional readings that are less probable but still grammatically acceptable? (2 points)
- I saw the man with a telescope.
  - saw with a telescope × man with a telescope
- Ženu holí stroj.
  - (já) ženu (čím?) holí (koho, co?) stroj
    - ženu – root/verb
    - holí – adverbial
    - stroj – object
  - (koho?) ženu (čím?) holí (ty) stroj
    - stroj – root/verb
    - ženu – subject
    - holí – adverbial
  - (koho?) ženu holí (kdo, co?) stroj
    - holí – root/verb
    - stroj – subject
    - ženu – object
  - there will be always a special node for the period (punctuation)
What is coordination? Why is it difficult in dependency parsing? How would you capture coordination in a dependency structure? What are the advantages and disadvantages of your solution?
- examples
  - chickens, hens, rabbits, cats, and dogs
  - new or even newer
  - quickly and finely
  - either now or later
  - the sun was shining, so we went outside
- there is no real head → difficult to capture in dependency trees
- the coordinate phrases (conjuncts) are usually of the same type
- my solution
  - add a special node acting as a root of the coordination
  - advantage: we do not emphasize any of the conjuncts as the most important
  - disadvantage: we add a new element that was not in the original sentence
  - it is not clear where the connective(s) should be
- see Universal Dependencies for a different possible solution
What is ellipsis? Why is it difficult in parsing? Give examples of different kinds of ellipsis (any natural language).
- a phrase omitted from the sentence (its surface form) although it is present in the underlying meaning (deep structure)
- to parse the sentence correctly, we need to deduce some missing information from the context
- examples
  - Whom did you see there? — Peter. (missing verb)
  - Czech and German researchers discussed… (it means that Czech researchers and German researchers discussed; another possible meaning is that every researcher was both Czech and German)
  - The Penguins are leading 4:0, while the Colorado Avalanches only 3:2. (missing verb in the second part)
  - Sedím. (omitted subject; the information is contained in the verb form)

Information retrieval

Explain the difference between information need and query.
- information need
  - a topic that the user desires to know more about
  - a vague idea of what the user is looking for
- query
  - what the user conveys to the computer in an attempt to communicate the information need
  - formal statement of information need
What is inverted index and what are the optimal data structures for it?
- for each term in the dictionary, we store a list of all documents that contain it (postings; ids of such documents)
- the lists of documents should be sorted so that we can easily merge/intersect them
- an optimal data structure for storing an inverted index is a hash table or a tree (binary search tree, B-tree, …)
What is stopword and what is it useful for?
- stop words = extremely common words which would appear to be of little value in helping select documents matching a user need
- examples: a, an, and, are, as, at, be, by, for, from, …
- it may be useful to skip the words to save the space needed to store the postings
- however, you need stop words for phrase queries like “King of Denmark”
Explain the bag-of-word principle?
- we do not consider the order of words in a document
- we only store the information whether the document contains a given word (and how many such words it contains)
What is the main advantage and disadvantage of boolean model.
- advantage
  - the model is simple and fast; we do not need any scoring function
  - good for experts: precise understanding of the needs and collection
  - good for applications: can easily consume thousands of results
- disadvantage
  - not good for the majority of users
  - most users are not capable or lazy to write Boolean queries
  - Boolean queries often result in either too few or too many results
  - in Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits
- solution: ranked retrieval
  - we score the documents by relevance
  - show the top 10 results
Explain the role of the two components in the TF-IDF weighting scheme.
- tf-idf … measure of importance of a word to a document in a collection
- tf-idf weight of a term is product of its tf weight and its idf weight
  - $w_{t,d}=(1+\log\mathrm{tf}_{t,d})\cdot\log\frac{N}{\mathrm{df}_t}$
- $\mathrm{tf}_{t,d}$ … term frequency (number of times that $t$ occurs in $d$ )
- $\mathrm{df}_t$ $df_{t}$ … document frequency (number of documents in the collection that $t$ $t$ occurs in)
  - the fewer documents $t$ appears in, the more informative it is
  - the collection has $N$ documents in total
  - idf … inverse document frequency
Explain length normalization in vector space model what is it useful for?
- we represent the documents and the query as vectors in a vector space
- it is a bad idea to use Euclidean distance as a measure of similarity
  - Euclidean distance is large for vectors of different lengths
- instead, we use cosine similarity $\cos(q,d)=\frac{q\cdot d}{|q|\cdot |d|}$ $cos (q, d) = \frac{q \cdot d}{∣ q ∣ \cdot ∣ d ∣}$ where $q,d$ $q, d$ are vectors
  - this normalizes the vector lengths

Language data resources

Explain what a corpus is.
- dataset of language resources, collection of texts
- it is often annotated
- it should be somehow balanced
- there are special types of corpora: spoken, parallel, domain-specific, diachronic
Explain what annotation is (in the context of language resources). What types of annotation do you know? (2 points)
- annotation = adding selected linguistic information in an explicit form to a corpus
  - raw texts are difficult to exploit
- examples
  - morphological annotation – lemma, POS, other categories)
  - syntactic annotation – constituent trees, dependency trees
  - semantic annotation – dozens of approaches, no consensus so far
  - anaphora – “what does the pronoun refer to?”
  - word sense disambiguated corpora – which sense is used for a given polysemous word in a given context
  - sentiment corpora – positive/negative emotions induced by some expressions in a text
What are the reasons for variability of even basic types of annotation, such as the annotation of morphological categories (parts of speech etc.).
- different languages may have different morphological categories
- different linguistic traditions of different national schools
Explain what a treebank is. Why trees are used? (2 points)
- treebank = a corpus in which the syntax and/or semantics of sentences is analyzed using tree-shaped data structures
- why trees? we believe that…
  - trees are attractive data structures :D
  - sentences can be represented by discrete units and relations among them
  - some relations make more sense than others
  - there is an identifiable discrete structure hidden in each sentence
  - the structure must allow for various kinds of nestedness → recursivity → trees
- two types of trees that are broadly used: constituency (phrase-structure) trees, dependency trees
Explain what a parallel corpus is. What kind of alignments can we distinguish? (2 points)
- it contains texts in two (or more) languages
- the texts are aligned (for example, we know which Czech and English documents are equivalent)
- there are multiple possible levels of alignment
  - document level alignment
  - sentence level alignment
  - word level alignment
  - (morpheme level alignment?)
- examples
  - The Rosetta Stone
  - CzEng
  - Canadian parliamentary transcripts (Hansard)
What is a sentiment-annotated corpus? How can it be used?
- it captures the attitude (emotional polarity) of a speaker with respect to some topic/expression
- examples
  - “it was an ok (neutral) club, but terribly crowded (negative)”
  - “Taiwan-made products stood a good chance (positive) of becoming even more competitive thanks to (positive) wider access to overseas markets”
- obviously over-simplified, but highly demanded e.g. by the marketing industry
What is a coreference-annotated corpus?
- it captures relations between expressions that refer to the same entity of the real world
  - for example, the expressions “Audi”, “the company”, and “the Audi company” all refer to one entity
- example of a coreference-annotated corpus: Prague Dependency Treebanks
Explain how WordNet is structured?
- it is a hyponymy forest composed of synsets (sets of synonymous words)
- example of a branch
  - event
  - act, human action, human activity
  - action
  - change
  - motion, movement, move
  - locomotion, travel
  - run, running
  - dash, sprint
- another subbranch
  - motion, movement, move
  - descent
  - jump, parachuting
Explain the difference between derivation and inflection?
- inflection = ohýbání (skloňování, časování)
  - change in the form of a word to express a grammatical function/attribute (tense, mood, person, number, case, gender)
- derivation = odvození
  - forming a new word from an existing word
  - usually prefixing and suffixing
- inflection does not change the core meaning of the word
- derivation produces a new word (a distinct lexeme), inflection produces grammatical variants of the same word

Evaluation measures in NLP

Give at least two examples of situations in which measuring a percentage accuracy is not adequate.
- if the classes are imbalanced (e.g. a rare disease)
- if there is more than one correct answer for each instance
- if our system does not give exactly one answer for each instance
- if there are some errors that are more wrong than others
Explain: precision, recall
- let's assume our system either gives prediction (”positive”) or does nothing (“negative”)
- precision = if our systems gives a prediction, what’s its avarage quality
  - correct answers given ÷ all answers given
  - $\frac{tp}{tp+fp}$
- recall = what average proportion of all possible correct answers is given by our system
  - correct answers given ÷ all possible correct answers
  - $\frac{tp}{tp+fn}$
What is F-measure, what is it useful for?
- weighted harmonic mean of precision and recall
- $F_\beta=\frac{(\beta^2+1)PR}{\beta^2P+R}$
- $F_1=\frac{2PR}{P+R}$
- why?
  - using both precision and recall means that we are in 2D
  - “is it better to have a system with $P=0.8$ and $R=0.2$ , or $P=0.9$ and $R=0.1$ ?”
  - we want to be able to use a single scale → F-measure
What is k-fold cross-validation?
- it is a special way to train and evaluate a (ML) model based on train/test data
- used especially in the case of very small data, when an actual train/test division could have huge impact on the measured quantities
- we partition the data into $k$ subsamples
- we train the model based on $k-1$ $k - 1$ subsamples and evaluate it using the last unused subsample → we get a score
  - we repeat this $k$ times so that every time, we use a different subsample for evaluation → we get $k$ scores and compute their average (this is our final score)
Explain BLEU (the exact formula not needed, just the main principles).
- BLEU = bilingual evaluation understudy
- it is a common measure in machine translation
- it measures similarity between a machine's output and a human translation
- usually, there are multiple reference (human) translations
- n-gram precision = $C/N$ $C / N$
  - where $N$ … number of n-grams that the candidate translation contains
  - and $C$ … number of n-grams from the candidate translation that exist in at least one reference translation
- brevity penalty
  - shorter sentences naturally achieve better n-gram precision
  - to make it fair, we penalize them
- to compute BLEU score, we get 1-gram, 2-gram, 3-gram, and 4-gram precisions, compute their geometric mean, and multiply it by the brevity penalty
Explain the purpose of brevity penalty in BLEU.
- shorter sentences naturally achieve better n-gram precision
- to make it fair, we penalize them
What is Labeled Attachment Score (in parsing)?
- percentage of words that are assigned the correct head and the correct dependency tag (label)
- this measure is used to evaluate generated dependency trees
  - head = the parent node of a given word
  - label = the relation between the head and the given word
What is Word Error Rate (in speech recognition)?
- it is a common metric for speech recognition
- the number of recognized words can be quite different from the number of true words
- $WER=\frac{S+D+I}{S+D+C}$ $W ER = \frac{S + D + I}{S + D + C}$
  - $S$ … substituted word
  - $D$ … deleted words
  - $I$ … inserted words
  - $C$ … correct words
What is inter-annotator agreement? How can it be measured?
- inter-annotator agreement (IAA) measures the reliability of manual annotations
- “how good human experts perform?”
- baseline: what if they annotated randomly?
- possible ways to measure IAA
  - treat one annotator as a virtual gold standard and measure the accuracy of the second annotator
  - use F-measure instead of accuracy
  - Cohen's kappa
What is Cohen's kappa?
- inter-annotator agreement metric that takes into account the agreement by chance
- $\kappa=\frac{P_a-P_e}{1-P_e}$
- $P_a$ … relative observed agreement between the annotators
- $P_e$ … probability of agreement by chance
- scale from $-1$ to $+1$ (negative kappa unexpected but possible)
- interpretation still unclear, but at least we abstracted from the by-chance agreement baseline

Deep learning for NLP

Describe the two methods for training of the Word2Vec model.
- we use a sliding window over the input text → the embedding of a given word is influenced by the neighboring words
- either we can predict the neighboring words based on the current word or vice versa
- two methods
  - CBOW: minimize cross-entropy of the middle word of a sliding window
    - the model predicts the middle word based on the other words in the window
  - skip-gram: minimize cross-entropy of a bag of words around a word
    - the model predicts the neighboring words (without specifying their order) based on the middle word
Use formulas to describe how Word2Vec is trained with negative sampling. (2 points)
- we don't do softmax of all the words that could be in the window but are not – that would be computationally expensive
- instead of softmax, we use sigmoid function: “can this word be in this context?”
- we randomly sample negative samples from the vocabulary
- as our loss function, we use $\mathcal L=-\log\sigma(U^T_{w_O}V_{w_I})-\sum_{i=1}^k\log\sigma(-U^T_{w_i}V_{w_I})$ $L = - lo g σ (U_{w_{O}}^{T} V_{w_{I}}) - \sum_{i = 1}^{k} lo g σ (- U_{w_{i}}^{T} V_{w_{I}})$
  - where $w_I$ is an input word, $w_O$ is an output word
  - there are $k$ negative samples $w_1,\dots,w_k$
  - $V$ is the first layer of the neural network (input embedding matrix)
  - $U$ is the second layer (output matrix)
Explain the difference between Word2Vec and FastText embeddings.
- Word2Vec training treats words as independent entities
- there are regularities in how words look like (morphology)
- FastText tackles this
  - represent each word as a bag of character n-grams
  - keep a table of character n-grams embeddings instead of words
  - word embedding = average of character n-gram embeddings
Sketch the structure of the Transformer model. (2 points)
- - author: Jindřich Libovický
  - license: CC BY-SA
- there are several such layers
- each layer has two sublayers (self-attention and feed-forward)
Why do we use positional encodings in the Transformer model.
- attention mechanism does not have any information about the order of the tokens
- one way to solve this is to sum positional coefficients with the token embeddings
What are residual connections in neural networks? Why do we use them?
- we always sum the output of the sublayer with the output of the previous sublayer
- we use them to make the learning numerically stable and to prevent the vanishing gradient problem
  - in a deep network, there are many forward propagation steps and the gradients of earlier weights (layers) are calculated with too many multiplications → they become too small
  - to solve this, we add the output of the previous sublayer to the output of the current one
  - it helps, because summation is linear with respect to the gradient
Use formulas to express the loss function for training sequence labeling tasks.
- we use cross-entropy between estimated distribution $P$ and one-hot ground-truth distribution $T$
- for a single item
  - $\mathcal L=H(T,P)=-\mathbb E_{i\sim T}\log P(i)=-\sum_i T(i)\log P(i)=-\log P(y^*)$
  - $y^*$ … ground truth
- for a sequence of $T$ $T$ items
  - $\mathcal L=-\sum_{t=1}^T\log P_t(y^*_t)$
Explain the pre-training procedure of the BERT model. (2 points)
- it is a “masked language model”
- it is trained as sequence labeling: change some input token, labeler guesses original tokens
- when trained, throw away labeling, use last layer as the representation
- for some of the words in the training data (randomly sampled; 15 % in the original paper), we do one of the following things:
  - with 80% probability replace the word with special MASK token
  - with 10% probability replace the word with a random token
  - with 10% probability keep the original word
- the classifier should then predict the correct (original) word
Explain what is the pre-train and finetune paradigm in NLP.
- two steps: pre-training and finetuning
- pre-training
  - do once, large (unlabeled) data, long time
  - for example, masked language modeling
- finetuning
  - small data (task-specific), fast
  - to fulfill a task-specific objective
Describe the task of named entitity recognition (NER). Explain the intution behind the CRF models compared to standard sequence labeling. (2 points)
- the goal of named entity recognition is to find named entities in the text
  - examples of named entities: person, location, organization, event, product, skill, address, phone, email, URL, IP, datetime, quantity
  - it is linked to coreference resolution
  - usage: entity linking, indexing text for search, direct use in smart devices (making addresses clickable etc.)
  - the entities can span multiple words; they can be nested
    - there are multiple schemes to deal with that
    - example (IOB scheme)
      - There are over 1000 compositions by Johan Sebastian Bach.
      - O O B-QUANT I-QUANT O O B-PERSON I-PERSON I-PERSON O
- standard tagging
  - conditional independence assumption
  - but tags have their internal grammar: I-PERSON cannot follow after O
- CRF (Conditional Random Fields)
  - instead of doing probability distribution over each token independently, we do a probability distribution over all possible tagging sequences
  - for a tagging sequence, we get a score, how probable it is
Explain how does the self-attention differ in encoder-only and decoder-only models.
- encoder-only models
  - predict sentence labeling (e.g. POS tags)
  - the self-attention mechanism can look in all directions (at the previous and following tokens)
- decoder-only models
  - predict the next word
  - we have to ensure that the self-attention mechanism cannot look at the future words
    - we need to use a mask (a triangular matrix with negative infinities)

Machine translation fundamentals

Why is MT difficult from linguistic point of view? Provide examples and explanation for at least three different phenomena. (2 points)
- ambiguity and word senses
  - “The plant is next to the bank.”
  - ”Ženu holí stroj.”
  - a single word can have several meanings
- target word forms
  - tenses, cases, genders
  - it is necessary to select the right word form
  - “The cat is on the mat.” → kočka
  - “He saw a cat.” → kočku
- negation
  - French negation is around the verb: “Je ne parle pas français.”
  - Czech negation is doubled: “Nemám žádné námitky.”
  - phrases with negation can be ambiguous
- pronouns
  - English requires the subject explicit; in Czech, the information is implicitly in the verb
    - Četl knihu. He read a book.
    - Spal jsem. I slept.
  - the gender must match the referent
    - He saw a book; it was red.
    - Viděl knihu. Byla černá.
  - tricky “svůj”
    - Could I use your cell phone?
    - an old machine translation: Mohl bych používat svůj mobilní telefon?
- coordination and apposition, word order
  - “This book is dedicated to my parents, Ayn Rand and God.”
- space of possible translations
  - there are many correct translations
- idioms
- cultural context (local time, customs, etc.)
Why is MT difficult from computational point of view?
- the space of possible translations is quite large – there are many bad translations but also many acceptable translations
- there is a lot of ambiguity in language; language is messy (not all sentences are grammatical)
- evaluation is hard
Briefly describe at least three methods of manual MT evaluation. (1-2 points)
- ranking of constituents on the scale 1–5
  - we see the original sentence, reference translation, and several machine translations
  - we assign a score to the highlighted part (constituent) of the translated sentence
  - this way, we do not evaluate overall coherence, which can be important
- ranking sentences
  - scoring the whole sentences
  - eye tracking can be used
  - long sentences are hard to rank
  - candidates are incomparably poor :(
- comprehension test
  - blind editing + correctness check
  - first, we look at the result (without seeing the original sentence) and edit it to make it as fluent as possible
  - second, the edited sentences are checked against the originals (whether the meaning is the same)
- task-based
  - does MT output help as much as the original?
  - do I dress appropriately given a translated weather forecast?
- evaluation by flagging errors
  - classification of MT errors
  - we label the words/phrases in the MT output using the corresponding errors
Describe BLEU. 1 point for the core properties explained, 1 point for the (commented) formula.
- it is based on geometric mean of n-gram precision and a brevity penalty
- $BLEU=BP\cdot\sqrt[4]{p_1p_2p_3p_4}$ $B L E U = BP \cdot 4 p_{1} p_{2} p_{3} p_{4}$
  - or $BLEU =BP\cdot\exp(\frac14\sum_{n=1}^4 \log p_n)$
  - where $p_n$ … $n$ -gram precision
  - and $BP$ $BP$ … brevity penalty
    - $BP=\begin{cases}1&\text{if }c\gt r\\ e^{1-r/c}&\text{if }c\leq r\end{cases}$
    - $c$ … candidate length
    - $r$ … effective reference length
- note that “.” (dot) counts as an unigram
- for more information about BLEU score, see Evaluation measures in NLP
- BLEU scores are not comparable…
  - across languages
  - on different test sets
  - with different number of reference translations
  - with different implementations of the evaluation tool
- BLEU is overly sensitive to token forms and sequences
  - solutions: use coarser units, use more references
Describe IBM Model 1 for word alignment, highlighting the EM structure of the algorithm.
- it estimates lexical probabilities (probabilities that a word in the first language equals a word in the second language)
- lexical probabilities disregard the position of words in sentences
- they are estimated using Expectation-Maximization loop
  - initialize model parameters (e.g. uniform)
  - iterate until convergence:
    - assign probabilities to the missing data
    - estimate model parameters from complete data
- real usage (see the slides by Philipp Koehn)
  - we start with all alignments equally likely
  - there are these phrases in the training data
    - la maison, the house
    - la maison blue, the blue house
    - la fleur, the flower
  - during the EM algorithm, the alignment of “la” and “the” becomes more likely
  - the pigeon hole principle leads to the words “fleur” and “flower” being aligned
- unsupervised learning
Explain using equations the relation between Noisy channel model and log-linear model for classical statistical MT. (2 points)
- we translate $f$ → $e$
- Noisy channel: $\hat e=\mathrm{argmax}_e\ p(f\mid e)\cdot p(e)$ $\overset{e}{^} = argmax_{e} p (f ∣ e) \cdot p (e)$
  - $p(f\mid e)$ $p (f ∣ e)$ … translation model (reversed)
    - “is it a likely translation?”
  - $p(e)$ $p (e)$ … language model
    - “is it a likely sentence of the target language?”
- log-linear model
  - we model $p(e\mid f)$ as a weighted combination of models (“feature functions”)
  - $p(e\mid f)=\frac{\exp(\sum_m \lambda_m h_m(e, f))}{\sum_{e'}\exp(\sum_m\lambda_mh_m(e',f))}$ $p (e ∣ f) = \frac{e x p ( \sum _{m} λ _{m} h _{m} ( e , f ))}{\sum _{e^{'}} e x p ( \sum _{m} λ _{m} h _{m} ( e ^{'} , f ))}$
    - each feature function $h_m(e,f)$ relates source $f$ to target $e$
    - model weights $\lambda_m$ specify the relative importance of features
  - the constant denominator is not needed in maximization
    - $\hat e=\mathrm{argmax}_e\ \exp(\sum_m\lambda_m h_m(e,f))$
- log-linear model → Noisy channel
  - we consider equal widths and only two features
    - $h_{TM}(e,f)=\log p(f\mid e)$
    - $h_{LM}(e,f)=\log p(e)$
  - $\hat e=\mathrm{argmax}_e\ \exp(\sum_m\lambda_mh_m(e,f))$
  - $=\mathrm{argmax}_e\ \exp(h_{TM}(e,f)+h_{LM}(e,f))$
  - $=\mathrm{argmax}_e\ \exp(\log p(f\mid e)+\log p(e))$
  - $=\mathrm{argmax}_e\ p(f\mid e)\cdot p(e)$
Describe the loop of weight optimization for the log-linear model as used in phrase-based MT.
- MERT (Minimum Error Rate Training) loop
  - (start with some initial weights $\lambda_1,\dots,\lambda_M$ )
  - translate input
  - evaluate candidates using an external score
  - find new weights for a better match of external and internal score
  - if the internal weights stayed the same, stop; otherwise loop

Neural machine translation

Describe the critical limitation of PBMT that NMT solves. Provide example training data and example input where PBMT is very likely to introduce an error.
- PBMT = phrase-based machine translation
- limitation: too strong phrase-independence assumption
  - but phrases do depend on each other
- example training data
  - Nemám žádného psa. → I have no dog.
  - Viděl kočku. → He saw a cat.
- example input
  - Nemám kočku. (→ I have a cat.)
- the negation is expressed not only as “žádného” but also in the verb
Use formulas to highlight the similarity of NMT and LMs.
- if we modeled $p(e\mid f)$ $p (e ∣ f)$ directly…
  - $p(e\mid f)=p(e_1,\dots,e_I\mid f)=\prod_{i=1}^I p(e_i\mid e_1,\dots,e_{i-1},f)$
- this is “just a cleverer language model”
  - $p(e)=\prod_{i=1}^Ip(e_i\mid e_1,\dots,e_{i-1})$
- a neural network with a single hidden layer (possibly huge) can approximate any continuous function to any precision
  - we will use NNs to model $p(e_i\mid e_1,\dots,e_{i-1},f)$
  - “what is the next word in the translated sentence?”
Describe, how words are fed to current NMT architectures and explain why is this beneficial over 1-hot representation.
- using word embeddings
  - each word is mapped to a dense vector
  - in practice 300–2000 dimensions are used (they have no clear interpretation)
  - embeddings are trained for each particular task
- in 1-hot representation, there are no relations among the words; they are all equally close/far
  - word embeddings somehow reflect the meanings of the words
Sketch the structure of an encoder-decoder architecture of neural MT, remember to describe the components in the picture (2 points)
- the encoder first processed the English sentence, then the decoder generated the French sentence
What is the difference in RNN decoder application at training time vs. at runtime?
- at runtime, the decoder generates the next word based on the previous generated word and its internal (hidden) state
- at training time, we “force” the decoder to produce the target sentence (kind of)
  - the decoder generates the probability distribution of the next word based on the previous target word and its internal state
  - we get the loss function from the probability distributions (their difference from the one-hot distributions of target words)
- - author: Ondřej Bojar
  - source: NMT slides
  - the two applications of the decoder are on the right: the training phase is on top; the runtime is below
What problem does attention in NMT address? Provide the key idea of the method.
- arbitrary-length sentences fit badly into a fixed vector
- without attention
  - encoder needs to compress the information contained in the original (source) sentence
  - for longer sentences, important details can be lost
- attention allows the decoder to look at different parts of the source sentence at each decoding step
  - it uses a weighted sum
What problem/task do both RNN and self-attention resolve and what is the main benefit of self-attention over RNN?
- the problem of representing source sentence (of an arbitrary length) using a fixed-size vector
- RNN
  - it has turned out that reading input backwards (in RNNs) works better: the information contained in the early words is not diluted that much when generating the first words of the translation
  - bi-directional RNN can be used to “attend” to all hidden states
  - we could use the two outputs of the encoder – reading the sentence forwards and backwards
    - but the middle words would be still quite lost in the representation
- self-attention
  - the decoder can look at any hidden state of the encoder (kind of)
  - there is a sub-network predicting the importance of source states at each step
What are the three roles each state at a Transformer encoder layer takes in self-attention.
- query, key, value
- to get any of the roles based on the word (embedding), we multiply the word by a corresponding weight matrix ( $W_Q$ , $W_K$ , or $W_V$ )
- by multiplying queries and keys (and softmaxing them), the self-attention mechanism finds which words are relevant – then gets their values
What are the three uses of self-attention in the Transformer model?
- encoder-decoder attention
  - $Q$ … previous decoder layers
  - $K = V$ … outputs of encoder
  - decoder positions attend to all positions of the input
- encoder self-attention
  - $Q=K=V$ … outputs of the previous layer of the encoder
  - encoder positions attend to all positions of previous layer
- decoder self-attention
  - $Q=K=V$ … outputs of the previous decoder layer
  - masking used to prevent depending on future outputs
  - decoder attends to all its previous outputs
Provide an example of NMT improvement that was assumed to come from additional linguistic information but occurred also for a simpler reason.
- interleaving the generated output with syntax tags
  - it was better than generating output without tags
  - but with random tags, the model achieved even better results (when using seq2seq)
  - likely reason: the model had “more thinking time” (the decoder was “deeper”)
- requiring that one of the self-attention heads in one of the layers corresponds to dependency tree of the source sentence
  - it improved the performance of the model
  - but again, producing a “linear” (dummy) tree improved the performance even more
Summarize and compare the strategy of "classical statistical MT" vs. the strategy of neural approaches to MT.
- goal of “classical” SMT
  - find minimum translation units / graph partitions
  - such that they are frequent across many sentence pairs
  - without imposing (too hard) constraints on reordering
  - in an unsupervised fashion
  - there is the independence assumption of the translation units
- goal of neural MT
  - avoid minimum translation units
  - find NN architecture that…
    - reads input in as original form as possible
    - produces output in as final form as possible
    - can be optimized end-to-end in practice