Exam: kartičky | natural-language-processing

Basic notions from probability and information theory

What are the three basic properties of a probability function?

$p:2^\Omega\to[0,1]$
$p(\Omega)=1$
it holds $p(\bigcup_i A_i)=\sum_i p(A_i)$ for disjoint events $A_1,\dots,A_n\subseteq\Omega$
( $\Omega$ is a sample space / set of possible basic outcomes)

Basic notions from probability and information theory

When do we say that two events are (statistically) independent?

$A,B$ are independent iff $p(A\cap B)=p(A)\cdot p(B)$
or $p(B\mid A)=p(B)$

Basic notions from probability and information theory

Show how Bayes' Theorem can be derived.

$p(A\cap B)=p(B\cap A)$ $p (A \cap B) = p (B \cap A)$
- or $p(A,B)=p(B,A)$
$p(A\mid B)\cdot p(B)=p(B\mid A)\cdot p(A)$ $p (A ∣ B) \cdot p (B) = p (B ∣ A) \cdot p (A)$
- from the definition of conditional probability $p(A\mid B)=\frac{p(A,B)}{p(B)}$
theorem: $p(A\mid B)=\frac{p(B\mid A)\cdot p(A)}{p(B)}$

Basic notions from probability and information theory

Explain Chain Rule.

chain rule: $p(A_1,A_2,A_3,\dots,A_n)=p(A_1\mid A_2,A_3,\dots,A_n)\cdot p(A_2\mid A_3,\dots,A_n)\cdot\ldots$ $p (A_{1}, A_{2}, A_{3}, \dots, A_{n}) = p (A_{1} ∣ A_{2}, A_{3}, \dots, A_{n}) \cdot p (A_{2} ∣ A_{3}, \dots, A_{n}) \cdot \dots$
- $\cdot\,p(A_{n-1}\mid A_n)\cdot p(A_n)$
if follows directly from the definition of conditional probability (or from Bayes rule) as $p(A_1,(A_2,\dots,A_n))=p(A_1\mid (A_2,\dots,A_n))\cdot p(A_2,\dots,A_n)$ etc.

Basic notions from probability and information theory

Explain the notion of Entropy (formula expected too).

measure of uncertainty
higher entropy → higher uncertainty → higher "surprise"
$H(X)=-\sum_{x\in\Omega}p(x)\log_2 p(x)$
or $H(X)=\mathbb E[\log_2(\frac1{p(X)})]$
coding interpretation: the least (average) number of bits needed to encode a message
note that $H(X)\geq 0$
$H(X)=0$ if a result of an experiment is known ahead of time
there is no upper bound in general
nothing can be more uncertain than the uniform distribution

Basic notions from probability and information theory

Explain Kullback-Leibler distance (formula expected too).

relative entropy
“if we estimate $p$ using a series of experiments, how big error are we making?”
$D(p\|q)=\sum_{x\in\Omega} p(x)\log_2\frac{p(x)}{q(x)}=\mathbb E_p[\log_2\frac{p(x)}{q(x)}]$
it is not symmetric and does not satisfy triangle inequality
- it is better to use “KL divergence” instead of ”KL distance”
$H(p)+D(p\|q)$ bits needed to encode $p$ using $q$

Basic notions from probability and information theory

Explain Mutual Information (formula expected too).

$I(X,Y)=D(p(x,y)\| p(x)p(y))$
how much our knowledge of $Y$ $Y$ contributes (on average) to easing the prediction of $X$ $X$
- by how many bits the knowledge of $Y$ lowers the entropy $H(X)$
- $I(X,Y)=H(X)-H(X\mid Y)=H(Y)-H(Y\mid X)$
how $p(x,y)$ deviates from $p(x)p(y)$
$I(X,Y)=\sum_{x\in\Omega}\sum_{y\in\Psi} p(x,y)\log_2\frac{p(x,y)}{p(x)p(y)}$
measured in bits
$I(X,X)=H(X)$
$I(X,Y)=I(Y,X)$
$I(X,Y)\geq 0$

Language models, the noisy channel model

Explain the notion of The Noisy Channel.

idea: we have the noisy output, we want to “predict” the original input
we want to find such an original input that $p(\text{original}\mid\text{noisy})$ is maximum
Bayes formula: $p(A\mid B)=\frac{p(B\mid A)p(A)}{p(B)}$
The Golden Rule: $A_\text{best}=\text{argmax}_A\;p(B\mid A)p(A)$
if used for speech recognition
- $p(B\mid A)$ … acoustic model
- $p(A)$ … language model

Language models, the noisy channel model

Explain the notion of the n-gram language model.

it would be impractical to compute the probability of a sequence of words as $p(W)=p(w_1,w_2,\dots,w_n)=$ $p (W) = p (w_{1}, w_{2}, \dots, w_{n}) =$ $p(w_1)\cdot p(w_2\mid w_1)\cdot\ldots\cdot p(w_n\mid w_1,w_2,\dots,w_{n-1})$ $p (w_{1}) \cdot p (w_{2} ∣ w_{1}) \cdot \dots \cdot p (w_{n} ∣ w_{1}, w_{2}, \dots, w_{n - 1})$
- there would be too many parameters in the model
in an $n$ -gram language model, we only remember $n-1$ previous words
examples
- 0-gram LM (uniform model) … $p(w)=1/|V|$
- 1-gram LM … we store $p(w)$ for every word
- 2-gram LM … $p(w_i\mid w_{i-1})$ for every pair of words
- 3-gram LM … $p(w_i\mid w_{i-2},w_{i-1})$ etc.
so in 3-gram LM, we estimate $p(W)=p(w_1\mid s,s)\cdot p(w_2\mid s,w_1)\cdot p(w_3\mid w_1,w_2)$ $\cdot\, p(w_4\mid w_2,w_3)\cdot\ldots\cdot p(w_n\mid w_{n-2},w_{n-1})$

Language models, the noisy channel model

Describe how Maximum Likelihood estimate of a trigram language model is computed. (2 points)

we estimate $p(w_i\mid w_{i-2},w_{i-1})=\frac{c_3(w_{i-2},w_{i-1},w_i)}{c_2(w_{i-2},w_{i-1})}$
to get $c_3$ , we count sequences of three words in training data
to get $c_2$ $c_{2}$ we count sequences of two words
- we can either get it from $c_3$ $c_{3}$
  - $c_2(y,z)=\sum_w c_3(y,z,w)$
- or count differently at the beginning of the data

Language models, the noisy channel model

Why do we need smoothing (in language modelling)?

in the trigram model, there will be always some zeros
however, we cannot be sure that the trigram really does not exist in the language as our data is possibly incomplete
when calculating the probability of a sequence, one trigram with zero probability makes the whole sequence have zero probability, which is impractical
we need to make the system more robust and avoid infinite cross entropy (this happens when test data contain an event which has not been seen in training data)
solution: redistribute probabilities (take from non-zero words, give to words with zero probabilities)

Language models, the noisy channel model

Give at least two examples of smoothing methods. (2 points)

add 1
- $\forall w\in V:p'(w\mid h)=\frac{c(h,w)+1}{c(h)+|V|}$ $\forall w \in V : p^{'} (w ∣ h) = \frac{c ( h , w ) + 1}{c ( h ) + ∣ V ∣}$
  - $h$ … history
  - $V$ … vocabulary
- problem if $|V|>c(h)$ , it happens often
add less than 1
- $p'(w\mid h)=\frac{c(h,w)+\lambda}{c(h)+\lambda |V|}$ $p^{'} (w ∣ h) = \frac{c ( h , w ) + λ}{c ( h ) + λ ∣ V ∣}$
  - $\lambda \lt 1$
- we choose $\lambda$ $λ$ in a way that minimizes cross-entropy of $p'$ $p^{'}$ relative to holdout (dev) data
  - cross-entropy of $q$ relative to $p$ : $H(p,q)=-\mathbb E_p[\log q]$
linear interpolation, multiple lambdas
- $p'(w_i\mid w_{i-2},w_{i-1})=\lambda_3p_3(w_i\mid w_{i-2},w_{i-1})$ $p^{'} (w_{i} ∣ w_{i - 2}, w_{i - 1}) = λ_{3} p_{3} (w_{i} ∣ w_{i - 2}, w_{i - 1})$
  - $+\ \lambda_2p_2(w_i\mid w_{i-1})+\lambda_1 p_1(w_i)+\lambda_0/|V|$
- we can normalize it so that…
  - $\lambda_i\gt 0$
  - $\sum_i\lambda_i=1$ $\sum_{i} λ_{i} = 1$
    - $\lambda_0=1-\lambda_1-\lambda_2-\lambda_3$
- estimated using expectation maximization (EM) algorithm
- another version: bucketed smoothing

Morphological analysis

What is a morphological tag? List at least five features that are often encoded in morphological tag sets.

morphological tag = a summary of the grammatical information about a specific word in the given context
features: part of speech, case (pád) number, gender, person, mood (způsob), tense, animacy (životnost), degree, voice (rod), aspect (vid), polarity (zápor), …

Morphological analysis

List the open and closed part-of-speech classes and explain the difference between open and closed classes.

open classes
- verbs (non-auxiliary), nouns, adjectives, adjectival adverbs, interjections
- take new words
- word formation (derivation) across classes
closed classes
- pronouns/determiners, adpositions, conjunctions, particles, pronomial adverbs, auxiliary and modal verbs, numerals
- words can be enumerated (even though these classes also evolve but over longer period of time)
- the words are typically not base for derivation

Morphological analysis

Explain the difference between a finite-state automaton and a finite-state transducer. Describe the algorithm of using a finite-state transducer to transform a surface string to a lexical string (pseudocode or source code in your favorite programming language). (2 points)

finite-state machine is a five-tuple $(A,Q,P,q_0,F)$ $(A, Q, P, q_{0}, F)$
- input alphabet $A$ , set of states $Q$ , transition function $P$ , initial state $q_0$ , accepting states $F$
- it can accept or reject a word
morphology of a language can be described using a finite-state automaton
attaching a suffix may result in phoneme or grapheme (spelling) changes
- we call this phonology for simplicity
- plural of baby is not *babys but babies
we will use two-level morphology: upper (lexical) language and lower (surface) language
finite-state transducer
- a type of finite-state automaton that maps between two sets of symbols
- pairs ( $r:s$ $r : s$ ) from finite alphabets $R$ $R$ and $S$ $S$
  - $R$ … lexical language
  - $S$ … surface language
- tasks: analysis (get lexical string based on a surface string), generation (generate surface string from a lexical string)
difference
- FSA defines a set of accepted strings (lexicon)
- FST defines a relation between sets of strings (lexical string → surface string or surface → lexical)
algorithm
- initialize set of paths $P=\set{}$
- read input symbols one-by-one
- for each input symbol $x$ $x$ :
  - generate all lexical symbols $y$ that may correspond to the empty symbol $y:0$
  - repeat until the maximum possible number of subsequent zeros is reached:
    - extend all paths in $P$ by all corresponding pairs $y:0$
    - check all new extensions against the phonological transducers and the lexical automaton; remove disallowed paths (partial solutions) from $P$
  - generate all possible lexical symbols $z$ for the current input symbol $x$ (pairs $z:x$ ); extend each path in $P$ by all such pairs
  - check all paths in $P$ (the next transition in FST/FSA); remove impossible paths
- collect glosses from the lexicon for all paths that survived

Morphological analysis

Give an example of a phonological or an orthographical change caused by morphological inflection (any natural language). Describe the rule that would take care of the change during analysis or generation. It is not required that you draw a transducer, although drawing a transducer is one of the possible ways of describing the rule.

Czech: káď → kádě
ď:d <= _ +:0 e:ě
- káď+e (lexical)
- kád0ě (surface)
complete version of the rule: [ď:d | ň:n | ť:t] <=> _ +:0 [e:ě | i:i | í:í]
another example: nať → nati

Morphological analysis

Give an example of a long-distance dependency in morphology (any natural language). How would you handle it in a morphological analyzer?

umlauts in German plurals
Buch → Bücher
simplified rule: u:ü <= _ c:c h:h +:0 e:e r:r
- Buch+er (lexical)
- Büch0er (surface)
context should contain +:0 and perhaps test end of word (#)
this is too simplified, the suffix must be plural and the stem must be marked for plural umlauting (Kocher and Besucher are correct!)
capturing long-distance dependencies is clumsy
- Kraut / Kräuter has different intervening symbols → different rule
- a transducer could be more general but it would probably overgenerate

Syntactic analysis

Describe dependency trees, constituent trees, differences between them and phenomena that must be addressed when converting between them. (2 points)

constituent tree
- sentence is divided to phrases (contituents)
- recursive: phrases are divided into smaller phrases
- the smallest phrases are words
dependency tree
- similar to “větný rozbor” taught in Czech schools (there are some differences, e.g., in dependency trees, the subject depends on the verb)
- graph of dependencies between words (constituents)
- head of phrase = governing node, parent node
- the other nodes are dependent nodes
differences
- phrase (constituent) trees
  - usually do not mark the head
  - may not mark the function of the constituent in the superordinate constituent
  - cannot capture nonprojective relations (see Úvod do počítačové lingvistiky)
- dependency trees
  - do not show nonterminals (phrase types, like NP, VP, …) nor any other phrase-level features
  - do not show “how the sentence is generated” (order, recursion, proximity of constituents)

Syntactic analysis

Give an example of a sentence (in any natural language) that has at least two plausible, semantically different syntactic analyses (readings). Draw the corresponding dependency trees and explain the difference in meaning. Are there other additional readings that are less probable but still grammatically acceptable? (2 points)

I saw the man with a telescope.
- saw with a telescope × man with a telescope
Ženu holí stroj.
- (já) ženu (čím?) holí (koho, co?) stroj
  - ženu – root/verb
  - holí – adverbial
  - stroj – object
- (koho?) ženu (čím?) holí (ty) stroj
  - stroj – root/verb
  - ženu – subject
  - holí – adverbial
- (koho?) ženu holí (kdo, co?) stroj
  - holí – root/verb
  - stroj – subject
  - ženu – object
- there will be always a special node for the period (punctuation)

Syntactic analysis

What is coordination? Why is it difficult in dependency parsing? How would you capture coordination in a dependency structure? What are the advantages and disadvantages of your solution?

examples
- chickens, hens, rabbits, cats, and dogs
- new or even newer
- quickly and finely
- either now or later
- the sun was shining, so we went outside
there is no real head → difficult to capture in dependency trees
the coordinate phrases (conjuncts) are usually of the same type
my solution
- add a special node acting as a root of the coordination
- advantage: we do not emphasize any of the conjuncts as the most important
- disadvantage: we add a new element that was not in the original sentence
- it is not clear where the connective(s) should be
see Universal Dependencies for a different possible solution

Syntactic analysis

What is ellipsis? Why is it difficult in parsing? Give examples of different kinds of ellipsis (any natural language).

a phrase omitted from the sentence (its surface form) although it is present in the underlying meaning (deep structure)
to parse the sentence correctly, we need to deduce some missing information from the context
examples
- Whom did you see there? — Peter. (missing verb)
- Czech and German researchers discussed… (it means that Czech researchers and German researchers discussed; another possible meaning is that every researcher was both Czech and German)
- The Penguins are leading 4:0, while the Colorado Avalanches only 3:2. (missing verb in the second part)
- Sedím. (omitted subject; the information is contained in the verb form)

Information retrieval

Explain the difference between information need and query.

information need
- a topic that the user desires to know more about
- a vague idea of what the user is looking for
query
- what the user conveys to the computer in an attempt to communicate the information need
- formal statement of information need

Information retrieval

What is inverted index and what are the optimal data structures for it?

for each term in the dictionary, we store a list of all documents that contain it (postings; ids of such documents)
the lists of documents should be sorted so that we can easily merge/intersect them
an optimal data structure for storing an inverted index is a hash table or a tree (binary search tree, B-tree, …)

Information retrieval

What is stopword and what is it useful for?

stop words = extremely common words which would appear to be of little value in helping select documents matching a user need
examples: a, an, and, are, as, at, be, by, for, from, …
it may be useful to skip the words to save the space needed to store the postings
however, you need stop words for phrase queries like “King of Denmark”

Information retrieval

Explain the bag-of-word principle?

we do not consider the order of words in a document
we only store the information whether the document contains a given word (and how many such words it contains)

Information retrieval

What is the main advantage and disadvantage of boolean model.

advantage
- the model is simple and fast; we do not need any scoring function
- good for experts: precise understanding of the needs and collection
- good for applications: can easily consume thousands of results
disadvantage
- not good for the majority of users
- most users are not capable or lazy to write Boolean queries
- Boolean queries often result in either too few or too many results
- in Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits
solution: ranked retrieval
- we score the documents by relevance
- show the top 10 results

Information retrieval

Explain the role of the two components in the TF-IDF weighting scheme.

tf-idf … measure of importance of a word to a document in a collection
tf-idf weight of a term is product of its tf weight and its idf weight
- $w_{t,d}=(1+\log\mathrm{tf}_{t,d})\cdot\log\frac{N}{\mathrm{df}_t}$
$\mathrm{tf}_{t,d}$ … term frequency (number of times that $t$ occurs in $d$ )
$\mathrm{df}_t$ $df_{t}$ … document frequency (number of documents in the collection that $t$ $t$ occurs in)
- the fewer documents $t$ appears in, the more informative it is
- the collection has $N$ documents in total
- idf … inverse document frequency

Information retrieval

Explain length normalization in vector space model what is it useful for?

we represent the documents and the query as vectors in a vector space
it is a bad idea to use Euclidean distance as a measure of similarity
- Euclidean distance is large for vectors of different lengths
instead, we use cosine similarity $\cos(q,d)=\frac{q\cdot d}{|q|\cdot |d|}$ $cos (q, d) = \frac{q \cdot d}{∣ q ∣ \cdot ∣ d ∣}$ where $q,d$ $q, d$ are vectors
- this normalizes the vector lengths

Language data resources

Explain what a corpus is.

dataset of language resources, collection of texts
it is often annotated
it should be somehow balanced
there are special types of corpora: spoken, parallel, domain-specific, diachronic

Language data resources

Explain what annotation is (in the context of language resources). What types of annotation do you know? (2 points)

annotation = adding selected linguistic information in an explicit form to a corpus
- raw texts are difficult to exploit
examples
- morphological annotation – lemma, POS, other categories)
- syntactic annotation – constituent trees, dependency trees
- semantic annotation – dozens of approaches, no consensus so far
- anaphora – “what does the pronoun refer to?”
- word sense disambiguated corpora – which sense is used for a given polysemous word in a given context
- sentiment corpora – positive/negative emotions induced by some expressions in a text

Language data resources

What are the reasons for variability of even basic types of annotation, such as the annotation of morphological categories (parts of speech etc.).

different languages may have different morphological categories
different linguistic traditions of different national schools

Language data resources

Explain what a treebank is. Why trees are used? (2 points)

treebank = a corpus in which the syntax and/or semantics of sentences is analyzed using tree-shaped data structures
why trees? we believe that…
- trees are attractive data structures :D
- sentences can be represented by discrete units and relations among them
- some relations make more sense than others
- there is an identifiable discrete structure hidden in each sentence
- the structure must allow for various kinds of nestedness → recursivity → trees
two types of trees that are broadly used: constituency (phrase-structure) trees, dependency trees

Language data resources

Explain what a parallel corpus is. What kind of alignments can we distinguish? (2 points)

it contains texts in two (or more) languages
the texts are aligned (for example, we know which Czech and English documents are equivalent)
there are multiple possible levels of alignment
- document level alignment
- sentence level alignment
- word level alignment
- (morpheme level alignment?)
examples
- The Rosetta Stone
- CzEng
- Canadian parliamentary transcripts (Hansard)

Language data resources

What is a sentiment-annotated corpus? How can it be used?

it captures the attitude (emotional polarity) of a speaker with respect to some topic/expression
examples
- “it was an ok (neutral) club, but terribly crowded (negative)”
- “Taiwan-made products stood a good chance (positive) of becoming even more competitive thanks to (positive) wider access to overseas markets”
obviously over-simplified, but highly demanded e.g. by the marketing industry

Language data resources

What is a coreference-annotated corpus?

it captures relations between expressions that refer to the same entity of the real world
- for example, the expressions “Audi”, “the company”, and “the Audi company” all refer to one entity
example of a coreference-annotated corpus: Prague Dependency Treebanks

Language data resources

Explain how WordNet is structured?

it is a hyponymy forest composed of synsets (sets of synonymous words)
example of a branch
- event
- act, human action, human activity
- action
- change
- motion, movement, move
- locomotion, travel
- run, running
- dash, sprint
another subbranch
- motion, movement, move
- descent
- jump, parachuting

Language data resources

Explain the difference between derivation and inflection?

inflection = ohýbání (skloňování, časování)
- change in the form of a word to express a grammatical function/attribute (tense, mood, person, number, case, gender)
derivation = odvození
- forming a new word from an existing word
- usually prefixing and suffixing
inflection does not change the core meaning of the word
derivation produces a new word (a distinct lexeme), inflection produces grammatical variants of the same word

Evaluation measures in NLP

Give at least two examples of situations in which measuring a percentage accuracy is not adequate.

if the classes are imbalanced (e.g. a rare disease)
if there is more than one correct answer for each instance
if our system does not give exactly one answer for each instance
if there are some errors that are more wrong than others

Evaluation measures in NLP

Explain: precision, recall

let's assume our system either gives prediction (”positive”) or does nothing (“negative”)
precision = if our systems gives a prediction, what’s its avarage quality
- correct answers given ÷ all answers given
- $\frac{tp}{tp+fp}$
recall = what average proportion of all possible correct answers is given by our system
- correct answers given ÷ all possible correct answers
- $\frac{tp}{tp+fn}$

Evaluation measures in NLP

What is F-measure, what is it useful for?

weighted harmonic mean of precision and recall
$F_\beta=\frac{(\beta^2+1)PR}{\beta^2P+R}$
$F_1=\frac{2PR}{P+R}$
why?
- using both precision and recall means that we are in 2D
- “is it better to have a system with $P=0.8$ and $R=0.2$ , or $P=0.9$ and $R=0.1$ ?”
- we want to be able to use a single scale → F-measure

Evaluation measures in NLP

What is k-fold cross-validation?

it is a special way to train and evaluate a (ML) model based on train/test data
used especially in the case of very small data, when an actual train/test division could have huge impact on the measured quantities
we partition the data into $k$ subsamples
we train the model based on $k-1$ $k - 1$ subsamples and evaluate it using the last unused subsample → we get a score
- we repeat this $k$ times so that every time, we use a different subsample for evaluation → we get $k$ scores and compute their average (this is our final score)

Evaluation measures in NLP

Explain BLEU (the exact formula not needed, just the main principles).

BLEU = bilingual evaluation understudy
it is a common measure in machine translation
it measures similarity between a machine's output and a human translation
usually, there are multiple reference (human) translations
n-gram precision = $C/N$ $C / N$
- where $N$ … number of n-grams that the candidate translation contains
- and $C$ … number of n-grams from the candidate translation that exist in at least one reference translation
brevity penalty
- shorter sentences naturally achieve better n-gram precision
- to make it fair, we penalize them
to compute BLEU score, we get 1-gram, 2-gram, 3-gram, and 4-gram precisions, compute their geometric mean, and multiply it by the brevity penalty

Evaluation measures in NLP

Explain the purpose of brevity penalty in BLEU.

shorter sentences naturally achieve better n-gram precision
to make it fair, we penalize them

Evaluation measures in NLP

What is Labeled Attachment Score (in parsing)?

percentage of words that are assigned the correct head and the correct dependency tag (label)
this measure is used to evaluate generated dependency trees
- head = the parent node of a given word
- label = the relation between the head and the given word

Evaluation measures in NLP

What is Word Error Rate (in speech recognition)?

it is a common metric for speech recognition
the number of recognized words can be quite different from the number of true words
$WER=\frac{S+D+I}{S+D+C}$ $W ER = \frac{S + D + I}{S + D + C}$
- $S$ … substituted word
- $D$ … deleted words
- $I$ … inserted words
- $C$ … correct words

Evaluation measures in NLP

What is inter-annotator agreement? How can it be measured?

inter-annotator agreement (IAA) measures the reliability of manual annotations
“how good human experts perform?”
baseline: what if they annotated randomly?
possible ways to measure IAA
- treat one annotator as a virtual gold standard and measure the accuracy of the second annotator
- use F-measure instead of accuracy
- Cohen's kappa

Evaluation measures in NLP

What is Cohen's kappa?

inter-annotator agreement metric that takes into account the agreement by chance
$\kappa=\frac{P_a-P_e}{1-P_e}$
$P_a$ … relative observed agreement between the annotators
$P_e$ … probability of agreement by chance
scale from $-1$ to $+1$ (negative kappa unexpected but possible)
interpretation still unclear, but at least we abstracted from the by-chance agreement baseline

Deep learning for NLP

Describe the two methods for training of the Word2Vec model.

we use a sliding window over the input text → the embedding of a given word is influenced by the neighboring words
either we can predict the neighboring words based on the current word or vice versa
two methods
- CBOW: minimize cross-entropy of the middle word of a sliding window
  - the model predicts the middle word based on the other words in the window
- skip-gram: minimize cross-entropy of a bag of words around a word
  - the model predicts the neighboring words (without specifying their order) based on the middle word

Deep learning for NLP

Use formulas to describe how Word2Vec is trained with negative sampling. (2 points)

we don't do softmax of all the words that could be in the window but are not – that would be computationally expensive
instead of softmax, we use sigmoid function: “can this word be in this context?”
we randomly sample negative samples from the vocabulary
as our loss function, we use $\mathcal L=-\log\sigma(U^T_{w_O}V_{w_I})-\sum_{i=1}^k\log\sigma(-U^T_{w_i}V_{w_I})$ $L = - lo g σ (U_{w_{O}}^{T} V_{w_{I}}) - \sum_{i = 1}^{k} lo g σ (- U_{w_{i}}^{T} V_{w_{I}})$
- where $w_I$ is an input word, $w_O$ is an output word
- there are $k$ negative samples $w_1,\dots,w_k$
- $V$ is the first layer of the neural network (input embedding matrix)
- $U$ is the second layer (output matrix)

Deep learning for NLP

Explain the difference between Word2Vec and FastText embeddings.

Word2Vec training treats words as independent entities
there are regularities in how words look like (morphology)
FastText tackles this
- represent each word as a bag of character n-grams
- keep a table of character n-grams embeddings instead of words
- word embedding = average of character n-gram embeddings

Deep learning for NLP

Sketch the structure of the Transformer model. (2 points)

- author: Jindřich Libovický
- license: CC BY-SA
there are several such layers
each layer has two sublayers (self-attention and feed-forward)

Deep learning for NLP

Why do we use positional encodings in the Transformer model.

attention mechanism does not have any information about the order of the tokens
one way to solve this is to sum positional coefficients with the token embeddings

Deep learning for NLP

What are residual connections in neural networks? Why do we use them?

we always sum the output of the sublayer with the output of the previous sublayer
we use them to make the learning numerically stable and to prevent the vanishing gradient problem
- in a deep network, there are many forward propagation steps and the gradients of earlier weights (layers) are calculated with too many multiplications → they become too small
- to solve this, we add the output of the previous sublayer to the output of the current one
- it helps, because summation is linear with respect to the gradient

Deep learning for NLP

Use formulas to express the loss function for training sequence labeling tasks.

we use cross-entropy between estimated distribution $P$ and one-hot ground-truth distribution $T$
for a single item
- $\mathcal L=H(T,P)=-\mathbb E_{i\sim T}\log P(i)=-\sum_i T(i)\log P(i)=-\log P(y^*)$
- $y^*$ … ground truth
for a sequence of $T$ $T$ items
- $\mathcal L=-\sum_{t=1}^T\log P_t(y^*_t)$

Deep learning for NLP

Explain the pre-training procedure of the BERT model. (2 points)

it is a “masked language model”
it is trained as sequence labeling: change some input token, labeler guesses original tokens
when trained, throw away labeling, use last layer as the representation
for some of the words in the training data (randomly sampled; 15 % in the original paper), we do one of the following things:
- with 80% probability replace the word with special MASK token
- with 10% probability replace the word with a random token
- with 10% probability keep the original word
the classifier should then predict the correct (original) word

Deep learning for NLP

Explain what is the pre-train and finetune paradigm in NLP.

two steps: pre-training and finetuning
pre-training
- do once, large (unlabeled) data, long time
- for example, masked language modeling
finetuning
- small data (task-specific), fast
- to fulfill a task-specific objective

Deep learning for NLP

Describe the task of named entitity recognition (NER). Explain the intution behind the CRF models compared to standard sequence labeling. (2 points)

the goal of named entity recognition is to find named entities in the text
- examples of named entities: person, location, organization, event, product, skill, address, phone, email, URL, IP, datetime, quantity
- it is linked to coreference resolution
- usage: entity linking, indexing text for search, direct use in smart devices (making addresses clickable etc.)
- the entities can span multiple words; they can be nested
  - there are multiple schemes to deal with that
  - example (IOB scheme)
    - There are over 1000 compositions by Johan Sebastian Bach.
    - O O B-QUANT I-QUANT O O B-PERSON I-PERSON I-PERSON O
standard tagging
- conditional independence assumption
- but tags have their internal grammar: I-PERSON cannot follow after O
CRF (Conditional Random Fields)
- instead of doing probability distribution over each token independently, we do a probability distribution over all possible tagging sequences
- for a tagging sequence, we get a score, how probable it is

Deep learning for NLP

Explain how does the self-attention differ in encoder-only and decoder-only models.

encoder-only models
- predict sentence labeling (e.g. POS tags)
- the self-attention mechanism can look in all directions (at the previous and following tokens)
decoder-only models
- predict the next word
- we have to ensure that the self-attention mechanism cannot look at the future words
  - we need to use a mask (a triangular matrix with negative infinities)