# Exam

The exam will have 10 questions, mostly from this pool. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles.

## Introduction

- What's the difference between task-oriented and non-task-oriented systems?
	- task-oriented
		- focused on completing certain tasks (booking restaurants/flights/hotels, finding bus schedules, smart home, …)
		- most actual dialogue systems in production
		- “backend access” vs. “agent/assistant”
	- non-task-oriented
		- chitchat – social conversation, entertainment
			- getting to know the user, specific persona
		- gaming the Turing test
- Describe the difference between closed-domain, multi-domain, and open-domain systems.
	- single/closed-domain – on a well-defined area, small set of specific tasks (e.g. banking system on a specific phone number)
	- multi-domain – joining several single-domain systems
	- open-domain – “responds to anything”, used to be mostly chitchat, now somewhat working via LLMs
- Describe the difference between user-initiative, mixed-initiative, and system-initiative systems.
	- user-initiative – user asks, machine responds
	- system-initiative – “form-filling”, system asks questions, user must reply (traditional, most robust, least natural)
	- mixed-initiative – system and user both can ask & react to queries; most natural, most complex

## Linguistics of Dialogue

- What are turn taking cues/hints in a dialogue? Name a few examples.
	- a speaker can use a turn taking cue/hint to signalize when their turn ends (they yield)
	- examples: linguistic (e.g. finished sentence), voice pitch, timing (gaps), eye gaze, gestures, …
- Explain the main idea of the speech acts theory.
	- each utterance is an act: intentional, changing the state of the world (changing the knowledge/mood of the listener, influencing their behavior)
	- speech acts consist of several levels: the words, their semantics, meaning, effect
	- types of speech acts: assertive, directive, commissive, expressive, declarative
	- explicit vs. implicit; direct vs. indirect
		- explicit: I **promise** to come by later.
		- implicit: I'll come by later.
		- direct: Please close the window.
		- indirect: Could you close the window?
		- even more indirect: I'm cold.
- What is grounding in dialogue?
	- dialogue is cooperative → need to ensure mutual understanding
	- common ground = shared knowledge, mutual assumptions of dialogue participants
		- the knowledge has to be *knowingly* shared
	- common ground is expanded/updated/refined in an informative conversation
	- validated/verified via grounding feedback/evidence
		- speaker presents utterance
		- listener accepts utterance by providing evidence of understanding
	- information added to common ground only after acceptance
- Give some examples of grounding signals in dialogue.
	- positive – understanding/acceptance signals
		- visual – eye gaze, facial expressions, smile
		- backchannels – particles (částice) signalling understanding (uh-uh, hmm, yeah, …)
		- explicit feedback – explicitly stating understanding (I know; yes, I understand)
		- implicit feedback – showing understanding implicitly in the next utterance
	- negative – misunderstanding
		- visual – stunned/puzzled silence
		- implicit/explicit repairs – denying (no, that's not right) / presenting alternative
		- clarification requests – demonstrating ambiguity & asking for additional information (Which John? John Smith or John Doe?)
		- repair requests – showing non-understanding & asking for correction (Oh, so you're not flying to London? Where are you going then?)
- What is deixis? Give some examples of deictic expressions.
	- “pointing” – relating between language & context/world
		- dialogue is typically set/situated in a specific context
	- deictic expressions
		- their meaning depends on the context (who is talking, when, where)
		- pronouns (I, you, him, this)
		- verbs: tense & person markers
		- adverbs (here, now, yesterday)
		- lexical meaning (come × go)
		- non-verbal (gestures, gaze)
	- typically egocentric
	- main types of deixis
		- personal – I, me, you, she
		- temporal – now, yesterday, later, on Monday
		- local – here, there
	- other types: social (politeness), discourse/textual (next chapter)
- What is coreference and how is it used in dialogue?
	- expression referring to something mentioned in context
		- anaphora = referring back
		- cataphora = referring forward
	- avoiding repetition, faster expression
	- can refer to basically anything (objects/persons/events, qualities, actions / full sentences / portions of text)
	- used frequently in dialogue, may be ambiguous
	- examples
		- anaphora: Susan dropped the plate. **It** shattered.
		- cataphora: When **he** hears that fire alarm, Sam is always cool and calm.
		- I don't like it as much as he **does**.
		- Her dress is green. **So** is mine.
		- Shall I book a room for you? – Sure, I'd like **that**.
		- ambiguity: Bill stands next to John. **He** is tall.
- What does Shannon entropy and conditional entropy measure? No need to give the formula, just the principle.
	- entropy – expected value of information conveyed (in bits)
		- $H(\mathrm{text})=-\mathbb E[\log p(\mathrm{word})]$
	- entropy plays well with the social interaction perspective
		- people tend to use all available channel capacity
		- people tend to spread information evenly (words carrying more information are emphasized)
	- conditional entropy – how hard is to guess the next word in the sentence?
		- given preceding context (n-gram)
		- related to Shannon entropy but may differ (it is typically much lower than Shannon entropy)
		- better estimate of prediction difficulty (although humans work with “unlimited” preceding context and reevaluate using following context)
		- $H_\mathrm{cond}(\mathrm{text})=-\mathbb E_p[\log\frac{p(c,w)}{q(c)}]$
- What is entrainment/adaptation/alignment in dialogue?
	- people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue
		- wording (lexical items) – they use the same words as their dialogue partner
		- grammar (sentential constructions)
		- speech rate, prosody, loudness
		- accent/dialect – BrE speaker uses AmE words when talking to AmE speaker
	- this helps a successful dialogue (also helps social bonding, feels natural)
	- systems typically don't align, people align to dialogue systems

## Data & Evaluation

- What are the typical options for collecting dialogue data?
	- in-house collection using experts or students
		- safe, high-quality, but very expensive & time-consuming
		- free talk / scripting whole dialogues / Wizard-of-Oz
	- web crawling
		- fast & cheap, but typically not real dialogues, may not be fit for purpose
		- potentially unsafe (offensive stuff)
		- need to be careful about the licensing
	- crowdsourcing
		- compromise: employing (untrained) people over the web
		- crowd workers tend to game the system
- How does Wizard-of-Oz data collection work?
	- users believe they're talking to a system → their behavior is simpler than when talking to a human
	- system is in fact controlled by a human “wizard”, who is selecting options (free typing is too slow)
	- usage: in-house data collection, prototyping/evaluating the system before implementing it
- What is corpus annotation, what is inter-annotator agreement?
	- annotation = labels, description added to the collected data (dialogues)
		- transcriptions (for ASR)
		- semantic annotation (for NLU) – dialogue acts, …
		- named entity labelling (for NLU)
	- inter-annotator agreement (IAA)
		- measures the reliability of manual annotations
		- multiple people annotate the same thing
		- needs to account for agreement by chance
		- typical measure: Cohen's kappa
			- $\kappa=\frac{\text{agreement}-\text{chance}}{1-\text{chance}}$
- What is the difference between intrinsic and extrinsic evaluation?
	- intrinsic … checks properties of systems/components in isolation, self-contained
	- extrinsic … how the system/component works in its intended purpose
		- effect of the system on something outside itself, in the real world (i.e. user)
- What is the difference between subjective and objective evaluation?
	- subjective … asking users' opinions, e.g. questionnaires (manual)
		- not repeatable
		- we should ask many people → not so subjective
	- objective … measuring properties directly from data (automatic)
		- might or might not correlate with users' perception
- What are the main extrinsic evaluation techniques for task-oriented dialogue systems?
	- objective metrics (we record people interacting with the system, analyze the logs)
		- task success / goal completion rate – did the user get what they wanted?
			- testers can have agenda → we can check if they found what they were supposed to
			- basic check: did we provide any information at all? (any bus/restaurant)
		- duration – number of turns or time (less is better)
		- retention rate – percentage of users that return to use our dialogue system again (over a time period)
		- fallback rate – percentage of failed dialogues
		- number of total/new/active users
	- subjective evaluation
		- questionnaires for users/testers
		- example questions
			- success rate: Did you get all the information you wanted?
			- future use: Would you use the system again?
			- ASR/NLU: Do you think the system understood you well?
			- NLG: Were the system replies fluent/well-phrased?
			- TTS: Was the system's speech natural?
- What are some evaluation metrics for non-task-oriented systems (chatbots)?
	- objective metrics
		- duration (longer = better)
		- other: % returning users, checks for users swearing vs. thanking the system
	- subjective
		- likeability/engagement: Did you enjoy the conversation?
		- other similar to task-oriented
- What's the main metric for evaluating ASR systems?
	- word error rate (WER)
	- ASR output is compared to human-authored reference
	- $\mathrm{WER}=\frac{S+I+D}{N}$
		- $S$ … substitutions
		- $I$ … insertions
		- $D$ … deletions
		- $N$ … reference length
	- ~ length-normalized edit distance (Levenshtein distance)
	- sometimes insertions & deletions are weighted $0.5\times$
	- can be $\gt 1$
	- assumes one correct answer
- What's the main metric for NLU (both slots and intents)?
	- slots: precision, recall, F-measure (F1)
		- precision $P=\frac{\mathrm{correct}}{\mathrm{detected}}$
		- recall $R=\frac{\mathrm{correct}}{\mathrm{true}}$
		- F-measure $F=\frac{2PR}{P+R}$ harmonic mean
		- example
			- NLU: inform(name=Golden Dragon, food=Chinese)
			- true: inform(name=Golden Dragon, food=Czech, price=high)
			- $P=1/3,\;R=1/2,\;F=0.2$
	- accuracy (% correct) used for intent/act type
		- alternatively also exact matches on the whole semantic structure (easier, but ignores partial matches)
		- one true answer assumed
- Explain an NLG evaluation metric of your choice.
	- BLEU score
		- word-overlap with reference text(s)
		- $BLEU=BP\cdot\sqrt[4]{p_1p_2p_3p_4}$
		- $p_n$ … $n$-gram precision (how many $n$-grams of the output text exist in any reference text)
		- $BP$ … brevity penalty (short sentences achieve higher $n$-gram precisions, so we penalize them)
	- slot error rate
	- diversity – can our system produce different replies?
- Why do you need to check for statistical significance (when evaluating an NLP experiment and comparing systems)?
	- higher score is not enough to prove your model is better
		- it can happen by chance
	- we need to define the hypotheses and select a significance level $\alpha$, then compute the observed value of test statistic and reject $H_0$ or not
- Why do you need to evaluate on a separate test set?
	- we want to know how well our model works on new, unseen data (how well it generalizes)
	- memorizing training data would give us 100% accuracy (on training data)

## Natural Language Understanding

- What are some alternative semantic representations of utterances, in addition to dialogue acts?
	- syntax/semantic trees (dependency trees, constituent trees, …)
	- frames – technically also trees, not directly connected to words
	- graphs – abstract meaning representation (AMR), more of a toy task, but popular
	- predicate logic
- Describe language understanding as classification and language understanding as sequence tagging.
	- NLU as classification
		- we treat DAs as a set of semantic concepts
		- concepts: intents, slot-value pairs
		- binary classification: is concept Y contained in utterance X?
			- independent for each concept
		- consistency problems – conflicting intents/values need to be solved externally (e.g. based on classifier confidence)
	- language understanding as sequence tagging
		- we want to parse slot values from the text
		- we can classify each word using IOB format (inside/outside/beginning) – isolate the slot values (can consist of several words)
		- pure classification can lead to inconsistencies (I cannot follow after O)
		- it is useful to tag the whole sentences (sequences of words) at once
- How do you deal with conflicting slots or intents in classification-based NLU?
	- we need to resolve such situations externally (e.g. based on classifier confidence)
- What is delexicalization and why is it helpful in NLU?
	- delexicalization = replacement of slot values / named entities with placeholders (indicating entity type)
	- generally needed for NLU as classification (otherwise in-domain data is too sparse)
	- named-entity recognition (NER) is a problem on its own
		- in-domain gazetteers (dictionaries of names) alone may be enough
- Describe one of the approaches to slot tagging as sequence tagging.
	- basic idea
		- we classify each word using IOB format to isolate the slot values
		- to avoid inconsistencies, we tag the whole sentences (sequences of words) at once
	- approaches
		- maximum entropy Markov model (MEMM)
			- looking at past classifications when making next ones
			- whole history would be too sparse/complex → Markov assumption: only the most recent classifications matter
			- looking at the whole input
			- not modelling the sequence globally
			- error propagation … during inference (prediction), one error can lead to a series of errors
			- label bias problem
		- hidden Markov model (HMM)
			- modelling the sequence as a whole
			- very basic model – tag depends on current word + previous tag
				- Markov assumption
			- we can get globally best tagging (using Viterbi algorithm)
		- linear-chain conditional random field (CRF)
			- somehow combines HMM and MEMM
			- uses global normalization → slow to train
			- state-of-the-art for many sequence tagging tasks (until neural networks took over; can be also used in conjunction with NNs)
- What is the IOB/BIO format for slot tagging?
	- it is used to get the slot values from the text
	- the words in the text are tagged; the slots can be nested
	- tags
		- B-$s$ … beginning of slot $s$
		- I-$s$ … inside slot $s$
		- O … outside
	- example
	    - There are **over 1000** compositions by **Johan Sebastian Bach**.
	    - O O B-quantity I-quantity O O B-person I-person I-person O
- What is the label bias problem?
	- in occurs in maximum entropy Markov models (MEMM)
	- due to local normalization, states with fewer outbound transitions are preferred – the transitions have larger probabilities than in states with more transitions
	- this makes the model less immune to error propagation (= one wrongly classified word leads to a series of errors)
- How can an NLU system deal with noisy ASR output? Propose an example solution.
	- simple approach
		- ASR produces multiple hypotheses (texts)
		- ASR → $p(\mathrm{text}\mid\mathrm{audio})$
		- NLU → $p(\mathrm{DA}\mid \mathrm{text})$
		- we want $p(\mathrm{DA}\mid \mathrm{audio})$
		- we sum it up: $p(\mathrm{DA}\mid\mathrm{audio})=\sum_{\mathrm{texts}} P(\mathrm{DA}\mid\mathrm{text})P(\mathrm{text}\mid\mathrm{audio})$
	- alternative approach: confusion networks
		- we use per-word ASR confidence

## Neural NLU & Dialogue State Tracking

- Describe an example of a neural architecture for NLU.
	- we can use simple classification or sequence tagging
	- when using sequence tagging, we can tag the intent at the start of the sentence (and then assign the IOB tags to all of its words)
	- examples of architecture
		- RNN-based NLU
			- bidirectional encoder (see [NLP notes](../natural-language-processing/exam.md#neural-machine-translation))
			- decoder that tags word-by-word (uses the encoder as one of its inputs)
			- intent classification – we can do softmax over last encoder state
			- attention can be used in the decoder and to classify the intent
		- (pretrained) Transformer-based NLU
			- slot tagging on top of pretrained BERT Transformer model
				- BERT was trained to guess masked words
				- further trained for NLU
			- standard IOB approach
				- softmax the final hidden layers → output tags
				- in case of split words, classify only the first subword (IOB tags should not change mid-word)
			- special start token tagged with intent
			- optional CRF on top of the tagger
- How can you use pretrained language models in NLU?
	- we can use BERT Transformer model and fine-tune it for NLU
	- BERT was trained to guess masked words
- What is the dialogue state and what does it contain?
	- dialogue state remembers what was said in the past
	- it acts as a basis for action selection decisions
	- dialogue state … current context of the conversation
	- contents = “all that is used when the system decides what to say next”
		- user goal / preferences (slots & values provided, information requested)
		- past system actions
		- other semantic context
	- usually, we consider a probability distribution over all possible states
- What is an ontology in task-oriented dialogue systems?
	- it is used to describe possible states
	- it defines all concepts in the system
		- list of slots
		- possible range of values per slot
		- possible actions per slot
		- dependencies (some concepts are only applicable for some values of parent concepts)
- Describe the task of a dialogue state tracker.
	- NLU is unreliable (it takes unreliable ASR output and adds its own errors), output might conflict with ontology
	- solution: we use belief state (probability distribution over all possible states)
		- per-slot distributions are used in practice
	- dialogue state tracker updates the belief state based on new information
	- to make it more robust, the state tracker can accumulate probability mass over multiple turns / over NLU n-best lists
	- probabilistic dialogue state tracker plays well with probabilistic dialogue policies
- What's a partially observable Markov decision process?
	- Markov decision process
		- model for sequential decision making when outcomes are uncertain
		- set of states, actions, probabilities that action leads from a state $s$ to a state $s'$, and rewards received after transitioning from state $s$ to state $s'$ using action $a$
		- we are looking for a policy function – mapping from state space to action space (can be probabilistic)
	- partially observable MDP – we do not know the current state certainly
		- belief state can be modelled using a hidden Markov model
- Describe a viable architecture for a belief state tracker.
	- basic discriminative belief tracker – we assume slot independence and trust the NLU
	- we have probabilities of states $p_s$ (tracked by our belief tracker) and probabilities of observations $p_o$ (returned by NLU)
	- in each step, for every slot…
		- we have the probability of null observation $p_o(\mathrm{null})$
		- for every state $x$, we multiply $p_s(x)$ by $p_o(\mathrm{null})$
		- for every non-null $x$, we then add $p_o(x)$ to every $p_s(x)$
	- such belief tracker is very fast and parameter-free
- What is the difference between dialogue state and belief state?
	- dialogue state is the current context of a conversation
	- belief state is a probability distribution over dialogue states – it reflects the fact that the NLU is not completely reliable
- What's the difference between a static and a dynamic state tracker?
	- static state tracker encodes whole history into features
	- dynamic/sequence state tracker explicitly models dialogue as sequential
		- can use CRF or RNNs
- How can you use pretrained language models or large language models for state tracking?
	- BERT (pretrained language model)
		- we let BERT process previous system & current user utterance
		- we use it to predict per-slot span (value of a dialogue state slot – where to find it in the message)
		- from the first token's representation, we get a single decision: none/dontcare/span
		- using 2 softmaxes over tokens, we can then predict start & end token
		- we apply rule-based update to the static state tracker – if *none* was predicted, we keep the previous value
	- LLM prompting – two alternatives were presented
		- SQL & examples: we present SQL schema to the LLM, show several examples, and provide the previous state + one dialogue turn → the (dynamic) state changes are produced as SQL requests
		- chain-of-thought style: we prompt the LLM to explain the inputs and produce state based on them (it uses the whole history, the state tracker is static)

## Dialogue Policies

- What are the non-statistical approaches to dialogue management/action selection?
	- finite-state machines
		- dialogue state is machine state
		- nodes – system actions
		- edges – possible user response semantics
		- FSMs are easy to design and predictable, but very rigid and do not scale to complex domains
		- good for basic DTML (tone-selection) phone systems
	- frame-based (VoiceXML)
		- slot-filling + providing information
		- required slots need to be filled, this can be done in any order, more information in one utterance possible
		- if all slots are filled, query the database
	- rule-based – any kind of rules (e.g. Python code)
		- we can use a probabilistic belief state
		- if-then-else rules in programming code, using thresholds over belief state for reasoning
		- output: system DA
		- very flexible and easy to code, but gets messy, the dialogue policy is pre-set (not flexible)
- Why is reinforcement learning preferred over supervised learning for training dialogue managers?
	- you need large human-human data for supervised learning (hard to get)
		- if we used human-machine, the model would just mimic the original system
	- dialogue is ambiguous & complex
		- there is no single correct next action
		- some paths will be unexplored in data, but you may encounter them
	- dialogue systems won't behave the same as people
		- there are ASR errors, limited NLU, limited environment model/actions
		- dialogue systems *should* behave differently than people – make the best of what they have
	- in reinforcement learning, the goal is to find a policy that maximizes long-term reward – this somehow corresponds to the goal of dialogue management
	- note that for a typical dialogue system, the belief state is too large to make RL tractable – we map state into a reduced space, optimize there, and map it back
- Describe the main idea of reinforcement learning (agent, environment, states, rewards).
	- Markov decision process (MDP)
		- agent in an environment
			- has internal state
			- chooses actions according to policy
			- gets rewards and state changes from the environment
		- Markov property – state defines everything (no other temporal dependency)
	- RL = finding a policy that maximizes long-term reward
		- unlike supervised learning, we don't know if an action is good
		- immediate reward might be low while long-term reward high
	- return $R_t$ = accumulated long-term reward (from timestep $t$ onwards)
	- state transition is stochastic (has a random probability distribution) → we maximize expected return
- What are deterministic and stochastic policies in dialogue management?
	- deterministic policy
		- always take the same action $\pi(s)$ in state $s$
		- enumerable in a table, equivalent to a rule-based system
		- but can be learned instead of hand-coded!
	- stochastic
		- specifies a probability distribution
		- $\pi(s,a)$ … probability of choosing action $a$ in state $s$
- What's a value function in a reinforcement learning scenario?
	- state-value function $V^\pi(s)$ … the value of a state $s$ under policy $\pi$
		- expected return for starting in state $s$ and following policy $\pi$
	- action-value function $Q^\pi(s,a)$
		- expected return of taking action $a$ in state $s$ under policy $\pi$
	- value functions can be used to evaluate states (or actions) and make better decisions
- What's the difference between actor and critic methods in reinforcement learning?
	- actor model learns the policy
		- for a given state, it predicts a probability distribution over actions
		- the agent can then decide according to this distribution
	- critic model learns the value function
		- for a given state $s$, it predicts its value function $V(s)$ or $Q(s,a)$ for action $a$
		- this guides the agent (they can then use the greedy policy or something like that)
- What's the difference between model-based and model-free approaches in RL?
	- model-based
		- we assume that transition probabilities and rewards are known
		- the solutions are mathematically nice
		- but you can only know the full model in limited settings
	- model-free
		- we don't assume anything
		- this is the one for “real-world” use
		- using $Q$ instead of $V$ comes handy here (we do not need the transition probability $p(s'\mid s,a)$ to get the expected return of taking action $a$ in state $s$)
- What are the main optimization approaches in reinforcement learning (what measures can you optimize and how)?
	- quantity to optimize
		- value function – critic
		- policy – actor
	- environment model: model-based × model-free
	- how to optimize
		- dynamic programming – find the exact solution from Bellman equation
			- iterative algorithms, refining estimates
			- expensive, assumes known environment (model-based)
		- Monte Carlo learning – learn from experience
			- sample, then update based on experience
			- when we arrive to state $s$, we update the model to match the observation
		- Temporal difference learning – like MC but look ahead (bootstrap)
			- sample, refine estimates as you go
			- even before we arrive to $s$, we have a good idea what the observation will be when we arrive to $s$ → we can update the model based on that guess
	- sampling & updates
		- on-policy – improve the policy while we are using it for decision
		- off-policy – decide according to a different policy
- Why do you typically need a user simulator to train a reinforcement learning dialogue policy?
	- we can't really learn just from static datasets
		- on-policy algorithms don't work (the system needs to navigate the dialogues according to the current policy – old dialogues are not sufficient)
	- RL needs a lot of data, more than real people would handle (also, the system behaves weirdly in the early phases of RL)

## Neural Policies & Natural Language Generation

- How do you involve neural networks in reinforcement learning (describe a Q network or a policy network)?
	- part of the agent is handled by a neural network – value function (typically $Q$) or policy
	- we are assuming huge state space (no more summary space)
	- REINFORCE (policy gradients)
		- works out of the box
		- we maximize performance – value of the initial state
	- deep Q-networks
		- Q-learning, where $Q$ function is represented by a neural net
		- problems we need to fix
			- SGD is unstable
			- correlated samples (data is sequential)
			- TD updates aim at a moving target (using $Q$ to compute updates to $Q$)
			- numeric instability (scale of rewards and $Q$ values unknown)
		- fixes
			- minibatches (updates by averaged $n$ samples, not just one)
			- experience replay – to break correlated samples (store experience in a buffer, train using minibatches sampled from the buffer)
			- target $Q$ function freezing (so that the target is not moving that often)
			- clipping rewards
- What are the main steps of a traditional NLG pipeline – describe at least 2.
	- entire process: inputs → content plan → sentence plan → text
	- content/text/document planning
		- inputs → content plan
		- content selection according to communication goal
		- basic structuring & ordering
		- typically handled by dialogue manager
	- sentence planning / microplanning
		- content plan → sentence plan
		- organizing content into sentences, merging simple sentences
		- lexical choice, referring expressions (restaurant vs. it)
	- surface realization
		- sentence plan → text
		- linearization according to grammar
		- word order, morphology
	- for NLG in dialogue systems, we need sentence planning and surface realization
- Describe one approach to NLG of your choice.
	- canned text
		- most trivial – completely hand-written prompts, no variation
		- doesn't scale (good for DTMF phone systems)
	- templates
		- “fill in blanks” approach
		- simple, but much more expressive, covers most common domains nicely
		- can scale, but still laborious
		- most production dialogue systems
	- grammars & rules
		- rules: mostly content & sentence planning
		- grammars: mostly older research systems, realization
	- machine learning
		- modern research systems
		- pre-neural attempts often combined with rules/grammar
		- neural nets made it work much better
- Describe how template-based NLG works.
	- we define templates for system DAs
	- it can be enhances with rules
		- inflection of the filled-in phrases
		- template coverage/selection rules
- What are some problems you need to deal with in template-based NLG?
	- it lacks generality and variation; it is difficult to maintain, expensive to scale up
	- the texts may sound unnatural
	- it is difficult to express rich information – the templates may be limiting
	- the templates lack context awareness
- Describe a possible neural networks based NLG architecture.
	- our example: neural end-to-end NLG using recurrent neural networks (RNNs)
	- we don't need alignments
	- binary-encoded DA (is intent/slot-value present?)
	- delexicalized: does not use real values – generates templates
	- this approach uses modified LSTM (long short-term memory) cells – input DA is passed in every time step
	- it generates delexicalized templates word-by-word (decoder-only architecture)
	- other approaches: seq2seq, Transformer
- How can you use pretrained language models or large language models in NLG?
	- pretrained LMs
		- architectures
			- guess masked word (encoder only: BERT)
			- generate next word (decoder only: GPT-2)
			- fix distorted sentences (both: BART, T5)
		- can be finetuned for our task/domain and for meaning representation (MR), can learn implicit copying
		- lot of them released online, plug-and-play (including multilingual versions)
	- LLMs
		- Transformer decoder models (slightly updated)
		- instruction tuning – finetune on problems & solutions
		- trained using reinforcement learning from human feedback (RLHF)
			- humans are paid to rate different solutions for instructions
			- rating model is trained based on these rating → such model can be used as RL reward for LLM training
		- usage: simple prompting, no need for finetuning
			- just feed in instructions/questions/example → LLM generates solution

## Voice assistants & Question Answering

- What is a smart speaker made of and how does it work?
	- smart speaker = internet-connected mic & speaker with a virtual assistant running
		- optionally display/camera
		- multiple microphones for far-field ASR
	- it listens for a wake word
		- everything is then processed in vendor's cloud service (raw audio is sent to the cloud)
		- follow-up mode – no wake word needed for follow-up questions
		- privacy concerns
	- NLU includes domain detection
	- rules on top of machine learning
- Briefly describe a viable approach to question answering.
	- our example: IR-based QA pipeline
		- IR … information retrieval
	- three steps
		- question processing
			- query formulation
			- answer type detection (what should the answer look like?)
		- passage retrieval
			- get relevant documents from the index (similar to web search) … document retrieval
			- find phrases in the documents that respond to the question
		- answer processing
			- generate a suitable answer to the original question
- What is document retrieval and how is it used in question answering?
	- document retrieval = getting relevant documents (candidates) according to the query by searching in the index
	- can use TF-IDF (or other metrics) for weighting
	- document retrieval works as a coarse filter that filters out irrelevant documents (selects the ones that are relevant to the query and can possibly contain an answer to the question)
- What is dense retrieval (in the context of question answering)?
	- the documents are embedded in a vector space
	- such embeddings can then be compared to query embeddings via cosine similarity
	- they can be also clustered into Voronoi cells, quantized, …
	- dense retrieval focuses more on semantics than on the specific contained words
- How can you use neural models in answer extraction (for question answering)?
	- passage extraction
		- we feed the question and extracted passage(s) to Transformer model (e.g. BERT)
		- 2 classifiers: start + end of answer span (softmax over passage tokens)
	- generative QA
		- feed in passage
		- generate reply word-by-word
- How can you use retrieval-augmented generation in question answering?
	- Transformer generative language model (decoder architecture)
		- input: retrieved passage
		- output: full-sentence response
	- not just extraction, but full-sentence answer formulation
	- the model has to be trained to provide reply (avoid hallucination, avoid copying everything verbatim)
- What is a knowledge graph?
	- large repository of structured, linked information
	- entities … nodes
	- relations … edges
	- entities and relations are typed, the types form a similar graph (ontology)
	- knowledge graphs can be used for question answering

## Dialogue Tooling

- What is a dialogue flow/tree?
	- graph structure that describes a non-linear dialogue
	- there are conditions to get to the individual nodes of the graph (and fallback strategies if none of the conditions is specified)
- What are intents and entities/slots?
	- intents correspond to the actions supported by the dialogue (represent what the user wants to achieve)
	- entities/slots are parameters of the actions (intents) – information needed to fulfill the intents
	- example
		-  intent: reserve table
		- slots: date, time, number of guests
- How can you improve a chatbot in production?
	- automatically
		- learning from user selections
		- statistics on user selections → automated pre-selection for next users
	- semi-automatically or manually
		- chat log analysis → model update
		- used measures
			- coverage – is the chatbot confident that it can address the user's request? (per dialogue turn)
			- containment – can the chatbot satisfy a user's request without human intervention? (per conversation)
- What is the containment rate (in the context of using dialogue systems in call centers)?
	- rate at which your chatbot can satisfy a user's request without human intervention, i.e. connect to human agent not requested (per conversation)
	- it is a measure that can be used to evaluate the chatbot
- What is retrieval-augmented generation?
	- process of optimizing the output of a large language model so that it references an authoritative knowledge base *outside of its training data sources* before generating a response

## Automatic Speech Recognition

- What is a speech activity detector?
	- it is a preprocessing step in ASR
		- to save CPU – run ASR only when there is speech, ignore non-speech sounds
	- approaches
		- handcrafted (now obsolete) – track signal amplitude contours, assumes low noise
		- statistical / neural – binary classifier trained on large corpora, accurate but more CPU-demanding than handcrafted detector
- Describe the main components of an ASR pipeline system.
	- speech activity detector – detects that someone is speaking, can depend on wake words
	- feature extractor – uses Fourier transform and mel frequency subsampling to extract features from the sound, also somehow normalizes the sound
	- acoustic model – models probability that a word corresponds to a given audio
	- language model – models probability of words and sentences, uses pronouncing dictionary
	- decoder – combines acoustic and language model
- How do input features for an ASR model look like?
	- mel frequency cepstral coefficients (MFCCs)
		- representation of the sound that is inspired by human perception
		- in older systems
	- mel spectrogram (filterbank)
		- uses mel (logarithmic) scale
		- less processed than MFCCs
	- raw spectrograms
	- raw audio
- What is the function of the acoustic model in a pipeline ASR system?
	- to estimate $P(\mathrm{audio}\mid \mathrm{text})$
	- it helps to map audio features to phonemes or subwords (using Gaussian mixtures or neural networks)
- What's the function of a decoder/language model in a pipeline ASR system?
	- to estimate $P(\mathrm{text})$
		- what is the probability of certain words/sentences in our language?
	- it decodes audio features back to text (with the help of an acoustic model)
	- it can use a pronouncing dictionary
- Describe an (example) architecture of an end-to-end neural ASR system.
	- our example: attention encoder-decoder
		- encoder encodes audio features
		- decoder decodes text character-by-character
		- RNN (LSTM) + attention / Transformer
		- if the audio is too fast, we slow it down
	- pros
		- direct audio to letter (no need to model pronunciation explicitly)
		- no need to align phones & audio frames
		- audio & transcript is enough to train
	- cons
		- inaccurate word/character timestamps
		- not low-latency
		- hard to customize

## Text-to-speech Synthesis

- How do humans produce sounds of speech?
	- air flow from lungs → vocal cords resonation → frequency characteristics further moderated by vocal tract
	- resonation
		- base frequency (F0)
		- upper harmonic frequencies
	- vocal tract moderation
		- shape of vocal tract changes (tongue, soft palate, lip, jaw positions)
		- some frequencies resonate
		- some are suppressed
- What's the difference between a vowel and a consonant?
	- vowel – sound produced with open vocal tract
		- typically voiced (vocal chords vibrate)
		- quality of vowels depends mainly on vocal tract shape (raised tongue position, jaw/tongue height, shape of lips)
	- consonant – sound produced with (partially) closed vocal tract
		- voiced/voiceless (often come in pairs, e.g. \[p], \[b])
		- quality also depends on type + position of closing
- What is F0 and what are formants?
	- F0 … base vocal cord frequency (voice pitch)
	- formants … loud multiples (upper harmonics) of F0
		- distinct for different phonemes
		- F1, F2 – first, second formant
- What is a spectrogram?
	- frequency-time-loudness graph
- What are main distinguishing characteristics of consonants?
	- do vocal chords vibrate? (voiced × voiceless)
	- type and position of vocal tract closing; vocal tract shape
		- stops/plosives … total closing + “explosive” release (p, d, k)
		- nasals … stops with open nasal cavity (n, m)
		- fricatives … partial closing (f, s, z)
		- approximants … movement towards partial closing and back, half-vowels (w, j)
- What is a phoneme?
	- sound that distinguishes meaning
	- changing it for another would change meaning (**d**og → **f**og)
- What are the main distinguishing characteristics of different vowel phonemes (both how they're produced and perceived)?
	- production – influenced by vocal tract shape
		- raised tongue position – front, central, back
		- jaw/tongue height – open, open-mid, close-mid, close
		- shape of lips – round, non-round
	- perception – depends on which formants are present in the spectrum or not (which are suppressed)
- What are the main approaches to grapheme-to-phoneme conversion in TTS?
	- main approaches: pronouncing dictionaries + rules
		- rules are good for languages with regular orthography (spelling)
			- Czech, German, Dutch
		- dictionaries good for irregular/historical orthography
			- English, French
		- typically it's a combination anyway
			- rules = fallback for out-of-vocabulary items
			- dictionary used for foreign words (overrides rules)
		- can be a pain in a domain with a lot of foreign names
	- pronunciation is sometimes context dependent
		- part-of-speech tagging
		- contextual rules
	- phonemes typically coded using ASCII
- Describe the main idea of concatenative speech synthesis.
	- cut & paste on recordings
	- but there are too many words/syllables; there are too few phonemes
	- so we use diphones = second half of one phoneme and first half of another
		- about 1500 diphones in English – manageable (even though we need lots of recordings of a single person)
		- this eliminates the heaviest coarticulation problems (but not all)
		- still artefacts at diphones boundaries
	- smoothing/overlay & F0 adjustments
		- over-smoothing makes the sound robotic
		- pitch adjustments limited – don't sound natural
	- modification: unit-selection concatenative synthesis
		- more instances of each diphone
		- we select units that best match the target position (to minimize adjustments needed)
- Describe the main ideas of statistical parametric speech synthesis.
	- trying to be more flexible, less resource-hungry than unit selection
	- inverse of model-based ASR
	- based on HMMs (hidden Markov models)
	- principle
		- in corpus, we have text and audio
		- for training and prediction, we need:
			- model that can extract linguistic features (phonemes, stress, pitch) from the text
			- vocoder that can both extract acoustic features (spectrum, excitation) from a waveform (audio) and synthesize a waveform from acoustic features
		- to train the statistical acoustic model, we extract both acoustic and linguistic features from the corpus and use the features as training data
		- during prediction, we first extract the linguistic features from the text, then the acoustic model predicts acoustic features, and vocoder synthesizes them into a waveform
- How can you use neural networks in speech synthesis?
	- we can use feed-forward networks or recurrent neural networks to replace HMMs used in statistical speech synthesis
	- RNNs predict smoother outputs (given temporal dependencies)
	- NNs allow better features (e.g. raw spectrum)
	- examples
		- WaveNet generates waveform directly, it is based on convolutional NNs
		- Tacotron is trained on waveforms and transcriptions (no linguistic features), it is based on seq2seq models with attention 

## Chatbots

- What are the three main approaches to building chitchat/non-task-oriented open-domain chatbots?
	- rule-based
		- human-scripted, react to keywords/phrases in user input
		- very time-consuming to make, but still popular
	- data-driven: retrieval
		- gets replies from a corpus
		- “nearest neighbor” approaches
		- corpus can contain past conversations with users
		- chatbots differ in the sophistication of reply selection
	- data-driven: generative
		- seq2seq-based models (typically RNN/Transformer)
		- usually trained on static corpora
		- (theoretically) able to handle unseen inputs, produce original replies
		- basic seq2seq architecture is weak (dull responses) → many extensions
- How does the Turing test work? Does it have any weaknesses?
	- evaluator leads two text-only conversations – with a machine and a human
	- needs to tell which is which
	- the evaluator can be gamed if the conversation is framed well (paranoid schizophrenic, therapist, Ukrainian boy, …)
- What are some techniques rule-based chitchat chatbots use to convince their users that they're human-like?
	- signalling understanding – repeating and reformulating user's phrasing
	- good framing – it's easier to appear human as a therapist (or paranoid schizophrenic, Ukrainian boy, …)
- Describe how a retrieval-based chitchat chatbot works.
	- it first checks for similar inputs in the corpus (rough retrieval)
	- then it reranks the best candidates to find the most suitable one
		- this step can use machine learning (problem: we need negative examples to train the classifier)
	- it cannot produce unseen sentences and sometimes replies inconsistently
		- postprocessing and rules can partially fix this
- How can you use neural networks for chatbots (non-task-oriented, open-domain systems)? Does that have any problems?
	- we can use neural networks for reranking
		- training data problem – datasets contain only positive examples, but we also need negative examples
	- NNs can be also used end-to-end
		- we can use similar approach as in phrase-based machine translation (MT)
			- the task is harder than MT – possible responses are much more variable than possible translations
			- it works, but fluency is not ideal and the context is too limited
		- RNN LMs without LSTM
			- more fluent than phrase-based
			- problems with long replies (less fluent, wander off-topic)
		- encoder-decoder RNN model with LSTM (seq2seq)
			- encode input, decode response
			- generic/dull responses
				- MLE/softmax prefer 1 option → models settle on safe replies and become over-confident
			- limited context
				- encoding long contexts is slow and ineffective
				- contexts are too sparse to learn much
			- inconsistency
				- ask the same question twice, get two different answers
				- no notion of own personality
- Describe a possible architecture of an ensemble non-task-oriented chatbot.
	- rule-based for sensitive/frequent/important questions
	- retrieval for jokes, trivia etc.
	- task-oriented-like (handcrafted / specially trained) systems for specific topics – news, weather, etc.
	- seq2seq as a backup or not at all
- What do you need to train a large language model?
	- trillions of tokens
	- enough compute power
	- well-defined evaluation metrics
- What are some issues you may encounter when chatting to LLMs?
	- it may not be factually accurate
	- it only uses information it memorized
	- hallucinates instead of saying “I don't know”
	- eager to please, easily swayed
	- hard to control
	- over-hyped