Exam

Introduction Linguistics of Dialogue Data & Evaluation Natural Language Understanding Neural NLU & Dialogue State Tracking Dialogue Policies Neural Policies & Natural Language Generation Voice assistants & Question Answering Dialogue Tooling Automatic Speech Recognition Text-to-speech Synthesis Chatbots

The exam will have 10 questions, mostly from this pool. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles.

Introduction

What's the difference between task-oriented and non-task-oriented systems?
- task-oriented
  - focused on completing certain tasks (booking restaurants/flights/hotels, finding bus schedules, smart home, …)
  - most actual dialogue systems in production
  - “backend access” vs. “agent/assistant”
- non-task-oriented
  - chitchat – social conversation, entertainment
    - getting to know the user, specific persona
  - gaming the Turing test
Describe the difference between closed-domain, multi-domain, and open-domain systems.
- single/closed-domain – on a well-defined area, small set of specific tasks (e.g. banking system on a specific phone number)
- multi-domain – joining several single-domain systems
- open-domain – “responds to anything”, used to be mostly chitchat, now somewhat working via LLMs
Describe the difference between user-initiative, mixed-initiative, and system-initiative systems.
- user-initiative – user asks, machine responds
- system-initiative – “form-filling”, system asks questions, user must reply (traditional, most robust, least natural)
- mixed-initiative – system and user both can ask & react to queries; most natural, most complex

Linguistics of Dialogue

What are turn taking cues/hints in a dialogue? Name a few examples.
- a speaker can use a turn taking cue/hint to signalize when their turn ends (they yield)
- examples: linguistic (e.g. finished sentence), voice pitch, timing (gaps), eye gaze, gestures, …
Explain the main idea of the speech acts theory.
- each utterance is an act: intentional, changing the state of the world (changing the knowledge/mood of the listener, influencing their behavior)
- speech acts consist of several levels: the words, their semantics, meaning, effect
- types of speech acts: assertive, directive, commissive, expressive, declarative
- explicit vs. implicit; direct vs. indirect
  - explicit: I promise to come by later.
  - implicit: I'll come by later.
  - direct: Please close the window.
  - indirect: Could you close the window?
  - even more indirect: I'm cold.
What is grounding in dialogue?
- dialogue is cooperative → need to ensure mutual understanding
- common ground = shared knowledge, mutual assumptions of dialogue participants
  - the knowledge has to be knowingly shared
- common ground is expanded/updated/refined in an informative conversation
- validated/verified via grounding feedback/evidence
  - speaker presents utterance
  - listener accepts utterance by providing evidence of understanding
- information added to common ground only after acceptance
Give some examples of grounding signals in dialogue.
- positive – understanding/acceptance signals
  - visual – eye gaze, facial expressions, smile
  - backchannels – particles (částice) signalling understanding (uh-uh, hmm, yeah, …)
  - explicit feedback – explicitly stating understanding (I know; yes, I understand)
  - implicit feedback – showing understanding implicitly in the next utterance
- negative – misunderstanding
  - visual – stunned/puzzled silence
  - implicit/explicit repairs – denying (no, that's not right) / presenting alternative
  - clarification requests – demonstrating ambiguity & asking for additional information (Which John? John Smith or John Doe?)
  - repair requests – showing non-understanding & asking for correction (Oh, so you're not flying to London? Where are you going then?)
What is deixis? Give some examples of deictic expressions.
- “pointing” – relating between language & context/world
  - dialogue is typically set/situated in a specific context
- deictic expressions
  - their meaning depends on the context (who is talking, when, where)
  - pronouns (I, you, him, this)
  - verbs: tense & person markers
  - adverbs (here, now, yesterday)
  - lexical meaning (come × go)
  - non-verbal (gestures, gaze)
- typically egocentric
- main types of deixis
  - personal – I, me, you, she
  - temporal – now, yesterday, later, on Monday
  - local – here, there
- other types: social (politeness), discourse/textual (next chapter)
What is coreference and how is it used in dialogue?
- expression referring to something mentioned in context
  - anaphora = referring back
  - cataphora = referring forward
- avoiding repetition, faster expression
- can refer to basically anything (objects/persons/events, qualities, actions / full sentences / portions of text)
- used frequently in dialogue, may be ambiguous
- examples
  - anaphora: Susan dropped the plate. It shattered.
  - cataphora: When he hears that fire alarm, Sam is always cool and calm.
  - I don't like it as much as he does.
  - Her dress is green. So is mine.
  - Shall I book a room for you? – Sure, I'd like that.
  - ambiguity: Bill stands next to John. He is tall.
What does Shannon entropy and conditional entropy measure? No need to give the formula, just the principle.
- entropy – expected value of information conveyed (in bits)
  - $H(\mathrm{text})=-\mathbb E[\log p(\mathrm{word})]$
- entropy plays well with the social interaction perspective
  - people tend to use all available channel capacity
  - people tend to spread information evenly (words carrying more information are emphasized)
- conditional entropy – how hard is to guess the next word in the sentence?
  - given preceding context (n-gram)
  - related to Shannon entropy but may differ (it is typically much lower than Shannon entropy)
  - better estimate of prediction difficulty (although humans work with “unlimited” preceding context and reevaluate using following context)
  - $H_\mathrm{cond}(\mathrm{text})=-\mathbb E_p[\log\frac{p(c,w)}{q(c)}]$
What is entrainment/adaptation/alignment in dialogue?
- people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue
  - wording (lexical items) – they use the same words as their dialogue partner
  - grammar (sentential constructions)
  - speech rate, prosody, loudness
  - accent/dialect – BrE speaker uses AmE words when talking to AmE speaker
- this helps a successful dialogue (also helps social bonding, feels natural)
- systems typically don't align, people align to dialogue systems

Data & Evaluation

What are the typical options for collecting dialogue data?
- in-house collection using experts or students
  - safe, high-quality, but very expensive & time-consuming
  - free talk / scripting whole dialogues / Wizard-of-Oz
- web crawling
  - fast & cheap, but typically not real dialogues, may not be fit for purpose
  - potentially unsafe (offensive stuff)
  - need to be careful about the licensing
- crowdsourcing
  - compromise: employing (untrained) people over the web
  - crowd workers tend to game the system
How does Wizard-of-Oz data collection work?
- users believe they're talking to a system → their behavior is simpler than when talking to a human
- system is in fact controlled by a human “wizard”, who is selecting options (free typing is too slow)
- usage: in-house data collection, prototyping/evaluating the system before implementing it
What is corpus annotation, what is inter-annotator agreement?
- annotation = labels, description added to the collected data (dialogues)
  - transcriptions (for ASR)
  - semantic annotation (for NLU) – dialogue acts, …
  - named entity labelling (for NLU)
- inter-annotator agreement (IAA)
  - measures the reliability of manual annotations
  - multiple people annotate the same thing
  - needs to account for agreement by chance
  - typical measure: Cohen's kappa
    - $\kappa=\frac{\text{agreement}-\text{chance}}{1-\text{chance}}$
What is the difference between intrinsic and extrinsic evaluation?
- intrinsic … checks properties of systems/components in isolation, self-contained
- extrinsic … how the system/component works in its intended purpose
  - effect of the system on something outside itself, in the real world (i.e. user)
What is the difference between subjective and objective evaluation?
- subjective … asking users' opinions, e.g. questionnaires (manual)
  - not repeatable
  - we should ask many people → not so subjective
- objective … measuring properties directly from data (automatic)
  - might or might not correlate with users' perception
What are the main extrinsic evaluation techniques for task-oriented dialogue systems?
- objective metrics (we record people interacting with the system, analyze the logs)
  - task success / goal completion rate – did the user get what they wanted?
    - testers can have agenda → we can check if they found what they were supposed to
    - basic check: did we provide any information at all? (any bus/restaurant)
  - duration – number of turns or time (less is better)
  - retention rate – percentage of users that return to use our dialogue system again (over a time period)
  - fallback rate – percentage of failed dialogues
  - number of total/new/active users
- subjective evaluation
  - questionnaires for users/testers
  - example questions
    - success rate: Did you get all the information you wanted?
    - future use: Would you use the system again?
    - ASR/NLU: Do you think the system understood you well?
    - NLG: Were the system replies fluent/well-phrased?
    - TTS: Was the system's speech natural?
What are some evaluation metrics for non-task-oriented systems (chatbots)?
- objective metrics
  - duration (longer = better)
  - other: % returning users, checks for users swearing vs. thanking the system
- subjective
  - likeability/engagement: Did you enjoy the conversation?
  - other similar to task-oriented
What's the main metric for evaluating ASR systems?
- word error rate (WER)
- ASR output is compared to human-authored reference
- $\mathrm{WER}=\frac{S+I+D}{N}$ $WER = \frac{S + I + D}{N}$
  - $S$ … substitutions
  - $I$ … insertions
  - $D$ … deletions
  - $N$ … reference length
- ~ length-normalized edit distance (Levenshtein distance)
- sometimes insertions & deletions are weighted $0.5\times$
- can be $\gt 1$
- assumes one correct answer
What's the main metric for NLU (both slots and intents)?
- slots: precision, recall, F-measure (F1)
  - precision $P=\frac{\mathrm{correct}}{\mathrm{detected}}$
  - recall $R=\frac{\mathrm{correct}}{\mathrm{true}}$
  - F-measure $F=\frac{2PR}{P+R}$ harmonic mean
  - example
    - NLU: inform(name=Golden Dragon, food=Chinese)
    - true: inform(name=Golden Dragon, food=Czech, price=high)
    - $P=1/3,\;R=1/2,\;F=0.2$
- accuracy (% correct) used for intent/act type
  - alternatively also exact matches on the whole semantic structure (easier, but ignores partial matches)
  - one true answer assumed
Explain an NLG evaluation metric of your choice.
- BLEU score
  - word-overlap with reference text(s)
  - $BLEU=BP\cdot\sqrt[4]{p_1p_2p_3p_4}$
  - $p_n$ … $n$ -gram precision (how many $n$ -grams of the output text exist in any reference text)
  - $BP$ … brevity penalty (short sentences achieve higher $n$ -gram precisions, so we penalize them)
- slot error rate
- diversity – can our system produce different replies?
Why do you need to check for statistical significance (when evaluating an NLP experiment and comparing systems)?
- higher score is not enough to prove your model is better
  - it can happen by chance
- we need to define the hypotheses and select a significance level $\alpha$ , then compute the observed value of test statistic and reject $H_0$ or not
Why do you need to evaluate on a separate test set?
- we want to know how well our model works on new, unseen data (how well it generalizes)
- memorizing training data would give us 100% accuracy (on training data)

Natural Language Understanding

What are some alternative semantic representations of utterances, in addition to dialogue acts?
- syntax/semantic trees (dependency trees, constituent trees, …)
- frames – technically also trees, not directly connected to words
- graphs – abstract meaning representation (AMR), more of a toy task, but popular
- predicate logic
Describe language understanding as classification and language understanding as sequence tagging.
- NLU as classification
  - we treat DAs as a set of semantic concepts
  - concepts: intents, slot-value pairs
  - binary classification: is concept Y contained in utterance X?
    - independent for each concept
  - consistency problems – conflicting intents/values need to be solved externally (e.g. based on classifier confidence)
- language understanding as sequence tagging
  - we want to parse slot values from the text
  - we can classify each word using IOB format (inside/outside/beginning) – isolate the slot values (can consist of several words)
  - pure classification can lead to inconsistencies (I cannot follow after O)
  - it is useful to tag the whole sentences (sequences of words) at once
How do you deal with conflicting slots or intents in classification-based NLU?
- we need to resolve such situations externally (e.g. based on classifier confidence)
What is delexicalization and why is it helpful in NLU?
- delexicalization = replacement of slot values / named entities with placeholders (indicating entity type)
- generally needed for NLU as classification (otherwise in-domain data is too sparse)
- named-entity recognition (NER) is a problem on its own
  - in-domain gazetteers (dictionaries of names) alone may be enough
Describe one of the approaches to slot tagging as sequence tagging.
- basic idea
  - we classify each word using IOB format to isolate the slot values
  - to avoid inconsistencies, we tag the whole sentences (sequences of words) at once
- approaches
  - maximum entropy Markov model (MEMM)
    - looking at past classifications when making next ones
    - whole history would be too sparse/complex → Markov assumption: only the most recent classifications matter
    - looking at the whole input
    - not modelling the sequence globally
    - error propagation … during inference (prediction), one error can lead to a series of errors
    - label bias problem
  - hidden Markov model (HMM)
    - modelling the sequence as a whole
    - very basic model – tag depends on current word + previous tag
      - Markov assumption
    - we can get globally best tagging (using Viterbi algorithm)
  - linear-chain conditional random field (CRF)
    - somehow combines HMM and MEMM
    - uses global normalization → slow to train
    - state-of-the-art for many sequence tagging tasks (until neural networks took over; can be also used in conjunction with NNs)
What is the IOB/BIO format for slot tagging?
- it is used to get the slot values from the text
- the words in the text are tagged; the slots can be nested
- tags
  - B- $s$ … beginning of slot $s$
  - I- $s$ … inside slot $s$
  - O … outside
- example
  - There are over 1000 compositions by Johan Sebastian Bach.
  - O O B-quantity I-quantity O O B-person I-person I-person O
What is the label bias problem?
- in occurs in maximum entropy Markov models (MEMM)
- due to local normalization, states with fewer outbound transitions are preferred – the transitions have larger probabilities than in states with more transitions
- this makes the model less immune to error propagation (= one wrongly classified word leads to a series of errors)
How can an NLU system deal with noisy ASR output? Propose an example solution.
- simple approach
  - ASR produces multiple hypotheses (texts)
  - ASR → $p(\mathrm{text}\mid\mathrm{audio})$
  - NLU → $p(\mathrm{DA}\mid \mathrm{text})$
  - we want $p(\mathrm{DA}\mid \mathrm{audio})$
  - we sum it up: $p(\mathrm{DA}\mid\mathrm{audio})=\sum_{\mathrm{texts}} P(\mathrm{DA}\mid\mathrm{text})P(\mathrm{text}\mid\mathrm{audio})$
- alternative approach: confusion networks
  - we use per-word ASR confidence

Neural NLU & Dialogue State Tracking

Describe an example of a neural architecture for NLU.
- we can use simple classification or sequence tagging
- when using sequence tagging, we can tag the intent at the start of the sentence (and then assign the IOB tags to all of its words)
- examples of architecture
  - RNN-based NLU
    - bidirectional encoder (see NLP notes)
    - decoder that tags word-by-word (uses the encoder as one of its inputs)
    - intent classification – we can do softmax over last encoder state
    - attention can be used in the decoder and to classify the intent
  - (pretrained) Transformer-based NLU
    - slot tagging on top of pretrained BERT Transformer model
      - BERT was trained to guess masked words
      - further trained for NLU
    - standard IOB approach
      - softmax the final hidden layers → output tags
      - in case of split words, classify only the first subword (IOB tags should not change mid-word)
    - special start token tagged with intent
    - optional CRF on top of the tagger
How can you use pretrained language models in NLU?
- we can use BERT Transformer model and fine-tune it for NLU
- BERT was trained to guess masked words
What is the dialogue state and what does it contain?
- dialogue state remembers what was said in the past
- it acts as a basis for action selection decisions
- dialogue state … current context of the conversation
- contents = “all that is used when the system decides what to say next”
  - user goal / preferences (slots & values provided, information requested)
  - past system actions
  - other semantic context
- usually, we consider a probability distribution over all possible states
What is an ontology in task-oriented dialogue systems?
- it is used to describe possible states
- it defines all concepts in the system
  - list of slots
  - possible range of values per slot
  - possible actions per slot
  - dependencies (some concepts are only applicable for some values of parent concepts)
Describe the task of a dialogue state tracker.
- NLU is unreliable (it takes unreliable ASR output and adds its own errors), output might conflict with ontology
- solution: we use belief state (probability distribution over all possible states)
  - per-slot distributions are used in practice
- dialogue state tracker updates the belief state based on new information
- to make it more robust, the state tracker can accumulate probability mass over multiple turns / over NLU n-best lists
- probabilistic dialogue state tracker plays well with probabilistic dialogue policies
What's a partially observable Markov decision process?
- Markov decision process
  - model for sequential decision making when outcomes are uncertain
  - set of states, actions, probabilities that action leads from a state $s$ to a state $s'$ , and rewards received after transitioning from state $s$ to state $s'$ using action $a$
  - we are looking for a policy function – mapping from state space to action space (can be probabilistic)
- partially observable MDP – we do not know the current state certainly
  - belief state can be modelled using a hidden Markov model
Describe a viable architecture for a belief state tracker.
- basic discriminative belief tracker – we assume slot independence and trust the NLU
- we have probabilities of states $p_s$ (tracked by our belief tracker) and probabilities of observations $p_o$ (returned by NLU)
- in each step, for every slot…
  - we have the probability of null observation $p_o(\mathrm{null})$
  - for every state $x$ , we multiply $p_s(x)$ by $p_o(\mathrm{null})$
  - for every non-null $x$ , we then add $p_o(x)$ to every $p_s(x)$
- such belief tracker is very fast and parameter-free
What is the difference between dialogue state and belief state?
- dialogue state is the current context of a conversation
- belief state is a probability distribution over dialogue states – it reflects the fact that the NLU is not completely reliable
What's the difference between a static and a dynamic state tracker?
- static state tracker encodes whole history into features
- dynamic/sequence state tracker explicitly models dialogue as sequential
  - can use CRF or RNNs
How can you use pretrained language models or large language models for state tracking?
- BERT (pretrained language model)
  - we let BERT process previous system & current user utterance
  - we use it to predict per-slot span (value of a dialogue state slot – where to find it in the message)
  - from the first token's representation, we get a single decision: none/dontcare/span
  - using 2 softmaxes over tokens, we can then predict start & end token
  - we apply rule-based update to the static state tracker – if none was predicted, we keep the previous value
- LLM prompting – two alternatives were presented
  - SQL & examples: we present SQL schema to the LLM, show several examples, and provide the previous state + one dialogue turn → the (dynamic) state changes are produced as SQL requests
  - chain-of-thought style: we prompt the LLM to explain the inputs and produce state based on them (it uses the whole history, the state tracker is static)

Dialogue Policies

What are the non-statistical approaches to dialogue management/action selection?
- finite-state machines
  - dialogue state is machine state
  - nodes – system actions
  - edges – possible user response semantics
  - FSMs are easy to design and predictable, but very rigid and do not scale to complex domains
  - good for basic DTML (tone-selection) phone systems
- frame-based (VoiceXML)
  - slot-filling + providing information
  - required slots need to be filled, this can be done in any order, more information in one utterance possible
  - if all slots are filled, query the database
- rule-based – any kind of rules (e.g. Python code)
  - we can use a probabilistic belief state
  - if-then-else rules in programming code, using thresholds over belief state for reasoning
  - output: system DA
  - very flexible and easy to code, but gets messy, the dialogue policy is pre-set (not flexible)
Why is reinforcement learning preferred over supervised learning for training dialogue managers?
- you need large human-human data for supervised learning (hard to get)
  - if we used human-machine, the model would just mimic the original system
- dialogue is ambiguous & complex
  - there is no single correct next action
  - some paths will be unexplored in data, but you may encounter them
- dialogue systems won't behave the same as people
  - there are ASR errors, limited NLU, limited environment model/actions
  - dialogue systems should behave differently than people – make the best of what they have
- in reinforcement learning, the goal is to find a policy that maximizes long-term reward – this somehow corresponds to the goal of dialogue management
- note that for a typical dialogue system, the belief state is too large to make RL tractable – we map state into a reduced space, optimize there, and map it back
Describe the main idea of reinforcement learning (agent, environment, states, rewards).
- Markov decision process (MDP)
  - agent in an environment
    - has internal state
    - chooses actions according to policy
    - gets rewards and state changes from the environment
  - Markov property – state defines everything (no other temporal dependency)
- RL = finding a policy that maximizes long-term reward
  - unlike supervised learning, we don't know if an action is good
  - immediate reward might be low while long-term reward high
- return $R_t$ = accumulated long-term reward (from timestep $t$ onwards)
- state transition is stochastic (has a random probability distribution) → we maximize expected return
What are deterministic and stochastic policies in dialogue management?
- deterministic policy
  - always take the same action $\pi(s)$ in state $s$
  - enumerable in a table, equivalent to a rule-based system
  - but can be learned instead of hand-coded!
- stochastic
  - specifies a probability distribution
  - $\pi(s,a)$ … probability of choosing action $a$ in state $s$
What's a value function in a reinforcement learning scenario?
- state-value function $V^\pi(s)$ $V^{π} (s)$ … the value of a state $s$ $s$ under policy $\pi$ $π$
  - expected return for starting in state $s$ and following policy $\pi$
- action-value function $Q^\pi(s,a)$ $Q^{π} (s, a)$
  - expected return of taking action $a$ in state $s$ under policy $\pi$
- value functions can be used to evaluate states (or actions) and make better decisions
What's the difference between actor and critic methods in reinforcement learning?
- actor model learns the policy
  - for a given state, it predicts a probability distribution over actions
  - the agent can then decide according to this distribution
- critic model learns the value function
  - for a given state $s$ , it predicts its value function $V(s)$ or $Q(s,a)$ for action $a$
  - this guides the agent (they can then use the greedy policy or something like that)
What's the difference between model-based and model-free approaches in RL?
- model-based
  - we assume that transition probabilities and rewards are known
  - the solutions are mathematically nice
  - but you can only know the full model in limited settings
- model-free
  - we don't assume anything
  - this is the one for “real-world” use
  - using $Q$ instead of $V$ comes handy here (we do not need the transition probability $p(s'\mid s,a)$ to get the expected return of taking action $a$ in state $s$ )
What are the main optimization approaches in reinforcement learning (what measures can you optimize and how)?
- quantity to optimize
  - value function – critic
  - policy – actor
- environment model: model-based × model-free
- how to optimize
  - dynamic programming – find the exact solution from Bellman equation
    - iterative algorithms, refining estimates
    - expensive, assumes known environment (model-based)
  - Monte Carlo learning – learn from experience
    - sample, then update based on experience
    - when we arrive to state $s$ , we update the model to match the observation
  - Temporal difference learning – like MC but look ahead (bootstrap)
    - sample, refine estimates as you go
    - even before we arrive to $s$ , we have a good idea what the observation will be when we arrive to $s$ → we can update the model based on that guess
- sampling & updates
  - on-policy – improve the policy while we are using it for decision
  - off-policy – decide according to a different policy
Why do you typically need a user simulator to train a reinforcement learning dialogue policy?
- we can't really learn just from static datasets
  - on-policy algorithms don't work (the system needs to navigate the dialogues according to the current policy – old dialogues are not sufficient)
- RL needs a lot of data, more than real people would handle (also, the system behaves weirdly in the early phases of RL)

Neural Policies & Natural Language Generation

How do you involve neural networks in reinforcement learning (describe a Q network or a policy network)?
- part of the agent is handled by a neural network – value function (typically $Q$ ) or policy
- we are assuming huge state space (no more summary space)
- REINFORCE (policy gradients)
  - works out of the box
  - we maximize performance – value of the initial state
- deep Q-networks
  - Q-learning, where $Q$ function is represented by a neural net
  - problems we need to fix
    - SGD is unstable
    - correlated samples (data is sequential)
    - TD updates aim at a moving target (using $Q$ to compute updates to $Q$ )
    - numeric instability (scale of rewards and $Q$ values unknown)
  - fixes
    - minibatches (updates by averaged $n$ samples, not just one)
    - experience replay – to break correlated samples (store experience in a buffer, train using minibatches sampled from the buffer)
    - target $Q$ function freezing (so that the target is not moving that often)
    - clipping rewards
What are the main steps of a traditional NLG pipeline – describe at least 2.
- entire process: inputs → content plan → sentence plan → text
- content/text/document planning
  - inputs → content plan
  - content selection according to communication goal
  - basic structuring & ordering
  - typically handled by dialogue manager
- sentence planning / microplanning
  - content plan → sentence plan
  - organizing content into sentences, merging simple sentences
  - lexical choice, referring expressions (restaurant vs. it)
- surface realization
  - sentence plan → text
  - linearization according to grammar
  - word order, morphology
- for NLG in dialogue systems, we need sentence planning and surface realization
Describe one approach to NLG of your choice.
- canned text
  - most trivial – completely hand-written prompts, no variation
  - doesn't scale (good for DTMF phone systems)
- templates
  - “fill in blanks” approach
  - simple, but much more expressive, covers most common domains nicely
  - can scale, but still laborious
  - most production dialogue systems
- grammars & rules
  - rules: mostly content & sentence planning
  - grammars: mostly older research systems, realization
- machine learning
  - modern research systems
  - pre-neural attempts often combined with rules/grammar
  - neural nets made it work much better
Describe how template-based NLG works.
- we define templates for system DAs
- it can be enhances with rules
  - inflection of the filled-in phrases
  - template coverage/selection rules
What are some problems you need to deal with in template-based NLG?
- it lacks generality and variation; it is difficult to maintain, expensive to scale up
- the texts may sound unnatural
- it is difficult to express rich information – the templates may be limiting
- the templates lack context awareness
Describe a possible neural networks based NLG architecture.
- our example: neural end-to-end NLG using recurrent neural networks (RNNs)
- we don't need alignments
- binary-encoded DA (is intent/slot-value present?)
- delexicalized: does not use real values – generates templates
- this approach uses modified LSTM (long short-term memory) cells – input DA is passed in every time step
- it generates delexicalized templates word-by-word (decoder-only architecture)
- other approaches: seq2seq, Transformer
How can you use pretrained language models or large language models in NLG?
- pretrained LMs
  - architectures
    - guess masked word (encoder only: BERT)
    - generate next word (decoder only: GPT-2)
    - fix distorted sentences (both: BART, T5)
  - can be finetuned for our task/domain and for meaning representation (MR), can learn implicit copying
  - lot of them released online, plug-and-play (including multilingual versions)
- LLMs
  - Transformer decoder models (slightly updated)
  - instruction tuning – finetune on problems & solutions
  - trained using reinforcement learning from human feedback (RLHF)
    - humans are paid to rate different solutions for instructions
    - rating model is trained based on these rating → such model can be used as RL reward for LLM training
  - usage: simple prompting, no need for finetuning
    - just feed in instructions/questions/example → LLM generates solution

Voice assistants & Question Answering

What is a smart speaker made of and how does it work?
- smart speaker = internet-connected mic & speaker with a virtual assistant running
  - optionally display/camera
  - multiple microphones for far-field ASR
- it listens for a wake word
  - everything is then processed in vendor's cloud service (raw audio is sent to the cloud)
  - follow-up mode – no wake word needed for follow-up questions
  - privacy concerns
- NLU includes domain detection
- rules on top of machine learning
Briefly describe a viable approach to question answering.
- our example: IR-based QA pipeline
  - IR … information retrieval
- three steps
  - question processing
    - query formulation
    - answer type detection (what should the answer look like?)
  - passage retrieval
    - get relevant documents from the index (similar to web search) … document retrieval
    - find phrases in the documents that respond to the question
  - answer processing
    - generate a suitable answer to the original question
What is document retrieval and how is it used in question answering?
- document retrieval = getting relevant documents (candidates) according to the query by searching in the index
- can use TF-IDF (or other metrics) for weighting
- document retrieval works as a coarse filter that filters out irrelevant documents (selects the ones that are relevant to the query and can possibly contain an answer to the question)
What is dense retrieval (in the context of question answering)?
- the documents are embedded in a vector space
- such embeddings can then be compared to query embeddings via cosine similarity
- they can be also clustered into Voronoi cells, quantized, …
- dense retrieval focuses more on semantics than on the specific contained words
How can you use neural models in answer extraction (for question answering)?
- passage extraction
  - we feed the question and extracted passage(s) to Transformer model (e.g. BERT)
  - 2 classifiers: start + end of answer span (softmax over passage tokens)
- generative QA
  - feed in passage
  - generate reply word-by-word
How can you use retrieval-augmented generation in question answering?
- Transformer generative language model (decoder architecture)
  - input: retrieved passage
  - output: full-sentence response
- not just extraction, but full-sentence answer formulation
- the model has to be trained to provide reply (avoid hallucination, avoid copying everything verbatim)
What is a knowledge graph?
- large repository of structured, linked information
- entities … nodes
- relations … edges
- entities and relations are typed, the types form a similar graph (ontology)
- knowledge graphs can be used for question answering

Dialogue Tooling

What is a dialogue flow/tree?
- graph structure that describes a non-linear dialogue
- there are conditions to get to the individual nodes of the graph (and fallback strategies if none of the conditions is specified)
What are intents and entities/slots?
- intents correspond to the actions supported by the dialogue (represent what the user wants to achieve)
- entities/slots are parameters of the actions (intents) – information needed to fulfill the intents
- example
  - intent: reserve table
  - slots: date, time, number of guests
How can you improve a chatbot in production?
- automatically
  - learning from user selections
  - statistics on user selections → automated pre-selection for next users
- semi-automatically or manually
  - chat log analysis → model update
  - used measures
    - coverage – is the chatbot confident that it can address the user's request? (per dialogue turn)
    - containment – can the chatbot satisfy a user's request without human intervention? (per conversation)
What is the containment rate (in the context of using dialogue systems in call centers)?
- rate at which your chatbot can satisfy a user's request without human intervention, i.e. connect to human agent not requested (per conversation)
- it is a measure that can be used to evaluate the chatbot
What is retrieval-augmented generation?
- process of optimizing the output of a large language model so that it references an authoritative knowledge base outside of its training data sources before generating a response

Automatic Speech Recognition

What is a speech activity detector?
- it is a preprocessing step in ASR
  - to save CPU – run ASR only when there is speech, ignore non-speech sounds
- approaches
  - handcrafted (now obsolete) – track signal amplitude contours, assumes low noise
  - statistical / neural – binary classifier trained on large corpora, accurate but more CPU-demanding than handcrafted detector
Describe the main components of an ASR pipeline system.
- speech activity detector – detects that someone is speaking, can depend on wake words
- feature extractor – uses Fourier transform and mel frequency subsampling to extract features from the sound, also somehow normalizes the sound
- acoustic model – models probability that a word corresponds to a given audio
- language model – models probability of words and sentences, uses pronouncing dictionary
- decoder – combines acoustic and language model
How do input features for an ASR model look like?
- mel frequency cepstral coefficients (MFCCs)
  - representation of the sound that is inspired by human perception
  - in older systems
- mel spectrogram (filterbank)
  - uses mel (logarithmic) scale
  - less processed than MFCCs
- raw spectrograms
- raw audio
What is the function of the acoustic model in a pipeline ASR system?
- to estimate $P(\mathrm{audio}\mid \mathrm{text})$
- it helps to map audio features to phonemes or subwords (using Gaussian mixtures or neural networks)
What's the function of a decoder/language model in a pipeline ASR system?
- to estimate $P(\mathrm{text})$ $P (text)$
  - what is the probability of certain words/sentences in our language?
- it decodes audio features back to text (with the help of an acoustic model)
- it can use a pronouncing dictionary
Describe an (example) architecture of an end-to-end neural ASR system.
- our example: attention encoder-decoder
  - encoder encodes audio features
  - decoder decodes text character-by-character
  - RNN (LSTM) + attention / Transformer
  - if the audio is too fast, we slow it down
- pros
  - direct audio to letter (no need to model pronunciation explicitly)
  - no need to align phones & audio frames
  - audio & transcript is enough to train
- cons
  - inaccurate word/character timestamps
  - not low-latency
  - hard to customize

Text-to-speech Synthesis

How do humans produce sounds of speech?
- air flow from lungs → vocal cords resonation → frequency characteristics further moderated by vocal tract
- resonation
  - base frequency (F0)
  - upper harmonic frequencies
- vocal tract moderation
  - shape of vocal tract changes (tongue, soft palate, lip, jaw positions)
  - some frequencies resonate
  - some are suppressed
What's the difference between a vowel and a consonant?
- vowel – sound produced with open vocal tract
  - typically voiced (vocal chords vibrate)
  - quality of vowels depends mainly on vocal tract shape (raised tongue position, jaw/tongue height, shape of lips)
- consonant – sound produced with (partially) closed vocal tract
  - voiced/voiceless (often come in pairs, e.g. [p], [b])
  - quality also depends on type + position of closing
What is F0 and what are formants?
- F0 … base vocal cord frequency (voice pitch)
- formants … loud multiples (upper harmonics) of F0
  - distinct for different phonemes
  - F1, F2 – first, second formant
What is a spectrogram?
- frequency-time-loudness graph
What are main distinguishing characteristics of consonants?
- do vocal chords vibrate? (voiced × voiceless)
- type and position of vocal tract closing; vocal tract shape
  - stops/plosives … total closing + “explosive” release (p, d, k)
  - nasals … stops with open nasal cavity (n, m)
  - fricatives … partial closing (f, s, z)
  - approximants … movement towards partial closing and back, half-vowels (w, j)
What is a phoneme?
- sound that distinguishes meaning
- changing it for another would change meaning (dog → fog)
What are the main distinguishing characteristics of different vowel phonemes (both how they're produced and perceived)?
- production – influenced by vocal tract shape
  - raised tongue position – front, central, back
  - jaw/tongue height – open, open-mid, close-mid, close
  - shape of lips – round, non-round
- perception – depends on which formants are present in the spectrum or not (which are suppressed)
What are the main approaches to grapheme-to-phoneme conversion in TTS?
- main approaches: pronouncing dictionaries + rules
  - rules are good for languages with regular orthography (spelling)
    - Czech, German, Dutch
  - dictionaries good for irregular/historical orthography
    - English, French
  - typically it's a combination anyway
    - rules = fallback for out-of-vocabulary items
    - dictionary used for foreign words (overrides rules)
  - can be a pain in a domain with a lot of foreign names
- pronunciation is sometimes context dependent
  - part-of-speech tagging
  - contextual rules
- phonemes typically coded using ASCII
Describe the main idea of concatenative speech synthesis.
- cut & paste on recordings
- but there are too many words/syllables; there are too few phonemes
- so we use diphones = second half of one phoneme and first half of another
  - about 1500 diphones in English – manageable (even though we need lots of recordings of a single person)
  - this eliminates the heaviest coarticulation problems (but not all)
  - still artefacts at diphones boundaries
- smoothing/overlay & F0 adjustments
  - over-smoothing makes the sound robotic
  - pitch adjustments limited – don't sound natural
- modification: unit-selection concatenative synthesis
  - more instances of each diphone
  - we select units that best match the target position (to minimize adjustments needed)
Describe the main ideas of statistical parametric speech synthesis.
- trying to be more flexible, less resource-hungry than unit selection
- inverse of model-based ASR
- based on HMMs (hidden Markov models)
- principle
  - in corpus, we have text and audio
  - for training and prediction, we need:
    - model that can extract linguistic features (phonemes, stress, pitch) from the text
    - vocoder that can both extract acoustic features (spectrum, excitation) from a waveform (audio) and synthesize a waveform from acoustic features
  - to train the statistical acoustic model, we extract both acoustic and linguistic features from the corpus and use the features as training data
  - during prediction, we first extract the linguistic features from the text, then the acoustic model predicts acoustic features, and vocoder synthesizes them into a waveform
How can you use neural networks in speech synthesis?
- we can use feed-forward networks or recurrent neural networks to replace HMMs used in statistical speech synthesis
- RNNs predict smoother outputs (given temporal dependencies)
- NNs allow better features (e.g. raw spectrum)
- examples
  - WaveNet generates waveform directly, it is based on convolutional NNs
  - Tacotron is trained on waveforms and transcriptions (no linguistic features), it is based on seq2seq models with attention

Chatbots

What are the three main approaches to building chitchat/non-task-oriented open-domain chatbots?
- rule-based
  - human-scripted, react to keywords/phrases in user input
  - very time-consuming to make, but still popular
- data-driven: retrieval
  - gets replies from a corpus
  - “nearest neighbor” approaches
  - corpus can contain past conversations with users
  - chatbots differ in the sophistication of reply selection
- data-driven: generative
  - seq2seq-based models (typically RNN/Transformer)
  - usually trained on static corpora
  - (theoretically) able to handle unseen inputs, produce original replies
  - basic seq2seq architecture is weak (dull responses) → many extensions
How does the Turing test work? Does it have any weaknesses?
- evaluator leads two text-only conversations – with a machine and a human
- needs to tell which is which
- the evaluator can be gamed if the conversation is framed well (paranoid schizophrenic, therapist, Ukrainian boy, …)
What are some techniques rule-based chitchat chatbots use to convince their users that they're human-like?
- signalling understanding – repeating and reformulating user's phrasing
- good framing – it's easier to appear human as a therapist (or paranoid schizophrenic, Ukrainian boy, …)
Describe how a retrieval-based chitchat chatbot works.
- it first checks for similar inputs in the corpus (rough retrieval)
- then it reranks the best candidates to find the most suitable one
  - this step can use machine learning (problem: we need negative examples to train the classifier)
- it cannot produce unseen sentences and sometimes replies inconsistently
  - postprocessing and rules can partially fix this
How can you use neural networks for chatbots (non-task-oriented, open-domain systems)? Does that have any problems?
- we can use neural networks for reranking
  - training data problem – datasets contain only positive examples, but we also need negative examples
- NNs can be also used end-to-end
  - we can use similar approach as in phrase-based machine translation (MT)
    - the task is harder than MT – possible responses are much more variable than possible translations
    - it works, but fluency is not ideal and the context is too limited
  - RNN LMs without LSTM
    - more fluent than phrase-based
    - problems with long replies (less fluent, wander off-topic)
  - encoder-decoder RNN model with LSTM (seq2seq)
    - encode input, decode response
    - generic/dull responses
      - MLE/softmax prefer 1 option → models settle on safe replies and become over-confident
    - limited context
      - encoding long contexts is slow and ineffective
      - contexts are too sparse to learn much
    - inconsistency
      - ask the same question twice, get two different answers
      - no notion of own personality
Describe a possible architecture of an ensemble non-task-oriented chatbot.
- rule-based for sensitive/frequent/important questions
- retrieval for jokes, trivia etc.
- task-oriented-like (handcrafted / specially trained) systems for specific topics – news, weather, etc.
- seq2seq as a backup or not at all
What do you need to train a large language model?
- trillions of tokens
- enough compute power
- well-defined evaluation metrics
What are some issues you may encounter when chatting to LLMs?
- it may not be factually accurate
- it only uses information it memorized
- hallucinates instead of saying “I don't know”
- eager to please, easily swayed
- hard to control
- over-hyped