The exam will have 10 questions, mostly from this pool. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles.
Introduction
What's the difference between task-oriented and non-task-oriented systems?
task-oriented
focused on completing certain tasks (booking restaurants/flights/hotels, finding bus schedules, smart home, …)
most actual dialogue systems in production
“backend access” vs. “agent/assistant”
non-task-oriented
chitchat – social conversation, entertainment
getting to know the user, specific persona
gaming the Turing test
Describe the difference between closed-domain, multi-domain, and open-domain systems.
single/closed-domain – on a well-defined area, small set of specific tasks (e.g. banking system on a specific phone number)
multi-domain – joining several single-domain systems
open-domain – “responds to anything”, used to be mostly chitchat, now somewhat working via LLMs
Describe the difference between user-initiative, mixed-initiative, and system-initiative systems.
user-initiative – user asks, machine responds
system-initiative – “form-filling”, system asks questions, user must reply (traditional, most robust, least natural)
mixed-initiative – system and user both can ask & react to queries; most natural, most complex
Linguistics of Dialogue
What are turn taking cues/hints in a dialogue? Name a few examples.
a speaker can use a turn taking cue/hint to signalize when their turn ends (they yield)
alternatively also exact matches on the whole semantic structure (easier, but ignores partial matches)
one true answer assumed
Explain an NLG evaluation metric of your choice.
BLEU score
word-overlap with reference text(s)
BLEU=BP⋅4p1p2p3p4
pn … n-gram precision (how many n-grams of the output text exist in any reference text)
BP … brevity penalty (short sentences achieve higher n-gram precisions, so we penalize them)
slot error rate
diversity – can our system produce different replies?
Why do you need to check for statistical significance (when evaluating an NLP experiment and comparing systems)?
higher score is not enough to prove your model is better
it can happen by chance
we need to define the hypotheses and select a significance level α, then compute the observed value of test statistic and reject H0 or not
Why do you need to evaluate on a separate test set?
we want to know how well our model works on new, unseen data (how well it generalizes)
memorizing training data would give us 100% accuracy (on training data)
Natural Language Understanding
What are some alternative semantic representations of utterances, in addition to dialogue acts?
syntax/semantic trees (dependency trees, constituent trees, …)
frames – technically also trees, not directly connected to words
graphs – abstract meaning representation (AMR), more of a toy task, but popular
predicate logic
Describe language understanding as classification and language understanding as sequence tagging.
NLU as classification
we treat DAs as a set of semantic concepts
concepts: intents, slot-value pairs
binary classification: is concept Y contained in utterance X?
independent for each concept
consistency problems – conflicting intents/values need to be solved externally (e.g. based on classifier confidence)
language understanding as sequence tagging
we want to parse slot values from the text
we can classify each word using IOB format (inside/outside/beginning) – isolate the slot values (can consist of several words)
pure classification can lead to inconsistencies (I cannot follow after O)
it is useful to tag the whole sentences (sequences of words) at once
How do you deal with conflicting slots or intents in classification-based NLU?
we need to resolve such situations externally (e.g. based on classifier confidence)
What is delexicalization and why is it helpful in NLU?
delexicalization = replacement of slot values / named entities with placeholders (indicating entity type)
generally needed for NLU as classification (otherwise in-domain data is too sparse)
named-entity recognition (NER) is a problem on its own
in-domain gazetteers (dictionaries of names) alone may be enough
Describe one of the approaches to slot tagging as sequence tagging.
basic idea
we classify each word using IOB format to isolate the slot values
to avoid inconsistencies, we tag the whole sentences (sequences of words) at once
approaches
maximum entropy Markov model (MEMM)
looking at past classifications when making next ones
whole history would be too sparse/complex → Markov assumption: only the most recent classifications matter
looking at the whole input
not modelling the sequence globally
error propagation … during inference (prediction), one error can lead to a series of errors
label bias problem
hidden Markov model (HMM)
modelling the sequence as a whole
very basic model – tag depends on current word + previous tag
Markov assumption
we can get globally best tagging (using Viterbi algorithm)
linear-chain conditional random field (CRF)
somehow combines HMM and MEMM
uses global normalization → slow to train
state-of-the-art for many sequence tagging tasks (until neural networks took over; can be also used in conjunction with NNs)
What is the IOB/BIO format for slot tagging?
it is used to get the slot values from the text
the words in the text are tagged; the slots can be nested
tags
B-s … beginning of slot s
I-s … inside slot s
O … outside
example
There are over 1000 compositions by Johan Sebastian Bach.
O O B-quantity I-quantity O O B-person I-person I-person O
What is the label bias problem?
in occurs in maximum entropy Markov models (MEMM)
due to local normalization, states with fewer outbound transitions are preferred – the transitions have larger probabilities than in states with more transitions
this makes the model less immune to error propagation (= one wrongly classified word leads to a series of errors)
How can an NLU system deal with noisy ASR output? Propose an example solution.
simple approach
ASR produces multiple hypotheses (texts)
ASR → p(text∣audio)
NLU → p(DA∣text)
we want p(DA∣audio)
we sum it up: p(DA∣audio)=∑textsP(DA∣text)P(text∣audio)
alternative approach: confusion networks
we use per-word ASR confidence
Neural NLU & Dialogue State Tracking
Describe an example of a neural architecture for NLU.
we can use simple classification or sequence tagging
when using sequence tagging, we can tag the intent at the start of the sentence (and then assign the IOB tags to all of its words)
decoder that tags word-by-word (uses the encoder as one of its inputs)
intent classification – we can do softmax over last encoder state
attention can be used in the decoder and to classify the intent
(pretrained) Transformer-based NLU
slot tagging on top of pretrained BERT Transformer model
BERT was trained to guess masked words
further trained for NLU
standard IOB approach
softmax the final hidden layers → output tags
in case of split words, classify only the first subword (IOB tags should not change mid-word)
special start token tagged with intent
optional CRF on top of the tagger
How can you use pretrained language models in NLU?
we can use BERT Transformer model and fine-tune it for NLU
BERT was trained to guess masked words
What is the dialogue state and what does it contain?
dialogue state remembers what was said in the past
it acts as a basis for action selection decisions
dialogue state … current context of the conversation
contents = “all that is used when the system decides what to say next”
user goal / preferences (slots & values provided, information requested)
past system actions
other semantic context
usually, we consider a probability distribution over all possible states
What is an ontology in task-oriented dialogue systems?
it is used to describe possible states
it defines all concepts in the system
list of slots
possible range of values per slot
possible actions per slot
dependencies (some concepts are only applicable for some values of parent concepts)
Describe the task of a dialogue state tracker.
NLU is unreliable (it takes unreliable ASR output and adds its own errors), output might conflict with ontology
solution: we use belief state (probability distribution over all possible states)
per-slot distributions are used in practice
dialogue state tracker updates the belief state based on new information
to make it more robust, the state tracker can accumulate probability mass over multiple turns / over NLU n-best lists
probabilistic dialogue state tracker plays well with probabilistic dialogue policies
What's a partially observable Markov decision process?
Markov decision process
model for sequential decision making when outcomes are uncertain
set of states, actions, probabilities that action leads from a state s to a state s′, and rewards received after transitioning from state s to state s′ using action a
we are looking for a policy function – mapping from state space to action space (can be probabilistic)
partially observable MDP – we do not know the current state certainly
belief state can be modelled using a hidden Markov model
Describe a viable architecture for a belief state tracker.
basic discriminative belief tracker – we assume slot independence and trust the NLU
we have probabilities of states ps (tracked by our belief tracker) and probabilities of observations po (returned by NLU)
in each step, for every slot…
we have the probability of null observation po(null)
for every state x, we multiply ps(x) by po(null)
for every non-null x, we then add po(x) to every ps(x)
such belief tracker is very fast and parameter-free
What is the difference between dialogue state and belief state?
dialogue state is the current context of a conversation
belief state is a probability distribution over dialogue states – it reflects the fact that the NLU is not completely reliable
What's the difference between a static and a dynamic state tracker?
static state tracker encodes whole history into features
dynamic/sequence state tracker explicitly models dialogue as sequential
can use CRF or RNNs
How can you use pretrained language models or large language models for state tracking?
BERT (pretrained language model)
we let BERT process previous system & current user utterance
we use it to predict per-slot span (value of a dialogue state slot – where to find it in the message)
from the first token's representation, we get a single decision: none/dontcare/span
using 2 softmaxes over tokens, we can then predict start & end token
we apply rule-based update to the static state tracker – if none was predicted, we keep the previous value
LLM prompting – two alternatives were presented
SQL & examples: we present SQL schema to the LLM, show several examples, and provide the previous state + one dialogue turn → the (dynamic) state changes are produced as SQL requests
chain-of-thought style: we prompt the LLM to explain the inputs and produce state based on them (it uses the whole history, the state tracker is static)
Dialogue Policies
What are the non-statistical approaches to dialogue management/action selection?
finite-state machines
dialogue state is machine state
nodes – system actions
edges – possible user response semantics
FSMs are easy to design and predictable, but very rigid and do not scale to complex domains
good for basic DTML (tone-selection) phone systems
frame-based (VoiceXML)
slot-filling + providing information
required slots need to be filled, this can be done in any order, more information in one utterance possible
if all slots are filled, query the database
rule-based – any kind of rules (e.g. Python code)
we can use a probabilistic belief state
if-then-else rules in programming code, using thresholds over belief state for reasoning
output: system DA
very flexible and easy to code, but gets messy, the dialogue policy is pre-set (not flexible)
Why is reinforcement learning preferred over supervised learning for training dialogue managers?
you need large human-human data for supervised learning (hard to get)
if we used human-machine, the model would just mimic the original system
dialogue is ambiguous & complex
there is no single correct next action
some paths will be unexplored in data, but you may encounter them
dialogue systems won't behave the same as people
there are ASR errors, limited NLU, limited environment model/actions
dialogue systems should behave differently than people – make the best of what they have
in reinforcement learning, the goal is to find a policy that maximizes long-term reward – this somehow corresponds to the goal of dialogue management
note that for a typical dialogue system, the belief state is too large to make RL tractable – we map state into a reduced space, optimize there, and map it back
Describe the main idea of reinforcement learning (agent, environment, states, rewards).
Markov decision process (MDP)
agent in an environment
has internal state
chooses actions according to policy
gets rewards and state changes from the environment
Markov property – state defines everything (no other temporal dependency)
RL = finding a policy that maximizes long-term reward
unlike supervised learning, we don't know if an action is good
immediate reward might be low while long-term reward high
return Rt = accumulated long-term reward (from timestep t onwards)
state transition is stochastic (has a random probability distribution) → we maximize expected return
What are deterministic and stochastic policies in dialogue management?
deterministic policy
always take the same action π(s) in state s
enumerable in a table, equivalent to a rule-based system
but can be learned instead of hand-coded!
stochastic
specifies a probability distribution
π(s,a) … probability of choosing action a in state s
What's a value function in a reinforcement learning scenario?
state-value function Vπ(s) … the value of a state s under policy π
expected return for starting in state s and following policy π
action-value function Qπ(s,a)
expected return of taking action a in state s under policy π
value functions can be used to evaluate states (or actions) and make better decisions
What's the difference between actor and critic methods in reinforcement learning?
actor model learns the policy
for a given state, it predicts a probability distribution over actions
the agent can then decide according to this distribution
critic model learns the value function
for a given state s, it predicts its value function V(s) or Q(s,a) for action a
this guides the agent (they can then use the greedy policy or something like that)
What's the difference between model-based and model-free approaches in RL?
model-based
we assume that transition probabilities and rewards are known
the solutions are mathematically nice
but you can only know the full model in limited settings
model-free
we don't assume anything
this is the one for “real-world” use
using Q instead of V comes handy here (we do not need the transition probability p(s′∣s,a) to get the expected return of taking action a in state s)
What are the main optimization approaches in reinforcement learning (what measures can you optimize and how)?
quantity to optimize
value function – critic
policy – actor
environment model: model-based × model-free
how to optimize
dynamic programming – find the exact solution from Bellman equation
iterative algorithms, refining estimates
expensive, assumes known environment (model-based)
Monte Carlo learning – learn from experience
sample, then update based on experience
when we arrive to state s, we update the model to match the observation
Temporal difference learning – like MC but look ahead (bootstrap)
sample, refine estimates as you go
even before we arrive to s, we have a good idea what the observation will be when we arrive to s → we can update the model based on that guess
sampling & updates
on-policy – improve the policy while we are using it for decision
off-policy – decide according to a different policy
Why do you typically need a user simulator to train a reinforcement learning dialogue policy?
we can't really learn just from static datasets
on-policy algorithms don't work (the system needs to navigate the dialogues according to the current policy – old dialogues are not sufficient)
RL needs a lot of data, more than real people would handle (also, the system behaves weirdly in the early phases of RL)
Neural Policies & Natural Language Generation
How do you involve neural networks in reinforcement learning (describe a Q network or a policy network)?
part of the agent is handled by a neural network – value function (typically Q) or policy
we are assuming huge state space (no more summary space)
REINFORCE (policy gradients)
works out of the box
we maximize performance – value of the initial state
deep Q-networks
Q-learning, where Q function is represented by a neural net
problems we need to fix
SGD is unstable
correlated samples (data is sequential)
TD updates aim at a moving target (using Q to compute updates to Q)
numeric instability (scale of rewards and Q values unknown)
fixes
minibatches (updates by averaged n samples, not just one)
experience replay – to break correlated samples (store experience in a buffer, train using minibatches sampled from the buffer)
target Q function freezing (so that the target is not moving that often)
clipping rewards
What are the main steps of a traditional NLG pipeline – describe at least 2.
entire process: inputs → content plan → sentence plan → text
content/text/document planning
inputs → content plan
content selection according to communication goal
basic structuring & ordering
typically handled by dialogue manager
sentence planning / microplanning
content plan → sentence plan
organizing content into sentences, merging simple sentences
lexical choice, referring expressions (restaurant vs. it)
surface realization
sentence plan → text
linearization according to grammar
word order, morphology
for NLG in dialogue systems, we need sentence planning and surface realization
Describe one approach to NLG of your choice.
canned text
most trivial – completely hand-written prompts, no variation
doesn't scale (good for DTMF phone systems)
templates
“fill in blanks” approach
simple, but much more expressive, covers most common domains nicely
can scale, but still laborious
most production dialogue systems
grammars & rules
rules: mostly content & sentence planning
grammars: mostly older research systems, realization
machine learning
modern research systems
pre-neural attempts often combined with rules/grammar
neural nets made it work much better
Describe how template-based NLG works.
we define templates for system DAs
it can be enhances with rules
inflection of the filled-in phrases
template coverage/selection rules
What are some problems you need to deal with in template-based NLG?
it lacks generality and variation; it is difficult to maintain, expensive to scale up
the texts may sound unnatural
it is difficult to express rich information – the templates may be limiting
the templates lack context awareness
Describe a possible neural networks based NLG architecture.
our example: neural end-to-end NLG using recurrent neural networks (RNNs)
we don't need alignments
binary-encoded DA (is intent/slot-value present?)
delexicalized: does not use real values – generates templates
this approach uses modified LSTM (long short-term memory) cells – input DA is passed in every time step
it generates delexicalized templates word-by-word (decoder-only architecture)
other approaches: seq2seq, Transformer
How can you use pretrained language models or large language models in NLG?
pretrained LMs
architectures
guess masked word (encoder only: BERT)
generate next word (decoder only: GPT-2)
fix distorted sentences (both: BART, T5)
can be finetuned for our task/domain and for meaning representation (MR), can learn implicit copying
lot of them released online, plug-and-play (including multilingual versions)
LLMs
Transformer decoder models (slightly updated)
instruction tuning – finetune on problems & solutions
trained using reinforcement learning from human feedback (RLHF)
humans are paid to rate different solutions for instructions
rating model is trained based on these rating → such model can be used as RL reward for LLM training
usage: simple prompting, no need for finetuning
just feed in instructions/questions/example → LLM generates solution
Voice assistants & Question Answering
What is a smart speaker made of and how does it work?
smart speaker = internet-connected mic & speaker with a virtual assistant running
optionally display/camera
multiple microphones for far-field ASR
it listens for a wake word
everything is then processed in vendor's cloud service (raw audio is sent to the cloud)
follow-up mode – no wake word needed for follow-up questions
privacy concerns
NLU includes domain detection
rules on top of machine learning
Briefly describe a viable approach to question answering.
our example: IR-based QA pipeline
IR … information retrieval
three steps
question processing
query formulation
answer type detection (what should the answer look like?)
passage retrieval
get relevant documents from the index (similar to web search) … document retrieval
find phrases in the documents that respond to the question
answer processing
generate a suitable answer to the original question
What is document retrieval and how is it used in question answering?
document retrieval = getting relevant documents (candidates) according to the query by searching in the index
can use TF-IDF (or other metrics) for weighting
document retrieval works as a coarse filter that filters out irrelevant documents (selects the ones that are relevant to the query and can possibly contain an answer to the question)
What is dense retrieval (in the context of question answering)?
the documents are embedded in a vector space
such embeddings can then be compared to query embeddings via cosine similarity
they can be also clustered into Voronoi cells, quantized, …
dense retrieval focuses more on semantics than on the specific contained words
How can you use neural models in answer extraction (for question answering)?
passage extraction
we feed the question and extracted passage(s) to Transformer model (e.g. BERT)
2 classifiers: start + end of answer span (softmax over passage tokens)
generative QA
feed in passage
generate reply word-by-word
How can you use retrieval-augmented generation in question answering?
Transformer generative language model (decoder architecture)
input: retrieved passage
output: full-sentence response
not just extraction, but full-sentence answer formulation
the model has to be trained to provide reply (avoid hallucination, avoid copying everything verbatim)
What is a knowledge graph?
large repository of structured, linked information
entities … nodes
relations … edges
entities and relations are typed, the types form a similar graph (ontology)
knowledge graphs can be used for question answering
Dialogue Tooling
What is a dialogue flow/tree?
graph structure that describes a non-linear dialogue
there are conditions to get to the individual nodes of the graph (and fallback strategies if none of the conditions is specified)
What are intents and entities/slots?
intents correspond to the actions supported by the dialogue (represent what the user wants to achieve)
entities/slots are parameters of the actions (intents) – information needed to fulfill the intents
example
intent: reserve table
slots: date, time, number of guests
How can you improve a chatbot in production?
automatically
learning from user selections
statistics on user selections → automated pre-selection for next users
semi-automatically or manually
chat log analysis → model update
used measures
coverage – is the chatbot confident that it can address the user's request? (per dialogue turn)
containment – can the chatbot satisfy a user's request without human intervention? (per conversation)
What is the containment rate (in the context of using dialogue systems in call centers)?
rate at which your chatbot can satisfy a user's request without human intervention, i.e. connect to human agent not requested (per conversation)
it is a measure that can be used to evaluate the chatbot
What is retrieval-augmented generation?
process of optimizing the output of a large language model so that it references an authoritative knowledge base outside of its training data sources before generating a response
Automatic Speech Recognition
What is a speech activity detector?
it is a preprocessing step in ASR
to save CPU – run ASR only when there is speech, ignore non-speech sounds
we select units that best match the target position (to minimize adjustments needed)
Describe the main ideas of statistical parametric speech synthesis.
trying to be more flexible, less resource-hungry than unit selection
inverse of model-based ASR
based on HMMs (hidden Markov models)
principle
in corpus, we have text and audio
for training and prediction, we need:
model that can extract linguistic features (phonemes, stress, pitch) from the text
vocoder that can both extract acoustic features (spectrum, excitation) from a waveform (audio) and synthesize a waveform from acoustic features
to train the statistical acoustic model, we extract both acoustic and linguistic features from the corpus and use the features as training data
during prediction, we first extract the linguistic features from the text, then the acoustic model predicts acoustic features, and vocoder synthesizes them into a waveform
How can you use neural networks in speech synthesis?
we can use feed-forward networks or recurrent neural networks to replace HMMs used in statistical speech synthesis