Lecture

What happens in a dialogue Data Evaluation Natural Language Understanding Dialogue State Tracking Dialogue Policy Natural Language Generation Voice Assistants & Question Answering MAMA AI Chitchat/Open-Domain Dialogue

credit requirements
- final exam
- lab exercises
https://web.stanford.edu/~jurafsky/slp3/
communication domains
- single/closed-domain
- multi-domain
- open-domain
application areas
- phone
- apps
- smart speakers
- appliances
- cars
- web
- embodied (robots)
- virtual characters
modes of communication
- text
- voice
- multimodal – video, mimics, touch, …
dialogue initiative
- system-initiative
- user-initiative
- mixed-initiative
traditional architecture
- main loop
  - voice → text → meaning → reaction → text → voice
- components
  - speech recognition
  - language understading
  - dialogue management
    - has access to backend (in order to perform tasks)
  - language generation
  - speech synthesis
- multimodal system would have additional components
automatic speech recognition (ASR)
- converting speech signal into text
- typically produces several possible hypotheses with confidence scores
  - n-best list
  - lattice
  - confusion network
- very good in ideal conditions
- problems: noise, accents, distance, channel (phone), …
- voice activity detection
  - is the user talking to the system?
  - wake words (OK, Google)
- ASR is usually implemented using neural networks
natural/spoken language understanding (NLU/SLU)
- extracting the meaning from the user utterance
- converting into a structured semantic representation
  - dialogue acts
    - act type/intent (inform, request, confirm)
    - slot/attribute
    - value
  - examples
    - inform(food=Chinese, price=cheap)
    - request(address)
  - can be more complex (using syntax trees, predicate logic)
- specific steps
  - named entity recognition
  - coreference resolution
- implementation varies
  - handcrafting often works for limited domains
    - keyword spotting, regular expressions, handcrafted grammars
  - machine learning approaches
- can also provide n-best outputs
- problems
  - recovering from bad ASR
  - ambiguities – next Friday (it is Tuesday now)
  - variation – there are many ways to express the same thing
dialogue manager (DM)
- stores dialogue history modeled by dialogue state
  - handcrafted × probabilistic
  - handcrafted … just replace the value in the slot by the last-mentioned
  - probabilistic … keep an estimate
- system actions described by dialogue policy
  - decision on next system action, given dialogue state
  - involves backend queries
  - result represented as system dialogue act
  - handcrafted
    - if-then-else clauses
    - flowcharts
  - machine learning
    - often trained with reinforcement learning
    - POMDP (partially observable markov decision process)
    - recurrent neural networks
natural language generation (NLG)
- how to express things might depend on context
- goals: fluency, naturalness, avoid repetition, …
- traditional approach: templates
  - fill in values into predefined templates (sentence skeletons)
  - works well for limited domains
- grammar-based approaches
  - grammar/semantic structures
  - syntactic transformation rules are applied
- statistical approaches
  - most prominent: transformer neural networks
  - generating word-by-word
speech synthesis
- standard pipeline: text normalization, pronunciation analysis, intonation/stress generation, waveform synthesis
- TTS methods
  - formant-based – phoneme-specific frequencies, rules
  - concatenative – record a single person, cut into phoneme transitions
  - hidden Markov models
  - neural networks
    - no need for phoneme conversion, can go directly from text
    - text to spectrograms → vocoder (spectrogram to audio)
organizing the components
- basic – pipeline
  - components oblivious of each other
- interconnected
  - read/write changes to dialogue state
  - more reactive but more complex
- joining the modules
  - ASR + NLU
  - NLU + state tracking
  - NLU & DM & NLG – using LLMs, may be end-to-end (without module separation)
  - audio based end-to-end (audio-to-audio)
research areas
- LLM-based systems
- dialogue flows from data – finding patterns in human dialogue recordings/transcripts
- multimodality – adding video (input/output)
- context dependency – understand/reply in context (grounding, speaker adaptation)
- incrementality – don't wait for the whole sentence to start processing

What happens in a dialogue

dialogue = conversational communication between two or more people
- verbal + non-verbal
- collaborative, social
- practical, related to actions
- interactive, incremental, messy
- dialogue systems are simpler than that
linguistic description
- phonetics/phonology
- morphology
- syntax
- semantics – sentence (propositional) meaning
- pragmatics – meaning in context, communication goal
  - underlying meaning of the sentence
turn-taking (interactivity)
- turn = continuous utterance from one speaker
- normal dialogue – very fluent, fast
  - minimizing overlaps and gaps
  - cues/markers for turn boundaries
- overlaps happen naturally
  - ambiguities in turn-taking
  - barge-in
natural speech is very different from written text
turn taking in dialogue systems
- consecutive turns are typically assumed
- system waits for the user to finish their turn
- voice activity detection (VAD)
  - quite hard
  - we need to figure out if it is the user speaking and if they are speaking to the system
- wake words are making VAD easier
- some systems allow user's barge-in
  - may be tied to the wake word
voice activity detection
- overlapping windows of ~30 ms + binary classifier
- features are similar to speech recognition itself
- onset is easier to detect than end of speech
  - but it is hard to detect speech towards the system vs. someone else (that's why wake words are used)
- postprocessing – smoothing out short-term errors
speech acts
- John L. Austin, John Searle
- each utterance is an act (intentional, changing the state of the world)
- speech acts consist of
  - utterance act – uttering of the words
  - propositional act – semantics (surface meaning)
  - illocutionary act – pragmatic meaning
  - perlocutionary act – listener obeys command, listener's worldview changes, …
- types of speech acts
  - assertive – speaker commits to he truth of a proposition
  - directive – speaker wants the listener to do something
  - commissive – speaker commits to do something themselves
  - expressive – speaker expresses their psychological state
  - declarative – performing actions (“performative verbs”)
- explicit (using a verb directly corresponding to the act) vs. implicit
  - I promise to come by later. × I'll come by later.
- direct vs. indirect
  - indirect – the surface meaning does not correspond to the actual one
    - primary illocution – the actual meaning
    - secondary illocution – how it's expressed
  - example
    - direct: Please close the window.
    - indirect: Could you close the window?
    - more indirect: I'm cold.
conversational maxims
- Paul Grice
- based on Grice's cooperative principle (“dialogue is cooperative”)
- 4 maxims: quantity, quality, relation, manner
- implicatures
  - obvious violation of the maxim implies additional meanings
speech acts, maxims and implicatures in dialogue systems
- learned from the data / hand-coded
- understanding
  - tested on real users → usually knows indirect speech acts
  - implicatures limited – there's no common sense
- responses
  - mostly strive for clarity – user doesn't really need to imply
grounding
- dialogue is cooperative → need to ensure mutual understanding
- common ground = shared knowledge, mutual assumptions of dialogue participants
  - knowingly shared
  - expanded/updated/refined in an informative conversation
- validated/verified via grounding feedback/evidence
  - positive – understanding/acceptance signals
    - visual, backchannels
    - explicit feedback
    - implicit feedback – showing understanding implicitly in the next utterance
  - negative – misunderstanding
    - visual
    - implicit/explicit repairs
    - clarification requests
    - repair requests
- in dialogue systems
  - crucial for successful dialogue
  - backchanels / visual signals typically not present
  - implicit confirmation very common
  - explicit confirmation may be required for important steps
  - clarification & repair requests very common
  - part of dialogue management
deixis = pointing
- relating between language & context/world
- very important in dialogue
- deictic expressions
  - meaning dependent on the context
  - pronouns, verbs (tense and person markers), adverbs, other (lexical meaning – e.g. come/go)
- typically egocentric, I – here – now is the center (origo)
- main types of deixis: personal, temporal, local
  - other: social, discourse/textual
- anaphora/coreference
  - anaphora – referring back
  - cataphora – referring forward
- in dialogue systems
  - system typically assume a single user → personal deixis becomes much easier
  - most systems are aware of time, location is more complicated
  - coreference resolution is a separate problem
prediction
- dialogue is a social interaction
- brain does not listen passively
- prediction is crucial for human cognition
- this is why we understand in adverse condition
  - we predict what the person might say → we can understand even in noisy environment
entropy
- Claude Shannon
- communication channel, entropy
- plays well with the social interaction perspective
  - people tend to use all available channel capacity
    - in noisy environment, we speak louder and slower
  - people tend to spread information evenly
    - words carrying more information are emphasized
- conditional entropy
  - how hard is it to guess the next word in the sentence?
  - given n-gram preceding context
  - related to Shannon entropy but may differ
prediction in dialogue systems
- used a lot in speech recognition
  - statistical language models – based on information theory
- not as good as humans
- less use in other DS components
adaptation/entrainment
- people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue
  - wording, grammar
  - speech rate, prosody, loudness
  - accent/dialect
- this helps a successful dialogue and social bonding, feels natural
- dialogue systems typically don't align
  - NLG is rigid (templates, machine learning trained without context)
  - but people align to dialogue systems
politeness
- dialogue as social interaction – follows social conventions
- indirect is polite
  - this is the point of most indirect speech acts
  - clashes with conversational maxims (maxim of manner)
  - appropriate level of politeness might be hard to find (culturally dependent)
- face-saving (Brown & Lewinson)
  - positive face = desire to be accepted, liked
  - negative face = desire to act freely
  - face-threatening acts – potentially any utterance
    - threatening other's/own negative/positive face
  - politeness softens FTAs
- in dialogue systems
  - typically handcrafted, does not adapt to the situation
  - typically not much indirect speech, but trying to stay polite
  - learning from data can be tricky – may contain offensive speech (not just swearwords, problems can be hard to find)

Data

two main questions before building a dialogue system
- what data to base it on
- how to evaluate it
observation: if you have extensive data of a high-enough quality, the LLM learns how to count etc. just from the examples
data
- corpus/dataset = collection of linguistic data
- Hugging Face, Czech National Corpus, …
dialogue corpora/dataset types
- modality: written/spoken/multimodal
- source
  - human-human conversations – real dialogues, scripted (from movies)
  - human-machine
  - automatically generated
- domain
  - closed/constrained/limited domain
  - multi-domain (more closed domains)
  - open domain (any topic, chitchat)
dialogue data collection
- in-house collection using experts (or students)
  - safe, high-quality
  - expensive, time-consuming
  - Wizard-of-Oz (WoZ)
    - for in-house data collection
      - also: to prototype/evaluate a system before implementing it
    - users believe they're talking to a system
      - they behave differently than when talking to a human
      - usually simpler
    - system in fact controlled by a human “wizard”
      - typically selecting options (free typing is too slow)
- web crawling
  - typically not real dialogues
  - offensive stuff
  - many copies of the same content
  - problematic licensing
- crowdsourcing
  - compromise: employing (untrained) people over the web
  - platforms: Amazon Mechanical Turk, Appen, Prolific
  - people tend to game the system, causing noise
corpus annotation
- what we need to add to the data (recordings)
  - transcriptions (textual representation of audio)
  - semantic annotation such as dialogues acts
  - name entity labelling
  - other linguistic annotation: POS, syntax (usually not in DSs)
- getting annotation
  - similar task as getting the data itself
- inter-annotator agreement
  - typical measure: Cohen's Kappa
    - for categorical annotation
    - $\kappa\in(0,1)$
    - 0.4 ~ fair, >0.7 ~ great
    - $\kappa=\frac{\text{agreement}-\text{chance}}{1-\text{chance}}$
corpus size
- we need enough examples for an accurate model
- speech: 10s–100s of hours minimum
  - pretrained LMs/audio LLMs: 100k–10M hours
- NLU, DM, NLG
  - handcrafting: 10s–100s of dialogues may be OK to inform you
  - simple model / limited domain: 100s–1000s dialogues might be fine
  - open domain: sky's the limit (LLMs: 1T+ tokens)
- TTS – single person, several hours at least
  - it pays off to have high-quality recordings of only one person with flat tone
  - pretrained LMs: 10k+ hours (multilingual)
available dialogue datasets
- domain choice is rather limited
- size is very often not enough
- vast majority is English-only
- few free datasets with audio
  - non-dialogue ones: https://www.openslr.org/
MultiWOZ
- task-oriented written dataset
- crowdsourced
dataset types
dataset splits
- train/dev/test spit
- dev … validation
- test … evaluation
- cross-validation

Evaluation

types
- extrinsic (how does the system affect the world) × intrinsic (how do its components work)
- subjective (what users think about it, manual) × objective (measuring properties directly from data, automatic)
we use quantitative evaluation (based on numeric data), not qualitative (detailed interviews with the users)
significance testing (Student's t-test, Mann-Whitney U-test), bootstrap resampling
getting the subjects for extrinsic evaluation
- can't do without people
extrinsic evaluation
- how to measure
  - record people
  - analyze the logs
- metrics
  - task success
  - duration
  - retention rate (percentage of returning users)
  - fallback rate (percentage of failed dialogues)
  - number of users – not in research setting
- subjective
  - questionnaires
  - question types: open-ended, yes/no, Likert scales
- …
intrinsic
- ASR: word error rate ~ length-normalized Levenshtein distance
- NLU
  - slot precision & recall & F1-measure
  - accuracy used for intent/act type
- dialogue manager
  - objective measures can be collected with user simulator
- NLG
  - word-overlap (BLEU score)
  - slot error rate
  - diversity

Natural Language Understanding

words → meaning
challenges
- non-grammaticality
- disfluencies
- ASR errors
- synonymy
- out-of-domain utterances
semantic representations
- syntax/semantic trees
- frames
- graphs
- dialogue acts
basic approaches
- for trees/frames/graphs
  - grammar-based parsing
    - grammars are expensive, hard to maintain
    - hardware-hungry, brittle
    - CFGs are too simple for full natural language
    - Phoenix Parser
  - statistical
- for dialogue acts (both options can be rule-based or statistical)
  - classification
    - concepts: intent, slot-value pair
    - consistency problems (conflicting intents, conflicting values) need to be solved externally
  - sequence labelling
named-entity recognition (NER) + delexicalization
- identify slot values / named entities
- delexicalize = replace them with placeholders (indicating entity type)
slot filling as sequence tagging
- get slot values directly, automatic delexicalization
- each word classified
- IOB format (inside-outside-beginning)
  - O … word does not belong to any slot
  - B … beginning of the slot
  - I … another word inside the slot
- it is useful to combine rules and classifiers
  - keywords/regexes found at specific position
  - apply classifier to each word in the sentence left-to-right
  - problem: overall consistency (slots found elsewhere in the sentence might influence what's classified now)
- solution: structured/sequence prediction
  - HMM, MEMM, CRF
machine learning
- generative × discriminative models
  - example: elephants vs. dogs
  - generative: ~ 2 models, what elephants and dogs look like
  - discriminative: establish decision boundary
- logistic regression
- SVM
- soft-margin SVM
- regularized logistic regression
sequence prediction
- maximum entropy Markov model
  - one error might lead to a series of errors
- hidden Markov model
  - limited feature function
- linear-chain conditional random field
neural networks
- both for classification & sequence models
- non-linear functions, composed of basic building blocks, stacked into layers
- activation functions
  - linear functions
  - nonlinearities – sigmoid, tanh, ReLU
  - softmax – probability estimates
- fully differentiable – training by gradient descent
  - gradients backpropagated from outputs to all parameters
- features: word embeddings
- recurrent neural networks (RNN)
  - many identical layers with shared parameters
  - output of the first layer is fed as an input to the second
  - additionally, each layer gets another token from the input
  - other cell types: GRU, LSTM
    - make backpropagation work better
    - gates to keep old values
- encoder-decoder networks
  - …
- attention = “memory” of all encoder hidden states
- transformer
  - getting rid of encoder recurrences
  - the whole encoder part can be parallel → networks (and datesets) can be larger, they are faster to train
  - …
  - apart from word embedding, it has also positional embedding
neural NLU
- various architectures possible
- classification
  - feed-forward NN
  - RNN + attention weight → softmax
  - convolutional networks
  - transformer
- sequence tagging
  - RNN (LSTM/GRU) → softmax over hidden states
  - transformer works the same
  - intent can be tagged at start of sentence
handling ASR noise
- we can run NLU for all the hypotheses and sum the results
- we can use confusion networks
- the word features can be weighed by word confidence
context
- user response can depend on last system action
- we might need to add last system DA/text into input features
- but the system cannot expect the user to always respond to the last question

Dialogue State Tracking

we need to remember what happened in the past during the dialogue
- past system actions! (user may react to them)
ontology
- to describe possible states
- defines all concepts in the system
problems with dialogue state
- NLU is unreliable
- to solve that, we can ignore low-confidence input
  - but if there is some level of noise and the user repeats multiple times the same thing, is the confidence still low?
belief state
- we estimate a probability distribution over all possible states
- Markov decision process
- partially observable Markov decision process
- naïve generative belief tracking
- parameter tying
- …
LLM prompting

Dialogue Policy

dialogue management
- DST tracks the past
- dialogue policy navigates towards the future
policy selects the next action
action selection approaches
- finite-state machines
  - good for tone-selection phone systems
- frame-based
  - state = frame with slots
  - slots can be filled in any order
  - more information in one utterance possible
  - the system asks until all the slots are filled
  - standard implementation: VoiceXML
- rule-based
  - if-then-else rules in programming code
  - very flexible, but gets messy
  - dialogue policy is still pre-set which might not be the best approach
- statistical – with machine learning
dialogue management with supervised learning
- action selection ~ classification → use supervised learning?
- hard to get sufficiently large human-human data
- dialogue is ambiguous and complex, there's no single correct next action
  - some paths are not explored in the data but you may encounter them
- DSs should behave differently than people, they are in a different situation
DM as a Markov Decision Process
- it has Markov property – current state defines everything
deterministic vs. stochastic policy
- ~ pure vs. mixed strategy profile
- deterministic … for every state, the next action is fixed
- stochastic … for a state, there is a probability distribution of possible actions
reinforcement learning
- finding a policy that maximizes long-term reward
- example reward mechanism
  - for each turn … -1 (to minimize total number of turns)
  - success … +20
  - fail … -10
- discount factor … assigns less priority to older rewards
- state-value function
- bellman equation
  - the equations will not be required :)
- action-value function
- optimal policy
- RL agent taxonomy
- RL approaches
  - dynamic programming – exact solution from Bellman equation
  - Monte Carlo – sample, learn from experience
  - temporal difference – look-ahead sampling (bootstrap), refine estimates as you go
- sampling & updates – on-policy vs. off-policy
examples of RL approaches
- value iteration
  - dynamic programming, model-based, value-based
  - we update the $V(s)$ value until it converges for all the states
  - can be done with $Q$ instead
  - we assume $p$ and $r$ to be known, can be estimated from the data but it's expensive
- Monte Carlo methods
- SARSA (state-action-reward-state-action)
  - on-policy
- Q-learning
  - off-policy
- REINFORCE
  - we are learning the policy directly (we update its parameters)
POMDP
- MDP algorithms need the states to be quantized/discretized
- policy gradients work out of the box
summary space
- nowadays, probably not necessary when using deep neural networks
simulated users
- RL needs a lot of data
- in the beginning, the system will behave randomly, people don't like this
- that's why we need to build another dialogue system (or at least dialogue manager) that can simulate the user
deep reinforcement learning
- part of the agent is handled by a neural network
deep Q-networks
- Q function is represented by a nerual net
- usual Q-learning does not converge well with NNs
  - SGD is unstable
  - correlated samples (data is sequential)
  - …
- there are some fixes we can use
- interesting tricks
  - experience replay – buffer of 10k moves → we sample from a set of both old and recent moves
  - target Q function freezing – have a copy of Q function that does not get updated every time

Natural Language Generation

subtasks
- content planning
- sentence planning
- surface realization
NLG basic approaches
- canned text – hand-written prompts
- templates – “fill in blanks” approach
- grammars & rules
- machine learning´– with or without NNs
neural networks
- Seq2seq RNNs
- Transformer
  - in theory, it's weaker then RNNs
  - but the models can be larger (we can train them in parallel)
  - usually, only the decoder model is used

Voice Assistants & Question Answering

MAMA AI

Chitchat/Open-Domain Dialogue

main goal
- keep the user entertained
- evaluation: conversation length, user engagement
usually different architecture than task-oriented dialogue systems
- it is hard to have explicit NLU/state for open domain
- DB connection is optional
evaluation history
- Turing test
- Loebner Prize
- Amazon Alexa Prize
basic architectures
- rule-based
- data-based
  - retrieval
  - generative