# Lecture

- credit requirements
	- final exam
	- lab exercises
- <https://web.stanford.edu/~jurafsky/slp3/>
- communication domains
	- single/closed-domain
	- multi-domain
	- open-domain
- application areas
	- phone
	- apps
	- smart speakers
	- appliances
	- cars
	- web
	- embodied (robots)
	- virtual characters
- modes of communication
	- text
	- voice
	- multimodal – video, mimics, touch, …
- dialogue initiative
	- system-initiative
	- user-initiative
	- mixed-initiative
- traditional architecture
	- main loop
		- voice → text → meaning → reaction → text → voice
	- components
		- speech recognition
		- language understading
		- dialogue management
			- has access to backend (in order to perform tasks)
		- language generation
		- speech synthesis
	- multimodal system would have additional components
- automatic speech recognition (ASR)
	- converting speech signal into text
	- typically produces several possible hypotheses with confidence scores
		- n-best list
		- lattice
		- confusion network
	- very good in ideal conditions
	- problems: noise, accents, distance, channel (phone), …
	- voice activity detection
		- is the user talking to the system?
		- wake words (OK, Google)
	- ASR is usually implemented using neural networks
- natural/spoken language understanding (NLU/SLU)
	- extracting the meaning from the user utterance
	- converting into a structured semantic representation
		- dialogue acts
			- act type/intent (inform, request, confirm)
			- slot/attribute
			- value
		- examples
			- inform(food=Chinese, price=cheap)
			- request(address)
		- can be more complex (using syntax trees, predicate logic)
	- specific steps
		- named entity recognition
		- coreference resolution
	- implementation varies
		- handcrafting often works for limited domains
			- keyword spotting, regular expressions, handcrafted grammars
		- machine learning approaches
	- can also provide n-best outputs
	- problems
		- recovering from bad ASR
		- ambiguities – next Friday (it is Tuesday now)
		- variation – there are many ways to express the same thing
- dialogue manager (DM)
	- stores dialogue history modeled by dialogue state
		- handcrafted × probabilistic
		- handcrafted … just replace the value in the slot by the last-mentioned
		- probabilistic … keep an estimate
	- system actions described by dialogue policy
		- decision on next system action, given dialogue state
		- involves backend queries
		- result represented as system dialogue act
		- handcrafted
			- if-then-else clauses
			- flowcharts
		- machine learning
			- often trained with reinforcement learning
			- POMDP (partially observable markov decision process)
			- recurrent neural networks
- natural language generation (NLG)
	- how to express things might depend on context
	- goals: fluency, naturalness, avoid repetition, …
	- traditional approach: templates
		- fill in values into predefined templates (sentence skeletons)
		- works well for limited domains
	- grammar-based approaches
		- grammar/semantic structures
		- syntactic transformation rules are applied
	- statistical approaches
		- most prominent: transformer neural networks
		- generating word-by-word
- speech synthesis
	- standard pipeline: text normalization, pronunciation analysis, intonation/stress generation, waveform synthesis
	- TTS methods
		- formant-based – phoneme-specific frequencies, rules
		- concatenative – record a single person, cut into phoneme transitions
		- hidden Markov models
		- neural networks
			- no need for phoneme conversion, can go directly from text
			- text to spectrograms → vocoder (spectrogram to audio)
- organizing the components
	- basic – pipeline
		- components oblivious of each other
	- interconnected
		- read/write changes to dialogue state
		- more reactive but more complex
	- joining the modules
		- ASR + NLU
		- NLU + state tracking
		- NLU & DM & NLG – using LLMs, may be end-to-end (without module separation)
		- audio based end-to-end (audio-to-audio)
- research areas
	- LLM-based systems
	- dialogue flows from data – finding patterns in human dialogue recordings/transcripts
	- multimodality – adding video (input/output)
	- context dependency – understand/reply in context (grounding, speaker adaptation)
	- incrementality – don't wait for the whole sentence to start processing

## What happens in a dialogue

- dialogue = conversational communication between two or more people
	- verbal + non-verbal
	- collaborative, social
	- practical, related to actions
	- interactive, incremental, messy
	- dialogue systems are simpler than that
- linguistic description
	- phonetics/phonology
	- morphology
	- syntax
	- semantics – sentence (propositional) meaning
	- **pragmatics** – meaning in context, communication goal
		- underlying meaning of the sentence
- turn-taking (interactivity)
	- turn = continuous utterance from one speaker
	- normal dialogue – very fluent, fast
		- minimizing overlaps and gaps
		- cues/markers for turn boundaries
	- overlaps happen naturally
		- ambiguities in turn-taking
		- barge-in
- natural speech is very different from written text
- turn taking in dialogue systems
	- consecutive turns are typically assumed
	- system waits for the user to finish their turn
	- voice activity detection (VAD)
		- quite hard
		- we need to figure out if it is the user speaking and if they are speaking to the system
	- wake words are making VAD easier
	- some systems allow user's barge-in
		- may be tied to the wake word
- voice activity detection
	- overlapping windows of ~30 ms + binary classifier
	- features are similar to speech recognition itself
	- onset is easier to detect than end of speech
		- but it is hard to detect speech towards the system vs. someone else (that's why wake words are used)
	- postprocessing – smoothing out short-term errors
- speech acts
	- John L. Austin, John Searle
	- each utterance is an act (intentional, changing the state of the world)
	- speech acts consist of
		- utterance act – uttering of the words
		- propositional act – semantics (surface meaning)
		- illocutionary act – pragmatic meaning
		- perlocutionary act – listener obeys command, listener's worldview changes, …
	- types of speech acts
		- assertive – speaker commits to he truth of a proposition
		- directive – speaker wants the listener to do something
		- commissive – speaker commits to do something themselves
		- expressive – speaker expresses their psychological state
		- declarative – performing actions (“performative verbs”)
	- explicit (using a verb directly corresponding to the act) vs. implicit
		- I **promise** to come by later. × I'll come by later.
	- direct vs. indirect
		- indirect – the surface meaning does not correspond to the actual one
			- primary illocution – the actual meaning
			- secondary illocution – how it's expressed
		- example
			- direct: Please close the window.
			- indirect: Could you close the window?
			- more indirect: I'm cold.
- conversational maxims
	- Paul Grice
	- based on Grice's cooperative principle (“dialogue is cooperative”)
	- 4 maxims: quantity, quality, relation, manner
	- implicatures
		- obvious violation of the maxim implies additional meanings
- speech acts, maxims and implicatures in dialogue systems
	- learned from the data / hand-coded
	- understanding
		- tested on real users → usually knows indirect speech acts
		- implicatures limited – there's no common sense
	- responses
		- mostly strive for clarity – user doesn't really need to imply
- grounding
	- dialogue is cooperative → need to ensure mutual understanding
	- common ground = shared knowledge, mutual assumptions of dialogue participants
		- knowingly shared
		- expanded/updated/refined in an informative conversation
	- validated/verified via grounding feedback/evidence
		- positive – understanding/acceptance signals
			- visual, backchannels
			- explicit feedback
			- implicit feedback – showing understanding implicitly in the next utterance
		- negative – misunderstanding
			- visual
			- implicit/explicit repairs
			- clarification requests
			- repair requests
	- in dialogue systems
		- crucial for successful dialogue
		- backchanels / visual signals typically not present
		- implicit confirmation very common
		- explicit confirmation may be required for important steps
		- clarification & repair requests very common
		- part of dialogue management
- deixis = pointing
	- relating between language & context/world
	- very important in dialogue
	- deictic expressions
		- meaning dependent on the context
		- pronouns, verbs (tense and person markers), adverbs, other (lexical meaning – e.g. come/go)
	- typically egocentric, *I – here – now* is the center (origo)
	- main types of deixis: personal, temporal, local
		- other: social, discourse/textual
	- anaphora/coreference
		- anaphora – referring back
		- cataphora – referring forward
	- in dialogue systems
		- system typically assume a single user → personal deixis becomes much easier
		- most systems are aware of time, location is more complicated
		- coreference resolution is a separate problem
- prediction
	- dialogue is a social interaction
	- brain does not listen passively
	- prediction is crucial for human cognition
	- this is why we understand in adverse condition
		- we predict what the person might say → we can understand even in noisy environment
- entropy
	- Claude Shannon
	- communication channel, entropy
	- plays well with the social interaction perspective
		- people tend to use all available channel capacity
			- in noisy environment, we speak louder and slower
		- people tend to spread information evenly
			- words carrying more information are emphasized
	- conditional entropy
		- how hard is it to guess the next word in the sentence?
		- given n-gram preceding context
		- related to Shannon entropy but may differ
- prediction in dialogue systems
	- used a lot in speech recognition
		- statistical language models – based on information theory
	- not as good as humans
	- less use in other DS components
- adaptation/entrainment
	- people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue
		- wording, grammar
		- speech rate, prosody, loudness
		- accent/dialect
	- this helps a successful dialogue and social bonding, feels natural
	- dialogue systems typically don't align
		- NLG is rigid (templates, machine learning trained without context)
		- but people align to dialogue systems
- politeness
	- dialogue as social interaction – follows social conventions
	- indirect is polite
		- this is the point of most indirect speech acts
		- clashes with conversational maxims (maxim of manner)
		- appropriate level of politeness might be hard to find (culturally dependent)
	- face-saving (Brown & Lewinson)
		- positive face = desire to be accepted, liked
		- negative face = desire to act freely
		- face-threatening acts – potentially any utterance
			- threatening other's/own negative/positive face
		- politeness softens FTAs
	- in dialogue systems
		- typically handcrafted, does not adapt to the situation
		- typically not much indirect speech, but trying to stay polite
		- learning from data can be tricky – may contain offensive speech (not just swearwords, problems can be hard to find)

## Data

- two main questions before building a dialogue system
	- what data to base it on
	- how to evaluate it
- observation: if you have extensive data of a high-enough quality, the LLM learns how to count etc. just from the examples
- data
	- corpus/dataset = collection of linguistic data
	- Hugging Face, Czech National Corpus, …
- dialogue corpora/dataset types
	- modality: written/spoken/multimodal
	- source
		- human-human conversations – real dialogues, scripted (from movies)
		- human-machine
		- automatically generated
	- domain
		- closed/constrained/limited domain
		- multi-domain (more closed domains)
		- open domain (any topic, chitchat)
- dialogue data collection
	- in-house collection using experts (or students)
		- safe, high-quality
		- expensive, time-consuming
		- Wizard-of-Oz (WoZ)
			- for in-house data collection
				- also: to prototype/evaluate a system before implementing it
			- users believe they're talking to a system
				- they behave differently than when talking to a human
				- usually simpler
			- system in fact controlled by a human “wizard”
				- typically selecting options (free typing is too slow)
	- web crawling
		- typically not real dialogues
		- offensive stuff
		- many copies of the same content
		- problematic licensing
	- crowdsourcing
		- compromise: employing (untrained) people over the web
		- platforms: Amazon Mechanical Turk, Appen, Prolific
		- people tend to game the system, causing noise
- corpus annotation
	- what we need to add to the data (recordings)
		- transcriptions (textual representation of audio)
		- semantic annotation such as dialogues acts
		- name entity labelling
		- other linguistic annotation: POS, syntax (usually not in DSs)
	- getting annotation
		- similar task as getting the data itself
	- inter-annotator agreement
		- typical measure: Cohen's Kappa
			- for categorical annotation
			- $\kappa\in(0,1)$
			- 0.4 ~ fair, >0.7 ~ great
			- $\kappa=\frac{\text{agreement}-\text{chance}}{1-\text{chance}}$
- corpus size
	- we need enough examples for an accurate model
	- speech: 10s–100s of hours minimum
		- pretrained LMs/audio LLMs: 100k–10M hours
	- NLU, DM, NLG
		- handcrafting: 10s–100s of dialogues may be OK to inform you
		- simple model / limited domain: 100s–1000s dialogues might be fine
		- open domain: sky's the limit (LLMs: 1T+ tokens)
	- TTS – single person, several hours at least
		- it pays off to have high-quality recordings of only one person with flat tone
		- pretrained LMs: 10k+ hours (multilingual)
- available dialogue datasets
	- domain choice is rather limited
	- size is very often not enough
	- vast majority is English-only
	- few free datasets with audio
		- non-dialogue ones: https://www.openslr.org/
- MultiWOZ
	- task-oriented written dataset
	- crowdsourced
- dataset types
- dataset splits
	- train/dev/test spit
	- dev … validation
	- test … evaluation
	- cross-validation

## Evaluation

- types
	- extrinsic (how does the system affect the world) × intrinsic (how do its components work)
	- subjective (what users think about it, manual) × objective (measuring properties directly from data, automatic)
- we use quantitative evaluation (based on numeric data), not qualitative (detailed interviews with the users)
- significance testing (Student's t-test, Mann-Whitney U-test), bootstrap resampling
- getting the subjects for extrinsic evaluation
	- can't do without people
- extrinsic evaluation
	- how to measure
		- record people
		- analyze the logs
	- metrics
		- task success
		- duration
		- retention rate (percentage of returning users)
		- fallback rate (percentage of failed dialogues)
		- number of users – not in research setting
	- subjective
		- questionnaires
		- question types: open-ended, yes/no, Likert scales
	- …
- intrinsic
	- ASR: word error rate ~ length-normalized Levenshtein distance
	- NLU
		- slot precision & recall & F1-measure
		- accuracy used for intent/act type
	- dialogue manager
		- objective measures can be collected with user simulator
	- NLG
		- word-overlap (BLEU score)
		- slot error rate
		- diversity

## Natural Language Understanding

- words → meaning
- challenges
	- non-grammaticality
	- disfluencies
	- ASR errors
	- synonymy
	- out-of-domain utterances
- semantic representations
	- syntax/semantic trees
	- frames
	- graphs
	- **dialogue acts**
- basic approaches
	- for trees/frames/graphs
		- grammar-based parsing
			- grammars are expensive, hard to maintain
			- hardware-hungry, brittle
			- CFGs are too simple for full natural language
			- Phoenix Parser
		- statistical
	- for dialogue acts (both options can be rule-based or statistical)
		- classification
			- concepts: intent, slot-value pair
			- consistency problems (conflicting intents, conflicting values) need to be solved externally
		- sequence labelling
- named-entity recognition (NER) + delexicalization
	- identify slot values / named entities
	- delexicalize = replace them with placeholders (indicating entity type)
- slot filling as sequence tagging
	- get slot values directly, automatic delexicalization
	- each word classified
	- IOB format (inside-outside-beginning)
		- O … word does not belong to any slot
		- B … beginning of the slot
		- I … another word inside the slot
	- it is useful to combine rules and classifiers
		- keywords/regexes found at specific position
		- apply classifier to each word in the sentence left-to-right
		- problem: overall consistency (slots found elsewhere in the sentence might influence what's classified now)
	- solution: structured/sequence prediction
		- HMM, MEMM, CRF
- machine learning
	- generative × discriminative models
		- example: elephants vs. dogs
		- generative: ~ 2 models, what elephants and dogs look like
		- discriminative: establish decision boundary
	- logistic regression
	- SVM
	- soft-margin SVM
	- regularized logistic regression
- sequence prediction
	- maximum entropy Markov model
		- one error might lead to a series of errors
	- hidden Markov model
		- limited feature function
	- linear-chain conditional random field
- neural networks
	- both for classification & sequence models
	- non-linear functions, composed of basic building blocks, stacked into layers
	- activation functions
		- linear functions
		- nonlinearities – sigmoid, tanh, ReLU
		- softmax – probability estimates
	- fully differentiable – training by gradient descent
		- gradients backpropagated from outputs to all parameters
	- features: word embeddings
	- recurrent neural networks (RNN)
		- many identical layers with shared parameters
		- output of the first layer is fed as an input to the second
		- additionally, each layer gets another token from the input
		- other cell types: GRU, LSTM
			- make backpropagation work better
			- gates to keep old values
	- encoder-decoder networks
		- …
	- attention = “memory” of all encoder hidden states
	- transformer
		- getting rid of encoder recurrences
		- the whole encoder part can be parallel → networks (and datesets) can be larger, they are faster to train
		- …
		- apart from word embedding, it has also positional embedding
- neural NLU
	- various architectures possible
	- classification
		- feed-forward NN
		- RNN + attention weight → softmax
		- convolutional networks
		- transformer
	- sequence tagging
		- RNN (LSTM/GRU) → softmax over hidden states
		- transformer works the same
		- intent can be tagged at start of sentence
- handling ASR noise
	- we can run NLU for all the hypotheses and sum the results
	- we can use confusion networks
	- the word features can be weighed by word confidence
- context
	- user response can depend on last system action
	- we might need to add last system DA/text into input features
	- but the system cannot expect the user to always respond to the last question

## Dialogue State Tracking

- we need to remember what happened in the past during the dialogue
	- past system actions! (user may react to them)
- ontology
	- to describe possible states
	- defines all concepts in the system
- problems with dialogue state
	- NLU is unreliable
	- to solve that, we can ignore low-confidence input
		- but if there is some level of noise and the user repeats multiple times the same thing, is the confidence still low?
- belief state
	- we estimate a probability distribution over all possible states
	- Markov decision process
	- partially observable Markov decision process
	- naïve generative belief tracking
	- parameter tying
	- …
- LLM prompting

## Dialogue Policy

- dialogue management
	- DST tracks the past
	- dialogue policy navigates towards the future
- policy selects the next action
- action selection approaches
	- finite-state machines
		- good for tone-selection phone systems
	- frame-based
		- state = frame with slots
		- slots can be filled in any order
		- more information in one utterance possible
		- the system asks until all the slots are filled
		- standard implementation: VoiceXML
	- rule-based
		- if-then-else rules in programming code
		- very flexible, but gets messy
		- dialogue policy is still pre-set which might not be the best approach
	- statistical – with machine learning
- dialogue management with supervised learning
	- action selection ~ classification → use supervised learning?
	- hard to get sufficiently large human-human data
	- dialogue is ambiguous and complex, there's no single correct next action
		- some paths are not explored in the data but you may encounter them
	- DSs should behave differently than people, they are in a different situation
- DM as a Markov Decision Process
	- it has Markov property – current state defines everything
- deterministic vs. stochastic policy
	- ~ pure vs. mixed strategy profile
	- deterministic … for every state, the next action is fixed
	- stochastic … for a state, there is a probability distribution of possible actions
- reinforcement learning
	- finding a policy that maximizes long-term reward
	- example reward mechanism
		- for each turn … -1 (to minimize total number of turns)
		- success … +20
		- fail … -10
	- discount factor … assigns less priority to older rewards
	- state-value function
	- bellman equation
		- the equations will not be required :)
	- action-value function
	- optimal policy
	- RL agent taxonomy
	- RL approaches
		- dynamic programming – exact solution from Bellman equation
		- Monte Carlo – sample, learn from experience
		- temporal difference – look-ahead sampling (bootstrap), refine estimates as you go
	- sampling & updates – on-policy vs. off-policy
- examples of RL approaches
	- value iteration
		- dynamic programming, model-based, value-based
		- we update the $V(s)$ value until it converges for all the states
		- can be done with $Q$ instead
		- we assume $p$ and $r$ to be known, can be estimated from the data but it's expensive
	- Monte Carlo methods
	- SARSA (state-action-reward-state-action)
		- on-policy
	- Q-learning
		- off-policy
	- REINFORCE
		- we are learning the policy directly (we update its parameters)
- POMDP
	- MDP algorithms need the states to be quantized/discretized
	- policy gradients work out of the box
- summary space
	- nowadays, probably not necessary when using deep neural networks
- simulated users
	- RL needs a lot of data
	- in the beginning, the system will behave randomly, people don't like this
	- that's why we need to build another dialogue system (or at least dialogue manager) that can simulate the user
- deep reinforcement learning
	- part of the agent is handled by a neural network
- deep Q-networks
	- Q function is represented by a nerual net
	- usual Q-learning does not converge well with NNs
		- SGD is unstable
		- correlated samples (data is sequential)
		- …
	- there are some fixes we can use
	- interesting tricks
		- experience replay – buffer of 10k moves → we sample from a set of both old and recent moves
		- target Q function freezing – have a copy of Q function that does not get updated every time

## Natural Language Generation

- subtasks
	- content planning
	- sentence planning
	- surface realization
- NLG basic approaches
	- canned text – hand-written prompts
	- templates – “fill in blanks” approach
	- grammars & rules
	- machine learning´– with or without NNs
- neural networks
	- Seq2seq RNNs
	- Transformer
		- in theory, it's weaker then RNNs
		- but the models can be larger (we can train them in parallel)
		- usually, only the decoder model is used

## Voice Assistants & Question Answering

## MAMA AI

## Chitchat/Open-Domain Dialogue

- main goal
	- keep the user entertained
	- evaluation: conversation length, user engagement
- usually different architecture than task-oriented dialogue systems
	- it is hard to have explicit NLU/state for open domain
	- DB connection is optional
- evaluation history
	- Turing test
	- Loebner Prize
	- Amazon Alexa Prize
- basic architectures
	- rule-based
	- data-based
		- retrieval
		- generative