# Lecture - credit requirements - final exam - lab exercises - - communication domains - single/closed-domain - multi-domain - open-domain - application areas - phone - apps - smart speakers - appliances - cars - web - embodied (robots) - virtual characters - modes of communication - text - voice - multimodal – video, mimics, touch, … - dialogue initiative - system-initiative - user-initiative - mixed-initiative - traditional architecture - main loop - voice → text → meaning → reaction → text → voice - components - speech recognition - language understading - dialogue management - has access to backend (in order to perform tasks) - language generation - speech synthesis - multimodal system would have additional components - automatic speech recognition (ASR) - converting speech signal into text - typically produces several possible hypotheses with confidence scores - n-best list - lattice - confusion network - very good in ideal conditions - problems: noise, accents, distance, channel (phone), … - voice activity detection - is the user talking to the system? - wake words (OK, Google) - ASR is usually implemented using neural networks - natural/spoken language understanding (NLU/SLU) - extracting the meaning from the user utterance - converting into a structured semantic representation - dialogue acts - act type/intent (inform, request, confirm) - slot/attribute - value - examples - inform(food=Chinese, price=cheap) - request(address) - can be more complex (using syntax trees, predicate logic) - specific steps - named entity recognition - coreference resolution - implementation varies - handcrafting often works for limited domains - keyword spotting, regular expressions, handcrafted grammars - machine learning approaches - can also provide n-best outputs - problems - recovering from bad ASR - ambiguities – next Friday (it is Tuesday now) - variation – there are many ways to express the same thing - dialogue manager (DM) - stores dialogue history modeled by dialogue state - handcrafted × probabilistic - handcrafted … just replace the value in the slot by the last-mentioned - probabilistic … keep an estimate - system actions described by dialogue policy - decision on next system action, given dialogue state - involves backend queries - result represented as system dialogue act - handcrafted - if-then-else clauses - flowcharts - machine learning - often trained with reinforcement learning - POMDP (partially observable markov decision process) - recurrent neural networks - natural language generation (NLG) - how to express things might depend on context - goals: fluency, naturalness, avoid repetition, … - traditional approach: templates - fill in values into predefined templates (sentence skeletons) - works well for limited domains - grammar-based approaches - grammar/semantic structures - syntactic transformation rules are applied - statistical approaches - most prominent: transformer neural networks - generating word-by-word - speech synthesis - standard pipeline: text normalization, pronunciation analysis, intonation/stress generation, waveform synthesis - TTS methods - formant-based – phoneme-specific frequencies, rules - concatenative – record a single person, cut into phoneme transitions - hidden Markov models - neural networks - no need for phoneme conversion, can go directly from text - text to spectrograms → vocoder (spectrogram to audio) - organizing the components - basic – pipeline - components oblivious of each other - interconnected - read/write changes to dialogue state - more reactive but more complex - joining the modules - ASR + NLU - NLU + state tracking - NLU & DM & NLG – using LLMs, may be end-to-end (without module separation) - audio based end-to-end (audio-to-audio) - research areas - LLM-based systems - dialogue flows from data – finding patterns in human dialogue recordings/transcripts - multimodality – adding video (input/output) - context dependency – understand/reply in context (grounding, speaker adaptation) - incrementality – don't wait for the whole sentence to start processing ## What happens in a dialogue - dialogue = conversational communication between two or more people - verbal + non-verbal - collaborative, social - practical, related to actions - interactive, incremental, messy - dialogue systems are simpler than that - linguistic description - phonetics/phonology - morphology - syntax - semantics – sentence (propositional) meaning - **pragmatics** – meaning in context, communication goal - underlying meaning of the sentence - turn-taking (interactivity) - turn = continuous utterance from one speaker - normal dialogue – very fluent, fast - minimizing overlaps and gaps - cues/markers for turn boundaries - overlaps happen naturally - ambiguities in turn-taking - barge-in - natural speech is very different from written text - turn taking in dialogue systems - consecutive turns are typically assumed - system waits for the user to finish their turn - voice activity detection (VAD) - quite hard - we need to figure out if it is the user speaking and if they are speaking to the system - wake words are making VAD easier - some systems allow user's barge-in - may be tied to the wake word - voice activity detection - overlapping windows of ~30 ms + binary classifier - features are similar to speech recognition itself - onset is easier to detect than end of speech - but it is hard to detect speech towards the system vs. someone else (that's why wake words are used) - postprocessing – smoothing out short-term errors - speech acts - John L. Austin, John Searle - each utterance is an act (intentional, changing the state of the world) - speech acts consist of - utterance act – uttering of the words - propositional act – semantics (surface meaning) - illocutionary act – pragmatic meaning - perlocutionary act – listener obeys command, listener's worldview changes, … - types of speech acts - assertive – speaker commits to he truth of a proposition - directive – speaker wants the listener to do something - commissive – speaker commits to do something themselves - expressive – speaker expresses their psychological state - declarative – performing actions (“performative verbs”) - explicit (using a verb directly corresponding to the act) vs. implicit - I **promise** to come by later. × I'll come by later. - direct vs. indirect - indirect – the surface meaning does not correspond to the actual one - primary illocution – the actual meaning - secondary illocution – how it's expressed - example - direct: Please close the window. - indirect: Could you close the window? - more indirect: I'm cold. - conversational maxims - Paul Grice - based on Grice's cooperative principle (“dialogue is cooperative”) - 4 maxims: quantity, quality, relation, manner - implicatures - obvious violation of the maxim implies additional meanings - speech acts, maxims and implicatures in dialogue systems - learned from the data / hand-coded - understanding - tested on real users → usually knows indirect speech acts - implicatures limited – there's no common sense - responses - mostly strive for clarity – user doesn't really need to imply - grounding - dialogue is cooperative → need to ensure mutual understanding - common ground = shared knowledge, mutual assumptions of dialogue participants - knowingly shared - expanded/updated/refined in an informative conversation - validated/verified via grounding feedback/evidence - positive – understanding/acceptance signals - visual, backchannels - explicit feedback - implicit feedback – showing understanding implicitly in the next utterance - negative – misunderstanding - visual - implicit/explicit repairs - clarification requests - repair requests - in dialogue systems - crucial for successful dialogue - backchanels / visual signals typically not present - implicit confirmation very common - explicit confirmation may be required for important steps - clarification & repair requests very common - part of dialogue management - deixis = pointing - relating between language & context/world - very important in dialogue - deictic expressions - meaning dependent on the context - pronouns, verbs (tense and person markers), adverbs, other (lexical meaning – e.g. come/go) - typically egocentric, *I – here – now* is the center (origo) - main types of deixis: personal, temporal, local - other: social, discourse/textual - anaphora/coreference - anaphora – referring back - cataphora – referring forward - in dialogue systems - system typically assume a single user → personal deixis becomes much easier - most systems are aware of time, location is more complicated - coreference resolution is a separate problem - prediction - dialogue is a social interaction - brain does not listen passively - prediction is crucial for human cognition - this is why we understand in adverse condition - we predict what the person might say → we can understand even in noisy environment - entropy - Claude Shannon - communication channel, entropy - plays well with the social interaction perspective - people tend to use all available channel capacity - in noisy environment, we speak louder and slower - people tend to spread information evenly - words carrying more information are emphasized - conditional entropy - how hard is it to guess the next word in the sentence? - given n-gram preceding context - related to Shannon entropy but may differ - prediction in dialogue systems - used a lot in speech recognition - statistical language models – based on information theory - not as good as humans - less use in other DS components - adaptation/entrainment - people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue - wording, grammar - speech rate, prosody, loudness - accent/dialect - this helps a successful dialogue and social bonding, feels natural - dialogue systems typically don't align - NLG is rigid (templates, machine learning trained without context) - but people align to dialogue systems - politeness - dialogue as social interaction – follows social conventions - indirect is polite - this is the point of most indirect speech acts - clashes with conversational maxims (maxim of manner) - appropriate level of politeness might be hard to find (culturally dependent) - face-saving (Brown & Lewinson) - positive face = desire to be accepted, liked - negative face = desire to act freely - face-threatening acts – potentially any utterance - threatening other's/own negative/positive face - politeness softens FTAs - in dialogue systems - typically handcrafted, does not adapt to the situation - typically not much indirect speech, but trying to stay polite - learning from data can be tricky – may contain offensive speech (not just swearwords, problems can be hard to find) ## Data - two main questions before building a dialogue system - what data to base it on - how to evaluate it - observation: if you have extensive data of a high-enough quality, the LLM learns how to count etc. just from the examples - data - corpus/dataset = collection of linguistic data - Hugging Face, Czech National Corpus, … - dialogue corpora/dataset types - modality: written/spoken/multimodal - source - human-human conversations – real dialogues, scripted (from movies) - human-machine - automatically generated - domain - closed/constrained/limited domain - multi-domain (more closed domains) - open domain (any topic, chitchat) - dialogue data collection - in-house collection using experts (or students) - safe, high-quality - expensive, time-consuming - Wizard-of-Oz (WoZ) - for in-house data collection - also: to prototype/evaluate a system before implementing it - users believe they're talking to a system - they behave differently than when talking to a human - usually simpler - system in fact controlled by a human “wizard” - typically selecting options (free typing is too slow) - web crawling - typically not real dialogues - offensive stuff - many copies of the same content - problematic licensing - crowdsourcing - compromise: employing (untrained) people over the web - platforms: Amazon Mechanical Turk, Appen, Prolific - people tend to game the system, causing noise - corpus annotation - what we need to add to the data (recordings) - transcriptions (textual representation of audio) - semantic annotation such as dialogues acts - name entity labelling - other linguistic annotation: POS, syntax (usually not in DSs) - getting annotation - similar task as getting the data itself - inter-annotator agreement - typical measure: Cohen's Kappa - for categorical annotation - $\kappa\in(0,1)$ - 0.4 ~ fair, >0.7 ~ great - $\kappa=\frac{\text{agreement}-\text{chance}}{1-\text{chance}}$ - corpus size - we need enough examples for an accurate model - speech: 10s–100s of hours minimum - pretrained LMs/audio LLMs: 100k–10M hours - NLU, DM, NLG - handcrafting: 10s–100s of dialogues may be OK to inform you - simple model / limited domain: 100s–1000s dialogues might be fine - open domain: sky's the limit (LLMs: 1T+ tokens) - TTS – single person, several hours at least - it pays off to have high-quality recordings of only one person with flat tone - pretrained LMs: 10k+ hours (multilingual) - available dialogue datasets - domain choice is rather limited - size is very often not enough - vast majority is English-only - few free datasets with audio - non-dialogue ones: https://www.openslr.org/ - MultiWOZ - task-oriented written dataset - crowdsourced - dataset types - dataset splits - train/dev/test spit - dev … validation - test … evaluation - cross-validation ## Evaluation - types - extrinsic (how does the system affect the world) × intrinsic (how do its components work) - subjective (what users think about it, manual) × objective (measuring properties directly from data, automatic) - we use quantitative evaluation (based on numeric data), not qualitative (detailed interviews with the users) - significance testing (Student's t-test, Mann-Whitney U-test), bootstrap resampling - getting the subjects for extrinsic evaluation - can't do without people - extrinsic evaluation - how to measure - record people - analyze the logs - metrics - task success - duration - retention rate (percentage of returning users) - fallback rate (percentage of failed dialogues) - number of users – not in research setting - subjective - questionnaires - question types: open-ended, yes/no, Likert scales - … - intrinsic - ASR: word error rate ~ length-normalized Levenshtein distance - NLU - slot precision & recall & F1-measure - accuracy used for intent/act type - dialogue manager - objective measures can be collected with user simulator - NLG - word-overlap (BLEU score) - slot error rate - diversity ## Natural Language Understanding - words → meaning - challenges - non-grammaticality - disfluencies - ASR errors - synonymy - out-of-domain utterances - semantic representations - syntax/semantic trees - frames - graphs - **dialogue acts** - basic approaches - for trees/frames/graphs - grammar-based parsing - grammars are expensive, hard to maintain - hardware-hungry, brittle - CFGs are too simple for full natural language - Phoenix Parser - statistical - for dialogue acts (both options can be rule-based or statistical) - classification - concepts: intent, slot-value pair - consistency problems (conflicting intents, conflicting values) need to be solved externally - sequence labelling - named-entity recognition (NER) + delexicalization - identify slot values / named entities - delexicalize = replace them with placeholders (indicating entity type) - slot filling as sequence tagging - get slot values directly, automatic delexicalization - each word classified - IOB format (inside-outside-beginning) - O … word does not belong to any slot - B … beginning of the slot - I … another word inside the slot - it is useful to combine rules and classifiers - keywords/regexes found at specific position - apply classifier to each word in the sentence left-to-right - problem: overall consistency (slots found elsewhere in the sentence might influence what's classified now) - solution: structured/sequence prediction - HMM, MEMM, CRF - machine learning - generative × discriminative models - example: elephants vs. dogs - generative: ~ 2 models, what elephants and dogs look like - discriminative: establish decision boundary - logistic regression - SVM - soft-margin SVM - regularized logistic regression - sequence prediction - maximum entropy Markov model - one error might lead to a series of errors - hidden Markov model - limited feature function - linear-chain conditional random field - neural networks - both for classification & sequence models - non-linear functions, composed of basic building blocks, stacked into layers - activation functions - linear functions - nonlinearities – sigmoid, tanh, ReLU - softmax – probability estimates - fully differentiable – training by gradient descent - gradients backpropagated from outputs to all parameters - features: word embeddings - recurrent neural networks (RNN) - many identical layers with shared parameters - output of the first layer is fed as an input to the second - additionally, each layer gets another token from the input - other cell types: GRU, LSTM - make backpropagation work better - gates to keep old values - encoder-decoder networks - … - attention = “memory” of all encoder hidden states - transformer - getting rid of encoder recurrences - the whole encoder part can be parallel → networks (and datesets) can be larger, they are faster to train - … - apart from word embedding, it has also positional embedding - neural NLU - various architectures possible - classification - feed-forward NN - RNN + attention weight → softmax - convolutional networks - transformer - sequence tagging - RNN (LSTM/GRU) → softmax over hidden states - transformer works the same - intent can be tagged at start of sentence - handling ASR noise - we can run NLU for all the hypotheses and sum the results - we can use confusion networks - the word features can be weighed by word confidence - context - user response can depend on last system action - we might need to add last system DA/text into input features - but the system cannot expect the user to always respond to the last question ## Dialogue State Tracking - we need to remember what happened in the past during the dialogue - past system actions! (user may react to them) - ontology - to describe possible states - defines all concepts in the system - problems with dialogue state - NLU is unreliable - to solve that, we can ignore low-confidence input - but if there is some level of noise and the user repeats multiple times the same thing, is the confidence still low? - belief state - we estimate a probability distribution over all possible states - Markov decision process - partially observable Markov decision process - naïve generative belief tracking - parameter tying - … - LLM prompting ## Dialogue Policy - dialogue management - DST tracks the past - dialogue policy navigates towards the future - policy selects the next action - action selection approaches - finite-state machines - good for tone-selection phone systems - frame-based - state = frame with slots - slots can be filled in any order - more information in one utterance possible - the system asks until all the slots are filled - standard implementation: VoiceXML - rule-based - if-then-else rules in programming code - very flexible, but gets messy - dialogue policy is still pre-set which might not be the best approach - statistical – with machine learning - dialogue management with supervised learning - action selection ~ classification → use supervised learning? - hard to get sufficiently large human-human data - dialogue is ambiguous and complex, there's no single correct next action - some paths are not explored in the data but you may encounter them - DSs should behave differently than people, they are in a different situation - DM as a Markov Decision Process - it has Markov property – current state defines everything - deterministic vs. stochastic policy - ~ pure vs. mixed strategy profile - deterministic … for every state, the next action is fixed - stochastic … for a state, there is a probability distribution of possible actions - reinforcement learning - finding a policy that maximizes long-term reward - example reward mechanism - for each turn … -1 (to minimize total number of turns) - success … +20 - fail … -10 - discount factor … assigns less priority to older rewards - state-value function - bellman equation - the equations will not be required :) - action-value function - optimal policy - RL agent taxonomy - RL approaches - dynamic programming – exact solution from Bellman equation - Monte Carlo – sample, learn from experience - temporal difference – look-ahead sampling (bootstrap), refine estimates as you go - sampling & updates – on-policy vs. off-policy - examples of RL approaches - value iteration - dynamic programming, model-based, value-based - we update the $V(s)$ value until it converges for all the states - can be done with $Q$ instead - we assume $p$ and $r$ to be known, can be estimated from the data but it's expensive - Monte Carlo methods - SARSA (state-action-reward-state-action) - on-policy - Q-learning - off-policy - REINFORCE - we are learning the policy directly (we update its parameters) - POMDP - MDP algorithms need the states to be quantized/discretized - policy gradients work out of the box - summary space - nowadays, probably not necessary when using deep neural networks - simulated users - RL needs a lot of data - in the beginning, the system will behave randomly, people don't like this - that's why we need to build another dialogue system (or at least dialogue manager) that can simulate the user - deep reinforcement learning - part of the agent is handled by a neural network - deep Q-networks - Q function is represented by a nerual net - usual Q-learning does not converge well with NNs - SGD is unstable - correlated samples (data is sequential) - … - there are some fixes we can use - interesting tricks - experience replay – buffer of 10k moves → we sample from a set of both old and recent moves - target Q function freezing – have a copy of Q function that does not get updated every time ## Natural Language Generation - subtasks - content planning - sentence planning - surface realization - NLG basic approaches - canned text – hand-written prompts - templates – “fill in blanks” approach - grammars & rules - machine learning´– with or without NNs - neural networks - Seq2seq RNNs - Transformer - in theory, it's weaker then RNNs - but the models can be larger (we can train them in parallel) - usually, only the decoder model is used ## Voice Assistants & Question Answering