this dir | view | cards | source | edit | dark
top
Lecture
- credit requirements
- https://web.stanford.edu/~jurafsky/slp3/
- communication domains
- single/closed-domain
- multi-domain
- open-domain
- application areas
- phone
- apps
- smart speakers
- appliances
- cars
- web
- embodied (robots)
- virtual characters
- modes of communication
- text
- voice
- multimodal – video, mimics, touch, …
- dialogue initiative
- system-initiative
- user-initiative
- mixed-initiative
- traditional architecture
- main loop
- voice → text → meaning → reaction → text → voice
- components
- speech recognition
- language understading
- dialogue management
- has access to backend (in order to perform tasks)
- language generation
- speech synthesis
- multimodal system would have additional components
- automatic speech recognition (ASR)
- converting speech signal into text
- typically produces several possible hypotheses with confidence scores
- n-best list
- lattice
- confusion network
- very good in ideal conditions
- problems: noise, accents, distance, channel (phone), …
- voice activity detection
- is the user talking to the system?
- wake words (OK, Google)
- ASR is usually implemented using neural networks
- natural/spoken language understanding (NLU/SLU)
- extracting the meaning from the user utterance
- converting into a structured semantic representation
- dialogue acts
- act type/intent (inform, request, confirm)
- slot/attribute
- value
- examples
- inform(food=Chinese, price=cheap)
- request(address)
- can be more complex (using syntax trees, predicate logic)
- specific steps
- named entity recognition
- coreference resolution
- implementation varies
- handcrafting often works for limited domains
- keyword spotting, regular expressions, handcrafted grammars
- machine learning approaches
- can also provide n-best outputs
- problems
- recovering from bad ASR
- ambiguities – next Friday (it is Tuesday now)
- variation – there are many ways to express the same thing
- dialogue manager (DM)
- stores dialogue history modeled by dialogue state
- handcrafted × probabilistic
- handcrafted … just replace the value in the slot by the last-mentioned
- probabilistic … keep an estimate
- system actions described by dialogue policy
- decision on next system action, given dialogue state
- involves backend queries
- result represented as system dialogue act
- handcrafted
- if-then-else clauses
- flowcharts
- machine learning
- often trained with reinforcement learning
- POMDP (partially observable markov decision process)
- recurrent neural networks
- natural language generation (NLG)
- how to express things might depend on context
- goals: fluency, naturalness, avoid repetition, …
- traditional approach: templates
- fill in values into predefined templates (sentence skeletons)
- works well for limited domains
- grammar-based approaches
- grammar/semantic structures
- syntactic transformation rules are applied
- statistical approaches
- most prominent: transformer neural networks
- generating word-by-word
- speech synthesis
- standard pipeline: text normalization, pronunciation analysis, intonation/stress generation, waveform synthesis
- TTS methods
- formant-based – phoneme-specific frequencies, rules
- concatenative – record a single person, cut into phoneme transitions
- hidden Markov models
- neural networks
- no need for phoneme conversion, can go directly from text
- text to spectrograms → vocoder (spectrogram to audio)
- organizing the components
- basic – pipeline
- components oblivious of each other
- interconnected
- read/write changes to dialogue state
- more reactive but more complex
- joining the modules
- ASR + NLU
- NLU + state tracking
- NLU & DM & NLG – using LLMs, may be end-to-end (without module separation)
- audio based end-to-end (audio-to-audio)
- research areas
- LLM-based systems
- dialogue flows from data – finding patterns in human dialogue recordings/transcripts
- multimodality – adding video (input/output)
- context dependency – understand/reply in context (grounding, speaker adaptation)
- incrementality – don't wait for the whole sentence to start processing
What happens in a dialogue
- dialogue = conversational communication between two or more people
- verbal + non-verbal
- collaborative, social
- practical, related to actions
- interactive, incremental, messy
- dialogue systems are simpler than that
- linguistic description
- phonetics/phonology
- morphology
- syntax
- semantics – sentence (propositional) meaning
- pragmatics – meaning in context, communication goal
- underlying meaning of the sentence
- turn-taking (interactivity)
- turn = continuous utterance from one speaker
- normal dialogue – very fluent, fast
- minimizing overlaps and gaps
- cues/markers for turn boundaries
- overlaps happen naturally
- ambiguities in turn-taking
- barge-in
- natural speech is very different from written text
- turn taking in dialogue systems
- consecutive turns are typically assumed
- system waits for the user to finish their turn
- voice activity detection (VAD)
- quite hard
- we need to figure out if it is the user speaking and if they are speaking to the system
- wake words are making VAD easier
- some systems allow user's barge-in
- may be tied to the wake word
- voice activity detection
- overlapping windows of ~30 ms + binary classifier
- features are similar to speech recognition itself
- onset is easier to detect than end of speech
- but it is hard to detect speech towards the system vs. someone else (that's why wake words are used)
- postprocessing – smoothing out short-term errors
- speech acts
- John L. Austin, John Searle
- each utterance is an act (intentional, changing the state of the world)
- speech acts consist of
- utterance act – uttering of the words
- propositional act – semantics (surface meaning)
- illocutionary act – pragmatic meaning
- perlocutionary act – listener obeys command, listener's worldview changes, …
- types of speech acts
- assertive – speaker commits to he truth of a proposition
- directive – speaker wants the listener to do something
- commissive – speaker commits to do something themselves
- expressive – speaker expresses their psychological state
- declarative – performing actions (“performative verbs”)
- explicit (using a verb directly corresponding to the act) vs. implicit
- I promise to come by later. × I'll come by later.
- direct vs. indirect
- indirect – the surface meaning does not correspond to the actual one
- primary illocution – the actual meaning
- secondary illocution – how it's expressed
- example
- direct: Please close the window.
- indirect: Could you close the window?
- more indirect: I'm cold.
- conversational maxims
- Paul Grice
- based on Grice's cooperative principle (“dialogue is cooperative”)
- 4 maxims: quantity, quality, relation, manner
- implicatures
- obvious violation of the maxim implies additional meanings
- speech acts, maxims and implicatures in dialogue systems
- learned from the data / hand-coded
- understanding
- tested on real users → usually knows indirect speech acts
- implicatures limited – there's no common sense
- responses
- mostly strive for clarity – user doesn't really need to imply
- grounding
- dialogue is cooperative → need to ensure mutual understanding
- common ground = shared knowledge, mutual assumptions of dialogue participants
- knowingly shared
- expanded/updated/refined in an informative conversation
- validated/verified via grounding feedback/evidence
- positive – understanding/acceptance signals
- visual, backchannels
- explicit feedback
- implicit feedback – showing understanding implicitly in the next utterance
- negative – misunderstanding
- visual
- implicit/explicit repairs
- clarification requests
- repair requests
- in dialogue systems
- crucial for successful dialogue
- backchanels / visual signals typically not present
- implicit confirmation very common
- explicit confirmation may be required for important steps
- clarification & repair requests very common
- part of dialogue management
- deixis = pointing
- relating between language & context/world
- very important in dialogue
- deictic expressions
- meaning dependent on the context
- pronouns, verbs (tense and person markers), adverbs, other (lexical meaning – e.g. come/go)
- typically egocentric, I – here – now is the center (origo)
- main types of deixis: personal, temporal, local
- other: social, discourse/textual
- anaphora/coreference
- anaphora – referring back
- cataphora – referring forward
- in dialogue systems
- system typically assume a single user → personal deixis becomes much easier
- most systems are aware of time, location is more complicated
- coreference resolution is a separate problem
- prediction
- dialogue is a social interaction
- brain does not listen passively
- prediction is crucial for human cognition
- this is why we understand in adverse condition
- we predict what the person might say → we can understand even in noisy environment
- entropy
- Claude Shannon
- communication channel, entropy
- plays well with the social interaction perspective
- people tend to use all available channel capacity
- in noisy environment, we speak louder and slower
- people tend to spread information evenly
- words carrying more information are emphasized
- conditional entropy
- how hard is it to guess the next word in the sentence?
- given n-gram preceding context
- related to Shannon entropy but may differ
- prediction in dialogue systems
- used a lot in speech recognition
- statistical language models – based on information theory
- not as good as humans
- less use in other DS components
- adaptation/entrainment
- people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue
- wording, grammar
- speech rate, prosody, loudness
- accent/dialect
- this helps a successful dialogue and social bonding, feels natural
- dialogue systems typically don't align
- NLG is rigid (templates, machine learning trained without context)
- but people align to dialogue systems
- politeness
- dialogue as social interaction – follows social conventions
- indirect is polite
- this is the point of most indirect speech acts
- clashes with conversational maxims (maxim of manner)
- appropriate level of politeness might be hard to find (culturally dependent)
- face-saving (Brown & Lewinson)
- positive face = desire to be accepted, liked
- negative face = desire to act freely
- face-threatening acts – potentially any utterance
- threatening other's/own negative/positive face
- politeness softens FTAs
- in dialogue systems
- typically handcrafted, does not adapt to the situation
- typically not much indirect speech, but trying to stay polite
- learning from data can be tricky – may contain offensive speech (not just swearwords, problems can be hard to find)
Data
- two main questions before building a dialogue system
- what data to base it on
- how to evaluate it
- observation: if you have extensive data of a high-enough quality, the LLM learns how to count etc. just from the examples
- data
- corpus/dataset = collection of linguistic data
- Hugging Face, Czech National Corpus, …
- dialogue corpora/dataset types
- modality: written/spoken/multimodal
- source
- human-human conversations – real dialogues, scripted (from movies)
- human-machine
- automatically generated
- domain
- closed/constrained/limited domain
- multi-domain (more closed domains)
- open domain (any topic, chitchat)
- dialogue data collection
- in-house collection using experts (or students)
- safe, high-quality
- expensive, time-consuming
- Wizard-of-Oz (WoZ)
- for in-house data collection
- also: to prototype/evaluate a system before implementing it
- users believe they're talking to a system
- they behave differently than when talking to a human
- usually simpler
- system in fact controlled by a human “wizard”
- typically selecting options (free typing is too slow)
- web crawling
- typically not real dialogues
- offensive stuff
- many copies of the same content
- problematic licensing
- crowdsourcing
- compromise: employing (untrained) people over the web
- platforms: Amazon Mechanical Turk, Appen, Prolific
- people tend to game the system, causing noise
- corpus annotation
- what we need to add to the data (recordings)
- transcriptions (textual representation of audio)
- semantic annotation such as dialogues acts
- name entity labelling
- other linguistic annotation: POS, syntax (usually not in DSs)
- getting annotation
- similar task as getting the data itself
- inter-annotator agreement
- typical measure: Cohen's Kappa
- for categorical annotation
- κ∈(0,1)
- 0.4 ~ fair, >0.7 ~ great
- κ=1−chanceagreement−chance
- corpus size
- we need enough examples for an accurate model
- speech: 10s–100s of hours minimum
- pretrained LMs/audio LLMs: 100k–10M hours
- NLU, DM, NLG
- handcrafting: 10s–100s of dialogues may be OK to inform you
- simple model / limited domain: 100s–1000s dialogues might be fine
- open domain: sky's the limit (LLMs: 1T+ tokens)
- TTS – single person, several hours at least
- it pays off to have high-quality recordings of only one person with flat tone
- pretrained LMs: 10k+ hours (multilingual)
- available dialogue datasets
- domain choice is rather limited
- size is very often not enough
- vast majority is English-only
- few free datasets with audio
- MultiWOZ
- task-oriented written dataset
- crowdsourced
- dataset types
- dataset splits
- train/dev/test spit
- dev … validation
- test … evaluation
- cross-validation
Evaluation
- types
- extrinsic (how does the system affect the world) × intrinsic (how do its components work)
- subjective (what users think about it, manual) × objective (measuring properties directly from data, automatic)
- we use quantitative evaluation (based on numeric data), not qualitative (detailed interviews with the users)
- significance testing (Student's t-test, Mann-Whitney U-test), bootstrap resampling
- getting the subjects for extrinsic evaluation
- extrinsic evaluation
- how to measure
- record people
- analyze the logs
- metrics
- task success
- duration
- retention rate (percentage of returning users)
- fallback rate (percentage of failed dialogues)
- number of users – not in research setting
- subjective
- questionnaires
- question types: open-ended, yes/no, Likert scales
- …
- intrinsic
- ASR: word error rate ~ length-normalized Levenshtein distance
- NLU
- slot precision & recall & F1-measure
- accuracy used for intent/act type
- dialogue manager
- objective measures can be collected with user simulator
- NLG
- word-overlap (BLEU score)
- slot error rate
- diversity
Natural Language Understanding
- words → meaning
- challenges
- non-grammaticality
- disfluencies
- ASR errors
- synonymy
- out-of-domain utterances
- semantic representations
- syntax/semantic trees
- frames
- graphs
- dialogue acts
- basic approaches
- for trees/frames/graphs
- grammar-based parsing
- grammars are expensive, hard to maintain
- hardware-hungry, brittle
- CFGs are too simple for full natural language
- Phoenix Parser
- statistical
- for dialogue acts (both options can be rule-based or statistical)
- classification
- concepts: intent, slot-value pair
- consistency problems (conflicting intents, conflicting values) need to be solved externally
- sequence labelling
- named-entity recognition (NER) + delexicalization
- identify slot values / named entities
- delexicalize = replace them with placeholders (indicating entity type)
- slot filling as sequence tagging
- get slot values directly, automatic delexicalization
- each word classified
- IOB format (inside-outside-beginning)
- O … word does not belong to any slot
- B … beginning of the slot
- I … another word inside the slot
- it is useful to combine rules and classifiers
- keywords/regexes found at specific position
- apply classifier to each word in the sentence left-to-right
- problem: overall consistency (slots found elsewhere in the sentence might influence what's classified now)
- solution: structured/sequence prediction
- machine learning
- generative × discriminative models
- example: elephants vs. dogs
- generative: ~ 2 models, what elephants and dogs look like
- discriminative: establish decision boundary
- logistic regression
- SVM
- soft-margin SVM
- regularized logistic regression
- sequence prediction
- maximum entropy Markov model
- one error might lead to a series of errors
- hidden Markov model
- linear-chain conditional random field
- neural networks
- both for classification & sequence models
- non-linear functions, composed of basic building blocks, stacked into layers
- activation functions
- linear functions
- nonlinearities – sigmoid, tanh, ReLU
- softmax – probability estimates
- fully differentiable – training by gradient descent
- gradients backpropagated from outputs to all parameters
- features: word embeddings
- recurrent neural networks (RNN)
- many identical layers with shared parameters
- output of the first layer is fed as an input to the second
- additionally, each layer gets another token from the input
- other cell types: GRU, LSTM
- make backpropagation work better
- gates to keep old values
- encoder-decoder networks
- attention = “memory” of all encoder hidden states
- transformer
- getting rid of encoder recurrences
- the whole encoder part can be parallel → networks (and datesets) can be larger, they are faster to train
- …
- apart from word embedding, it has also positional embedding
- neural NLU
- various architectures possible
- classification
- feed-forward NN
- RNN + attention weight → softmax
- convolutional networks
- transformer
- sequence tagging
- RNN (LSTM/GRU) → softmax over hidden states
- transformer works the same
- intent can be tagged at start of sentence
- handling ASR noise
- we can run NLU for all the hypotheses and sum the results
- we can use confusion networks
- the word features can be weighed by word confidence
- context
- user response can depend on last system action
- we might need to add last system DA/text into input features
- but the system cannot expect the user to always respond to the last question
Dialogue State Tracking
- we need to remember what happened in the past during the dialogue
- past system actions! (user may react to them)
- ontology
- to describe possible states
- defines all concepts in the system
- problems with dialogue state
- NLU is unreliable
- to solve that, we can ignore low-confidence input
- but if there is some level of noise and the user repeats multiple times the same thing, is the confidence still low?
- belief state
- we estimate a probability distribution over all possible states
- Markov decision process
- partially observable Markov decision process
- naïve generative belief tracking
- parameter tying
- …
- LLM prompting
Dialogue Policy
- dialogue management
- DST tracks the past
- dialogue policy navigates towards the future
- policy selects the next action
- action selection approaches
- finite-state machines
- good for tone-selection phone systems
- frame-based
- state = frame with slots
- slots can be filled in any order
- more information in one utterance possible
- the system asks until all the slots are filled
- standard implementation: VoiceXML
- rule-based
- if-then-else rules in programming code
- very flexible, but gets messy
- dialogue policy is still pre-set which might not be the best approach
- statistical – with machine learning
- dialogue management with supervised learning
- action selection ~ classification → use supervised learning?
- hard to get sufficiently large human-human data
- dialogue is ambiguous and complex, there's no single correct next action
- some paths are not explored in the data but you may encounter them
- DSs should behave differently than people, they are in a different situation
- DM as a Markov Decision Process
- it has Markov property – current state defines everything
- deterministic vs. stochastic policy
- ~ pure vs. mixed strategy profile
- deterministic … for every state, the next action is fixed
- stochastic … for a state, there is a probability distribution of possible actions
- reinforcement learning
- finding a policy that maximizes long-term reward
- example reward mechanism
- for each turn … -1 (to minimize total number of turns)
- success … +20
- fail … -10
- discount factor … assigns less priority to older rewards
- state-value function
- bellman equation
- the equations will not be required :)
- action-value function
- optimal policy
- RL agent taxonomy
- RL approaches
- dynamic programming – exact solution from Bellman equation
- Monte Carlo – sample, learn from experience
- temporal difference – look-ahead sampling (bootstrap), refine estimates as you go
- sampling & updates – on-policy vs. off-policy
- examples of RL approaches
- value iteration
- dynamic programming, model-based, value-based
- we update the V(s) value until it converges for all the states
- can be done with Q instead
- we assume p and r to be known, can be estimated from the data but it's expensive
- Monte Carlo methods
- SARSA (state-action-reward-state-action)
- Q-learning
- REINFORCE
- we are learning the policy directly (we update its parameters)
- POMDP
- MDP algorithms need the states to be quantized/discretized
- policy gradients work out of the box
- summary space
- nowadays, probably not necessary when using deep neural networks
- simulated users
- RL needs a lot of data
- in the beginning, the system will behave randomly, people don't like this
- that's why we need to build another dialogue system (or at least dialogue manager) that can simulate the user
- deep reinforcement learning
- part of the agent is handled by a neural network
- deep Q-networks
- Q function is represented by a nerual net
- usual Q-learning does not converge well with NNs
- SGD is unstable
- correlated samples (data is sequential)
- …
- there are some fixes we can use
- interesting tricks
- experience replay – buffer of 10k moves → we sample from a set of both old and recent moves
- target Q function freezing – have a copy of Q function that does not get updated every time
Natural Language Generation
- subtasks
- content planning
- sentence planning
- surface realization
- NLG basic approaches
- canned text – hand-written prompts
- templates – “fill in blanks” approach
- grammars & rules
- machine learning´– with or without NNs
- neural networks
- Seq2seq RNNs
- Transformer
- in theory, it's weaker then RNNs
- but the models can be larger (we can train them in parallel)
- usually, only the decoder model is used
Voice Assistants & Question Answering