# Exam The exam will have 10 questions, mostly from this pool. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles. ## Introduction - What's the difference between task-oriented and non-task-oriented systems? - task-oriented - focused on completing certain tasks (booking restaurants/flights/hotels, finding bus schedules, smart home, …) - most actual dialogue systems in production - “backend access” vs. “agent/assistant” - non-task-oriented - chitchat – social conversation, entertainment - getting to know the user, specific persona - gaming the Turing test - Describe the difference between closed-domain, multi-domain, and open-domain systems. - single/closed-domain – on a well-defined area, small set of specific tasks (e.g. banking system on a specific phone number) - multi-domain – joining several single-domain systems - open-domain – “responds to anything”, used to be mostly chitchat, now somewhat working via LLMs - Describe the difference between user-initiative, mixed-initiative, and system-initiative systems. - user-initiative – user asks, machine responds - system-initiative – “form-filling”, system asks questions, user must reply (traditional, most robust, least natural) - mixed-initiative – system and user both can ask & react to queries; most natural, most complex ## Linguistics of Dialogue - What are turn taking cues/hints in a dialogue? Name a few examples. - a speaker can use a turn taking cue/hint to signalize when their turn ends (they yield) - examples: linguistic (e.g. finished sentence), voice pitch, timing (gaps), eye gaze, gestures, … - Explain the main idea of the speech acts theory. - each utterance is an act: intentional, changing the state of the world (changing the knowledge/mood of the listener, influencing their behavior) - speech acts consist of several levels: the words, their semantics, meaning, effect - types of speech acts: assertive, directive, commissive, expressive, declarative - explicit vs. implicit; direct vs. indirect - explicit: I **promise** to come by later. - implicit: I'll come by later. - direct: Please close the window. - indirect: Could you close the window? - even more indirect: I'm cold. - What is grounding in dialogue? - dialogue is cooperative → need to ensure mutual understanding - common ground = shared knowledge, mutual assumptions of dialogue participants - the knowledge has to be *knowingly* shared - common ground is expanded/updated/refined in an informative conversation - validated/verified via grounding feedback/evidence - speaker presents utterance - listener accepts utterance by providing evidence of understanding - information added to common ground only after acceptance - Give some examples of grounding signals in dialogue. - positive – understanding/acceptance signals - visual – eye gaze, facial expressions, smile - backchannels – particles (částice) signalling understanding (uh-uh, hmm, yeah, …) - explicit feedback – explicitly stating understanding (I know; yes, I understand) - implicit feedback – showing understanding implicitly in the next utterance - negative – misunderstanding - visual – stunned/puzzled silence - implicit/explicit repairs – denying (no, that's not right) / presenting alternative - clarification requests – demonstrating ambiguity & asking for additional information (Which John? John Smith or John Doe?) - repair requests – showing non-understanding & asking for correction (Oh, so you're not flying to London? Where are you going then?) - What is deixis? Give some examples of deictic expressions. - “pointing” – relating between language & context/world - dialogue is typically set/situated in a specific context - deictic expressions - their meaning depends on the context (who is talking, when, where) - pronouns (I, you, him, this) - verbs: tense & person markers - adverbs (here, now, yesterday) - lexical meaning (come × go) - non-verbal (gestures, gaze) - typically egocentric - main types of deixis - personal – I, me, you, she - temporal – now, yesterday, later, on Monday - local – here, there - other types: social (politeness), discourse/textual (next chapter) - What is coreference and how is it used in dialogue? - expression referring to something mentioned in context - anaphora = referring back - cataphora = referring forward - avoiding repetition, faster expression - can refer to basically anything (objects/persons/events, qualities, actions / full sentences / portions of text) - used frequently in dialogue, may be ambiguous - examples - anaphora: Susan dropped the plate. **It** shattered. - cataphora: When **he** hears that fire alarm, Sam is always cool and calm. - I don't like it as much as he **does**. - Her dress is green. **So** is mine. - Shall I book a room for you? – Sure, I'd like **that**. - ambiguity: Bill stands next to John. **He** is tall. - What does Shannon entropy and conditional entropy measure? No need to give the formula, just the principle. - entropy – expected value of information conveyed (in bits) - $H(\mathrm{text})=-\mathbb E[\log p(\mathrm{word})]$ - entropy plays well with the social interaction perspective - people tend to use all available channel capacity - people tend to spread information evenly (words carrying more information are emphasized) - conditional entropy – how hard is to guess the next word in the sentence? - given preceding context (n-gram) - related to Shannon entropy but may differ (it is typically much lower than Shannon entropy) - better estimate of prediction difficulty (although humans work with “unlimited” preceding context and reevaluate using following context) - $H_\mathrm{cond}(\mathrm{text})=-\mathbb E_p[\log\frac{p(c,w)}{q(c)}]$ - What is entrainment/adaptation/alignment in dialogue? - people subconsciously adapt/align/entrain to their dialogue partner over the course of the dialogue - wording (lexical items) – they use the same words as their dialogue partner - grammar (sentential constructions) - speech rate, prosody, loudness - accent/dialect – BrE speaker uses AmE words when talking to AmE speaker - this helps a successful dialogue (also helps social bonding, feels natural) - systems typically don't align, people align to dialogue systems ## Data & Evaluation - What are the typical options for collecting dialogue data? - in-house collection using experts or students - safe, high-quality, but very expensive & time-consuming - free talk / scripting whole dialogues / Wizard-of-Oz - web crawling - fast & cheap, but typically not real dialogues, may not be fit for purpose - potentially unsafe (offensive stuff) - need to be careful about the licensing - crowdsourcing - compromise: employing (untrained) people over the web - crowd workers tend to game the system - How does Wizard-of-Oz data collection work? - users believe they're talking to a system → their behavior is simpler than when talking to a human - system is in fact controlled by a human “wizard”, who is selecting options (free typing is too slow) - usage: in-house data collection, prototyping/evaluating the system before implementing it - What is corpus annotation, what is inter-annotator agreement? - annotation = labels, description added to the collected data (dialogues) - transcriptions (for ASR) - semantic annotation (for NLU) – dialogue acts, … - named entity labelling (for NLU) - inter-annotator agreement (IAA) - measures the reliability of manual annotations - multiple people annotate the same thing - needs to account for agreement by chance - typical measure: Cohen's kappa - $\kappa=\frac{\text{agreement}-\text{chance}}{1-\text{chance}}$ - What is the difference between intrinsic and extrinsic evaluation? - intrinsic … checks properties of systems/components in isolation, self-contained - extrinsic … how the system/component works in its intended purpose - effect of the system on something outside itself, in the real world (i.e. user) - What is the difference between subjective and objective evaluation? - subjective … asking users' opinions, e.g. questionnaires (manual) - not repeatable - we should ask many people → not so subjective - objective … measuring properties directly from data (automatic) - might or might not correlate with users' perception - What are the main extrinsic evaluation techniques for task-oriented dialogue systems? - objective metrics (we record people interacting with the system, analyze the logs) - task success / goal completion rate – did the user get what they wanted? - testers can have agenda → we can check if they found what they were supposed to - basic check: did we provide any information at all? (any bus/restaurant) - duration – number of turns or time (less is better) - retention rate – percentage of users that return to use our dialogue system again (over a time period) - fallback rate – percentage of failed dialogues - number of total/new/active users - subjective evaluation - questionnaires for users/testers - example questions - success rate: Did you get all the information you wanted? - future use: Would you use the system again? - ASR/NLU: Do you think the system understood you well? - NLG: Were the system replies fluent/well-phrased? - TTS: Was the system's speech natural? - What are some evaluation metrics for non-task-oriented systems (chatbots)? - objective metrics - duration (longer = better) - other: % returning users, checks for users swearing vs. thanking the system - subjective - likeability/engagement: Did you enjoy the conversation? - other similar to task-oriented - What's the main metric for evaluating ASR systems? - word error rate (WER) - ASR output is compared to human-authored reference - $\mathrm{WER}=\frac{S+I+D}{N}$ - $S$ … substitutions - $I$ … insertions - $D$ … deletions - $N$ … reference length - ~ length-normalized edit distance (Levenshtein distance) - sometimes insertions & deletions are weighted $0.5\times$ - can be $\gt 1$ - assumes one correct answer - What's the main metric for NLU (both slots and intents)? - slots: precision, recall, F-measure (F1) - precision $P=\frac{\mathrm{correct}}{\mathrm{detected}}$ - recall $R=\frac{\mathrm{correct}}{\mathrm{true}}$ - F-measure $F=\frac{2PR}{P+R}$ harmonic mean - example - NLU: inform(name=Golden Dragon, food=Chinese) - true: inform(name=Golden Dragon, food=Czech, price=high) - $P=1/3,\;R=1/2,\;F=0.2$ - accuracy (% correct) used for intent/act type - alternatively also exact matches on the whole semantic structure (easier, but ignores partial matches) - one true answer assumed - Explain an NLG evaluation metric of your choice. - BLEU score - word-overlap with reference text(s) - $BLEU=BP\cdot\sqrt[4]{p_1p_2p_3p_4}$ - $p_n$ … $n$-gram precision (how many $n$-grams of the output text exist in any reference text) - $BP$ … brevity penalty (short sentences achieve higher $n$-gram precisions, so we penalize them) - slot error rate - diversity – can our system produce different replies? - Why do you need to check for statistical significance (when evaluating an NLP experiment and comparing systems)? - higher score is not enough to prove your model is better - it can happen by chance - we need to define the hypotheses and select a significance level $\alpha$, then compute the observed value of test statistic and reject $H_0$ or not - Why do you need to evaluate on a separate test set? - we want to know how well our model works on new, unseen data (how well it generalizes) - memorizing training data would give us 100% accuracy (on training data) ## Natural Language Understanding - What are some alternative semantic representations of utterances, in addition to dialogue acts? - syntax/semantic trees (dependency trees, constituent trees, …) - frames – technically also trees, not directly connected to words - graphs – abstract meaning representation (AMR), more of a toy task, but popular - predicate logic - Describe language understanding as classification and language understanding as sequence tagging. - NLU as classification - we treat DAs as a set of semantic concepts - concepts: intents, slot-value pairs - binary classification: is concept Y contained in utterance X? - independent for each concept - consistency problems – conflicting intents/values need to be solved externally (e.g. based on classifier confidence) - language understanding as sequence tagging - we want to parse slot values from the text - we can classify each word using IOB format (inside/outside/beginning) – isolate the slot values (can consist of several words) - pure classification can lead to inconsistencies (I cannot follow after O) - it is useful to tag the whole sentences (sequences of words) at once - How do you deal with conflicting slots or intents in classification-based NLU? - we need to resolve such situations externally (e.g. based on classifier confidence) - What is delexicalization and why is it helpful in NLU? - delexicalization = replacement of slot values / named entities with placeholders (indicating entity type) - generally needed for NLU as classification (otherwise in-domain data is too sparse) - named-entity recognition (NER) is a problem on its own - in-domain gazetteers (dictionaries of names) alone may be enough - Describe one of the approaches to slot tagging as sequence tagging. - basic idea - we classify each word using IOB format to isolate the slot values - to avoid inconsistencies, we tag the whole sentences (sequences of words) at once - approaches - maximum entropy Markov model (MEMM) - looking at past classifications when making next ones - whole history would be too sparse/complex → Markov assumption: only the most recent classifications matter - looking at the whole input - not modelling the sequence globally - error propagation … during inference (prediction), one error can lead to a series of errors - label bias problem - hidden Markov model (HMM) - modelling the sequence as a whole - very basic model – tag depends on current word + previous tag - Markov assumption - we can get globally best tagging (using Viterbi algorithm) - linear-chain conditional random field (CRF) - somehow combines HMM and MEMM - uses global normalization → slow to train - state-of-the-art for many sequence tagging tasks (until neural networks took over; can be also used in conjunction with NNs) - What is the IOB/BIO format for slot tagging? - it is used to get the slot values from the text - the words in the text are tagged; the slots can be nested - tags - B-$s$ … beginning of slot $s$ - I-$s$ … inside slot $s$ - O … outside - example - There are **over 1000** compositions by **Johan Sebastian Bach**. - O O B-quantity I-quantity O O B-person I-person I-person O - What is the label bias problem? - in occurs in maximum entropy Markov models (MEMM) - due to local normalization, states with fewer outbound transitions are preferred – the transitions have larger probabilities than in states with more transitions - this makes the model less immune to error propagation (= one wrongly classified word leads to a series of errors) - How can an NLU system deal with noisy ASR output? Propose an example solution. - simple approach - ASR produces multiple hypotheses (texts) - ASR → $p(\mathrm{text}\mid\mathrm{audio})$ - NLU → $p(\mathrm{DA}\mid \mathrm{text})$ - we want $p(\mathrm{DA}\mid \mathrm{audio})$ - we sum it up: $p(\mathrm{DA}\mid\mathrm{audio})=\sum_{\mathrm{texts}} P(\mathrm{DA}\mid\mathrm{text})P(\mathrm{text}\mid\mathrm{audio})$ - alternative approach: confusion networks - we use per-word ASR confidence ## Neural NLU & Dialogue State Tracking - Describe an example of a neural architecture for NLU. - we can use simple classification or sequence tagging - when using sequence tagging, we can tag the intent at the start of the sentence (and then assign the IOB tags to all of its words) - examples of architecture - RNN-based NLU - bidirectional encoder (see [NLP notes](../natural-language-processing/exam.md#neural-machine-translation)) - decoder that tags word-by-word (uses the encoder as one of its inputs) - intent classification – we can do softmax over last encoder state - attention can be used in the decoder and to classify the intent - (pretrained) Transformer-based NLU - slot tagging on top of pretrained BERT Transformer model - BERT was trained to guess masked words - further trained for NLU - standard IOB approach - softmax the final hidden layers → output tags - in case of split words, classify only the first subword (IOB tags should not change mid-word) - special start token tagged with intent - optional CRF on top of the tagger - How can you use pretrained language models in NLU? - we can use BERT Transformer model and fine-tune it for NLU - BERT was trained to guess masked words - What is the dialogue state and what does it contain? - dialogue state remembers what was said in the past - it acts as a basis for action selection decisions - dialogue state … current context of the conversation - contents = “all that is used when the system decides what to say next” - user goal / preferences (slots & values provided, information requested) - past system actions - other semantic context - usually, we consider a probability distribution over all possible states - What is an ontology in task-oriented dialogue systems? - it is used to describe possible states - it defines all concepts in the system - list of slots - possible range of values per slot - possible actions per slot - dependencies (some concepts are only applicable for some values of parent concepts) - Describe the task of a dialogue state tracker. - NLU is unreliable (it takes unreliable ASR output and adds its own errors), output might conflict with ontology - solution: we use belief state (probability distribution over all possible states) - per-slot distributions are used in practice - dialogue state tracker updates the belief state based on new information - to make it more robust, the state tracker can accumulate probability mass over multiple turns / over NLU n-best lists - probabilistic dialogue state tracker plays well with probabilistic dialogue policies - What's a partially observable Markov decision process? - Markov decision process - model for sequential decision making when outcomes are uncertain - set of states, actions, probabilities that action leads from a state $s$ to a state $s'$, and rewards received after transitioning from state $s$ to state $s'$ using action $a$ - we are looking for a policy function – mapping from state space to action space (can be probabilistic) - partially observable MDP – we do not know the current state certainly - belief state can be modelled using a hidden Markov model - Describe a viable architecture for a belief state tracker. - basic discriminative belief tracker – we assume slot independence and trust the NLU - we have probabilities of states $p_s$ (tracked by our belief tracker) and probabilities of observations $p_o$ (returned by NLU) - in each step, for every slot… - we have the probability of null observation $p_o(\mathrm{null})$ - for every state $x$, we multiply $p_s(x)$ by $p_o(\mathrm{null})$ - for every non-null $x$, we then add $p_o(x)$ to every $p_s(x)$ - such belief tracker is very fast and parameter-free - What is the difference between dialogue state and belief state? - dialogue state is the current context of a conversation - belief state is a probability distribution over dialogue states – it reflects the fact that the NLU is not completely reliable - What's the difference between a static and a dynamic state tracker? - static state tracker encodes whole history into features - dynamic/sequence state tracker explicitly models dialogue as sequential - can use CRF or RNNs - How can you use pretrained language models or large language models for state tracking? - BERT (pretrained language model) - we let BERT process previous system & current user utterance - we use it to predict per-slot span (value of a dialogue state slot – where to find it in the message) - from the first token's representation, we get a single decision: none/dontcare/span - using 2 softmaxes over tokens, we can then predict start & end token - we apply rule-based update to the static state tracker – if *none* was predicted, we keep the previous value - LLM prompting – two alternatives were presented - SQL & examples: we present SQL schema to the LLM, show several examples, and provide the previous state + one dialogue turn → the (dynamic) state changes are produced as SQL requests - chain-of-thought style: we prompt the LLM to explain the inputs and produce state based on them (it uses the whole history, the state tracker is static) ## Dialogue Policies - What are the non-statistical approaches to dialogue management/action selection? - finite-state machines - dialogue state is machine state - nodes – system actions - edges – possible user response semantics - FSMs are easy to design and predictable, but very rigid and do not scale to complex domains - good for basic DTML (tone-selection) phone systems - frame-based (VoiceXML) - slot-filling + providing information - required slots need to be filled, this can be done in any order, more information in one utterance possible - if all slots are filled, query the database - rule-based – any kind of rules (e.g. Python code) - we can use a probabilistic belief state - if-then-else rules in programming code, using thresholds over belief state for reasoning - output: system DA - very flexible and easy to code, but gets messy, the dialogue policy is pre-set (not flexible) - Why is reinforcement learning preferred over supervised learning for training dialogue managers? - you need large human-human data for supervised learning (hard to get) - if we used human-machine, the model would just mimic the original system - dialogue is ambiguous & complex - there is no single correct next action - some paths will be unexplored in data, but you may encounter them - dialogue systems won't behave the same as people - there are ASR errors, limited NLU, limited environment model/actions - dialogue systems *should* behave differently than people – make the best of what they have - in reinforcement learning, the goal is to find a policy that maximizes long-term reward – this somehow corresponds to the goal of dialogue management - note that for a typical dialogue system, the belief state is too large to make RL tractable – we map state into a reduced space, optimize there, and map it back - Describe the main idea of reinforcement learning (agent, environment, states, rewards). - Markov decision process (MDP) - agent in an environment - has internal state - chooses actions according to policy - gets rewards and state changes from the environment - Markov property – state defines everything (no other temporal dependency) - RL = finding a policy that maximizes long-term reward - unlike supervised learning, we don't know if an action is good - immediate reward might be low while long-term reward high - return $R_t$ = accumulated long-term reward (from timestep $t$ onwards) - state transition is stochastic (has a random probability distribution) → we maximize expected return - What are deterministic and stochastic policies in dialogue management? - deterministic policy - always take the same action $\pi(s)$ in state $s$ - enumerable in a table, equivalent to a rule-based system - but can be learned instead of hand-coded! - stochastic - specifies a probability distribution - $\pi(s,a)$ … probability of choosing action $a$ in state $s$ - What's a value function in a reinforcement learning scenario? - state-value function $V^\pi(s)$ … the value of a state $s$ under policy $\pi$ - expected return for starting in state $s$ and following policy $\pi$ - action-value function $Q^\pi(s,a)$ - expected return of taking action $a$ in state $s$ under policy $\pi$ - value functions can be used to evaluate states (or actions) and make better decisions - What's the difference between actor and critic methods in reinforcement learning? - actor model learns the policy - for a given state, it predicts a probability distribution over actions - the agent can then decide according to this distribution - critic model learns the value function - for a given state $s$, it predicts its value function $V(s)$ or $Q(s,a)$ for action $a$ - this guides the agent (they can then use the greedy policy or something like that) - What's the difference between model-based and model-free approaches in RL? - model-based - we assume that transition probabilities and rewards are known - the solutions are mathematically nice - but you can only know the full model in limited settings - model-free - we don't assume anything - this is the one for “real-world” use - using $Q$ instead of $V$ comes handy here (we do not need the transition probability $p(s'\mid s,a)$ to get the expected return of taking action $a$ in state $s$) - What are the main optimization approaches in reinforcement learning (what measures can you optimize and how)? - quantity to optimize - value function – critic - policy – actor - environment model: model-based × model-free - how to optimize - dynamic programming – find the exact solution from Bellman equation - iterative algorithms, refining estimates - expensive, assumes known environment (model-based) - Monte Carlo learning – learn from experience - sample, then update based on experience - when we arrive to state $s$, we update the model to match the observation - Temporal difference learning – like MC but look ahead (bootstrap) - sample, refine estimates as you go - even before we arrive to $s$, we have a good idea what the observation will be when we arrive to $s$ → we can update the model based on that guess - sampling & updates - on-policy – improve the policy while we are using it for decision - off-policy – decide according to a different policy - Why do you typically need a user simulator to train a reinforcement learning dialogue policy? - we can't really learn just from static datasets - on-policy algorithms don't work (the system needs to navigate the dialogues according to the current policy – old dialogues are not sufficient) - RL needs a lot of data, more than real people would handle (also, the system behaves weirdly in the early phases of RL) ## Neural Policies & Natural Language Generation - How do you involve neural networks in reinforcement learning (describe a Q network or a policy network)? - part of the agent is handled by a neural network – value function (typically $Q$) or policy - we are assuming huge state space (no more summary space) - REINFORCE (policy gradients) - works out of the box - we maximize performance – value of the initial state - deep Q-networks - Q-learning, where $Q$ function is represented by a neural net - problems we need to fix - SGD is unstable - correlated samples (data is sequential) - TD updates aim at a moving target (using $Q$ to compute updates to $Q$) - numeric instability (scale of rewards and $Q$ values unknown) - fixes - minibatches (updates by averaged $n$ samples, not just one) - experience replay – to break correlated samples (store experience in a buffer, train using minibatches sampled from the buffer) - target $Q$ function freezing (so that the target is not moving that often) - clipping rewards - What are the main steps of a traditional NLG pipeline – describe at least 2. - entire process: inputs → content plan → sentence plan → text - content/text/document planning - inputs → content plan - content selection according to communication goal - basic structuring & ordering - typically handled by dialogue manager - sentence planning / microplanning - content plan → sentence plan - organizing content into sentences, merging simple sentences - lexical choice, referring expressions (restaurant vs. it) - surface realization - sentence plan → text - linearization according to grammar - word order, morphology - for NLG in dialogue systems, we need sentence planning and surface realization - Describe one approach to NLG of your choice. - canned text - most trivial – completely hand-written prompts, no variation - doesn't scale (good for DTMF phone systems) - templates - “fill in blanks” approach - simple, but much more expressive, covers most common domains nicely - can scale, but still laborious - most production dialogue systems - grammars & rules - rules: mostly content & sentence planning - grammars: mostly older research systems, realization - machine learning - modern research systems - pre-neural attempts often combined with rules/grammar - neural nets made it work much better - Describe how template-based NLG works. - we define templates for system DAs - it can be enhances with rules - inflection of the filled-in phrases - template coverage/selection rules - What are some problems you need to deal with in template-based NLG? - it lacks generality and variation; it is difficult to maintain, expensive to scale up - the texts may sound unnatural - it is difficult to express rich information – the templates may be limiting - the templates lack context awareness - Describe a possible neural networks based NLG architecture. - our example: neural end-to-end NLG using recurrent neural networks (RNNs) - we don't need alignments - binary-encoded DA (is intent/slot-value present?) - delexicalized: does not use real values – generates templates - this approach uses modified LSTM (long short-term memory) cells – input DA is passed in every time step - it generates delexicalized templates word-by-word (decoder-only architecture) - other approaches: seq2seq, Transformer - How can you use pretrained language models or large language models in NLG? - pretrained LMs - architectures - guess masked word (encoder only: BERT) - generate next word (decoder only: GPT-2) - fix distorted sentences (both: BART, T5) - can be finetuned for our task/domain and for meaning representation (MR), can learn implicit copying - lot of them released online, plug-and-play (including multilingual versions) - LLMs - Transformer decoder models (slightly updated) - instruction tuning – finetune on problems & solutions - trained using reinforcement learning from human feedback (RLHF) - humans are paid to rate different solutions for instructions - rating model is trained based on these rating → such model can be used as RL reward for LLM training - usage: simple prompting, no need for finetuning - just feed in instructions/questions/example → LLM generates solution ## Voice assistants & Question Answering - What is a smart speaker made of and how does it work? - smart speaker = internet-connected mic & speaker with a virtual assistant running - optionally display/camera - multiple microphones for far-field ASR - it listens for a wake word - everything is then processed in vendor's cloud service (raw audio is sent to the cloud) - follow-up mode – no wake word needed for follow-up questions - privacy concerns - NLU includes domain detection - rules on top of machine learning - Briefly describe a viable approach to question answering. - our example: IR-based QA pipeline - IR … information retrieval - three steps - question processing - query formulation - answer type detection (what should the answer look like?) - passage retrieval - get relevant documents from the index (similar to web search) … document retrieval - find phrases in the documents that respond to the question - answer processing - generate a suitable answer to the original question - What is document retrieval and how is it used in question answering? - document retrieval = getting relevant documents (candidates) according to the query by searching in the index - can use TF-IDF (or other metrics) for weighting - document retrieval works as a coarse filter that filters out irrelevant documents (selects the ones that are relevant to the query and can possibly contain an answer to the question) - What is dense retrieval (in the context of question answering)? - the documents are embedded in a vector space - such embeddings can then be compared to query embeddings via cosine similarity - they can be also clustered into Voronoi cells, quantized, … - dense retrieval focuses more on semantics than on the specific contained words - How can you use neural models in answer extraction (for question answering)? - passage extraction - we feed the question and extracted passage(s) to Transformer model (e.g. BERT) - 2 classifiers: start + end of answer span (softmax over passage tokens) - generative QA - feed in passage - generate reply word-by-word - How can you use retrieval-augmented generation in question answering? - Transformer generative language model (decoder architecture) - input: retrieved passage - output: full-sentence response - not just extraction, but full-sentence answer formulation - the model has to be trained to provide reply (avoid hallucination, avoid copying everything verbatim) - What is a knowledge graph? - large repository of structured, linked information - entities … nodes - relations … edges - entities and relations are typed, the types form a similar graph (ontology) - knowledge graphs can be used for question answering ## Dialogue Tooling - What is a dialogue flow/tree? - graph structure that describes a non-linear dialogue - there are conditions to get to the individual nodes of the graph (and fallback strategies if none of the conditions is specified) - What are intents and entities/slots? - intents correspond to the actions supported by the dialogue (represent what the user wants to achieve) - entities/slots are parameters of the actions (intents) – information needed to fulfill the intents - example - intent: reserve table - slots: date, time, number of guests - How can you improve a chatbot in production? - automatically - learning from user selections - statistics on user selections → automated pre-selection for next users - semi-automatically or manually - chat log analysis → model update - used measures - coverage – is the chatbot confident that it can address the user's request? (per dialogue turn) - containment – can the chatbot satisfy a user's request without human intervention? (per conversation) - What is the containment rate (in the context of using dialogue systems in call centers)? - rate at which your chatbot can satisfy a user's request without human intervention, i.e. connect to human agent not requested (per conversation) - it is a measure that can be used to evaluate the chatbot - What is retrieval-augmented generation? - process of optimizing the output of a large language model so that it references an authoritative knowledge base *outside of its training data sources* before generating a response ## Automatic Speech Recognition - What is a speech activity detector? - it is a preprocessing step in ASR - to save CPU – run ASR only when there is speech, ignore non-speech sounds - approaches - handcrafted (now obsolete) – track signal amplitude contours, assumes low noise - statistical / neural – binary classifier trained on large corpora, accurate but more CPU-demanding than handcrafted detector - Describe the main components of an ASR pipeline system. - speech activity detector – detects that someone is speaking, can depend on wake words - feature extractor – uses Fourier transform and mel frequency subsampling to extract features from the sound, also somehow normalizes the sound - acoustic model – models probability that a word corresponds to a given audio - language model – models probability of words and sentences, uses pronouncing dictionary - decoder – combines acoustic and language model - How do input features for an ASR model look like? - mel frequency cepstral coefficients (MFCCs) - representation of the sound that is inspired by human perception - in older systems - mel spectrogram (filterbank) - uses mel (logarithmic) scale - less processed than MFCCs - raw spectrograms - raw audio - What is the function of the acoustic model in a pipeline ASR system? - to estimate $P(\mathrm{audio}\mid \mathrm{text})$ - it helps to map audio features to phonemes or subwords (using Gaussian mixtures or neural networks) - What's the function of a decoder/language model in a pipeline ASR system? - to estimate $P(\mathrm{text})$ - what is the probability of certain words/sentences in our language? - it decodes audio features back to text (with the help of an acoustic model) - it can use a pronouncing dictionary - Describe an (example) architecture of an end-to-end neural ASR system. - our example: attention encoder-decoder - encoder encodes audio features - decoder decodes text character-by-character - RNN (LSTM) + attention / Transformer - if the audio is too fast, we slow it down - pros - direct audio to letter (no need to model pronunciation explicitly) - no need to align phones & audio frames - audio & transcript is enough to train - cons - inaccurate word/character timestamps - not low-latency - hard to customize ## Text-to-speech Synthesis - How do humans produce sounds of speech? - air flow from lungs → vocal cords resonation → frequency characteristics further moderated by vocal tract - resonation - base frequency (F0) - upper harmonic frequencies - vocal tract moderation - shape of vocal tract changes (tongue, soft palate, lip, jaw positions) - some frequencies resonate - some are suppressed - What's the difference between a vowel and a consonant? - vowel – sound produced with open vocal tract - typically voiced (vocal chords vibrate) - quality of vowels depends mainly on vocal tract shape (raised tongue position, jaw/tongue height, shape of lips) - consonant – sound produced with (partially) closed vocal tract - voiced/voiceless (often come in pairs, e.g. \[p], \[b]) - quality also depends on type + position of closing - What is F0 and what are formants? - F0 … base vocal cord frequency (voice pitch) - formants … loud multiples (upper harmonics) of F0 - distinct for different phonemes - F1, F2 – first, second formant - What is a spectrogram? - frequency-time-loudness graph - What are main distinguishing characteristics of consonants? - do vocal chords vibrate? (voiced × voiceless) - type and position of vocal tract closing; vocal tract shape - stops/plosives … total closing + “explosive” release (p, d, k) - nasals … stops with open nasal cavity (n, m) - fricatives … partial closing (f, s, z) - approximants … movement towards partial closing and back, half-vowels (w, j) - What is a phoneme? - sound that distinguishes meaning - changing it for another would change meaning (**d**og → **f**og) - What are the main distinguishing characteristics of different vowel phonemes (both how they're produced and perceived)? - production – influenced by vocal tract shape - raised tongue position – front, central, back - jaw/tongue height – open, open-mid, close-mid, close - shape of lips – round, non-round - perception – depends on which formants are present in the spectrum or not (which are suppressed) - What are the main approaches to grapheme-to-phoneme conversion in TTS? - main approaches: pronouncing dictionaries + rules - rules are good for languages with regular orthography (spelling) - Czech, German, Dutch - dictionaries good for irregular/historical orthography - English, French - typically it's a combination anyway - rules = fallback for out-of-vocabulary items - dictionary used for foreign words (overrides rules) - can be a pain in a domain with a lot of foreign names - pronunciation is sometimes context dependent - part-of-speech tagging - contextual rules - phonemes typically coded using ASCII - Describe the main idea of concatenative speech synthesis. - cut & paste on recordings - but there are too many words/syllables; there are too few phonemes - so we use diphones = second half of one phoneme and first half of another - about 1500 diphones in English – manageable (even though we need lots of recordings of a single person) - this eliminates the heaviest coarticulation problems (but not all) - still artefacts at diphones boundaries - smoothing/overlay & F0 adjustments - over-smoothing makes the sound robotic - pitch adjustments limited – don't sound natural - modification: unit-selection concatenative synthesis - more instances of each diphone - we select units that best match the target position (to minimize adjustments needed) - Describe the main ideas of statistical parametric speech synthesis. - trying to be more flexible, less resource-hungry than unit selection - inverse of model-based ASR - based on HMMs (hidden Markov models) - principle - in corpus, we have text and audio - for training and prediction, we need: - model that can extract linguistic features (phonemes, stress, pitch) from the text - vocoder that can both extract acoustic features (spectrum, excitation) from a waveform (audio) and synthesize a waveform from acoustic features - to train the statistical acoustic model, we extract both acoustic and linguistic features from the corpus and use the features as training data - during prediction, we first extract the linguistic features from the text, then the acoustic model predicts acoustic features, and vocoder synthesizes them into a waveform - How can you use neural networks in speech synthesis? - we can use feed-forward networks or recurrent neural networks to replace HMMs used in statistical speech synthesis - RNNs predict smoother outputs (given temporal dependencies) - NNs allow better features (e.g. raw spectrum) - examples - WaveNet generates waveform directly, it is based on convolutional NNs - Tacotron is trained on waveforms and transcriptions (no linguistic features), it is based on seq2seq models with attention ## Chatbots - What are the three main approaches to building chitchat/non-task-oriented open-domain chatbots? - rule-based - human-scripted, react to keywords/phrases in user input - very time-consuming to make, but still popular - data-driven: retrieval - gets replies from a corpus - “nearest neighbor” approaches - corpus can contain past conversations with users - chatbots differ in the sophistication of reply selection - data-driven: generative - seq2seq-based models (typically RNN/Transformer) - usually trained on static corpora - (theoretically) able to handle unseen inputs, produce original replies - basic seq2seq architecture is weak (dull responses) → many extensions - How does the Turing test work? Does it have any weaknesses? - evaluator leads two text-only conversations – with a machine and a human - needs to tell which is which - the evaluator can be gamed if the conversation is framed well (paranoid schizophrenic, therapist, Ukrainian boy, …) - What are some techniques rule-based chitchat chatbots use to convince their users that they're human-like? - signalling understanding – repeating and reformulating user's phrasing - good framing – it's easier to appear human as a therapist (or paranoid schizophrenic, Ukrainian boy, …) - Describe how a retrieval-based chitchat chatbot works. - it first checks for similar inputs in the corpus (rough retrieval) - then it reranks the best candidates to find the most suitable one - this step can use machine learning (problem: we need negative examples to train the classifier) - it cannot produce unseen sentences and sometimes replies inconsistently - postprocessing and rules can partially fix this - How can you use neural networks for chatbots (non-task-oriented, open-domain systems)? Does that have any problems? - we can use neural networks for reranking - training data problem – datasets contain only positive examples, but we also need negative examples - NNs can be also used end-to-end - we can use similar approach as in phrase-based machine translation (MT) - the task is harder than MT – possible responses are much more variable than possible translations - it works, but fluency is not ideal and the context is too limited - RNN LMs without LSTM - more fluent than phrase-based - problems with long replies (less fluent, wander off-topic) - encoder-decoder RNN model with LSTM (seq2seq) - encode input, decode response - generic/dull responses - MLE/softmax prefer 1 option → models settle on safe replies and become over-confident - limited context - encoding long contexts is slow and ineffective - contexts are too sparse to learn much - inconsistency - ask the same question twice, get two different answers - no notion of own personality - Describe a possible architecture of an ensemble non-task-oriented chatbot. - rule-based for sensitive/frequent/important questions - retrieval for jokes, trivia etc. - task-oriented-like (handcrafted / specially trained) systems for specific topics – news, weather, etc. - seq2seq as a backup or not at all - What do you need to train a large language model? - trillions of tokens - enough compute power - well-defined evaluation metrics - What are some issues you may encounter when chatting to LLMs? - it may not be factually accurate - it only uses information it memorized - hallucinates instead of saying “I don't know” - eager to please, easily swayed - hard to control - over-hyped