deep learning-based spoken dialog systems · deep learning-based spoken dialog systems. virtual...
TRANSCRIPT
Gary Geunbae LeeDept of CSE/AIGS
POSTECH
Deep Learning-based Spoken Dialog Systems
Virtual Personal Assistant: Trends
S/W
Ass
istan
t
Apple Google Now
MS CortanaSamsungS-voice
Amazon Dash & Echo
Social Robot: Jibo
Service Robot : PepperN
ewFo
rmFa
ctor
(IoT,
Wea
rabl
e,…
)Ro
bot
Facebook M
SamsungBixby
IKEA Smart table
Hotel Robot : Botlr
Apple Knowledge Navigator (‘87)
Google Home
~2012 2013 2014 2015 2016~
AI for Personal Health Advisor
- From Medical Futurist report, “Artificial Intelligence Will Redesign Healthcare”
AI for Shopping Assistant
- From Information Age report, “3 ways artificial intelligence is transforming e-commerce”
AI for 24/7 After-sales Care
- From Wikipedia, “Customer support Automation”
AI for Social Robot
- From Wikipedia, “Social robot”
traditional pipeline architecture
7
Spoken Language Understanding
Spoken Language Understanding
• Spoken language understanding is to map natural language speech to frame structure encoding of its meanings.
• What’s difference between NLU and SLU?► Robustness; noise and ungrammatical spoken language► Domain-dependent; further deep-level semantics (e.g. Person vs.
Cast)► Dialog; dialog history dependent and utt. by utt. analysis
• Traditional approaches; natural language to SQL conversion
ASRSpeech
SLU SQLGenerate Database
Text SemanticFrame SQL Response
A typical ATIS system (from [Wang et al., 2005])
8Intelligent Robot Lecture Note
Spoken Language Understanding
Semantic Representation
• Two common components in semantic frame► Dialog acts (DA); the meaning of an utt. At the discourse level, and it
it approximately the equivalent of intent or subject slot in the practice.► Named entities (NE); the identifier of entity such as person, location,
organization, or time. In SLU, it represents domain-specific meaning of a word (or word group).
• Example (ATIS and EPG domain, simplified representation)
Show me flights from Denver to New York on Nov. 18thDIALOG_ACT = Show_FlightFROMLOC.CITY_NAME = DenverTOLOC.CITY_NAME = New YorkMONTH_NAME = Nov. DAY_NUMBER = 18th
I want to watch LOSTDIALOG_ACT = Search_ProgramPROGRAM = LOST
9Intelligent Robot Lecture Note
Spoken Language Understanding
Semantic Frame Extraction
Dialog ActIdentification
Frame-SlotExtraction
RelationExtraction
Unification
Feature Extraction / Selection
Info.Source
+
+
+
+ +
Overall architecture for semantic analyzer
롯데월드에어떻게가나요?Domain: NavigationDialog Act: WH-questionMain Action: SearchObject.Location.Destination=롯데월드난롯데월드가너무좋아.Domain: ChatDialog Act: StatementMain Action: LikeObject.Location=롯데월드
Examples of semantic frame structure
• Semantic Frame Extraction (~ Information Extraction Approach)1) Dialog act / Main action Identification ~ Classification2) Frame-Slot Object Extraction ~ Named Entity Recognition3) Object-Attribute Attachment ~ Relation Extraction► 1) + 2) + 3) ~ Unification
10Intelligent Robot Lecture Note
Spoken Language Understanding
Machine Learning for SLU
• Maximum Entropy (a.k.a logistic regression)► Conditional and discriminative manner► Unstructured! (no dependency in y)► Dialog act classification problem
• Conditional Random Fields [Lafferty et al. 2001]► Structured versions of MaxEnt (argmax search in inference)► Undirected graphical models► Popular in language and text processing► Linear-chain structure for practical implementation► Named entity recognition problem
yt-1 yt yt+1
xt-1 xt xt+1
z
x
fk
gk
hk
11Intelligent Robot Lecture Note
Spoken Language Understanding
Semantic NER as Sequence Labeling
• Relational Learning for language processing► Left-to-right n-th order Markov model (linear-chain or sequence)► E.g. Part-of-speech tagging, Noun Phrase chunking, Information
Extraction, Speech Recognition, etc.► Very large size of feature space (e.g. state-of-the-art NP chunking >
1M)► Open problem is how to reduce the training cost (even 1st order
Markov)• Transformation to BIO representation [Ramshaw and Marcus, 1995]
► Begin of entity, Inside of entity, and Outside
Show me flight from Denver to New York on Nov. 18thO O O O F.CITY-B O T.CITY-B T.CITY-I O MONTH-B DAY-B
F.CITY T-CITY MONTH DAY
12Intelligent Robot Lecture Note
Spoken Language Understanding
Long-distance Dependency Problem
… …fly from denver to chicago on dec. 10th 1999
DEPART.MONTH
… …return from denver to chicago on dec. 10th 1999
RETURN.MONTH
• Most practical NLP models employ the local feature set► Local context feature (sliding window)► E.g.) for “dec.”; current = dec., prev-2 = on, prev-1 = chicago, next+1
= 10th, next+2 = 1999, POS-tag = NN, chunk = PP► However, there is exactly same feature set for two states “dec.”
(even different labels)• Non-local feature or high-order dependency should be considered
13Intelligent Robot Lecture Note
Spoken Language Understanding
Motivation: Joint System
• Joint Prediction of DA and NE [Jeong and Lee, 2006]► DA and NE are mutually dependent► An integration of DA and NE model encoding inter-dependency► GOAL is to improve both performances of DA and NE task.
Joint Inference
Classification(Dialog Act / Intent)
SequenceLabeling
(Named Entity /Frame Slot)
AutomaticSpeech
Recognition
DialogManagement
Joint Model(e.g. TriCRFs)
x x,y,z
14Intelligent Robot Lecture Note
Spoken Language Understanding
Triangular-chain CRFs
• Modeling the inter-dependency (y ↔ z)
► Factorizing the potential
yt-1 yt yt+1
xt-1 xt xt+1
zxz
hk
gk
f1k
f2k
y,y dependency y,z dependency
In general, fk can be a function of triangulated cliques.However, we assume that NE state transition is independent from DA,i.e., DA operates as an observation feature to identify NE labels.
edge-transition NE-observation DA-observation
15Intelligent Robot Lecture Note
Spoken Language Understanding
Active Learning
• Certainty-based method► Predict the candidate raw data► Estimate confidence (e.g. Prob) and use it
Raw data
SmallLabeled data Model
Predict &Estimate
Confidence
Selected samplesFilter
Labeledsamples
+
16Intelligent Robot Lecture Note
Spoken Language Understanding
Semi-supervised Learning
• Augmenting the machine-labeled data
• Augmenting the classification model► Like as adaptation (model interpolation), classifier dependent method
Raw data
SmallLabeled data Model
Predict &Estimate
Confidence
Selected samplesFilter
Labeledsamples
+
17Intelligent Robot Lecture Note
traditional pipeline architecture
18
Dialog Management
Dialogue Management
• A system to provide interface between the user and a computer-based application
• Interact on turn-by-turn basis• Dialogue manager
► Control the flow of the dialogue► Main flow
◦ information gathering from user◦ communicating with external application◦ communicating information back to the user
► Three types of dialogue system◦ finite state- (or graph-) based◦ frame-based ◦ agent-based
19Intelligent Robot Lecture Note
Dialog Management
Dialog System Architecture
• The DARPA Communicator program was designed to support the creation of speech-enabled interfaces that scale gracefully across modalities, from speech-only to interfaces that include graphics, maps, pointing and gesture.
20
AT&T
CMU
MIT
CU
BBN
Bell Lab
SRIDARPA
Intelligent Robot Lecture Note
Dialog Management
21
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"><menu>
<prompt>Say one of: <enumerate/></prompt><choice next="http://www.example.com/sports.vxml">
Sports scores</choice><choice next="http://www.example.com/weather.vxml">
Weather information</choice><choice next="#login">
Log in</choice>
</menu></vxml>
Browser : Say one of: Sports scores; Weather information; Log in.
User : Sports scores
Intelligent Robot Lecture Note
Dialog Management
22
Frame-based Approach
• Frame-based system► Asks the user questions to fill slots in a template in order to perform a
task (form-filling task)► Permits the user to respond more flexibly to the system’s prompts (as
in Example 2.)► Recognizes the main concepts in the user’s utterance
Example 1)• System: What is your destination?• User: London.• System: What day do you want to
travel?• User: Friday
Example 2) System: What is your destination? User: London on Friday around
10 in the morning. System: I have the following
connection …
Intelligent Robot Lecture Note
Dialog Management
23
Agent-based Approach
• Properties► Complex communication using unrestricted natural language► Mixed-Initiative► Co-operative problem solving► Theorem proving, planning, distributed architectures► Conversational agents
• ExamplesUser : I’m looking for a job in the Calais area. Are there any servers?
System : No, there aren’t any employment servers for Calais. However, there is an employment server for Pasde-Calais and an employment server for Lille. Are you interested in one of these?
System attempts to provide a more co-operative response that might address the user’s needs.
Intelligent Robot Lecture Note
Dialog Management
24
TRIPS Architecture
The TRIPS System ArchitectureIntelligent Robot Lecture Note
Dialog Management
25
Information State Approach (plan-based)
• A method of specifying a dialogue theory that makes it straightforward to implement
• Consisting of following five constituents► Information Components
◦ Including aspects of common context◦ (e.g., participants, common ground, linguistic and intentional structure,
obligations and commitments, beliefs, intentions, user models, etc.)► Formal Representations
◦ How to model the information components◦ (e.g., as lists, sets, typed feature structures, records, etc.)
Intelligent Robot Lecture Note
Dialog Management
26
Information State Approach
► Dialogue Moves◦ Trigger the update of the information state◦ Be correlated with externally performed actions
► Update Rules◦ Govern the updating of the information state
► Update Strategy◦ For deciding which rules to apply at a given point from the set of applicable
ones
Intelligent Robot Lecture Note
Dialog Management
Dialog as a Markov Decision Process
27
User
SpeechUnderstanding
SpeechGeneration
StateEstimator
DialogPolicy Optimize
∑=k
kk rR γ
us
ua
ma
ua~
ma~ π
>=< duum ssas ~,~,~~MDP
usergoal
userdialog act
noisy estimate ofuser dialog act
dialoghistory
machinestate
machinedialog act
ReinforcementLearning
Reward
),( mm asrms~
ds
[S. Young, 2006]
Intelligent Robot Lecture Note
Spoken Language Understanding
Deep Learning-based Dialog System
28
AI vs. ML vs. DL
AI: Artificial Intelligence ML: Machine Learning DL: Deep Learning
Artificial Intelligence
Machine Learning
Deep Learning
• Reasoning• Knowledge
Representation• Planning• Learning• Perception• Manipulation
• Supervised Learning with Teachers• Semi-supervised Learning with Teacher• Unsupervised Learning without Teacher• Reinforcement Learning with Rewards
InputText
Sound
Image
Video
Output Gen
erated Infor
mation
/ Autonomo
us Control
• RBM (Restricted Boltzmann Machine)• DBN (Deep Belief Network)• CNN (Convolutional Neural Network)• Deep Reinforcement Learning
Axon
Terminal Branchesof AxonDendrites
Σ
x1
x2w1
w2
wnxn
x3 w3
Artificial Neural Networks (ANN)
Layered Networks
Input nodesHidden nodes
Connections
Output nodes
∑ i j= f ( w x )
y = f ( w x +w x +w x ++w xm )
j
j
miiiOutput : 321
Deep learning Innovation
• Combining Feature Learning and Classification as UnifiedFramework (※ Learning what to learn, how to learn)
Feature learning aspect ofDNN based Image Classification
Vanilla recurrent neural networks (RNNs)
• RNNs have connections from the outputs of previous time steps to inputs of next time steps
• For sequential data, a RNN usually computes hidden state ℎ𝑡𝑡 from the previous hidden state ℎ𝑡𝑡−1 and the input 𝑥𝑥𝑡𝑡
• ℎ𝑡𝑡 = 𝜎𝜎 𝑊𝑊ℎℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 + 𝑏𝑏
33
Vanishing gradient problem
• ℎ𝑡𝑡 = 𝜎𝜎 𝑊𝑊ℎℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 + 𝑏𝑏
• Let’s assume 𝜎𝜎 is the identity function
34If all 𝜕𝜕ℎ
𝑡𝑡
𝜕𝜕ℎ𝑡𝑡−1< 1 𝜕𝜕𝐽𝐽𝑡𝑡
𝜕𝜕ℎ1≈ 0
Long short-term memory networks (LSTMs)
• LSTMs explicitly keep and update cell memory 𝑐𝑐(𝑡𝑡) by
• Removing the previous cell content 𝑐𝑐(𝑡𝑡−1) by multiplying it with 𝑓𝑓(𝑡𝑡)
• Adding the new cell content �̃�𝑐(𝑡𝑡) multiplied by 𝑖𝑖(𝑡𝑡)
• LSTMs produce output ℎ(𝑡𝑡) = 𝑜𝑜(𝑡𝑡)°tanh𝑐𝑐(𝑡𝑡)
35
Gated recurrent units (GRUs)
• GRUs keeps and update ℎ(𝑡𝑡)by two gates:
• Update gate 𝑢𝑢(𝑡𝑡)decides
• How much the old hidden representation ℎ(𝑡𝑡) is removed
• how much the new hidden representation �ℎ(𝑡𝑡) is added
• Reset gate 𝑟𝑟(𝑡𝑡) decides how much old representation ℎ(𝑡𝑡) is neededto compute new representation �ℎ(𝑡𝑡)
• GRUs also use less number of gates and have smaller parameters than LSTMs
36
Bidirectional Multi-Layer RNNs
37
Word Vector
• Represent words as vectors
38
Word Vector
• Distributional semantics : A word’s meaning is given by the words that frequently appear close-by
• “You shall know a word by the company it keeps”
• Word2vec objective function (skip-grams)
39
Contextual word embedding
• A word’s contextual embedding must consider its context
40
GloVe
the actor plays a show
𝜖𝜖(𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝)
Some contextual
method
the actor plays a show
𝜖𝜖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑡𝑡ℎ𝑡𝑡 𝑝𝑝𝑐𝑐𝑡𝑡𝑜𝑜𝑟𝑟 _ 𝑝𝑝 𝑝𝑝ℎ𝑜𝑜𝑠𝑠)
ELMo: Embeddings from Language Model
• Multi-layer bidirectional LSTM language model
41
𝐡𝐡𝑘𝑘,0𝐿𝐿𝐿𝐿 = 𝑥𝑥𝑘𝑘𝐿𝐿𝐿𝐿 (token representation;
GloVe)
𝐡𝐡𝑘𝑘,𝑗𝑗𝐿𝐿𝐿𝐿 = �⃗�𝐡𝑘𝑘,𝑗𝑗
𝐿𝐿𝐿𝐿, �⃖�𝐡𝑘𝑘,𝑗𝑗𝐿𝐿𝐿𝐿 (LSTM state)
𝑥𝑥𝑡𝑡−1 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡+1
𝑝𝑝𝑡𝑡−1 𝑝𝑝𝑡𝑡 𝑝𝑝𝑡𝑡+1
ℎ𝑡𝑡−1,1 ℎ𝑡𝑡,1 ℎ𝑡𝑡+1,1
ℎ𝑡𝑡−1,2 ℎ𝑡𝑡,2 ℎ𝑡𝑡+1,2
ℎ𝑡𝑡−1,1 ℎ𝑡𝑡,1 ℎ𝑡𝑡+1,1
ℎ𝑡𝑡−1,2 ℎ𝑡𝑡,2 ℎ𝑡𝑡+1,2
Layer 1
Layer 2
Layer 0
𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘: scale (hyper-parameter)
𝑝𝑝𝑗𝑗𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘: weight (learned)
ELMo for MRC
• ELMo as a word embedding
42
Transformer
• Parallel self-attention
• Looks at self, and determines where to focus
43Vaswani, Ashish, et al. "Attention is all you need." NIPS 2017
BERT: Bidirectional Encoder Representations from Transformers
• Training 1. Masked words prediction
• 15% of words are [MASK]ed
44
• Training 2. Next sentence prediction
• To understand texts more than a sentence
45
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
• BERT as universal pre-trained model for NLP
• BERT requires minimal additional layers and fine-tuning
46Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019
GPT(Generative Pre-Training)
47
In pre-training, optimize 𝐿𝐿1 𝑢𝑢𝑢𝑢: Unlabeled datasetΘ: Model parameters
In fine-tuning, optimize 𝐿𝐿3 𝑐𝑐𝑐𝑐: Labeled dataset𝜆𝜆: Hyper-parameter weight
Parallel computing for Deep Learning
• History of parallel/ distributed systems for Deep Learning computing
Google taps 16k computers to look for cats –for Science!
Univ. of Toronto uses2 GPUs for 1.2M training Images for 1000 classes Image classification(※ ImageNet Large Scale Visual Recognition Challenge)
Stanford uses12 GPUs for Large-scale Video Classification With Convolutional Neural Networks(※ 10M Youtube video)
Google uses16K CPU cores for Training 22-layers Deep neural network (※GoogLeNet, 2014)
Baidu’s Artificial- Intelligence Supercomputer Beats Google at Image Recognition
Model 회사 언어파라미터
(vocab, layer, hidden, heads)성능 학습 데이터
KoBERT SKTbrain pytorch (8002, 12, 3072, 12)SA – 90.1
(google : 87.5)한국어 위키/뉴스
KoGPT SKTbrain pytorch (50000, 12, -, 12)SA – 89.9
(google : 87.5)한국어 위키/뉴스+ 기타(알수없음)
한국어BERT언어모델
ETRIpytorch / tensorflow
(30349, 12, 768, 12)SRL - 85.77
(google : 81.85)23GB 원시 말뭉치
HanBERT TwoBlock AI pytorch (54000, 12, 768, 12)KorQuad
17th문서 (일반, 특허)
SA : Sentiment AnalysisSRL : Semantic Role Labeling
Korean
Multi-domain Task-oriented Dialogue
• The system interacts with user to help the user achieve his/her goal
• e.g. Restaurant reservation, hotel reservation…
• Specific domains & specific goals
• Open domain (chit-chat)
• Multi-turns
• User goals are not limited to just one domain
2021-03-17 50
To book a hotel To book a restaurantnear the hotel
To book a taxi from the hotelto the restaurant
Dataset: MultiWOZ
• Standard benchmark dataset for end-to-end task-oriented dialogue systems
• Multi-domain dataset for task-oriented dialog system
• Virtual conversations between a tourist and a guide in Cambridge
• 7 domains: Hotel, restaurant, attraction, train, taxi, police, hospital
• Number of dialog: 10,000 / Avg. number of turns per dialog: 13
• Evaluation in MultiWOZ
• Inform rate
• How many entries provided by the system meet goals?
• Success rate
• How many user requests are provided by the system?
• BLEU
• Similarity between generated response and true response in the dataset
2021-03-17 51
A short example in MultiWOZ
52
User: I am planning a trip to town and want to sight see a bit.System: There are 79 attractions in the city. Do you have a specific type of attraction?
User: I’d like a architecture.System: How about All Saints Church? It has fantastic architecture and free entrance.
User: Could I get the address? I would also like an expensive place to eat around it.System: The address is Jesus Lane. What type of food would you like to eat?
User: I'm in the mood for Turkish food.System: Anatolia is an expensive Turkish restaurant in the city centre.
User: Can I have the phone number?System: The phone number is 01223362372.
User: Thank you!System: You're welcome, feel free to contact us about anything else you need.
DST(Dialogue State Tracking)
• The most important subtask of task-oriented dialogue system
• Dialogue state generally consists of slot-value pairs
• Slot means general class like food, area / value means specific value of slot
• In multi-domain, dialogue state consists of domain-slot-value triplets
• ex) (restaurant-pricerange-cheap), (attraction-type-museum)…
• Infer dialogue state each turn of dialogues between user and system
• There are ground truth dialogue states every turn
• Inferred state is used to generate next system action and response
• TRADE(TRAnsferable Dialogue statE generator)
• 2019 ACL @ Hong Kong Univ. & Salesforce [https://www.aclweb.org/anthology/P19-1078]
• DS-DST(Dual Strategy for DST)
• State-Of-The-Art in DST
• 2019 Arxiv @ Illinois Univ. & Salesforce [https://arxiv.org/abs/1910.03544]
2021-03-17 53
Existing model: SOLOIST
54
• An auto-regressive model for training (Language modeling)• Task 1: Predict dialogue state (slot-value)• Task 2: Predict system response• Task 3 (for Auxiliary loss)
• Replace dialogue state or system response in input sequence into negative samples randomly
• Then, predict input sequence is negative sample or not (binary classification)• Jointly train by sum of 3 losses
DAMD(Domain Aware Multi-Decoder)
55
GRU encoder
Attention
User context
Action
GRU encoder
Attention
Action context
Belief
GRU encoder
Attention
Belief context
Delexicalization System Response
GRU encoder
Attention
Response context
GRU decoder
Concat
Softmax
New belief
User utterance
GRU encoder
Attention
User context
DB
GRU decoder
Concat
Softmax
New action
Pointer
GRU decoder
Concat
Softmax
New response
Domain state tracking
• For tracking the flow of conversation from domain perspective
• Contains binary values indicating whether domains are activated or not in the current conversation
• Changes over turns
2021-03-17 56
User: I would like to get some information about a restaurant and a hotel.System: Okay, lets start with a hotel, any preference of type,area, or price range?
User: The hotel that I am looking for is called Gonville.System: Do you want book the hotel?
…
User: I want a Italian restaurant near the hotel.System: How about Prezzo? It places centre city.
…
ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂
ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂𝑂𝑂
ℎ𝑜𝑜𝑡𝑡𝑡𝑡𝑝𝑝:𝑂𝑂𝑂𝑂𝑂𝑂𝑟𝑟𝑡𝑡𝑝𝑝𝑡𝑡𝑝𝑝𝑢𝑢𝑟𝑟𝑝𝑝𝑟𝑟𝑡𝑡:𝑂𝑂𝑂𝑂
Architecture w/ domain state tracking
• NLU module is shared for DST, POL, and NLG
• Darker blocks mean previous turn
• DB result contains the number of matched entries for each domain
2021-03-17 57
Cross entropy loss
Binary cross entropy loss
Bert
FC (FullyConnected)
GRU GRU GRU
Training / inference
• Inputs
• 𝑢𝑢𝑝𝑝𝑡𝑡𝑟𝑟𝑡𝑡 , 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡−1, 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡−1
• Outputs(Training)
• 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡• 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑑𝑑𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑝𝑝𝑐𝑐𝑡𝑡𝑖𝑖𝑜𝑜𝑟𝑟𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑟𝑟𝑡𝑡𝑝𝑝𝑝𝑝𝑜𝑜𝑟𝑟𝑝𝑝𝑡𝑡𝑡𝑡
• Outputs(inference)
• 𝑡𝑡𝑢𝑢𝑟𝑟𝑟𝑟 𝑑𝑑𝑜𝑜𝑑𝑑𝑝𝑝𝑖𝑖𝑟𝑟𝑡𝑡• 𝑑𝑑𝑖𝑖𝑝𝑝𝑝𝑝𝑜𝑜𝑑𝑑 𝑝𝑝𝑡𝑡𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡• 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑑𝑑 𝑟𝑟𝑡𝑡𝑝𝑝𝑝𝑝𝑜𝑜𝑟𝑟𝑝𝑝𝑡𝑡𝑡𝑡
2021-03-17 58
domain loss
value loss
gate loss
action loss
response loss
Sum
Cross entropy
Adam
End-to-end ASR
AudioInputsAudio
InputsAudioInputs
AudioInputsAudio
InputsTextOutputs
Beam SearchSpeech
Representation
CTC
Frontend (Preprocessing)
STFT, MEL
TransformerEncoder
TransformerDecoder
(Attention)
*Connectionist Temporal Classification (CTC)
Short-Time Fourier Transform (STFT)
Conformer Encoder
Conformer (Encoder)
Decoder - LSTM
https://arxiv.org/pdf/2005.08100.pdf*convolution-augmented transformer for speech recognition, Conformer
*Specaugment: A simple data augmentation method for automatic speech recognition (to log-mel spectrogram): time warping, frequency & time masking
Conformer + Wav2Vec 2.0 (Encoder)Decoder - LSTMhttps://arxiv.org/pdf/2010.10504v1.pdf
*Pre-training: The contrastive loss between the contextvectors from masked features and the quantization unit is optimized.
CTC-attention decoder
Joint CTC-Attention Decoding
https://arxiv.org/pdf/1609.06773.pdf
*use intermediate label representationallowing repetitions of labels and blank labelsDo baum-welch optimization for Viterbi alignment S: decoder state
A: attention
*the shared encoder is trained by both CTC andattention model objectives simultaneously
LAS (Listen, Attend and Spell)
LAS (Listen, Attend and Spell)
https://arxiv.org/pdf/1508.01211.pdf
*accepts filter bank spectra as inputs
*Attention-based recurrent network decoder(character decoding)*LAS does not make independenceassumptions in the label sequence unlike CTC
Tacotron2: Seq2seq with attention RNN + modified WaveNet
*location-sensitive attention: mitigating potential failure modes where some subsequences are repeated or ignored by the decoder*Auto-regressive decoder to generate mel spectrogram frame by frame; Previous frame input to pre-net & post-net (for residual)*stop token to stop generation
*wavenet: invert the mel spectrogram feature representation into time-domainwaveform samples (10 component mixture of logistic distributions (MoL) to generate 16-bitsamples at 24 kHz)
MelGan
Change WaveNet vocoder to:- MelGan for faster inference
for different frequency range of the audio
Loss: GAN objective + feature matching objective(L1 distance between the discriminator feature maps of real and synthetic audio)
66
Dialog System for Counseling (TOD Open-ended)
Paraphrasing Reflect Feeling
Correct Information Logical Answering
Listening Memorizing
Questioning Show Interest
Empathy Motivating
Sensibleness Attending
COUNSELING
CBT Counseling Scenario
67
Initiation
Greeting
Explanation of Counseling
Method
Extraction information of
User
Situation
5W1H details
Automatic thoughts
Ground Belief
EmotionSupporting Evidence
No-Supporting Evidence
Refutation
Alternative thoughts
Emotion
Motivating
(Cognitive Behaviour Therapy)
*Automatic vs alternative
*For next visit
68
VHRED:RNN Dialog Model
Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., & Bengio, Y. (2016). A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069..
*Variational hierarchical recurrent
User : How are things?
Meena : They’re good. How about you?
User : I’m trying to decide what to watch
Meena : What are some of your favorite shows?
User : I like the Good Place
Meena : I haven’t watched that. My favorite shows are Supernatural, and Star trek : TNG.
User : What is TNG?
Meena : The Next Generation
Evolved Transformer Encoder Block
Evolved Transformer Decoder Block
x13
Meena : Transformer Dialog Model
Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., ... & Le, Q. V. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
69
x1
*Automatically learned architecture w/ NAS (neural architecture search)
Summary
- NLP enjoys rapid progress in the last 10 years due to deep learning.
- NLP is reaching the point of having big social impact, making issues like bias and security increasingly important.
- Big model, big computation resources, huge training times are problematic, need to focus more light way of doing NLP (even embedded model)
70