Toward the Engineering of Improving Semantic …xye/papers_and_ppts/ppts/Improving...Language Space...
Transcript of Toward the Engineering of Improving Semantic …xye/papers_and_ppts/ppts/Improving...Language Space...
Toward the Engineering of Improving Toward the Engineering of Improving Toward the Engineering of Improving Toward the Engineering of Improving Semantic Embedding Semantic Embedding Semantic Embedding Semantic Embedding
Xugang Ye
Semantic Space
Language Space
��
�� ≈ ��′
Ge
ne
rative
pro
cess
���~(��|��; Θ) ≈ �� �� ��
��
��
����
��
Neural
Network
model
What is Semantic Embedding?
Semantic Space
Language Space
��
�� ≈ ��′
Ge
ne
rative
pro
cess
���~(��|��; Θ) ≈ �� �� ��
��
��
����
��
Neural
Network
model
What is Semantic Embedding?
��′
Relevance (by variational Bayes) :
ln ��� ��
���≥ ��� �� ��� , �� − ��� �� ��� , ��|��
Language Space
Semantic Space
Context Expression
Count-based representations: 30k30k
300 300
Parsed words/phrases: Context
30k30k
300 300
300
Word/phrase hashing:
300
128 128��(�)
��(�)
��(�)
��(�)
Cosine similarity: �(�� � , �� � )
Relevance probability: (�� � |�� � )
Compressed representations:
Aggregated representations:
Semantic representations:
�(·) �(·)
�(�,�) �(�,�)
�(�,�) �(�,�)
�(�,�) �(�,�)
ModelArchitecture
Expression
Relevance probability
�� � �� � = !" #$ %� & ,%� '
( !" #$ %� & ,%� )%�,
Where
� �� � , �� � =%� & ·%� '
%� &'%� '
'
=∑ %+
(&)%+(')
+
∑ %+(&) '
+ ∑ %+(') '
+
,
�,(�) = ℎ .,
(�,�) , �,(�)
= ℎ .,(�,�)
, ~level 0
.,(�,�)
= ∑ �,/(�,�)
0/(�,�)
/ , .,(�,�)
= ∑ �,/(�,�)
0/(�,�)
/ , �(�,�), �(�,�) are fully connected
0,(�,�)
= ℎ .,(�,�) , 0,
(�,�)= ℎ .,
(�,�), ~level 1
.,(�,�)
= ∑ �,/(�,�)
0/(�,�)
/ , .,(�,�)
= ∑ �,/(�,�)
0/(�,�)
/ , �(�,�), �(�,�) are fully connected
0,(�,�)
= ℎ .,(�,�) , 0,
(�,�)= ℎ .,
(�,�), ~level 2
.,(�,�)
= ∑ �,/(�,�)
0/(�,�)
/ , .,(�,�)
= ∑ �,/(�,�)
0/(�,�)
/ , �(�,�), �(�,�) are partially connected
0,(�,�)
, 0,(�,�)
are count-based
Learning
• The gradient of relevance probability:
12 ln (�� � |�� � ) = 12 lnexp 6� �� � , �� �
( exp 6� �� � , �� 7��
= 12 6� �� � , �� � − ln( exp 6� �� � , �� 7��
= 612� �� � , �� � −( !" #$ %� & ,%�8 #9:$ %� & ,%�8 )%�8
( !" #$ %� & ,%� )%�
= 612� �� � , �� � − 6 ( ��′ �� � 12� �� � , ��� 7���, where ��′ �� � = !" #$ %� & ,%�8
( !" #$ %� & ,%� )%�
= 6 ( ��′ �� � 12� �� � , �� � − 12� �� � , ��� 7���
≈ 6∑ ��′ �� � 12� �� � , �� � − 12� �� � , ���;�8<;� &
Learning
• The gradient of semantic similarity (via back propagation):
=$ %� & ,%� '
=%+(&) =
�
%� &'
%+(')
%� ''
− � �� � , �� � %+(&)
%� &'
, =$ %� & ,%� '
=%+(') =
�
%� ''
%+(&)
%� &'
− � �� � , �� � %+(')
%� ''
,
=$ %� & ,%� '
=2>?&,& =
=$ %� & ,%� '
=%>(&)
)@ A>(&,&)
)A>(&,&) 0B
(�,�),
=$ %� & ,%� '
=2>?&,' =
=$ %� & ,%� '
=%>(')
)@ A>(&,')
)A>(&,') 0B
(�,�),
=$ %� & ,%� '
=C?&,& = ∑
=$ %� & ,%� '
=%+(&)
)@ A+(&,&)
)A+(&,&) �,B
(�,�), ,
=$ %� & ,%� '
=C?&,' = ∑
=$ %� & ,%� '
=%+(')
)@ A+(&,')
)A+(&,') �,B
(�,�), ,
=$ %� & ,%� '
=2>?',& =
=$ %� & ,%� '
=C>&,&
)@ A>(',&)
)A>(',&) 0B
(�,�),
=$ %� & ,%� '
=2>?',' =
=$ %� & ,%� '
=C>&,'
)@ A>(',')
)A>(',') 0B
(�,�),
=$ %� & ,%� '
=C?',& = ∑
=$ %� & ,%� '
=C+&,&
)@ A+(',&)
)A+(',&) �,B
(�,�), ,
=$ %� & ,%� '
=C?',' = ∑
=$ %� & ,%� '
=C+&,'
)@ A+(',')
)A+(',') �,B
(�,�), ,
=$ %� & ,%� '
=2>?D,& =
=$ %� & ,%� '
=C>',&
)@ A>(D,&)
)A>(D,&) 0B
(�,�),
=$ %� & ,%� '
=2>?D,' =
=$ %� & ,%� '
=C>','
)@ A>(D,')
)A>(D,') 0B
(�,�)
Learning
• The objective function:
Suppose the association between ��(E) and ��()) is observed F)(E)
times. Let F(E) = ∑ F)(E)
) , Then, approximated by the
multinomial distribution, the probability of observing F)(E): 7 ∈ �(E) given I, F(E), �(E), where �(E) =
7: �� ) isassociatedwith��(E) is
F)(E): 7 ∈ �(E) |I, F E , �(E) ∝ ∏ ��())|��(E)
UV(W)
)∈X(W) . (1)
Consider all possible I ∈ Y and assume independence, we have the joint probability of observing F)(E): 7 ∈ �(E) : I ∈ Y
is
F)(E): 7 ∈ �(E) : I ∈ Y | I, F E , � E : I ∈ Y ∝ ∏ ∏ ��())|��(E)
UV(W)
)∈X(W)E∈Z . (2)
Hence a loss function can be constructed as
[ = −∑ ∑ F)(E) ln ��())|��(E))∈X(W)E∈Z . (3)
Logic behind the data
Semantic Space
Non-complete description:
{main features, meta features}
Complete description:
Semantic Space
Non-complete description:
{main features, meta features}
Non-complete description:
{main features, meta features}
Non-complete description:
{main features, meta features}
Query
Infer + retrieveJudge
Feedback
Engineering platform
Open Space
Closed Space
NLP Knowledge
Parser
Tagger
Tokenizer
Contents
⋯
Search Engine
Index
Ranking
models
Services
Search
Recommendation
Suggestion
Conversation
Feedbacks
Instrumentation
Data Analytics
Statistical
models
⋯
Search Engine
Closed learning (cold start mining for IR)
Entities/Meanings/Concepts
Main features Meta features
* ⋯ ⋯
* ⋯ ⋯
* ⋯ ⋯
Entities/Meanings/Concepts
Main features Meta features
* ⋯ ⋯
* ⋯ ⋯
* ⋯ ⋯
Entities/Meanings/Concepts
Main features Meta features
* ⋯ ⋯
* ⋯ ⋯
* ⋯ ⋯
Contents
Structured
Unstructured
NLP Models:
Tokenizing
model
Tagging
model
Parsing
model
⋯
Query
Doc
Featurized query
+ Ranking Models:
Retrieved
results
Auto-complete Search
Featurized pairs (signals):
Feedback learning: search & suggestion
Semantic Space
Language Space
Context Expression
�(·)
ContextExpression
�(·)
⋯
Query Result Count
Query Result Count
Query Result Count
⋯
Query
Retrieved results:
*----------------
*------------
*---------
Instrumentation
Feedbacks
Data Analytics:
Statistical
models
NLP Models:
Parsing model
Tagging model
Tokenizing model
�
Re-ranking
Feedback learning: recommendation
Semantic Space
Language Space
Context Expression
�(·)
Featurized pairs (signals):
⋯
Query Result Count
Query Result Count
Query Result Count
�
⋯
Query Query Query⋯
Result Result Result⋯
�(·)
ContextExpression
Result
Result
Result
Model-based recommendation
Rule-based recommendation
Knowledge
⋯
Feedback learning: conversation
Language input:
Knowledge
Processed:
Tag Tag
Search Engine:
Index
Ranking models
NLP Models:
Parsing model
Tagging model
Tokenizing model
Retrieved results:
* --------------------
* ------------------
* ----------------
Language Models:
RNNs
LSTM
Response
-------
Chat Log
--------------
-------
-------
⋯
Instrumentation
�̂_ = ` �(a)b�_c� + d(a)0�_��_ = ` �(e)b�_c� + d(e)0�_f�_ = ` �(g)b�_c� + d(g)0�_��_ = ℎ �(h)b�_c� + d(h)0�_i�_ = i�_c� ∘ ��_ + ��_ ∘ �̂_b�_ = ℎ i�_ ∘ f�_
[ = −∑ ∑ ∑ k_,B(U) ln f_,B
(U)B_U
Demos
• Auto suggest
https://youtu.be/iDcKuOPU1q4
https://www.zillow.com:443/abs/AssignTrialBucket.htm?redirect=www.zillow.com&treatment=POI_TYPEAHEAD&trial=SHO_POI_TYPEAHEAD
• Chat bot
https://youtu.be/JJ6tg94LLw8
https://youtu.be/_Iu6uqaA6to
Recurrent DSSM for Sentence Embedding
From �(�) = (���(�), … , ��(�)) to � = (���, … , ��), to �� , it’s model-based word-sequence level featurization,
supervised by maximizing the joint likelihood of sequences.
��� ��� �� � ��
�� � = ��(�) LSTM LSTM LSTM LSTM ��� … �� �
�� �� �
��
��
�� �� �
��
��
��
+
ℎ
�
�
�
ℎ
�
�� = ℎ(��) ∘ ��, where
�� = �(�(�)�� +�(�)�� � +�(�,�)��(�)), �� = �� � ∘ �� + �� ∘ ��, �� = �(�(�)�� +�(�)�� � +�(�,�)��(�)), �� = ℎ(�(�)�� +�(�)�� � +�(�,�)��(�)), �� = �(�(�)�� +�(�)�� � +�(�,�)��(�)).
���(�)
…
���(�)
…
�� �(�)
…
…
��(�)
… Word
embedding
50k
300
Sentence
embedding
��(�)
Context
��(�)
��(�)
…
Context
embedding
�(�) = (���(�) , … , ��(�))
� = (���, … , ��)
��
�′(�) = (��′�(�), … , ��′$(�))
�′ = (��′�, … , ��′$)
��′$
Similarity: %(�� , ��′$) =&�'(&�)'$
*|&�'|*,-*&�)'$*-, for retrieval
%(�� , ��′$), as the main part of DSSM, measures the sentence level
similarity, trained from additional signals like clicks
A Quick Summary on Building and Evaluating Seq-to-Seq Models
Build model
Suppose the source sequence is � = (���, … , ��), the target sentence � = (���, … , �� ), we want to
model the relevance probability �(�|�). By the probability chain rule, we have
�(�|�) = �(���, … , �� |�) = ∏ �(���|���, … , �����, �) ��� ,
which says we need to model �(���|���, … , �����, �). Suppose ���, … , �����, � are encoded into �����, then
�(���|���, … , �����, �) ≈ �(���|�����). Define �(�����, ���) as the similarity function (e.g., �(�����, ���) =����� � ���, where � is a projection matrix), then we can further model �(���|���, … , �����, �) as
softmax:
�(���|���, … , �����, �) ≈ ���(��(�� !",#� ))∑ ���(��(�� !",#� %))&''� %
.
The model depends on the sequence of state vectors ���, … , �� ��.
By using LSTM, the sequence of state vectors has the recurrence relation as
Suppose we have the data points {)�(*), �(*)+: - = 1, … , /}, then by assuming the independence of the
data points, we have the loss function as
12��)3)�(*), �(*)+: - = 1, … , /4+ = − �6∑ ln�(�(*)|�(*))6*��
��� ��� ����� ��� ���� = 9�(�) LSTM LSTM LSTM LSTM ��� … �����
��� �����
2��
���
:�� :����
;�� <�� =��
+
ℎ
?
?
?
ℎ
?
��� = ℎ(:��) ∘ 2��, where 2�� = ?(A(B)��� +D(B)�����), :�� = :���� ∘ ;�� + <�� ∘ =��, ;�� = ?(A(E)��� +D(E)�����), <�� = ℎ(A(F)��� +D(F)�����), =�� = ?(A(*)��� +D(*)�����).
Evaluate model
1) Perplexity
Suppose we have reference pairs: {)�′(H), �′(H)+: I = 1,… ,J}, then we can define the perplexity:
Perplexity = 2�"L∑ "
|�| MNOP Q(�%(R)|�%(R))LRS" , where |�| is number of words in �.
Pros: obviously, the Perplexity evaluates the model without generating target sequence for each
source sequence. So, it naturally solves the multiple reference problem.
Cons: it does not consider actual generating output.
2) BLEU
BLEU is a statistical metric that compares a generated target sequence for each source with its
reference target sequence.
2.1) Modified precision:
TH(�(H), �′(H)) = |�(R)∩�%(R)||�(R)| , which is the percentage of �(H) that appears in �′(H).
Pre({)�′(H), �′(H)+: I = 1,… , J}) = )∏ TH(�(H), �′(H))YH�� +�/Y, which is geometric mean.
2.2) Brevity penalty:
To heuristically catch the idea of recall, that is the longer the generated text, the more likely it
contains the refence components.
BP = \1,if: > `exp c1 − d
ef , otherwise
where : = ∑ |�(H)|YH�� , ` = ∑ |�l(H)|YH��
Putting together yields:
BLEU = BP ∙ Pre.
Pros: it’s very intuitive, easy to use
Cons: it’s bad for comparing very different systems.
Xugang, Nov. 2018
On Feeding Context into the RNN/LSTM Unit of RLM?
In recurrent language model (RLM):
Conditional sequence probability: ���|�� = ∏ ���|�, … , ���, ������ ,
Conditional word probability: ���|�, … , ���, �� =�����������,����
∑ �����������,�����
!!��
, "�#���, �� = #���� $��,
In simplest RNN:
State transition mechanism: #� = ℎ�&� +(#��� +(�)�*����.
In typical LSTM:
State transition mechanism: #� = ℎ�+�� ∘ -�,
Where
-� = .�&�/�� +(�/�#��� +(�/,��*����,
+� = +��� ∘ 0� + 1� ∘ 2�,
0� = .�&�3�� +(�3�#��� +(�3,��*����,
1� = ℎ�&�4�� +(�4�#��� +(�4,��*����,
2� = .�&�5�� +(�5�#��� +(�5,��*����.
#� #���
�
#� #���
-�
�
+� +���
0�
1�
2�
+
ℎ
.
.
.
ℎ
ℎ
.
.
*���
*���
Xugang, Nov. 2018
On Making a Personally Consistent Response
A good example is the SPEAKER model in (Li et al 2017). This model basically treats the
embedding vector of personal info as context. As the following figure shows, ��(�) is the embedding
vector of person �. The intuition is that when generating ��� (from ���), ��(�) also plays a role, with
the observation that person � and ��� has significant co-occurrence.
�(���|�� , … , ���,�, �) = ���(��(�����,���))∑ ���(��(�����,����)) !!���
,
�("|�, �) = �(�� , … , ��#|�, �) = ∏ �(���|�� , … , ���,�, �)#�% .
A natural extension is called SPEAKER-Addressee model, in which ��(�) is replaced by
��(�(�),�(&)) = ℎ ()*�(�)+��*�(�)+ +)*�(&)+��*�(&)+-,
where �() is the speaker and �(.) is the addressee.
Note that even if the two speakers at test time were never involved in the same conversation in the
training data, the two speakers who are respectively close in embeddings may have been, and this can
help modelling how one speaker should respond to the other.
�� �� ��� ���
��(�)
��(�)
LSTM LSTM LSTM LSTM �� … ���
Speaker
embeddings
Personalized
dictionary
Xugang, Nov. 2018
Deriving Maximum Mutual Information Objective Function for Recurrent Language Model
Consider response sentence � = (���, … , ��), with message � and context �.
In recurrent language model (RLM),
(�|�, �) = (���, … , ��|�, �) = ∏ (���|���, … , �����,�, �)��� .
Suppose we use LSTM to model (���|���, … , �����,�, �), then
(���|���, … , �����,�, �) =���(��(�����,���))
∑ ���(��(�����,��� ))
!""��
,
where # is similarity function. For example, #($����, ���) = (%$����)��� = $����
%���, where % is the
projection matrix to handle the dimension mismatch of $���� and ���. Putting those together yields
(�|�, �) = ∏���(��(�����,���))
∑ ���(��(�����,��� ))
!""��
��� , which is target probability in usual seq-to-seq objective
function.
Now, let’s consider (�|�) and still use LSTM, then
(�|�) = ∏���(��(&����,���))
∑ ���(��(&����,��� ))
!""��
��� .
We now have parametrized log*(�|�,�)
*(�|�), which is the target mutual information score in MMI
objective function.
*Note: it has been shown (in Li et al 2016) that maximum mutual information (MMI) models
produce more diverse, interesting, and appropriate responses in dialogues.
��� ��� ����� ���
+�(�)
+�(�)
LSTM LSTM LSTM LSTM $�� … $����
��� ��� ����� ���
,���
+�(�)
LSTM LSTM LSTM LSTM ,�� … ,����