Hierarchical Text Generation and Planning for Strategic Dialogue using RL
Moving up the value chain CellStrat would like to encourage discussions and webinars focusing on the application of AI in Real Life problem-solving. A beginning has already been made and this is another step in that direction. The use of Deep Learning and Reinforcement Learning to solve a complex Strategic Negotiation is a very good example, showcasing the use of RL in optimizing the decision-making process.
The topic has been divided into two parts Part I will deal with the introduction to the Hierarchical Text Generation Process. In part II we will take up a case study.
The Word-by-word approach to text generation has been successful in many tasks. However, they have limitations in under-constrained generation settings, such as dialogue response or summarization, where models have significant freedom in the semantics of the text to generate. There is a tendency among the models to overly generalize the responses that might be valid but not necessarily accurate. Further, such models are interpretable and at times intellectually dissatisfying because they do not clearly distinguish between the semantics of language and its surface realization. Entangling form and meaning is problematic for reinforcement learning, where back-propagating caused by semantic decisions can adversely affect the linguistic quality of text (Lewis et al., 2017), and for candidate generation for long term planning, as the linguistically diverse text may lack semantic diversity. Here we will concentrate on Negotiation dialogs. More focused on Strategic Negotiation.
Use a method for learning discrete latent representations of sentences(zt) based on their effect on the continuation of the dialogue. It consists of
- Decoupling the semantics of the dialogue utterance from its linguistic realization. –Use the latent sentence representations (zt) for hierarchical language generation, planning, and reinforcement learning.
- Improve the ability of the model to plan ahead by creating a set of semantically diverse candidate messages by sampling zt and then use rollout to identify an expected reward for each
- RL applied for learning based on end-task reward
Advantages of this approach
- increases the end- task reward achieved by the model
- improves the effectiveness of long-term planning using roll- outs,
- allows self-play reinforcement learning to improve decision making without diverging from human language..
The text generated by the model has consequences than can be easily measured with ref to human response using a hierarchical generation or phased approach for a strategic dialogue agent.
- In the first phase the agent samples a short-term plan in the form of a latent sentence representation.
- The agent then conditions on this plan during generation, allowing the precise and consistent generation of text to achieve a short- term goal. this could be treated as phase two.
- Doing so, we aim to disentangle the concepts of ”what to say” and ”how to say it”.
Negotiation Dialogue Sequence
Separated Strategic and NLG aspect
Hierarchical generation of dialogue responses
The latent variable zt is inferred to maximize the likelihood of a message xt , given previous messages x0t−1 ≡(x0,…,xt−1) which has the effect of clustering similar message strings.
- Latent variable zt is optimized to maximize the likelihood of messages and actions of the continuation of the dialogue, but not the message xt itself –
- zt learns to represent xt’s effect on the dialogue, but not the words of xt.
- The distinction is important because messages with similar words can have very different semantics, conversely the same meaning can be conveyed with different sentences.
- Results show empirically and through human evaluation that our method leads to
- Both better perplexities and end task rewards
- qualitatively that our representations group sentences that are more semantically coherent but linguistically diverse.
- Using this message representation improves the strategic decision making of our dialogue agent.
The agents X and Y are initially given a space A of possible agreements,
- Value functions vX and vY , specify a non-negative reward for each Agreement a ε A
- Agents cannot directly observe each other’s value functions and can only infer it through dialogue.
- The agents sequentially exchange turns of natural language xt , consisting of n + 1 words xt0:nt =(x0t, . . . , xnt ), until one agent enters a special turn that ends the dialogue.
- Both agents independently enter agreements aX , aY ε A respectively.
- If the agreements are compatible, both agents receive a reward based on their actions and the value function.
- If the actions are incompatible, neither agent receives any reward.
- Training dialogues from an agent’s perspective consist of agreement space A, value function v, messages x0:T,and agreement a.
a hierarchical generation approach for a strategic dialogue agent, where the agent ﬁrst samples a short-term plan in the form of a latent sentence representation.