关于强化学习中J（θ）的建立

关于强化学习中J（θ）的建立

有做policy的吗？我有了reward 和输出的概率分布 loss应该怎么计算
Reinforcement Learning for Machine Comprehension
One way to tackle this problem is to directly optimizing the
F1 score with reinforcement learning. The F1 score measures
the overlap between the predicted answer and the
ground-truth answer, serving as a “soft” metric compared
to the “hard” EM. Taking the F1 score as reward, we use the
REINFORCE algorithm (Williams 1992) to maximize the
model’s expected reward. For each sampled answer Aˆ, we
define the loss as:
JRL(θ) = −E（Aˆ∼pθ(A|C,Q)）[R(Aˆ, A ∗)] (10)
where pθ is the policy to be learned, and R(A ˆ , A∗) is
the reward function for a sampled answer, computed as
the F1 score with the ground-truth answer A∗
. Aˆ is obtainedby sampling from the predicted probability distribution pθ(A|C, Q).

发表回答