https://zhuanlan.zhihu.com/p/32727209
这个专栏写得很赞,建议看完课程之后浏览一下,会有很多收获
¥
支付方式
请使用微信扫一扫 扫描二维码支付
请使用支付宝扫一扫 扫描二维码支付
https://zhuanlan.zhihu.com/p/32727209
这个专栏写得很赞,建议看完课程之后浏览一下,会有很多收获
1. It is interesting to see that under the umberella of policy grident , REINFROCE method , as one MC approach, comes first and after that value function fitting approach comes later. It is very easy for us to have a confusion on understanding PG. IMHO, the key points here are like this:
1.1. G(s,a) unbiased and biased estimation
1.2 Variance reduction idea
2. Under some policy, the accurate Q(st,at) is the reward plus the expectation on Value funciton of the next state. As unbiased estimation, one sample V for the next state is used to combine with the reward.
3. The tradeoff between the AC based and the MC based lies in the bias and variance.
For AC: Lower Variance but higher bias if value is wrong(it always is)
For A-MC: no bias, higher variance
4. Generalized advantage estimation (GAE) is good framework to give a consist way to consider the tradeoff