【中英字幕】伯克利大学 2018 年秋季 CS 294-112 深度强化学习

开课时间：2018年12月20日

开课时长：26讲

排序：最新笔记
- 最新笔记
- 点赞最多

https://zhuanlan.zhihu.com/p/32727209

这个专栏写得很赞，建议看完课程之后浏览一下，会有很多收获

[展开全文]

聖書 · 2019-06-07 · 第六讲：Actor-Critic 算法简介 0

1. It is interesting to see that under the umberella of policy grident , REINFROCE method , as one MC approach, comes first and after that value function fitting approach comes later. It is very easy for us to have a confusion on understanding PG. IMHO, the key points here are like this:

1.1. G(s,a) unbiased and biased estimation

1.2 Variance reduction idea

2. Under some policy, the accurate Q(st,at) is the reward plus the expectation on Value funciton of the next state. As unbiased estimation, one sample V for the next state is used to combine with the reward.

3. The tradeoff between the AC based and the MC based lies in the bias and variance.

For AC: Lower Variance but higher bias if value is wrong(it always is)

For A-MC: no bias, higher variance

4. Generalized advantage estimation (GAE) is good framework to give a consist way to consider the tradeoff

[展开全文]

爱德华•安 · 2019-05-26 · 第六讲：Actor-Critic 算法简介 0