首页 AI学术青年与开发者社区

【中英字幕】伯克利大学 2018 年秋季 CS 294-112 深度强化学习

开课时间:2018年12月20日
开课时长:26讲

https://zhuanlan.zhihu.com/p/32727209

这个专栏写得很赞,建议看完课程之后浏览一下,会有很多收获

[展开全文]

1.  It is interesting to see that  under the umberella of policy grident ,  REINFROCE method , as one MC approach, comes first and after that value function fitting approach comes later.  It is very easy for us to have a confusion on understanding PG. IMHO, the key points here are like this: 

1.1.  G(s,a)  unbiased and biased estimation

1.2  Variance reduction idea  

 

2.  Under some policy,  the accurate Q(st,at) is the reward plus the expectation on Value funciton of the next state.  As unbiased estimation, one sample V for the next state is used to combine with the reward. 

 

3.  The tradeoff between the AC based and the MC based lies in the bias and variance. 

For AC:  Lower Variance but higher bias if value is wrong(it always is)

For  A-MC: no bias, higher variance

 

4. Generalized advantage estimation (GAE) is good framework to give a consist way to consider the tradeoff

 

 

 

 

 

 

 

[展开全文]

进入小组观看课程

以下为该课程相关学习小组,您可以选择任意小组加入学习课程并交流