1.  It is interesting to see that  under the umberella of policy grident ,  REINFROCE method , as one MC approach, comes first and after that value function fitting approach comes later.  It is very easy for us to have a confusion on understanding PG. IMHO, the key points here are like this: 

1.1.  G(s,a)  unbiased and biased estimation

1.2  Variance reduction idea  


2.  Under some policy,  the accurate Q(st,at) is the reward plus the expectation on Value funciton of the next state.  As unbiased estimation, one sample V for the next state is used to combine with the reward. 


3.  The tradeoff between the AC based and the MC based lies in the bias and variance. 

For AC:  Lower Variance but higher bias if value is wrong(it always is)

For  A-MC: no bias, higher variance


4. Generalized advantage estimation (GAE) is good framework to give a consist way to consider the tradeoff









