21点
def generate_episode_from_limit_stochastic(bj_env):
episode = []
state = bj_env.reset()
while True:
probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
action = np.random.choice(np.arange(2), p=probs)
next_state, reward, done, info = bj_env.step(action)
episode.append((state, action, reward))
state = next_state
if done:
break
return episode
episode (state, action, reward)是list,每次得到的结果都放入该list
初始选择概率是0.8和0.2,当点数>18 以0.8停牌...
每次调用一个函数得到一个episode
计算折扣系数:
discounts = np.array([gamma**i for i in range(len(rewards)+1)])
采样episode:
episode = generate_episode(env)
增量均值公式:
returns_sum[state][actions[i]] += sum(rewards[i:]*discounts[:-(1+i)])
N[state][actions[i]] += 1.0
Q[state][actions[i]] = returns_sum[state][actions[i]] / N[state][actions[i]]
随着迭代次数增加,Q趋于平衡