首页 AI学术青年与开发者社区

21点

def generate_episode_from_limit_stochastic(bj_env):
    episode = []
    state = bj_env.reset()
    while True:
        probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
        action = np.random.choice(np.arange(2), p=probs)
        next_state, reward, done, info = bj_env.step(action)
        episode.append((state, action, reward))
        state = next_state
        if done:
            break
    return episode

 

episode (state, action, reward)是list,每次得到的结果都放入该list

初始选择概率是0.8和0.2,当点数>18 以0.8停牌...

每次调用一个函数得到一个episode

 

计算折扣系数:

discounts = np.array([gamma**i for i in range(len(rewards)+1)])

 

采样episode:

episode = generate_episode(env)

 

增量均值公式:

returns_sum[state][actions[i]] += sum(rewards[i:]*discounts[:-(1+i)])
            N[state][actions[i]] += 1.0
            Q[state][actions[i]] = returns_sum[state][actions[i]] / N[state][actions[i]]

 

随着迭代次数增加,Q趋于平衡

[展开全文]

授课教师

暂无教师
微信扫码分享课程