首页 AI学术青年与开发者社区

CliffWalking environment

 

This gridworld example compares Sarsa and Qlearning, highlighting the difference between on-policy (Sarsa) and off-policy (Qlearning) methods.

task 4*12矩阵网格

Estimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):
[[ 0  3  1  1  1  1  1  1  1  2  2  1]
 [ 1  1  0  3  2  2  1  3  2  2  2  2]
 [ 1  1  1  1  1  1  1  1  1  1  1  2]
 [ 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0]]

The agent has 4 potential actions:

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
先计算随机值函数
Sarsa算法 参数num_episodes: This is the number of episodes that are generated through agent-environment interaction.
 
先计算之前Q表S0 A0
更新Q表 一个队列记录瞬时分数,一个记录平均分,用来作分析
对于episode 进行迭代,每100次迭代,打印结果,最后用平均reward 计算是否收敛,
 
[展开全文]

授课教师

暂无教师
微信扫码分享课程