CliffWalking environment
This gridworld example compares Sarsa and Qlearning, highlighting the difference between on-policy (Sarsa) and off-policy (Qlearning) methods.
task 4*12矩阵网格
The agent has 4 potential actions:
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
先计算随机值函数
Sarsa算法 参数
num_episodes
: This is the number of episodes that are generated through agent-environment interaction.先计算之前Q表S0 A0
更新Q表 一个队列记录瞬时分数,一个记录平均分,用来作分析
对于episode 进行迭代,每100次迭代,打印结果,最后用平均reward 计算是否收敛,