Teacher: Sergey Levine, Electric engineering (EECS)
-
Full list on course website (click "Lecture Slides")
-
From supervised learning to decision making (如何将有监督问题转化为决策问题)
-
Model-free algorithms: Q-learning, policy gradients, actor-critic
-
Advanced model learning and prediction
-
Exploration (一个新的RL概念, 以及最新的进展)
-
Transfer and multi-task learning, meta-learning
-
Open problems(公开问题), research talks(研究报告), invited lectures(邀请演讲)
-
Assignments
-
Homework 1: imitation learning (control via supervised learning)
-
Homework 2: Policy gradients ("REINFORCE")
-
Homework 3: Q learning and actor-critic (演员-批判家) algorithms
-
Homework 4: Model-based reinforcement learning
-
Homework 5: Advanced model-free RL algorithms
-
Final project: Research-level project of your choice (form a group of up to 2-3 students, you're welcome to start early!)
What is reinforcement learning, and why should we care ?
-
How do we build intelligent machines ? (引子)
-
Intelligent machines must be able to adapt (适应性, 但是现实是复杂的环境会让智能体实现适应性是一件很困难的事情).
-
Deep learning help us handle unstructured environments . (深度学习的好处在于它能提供一种处理非结构化环境的工具, 非结构化的意思是你不能提前预测环境中所有东西的布局或者突发情况.)
-
Reinforcement learning provides a formalism for behavior (虽然深度学习可以帮助我们处理现实世界非结构化的输入信息, 但是没有告诉我们怎么决策(大概指黑箱和不可解释性,以及太过暴力和不灵活),如果要实现一个灵活决策的智能系统例如家用机器人或者战斗机器人, 你就可能需要某种用来做决策的数学公式, 因为在实际使用中要做的不只是识别什么, 还需要选择做出什么样的反应, 强化学习就是一种用来做决策的数学框架, 实现智能体和环境之间的交互)
还有alpha go 等等.
-
游戏里面运用强化学习成功的一个例子: TD gammon 强化学习 + 神经网络.
-
-
What is deep RL, and why should we care ?
早期的视觉算法中, 构建好的特征时分重要, 但是很麻烦; 深度学习的话就不需要人去刻意手动定义特征了, 模型会自己学习特征. 对增强学习, 标准的增强学习采用传统的比如线性函数的方式, 可能从一些状态入手(如有些游戏), 通常会以某种方式进行编码, 然后从中抽取特征, 这些特征用于强化学习时比较困难的, 因为可以用作决策的状态的特征如线性表示策略或者值函数可能在某种程度上是违反直觉的. 这可能比传统的视觉算法中更难找到特征, 找到也存在向高层抽取时的转换困难问题, 而且因为状态是不断变化的, 所以即便底层特征抽象到了高层特征, 所需要的有区分性的高层特征仍然会很多. 这是早期应用强化学习应用困难的原因. 深度强化学习的好处在于我们可以将整个模型作为一个整体, 从端到端训练, 好处是不需要人工去构造特定和指定那些特征是正确的, 通过优化就可以得到比较好的底层特征以帮助高层做出决策)
-
What dose end-to-end learning mean for sequential decision making ?
端到端的模型在决策时到底有什么作用 ?特别是对现实世界来说, 这个决策将取决于自身的感知和感觉, 传统的方法, 比如机器人做决策, 就是有一个感知系统, 它会检测到周围的环境,还有一个控制系统作用于感知系统的输出端(根据感知做出决策), 然后根据感知做出决策. 这种模式的困难之处在于机器需要高层次的抽象特征, 这在一定程度上会依赖于人的指导.
传统的神经网络一般只能处理感知(机器人的感知系统)的问题, 但是动作相应一般很难做(机器人的控制系统, 类似于控制系统里面的正反馈和负反馈), 关键问题在于如果你想联合训练控制系统和感知系统时, 机器就必须首先知道感知的对象的对自身的好坏(类似趋利避害), 这与以后要学的值函数有关.
事实上, 强化学习是其他机器学习问题的一个泛化, 如果不做一个分类任务, 你也可以把它重新构建为一个强化学习任务, 例如对于图像分类问题, 动作是输出的标签, 观测的是图像的像素, 奖励是你的分类正确率, 虽然听起来更麻烦了, 但是思想很重要. (实际上, 自然语言处理大部分都是要靠值构建奖励)
Deep models are what reinforcement learning algorithms to solve complex problems end to end ! (深度学习一般会给我们提供一些底层的表示(Observations), 然后借助强化学习去执行各种复杂的任务, 也是属于端到端结构.)
Why should we study this now ?
-
Advances in deep learning (深度学习发展的很快, 强大的工具)
-
Advances in reinforcement learning (强化学习有了一些新的进展)
-
Advances in computational capability (计算力的提升, 给上述应用提供条件)
What other problems do we need to solve to enable real-world sequential decision making ? (连续决策)
-
Beyond learning from reward (一个我们经常面对的问题是, 当我们想建立一个能在现实世界中运行的高效强化学习系统时, 我们不能只是假定有一个很完美的完全正确的奖励函数)
-
Basic reinforcement learning deals with maximizing rewards
-
This is not the only problem that matters for sequential decision making ! ( 这并不是影响顺序决策的惟一问题! )
-
We will cover more advances topics
-
Learning reward functions from example (inverse reinforcement learning) 意味着你可以不从奖励函数开始, 而是直接从数据开始 (初次尝试过, 但优化过程的运算空间实在太大, 对NLP任务比较难).
-
Transferring knowledge between domains (transfer learning, meta learning)
-
Learning to predict and using prediction to act
-
-
-
Where do rewards come from ? (奖励是传统强化学习中的基本构件.一般情况下, 人们是不知道精确的奖励到底是什么样的, 而且也很难测量, 这也是做强化学习的一个难点. 人的奖励大部分是一些大脑指挥人体分泌的激素啥的...猜的)
有趣的Berkeley's Quotation:
"作为人类, 我们习惯于跟着奖励运转, 这些奖励是如此之少, 可能一辈子只能经理一次或两次, 如果奖励存在的话"
"You know as human agents we are accustomed to operating with rewards that are so sparse that we only experienced them once or twice in a lifetime if at all."
--"I pity the author" (from Reddit)
-
Are there another forms of supervision ?
-
Learning from demonstrations
-
Directly copying observed behavior
-
Inferring rewards from observed behavior (inverse reinforcement learning, 通过感知网络进行推理自动产生奖励可能也算一个)
-
-
Learning from observing the world (下面是可以用来做什么)
-
Learning to predict
-
Unsupervised learning
-
-
Learning from other tasks
-
Transfer learning
-
Meta-learning: learn to learn
-
-
-
Some examples
-
Imitation learning 英伟达的无人驾驶汽车 (no reward, Directly copy and supervision).
-
More than imitation: inferring intentions人般东西到柜子边上, 旁边小孩通过观察理解到你的意图推测你可能要放进去, 然后走过去给你打开柜子.
-
Inverse RL examples 人扶着机械臂倒水, 充分很多次, 产生了很多轨迹, 然后机器就从这个轨迹中学习奖励, 当随机防一个杯子在他面前时, 它就根据奖励做出行动, 那就是往杯子里面倒水.
-
Prediction "the idea that we predict the consequences of our motor command has emerged(出现, 暴露) as an important theoretical concept in all aspects of sensorimotor (感觉运动的) control" ( 我们预测我们的运动指令的后果的想法, 在所有方面的感觉运动控制已经成为一个重要的理论概念. 有研究表明, 人类的一些智慧推理实际上是基于预测.)
-
-
What can we do with a perfect model ?
-
Prediction for real world control
-
-
How do we build intelligent machines ?
-
Imagine you have to build an intelligent machine, where do you start ?
根据已知的智能体(如人类), 模仿其组建功能, 编写代码, 实际上很难.
-
Learning as the basis of intelligence (学习时智力的基础, 将要模仿的智能体的功能简化, 即: 在建立某个模型之前设置一个一致同意的特定假设)
-
Some things we can all do (e.g. walking)
-
Some things we can only learn (e.g. driving a car)
-
We can learn a huge variety of things, including very difficult things ( 我们可以学到各种各样的东西,包括非常困难的东西 )
-
Therefore our learning mechanism(s) are likely powerful enough to do everything we associate with intelligence (不太现实)
-
But it may still be very convenient to "hard-code" a few really important bits (硬编码有时候也是很方便的, 不一定全是算法?)
-
-
-
A single algorithm ? (对前面假设的更进一步, 存在一种或者几种算法能够完成指定任务. 算法也是学习的基础, 这是一个很有意义的假设.)
-
An algorithm for each "module" ? ( 目前的研究倾向于可以为每个组件设计一个算法. 例如: 人类-鼻子&嗅觉算法;舌头&味觉算法)
-
-
What must do that single algorithm do ?
-
Interpreter rich sensory inputs
-
Choose complex actions
-
-
Why deep reinforcement learning ?
-
Deep = can process complex sensory input
-
... and also compute really complex functions
-
-
Reinforcement learning = can choose complex actions
-
-
Some evidence in favor of deep learning (Papers)
-
Unsupervised learning models of primary cortical receptive fields and receptive field plasticity.
-
Reinforcement learning in the brain
-
Percepts(认知, 感知) that anticipate(预期) reward become associated with similar firing patterns(放点模式) as the reward itself.
预期奖励的感知与奖励本身的激发模式相似. 这意味着如果你看到什么东西, 然后你知道这个特定信号总是在某些奖励之前出现, 那么你就会把这个信号和奖励关联起来, 这和所谓的贝尔曼备份("Bellman back up")比较相似. 在讲值函数的时候会讲到.
-
Basal ganglia appears to be related to reward system.
基底神经节可能和奖励系统有联系.
-
Model-free RL-like adaption is often a good fit for experimental data of animal adaption.
-
-
-
-
What can deep learning & RL do well now ?
-
Acquire high degree of proficiency in domains governed by simple, known rules. ( 在由简单的、已知的规则管理的领域中获得高度精通.)
-
Learn simple skills with raw sensory inputs, given enough experience
-
Learn from imitating enough human provided expert behavior
-
-
What has proven so far ?
-
Humans can learn incredibly(难以置信的) quickly
-
Deep RL methods are usually slow
-
-
Humans can reuse past knowledge
-
Transfer learning in deep RL is an open problem
-
-
No clear what the reward function should be
-