Reinforcement Learning via Recurrent Convolutional Neural Networks
My Bachelor's thesis at IIT Guwahati pursued an elegant connection between classical model-based reinforcement learning and deep learning. More details below!
My paper on this topic was published at ICPR!
For code, please visit the RCNN_MDP repository, and here's a talk that summarizes this work:
My paper on this topic was published at ICPR!
- Tanmay Shankar, Santosha K. Dwivedy, and Prithwijit Guha, "Reinforcement Learning via Recurrent Convolutional Neural Networks", published at the 23rd International Conference on Pattern Recognition - ICPR 2016, at Cancun, Mexico.
For code, please visit the RCNN_MDP repository, and here's a talk that summarizes this work:
Motivation
Deep RL typically learns model-free policies of complex tasks, i.e. without learning the underlying model of the task. While these approaches have succeeded tremendously, they typically ignore the structure of the task.
Model-based RL, on the other hand, estimate the underlying Markov Decision Process (MDP) model involved. While they require a more indirect approach of re-planning with learnt models, observing the learnt model provides valuable insights into how agents act.
My bachelor's thesis focused on building more natural representations of model-based RL, within recurrent convolutional neural networks (RCNNs). This lets us come up with interesting solutions to:
Here's a overview of these ideas - please refer to the paper for details!
The Value Iteration RCNN
Model-based RL, on the other hand, estimate the underlying Markov Decision Process (MDP) model involved. While they require a more indirect approach of re-planning with learnt models, observing the learnt model provides valuable insights into how agents act.
My bachelor's thesis focused on building more natural representations of model-based RL, within recurrent convolutional neural networks (RCNNs). This lets us come up with interesting solutions to:
- Value Iteration in an MDP.
- Belief Propagation for a POMDP.
- Learning Transition Models and Reward Functions associated with a POMDP.
- Planning under partial observability.
Here's a overview of these ideas - please refer to the paper for details!
The Value Iteration RCNN
Consider the Bellman backup in Value Iteration:
Under some assumptions, I realized the expectation with respect to the transition dynamics above could be represented as a convolution:
This update looks surprisingly like the forward pass of a convolutional layer of a recurrent convolutional net!
These tricks let me build the Value Iteration RCNN above, which implements a end-to-end differentiable approximation to Value Iteration. This VI RCNN can be trained to learn transition dynamics as convolutional filters, and reward function using a deep network as a function approximator.
The Belief Propagation RCNN
- The convolution T(s,a,s') * V(s') represents the convolution stage.
- The addition of R(s,a) represents the addition of a bias term.
- The max over actions, represents a pooling operation.
- The iterative nature of value iteration is captured as a temporal recurrence.
These tricks let me build the Value Iteration RCNN above, which implements a end-to-end differentiable approximation to Value Iteration. This VI RCNN can be trained to learn transition dynamics as convolutional filters, and reward function using a deep network as a function approximator.
The Belief Propagation RCNN
This Belief Propagation RCNN can be trained to learn the transition dynamics of states (even with a noisy observation model, provided it is differentiable).
The QMDP RCNN
By combining the beliefs of state obtained from the Belief Propagation RCNN with the estimates of Q-values provided by the value iteration RCNN according to the QMDP approximation, we construct the QMDP RCNN! The QMDP RCNN is a differentiable approximation to planning under partial observability, that can be trained to learn reward functions by imitating expert demonstrations.
The QMDP RCNN
By combining the beliefs of state obtained from the Belief Propagation RCNN with the estimates of Q-values provided by the value iteration RCNN according to the QMDP approximation, we construct the QMDP RCNN! The QMDP RCNN is a differentiable approximation to planning under partial observability, that can be trained to learn reward functions by imitating expert demonstrations.