Off-Policy Deep Reinforcement Learning without Exploration
Date:
This paper proposes a new algorithm for off-policy reinforcement learning that combines state-of-the-art deep Q-learning algorithms with a state-conditioned generative model for producing only previously seen actions.
Their algorithm, Batch-Constrained deep Q-learning (BCQ), uses the generative model to propose candidate actions with high similarity to the batch, and then selects the highest valued action through a learned Q-network.
OpenReview for this presentation
Powerpoint for this talk
Reference Paper
- Reinforcement learning: An introduction.
- Deep reinforcement learning with double q-learning
- Dueling Network Architectures for Deep Reinforcement Learning
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
- Trust region policy optimization
- Learning From Delayed Rewards
- Deep Reinforcement Learning with Double Q-learning
Leave a Comment