MSc Thesis Defense by Michael Feveile Mariboe: Sampling in reinforcement learning
Sampling in reinforcement learning
Reinforcement learning is concerned with learning a policy that determines what actions an agent should take given observations of an environment’s states to maximise the agent’s reward. The policy space can be parametrised by a function approximator such as an artificial neural network. This can be advantageous when an environment’s state space is too large to fit in memory and it is necessary when the state space becomes infinite such as in the continuous case. Function approximation can allow to generalise to unseen observations. A neuroevolution strategy is an evolution strategy used to teach an artificial neural network to perform a specific task. The variable metric stochastic optimisation method called the covariance matrix adaptation neuroevolution strategy has been applied to model-free episodic control learning by teaching static shallow artificial neural networks to balance poles on a cart by searching directly in the policy space . The question arises how it performs on more difficult tasks. Temporal difference learning algorithms have been applied to model-free control learning by teaching a static deep convolutional artificial neural network network to approximate a value function on deterministic Atari 2600 games [5, 2, 16, 3, 19, 23, 18, 4, 22, 17, 15, 6].
We evaluate the covariance matrix adaptation neuroevolution strategy and a direct policy search approach based on another evolution strategy called the cross-entropy method on 59 stochastic, synchronous Atari 2600 games. We evaluate three different initial standard deviations, and compare the evolution strategies and the initial standard deviations. The covariance matrix adaptation neuroevolution strategy performs significantly better than the cross-entropy method on 53 games. We also evaluate the covariance matrix adaptation neuroevolution strategy with uncertainty handling  on 5 games. It performs worse than without but still significantly better than the cross-entropy method. This is likely due to a low level of stochasticity resulting in a reevaluation overhead.