Sublinear regret for learning POMDPs
Sublinear regret for learning POMDPs
We study the modelābased undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral ā¦