Order-Optimal Instance-Dependent Bounds for Offline Reinforcement
Learning with Preference Feedback
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement
Learning with Preference Feedback
We consider offline reinforcement learning (RL) with preference feedback in which the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an …