Abstract
We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-theart forward in two ways: First, it attains a regret of with respect to the best fixed policy in hindsight, whereas the previous best regret bound was Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.