Abstract
We investigate how Safe Policy Improvement (SPI)
algorithms can exploit the structure of factored
Markov decision processes when such structure is
unknown a priori. To facilitate the application of
reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes
in a running process will improve the performance
of this process. However, current SPI algorithms
have requirements that might be impractical, such
as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying
structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that
knows the structure. This indicates that the combination of FSPI and structure learning algorithms
is a promising solution to real-world problems involving many variables