Off-policy temporal difference learning with distribution adaptation in fast mixing chains

https://people.iut.ac.ir/en/content/93566