Off-policy temporal difference learning with distribution adaptation in fast mixing chains - doi 10.1007/s00500-017-2490-1 2017-01 Article type Journal