TY - JOUR
T1 - Guided policy search for sequential multitask learning
AU - Xiong, Fangzhou
AU - Sun, Biao
AU - Yang, Xu
AU - Qiao, Hong
AU - Huang, Kaizhu
AU - Hussain, Amir
AU - Liu, Zhiyong
N1 - Funding Information:
Manuscript received November 25, 2017; accepted January 11, 2018. Date of publication February 19, 2018; date of current version December 14, 2018. This work was supported in part by the NSFC under Grant U1613213, Grant 61375005, Grant 61503383, Grant 61210009, Grant 61627808, Grant 91648205, Grant 61702516, and Grant 61473236, in part by the National Key Research and Development Plan of China under Grant 2017YFB1300202 and Grant 2016YFC0300801, in part by the MOST under Grant 2015BAK35B00 and Grant 2015BAK35B01, in part by the Guangdong Science and Technology Department under Grant 2016B090910001, in part by the Suzhou Science and Technology Program under Grant SYG201712 and Grant SZS201613, in part by the Strategic Priority Research Program of the Chinese Academy of Science under Grant XDB02080003, in part by the Key Program Special Fund in XJTLU under Grant KSF-A-01, and in part by the U.K. Engineering and Physical Sciences Research Council under Grant EP/M026981/1. This paper was recommended by Associate Editor A. Hussain. (Corresponding author: Zhiyong Liu.) F. Xiong and X. Yang are with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Science, Beijing 100190, China, and also with the School of Computer and Control, University of Chinese Academy of Sciences, Beijing 100049, China.
Publisher Copyright:
© 2013 IEEE.
PY - 2019/1
Y1 - 2019/1
N2 - Policy search in reinforcement learning (RL) is a practical approach to interact directly with environments in parameter spaces, that often deal with dilemmas of local optima and real-time sample collection. A promising algorithm, known as guided policy search (GPS), is capable of handling the challenge of training samples using trajectory-centric methods. It can also provide asymptotic local convergence guarantees. However, in its current form, the GPS algorithm cannot operate in sequential multitask learning scenarios. This is due to its batch-style training requirement, where all training samples are collectively provided at the start of the learning process. The algorithm's adaptation is thus hindered for real-time applications, where training samples or tasks can arrive randomly. In this paper, the GPS approach is reformulated, by adapting a recently proposed, lifelong-learning method, and elastic weight consolidation. Specifically, Fisher information is incorporated to impart knowledge from previously learned tasks. The proposed algorithm, termed sequential multitask learning-GPS, is able to operate in sequential multitask learning settings and ensuring continuous policy learning, without catastrophic forgetting. Pendulum and robotic manipulation experiments demonstrate the new algorithms efficacy to learn control policies for handling sequentially arriving training samples, delivering comparable performance to the traditional, and batch-based GPS algorithm. In conclusion, the proposed algorithm is posited as a new benchmark for the real-time RL and robotics research community.
AB - Policy search in reinforcement learning (RL) is a practical approach to interact directly with environments in parameter spaces, that often deal with dilemmas of local optima and real-time sample collection. A promising algorithm, known as guided policy search (GPS), is capable of handling the challenge of training samples using trajectory-centric methods. It can also provide asymptotic local convergence guarantees. However, in its current form, the GPS algorithm cannot operate in sequential multitask learning scenarios. This is due to its batch-style training requirement, where all training samples are collectively provided at the start of the learning process. The algorithm's adaptation is thus hindered for real-time applications, where training samples or tasks can arrive randomly. In this paper, the GPS approach is reformulated, by adapting a recently proposed, lifelong-learning method, and elastic weight consolidation. Specifically, Fisher information is incorporated to impart knowledge from previously learned tasks. The proposed algorithm, termed sequential multitask learning-GPS, is able to operate in sequential multitask learning settings and ensuring continuous policy learning, without catastrophic forgetting. Pendulum and robotic manipulation experiments demonstrate the new algorithms efficacy to learn control policies for handling sequentially arriving training samples, delivering comparable performance to the traditional, and batch-based GPS algorithm. In conclusion, the proposed algorithm is posited as a new benchmark for the real-time RL and robotics research community.
KW - Elastic weight consolidation (EWC)
KW - guided policy search (GPS)
KW - reinforcement learning (RL)
KW - sequential multitask learning
UR - http://www.scopus.com/inward/record.url?scp=85042198472&partnerID=8YFLogxK
U2 - 10.1109/TSMC.2018.2800040
DO - 10.1109/TSMC.2018.2800040
M3 - Article
AN - SCOPUS:85042198472
SN - 2168-2216
VL - 49
SP - 216
EP - 226
JO - IEEE Transactions on Systems, Man, and Cybernetics: Systems
JF - IEEE Transactions on Systems, Man, and Cybernetics: Systems
IS - 1
M1 - 8294227
ER -