教授 博士生导师
招生学科专业:
控制科学与工程 -- 【招收博士、硕士研究生】 -- 自动化学院
兵器科学与技术 -- 【招收硕士研究生】 -- 自动化学院
电子信息 -- 【招收博士、硕士研究生】 -- 自动化学院
性别:女
学历:南京航空航天大学
学位:工学博士学位
所在单位:自动化学院
办公地点:自动化学院4号楼401
联系方式:025-84892301-8023
电子邮箱:
最后更新时间:..
点击次数:
所属单位:计算机科学与技术学院/人工智能学院/软件学院
发表刊物:Conf. Uncertain. Artif. Intell., UAI
摘要:Proximal policy optimization (PPO) is one of the most successful deep reinforcement learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the probability ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Trust Region-based PPO with Rollback (TR-PPO-RB). Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the ratio between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, which is theoretically justified according to the trust region theorem. It seems, by adhering more truly to the "proximal" property − restricting the policy within the trust region, the new algorithm improves the original PPO on both stability and sample efficiency. © 2019 Association For Uncertainty in Artificial Intelligence (AUAI). All rights reserved.
是否译文:否
发表时间:2019-01-01
合写作者:Wang, Yuhui,He, Hao,谭晓阳
通讯作者:王玉惠