TY - JOUR
T1 - A novel multi-agent dynamic portfolio optimization learning system based on hierarchical deep reinforcement learning
AU - Sun, Ruoyu
AU - Xi, Yue
AU - Stefanidis, Angelos
AU - Jiang, Zhengyong
AU - Su, Jionglong
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/7
Y1 - 2025/7
N2 - Deep reinforcement learning (DRL) has been extensively used to address portfolio optimization problems. DRL agents acquire knowledge and make decisions through unsupervised interactions with their environment without requiring explicit knowledge of the joint dynamics of portfolio assets. Among these DRL algorithms, the combination of actor-critic algorithms and deep function approximators is the most widely used DRL algorithm. Here, we find that training the DRL agent using the actor-critic algorithm and deep function approximators may lead to scenarios where the improvement in the DRL agent's risk-adjusted profitability is insignificant. We argue that such situations primarily arise from the following two problems: sparsity in positive reward and the curse of dimensionality. These limitations prevent DRL agents from comprehensively learning asset price change patterns in the training environment. As a result, the DRL agents cannot effectively explore the dynamic portfolio optimization policy to improve the risk-adjusted profitability in the training process. To address these problems, we propose a novel multi-agent learning system based on the hierarchical deep reinforcement learning (HDRL) algorithmic framework in this research. Under this framework, the agents work together as a learning system for portfolio optimization. Specifically, by designing an auxiliary agent that works together with the executive agent for optimal policy exploration, the learning system can focus on exploring the policy with higher risk-adjusted return in the action space with positive return and low variance. The performance of the proposed learning system is evaluated using a portfolio of 29 stocks from the Dow Jones index in four different experiments. In the training process, the objective functions of the actor and critic both ultimately achieve stable convergence in the training process. The risk-adjusted profitability of our learning system in the training environment is significantly improved. Hence, we prove that the policies executed by our learning system in out-sample experiments originate from the DRL agents' comprehensive learning of asset price change patterns in the training environment. Furthermore, we find that adopting the auxiliary agent and HDRL training algorithm can efficiently overcome the issue of the curse of dimensionality and improve the training efficiency in the positive reward sparse environment. In each back-test experiment, the proposed learning system is compared to sixteen traditional strategies and ten strategies based on machine learning algorithms in the performance of profitability and risk control ability. The empirical results in the four evaluation experiments demonstrate the efficacy of our learning system, which outperforms all other strategies by at least 8.2% in terms of Sharpe ratio, Sorino ratio, and Calmar ratio. This indicates that the policies learned in the training environment can exhibit excellent generalization ability in the back-testing experiments.
AB - Deep reinforcement learning (DRL) has been extensively used to address portfolio optimization problems. DRL agents acquire knowledge and make decisions through unsupervised interactions with their environment without requiring explicit knowledge of the joint dynamics of portfolio assets. Among these DRL algorithms, the combination of actor-critic algorithms and deep function approximators is the most widely used DRL algorithm. Here, we find that training the DRL agent using the actor-critic algorithm and deep function approximators may lead to scenarios where the improvement in the DRL agent's risk-adjusted profitability is insignificant. We argue that such situations primarily arise from the following two problems: sparsity in positive reward and the curse of dimensionality. These limitations prevent DRL agents from comprehensively learning asset price change patterns in the training environment. As a result, the DRL agents cannot effectively explore the dynamic portfolio optimization policy to improve the risk-adjusted profitability in the training process. To address these problems, we propose a novel multi-agent learning system based on the hierarchical deep reinforcement learning (HDRL) algorithmic framework in this research. Under this framework, the agents work together as a learning system for portfolio optimization. Specifically, by designing an auxiliary agent that works together with the executive agent for optimal policy exploration, the learning system can focus on exploring the policy with higher risk-adjusted return in the action space with positive return and low variance. The performance of the proposed learning system is evaluated using a portfolio of 29 stocks from the Dow Jones index in four different experiments. In the training process, the objective functions of the actor and critic both ultimately achieve stable convergence in the training process. The risk-adjusted profitability of our learning system in the training environment is significantly improved. Hence, we prove that the policies executed by our learning system in out-sample experiments originate from the DRL agents' comprehensive learning of asset price change patterns in the training environment. Furthermore, we find that adopting the auxiliary agent and HDRL training algorithm can efficiently overcome the issue of the curse of dimensionality and improve the training efficiency in the positive reward sparse environment. In each back-test experiment, the proposed learning system is compared to sixteen traditional strategies and ten strategies based on machine learning algorithms in the performance of profitability and risk control ability. The empirical results in the four evaluation experiments demonstrate the efficacy of our learning system, which outperforms all other strategies by at least 8.2% in terms of Sharpe ratio, Sorino ratio, and Calmar ratio. This indicates that the policies learned in the training environment can exhibit excellent generalization ability in the back-testing experiments.
KW - Hierarchical deep reinforcement learning
KW - Learning system
KW - Multi-agent
KW - Portfolio optimization
UR - http://www.scopus.com/inward/record.url?scp=105006910484&partnerID=8YFLogxK
U2 - 10.1007/s40747-025-01884-y
DO - 10.1007/s40747-025-01884-y
M3 - Article
AN - SCOPUS:105006910484
SN - 2199-4536
VL - 11
JO - Complex and Intelligent Systems
JF - Complex and Intelligent Systems
IS - 7
M1 - 311
ER -