Hi @Yiwen Zhang ,
You mentioned, “When setting "centralized" learning strategy in rlMultiAgentTrainingOptions for PPO algorithm, do all the agents share one critic network? Does each agent have its own actor network? In other words, for a MAPPO training task with N agents, does the number of critic and actor be 1 and N?”
Please see my response to your comments below.
After reviewing the documentation provided at the link below,
https://www.mathworks.com/help/reinforcement-learning/ref/rlmultiagenttrainingoptions.html
A typical MAPPO setup using the rlMultiAgentTrainingOptions, the architecture of actor and critic networks under a centralized learning strategy can be summarized as follows:
Critic Network: When you set the learning strategy to "centralized", all agents indeed share a single critic network. This shared critic is beneficial as it allows for the aggregation of experiences across all agents, leading to more stable training and improved performance. By utilizing a centralized critic, the network can better estimate the value function by considering the actions and states of all agents, thus promoting cooperative behavior.
Actor Networks: Each agent maintains its own distinct actor network. This means that while the critic is shared, each agent's policy (actor) is independent. This design allows each agent to learn its unique strategy based on its observations and interactions within the environment while still benefiting from the collective knowledge provided by the shared critic.
In summary, for an MAPPO training task with N agents configured under a centralized learning strategy:
Number of Critic Networks: 1 (shared among all agents) Number of Actor Networks: N, (one for each agent)
When implementing this architecture, it is essential to ensure that the shared critic can adequately process inputs from all agents. This often involves designing input layers that can concatenate or otherwise aggregate observations from multiple agents. While sharing a critic can provide advantages, it may also introduce challenges such as increased complexity in managing gradients during backpropagation and potential interference among agents if their policies are not well-aligned.
Hope this helps.
Please let me know if you have any further questions.