📜  异步 Advantage Actor Critic (A3C) 算法

📅  最后修改于: 2022-05-13 01:54:50.689000             🧑  作者: Mango

异步 Advantage Actor Critic (A3C) 算法

Asynchronous Advantage Actor Critic (A3C)算法是在深度强化学习算法领域开发的最新算法之一。该算法是由谷歌的人工智能部门谷歌的 DeepMind 开发的。该算法于 2016 年在一篇名为 Asynchronous Methods for Deep Learning 的研究论文中首次被提及。

解码算法名称的不同部分:-

  • 异步:与其他流行的深度强化学习算法(如使用单个代理和单个环境的 Deep Q-Learning)不同,该算法使用多个代理,每个代理都有自己的网络参数和环境副本。该代理与它们各自的环境异步交互,在每次交互中学习。每个代理都由一个全球网络控制。随着每个智能体获得更多的知识,它对全球网络的总知识做出了贡献。全球网络的存在允许每个代理拥有更多样化的训练数据。这种设置模仿了人类生活的现实生活环境,因为每个人都从其他人的经验中获取知识,从而使整个“全球网络”变得更好。
  • Actor-Critic:与基于价值迭代方法或策略梯度方法的一些更简单的技术不同,A3C 算法结合了这两种方法的最佳部分,即算法预测值函数V(s) 以及最优策略函数\pi (s) .学习代理使用价值函数(Critic)的值来更新最优策略函数(Actor)。注意这里的策略函数是指动作空间的概率分布。准确地说,学习代理确定条件概率 P(a|s ; \theta ) 即代理在状态 s 中选择动作 a 的参数化概率。

优点:通常在Policy Gradient的实施中,Discounted Returns 的值( \gamma r ) 告诉代理它的哪些行为是奖励的,哪些是惩罚的。通过使用 Advantage 的值,代理还可以了解奖励比预期好多少。这为代理提供了对环境的新发现,因此学习过程更好。优势度量由以下表达式给出:-

优势:A = Q(s, a) – V(s)

以下伪代码来自上面链接的研究论文。

定义全局共享参数向量\theta [特克斯]和[/特克斯] \theta _{v} [特克斯] [/特克斯] Define global shared counter T = 0Define thread specific parameter vectors [特克斯]\theta '[/特克斯] and [特克斯]\theta _{v}'[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

Define thread step counter t = 1
while( [特克斯]T
{

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]d\theta = 0[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特]d\theta _{v} = 0[/特]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta ' = \theta[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta '_{v} = \theta _{v}[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[Tex]t_{开始} = t[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]s = s_{t}[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

while( [特]s_{t}[/特] is not terminal [Tex]t-t_{start} < t_{max}[/Tex] )
{
Simulate action [特克斯]a_{t}[/特克斯] according to [特克斯]\pi (a_{t}|s;\theta )[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

Receive reward [特]r_{t}[/特] and next state [特克斯]s_{t+1}[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

t++
T++
}
if( [特]s_{t}[/特] is terminal)
{
R = 0
}
else
{
R = [特克斯]V(s_{t}, \theta _{v}')[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

}
for(i=t-1;i>= [Tex]t_{开始}[/Tex] ;i--)
{
R = [Tex]r_{i} + \gamma R[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[Tex]d\theta = d\theta + \Delta _{\theta '}log(\pi (a_{i}|s{i};\theta ')(RV(s_{i};\theta _{ v}')))[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]d\theta _{v}= d\theta _{v} + \frac{\partial ((RV(s_{i};\theta _{v}'))^{2})}{\部分 \theta _{v}'}[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

}

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta = \theta + d\theta[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty
*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta _{v}= \theta + d\theta_{v}[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

}

在哪里,

T_{max} – 最大迭代次数

d\theta – 全局参数向量的变化

R – 总奖励

\pi – 政策函数

V – 价值函数

\gamma – 折扣系数

好处:

  • 该算法比标准的强化学习算法更快、更健壮。
  • 由于如上所述的知识多样化,它比其他强化学习技术表现更好。
  • 它可以用于离散和连续的动作空间。