异步 Advantage Actor Critic (A3C) 算法

Asynchronous Advantage Actor Critic (A3C)算法是在深度强化学习算法领域开发的最新算法之一。该算法是由谷歌的人工智能部门谷歌的 DeepMind 开发的。该算法于 2016 年在一篇名为 Asynchronous Methods for Deep Learning 的研究论文中首次被提及。

解码算法名称的不同部分：-

异步：与其他流行的深度强化学习算法（如使用单个代理和单个环境的 Deep Q-Learning）不同，该算法使用多个代理，每个代理都有自己的网络参数和环境副本。该代理与它们各自的环境异步交互，在每次交互中学习。每个代理都由一个全球网络控制。随着每个智能体获得更多的知识，它对全球网络的总知识做出了贡献。全球网络的存在允许每个代理拥有更多样化的训练数据。这种设置模仿了人类生活的现实生活环境，因为每个人都从其他人的经验中获取知识，从而使整个“全球网络”变得更好。
Actor-Critic：与基于价值迭代方法或策略梯度方法的一些更简单的技术不同，A3C 算法结合了这两种方法的最佳部分，即算法预测值函数V(s) 以及最优策略函数 $\pi (s)$ .学习代理使用价值函数（Critic）的值来更新最优策略函数（Actor）。注意这里的策略函数是指动作空间的概率分布。准确地说，学习代理确定条件概率 P(a|s ; $\theta$ ) 即代理在状态 s 中选择动作 a 的参数化概率。

优点：通常在Policy Gradient的实施中，Discounted Returns 的值（ $\gamma r$ ) 告诉代理它的哪些行为是奖励的，哪些是惩罚的。通过使用 Advantage 的值，代理还可以了解奖励比预期好多少。这为代理提供了对环境的新发现，因此学习过程更好。优势度量由以下表达式给出：-

优势：A = Q(s, a) – V(s)

以下伪代码来自上面链接的研究论文。

定义全局共享参数向量 $\theta$ [特克斯]和[/特克斯] $\theta _{v}$ [特克斯] [/特克斯] $Define global shared counter T = 0$ $Define thread specific parameter vectors$ [特克斯]\theta '[/特克斯] $and$ [特克斯]\theta _{v}'[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$Define thread step counter t = 1$
$while($ [特克斯]T
${$

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]d\theta = 0[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特]d\theta _{v} = 0[/特]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta ' = \theta[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta '_{v} = \theta _{v}[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[Tex]t_{开始} = t[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]s = s_{t}[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$while($ [特]s_{t}[/特] $is not terminal$ [Tex]t-t_{start} < t_{max}[/Tex] $)$
${$
$Simulate action$ [特克斯]a_{t}[/特克斯] $according to$ [特克斯]\pi (a_{t}|s;\theta )[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$Receive reward$ [特]r_{t}[/特] $and next state$ [特克斯]s_{t+1}[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$t++$
$T++$
$}$
$if($ [特]s_{t}[/特] $is terminal)$
${$
$R = 0$
$}$
$else$
${$
$R =$ [特克斯]V(s_{t}, \theta _{v}')[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$}$
$for(i=t-1;i>=$ [Tex]t_{开始}[/Tex] $;i--)$
${$
$R =$ [Tex]r_{i} + \gamma R[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[Tex]d\theta = d\theta + \Delta _{\theta '}log(\pi (a_{i}|s{i};\theta ')(RV(s_{i};\theta _{ v}')))[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]d\theta _{v}= d\theta _{v} + \frac{\partial ((RV(s_{i};\theta _{v}'))^{2})}{\部分 \theta _{v}'}[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$}$

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta = \theta + d\theta[/特克斯]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

[特克斯]\theta _{v}= \theta + d\theta_{v}[/Tex]

*** QuickLaTeX cannot compile formula:
 

*** Error message:
Error: Nothing to show, formula is empty

$}$

在哪里，

$T_{max}$ – 最大迭代次数

$d\theta$ – 全局参数向量的变化

$R$ – 总奖励

$\pi$ – 政策函数

$V$ – 价值函数

$\gamma$ – 折扣系数

好处：

该算法比标准的强化学习算法更快、更健壮。
由于如上所述的知识多样化，它比其他强化学习技术表现更好。
它可以用于离散和连续的动作空间。