Deep Deterministic Policy Gradient (DDPG)

INFO

An off-policy, actor-critic algorithm designed for environments with continuous action spaces.

Blends the strength of DQN and Policy Gradient Methods to learn both a Q-function and a deterministic policy simultaneously

Components

Actor Network: Learns a deterministic policy $μ (s ∥ θ^{μ})$ that maps states to actions
Critic Network: Estimates the Q-value $Q (s, a ∥ θ^{Q})$ for given state-action pairs
Target Networks: Stabilize training by slowly updating target versions of actor and critic
Replay Buffer: Stores transitions $(s, a, r, s^{'})$ for off-policy learning
Exploration Noise: Adds noise (e.g. Ornstein-Uhlenbeck) to actions for exploration in continuous space

Deterministic Policy
- Unlike stochastic Policy Gradient methods, Deep Deterministic Policy Gradient uses a deterministic actor
- Efficient in high-dimensional action spaces where sampling is costly
Off-Policy Learning
- Learns from stored experiences, improving sample efficiency
- Enables reuse of past trajectories for training
Actor-Critic Architecture
- Actor proposes actions
- Critic evaluates them
- Critic guides actor updates via gradient of Q-values
Target Networks
- Reduce training instability by slowing updating target parameters $θ^{-} \leftarrow τ θ + (1 - τ) θ^{-}$
Exploration via Noise
- Adds temporary correlated noise (e.g. Ornstein-Uhlenbeck) to encourage exploration
- especially useful in physical control tasks