Proximal Policy Optimization (PPO)

INFO

Simplifies trust region policy optimization by constraining updates within a predefined clipping threshold → providing stability and reliability during training

A more advanced variant of Deep Reinforcement Learning Method
significantly reduces sensitivity to hyperparameter selection and ensures consistent learning across diverse environments

Components

Stochastic Policy: Learns $π_{θ} (a ∣ s)$ , a probability distribution over actions
Clipped Objective: Limits how much the new policy can deviate from the old one during updates
Actor-Critic Architecture¹: Actor updates the policy; critic estimates value function $V (s)$
Advantage Estimation: Uses Generalized Advantage Estimation (GAE) for variance reduction
On-Policy Learning: Uses fresh trajectories from the current policy for updates

Key Features

Clipped Surrogate Objective
- Prevents large, destabilizing policy updates
- Objective: $L^{C L I P} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]$
- Where: $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )}$
Trust Region Approximation
- Inspired by TRPO but avoids second-order derivatives
- Uses clipping instead of KL-divergence constraints
Sample Efficiency
- More efficient than vanilla policy gradients
- Can reuse mini-batches for multiple epochs
Wide Applicability
- Works well in high-dimensional, continuous control tasks
- Used in robotics, games, and simulated physics

Uses 2 parts: - Actor: decides what action to take - Critic: evaluates how good that action was

The critic provides feedback to the actor help to improve its decisions. This setup makes learning more stable and reduces the randomness in the updates ↩

Jason's Notebook

Explorer

Proximal Policy Optimization (PPO)

Components

Key Features

Graph View

Table of Contents

Backlinks

Jason's Notebook

Explorer

Proximal Policy Optimization (PPO)

Components

Key Features

Footnotes

Graph View

Table of Contents

Backlinks