INFO
Simplifies trust region policy optimization by constraining updates within a predefined clipping threshold → providing stability and reliability during training
- A more advanced variant of Deep Reinforcement Learning Method
- significantly reduces sensitivity to hyperparameter selection and ensures consistent learning across diverse environments
Components
- Stochastic Policy: Learns , a probability distribution over actions
- Clipped Objective: Limits how much the new policy can deviate from the old one during updates
- Actor-Critic Architecture1: Actor updates the policy; critic estimates value function
- Advantage Estimation: Uses Generalized Advantage Estimation (GAE) for variance reduction
- On-Policy Learning: Uses fresh trajectories from the current policy for updates
Key Features
- Clipped Surrogate Objective
- Prevents large, destabilizing policy updates
- Objective:
- Where:
- Trust Region Approximation
- Inspired by TRPO but avoids second-order derivatives
- Uses clipping instead of KL-divergence constraints
- Sample Efficiency
- More efficient than vanilla policy gradients
- Can reuse mini-batches for multiple epochs
- Wide Applicability
- Works well in high-dimensional, continuous control tasks
- Used in robotics, games, and simulated physics
Footnotes
-
Uses 2 parts: - Actor: decides what action to take - Critic: evaluates how good that action was
The critic provides feedback to the actor help to improve its decisions. This setup makes learning more stable and reduces the randomness in the updates ↩