INFO

A class of reinforcement learning algorithms that directly optimize the policy using gradient ascent on expected rewards, rather than estimating value functions like DQN does.

  • Exemplified by algorithms like
  • Directly parameterize policy functions
    • Allow agent to select actions based on probabilities derived from neural networks
  • Useful for environments with
    • high-dimensional action spaces
    • When the goal is to optimize directly for policy performance

Key Features

  1. Direct Policy Optimization
    • Learns the policy directly without relying on value functions
    • Ideal for continuous or high-dimensional action spaces
  2. Stochastic Policies
  3. High Variance, Low Bias
  4. Compatible with Actor-Critic Architectures2
    • Combines policy (actor) and value function (critic) for more stable learning
    • Critic estimates value to guide actor’s updates

Footnotes

  1. A simple Monte Carlo method that directly estimates the policy gradient using complete episodes from the environment. It updates the policy parameters based on the log probability of actions taken, weighted by the return (cumulative reward) from those actions. While simple it suffer from high variance in the gradient estimates.

  2. Uses 2 parts:

    • Actor: decides what action to take
    • Critic: evaluates how good that action was

    The critic provides feedback to the actor help to improve its decisions. This setup makes learning more stable and reduces the randomness in the updates 2

  3. A method that carefully updates the decision-making rules. It avoids making big changes at once which helps keep training steady → makes PPO reliable and popular for tough problems