Lunar Lander

Application Overview

Mission Objective

“The lunar lander lets you land a simulated vehicle on the moon. It’s like a fun little video game that’s been used by a lot of reinforcement learning researchers.”

Goal: “You’re in command of a lunar lander that is rapidly approaching the surface of the moon. And your job is the fire thrusters at the appropriate times to land it safely on the landing pad.”

Success vs Failure Examples

Successful landing: “The lunar lander landing successfully and it’s firing thrusters downward and to the left and right to position itself to land between these two yellow flags”
Failed landing: “If the reinforcement landing algorithm policy does not do well then this is what it might look like where the lander unfortunately has crashed on the surface of the moon”

Action Space

Four Discrete Actions

On every time step, the agent can choose from:

Do nothing (0): “The forces of inertia and gravity pull you towards the surface of the moon”
Fire left thruster (1): “You see a little red dot come out on the left, that’s firing the left. They’ll tend to push the lunar lander to the right”
Fire main engine (2): “Thrusting down the bottom here”
Fire right thruster (3): “That’s firing the right thruster which will push you to the left”

Objective: “Your job is to keep on picking actions over time. So it’s the lunar lander safely between these two flags here on the landing pad.”

State Space (8-Dimensional)

Position and Motion Variables

The state includes eight numbers:

Position and Velocity:

x: Horizontal position (“how far to the left or right”)
y: Vertical position (“how high up is it”)
ẋ: Horizontal velocity (“how fast is it moving in the horizontal… directions”)
ẏ: Vertical velocity (“how fast is it moving in the… vertical directions”)

Orientation:

θ (theta): Angle (“how far is the lunar lander tilted to the left or tilted to the right”)
θ̇ (theta dot): Angular velocity

Ground Contact:

l: Left leg status (“whether the left leg is grounded, meaning whether or not the left leg is sitting on the ground”)
r: Right leg status (“whether or not the right leg is sitting on the ground”)

Reward Function

Complex Multi-Component Rewards

“This is a moderately complex reward function. The designers of the lunar lander application actually put some thought into exactly what behavior you want and codified it in the reward function.”

Landing Success/Failure

Successful landing: “If it manages to get to the landing pad, didn’t receive the reward between 100 and 140 depending on how well it’s flown when gotten to the center of the landing pad”
Crash: “If it crashes it gets a large -100 reward”
Soft landing: “If it achieves a soft landing, that is a landing. There’s another crash, it gets a +100 reward”

Continuous Feedback Components

Position incentive: “Give it an additional reward for moving toward or away from the pad so it moves closer to the pad it receives a positive reward and it moves away and drifts away. It receives a negative reward”
Ground contact: “For each leg, the left leg or the right link that it gets grounded. It receives a +10 reward”

Fuel Efficiency Incentives

“To encourage it not to waste too much fuel and fire thrusters aren’t necessarily”:

Main engine: “-0.3 reward” each time fired
Side thrusters: “-0.03 reward” each time fired

Problem Formulation

Reinforcement Learning Goal

Objective: “Learn a policy π that when given a state S as written here picks an action a equals π(S). So as to maximize the return the sum of discounted rewards.”

Discount Factor

“Usually for the lunar lander would use a fairly large value for gamma… In fact for the would use the value of gamma that’s equal to 0.985 so pretty close to one.”

γ = 0.985

Reward Function Design Philosophy

Behavior Specification Advantage

“You find when you’re building your own reinforcement learning application usually takes some thought to specify exactly what you want or don’t want and to codify that in the reward function.”

Key Advantage: “Specify the reward function should still turn out to be much easier to specify the exact right action to take from every single state. Which is much harder for this and many other reinforcement learning applications.”

Incentive Structure

The reward function incentivizes:

Moving toward landing pad
Controlled descent (not too fast)
Successful ground contact with both legs
Fuel conservation
Avoiding crashes

The lunar lander environment demonstrates how reinforcement learning can handle complex control tasks with continuous state spaces, multiple objectives, and sophisticated reward structures that would be difficult to specify through direct programming of optimal actions.