Lunar Lander
Lunar Lander
Section titled “Lunar Lander”Application Overview
Section titled “Application Overview”Mission Objective
Section titled “Mission Objective”“The lunar lander lets you land a simulated vehicle on the moon. It’s like a fun little video game that’s been used by a lot of reinforcement learning researchers.”
Goal: “You’re in command of a lunar lander that is rapidly approaching the surface of the moon. And your job is the fire thrusters at the appropriate times to land it safely on the landing pad.”
Success vs Failure Examples
Section titled “Success vs Failure Examples”- Successful landing: “The lunar lander landing successfully and it’s firing thrusters downward and to the left and right to position itself to land between these two yellow flags”
- Failed landing: “If the reinforcement landing algorithm policy does not do well then this is what it might look like where the lander unfortunately has crashed on the surface of the moon”
Action Space
Section titled “Action Space”Four Discrete Actions
Section titled “Four Discrete Actions”On every time step, the agent can choose from:
- Do nothing (0): “The forces of inertia and gravity pull you towards the surface of the moon”
- Fire left thruster (1): “You see a little red dot come out on the left, that’s firing the left. They’ll tend to push the lunar lander to the right”
- Fire main engine (2): “Thrusting down the bottom here”
- Fire right thruster (3): “That’s firing the right thruster which will push you to the left”
Objective: “Your job is to keep on picking actions over time. So it’s the lunar lander safely between these two flags here on the landing pad.”
State Space (8-Dimensional)
Section titled “State Space (8-Dimensional)”Position and Motion Variables
Section titled “Position and Motion Variables”The state includes eight numbers:
Position and Velocity:
- x: Horizontal position (“how far to the left or right”)
- y: Vertical position (“how high up is it”)
- ẋ: Horizontal velocity (“how fast is it moving in the horizontal… directions”)
- ẏ: Vertical velocity (“how fast is it moving in the… vertical directions”)
Orientation:
- θ (theta): Angle (“how far is the lunar lander tilted to the left or tilted to the right”)
- θ̇ (theta dot): Angular velocity
Ground Contact:
- l: Left leg status (“whether the left leg is grounded, meaning whether or not the left leg is sitting on the ground”)
- r: Right leg status (“whether or not the right leg is sitting on the ground”)
Reward Function
Section titled “Reward Function”Complex Multi-Component Rewards
Section titled “Complex Multi-Component Rewards”“This is a moderately complex reward function. The designers of the lunar lander application actually put some thought into exactly what behavior you want and codified it in the reward function.”
Landing Success/Failure
Section titled “Landing Success/Failure”- Successful landing: “If it manages to get to the landing pad, didn’t receive the reward between 100 and 140 depending on how well it’s flown when gotten to the center of the landing pad”
- Crash: “If it crashes it gets a large -100 reward”
- Soft landing: “If it achieves a soft landing, that is a landing. There’s another crash, it gets a +100 reward”
Continuous Feedback Components
Section titled “Continuous Feedback Components”- Position incentive: “Give it an additional reward for moving toward or away from the pad so it moves closer to the pad it receives a positive reward and it moves away and drifts away. It receives a negative reward”
- Ground contact: “For each leg, the left leg or the right link that it gets grounded. It receives a +10 reward”
Fuel Efficiency Incentives
Section titled “Fuel Efficiency Incentives”“To encourage it not to waste too much fuel and fire thrusters aren’t necessarily”:
- Main engine: “-0.3 reward” each time fired
- Side thrusters: “-0.03 reward” each time fired
Problem Formulation
Section titled “Problem Formulation”Reinforcement Learning Goal
Section titled “Reinforcement Learning Goal”Objective: “Learn a policy π that when given a state S as written here picks an action a equals π(S). So as to maximize the return the sum of discounted rewards.”
Discount Factor
Section titled “Discount Factor”“Usually for the lunar lander would use a fairly large value for gamma… In fact for the would use the value of gamma that’s equal to 0.985 so pretty close to one.”
γ = 0.985Reward Function Design Philosophy
Section titled “Reward Function Design Philosophy”Behavior Specification Advantage
Section titled “Behavior Specification Advantage”“You find when you’re building your own reinforcement learning application usually takes some thought to specify exactly what you want or don’t want and to codify that in the reward function.”
Key Advantage: “Specify the reward function should still turn out to be much easier to specify the exact right action to take from every single state. Which is much harder for this and many other reinforcement learning applications.”
Incentive Structure
Section titled “Incentive Structure”The reward function incentivizes:
- Moving toward landing pad
- Controlled descent (not too fast)
- Successful ground contact with both legs
- Fuel conservation
- Avoiding crashes
The lunar lander environment demonstrates how reinforcement learning can handle complex control tasks with continuous state spaces, multiple objectives, and sophisticated reward structures that would be difficult to specify through direct programming of optimal actions.