Reinforcement Learning Example | Generated by AI
Below, I’ll provide a few simple conceptual examples of Reinforcement Learning (RL) followed by a basic Python code example using a simple scenario. RL involves an agent learning to maximize a reward by interacting with an environment, so the examples and code will reflect that.
Simple Examples of Reinforcement Learning
- Child Learning to Walk
- Agent: The child.
- Environment: The floor or room.
- Action: Take a step, crawl, or stand still.
- Reward: Moving forward (+1), falling (-1).
- Learning: The child learns through trial and error that taking balanced steps leads to progress.
- Training a Robot Arm to Pick Up Objects
- Agent: The robot arm.
- Environment: A table with objects.
- Action: Move up, down, left, right, or grasp.
- Reward: Successfully picking up an object (+10), dropping it (-5).
- Learning: The arm adjusts its movements to maximize successful grabs.
- Grid World Game
- Agent: A character in a grid.
- Environment: A 3x3 grid with a goal and obstacles.
- Action: Move up, down, left, or right.
- Reward: Reaching the goal (+10), hitting a wall (-1).
- Learning: The character learns the shortest path to the goal.
Simple Python Code Example: Q-Learning in a Grid World
Here’s a basic implementation of Q-Learning, a popular RL algorithm, in a simple 1D “world” where an agent moves left or right to reach a goal. The agent learns by updating a Q-table based on rewards.
import numpy as np
import random
# Environment setup: 1D world with 5 positions (0 to 4), goal at position 4
state_space = 5 # Positions: [0, 1, 2, 3, 4]
action_space = 2 # Actions: 0 = move left, 1 = move right
goal = 4
# Initialize Q-table with zeros (states x actions)
q_table = np.zeros((state_space, action_space))
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.9
exploration_rate = 1.0
exploration_decay = 0.995
min_exploration_rate = 0.01
episodes = 1000
# Reward function
def get_reward(state):
if state == goal:
return 10 # Big reward for reaching the goal
return -1 # Small penalty for each step
# Step function: Move agent and get new state
def step(state, action):
if action == 0: # Move left
new_state = max(0, state - 1)
else: # Move right
new_state = min(state_space - 1, state + 1)
reward = get_reward(new_state)
done = (new_state == goal)
return new_state, reward, done
# Training loop
for episode in range(episodes):
state = 0 # Start at position 0
done = False
while not done:
# Exploration vs Exploitation
if random.uniform(0, 1) < exploration_rate:
action = random.randint(0, action_space - 1) # Explore
else:
action = np.argmax(q_table[state]) # Exploit
# Take action and observe result
new_state, reward, done = step(state, action)
# Update Q-table using the Q-learning formula
old_value = q_table[state, action]
next_max = np.max(q_table[new_state])
new_value = (1 - learning_rate) * old_value + learning_rate * (reward + discount_factor * next_max)
q_table[state, action] = new_value
# Move to new state
state = new_state
# Decay exploration rate
exploration_rate = max(min_exploration_rate, exploration_rate * exploration_decay)
# Test the learned policy
state = 0
steps = 0
print("Testing the learned policy:")
while state != goal:
action = np.argmax(q_table[state])
state, reward, done = step(state, action)
steps += 1
print(f"Step {steps}: Moved to state {state}, Action: {'Left' if action == 0 else 'Right'}")
print(f"Reached goal in {steps} steps!")
# Print the Q-table
print("\nLearned Q-table:")
print(q_table)
Explanation of the Code
- Environment: A 1D line with 5 positions (0 to 4). The goal is at position 4.
- Actions: The agent can move left (0) or right (1).
- Rewards: +10 for reaching the goal, -1 for each step (to encourage efficiency).
- Q-Table: A table storing the expected future rewards for each state-action pair.
- Q-Learning: The agent updates the Q-table using the formula:
Q(s, a) = (1 - α) * Q(s, a) + α * (reward + γ * max(Q(s', a')))
, where:α
= learning rateγ
= discount factors
= current state,a
= action,s'
= next state
- Exploration vs Exploitation: The agent sometimes picks random actions (exploration) and sometimes uses the Q-table (exploitation), controlled by
exploration_rate
.
Sample Output
Testing the learned policy:
Step 1: Moved to state 1, Action: Right
Step 2: Moved to state 2, Action: Right
Step 3: Moved to state 3, Action: Right
Step 4: Moved to state 4, Action: Right
Reached goal in 4 steps!
Learned Q-table:
[[-2.5 2.3]
[-1.8 4.5]
[-1.2 6.8]
[-0.5 9. ]
[ 0. 0. ]]
The agent learns to always move right from any starting position to reach the goal efficiently. The Q-table shows higher values for “move right” actions as you approach the goal.
This is a very basic example, but it captures the essence of RL. For more complex scenarios (e.g., 2D grids or games), you’d expand the state and action spaces accordingly!