注解

此笔记本可在此处下载: 02_RL_2_CarRacing.ipynb

实践工作:强化学习

第2部分:解决携带环境

在这种环境下,就像一个真正的人类一样,我们需要从像素中学习。
可观察的空间确实很大,因此我们需要通过手工制作特征,或者通过深入学习来缩小它。-手工制作功能大量使用图像处理(寻找轮廓、线条、预测运动…)。-图像的深度学习大量使用卷积神经网络从像素中提取特征。

然后,我们需要通过控制理论或深度预测模型来选择下一个动作(状态空间也很大)。请注意,此处的操作空间较低。

在这里,我们将采取深入学习的方式。

II-搬运

最简单的连续控制任务,从像素学习,一个自上而下的赛车环境。

在这种环境下,离散控制是合理的,开/关离散化是可以的。
国家包括 STATE_W x STATE_H pixels .

奖励

奖励是 -0.1 every frame+1000/N for every track tile visited ,其中N是轨迹中的平铺总数。例如,如果你完成了732帧,你的奖励是1000-0.1*732=926.8点。

解决

CarRacing-v0将“解决”定义为 连续100次测试平均奖励900次 .

轨道

轨道是 每集随机 .
当访问所有磁贴时,该集结束。汽车也可以 走出操场 ,离轨道很远,然后它会 -100 and die .

指标

窗口底部和状态RGB缓冲区中显示的一些指示器:

  • 真实速度

  • 4个ABS传感器

  • 方向盘位置

  • 陀螺仪

由Oleg Klimov创造的环境。与其他Openai健身房的许可条款相同。

技术修复

HDPI固定1

此外,要修复大窗口,您可以更改Mac上的显示dpi,或更改:
WINDOW_W = 1200
WINDOW_H = 1000

在里面 gym/envs/box2d/car_racing.py 较小的值。

HDPI固定2

换乘汽车386线_赛车.py收件人:
gl.glViewport(0, 0, WINDOW_W*2, WINDOW_H*2)
avconv -version
if not brew install libav

1-自己玩!

# Try to play by yourself!
import gym
from pyglet.window import key
import numpy as np
import time

bool_do_not_quit = True  # Boolean to quit pyglet
scores = []  # Your gaming score
a = np.array( [0.0, 0.0, 0.0] )  # Actions

def key_press(k, mod):
    global bool_do_not_quit, a, restart
    if k==0xff0d: restart = True
    if k==key.ESCAPE: bool_do_not_quit=False  # To Quit
    if k==key.Q: bool_do_not_quit=False  # To Quit
    if k==key.LEFT:  a[0] = -1.0
    if k==key.RIGHT: a[0] = +1.0
    if k==key.UP:    a[1] = +1.0
    if k==key.DOWN:  a[2] = +0.8   # set 1.0 for wheels to block to zero rotation

def key_release(k, mod):
    global a
    if k==key.LEFT  and a[0]==-1.0: a[0] = 0
    if k==key.RIGHT and a[0]==+1.0: a[0] = 0
    if k==key.UP:    a[1] = 0
    if k==key.DOWN:  a[2] = 0

def run_carRacing_asHuman(policy=None, record_video=False):
    global bool_do_not_quit, a, restart
    # env = CarRacing()
    env = gym.make('CarRacing-v0').env

    env.reset()
    env.render()
    if record_video:
        env.monitor.start('/tmp/video-test', force=True)
    env.viewer.window.on_key_press = key_press
    env.viewer.window.on_key_release = key_release

    while bool_do_not_quit:
        env.reset()
        total_reward = 0.0
        steps = 0
        restart = False
        t1 = time.time()  # Trial timer
        while bool_do_not_quit:
            state, reward, done, info = env.step(a)
            # time.sleep(1/10)  # Slow down to 10fps for us poor little human!
            total_reward += reward
            if steps % 200 == 0 or done:
                print("Step: {} | Reward: {:+0.2f}".format(steps, total_reward), "| Action:", a)
            steps += 1
            if not record_video: # Faster, but you can as well call env.render() every time to play full window.
                env.render()
            if done or restart:
                t1 = time.time()-t1
                scores.append(total_reward)
                scores.append(total_reward)
                print("Trial", len(scores), "| Score:", total_reward, '|', steps, "steps | %0.2fs."% t1)
                break
        if not bool_do_not_quit:
            scores.append(total_reward)
            print("Trial", len(scores), "| Score:", total_reward, '|', steps, "steps | %0.2fs."% t1)
    env.close()

run_carRacing_asHuman()  # Run with human keyboard input
Track generation: 1140..1436 -> 296-tiles track
Track generation: 1107..1388 -> 281-tiles track
Step: 0 | Reward: +7.04 | Action: [0. 0. 0.]
Step: 200 | Reward: +22.76 | Action: [0. 1. 0.]
Step: 400 | Reward: +59.90 | Action: [0.  0.  0.8]
Step: 600 | Reward: +39.90 | Action: [1. 1. 0.]
Trial 2 | Score: 22.999999999999254 | 770 steps | 20.88s.
Track generation: 1025..1288 -> 263-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1068..1339 -> 271-tiles track
Step: 0 | Reward: +7.31 | Action: [0. 0. 0.]
Step: 200 | Reward: +31.75 | Action: [-1.  1.  0.]
Step: 400 | Reward: +48.79 | Action: [0.  0.  0.8]
Step: 600 | Reward: +158.42 | Action: [0. 0. 0.]
Step: 800 | Reward: +264.34 | Action: [0.  0.  0.8]
Trial 3 | Score: 261.1444444444442 | 833 steps | 1551202624.86s.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Plot your score
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.title('My human performance on CarRacing-v0')
plt.ylabel('Score')
plt.xlabel('Human Episode')
plt.show()

2-创建代理以解决问题

  • 强化学习

    • 政策梯度

    • 模仿学习

    • 深度Q学习

import gym
import numpy as np
# Define the Environments
env = gym.make('CarRacing-v0').env

# Number of Dimensions in the Observable Space and number of Control Actions in the Environments
print('Observation Space:', env.observation_space)
print('Action Space:', env.action_space)

print("\n")
print("Observation Space Param: 96x96x3 values for Red, Green and Blue pixels")
print("Observation Space Highs:", np.mean(env.observation_space.high))
print("Observation Space Lows:   ", np.mean(env.observation_space.low))
Observation Space: Box(96, 96, 3)
Action Space: Box(3,)


Observation Space Param: 96x96x3 values for Red, Green and Blue pixels
Observation Space Highs: 255.0
Observation Space Lows:    0.0
# Run the environment

from collections import deque
import gym
import numpy as np
from pyglet.window import key
import time

bool_quit = False

def key_press(k, mod):
    global bool_do_not_quit, action, restart
    if k==0xff0d: restart = True
    if k==key.ESCAPE: bool_do_not_quit=False  # To Quit
    if k==key.Q: bool_do_not_quit=False  # To Quit
    if k==key.LEFT:  action[0] = -1.0
    if k==key.RIGHT: action[0] = +1.0
    if k==key.UP:    action[1] = +1.0
    if k==key.DOWN:  action[2] = +0.8   # set 1.0 for wheels to block to zero rotation

def key_release(k, mod):
    global action
    if k==key.LEFT  and action[0]==-1.0: action[0] = 0
    if k==key.RIGHT and action[0]==+1.0: action[0] = 0
    if k==key.UP:    action[1] = 0
    if k==key.DOWN:  action[2] = 0

def run_carRacing(policy, n_episodes=1000, max_t=1500, print_every=100, record_video=False):
    """Run the CarRacing-v0 environment.

    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        print_every (int): how often to print average score (over last 100 episodes)

    Adapted from:
    https://gist.github.com/lmclupr/b35c89b2f8f81b443166e88b787b03ab

    """
    global bool_quit

    if policy.__class__.__name__ == 'Policy_Human':
        global action
        action = np.array( [0.0, 0.0, 0.0] ) # Global variable used for manual control actions with key_press
        env = gym.make('CarRacing-v0').env
        env.reset()  # This is mandatory for keyboard input
        env.render()  # This is mandatory for keyboard input
        env.viewer.window.on_key_press = key_press
        env.viewer.window.on_key_release = key_release
    else:
        print('** Evaluating', policy.__class__.__name__, '**')
        # Define the Environments
        env = gym.make('CarRacing-v0').env
        # Set random generator for reproductible runs
        env.seed(0)
        np.random.seed(0)

    if record_video:
        env.monitor.start('/tmp/video-test', force=True)

    scores_deque = deque(maxlen=100)
    scores = []
    trials_to_solve=[]

    for i_episode in range(1, n_episodes+1):
        rewards = []
        state = env.reset()
        if 'reset' in dir(policy):  # Check if the .reset method exists
            policy.reset(state)
        for t in range(max_t):  # Avoid stucked episodes
            action = policy.act(state)
            state, reward, done, info = env.step(action)
            rewards.append(reward)
            if 'memorize' in dir(policy):  # Check if the .memorize method exists
                policy.memorize(state, action, reward, done)
            # Environment must be rendered! If not, all pixels are white...
            env.render()  # (mode='rgb_array')
            if done:
                trials_to_solve.append(t)
                break
        # if t + 1 == max_t:
        #     print('This episode is stuck.')

        scores_deque.append(sum(rewards))
        scores.append(sum(rewards))

        if 'update' in dir(policy):  # Check if the .update method exists
            policy.update(state)  # Update the policy

        if i_episode % print_every == 0:
            print('Episode {}\tAverage Score: {:.2f}\tSteps: {:d}'.format(i_episode, np.mean(scores_deque), t))
        if np.mean(scores_deque) >= 900.0:
            print('Episode {}\tAverage Score: {:.2f}\tSteps: {:d}'.format(i_episode, np.mean(scores_deque), t))
            print('** Environment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(max(1, i_episode-100), np.mean(scores_deque)))
            break
        if bool_quit:
            break

    if np.mean(scores_deque) < 900.0:
        print('** The environment has never been solved!')
        print('   Mean scores on last 100 runs was < 900.0')
    if record_video:
        env.env.close()
    env.close()
    return scores, trials_to_solve
#  Performance plots
import matplotlib.pyplot as plt
%matplotlib inline
def plot_performance(scores):
    # Plot the policy performance
    fig = plt.figure()
    ax = fig.add_subplot(111)
    x = np.arange(1, len(scores) + 1)
    y = scores
    plt.scatter(x, y, marker='x', c=y)
    fit = np.polyfit(x, y, deg=4)
    p = np.poly1d(fit)
    plt.plot(x,p(x),"r--")
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.title(policy.__class__.__name__ +  ' performance on CartPole-v0')
    plt.show()

def plot_trials_to_solve(trials_to_solve):
    # Plot the policy number of trials to solve the Environment
    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.hist(trials_to_solve, bins='auto', density=True, facecolor='g', alpha=0.75)
    plt.ylabel('Frequency')
    plt.xlabel('Number of Trial to solve')
    plt.title(policy.__class__.__name__ +  ' trials to solve CartPole-v0')
    plt.show()

时间差(td0)

信息约简的图像处理

从A (96, 96, 3) = 27 648 values 观测空间至(10*10+7+4)=111 values_.

请注意,如果没有这种减少,问题仍然可以解决,但具有重要的计算能力。

从原始(96、96、3)像素图像

  • 将轨迹裁剪为10*10像素。

  • 从车窗底部的杆上提取1个转向设置、1个速度、1个陀螺仪和4个abs值

资源

# env = wrappers.Monitor(env, "monitor-folder", force=True)  # video_callable=lambda episode_id: True
import cv2
# Hand-crafted image processing to extract: 10*10 pixels track, upper and car images
def transform(state):
    # crop_img = img[200:400, 100:300]  # Crop from x, y, w, h -> 100, 200, 300, 400
    # NOTE: its img[y: y + h, x: x + w] and *not* img[x: x + w, y: y + h]
    # bottom_black_bar is the section of the screen with steering, speed, abs and gyro information.
    # we crop off the digits on the right as they are illigible, even for ml.
    # since color is irrelavent, we grayscale it.
    bottom_black_bar = state[84:, 12:]
    img = cv2.cvtColor(bottom_black_bar, cv2.COLOR_RGB2GRAY)
    bottom_black_bar_bw = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)[1]
    bottom_black_bar_bw = cv2.resize(bottom_black_bar_bw, (84, 12), interpolation = cv2.INTER_NEAREST)

    # upper_field = observation[:84, :96]  # This is the section of the screen that contains the track
    upper_field = state[:84, 6:90]  # We crop side of screen as they carry little information
    img = cv2.cvtColor(upper_field, cv2.COLOR_RGB2GRAY)
    upper_field_bw = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY)[1]
    upper_field_bw = cv2.resize(upper_field_bw, (10, 10), interpolation = cv2.INTER_NEAREST) # rescaled to 10*10 pixels
    upper_field_bw = upper_field_bw.astype('float') / 255

    car_field = state[66:78, 43:53]
    img = cv2.cvtColor(car_field, cv2.COLOR_RGB2GRAY)
    car_field_bw = cv2.threshold(img, 80, 255, cv2.THRESH_BINARY)[1]
    car_field_t = [
        car_field_bw[:, 3].mean() / 255,
        car_field_bw[:, 4].mean() / 255,
        car_field_bw[:, 5].mean() / 255,
        car_field_bw[:, 6].mean() / 255]

    return bottom_black_bar_bw, upper_field_bw, car_field_t
def convert_argmax_qval_to_env_action(output_value):
    # we reduce the action space to 15 values.  9 for steering, 6 for gaz/brake.
    # to reduce the action space, gaz and brake cannot be applied at the same time.
    # as well, steering input and gaz/brake cannot be applied at the same time.
    # similarly to real life drive, you brake/accelerate in straight line, you coast while sterring.

    gaz = 0.0
    brake = 0.0
    steering = 0.0

    # output value ranges from 0 to 10

    if output_value <= 8:
        # steering. brake and gaz are zero.
        output_value -= 4
        steering = float(output_value) / 4
    elif output_value >= 9 and output_value <= 9:
        output_value -= 8
        gaz = float(output_value) / 3 # 33%
    elif output_value >= 10 and output_value <= 10:
        output_value -= 9
        brake = float(output_value) / 2 # 50% brakes
    else:
        print("[WARNING] Error in convert_argmax_qval_to_env_action()")

    white = np.ones((round(brake * 100), 10))
    black = np.zeros((round(100 - brake * 100), 10))
    brake_display = np.concatenate((black, white))*255

    white = np.ones((round(gaz * 100), 10))
    black = np.zeros((round(100 - gaz * 100), 10))
    gaz_display = np.concatenate((black, white))*255

    control_display = np.concatenate((brake_display, gaz_display), axis=1)

    # cv2.namedWindow('controls', cv2.WINDOW_NORMAL)
    # cv2.resizeWindow('controls', 100, 20)
    # cv2.imshow('controls', control_display)
    # cv2.waitKey(1)

    return [steering, gaz, brake]
# This function uses the black bar at the window botttom to extract steering setting, speed and gyro data
def compute_steering_speed_gyro_abs(a):
    right_steering = a[6, 36:46].mean() / 255
    left_steering = a[6, 26:36].mean() / 255
    steering = (right_steering - left_steering + 1.0) / 2

    left_gyro = a[6, 46:60].mean() / 255
    right_gyro = a[6, 60:76].mean() / 255
    gyro = (right_gyro - left_gyro + 1.0) / 2

    speed = a[:, 0][:-2].mean() / 255
    abs1 = a[:, 6][:-2].mean() / 255
    abs2 = a[:, 8][:-2].mean() / 255
    abs3 = a[:, 10][:-2].mean() / 255
    abs4 = a[:, 12][:-2].mean() / 255

    # white = np.ones((round(speed * 100), 10))
    # black = np.zeros((round(100 - speed * 100), 10))
    # speed_display = np.concatenate((black, white))*255

    # cv2.imshow('sensors', speed_display)
    # cv2.waitKey(1)

    return [steering, speed, gyro, abs1, abs2, abs3, abs4]
# Information reduction from the original image pixels:
#   - Crop the track to 10*10 pixels
#   - Extract 1 steering setting, 1 speed, 1 gyro and 4 abs values from the bar on the window bottom
state_space_dim = 10 * 10 + 7 + 4
print(state_space_dim)
111
from keras.models import load_model, Sequential
from keras.layers import Dense
from keras.optimizers import SGD, RMSprop, Adam, Adamax
import math
import os
from pathlib import Path
import random
import shutil
import time

state_space_dim = 10 * 10 + 7 + 4  # We extract a 10*10 image and values from the window bottom
# 7*7 + 3.  or 14*14 + 3

# Define a Temporal Difference Policy
class Policy_TD0():
    def __init__(self, state_space_dim=state_space_dim, action_space_dim=3):
        self.action_space_dim = action_space_dim
        self.state_space_dim = state_space_dim
        self.gamma = 0.99  # 0.95 is slow  # Discount rate
        # self.epsilon = 1.0  # 1.0 0.2 1.0<->0.1  # This should be tuned carefuly
        self.epsilon = 0.5 / np.sqrt(901)
        # self.epsilon_min = 0.01  # 0.0001 0.001
        self.epsilon_decay = 0.995  # 0.995
        self.learning_rate = 0.002  # 0.001  # This should be tuned carefuly
        self.learning_rate_decay = 0.01  # 0.01 Learning rate decay
        # self.batch_size = 64  # 32

        self.episode = 0  # Episode counter
        print('Deep Neural Networks to model the Q Table:')
        if os.path.exists('race-car.h5'):
            print('** Found a local race-car.h5 model.')
            self.model = load_model('race-car.h5')
            print('  race-car.h5 model loaded!')
        else:
            self.model = self._build_model(self.state_space_dim, action_space_dim)
        self.model.summary()
        # self.memory = []
        # self.memory = deque(maxlen=100000)  # We can limit the memory size

    def _build_model(self, state_space_dim, action_space_dim):
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        print('** Build a 2FC layers to model the Q-table of this problem **')
        model.add(Activation('linear'))

        # Optimizer:
        # rms = RMSprop(lr=0.005)
        # sgd = SGD(lr=0.1, decay=0.0, momentum=0.0, nesterov=False)
        # Adam(lr=0.0005)
        # Adamax(lr=0.001)
        model.add(Dense(action_space_dim, activation='linear'))  # Linear output so we can have range of real-valued outputs
        model.compile(loss='mse', optimizer=Adamax(lr=self.learning_rate, decay=self.learning_rate_decay))

    def act(self, state):
        self.qval = self.model.predict(self.current_state, verbose=0)[0]

        # epsilon = max(self.epsilon_min, min(self.epsilon, 1.0 - math.log10((self.episode + 1) * self.epsilon_decay)))
        if np.random.random() < self.epsilon:  # epsilon-greedy action selection
            self.argmax_qval = random.randint(0, 10)
            # action = env.action_space.sample()
        else:
            self.argmax_qval = np.argmax(self.qval)
        return convert_argmax_qval_to_env_action(self.argmax_qval)

    def memorize(self, next_state, action, reward, done):
        # Memorize all observables and environment trial results
        a, b, c = transform(next_state)
        next_state = np.concatenate(  # this is 3 + 7*7 size vector.  all scaled in range 0..1
            (np.array([compute_steering_speed_gyro_abs(a)]).reshape(1,-1).flatten(),
             b.reshape(1,-1).flatten(), c), axis=0)
        next_state = np.array(next_state).reshape(1, self.state_space_dim)
        # self.memory.append((self.current_state, action, reward, next_state, done))

        # Standard Q-Learning TD(0)
        next_qval = self.model.predict(next_state, verbose=0)[0]
        G = reward + self.gamma * np.max(next_qval)
        y = self.qval[:]
        # y[np.argmax(self.qval)] = G
        y[self.argmax_qval] = G
        self.train(self.current_state, y)  # Update model on every run?
        self.current_state = next_state

    def update(self, state):
        self.episode += 1  # Increment trial counter
        # Train the Q-Network
        # if len(self.memory) > self.batch_size:  # If there are enough trial in memory
        # if len(self.memory) % 4:  # Train only every 4th trial
        # if len(self.memory) > 2000:  # We can lazy start to acquire more data before learn on it
        #     self.replay_to_train(self.batch_size)

        # Update epsilon
        # if self.epsilon > self.epsilon_min:
        #     self.epsilon *= self.epsilon_decay
        # self.epsilon = 0.5 / np.sqrt(self.episode + 1 + 900)
        self.epsilon = 0.5 / np.sqrt(self.episode + 900)

    def train(self, state, G):
        self.model.fit(state, np.array(G).reshape(1, 11), epochs=1, verbose=0)

    def replay_to_train(self, batch_size):  # TODO: implement TD
        state_batch, Q_batch = [], []
        minibatch = random.sample(self.memory, min(len(self.memory), batch_size))

        for state, action, reward, next_state, done in minibatch:
            Q_target = self.model.predict(state)[0]  # Use the model to predict the target
            if done :  # Full reward because environment was solved
                Q_target[np.argmax(Q_target)] = reward
            else:  # Discount the reward by gamma because environment was not solved
                Q_target[np.argmax(Q_target)] = reward + self.gamma * np.max(self.model.predict(next_state)[0])

            state_batch.append(state[0])
            Q_batch.append(Q_target)
        self.model.fit(np.array(state_batch), np.array(Q_batch), batch_size=len(state_batch), verbose=0)

    def reset(self, state):
        # Set current_state
        a, b, c = transform(state)  # this is 3 + 7*7 size vector.  all scaled in range 0..1
        self.current_state = np.concatenate((np.array([compute_steering_speed_gyro_abs(a)]).reshape(1, -1).flatten(),
                b.reshape(1, -1).flatten(), c), axis=0).reshape(1, self.state_space_dim)

    def save_model(self):
        filepath = 'race-car.h5'
        if Path(filepath).exists():
            timestamp = os.stat(filepath).st_ctime
            print('** OLD', filepath, 'exists! Created:', time.ctime(timestamp))
            shutil.copy(filepath, filepath + '_' + str(timestamp))
        self.model.save('race-car.h5')
        print('** Model saved to', filepath, '!')

policy = Policy_TD0()
scores, trials_to_solve = run_carRacing(policy, n_episodes=102, max_t=1000, print_every=1)
Deep Neural Networks to model the Q Table:
** Found a local race-car.h5 model.
  race-car.h5 model loaded!
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_1 (Dense)              (None, 512)               57344
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 11)                5643
_________________________________________________________________
activation_2 (Activation)    (None, 11)                0
=================================================================
Total params: 62,987
Trainable params: 62,987
Non-trainable params: 0
_________________________________________________________________
** Evaluating Policy_TD0 **
Track generation: 1143..1442 -> 299-tiles track
Episode 1   Average Score: -13.34   Steps: 979
Track generation: 1087..1369 -> 282-tiles track
Episode 2   Average Score: 89.24    Steps: 999
Track generation: 964..1212 -> 248-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1176..1474 -> 298-tiles track
Episode 3   Average Score: 158.60   Steps: 999
Track generation: 1283..1608 -> 325-tiles track
Episode 4   Average Score: 174.19   Steps: 999
Track generation: 1217..1526 -> 309-tiles track
Episode 5   Average Score: 208.32   Steps: 999
Track generation: 1096..1374 -> 278-tiles track
Episode 6   Average Score: 208.07   Steps: 999
Track generation: 1198..1501 -> 303-tiles track
Episode 7   Average Score: 191.03   Steps: 999
Track generation: 1159..1453 -> 294-tiles track
Episode 8   Average Score: 192.62   Steps: 999
Track generation: 957..1205 -> 248-tiles track
Episode 9   Average Score: 175.85   Steps: 999
Track generation: 1181..1480 -> 299-tiles track
Episode 10  Average Score: 198.60   Steps: 999
Track generation: 979..1234 -> 255-tiles track
Episode 11  Average Score: 183.43   Steps: 257
Track generation: 1320..1654 -> 334-tiles track
Episode 12  Average Score: 181.59   Steps: 999
Track generation: 1067..1338 -> 271-tiles track
Episode 13  Average Score: 184.14   Steps: 999
Track generation: 1067..1338 -> 271-tiles track
Episode 14  Average Score: 189.51   Steps: 999
Track generation: 1207..1513 -> 306-tiles track
Episode 15  Average Score: 193.16   Steps: 999
Track generation: 1106..1396 -> 290-tiles track
Episode 16  Average Score: 184.44   Steps: 262
Track generation: 1296..1628 -> 332-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1047..1319 -> 272-tiles track
Episode 17  Average Score: 172.57   Steps: 760
Track generation: 1245..1560 -> 315-tiles track
Episode 18  Average Score: 176.19   Steps: 999
Track generation: 1232..1544 -> 312-tiles track
Episode 19  Average Score: 181.62   Steps: 999
Track generation: 1120..1408 -> 288-tiles track
Episode 20  Average Score: 177.29   Steps: 999
Track generation: 1225..1536 -> 311-tiles track
Episode 21  Average Score: 182.98   Steps: 999
Track generation: 1077..1357 -> 280-tiles track
Episode 22  Average Score: 174.58   Steps: 488
Track generation: 1322..1664 -> 342-tiles track
Episode 23  Average Score: 183.94   Steps: 999
Track generation: 1035..1297 -> 262-tiles track
Episode 24  Average Score: 182.61   Steps: 429
Track generation: 1297..1626 -> 329-tiles track
Episode 25  Average Score: 183.62   Steps: 999
Track generation: 1294..1621 -> 327-tiles track
Episode 26  Average Score: 181.33   Steps: 999
Track generation: 1055..1331 -> 276-tiles track
Episode 27  Average Score: 176.12   Steps: 338
Track generation: 1202..1507 -> 305-tiles track
Episode 28  Average Score: 177.42   Steps: 999
Track generation: 863..1087 -> 224-tiles track
Episode 29  Average Score: 186.72   Steps: 999
Track generation: 1227..1538 -> 311-tiles track
Episode 30  Average Score: 199.85   Steps: 999
Track generation: 1317..1652 -> 335-tiles track
Episode 31  Average Score: 201.77   Steps: 999
Track generation: 1115..1402 -> 287-tiles track
Episode 32  Average Score: 211.90   Steps: 999
Track generation: 1168..1464 -> 296-tiles track
Episode 33  Average Score: 208.74   Steps: 872
Track generation: 1056..1324 -> 268-tiles track
Episode 34  Average Score: 207.11   Steps: 688
Track generation: 1204..1509 -> 305-tiles track
Episode 35  Average Score: 212.53   Steps: 999
Track generation: 1260..1579 -> 319-tiles track
Episode 36  Average Score: 214.41   Steps: 999
Track generation: 1100..1379 -> 279-tiles track
Episode 37  Average Score: 220.21   Steps: 999
Track generation: 1185..1485 -> 300-tiles track
Episode 38  Average Score: 220.23   Steps: 999
Track generation: 1034..1296 -> 262-tiles track
Episode 39  Average Score: 232.55   Steps: 999
Track generation: 1076..1352 -> 276-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1132..1419 -> 287-tiles track
Episode 40  Average Score: 232.72   Steps: 999
Track generation: 1292..1619 -> 327-tiles track
Episode 41  Average Score: 234.48   Steps: 999
Track generation: 1275..1598 -> 323-tiles track
Episode 42  Average Score: 232.72   Steps: 999
Track generation: 1056..1324 -> 268-tiles track
Episode 43  Average Score: 236.05   Steps: 999
Track generation: 1092..1377 -> 285-tiles track
Episode 44  Average Score: 231.00   Steps: 586
Track generation: 1127..1413 -> 286-tiles track
Episode 45  Average Score: 230.74   Steps: 999
Track generation: 1104..1384 -> 280-tiles track
Episode 46  Average Score: 235.86   Steps: 999
Track generation: 1131..1430 -> 299-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1195..1498 -> 303-tiles track
Episode 47  Average Score: 239.00   Steps: 999
Track generation: 1155..1448 -> 293-tiles track
Episode 48  Average Score: 237.08   Steps: 999
Track generation: 1317..1650 -> 333-tiles track
Episode 49  Average Score: 235.85   Steps: 999
Track generation: 1180..1479 -> 299-tiles track
Episode 50  Average Score: 236.79   Steps: 999
Track generation: 1170..1467 -> 297-tiles track
Episode 51  Average Score: 235.48   Steps: 573
Track generation: 1022..1282 -> 260-tiles track
Episode 52  Average Score: 236.68   Steps: 999
Track generation: 1230..1542 -> 312-tiles track
Episode 53  Average Score: 239.73   Steps: 999
Track generation: 1018..1278 -> 260-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1126..1412 -> 286-tiles track
Episode 54  Average Score: 240.20   Steps: 999
Track generation: 1251..1568 -> 317-tiles track
Episode 55  Average Score: 237.32   Steps: 936
Track generation: 1202..1507 -> 305-tiles track
Episode 56  Average Score: 232.17   Steps: 232
Track generation: 1265..1592 -> 327-tiles track
Episode 57  Average Score: 229.03   Steps: 999
Track generation: 1049..1315 -> 266-tiles track
Episode 58  Average Score: 231.49   Steps: 999
Track generation: 1345..1685 -> 340-tiles track
Episode 59  Average Score: 231.53   Steps: 999
Track generation: 929..1167 -> 238-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1059..1328 -> 269-tiles track
Episode 60  Average Score: 233.84   Steps: 999
Track generation: 981..1236 -> 255-tiles track
Episode 61  Average Score: 234.69   Steps: 999
Track generation: 1220..1529 -> 309-tiles track
Episode 62  Average Score: 236.05   Steps: 999
Track generation: 1062..1332 -> 270-tiles track
Episode 63  Average Score: 241.33   Steps: 999
Track generation: 1148..1439 -> 291-tiles track
Episode 64  Average Score: 241.12   Steps: 999
Track generation: 983..1235 -> 252-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1106..1387 -> 281-tiles track
Episode 65  Average Score: 240.24   Steps: 554
Track generation: 1068..1339 -> 271-tiles track
Episode 66  Average Score: 244.23   Steps: 999
Track generation: 1285..1611 -> 326-tiles track
Episode 67  Average Score: 241.20   Steps: 999
Track generation: 1224..1534 -> 310-tiles track
Episode 68  Average Score: 242.47   Steps: 999
Track generation: 1095..1373 -> 278-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1209..1515 -> 306-tiles track
Episode 69  Average Score: 245.68   Steps: 999
Track generation: 1167..1463 -> 296-tiles track
Episode 70  Average Score: 247.52   Steps: 999
Track generation: 1215..1523 -> 308-tiles track
Episode 71  Average Score: 249.51   Steps: 999
Track generation: 1292..1619 -> 327-tiles track
Episode 72  Average Score: 252.92   Steps: 999
Track generation: 1099..1378 -> 279-tiles track
Episode 73  Average Score: 255.52   Steps: 999
Track generation: 1179..1478 -> 299-tiles track
Episode 74  Average Score: 254.44   Steps: 999
Track generation: 1107..1388 -> 281-tiles track
Episode 75  Average Score: 253.66   Steps: 999
Track generation: 1209..1519 -> 310-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1233..1544 -> 311-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1234..1555 -> 321-tiles track
Episode 76  Average Score: 251.07   Steps: 999
Track generation: 1232..1553 -> 321-tiles track
Episode 77  Average Score: 247.70   Steps: 614
Track generation: 1214..1522 -> 308-tiles track
Episode 78  Average Score: 247.08   Steps: 999
Track generation: 1151..1450 -> 299-tiles track
Episode 79  Average Score: 244.98   Steps: 999
Track generation: 1213..1529 -> 316-tiles track
Episode 80  Average Score: 242.93   Steps: 999
Track generation: 1252..1569 -> 317-tiles track
Episode 81  Average Score: 242.20   Steps: 583
Track generation: 1040..1304 -> 264-tiles track
Episode 82  Average Score: 241.67   Steps: 471
Track generation: 1128..1414 -> 286-tiles track
Episode 83  Average Score: 244.15   Steps: 999
Track generation: 1121..1408 -> 287-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1048..1319 -> 271-tiles track
Episode 84  Average Score: 246.18   Steps: 999
Track generation: 1224..1534 -> 310-tiles track
Episode 85  Average Score: 246.45   Steps: 999
Track generation: 1140..1429 -> 289-tiles track
Episode 86  Average Score: 248.44   Steps: 999
Track generation: 1114..1400 -> 286-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1052..1319 -> 267-tiles track
Episode 87  Average Score: 250.57   Steps: 999
Track generation: 1045..1313 -> 268-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1131..1418 -> 287-tiles track
Episode 88  Average Score: 252.82   Steps: 999
Track generation: 1122..1412 -> 290-tiles track
Episode 89  Average Score: 249.96   Steps: 231
Track generation: 1239..1553 -> 314-tiles track
Episode 90  Average Score: 249.05   Steps: 999
Track generation: 1022..1283 -> 261-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1123..1408 -> 285-tiles track
Episode 91  Average Score: 249.59   Steps: 999
Track generation: 1158..1462 -> 304-tiles track
Episode 92  Average Score: 248.59   Steps: 999
Track generation: 1313..1645 -> 332-tiles track
Episode 93  Average Score: 248.80   Steps: 999
Track generation: 1000..1254 -> 254-tiles track
Episode 94  Average Score: 248.66   Steps: 999
Track generation: 1259..1578 -> 319-tiles track
Episode 95  Average Score: 246.75   Steps: 999
Track generation: 1316..1655 -> 339-tiles track
Episode 96  Average Score: 244.65   Steps: 999
Track generation: 1086..1367 -> 281-tiles track
Episode 97  Average Score: 244.74   Steps: 999
Track generation: 988..1238 -> 250-tiles track
Episode 98  Average Score: 241.85   Steps: 223
Track generation: 1072..1344 -> 272-tiles track
Episode 99  Average Score: 240.71   Steps: 999
Track generation: 1152..1444 -> 292-tiles track
Episode 100 Average Score: 237.61   Steps: 413
Track generation: 1066..1341 -> 275-tiles track
Episode 101 Average Score: 242.11   Steps: 999
Track generation: 1070..1341 -> 271-tiles track
Episode 102 Average Score: 242.06   Steps: 724
** The environment has never been solved!
   Mean scores on last 100 runs was < 900.0
** Mean average score: 239.06616008258862
../../../../_images/02_RL_2_CarRacing_15_1.png ../../../../_images/02_RL_2_CarRacing_15_2.png
print('** Mean average score:', np.mean(scores))
plot_performance(scores)
plot_trials_to_solve(trials_to_solve)
** Mean average score: 239.06616008258862
../../../../_images/02_RL_2_CarRacing_16_1.png ../../../../_images/02_RL_2_CarRacing_16_2.png
# Save the model
policy.save_model()

更进一步

在自动驾驶汽车上

Comma.ai’s real self-driving dataset & solutions Self Racing Car challenge & dataset

潜在空间强化学习:世界模式

大卫银课程

Reinforcement Learning course page, Videos on YouTube

理查德·萨顿和安德鲁·巴托的书

Reinforcement Learning: An Introduction

OpenAI文档

技术提示