Create custom OpenAI Gym environment for Deep Reinforcement Learning (drl4t-04)

5 min readMar 26, 2023

Stock trading is a complex economic behavior and no one can give a formula to predict it accurately. Well, this is just where Deep Reinforcement Learning (DRL), which has been growing rapidly in recent years, can come into play.

Deep Reinforcement Learning (DRL) combines reinforcement learning algorithms with deep neural networks to learn how to make decisions in complex environments. In DRL, an agent learns to interact with an environment by taking actions and receiving feedback in the form of rewards. The goal of the agent is to learn a policy that maximizes its cumulative reward over time.

OpenAI Gym is a popular toolkit for developing reinforcement learning algorithms. It provides a collection of environments that simulate different scenarios for agents to learn in, such as playing Atari games, controlling robots, or navigating mazes. The environments are designed to be easy to use and provide a standardized interface for agents to interact with.

Install Gym

We will create a custom Gym Env class to simulate stock trading for deep reinforcement learning. Before starting coding, the Gym library needs to be installed like this:

!pip install gym

Custom Gym.Env Class

Gym Env provides methods for the agent to take action, observe status, and receive rewards. The simplest custom Gym Env class is as follows and includes only two methods, Step and Reset. We will add content from here on.

import gym

class DRL4TEnv(gym.Env):
    metadata = {'render.modes': ['human']}

    def __init__(self):
        super(DRL4TEnv, self).__init__()

    def step(self, action):
        pass

    def reset(self):
        pass

Class Constructor

The first step is to transfer the data and other parameters into DRL4TEnv class and initialize the settings. The data is used to train the model and it is a directory where the key of each item is the stock’s symbol as well as the value contains this stock’s historical trading data and technical indicators. It (data) will include thousands of stocks, and each episode will use one of them to train the model.

Other parameters include:

indicator_columns: the columns of data that will be used to train
price_column: the column of data that will be used as stock price
sample_days: how many previous trading days’ data can be observed for each step (day). Accordingly, we only keep stocks that have historical data that are more than the number of sample_days
starting_balance: initial amount of cash for each episode
commission_rate: commission rate for buying and selling shares
random_on_reset: whether each episode randomly selects a stock to train

import numpy as np
import gym
from gym import spaces
import random

class DRL4TEnv(gym.Env):
    metadata = {'render.modes': ['human']}

    def __init__(self, 
                 data, 
                 indicator_columns=['SMARatio10', 'SMARatio20', 'MACD', 'BBP', 'CMF'],
                 price_column='Close',
                 sample_days=30, 
                 starting_balance=100000, 
                 commission_rate=0.001, 
                 random_on_reset=True):
        super(DRL4TEnv, self).__init__()

        self.indicator_columns = indicator_columns
        self.price_column = price_column
        self.sample_days = sample_days
        self.starting_balance = starting_balance
        self.commission_rate = commission_rate
        self.random_on_reset = random_on_reset

        self.data = dict(filter(lambda item: len(item[1]) > sample_days, data.items()))

In the class constructor, we also initialize some settings. Where cur_episode and cur_step are pointers to the current position, as well as cash and shares are the current holdings of cash and shares.

        self.cur_episode = self.next_episode() if random_on_reset else 0
        self.cur_step = self.sample_days

        self.action_space = spaces.Discrete(len(Actions))
        self.observation_space = spaces.Box(low=-np.inf, 
                                                high=np.inf, 
                                                shape=(len(self.cur_indicators.columns) * self.sample_days + 3,), 
                                                dtype=np.float16)
        self.reward_range = (-np.inf, np.inf)

        self.cash = self.starting_balance
        self.shares = 0

The other important settings are:

action_space: the actions we will use include: Hold, Buy and Sell (as shown in the code below). So the action space is represented by three discrete values (0, 1 and 2) here
observation_space: the observation space consists of a group of floating point numbers, including technical indicators of several days (sample_days) and three additional positions to hold cash, share value and action
reward_range: the maximum and minimum value of the awards that can be offered

Here is the definition of actions that we will use:

import enum

class Actions(enum.Enum):
    Hold = 0
    Buy = 1
    Sell = 2

Reset Method

The reset() method is used to initialize the environment when starting a new episode.

    def reset(self):
        self.cash = self.starting_balance
        self.shares = 0
        return self.next_observation(Actions.Hold.value)

Step Method

The step() method is used to execute an action, return the reward and move to the next status. The major tasks here include calculating the new position pointer (cur_episode and cur_step), the reward (the amount of increase or decrease of balance) and whether the current episode is finished (done).

The return value includes observation, reward, done and info, which will be used to predict the next action.

    def step(self, action):
        balance = self.cur_balance

        self.cur_step += 1
        if self.cur_step == self.total_steps:
            self.cur_episode = self.next_episode()
            self.cur_step = self.sample_days

        self.take_action(action)

        obs = self.next_observation(action)
        reward = self.cur_balance - balance
        done = self.cur_step == self.total_steps - 1
        info = { 'Date'  : self.cur_data.index[self.cur_step].strftime('%Y-%m-%d'),
                 'Reward' : round(reward, 2),
                 'Symbol' : self.cur_symbol,
                 'Action' : Actions(action).name,
                 'Shares' : self.shares, 
                 'Close'  : round(self.cur_close_price, 2),
                 'Cash'   : round(self.cash, 2), 
                 'Total'  : round(self.cur_balance, 2) }
        
        if done:
            self.reset()

        return obs, reward, done, info

The take_action() method used here will calculate the amount of cash and share holdings based on whether the action is Buy or Sell.

    def take_action(self, action):
        if action == Actions.Buy.value:
            if (self.shares == 0):
                price = self.cur_close_price * (1 + self.commission_rate)
                self.shares = int(self.cash / price / 100) * 100
                self.cash -= self.shares * price
        elif action == Actions.Sell.value:
            if (self.shares > 0):
                price = self.cur_close_price * (1 - self.commission_rate)
                self.cash += self.shares * price
                self.shares = 0

The next_observation() method will construct the new observable status, which includes technical indicators of several days (sample_days) and three additional positions to hold current cash, share value and action.

    def next_observation(self, action):
        observation = []
        for i in range(self.sample_days, 0, -1):
            observation = np.append(observation, self.cur_indicators.values[self.cur_step - i + 1])
        return np.append(observation, [self.cash, self.shares * self.cur_close_price, action])

When the current episode is finished, the next_episode() method is used to select the stock to be used in the next episode.

    def next_episode(self):
        if self.random_on_reset:
            return random.randrange(0, self.total_episodes)
        else:
            return (self.cur_episode + 1) % self.total_episodes

Other Additional Properties

Listed here are the other additional properties that are used to simplify the code.

    @property
    def total_episodes(self):    
        return len(self.data)
    
    @property
    def cur_symbol(self):
        return list(self.data.keys())[self.cur_episode]

    @property
    def cur_data(self):
        return self.data[self.cur_symbol]

    @property
    def cur_indicators(self):
        return self.cur_data[self.indicator_columns]

    @property
    def total_steps(self):    
        return len(self.cur_data)
    
    @property
    def cur_close_price(self):
        return self.cur_data[self.price_column][self.cur_step]
    
    @property
    def cur_balance(self):
        return self.cash + (self.shares * self.cur_close_price)

How to use

Creating an instance of this trading environment is very simple, just need to download trading data using the method described in Data preparation, where to get a list of all NYSE and NASDAQ stocks (drl4t-03). For convenience, I packaged it as a download() method and saved it in a file called “drl4t_data.py”.

from drl4t_data import download

train_data, test_data = download('nyse.csv')
env = DRL4TEnv(train_data)

Next time, we will use this trading environment to train a deep reinforcement learning model.