Create custom OpenAI Gym environment for Deep Reinforcement Learning (drl4t-04)

Xiaoguang Li
5 min readMar 26, 2023

Stock trading is a complex economic behavior and no one can give a formula to predict it accurately. Well, this is just where Deep Reinforcement Learning (DRL), which has been growing rapidly in recent years, can come into play.

Deep Reinforcement Learning (DRL) combines reinforcement learning algorithms with deep neural networks to learn how to make decisions in complex environments. In DRL, an agent learns to interact with an environment by taking actions and receiving feedback in the form of rewards. The goal of the agent is to learn a policy that maximizes its cumulative reward over time.

OpenAI Gym is a popular toolkit for developing reinforcement learning algorithms. It provides a collection of environments that simulate different scenarios for agents to learn in, such as playing Atari games, controlling robots, or navigating mazes. The environments are designed to be easy to use and provide a standardized interface for agents to interact with.

Install Gym

We will create a custom Gym Env class to simulate stock trading for deep reinforcement learning. Before starting coding, the Gym library needs to be installed like this:

!pip install gym

Custom Gym.Env Class

Gym Env provides methods for the agent to take action, observe status, and receive rewards. The simplest custom Gym Env class is as follows and includes only two methods, Step and Reset. We will add content from here on.

import gym

class DRL4TEnv(gym.Env):
metadata = {'render.modes': ['human']}

def __init__(self):
super(DRL4TEnv, self).__init__()

def step(self, action):
pass

def reset(self):
pass

Class Constructor

The first step is to transfer the data and other parameters into DRL4TEnv class and initialize the settings. The data is used to train the model and it is a directory where the key of each item is the stock’s symbol as well as the value contains this stock’s historical trading data and technical indicators. It (data) will include thousands of stocks, and each episode will use one of them to train the model.

Other parameters include:

  • indicator_columns: the columns of data that will be used to train
  • price_column: the column of data that will be used as stock price
  • sample_days: how many previous trading days’ data can be observed for each step (day). Accordingly, we only keep stocks that have historical data that are more than the number of sample_days
  • starting_balance: initial amount of cash for each episode
  • commission_rate: commission rate for buying and selling shares
  • random_on_reset: whether each episode randomly selects a stock to train
import numpy as np
import gym
from gym import spaces
import random

class DRL4TEnv(gym.Env):
metadata = {'render.modes': ['human']}

def __init__(self,
data,
indicator_columns=['SMARatio10', 'SMARatio20', 'MACD', 'BBP', 'CMF'],
price_column='Close',
sample_days=30,
starting_balance=100000,
commission_rate=0.001,
random_on_reset=True):
super(DRL4TEnv, self).__init__()

self.indicator_columns = indicator_columns
self.price_column = price_column
self.sample_days = sample_days
self.starting_balance = starting_balance
self.commission_rate = commission_rate
self.random_on_reset = random_on_reset

self.data = dict(filter(lambda item: len(item[1]) > sample_days, data.items()))

In the class constructor, we also initialize some settings. Where cur_episode and cur_step are pointers to the current position, as well as cash and shares are the current holdings of cash and shares.

        self.cur_episode = self.next_episode() if random_on_reset else 0
self.cur_step = self.sample_days

self.action_space = spaces.Discrete(len(Actions))
self.observation_space = spaces.Box(low=-np.inf,
high=np.inf,
shape=(len(self.cur_indicators.columns) * self.sample_days + 3,),
dtype=np.float16)
self.reward_range = (-np.inf, np.inf)

self.cash = self.starting_balance
self.shares = 0

The other important settings are:

  • action_space: the actions we will use include: Hold, Buy and Sell (as shown in the code below). So the action space is represented by three discrete values (0, 1 and 2) here
  • observation_space: the observation space consists of a group of floating point numbers, including technical indicators of several days (sample_days) and three additional positions to hold cash, share value and action
  • reward_range: the maximum and minimum value of the awards that can be offered

Here is the definition of actions that we will use:

import enum

class Actions(enum.Enum):
Hold = 0
Buy = 1
Sell = 2

Reset Method

The reset() method is used to initialize the environment when starting a new episode.

    def reset(self):
self.cash = self.starting_balance
self.shares = 0
return self.next_observation(Actions.Hold.value)

Step Method

The step() method is used to execute an action, return the reward and move to the next status. The major tasks here include calculating the new position pointer (cur_episode and cur_step), the reward (the amount of increase or decrease of balance) and whether the current episode is finished (done).

The return value includes observation, reward, done and info, which will be used to predict the next action.

    def step(self, action):
balance = self.cur_balance

self.cur_step += 1
if self.cur_step == self.total_steps:
self.cur_episode = self.next_episode()
self.cur_step = self.sample_days

self.take_action(action)

obs = self.next_observation(action)
reward = self.cur_balance - balance
done = self.cur_step == self.total_steps - 1
info = { 'Date' : self.cur_data.index[self.cur_step].strftime('%Y-%m-%d'),
'Reward' : round(reward, 2),
'Symbol' : self.cur_symbol,
'Action' : Actions(action).name,
'Shares' : self.shares,
'Close' : round(self.cur_close_price, 2),
'Cash' : round(self.cash, 2),
'Total' : round(self.cur_balance, 2) }

if done:
self.reset()

return obs, reward, done, info

The take_action() method used here will calculate the amount of cash and share holdings based on whether the action is Buy or Sell.

    def take_action(self, action):
if action == Actions.Buy.value:
if (self.shares == 0):
price = self.cur_close_price * (1 + self.commission_rate)
self.shares = int(self.cash / price / 100) * 100
self.cash -= self.shares * price
elif action == Actions.Sell.value:
if (self.shares > 0):
price = self.cur_close_price * (1 - self.commission_rate)
self.cash += self.shares * price
self.shares = 0

The next_observation() method will construct the new observable status, which includes technical indicators of several days (sample_days) and three additional positions to hold current cash, share value and action.

    def next_observation(self, action):
observation = []
for i in range(self.sample_days, 0, -1):
observation = np.append(observation, self.cur_indicators.values[self.cur_step - i + 1])
return np.append(observation, [self.cash, self.shares * self.cur_close_price, action])

When the current episode is finished, the next_episode() method is used to select the stock to be used in the next episode.

    def next_episode(self):
if self.random_on_reset:
return random.randrange(0, self.total_episodes)
else:
return (self.cur_episode + 1) % self.total_episodes

Other Additional Properties

Listed here are the other additional properties that are used to simplify the code.

    @property
def total_episodes(self):
return len(self.data)

@property
def cur_symbol(self):
return list(self.data.keys())[self.cur_episode]

@property
def cur_data(self):
return self.data[self.cur_symbol]

@property
def cur_indicators(self):
return self.cur_data[self.indicator_columns]

@property
def total_steps(self):
return len(self.cur_data)

@property
def cur_close_price(self):
return self.cur_data[self.price_column][self.cur_step]

@property
def cur_balance(self):
return self.cash + (self.shares * self.cur_close_price)

How to use

Creating an instance of this trading environment is very simple, just need to download trading data using the method described in Data preparation, where to get a list of all NYSE and NASDAQ stocks (drl4t-03). For convenience, I packaged it as a download() method and saved it in a file called “drl4t_data.py”.

from drl4t_data import download

train_data, test_data = download('nyse.csv')
env = DRL4TEnv(train_data)

Next time, we will use this trading environment to train a deep reinforcement learning model.

--

--

Xiaoguang Li

Master of Science in Computational Data Analytics from Georgia Tech, Senior IT Consultant at Morgan Stanley