Blackjack, Stocks, and Reinforcement Learning

Machine Learning for Risk Management

15 min readAug 4, 2022

Origins

Reinforcement learning (RL) is one of the three main fields of machine learning. RL is unique in the sense that instead of attempting to memorize a set of training samples (supervised learning), or finding similarity via methods like clustering (unsupervised learning), it attempts to replicate how biological creatures actually learn.

The origins of Reinforcement Learning are deep-rooted in both physiological and psychological experiments, namely those of Ivan Petrovich Pavlov, who is without doubt best known for his studies on the salivation of dogs.

For those of you who have never heard of Pavlov, the gist of his fame is that he observed that dogs would first salivate to the anticipation of being fed, rather than to the act of being fed itself. Better known as Classical Conditioning, this phenomenon basically shows that the brain is associative in nature.

Ironically, Pavlov’s discovery was a complete accident. His initial hypothesis was that the dogs would salivate in response to food being placed in front of them. However, he quickly realized that the dogs began salivating as soon as they heard the footsteps of the lab assistants who were feeding them.

After observing this, Pavlov introduced the dogs to the sound of a metronome immediately before they were fed. After a relatively short period of time, the dogs began to salivate to the clicking of the metronome, regardless of whether food was present or not.

In short, the dogs learned to associate the sound of the metronome with the prospect of being fed. And because this behavior was learned, it was called a conditional response.

In 1927, Pavlov formally used the term reinforcement to describe the strengthening of a behavior due to the reception of a precursory stimulus. Likewise, the concept of Reinforcement Learning was born.

Modern Reinforcement Learning

The concept of modern Reinforcement Learning is centered around a biologically-inspired feedback loop. The feedback loop itself can be visualized below:

There are four main components to this feedback loop:

the environment
the agent
a set of states or observations
the reward given to the agent for taking an action (A) in a given state (S)

At a high level, it works like this: an agent observes the current conditions of its environment, takes an action, and receives a reward. This reward can be positive or negative in nature. And because the agent has a memory of past events, we can write an algorithm to make it learn from its past mistakes, which ultimately decreases the probability of it making the same mistakes again.

Sound familiar? Well it should. This is the basis for how you, me, and pretty much every biological living thing learns. You fail a few times, learn from your mistakes, and ultimately become better at the skill in question.

Such is the concept behind Reinforcement Learning for machines. We first define an environment specific to the task we are trying to solve. Then, we design an agent to take actions in this environment. Lastly, we craft a reward function that governs how the agent will receive feedback from the environment after it performs the action.

Q Learning

One of the most well-known RL algorithms is known as Q-Learning. Developed in 1989, this procedure aims to optimize by finding the action that maximizes the agent’s future reward, given the current state of the environment.

It does this by storing a reference table (known as a Q-table) that contains values representing the maximum future rewards for taking an action (A) in a given state (S). The table itself is discrete in nature and contains values for all possible state-action combinations.

The values in the Q-table are updated over time as our agent interacts with its environment. If our agent takes action (A) in state (S), and receives a positive reward for doing so, then the value corresponding to that state-action pair will be increased. Likewise, if our agent receives a negative reward, the corresponding value will be decreased accordingly.

The algorithm to update these values (known as Q-values) is shown below.

credit: https://www.gocoder.one/blog/rl-tutorial-with-openai-gym

There are a couple of important parameters in this algorithm that we should touch upon.

The first is the learning rate, denoted by the greek letter alpha (α). This controls the degree to which the Q-values are updated. Stated differently, it governs the speed at which our agent can learn.

The second is the discount factor, denoted by the greek letter gamma (γ). This parameter dictates the degree to which the agent considers the maximum expected future reward.

In the formula below, we can see how the maximum future reward is calculated by looking at the next state S(t+1).

But what if our agent’s ultimate objective is several timesteps ahead?

This is where gamma (γ) comes into play. By discounting some percentage of the maximum reward at the next state S(t+1), we can better account for such delays in our agent’s reward.

Exploration vs. Exploitation

Although not specified directly in the Q-learning algorithm, there is one last parameter we need to worry about. This parameter, denoted by the greek letter epsilon (ε), determines the agent’s adversity to risk. To better explain the epsilon parameter, it is important to understand the trade-off between exploration and exploitation.

During periods of exploration, the agent doesn’t really care about its performance. Its sole goal is to learn more about its environment using a randomized trial-and-error methodology.

However, during periods of exploitation, the agent focuses on fine-tuning its strategy in hopes of maximizing its performance.

It is extremely important that our agent undergoes periods of both exploration and exploitation. Without exploration, our agent will not efficiently learn about its environment, and may very well decide the best strategy is to simply do nothing (depending on the environment). And without exploitation, our agent will not be able to come close to an optimal solution.

As you might have guessed, periods of exploration should generally come before periods of exploitation. For this reason, the epsilon (ε) parameter is initially set to a value where its almost statistically certain for the agent to take random actions for the sole purpose of gaining insight about its environment. Then, after each iteration, this value is gradually decayed to a point whereby the agent is almost certain to take an action determined solely by its previous experiences.

By allowing the agent to explore its environment before exploiting it, we ultimately increase the time for it to reach a near-optimal strategy.

Deep Q-Learning

This brings us to Deep Q-Learning.

There are a few differences between normal Q-Learning and Deep Q-Learning; the most important of which is that instead of creating a Q-table, we use a neural network to approximate the Q-values.

This difference is shown in the image below:

The input to our agent’s neural network is the state; while the output is the probabilities of which action to take. The agent will take the action with the highest probability.

Moreover, another key difference is that our agent will undergo something called “experience replay”, whereby it reflects upon and learns from its past experiences. This is similar to when we update our agent’s Q-values, except that it usually happens in randomly sub-sampled batches after some number of iterations.

Creating a Winning Blackjack Strategy

Now that you have an idea of how Deep Reinforcement Learning works, we can focus on its applications.

Deep Reinforcement Learning (DRL), and Reinforcement Learning (RL) in general, are used in a variety of domains. That being said, it is one of the best learning algorithms for problems that can be gamified. This is why DRL systems have achieved superhuman success at playing actual games, like Pac-Man, Super Mario, or (more famously) Go.

For this reason, we are going to develop a Deep Reinforcement Learning agent to play Blackjack.

Blackjack is a game of both skill and luck. That being said, it is highly random, and really comes down to successful risk management. To make things harder, our version of Blackjack will be even more randomized, as we will be playing with an infinite deck of cards (so unfortunately a card-counting strategy won’t work 😟).

For those who don’t know, here are the basic rules of the Blackjack:

The goal is to beat the dealer by obtaining cards that sum up to as close to 21 as possible (without going over — called a bust).
Face cards (Jack, Queen, King) have a point value of 10
Aces can either count as 11 or 1
All other cards are worth their numerical value (2–9)
The game starts with the dealer having one card face-down and one card face-up. The player starts with two cards face-up.
The player can draw additional cards (called hit) until they decide to stop (called stay or stick)
After the player sticks, the dealer reveals their face-down card, and draws until their sum is 17 or greater.
Whosever cards have a cumulative sum closer to (but not over) 21 wins

Our agent’s neural network is straightforward: it utilizes 3 fully-connected layers and ReLU non-linearity. The input is the current state, and the output is the probabilities for taking each action. We choose the action with the highest probability.

We set our initial epsilon (ε) to 0.99, meaning the agent has a 99% chance of taking a random action during the first hand (exploration). Our epsilon decay rate is set to 0.995, so ε will decrease over time to allow our agent to gradually exploit its previous experiences in its decision making process. An important note here is that we also set a lower boundary of ε to 0.02 in order to allow the agent a chance of exploring throughout the entirety of the training process.

Moreover, we set our agent’s discount factor (γ) to 0.99 to give heavy weight to the immediate reward, as a single game of Blackjack could be over as soon as the cards are dealt. And lastly, we set our learning rate (α) to 3e-4.

But what’s really key here — just like in all Reinforcement Learning algorithms — is the reward function involved. The vanilla setup of OpenAI’s Blackjack environment (which we use to train our agent) rewards +1 point for a win and -1 point for a loss. However, this doesn’t allow our agent to account for position (bet) size; which ultimately stops our agent from learning about risk management.

To remediate this issue, we multiply the reward by the raw probability given by the output of our agent’s neural network. This, in turn, allows our agent to place higher bets on the hands it feels are more favorable (and vice versa).

As you can see from the plot below, our agent was rather successful in its gambling degeneracy.

Despite only having a win percentage of 56%, our trained agent was able to net a 643% return over the course of 100 hands. This success is almost solely due to the reward function we crafted, which ultimately taught our agent how to manage risk.

Managing Risk in the Stock Market

Now that we’ve shown that Reinforcement Learning can teach an agent to manage risk while playing Blackjack, we will attempt to use this strategy for the stock market.

To do this we’ll have to code up a custom environment for our agent to play in. We do this by taking inspiration from Blackjack.

Our environment works as follows:

The agent is given a long position, represented by an randomized entry price
At each given state (S), the agent will take an action (A) in the set (0, 1), which corresponds to sticking or exiting.

The agent can lose in 3 ways:

It doesn’t exit before a maximum number of days (20)
The stop loss (-4%) hits before it sells
It sells its position and takes a loss

Moreover, the agent can only win by selling its position for a profit before its time limit expires.

Lastly, it can draw in two ways:

It selling its position and breaking even
It reaches the end of the time series

The states (observations) for this environment are a one-dimensional flattened array of price-derived features (mostly technical analysis indicators tweaked to work well with machine learning). Each individual state represents the previous 10 days of data.

The environment will reward the agent its percent return for a loss or win, and 0 for a draw.

We set our gamma to 0.99, to place emphasis on minimizing the amount of time our agent holds its position. Furthermore, we lower our initial epsilon value to 0.9 in order to slightly decrease its probability of exploring vs exploiting.

Our training data spans from 2003 to August 2017, while our test data spans from October 2017 to August 2022. It is crucial that we split up our data like this. If we measured our agent’s skill on the data it was trained on, then that would be cheating; and likewise, the results wouldn’t be indicative of its potential performance on live data (which would obviously be unseen by our agent).

Furthermore, you may have noticed that there is a 2 month gap between our training and test datasets. This is because some of the technical indicators we use lag by 50 days. So, to prevent data leakage between our datasets, we drop at least 50 days worth data.

After training our agent for 20,000 iterations, we let it play with $1000 for 200 rounds on the unseen data to test how well it was able to learn. The results of this are seen below:

As you can see, our agent made a ~118% profit with only a 64% win percentage. This somewhat low win percentage is a testament to the random nature of the stock market. The hefty return, on the other hand, shows that the agent was indeed able to learn to manage risk.

It is worth noting that during this timeframe, the price of the $SPY increased by nearly 65%. So, with 200 random entries, our agent was able to outperform this benchmark by a factor of almost 2.

Because our environment is highly randomized, our agent’s performance will vary each time. So, to get some hard statistics about our agent’s performance, we let it play 100 “games”, where each game consists of 200 rounds of test data. For each game, we calculate the total percentage return.

The results can be shown below:

mean     97.57%
std      52.92%
min     -29.95%
25%      65.89%
50%      98.43%
75%     129.95%
max     236.25%

On average, our agent yielded a 97% return after 200 rounds of play. At its worst, it earned a -29% total return. And at its best, it was able to take home a total return of 236%.

Recall that the $SPY buy-and-hold benchmark for this timeframe was 65%. Looking at the statistics above, we can conclude that after buying randomly 200 times over the course of 5 years, our agent was able to outperform $SPY in 75% of the simulations. And even better, it was able to outperform $SPY by a factor of 2 in over 25% of the simulations.

In an attempt to further improve the agent’s performance, we set our state size to be the previous 5 days of data (as opposed to the 10 days we were using). Everything else remaining the same, we run the same training loop and get the following results on the unseen data:

Once again, we let our agent play 100 games, each consisting of 200 rounds:

mean     98.03%
std      30.70%
min      21.33%
25%      75.76%
50%     100.94%
75%     118.68%
max     171.60%

While our agent’s maximum return decreased significantly, I would argue that the agent actually performed better than it did with the 10-day observations. My reasoning behind this is summarized below:

The standard deviation of returns dropped by nearly 50% (meaning the agent was way more consistent)
At its worst, the agent was still in the green (+20% profit)
Both the 25th and 50th percentile increased significantly

It is important to reiterate here that the agent didn’t learn how to trade stocks, per se. Rather, it learned how to properly manage risk in a notoriously difficult environment. And many would argue that this is perhaps the most important aspect of trading.

As such, that is exactly what this experiment proves. After all, even though our agent was forced to buy $SPY at random, it still was able to make a significant profit.

On top of that, there are a few other things we need to consider.

The first is that during testing, our agent was forced to trade $SPY between October 2017 to August 2022.

We trained our agent on the period highlighted in blue, and tested our agent on the period highlighted in orange.

For those who don’t know, this was one of the most volatile periods in market history. In fact, if you need a refresher of how volatile this period truly was (source):

2 / 3 of the largest percentage one-day losses occurred for $SPY
2 / 4 of the largest percentage one-day gains occurred for $SPY
20 / 20 of the largest intraday point swings occurred for $SPY

We can visualize this by looking at the VIX, which attempts to measure volatility by using the strike prices of a wide variety of put/call options on the $SPX, which in the index that $SPY tracks. The period of time in question is highlighted in orange:

The VIX, which is in a clear up-trend during our testing period, is generally negatively correlated to stock returns. This means that our agent was capable of risk-managing an extremely difficult market environment, despite never encountering it before.

The reason for our agent’s success relates to some of the statistics mentioned before. That is, despite this timeframe being responsible for some of the largest short-term losses in history, it was also responsible for some of the largest short-term gains.

Stated differently, our agent was able to take advantage of the gains and avoid the losses, despite being forced to hold a randomized long position at any given point in time.

Concluding Thoughts

After reflecting upon the results of these experiments, I have a few concluding thoughts that I would like to share with you.

To begin, I am a huge proponent of the theory behind Reinforcement Learning. After all, the first neural networks were based on biology, so why shouldn’t their training loops also be based on it?

If you think about it, supervised learning is akin to using an enormous deck of flashcards to study for a test (coupled with the fact that you need to be really good at memorization). Unsupervised learning, on the other hand, is akin to using your intuition to identify patterns in order to make a “best guess”.

Reinforcement Learning, however, is a biologically-tested, real, and effective learning method utilized everyday by almost every species on Earth. You could even extend this thought further and say that natural selection is itself a form of long-term Reinforcement Learning.

After all, the genome of a species undergoes periods of exploration via random genetic mutations. Some of these mutations work out for the best, and are subsequently exploited in their respective environments. The mutations that produce a positive result for the host have a higher chance of being passed down to later generations. And for this reason, they tend to stick around longer due to their utility in the host’s environment. The mutations that don’t have utility, however, tend to erase themselves over time.

To continue, risk management is an inherently tricky endeavor for humans. The primary reason for this is that humans often let emotions get in the way of making rational decisions. Computers, on the other hand, obviously don’t consider the emotional dimension in their decision making process.

As Pavlov proved well-over a hundred years ago, classical conditioning is a real thing, even for humans. And attached to this is the fact that classical conditioning can easily lead to bias.

For example, think of Pavlov’s dogs. Because they were conditioned to believe that the sound of the metronome was correlated to the prospect of being fed, they did not understand the difference between the two sensations.

Likewise, classical conditioning can lead humans to place too much emphasis on their wins and losses while playing Blackjack or when trading stocks. While the brain is undoubtedly a powerful thing, it is extremely flawed when it comes to certain tasks. And risk management is one of those tasks.

As such, it makes sense to get a purely statistical point-of-view when it comes to these things. After all, a person’s previous experiences can hinder their ability to make the optimal choice. A computer, on the other hand, doesn’t care whether it wins or loses. It only puts resources towards the decision-making process involved.

Finally, the stock market isn’t that different from most casino games. Both generally lead to random results. As a consequence, both are inherently risky in nature. For this reason, it makes sense to use a machine learning architecture that can successfully do both. After all, both Blackjack and trading boil down to one thing: risk management.

And as Ben Hogan once said (regarding golf):

“This is a game of misses. The person who misses the best is going to win.”

As always, thank you for reading.

The notebooks associated with this post can be found here.