Using Dynamic Time Warping (DTW) to Cluster Stocks
Stock Classification
Classifying stocks is a technique commonly used in financial and investment analysis to group similar stocks together based on certain characteristics or patterns. The goal of stock classification is to identify underlying structures or relationships in a large group of stocks that may not be immediately apparent.
Stocks can be classified according to the industry or sector they belong to, market capitalization, geographic location, or investment style (e.g., growth stocks or value stocks).
Clustering Stocks Based on Close Price
From a technical analysis perspective, this article attempts to use machine learning techniques to cluster stocks based on adjusted closing prices, which in turn can help investors in the following ways:
- Pattern Recognition: Clustering helps identify patterns or similarities in the price behavior of stocks. By grouping stocks with similar closing price trends, investors can uncover patterns that may not be immediately apparent. This can aid in identifying potential trading opportunities or predicting future price movements.
- Risk Management: Clustering stocks based on close price allows investors to assess risk more effectively. Stocks within the same cluster tend to exhibit similar price movements, indicating that they may be influenced by common factors. Understanding these clusters can help investors manage risk by diversifying their portfolios across different clusters or adjusting their positions based on the performance of the cluster as a whole.
In the rest of this article, I will demonstrate how to complete the stock clustering step by step, starting from collecting data.
Data Collection
The first step is to gather historical price data for a set of stocks, including their daily closing prices over a specified period.
Get Stock List
A list of NYST or Nasdaq stocks can be searched and downloaded from the following link:
https://www.nasdaq.com/market-activity/stocks/screener
This article downloaded a list of Nasdaq medium-sized and above stocks, total 647 stocks as sample data.
Get Historical Trading Data
yFinance is a free and popular python library to download historical stock trading data from Yahoo Finance. It includes a download() method to download historical data of multiple stocks at the same time. The code below shows how to install and use yFinance to download weekly trading data of multiple stocks for the most recent year:
!pip install yfinance
import pandas as pd
import yfinance as yf
symbols = pd.read_csv('nasdaq.csv')
tickers = symbols['Symbol'].str.cat(sep=' ')
hist = yf.download(tickers=tickers, period='1y', interval='1wk', actions=False)
Feature Selection
This article uses the adjusted closing price as the clustering feature:
data = hist['Adj Close']
Preprocessing
The prices of different stocks vary widely. This article is interested in price trends, so normalize stock prices so that the prices of different stocks are comparable:
Normalized Price = Current Price / Initial Price
The data preprocessing process also includes transposing indexes and columns (so that each row of data represents one stock) and removing incomplete data.
data = data / data.iloc[0]
data = data.T
data = data.dropna()
Selecting a Clustering Algorithm
Some commonly used algorithms for stock clustering include k-means, hierarchical clustering, and DBSCAN. Each algorithm has its own strengths and weaknesses. This article chooses k-means for demonstration.
tslearn is a Python library designed for time series analysis and clustering tasks. This article uses tslearn to implement the k-means algorithm. First, we need install the library:
!pip install tslearn
The implementation of the kmeans algorithm in tslearn is TimeSeriesKMeans() method, which supports ‘euclidean’, ‘dtw’, ‘softdtw’ for both cluster assignment and barycenter computation.
Dynamic Time Warping (DTW)
Dynamic Time Warping (DTW) is a popular algorithm used for measuring the similarity between two time series data that may have different lengths and different speeds.
DTW calculates the minimum distance between two time series by warping one series in time to match the other. It is often used in pattern recognition, classification, and clustering of time series data. DTW allows for the detection of temporal structure and can account for time shifting, stretching, and compression of the time series being compared. The following figure demonstrates the difference between using Euclidean distance and DTW to evaluate the similarity of two time series data:

DTW has proven to be an effective method for analyzing a wide range of time series data, including speech recognition, handwriting recognition, and music analysis. This article will use ‘dtw’ for measuring the similarity between time series data.
Determine the Number of Clusters
Before implementing the clustering algorithm, we still need to determine how many clusters we are going to segment the data into. This article uses the elbow method to determine the optimal number of clusters. The following is the code using the elbow method:
from tslearn.clustering import TimeSeriesKMeans
import matplotlib.pyplot as plt
ssd = []
for i in range(10, 150, 10):
model = TimeSeriesKMeans(n_clusters=i, metric="dtw", max_iter=200)
model.fit_predict(data)
ssd.append(model.inertia_)
The following code displays the results of the elbow method:
plt.plot(range(10, 150, 10), ssd)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of Squared Distances')
plt.show()

Perform Clustering
The elbow method above shows that 40–80 are all suitable cluster numbers. This article uses 40 to train the model. The code is as follows:
model = TimeSeriesKMeans(n_clusters=40, metric="dtw", max_iter=200)
model.fit_predict(data)
The following code merges the results of the model training (model.labels_) into the stock list:
data['Category'] = model.labels_
symbols = pd.merge(symbols[['Symbol', 'Name']], data, left_on="Symbol", right_index=True, how='left')[['Symbol', 'Name', 'Category']]
Analysis and Validation
First let’s look at the clustering results. It can be seen that the number of stocks included in each cluster is very different:
symbols.groupby(by=['Category'], dropna=False).count()

The following codes show exactly which stocks are included in each cluster:
symbols[symbols['Category']==9]

Finally, put the price trends of the stocks in the same cluster in one chart, so that it can be seen how similar they are:
def plot_price(symbols):
plt.ylabel('Adj Close')
plt.xlabel('Date')
for s in symbols:
plt.plot(hist['Adj Close'][s]/hist['Adj Close'][s][0])
plt.legend(symbols)
plt.show()
plot_price(symbols[symbols['Category']==9]['Symbol'])
plot_price(symbols[symbols['Category']==29]['Symbol'])


Conclusion
The two charts above show the price trends of the stocks included in two randomly selected clusters (9 and 29). The obvious similarity between them can be observed, indicating that clustering stocks using DTW is possible and effective.