Cryptocurrencies algorithmic trading with Python (1/4)

7 min readMay 8, 2021

Part I: In search of data

When it comes to designing and backtesting systematic investment strategies, the limit is usually not our imagination, but access to reliable and affordable dataset and computing power to test our ideas. For years, unless we had the chance to work in an hedge-fund with extensive research data budgets and state of the arts IT infrastructure, obtaining sufficient qualitative data and backtesting power was an almost impossible step to climb, and your investment ideas would most likely be doomed to remain sketches on the side of a notebook.

Thankfully, in the recent years, computing power has become incredibly affordable, would it be via your laptop or cloud services, and cryptocurrency emerged. What’s the link you may ask? Well, together with the rise of the blockchain technology and the growing interest in cryptocurrencies, cryptocurrency trading exchanges emerged and it turns out that they are a formidable source of completely free and high quality data to whoever can code a bit.

So, we have a good laptop ✅, plenty of free cryptocurrencies exchanges to chose from ✅, a crypto booming market ✅ and a bit of time (thank you lockdown…)✅. Great, that’s all we need to get started.

So, where to start? 🤷‍♂️

Our starting point is to select a cryptocurrencies trading exchange: to make our life easier, the first criteria will be the quality of its Python API (Application Programming Interface, i.e. a Pyhon high level library generally developed by an app or website to allow users like us to interact with its application/website. Without those APIs, we would need to spend a great amount of time to code very specific functions to interact with each and every app/websites).

Several crypto exchanges offer very complete APIs to easily “plug” our algorithm to their platform, allowing us to retrieve data and trade automatically.

Some Python libraries even offer an infrastructure to easily plug to multiple crypto exchange APIs (like Cryptofeed or CCXT) using one single library. Those libraries would have been particularly useful if we had planned to develop market making strategies, requiring the same asset price across multiple platforms in order to take profits from the potential price differences between the exchanges. However, as we will focus first on rather simple momentum strategies, one single platform is enough for us.

In this post, we will be using the Binance cryptocurrency exchange because(i) Binance has an extensive number of listed assets (see below), (ii) has relatively interesting transactions fees which is fundamental if we want to trade extensively, (iii) has great liquidity which is important to reduce our trading costs, and (iv) has a great API that makes it easy to get data and trade. On the downside, Binance is not the oldest exchange you could find out-there (2017 vs 2011 for Kraken for example), so historical data may be missing if we want to backtest over a long period of time. However, we can consider that the 2017–2021 period offers enough variety in term of market regimes so that we can efficiently backtest our strategy through various market cycles.

Kraken also offers a rather good API, however, we are very limited in term of data points we can get in one single request, which is very reductive when it comes to backtesting on a large data period.

List of exchanges, sorted by number of listed pairs (Source: Cryptowatch)

Getting BTC/USDT historical data from Binance 📈

To use the Binance API, we can refer to the API documentation below:(https://python-binance.readthedocs.io/en/latest/overview.html). The idea here is not to go through all the details of the code and the way the API works, but rather to give a quick overview.

First, we install python Binance API using our terminal:

pip install python-binance

2. Then we define the Binance Client API setup:

3. Now we can start fetching the BTC/USDT historical data by simply using the get_historical_klines() function of the API:

The get_historical_klines() function returns a DataFrame (a Python indexed 2-D table) whose lines are dates in ms timestamp format and whose columns are the Open time, the Open price, the Highest price over the period…etc (cf columns_binance list above )

Cleaning the data 🧹

Getting the data is one thing, but making sure the data is “clean” is equally important. There is a reason why data scientists can spend up to 80% of the job cleaning their data. In the case of a DataFrame, this consists mainly of:

Dropping unnecessary columns
Changing the index
Renaming columns to a more recognizable set of labels
Replacing missing values

In our case, through a quick look at our data, we see that for some reasons, on some specific dates, some data are missing and are replaced by NaNs. So we will replace those NaN missing values using the linear interpolate function:

Plotting the data 📈

Now that we have clean data, we can use the Matplotlib Python library to plot the BTC/USDT price:

and here we go:

Getting historical data for multiple assets in one single function 📦

Using the Binance API client.get_historical_klines()function, we can create our own GET_DATE() function to automatically get prices for several assets in one call. Our function takes as input a Seriesof asset pairs we want data for, the start and end dates of the period we want to retrieve data for, and the data frequency in a string format (‘1m’, ‘1h’, ‘1d’, ‘1w’…cf Binance API documentation). This function will simply iterate through all the tickers of the investment universe inputed, calling each time the client.get_historical_klines() , clean the DataFrame obtained, and store it in a Python dictionary. So the output of this function is a Dict of DataFrame whose key is the asset pair name.

At last, in order to save some time later on, this function also returns, in the same Dict, a DataFrame of close prices of all assets (one column per asset), one similar DataFrame with with log returns, and one with simple single returns.

If we want to go into the details, this function actually performs one last task: it resizes the dataset so that all DataFrameshave the same size. Effectively, the Binance API will return historical data up to the start of the listing on the Binance Exchange, not up to the start_date. Therefore, if you are trying to get the historical data for an asset that is listing only since few weeks on the exchange, you will get a DataFrame with less lines than what you would obtain for BTC/USDT for example. Although all this makes sense, having different size DataFrames will be an issue, therefore we have to reindex all our DataFrames with the dates Series from the asset having the longest history. This is done in one line using the reindex() function in Python.

In order to save time and be nice with the API, I have also explored the solution of saving the DataFrames into excel files so that those data can be used in a future backtest instead of calling again the API. However, it turned out that opening an excel file, reading the data and closing the excel file was way longer than just getting again the data from the API. Therefore, I have let this functionality aside for the time being.

Great, we are now able to get easily, via a single function, all the data for a list of assets and get it into a single dictionary. But, what is actually the potential data we are talking about? What is the investment universe we have in our hands to develop our strategies? How many assets? What time horizon? Let’s discover this in this last section.

Defining the Investment Universe 📚

First, we need to define the currency of our portfolio. We could use GBP, EUR or USD, or even BTC, but Binance offers more assets quoting against the USDT (“Tether”, a cryptocurrency pegged to the USD, however, the backing reserves are still to be discussed…but that’s another story) than any of those fiat currencies, so in order to maximize our investment universe (and consequently the potential alpha generation) we will use USDT as our portfolio main currency.

Using the general client.get_exchange_info() function from Binance API, we can easily retrieve all asset pairs listed on Binance together with their current status (trading or not).

Symbols_info is a dictionary containing information about all asset pairs listed on Binance. We will start filtering this dictionary to get a list of pair tickers which (i) still quote today and (ii)include USDT:

By keeping only the stocks that are trading today, we are exposed to the survivorship bias. However, we will keep this approach today as the number of delisted assets is relatively small compared to the universe. But finding a way to keep those delisted assets in our investment universe history is definitively an improvement to test in the future.

Then, we remove from our investment universe the Binance leveraged tokens (“UP” and “DOWN” tokens) although they might offer interesting ways to increase our exposure without going through the futures market and consequently having to manage the margins mechanism.

At last, we also remove composite “BULL” and “BEAR” tokens:

Once applying all those filters, let’s see the actual size of our investment universe across time:

First, we get the data using our function GET_DATA.

DICT = GET_DATA(start_date, end_date, period_dict, inv_universe)

The function will return NaNs where data is not available yet, so to count the number of assets listed at a certain time, we just sum the number of non NaN columns:

inv_universe_size = np.sum(~np.isnan(DICT['Close']), 1)

Then we plot the result:

Here we go:

So, we are left with around 211 available pairs for trading as of today (March 2021), which seems to be large enough to seek some nice alpha. We also see that we must be careful when backtesting over periods older than 2019 as the investment universe was substantially smaller at that time.

Talking about backtesting… let’s now create our backtesting environment. But enough for today! La suite au prochain episode!

Here