How I used Machine Learning to predict ESG Fund performance

Bjoern Holste
4 min readAug 9, 2020

--

While taking the dog out for a morning walk I thought about how one of my new ML toys could be tested on an ESG related topic. So here it goes.

Many investors care about ESG, that is investing into companies which care about environmental, social and governance factors. The approach ties in closely with the UN SDGs — the 17 sustainable development goals issued by the United Nations. On the asset side, ESG factors are still gaining momentum mostly as a marketing gadget. However, the intended effect of creating impact and nudging corporates to be better global citizens is shown by a growing number of scientific papers. So all is going in the right direction. Even on the credit side, climate risks will have to be included in bank’s credit risk models and my next post will be about that.

So lets get coding with our SDG machine learning project.

First, we need some data. I chose to work with time series for the longest available ETF (symbol SDG which is the iShares MSCI Global Impact ETF) which I procured from Yahoo Finance through pandas-datareader, just because it was handy. The pickled data is available in the git repo if you don’t want to select and download your own factors.

As you see from the code, we’ll also use some other factors to feed our ML model with, namely Oil (CL=F), Gold (GF=F), Volatility (VIX) and EURUSD and BTCUSD currency exchange rates. Feel free to add factors you think might hold relevant information and play around with them.

We need to make sure to use returns and not the price series since returns are statistically stable and prices are not. Pandas has a built in function for this df.pct_change() but we use the log returns here through np.log().

Some visualisations to see what we’re dealing with are done through Ploty. As expected, the volatility is negatively correlated with our target fund’s return but not much can be learned by looking at the other charts — let’s see if the machine can learn more from this.

For the actual analysis I use one of my new shiny toys, PyCaret, a low- code wrapper for the standard libraries I play with usually.

Ordinarily, in investing exceptional performance can be achieved by avoiding the negative days. So our simple analysis will try to create a classifier to see if tomorrow’s performance will be positive or negative based on all previous observations. While this sounds very much like Bayesian Statistics we might be able to find a model with better results.

As with all investment it would be great to know on which days we should be invested or not. The difference in total return over the observed timeframe is quite substantial: +273% on posivitve days vs -227% for only the negative days and +45% for being invested every day.

So let’s see if we can teach our machine to predit if the next day’s performance will be positive or negative with some confidence. What will be the best model to do so?

We can easily find out with PyCaret’s compare_models() function wich gives an overview of all 15 available models from Linear Discriminant Analysis to Extreme Gradient Boosting. Ex-ante assumption could have been made that Naive Bayes might be a suitable model since we’re making predictions given observed data but as you can see Linear Discriminant Analysis gives the best scores both in the initial evaluation as in the final tuned version.

ML model comparison

With some model tuning and finalization, the model is able to predit the right course of action more than half of the time as can be seen from the area-under-curve measure. In asset management terms this is pretty good and will be enough to set your performance apart from the crowd if deployed efficiently.

AUC plot

Interestingly, the Feature Importance Plot reveals that the price of oil (CL=F) turns out to be the most important factor for the LDA Model. My guess would have been that volatility (VIX) shows good predictive qualities due to its negative correlation but I’m just human after all.

feature plot

The model is making fair predictions im terms of true negatives or in English: it tells you when not to invest the next day. While this will not be sufficient to achieve the maximum return of only investing on the positive days it holds the potential to be built-out into an outperforming strategy with some additional optimisation.

Björn Holste — Technology Institute

ps. Full Notebook (which was run on Kaggle for speed advantages over my local laptop) for the analysis lives on github as well:

--

--

Bjoern Holste
Bjoern Holste

Written by Bjoern Holste

Entrepreneur, Engineer, Researcher

No responses yet