Optimizing Run Time for Large Universe

22 Apr 2015

To install Systematic Investor Toolbox (SIT) please visit About page.

Recently, I was asked if Systematic Investor Toolbox (SIT) back-testing framework is suitable for running large universe back-tests.

Below, I will present the steps I took to test and optimize the running time. I will base this experiment on the Volatility Quantiles post i wrote back in 2012.

First, let’s generate 10 years of daily data for 25,000 securities.

Please note that this code requires lot’s of memory to store results; hence, you might want to add --max-mem-size=8000M to your R parameters.

# https://systematicinvestor.wordpress.com/2012/06/05/volatility-quantiles/
#*****************************************************************
# Load historical data
#*****************************************************************
library(SIT)
load.packages('quantmod')

tic(11)

tic(1)

	n = 25 * 1000
	nperiods = 10 * 252

	# force high vol to small return
	library(MASS)
	mu = c(5,20) / 100 
	sigma = c(5,10) / 100
	rho = -0.9
	cov.matrix = sigma%*%t(sigma) * matrix(c(1,rho,rho,1),2,2)	
	temp = mvrnorm(n, mu, cov.matrix)

	mu = temp[,1]
	sigma = temp[,2]
		sigma[sigma < 1/100] = 1/100

	ret = matrix(rnorm(n*nperiods), nr=n)
		ret = t(ret * sigma/sqrt(252) + mu/252)

		#colnames(ret) = paste('A',1:n)
		dates = seq(Sys.Date()-nperiods+1, Sys.Date(), 1)
	#prices = bt.apply.matrix(1 + ret, cumprod)
	#plot(prices[,1])

	#data = env(
	#	symbolnames = colnames(prices)
	#	dates = dates
	#	fields = 'Cl'
	#	Cl = prices)					
	#bt.prep.matrix(data)

Next I tried to run the original code from the Volatility Quantiles post and eventually run out of memory.

Below i modified original code to make it work.

    #*****************************************************************
    # Create Quantiles based on the historical one year volatility
    #******************************************************************
    period.ends = endpoints(dates, 'weeks')
        period.ends = period.ends[period.ends > 0]
tic(2)
tic(1)
    # compute historical one year volatility    
    sd252 = bt.apply.matrix(ret, runSD, 252)        
        
toc(1); tic(1)

Elapsed time is 48.12 seconds

    # split stocks into Quantiles using one year historical Volatility    
    n.quantiles=5
    start.t = which(period.ends >= (252+2))[1]
    quantiles = weights = ret[period.ends,] * NA            
    
    for( t in start.t:len(period.ends) ) {
        i = period.ends[t]

        factor = sd252[i,]
        ranking = ceiling(n.quantiles * rank(factor, na.last = 'keep','first') / count(factor))
    
        quantiles[t,] = ranking
        weights[t,] = 1/tapply(rep(1,n), ranking, sum)[ranking]            
    }
    
# free memory
env.del(spl('prices,sd252'), globalenv())
gc(T)

       used  (Mb) gc trigger   (Mb)  max used   (Mb) Ncells   561806  30.1     984024   52.6    984024   52.6 Vcells 81877454 624.7  199500757 1522.1 199499815 1522.1

    #*****************************************************************
    # Create backtest for each Quintile
    #******************************************************************
toc(1); tic(1)

Elapsed time is 10.63 seconds

    models = list()
    quantiles = ifna(quantiles,0)

    for( i in 1:n.quantiles) {
        temp = weights * NA
        temp[] = 0
        temp[quantiles == i] = weights[quantiles == i]
    
        weight = ret * NA
            weight[period.ends,] = temp
            weight = ifna(apply(mlag(weight), 2, ifna.prev),0)
            rets = rowSums(ret * weight)
        models[[ paste('Q',i,sep='_') ]] = list(
            ret = make.xts(rets, dates),
            equity = make.xts(cumprod(1 + rets),dates))
    }
toc(1)

Elapsed time is 50.27 seconds

toc(2)

Elapsed time is 109.02 seconds

Looking at the timing, there are 2 bottle necks:

computation of historical one year volatility
creating back-test for each quantile

To address running time for computation of historical one year volatility, I tried following approaches:

	# Run Cluster
	load.packages('parallel')
tic(2)
tic(1)
	cl = setup.cluster(NULL, 'ret',envir=environment())	
toc(1); tic(1)

Elapsed time is 19.56 seconds

	out = clusterApplyLB(cl, 1:n, function(i) { runSD(ret[,i], 252) } )	
toc(1); tic(1)

Elapsed time is 17.43 seconds

	stopCluster(cl)	
	sd252 = do.call(c, out)
		dim(sd252) = dim(ret)
toc(1)

Elapsed time is 0.54 seconds

toc(2)

Elapsed time is 37.67 seconds

Not much faster, it takes almost 19 seconds to move large ret matrix around.

Next I looked at RcppParallel for help.

#include <Rcpp.h>
using namespace Rcpp;
using namespace std;

// [[Rcpp::plugins(cpp11)]]


// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
using namespace RcppParallel;


//[Standard deviation](http://mathforum.org/library/drmath/view/71724.html)
// sqrt((sX2 - sX^2/n)/(n-1))
struct run_sd_helper : public Worker {
	// input matrix to read from
	const RMatrix<double> mat;

	// internal variables
	int rstart, rend;
	const int nwindow, nwindow1;
	RVector<double> sum, sum2;

	// output matrix to write to
	RMatrix<double> rmat;

	// flag to indicate initialization
	bool first_run;

	// initialize from Rcpp input and output matrixes
	run_sd_helper(const NumericMatrix& mat, int rstart, int rend,
		NumericVector sum, NumericVector sum2, NumericMatrix rmat)
		: mat(mat), rstart(rstart), rend(rend), nwindow(rend - rstart), nwindow1(nwindow-1),
		sum(sum), sum2(sum2), rmat(rmat), first_run(true) { }

	// helper function to shift one period in time
	void update(int i) {
		rstart = i;
		rend = rstart + nwindow;
		first_run = false;
	}

	// do initialization, pre-compute sum and sum2
	void do_first_run(size_t begin, size_t end) {
		for (size_t c1 = begin; c1 < end; c1++) {
			double dsum, dsum2;
			dsum = dsum2 = 0;

			for (int r = rstart; r < rend; r++) {
				double d = mat(r, c1);
				dsum += d;
				dsum2 += pow(d, 2);
			}

			sum[c1] = dsum;
			sum2[c1] = dsum2;
			rmat(rend - 1, c1) = sqrt((dsum2 - pow(dsum, 2) / nwindow) / nwindow1);
		}
	}

	// do incremental update, no need to re-compute
	void do_update(size_t begin, size_t end) {
		for (size_t c1 = begin; c1 < end; c1++) {
			double d0 = mat(rstart - 1, c1);
			double d1 = mat(rend - 1, c1);
			sum[c1] += -d0 + d1;
			sum2[c1] += -pow(d0, 2) + pow(d1, 2);
			rmat(rend - 1, c1) = sqrt((sum2[c1] - pow(sum[c1], 2) / nwindow) / nwindow1);
		}
	}

	// main entry -function call operator that work for the specified range (begin/end)
	void operator()(size_t begin, size_t end) {
		if (first_run)
			do_first_run(begin, end);
		else
			do_update(begin, end);
	}
};

// [[Rcpp::export]]
NumericMatrix cp_run_sd(NumericMatrix mat, int nwindow) {
	// allocate the matrix we will return
	int nc = mat.ncol();
	int nperiod = mat.nrow();
	NumericMatrix rmat(nperiod, nc);

	// pad the first n-1 elements with NA
	for (int i = 0; i < nc; i++)
		fill_n(rmat.begin() + i*nperiod, nwindow - 1, NA_REAL);
	
	// pre-compute first run
	NumericVector rsum(nc), rsum2(nc);
	run_sd_helper p(mat, 0, nwindow, rsum, rsum2, rmat);
	parallelFor(0, nc, p);

	// for loop for each time period in parallel
	for (int i = 1; i < (nperiod - nwindow + 1); i++) {
		// update stats
		p.update(i);
		parallelFor(0, nc, p);
	}
	return rmat;
}

Please save above code in the runsd.cpp file or download runsd.cpp.

#*****************************************************************
# Compile Rcpp helper function to compute running window standard deviation in parallel
#*****************************************************************
# make sure to install [Rtools on windows](http://cran.r-project.org/bin/windows/Rtools/)
load.packages('Rcpp')
load.packages('RcppParallel')
sourceCpp('runsd.cpp')

tic(1)	
	sd252 = cp_run_sd(ret, 252)		
toc(1)

Elapsed time is 0.98 seconds

The Rcpp helper function took less than a second. This is an improvement i was lookin for.

Next, let’s address the run time for making back-test for each quantile. Please notice that instead of lagging and back-filling weights for each quantile, we can do it once to save time.

Following is final code that runs in about 28 seconds.

# https://systematicinvestor.wordpress.com/2012/06/05/volatility-quantiles/
#*****************************************************************
# Load historical data
#*****************************************************************
library(SIT)
load.packages('quantmod')

tic(2)

tic(1)

	n = 25 * 1000
#n = 250
	nperiods = 10 * 252


	# force high vol to small return
	library(MASS)
	mu = c(5,20) / 100 
	sigma = c(5,10) / 100
	rho = -0.9
	cov.matrix = sigma%*%t(sigma) * matrix(c(1,rho,rho,1),2,2)	
	temp = mvrnorm(n, mu, cov.matrix)

	mu = temp[,1]
	sigma = temp[,2]
		sigma[sigma < 1/100] = 1/100

	ret = matrix(rnorm(n*nperiods), nr=n)
		ret = t(ret * sigma/sqrt(252) + mu/252)


		#colnames(ret) = paste('A',1:n)
		dates = seq(Sys.Date()-nperiods+1, Sys.Date(), 1)
	#prices = bt.apply.matrix(1 + ret, cumprod)
	#plot(prices[,1])



	#*****************************************************************
	# Create Quantiles based on the historical one year volatility 
	#****************************************************************** 
	period.ends = endpoints(dates, 'weeks')
		period.ends = period.ends[period.ends > 0]

	# compute historical one year volatility	

toc(1); tic(1)

Elapsed time is 6.85 seconds

if(F) { # 48 seconds 
	sd252 = bt.apply.matrix(ret, runSD, 252)		
}

if(F) { # 38 seconds
	# Run Cluster - will be about 10sec faster, please note transferring ret matrix takes almost 17secs
	load.packages('parallel')
	cl = setup.cluster(NULL, 'ret',envir=environment())	
	out = clusterApplyLB(cl, 1:n, function(i) { runSD(ret[,i], 252) } )	
	stopCluster(cl)	
	sd252 = do.call(c, out)
		dim(sd252) = dim(ret)
}

if(T) { # 0.9 seconds
	sd252 = cp_run_sd(ret, 252)		
}

toc(1); tic(1)

Elapsed time is 0.86 seconds

#all(abs(sd252 - result) < 0.0001, na.rm=T)



	# split stocks into Quantiles using one year historical Volatility	
	n.quantiles=5
	start.t = which(period.ends >= (252+2))[1]
	quantiles = weights = ret[period.ends,] * NA			
	
	for( t in start.t:len(period.ends) ) {
		i = period.ends[t]

		factor = sd252[i,]
		ranking = ceiling(n.quantiles * rank(factor, na.last = 'keep','first') / count(factor))
	
		quantiles[t,] = ranking
		weights[t,] = 1/tapply(rep(1,n), ranking, sum)[ranking]			
	}
	quantiles = ifna(quantiles,0)
toc(1); tic(1)

Elapsed time is 11.4 seconds

# free memory
env.del(spl('prices,sd252'), globalenv())
gc(T)

        used   (Mb) gc trigger   (Mb)  max used   (Mb) Ncells    637966   34.1    1073225   57.4   1012438   54.1 Vcells 208010169 1587.0  518657040 3957.1 518656088 3957.1

	#*****************************************************************
	# Create backtest for each Quintile
	#****************************************************************** 
	models = list()
	
	weight = weights

	# Simplified bt.run for speed
	index = ret[,1] * NA
	index[period.ends] = 1:len(period.ends)
	index = ifna(ifna.prev(mlag(index)), 1)

	for( i in 1:n.quantiles) {		
		weight[] = 0
		weight[quantiles == i] = weights[quantiles == i]
	
		rets = rowSums(ret * weight[index,])

		models[[ paste('Q',i,sep='_') ]] = list(
			ret = make.xts(rets, dates),
			equity = make.xts(cumprod(1 + rets),dates))
	}
	

gc(T)

        used   (Mb) gc trigger   (Mb)  max used   (Mb) Ncells    638113   34.1    1073225   57.4   1012438   54.1 Vcells 154037802 1175.3  414925632 3165.7 518656088 3957.1

toc(1); tic(1)

Elapsed time is 9.17 seconds

#*****************************************************************
# Report
#*****************************************************************
#strategy.performance.snapshoot(models, T)
plotbt(models, plotX = T, log = 'y', LeftMargin = 3, main = NULL)
	mtext('Cumulative Performance', side = 2, line = 1)

plot of chunk plot-6

toc(1);

Elapsed time is 0.03 seconds

toc(2);

Elapsed time is 28.31 seconds

Aside: Rcpp helper function testing:

#*****************************************************************
# Compile Rcpp helper function to compute running window standard deviation in parallel
#*****************************************************************
# [RcppParallel help site](http://rcppcore.github.io/RcppParallel/)
# [RcppParallel source code](https://github.com/RcppCore/RcppParallel)
# [RcppParallel introduction](http://dirk.eddelbuettel.com/blog/2014/07/16/#introducing_rcppparallel)
# [RcppParallel gallery](http://gallery.rcpp.org/articles/parallel-distance-matrix/)
# defaultNumThreads()
#[High performance functions with Rcpp](http://adv-r.had.co.nz/Rcpp.html)
#
# make sure to install [Rtools on windows](http://cran.r-project.org/bin/windows/Rtools/)
load.packages('Rcpp')
load.packages('RcppParallel')
sourceCpp('runsd.cpp')

# test - generate random data
n = 24
nperiods = 10 * 252
ret = matrix(rnorm(n*nperiods), nc=n)

# compute running window standard deviation using Rcpp
cp.sd252 = cp_run_sd(ret, 252)		

# compute running window standard deviation using R
library(SIT)
load.packages('quantmod')
r.sd252 = bt.apply.matrix(ret, runSD, 252)		

print(test.equality(r.sd252, cp.sd252))

item1	item2	equal
r.sd252	cp.sd252	TRUE

# compare performance
library(rbenchmark)
res <- benchmark(
			cp_run_sd(ret, 252),
			bt.apply.matrix(ret, runSD, 252),
			replications = 20,
			order="relative")
print(res[,1:4])

rownames(x)	test	replications	elapsed	relative
1	cp_run_sd(ret, 252)	20	0.16	1.000
2	bt.apply.matrix(ret, runSD, 252)	20	0.94	5.875

(this report was produced on: 2015-05-01)

Systematic Investor's Blog About Contact

Optimizing Run Time for Large Universe

Related Posts

RFinance 2016 15 May 2016

Benchmark Plus 14 Apr 2016

Exponentially Weighted Volatility using RCPP 10 Apr 2016