stat 434 rebecca wu troy shu final report

20
UNIVERSITY OF PENNSYLVANIA Statistical Pair Trading on International ETFs Rebecca Wu Troy Shu STAT 434 Final Project Report Steele December 18, 2012

Upload: tmshu1

Post on 28-Apr-2015

518 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT 434 Rebecca Wu Troy Shu Final Report

UNIVERSITY OF PENNSYLVANIA

Statistical Pair Trading on International ETFs

Rebecca Wu

Troy Shu

STAT 434 Final Project Report Steele

December 18, 2012

Page 2: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 1

I. Executive Summary

Pair trading international ETFs with a non-adaptive strategy does not seem to perform

well over longer time frames due to the changing dynamics of mean reversion and momentum in

international ETFs. However, after applying an adaptive “filter” to our pair trading strategy, our

returns improved dramatically which suggests that being able to successfully capture these mean

reversion and momentum dynamic changes can be profitable.

Our first step was to conduct exploratory data analysis on the price and return data for 22

international ETFs. We did not find anything out of the ordinary: the international ETF prices are

highly autocorrelated, not normally distributed and not stationary while the ETF returns are

autocorrelated, heavy-tailed and stationary.

Next, we backtested our international ETF pair trading strategy. Our strategy used the

Augmented Dickey-Fuller stationarity test to select only the cointegrated ETF pairs as potential

trades.After regressing the price of one ETF against the price of the other over a rolling 120-day

formation period, our strategy then ordered the ETF pairs by the magnitude of the current

residual on the 120th day and selected the top 5 ETF pairs with the largest residual/divergence to

trade for the next 20 days.

Initial results were poor: the strategy produced a full period Sharpe Ratio of -0.16 and a

max drawdown of -53.3%. Plotting the rolling Sharpe Ratios showed that they oscillated around

0.00,so the overall risk-reward relationship of our initial strategy remained poor throughout time.

We considered using a GARCH(1,1) model to obtain a clearer picture of the standard deviation

of our strategy returns, and thus a clearer picture of the rolling Sharpe Ratio. However, the fact

that the residuals of our strategy’s returns are heavy-tailed precluded the use of GARCH to

model the standard deviation of our strategy’s returns. We also conduct an analysis using

different Kelly criterion bets and as expected, our strategy’s terminal wealth and compound

annualized growth rate is higher than the strategy that does not use Kelly bets.

In looking for ways to improve our initial international pairs trading ETF strategy, we

noticed that there seemed to be “regime shifts” over time between the dominance of mean

reversion or momentum in the returns of the ETFs. We applied a moving average “filter” to the

initial trading strategy to reverse the international ETF pair trades in the correct direction when it

Page 3: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 2

seemed that these regime shifts occurred. Our pair trading strategy’s returns improved

dramatically, producing a full period Sharpe Ratio close to 1 and a max drawdown of -35%.

II. Premise: “Pairs Trading on International ETFs” Paper

We decided to base our project on the premise of the quantitative financial research paper

titled “Pairs Trading on International ETFs”, authored by Schizas, Thomakos, and Wang. In their

paper, Schizas and his colleagues developed an international ETFs pair trading strategy that

produced spectacular results but did not seem to have a strong statistical foundation.

The authors of the paper used 23 international ETFs, representing countries such as the

USA, Germany, Brazil, Japan, and even smaller countries like Belgium and Malaysia. The

authors implemented their backtest using a rolling window: They had a 120-day formation

period in which they ranked all pairs of international ETFs and selected the top five to trade in a

simple 1-to-1 ratio. Then they had a 20-day trading period in which they calculated the ex-post

returns of the ETF pairs that they selected in the formation period. Rolling these two windows

forward together by 20 days produced ex-post returns for another 20 days.

To order the ETF pairs, the authors used the average absolute difference between the

cumulative returns of two ETFs starting from the beginning of the 120-day formation period. In

doing so, they were essentially betting that two international ETFs whose prices have shown to

diverge a lot will tend to converge in the future. However, they did not offereither a fundamental

economic reason or statistical evidence to explain such convergence behavior.

When assessing the performance of their strategy, theauthors neglected to provide basic

performance metrics such as monthly return, compound annualized growth rate, or max

drawdown numbers for their strategy. They only provided a single bar chart of monthly returns

and a few equity curves that are depicted below.

Page 4: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 3

Their results seemed too good to be true given that very few months had negative returns,

even throughout the 2008 financial crisis.Furthermore,the negative returns never fell below -5%

while the positive returns frequently exceeded +5%, even reaching levels of 10% of 20% at some

points. The equity curves also seemed suspect since the portfolio for the top five pairs

consistently beat the market throughout all time.

The goal of our project was to develop a more statistically sound international ETFs pair

trading strategy by only trading cointegrated international ETF pairs in order to avoid the

problem of basing our trading decisions off bogus spurious regressions that would always

produce a highly significant alpha and beta even if the two international ETFs were completely

independent of each other. In addition, rather than only trading the pair on a 1-to-1 ratio, we also

used the Engle-Granger two-step method to determine the optimal cointegration ratio and

construct our trades by going long for ETF y but then going short for ETF x in proportion to the

cointegration ratio.

III. Description of Our Strategy

In our project, we used the same type of rolling backtest that Schizas and his co-authors

used in their strategy; however, we implement the Engle-Granger two-step method in selecting

our pairs to trade. First, weperformed a regression onthe price data from the 120-day formation

period for each of the ETF pairs. Next, we ran a Dickey-Fuller stationarity test on the set of

residualsfrom theregression to check whether the first difference in price of the ETF pair was

stationary.If the residuals were not stationary, we eliminated the pairsince it did not make

statistical sense to pair trade two ETFs whose prices are expected to diverge.

After determining the valid pairs to potentially trade, we ranked the remaining pairs

based on the absolute magnitude of the last residual in the 120-day formation period (i.e. the

most recent residual).Then we selected the top five pairs with the greatest level of divergence to

trade for the following 20 days.

We constructed dollar neutral positions in the top ETF pairs that we selected using the

estimated betas from the regressions. We made sure that these trades were made in the correct

direction. For example, say that we selected to trade the SPY-EWM (iShares MSCI Malaysia

Index) pair because it had a very negative residual or -1 (very high absolute magnitude)

Page 5: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 4

afterregressing 푥 = 푆푃푌 price on 푦 = 퐸푊푀 price. By definition, residuals were calculated as

푦 − 훽푥 = 푟푒푠푖푑푢푎푙, and so a residual that was very negative meant that our 훽푥 leg was

“overpriced” compared to our 푦 leg. So in this case, we shorted our 푥 leg (SPY) and went long

our 푦 leg (EWM).

Once we had the trading period (next 20 days) returns for the five pairs, we simply took

the arithmetic average across all pair returns for that day to arrive at our portfolio’s overall return

for our day; in other words, we equal weighted the five pairs that we traded in our portfolio.

One of our exit criteria was as follows: if, within the 20 day trading period, the price

residual of a pair reversed sign from the original price residual, we exited that pair (and did not

rebalance our portfolio). For example, consider the SPY-EWM pair example that we used

before: the original residual was quite negative, -1. The price residual was calculated every

trading day as 푦 − 훽푥, or the price of EWM minus 훽 times the price of SPY (since we defined x

as the price of SPY and y as the price of EWM before. One day, the price residual became 0.01;

since the sign was now opposite to the sign of our original residual (-1), we exited (on the next

day’s close, since we are only using close data). Intuitively, this made sense because we

originally entered the pair trade to capture the divergence in price as measured by the residual

with the expectation that this divergence will close, and the residual will cross zero—this

difference between the original residual and zero would be our profit. So once the residual

switched signs we would be trading in the wrong direction.Since we captured most of the profit

in the “spread” as measured by the residual one day before the entry date, we exited as soon as

possible once this happened. We exited all positions in any pairs that we were still trading once

the 20-day trading period was over.

Using a rolling backtest prevented possible data mining issues that could be encountered

from selecting a fixed training and testing period. Additionally, such a rolling backtest made

particular sense in the context of our project for 2 reasons: (1) our trading strategy was relatively

short-term so it would have been erroneous to include historical data from too far back in time,

and (2) the nature of a pairs trading strategyrequiredscreening all the pairs and only trading those

pairs that seemed the most promising. In this case, our pairs were ordered by a univariate metric,

namely the magnitude of the price regression residual.

Page 6: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 5

IV. EDA of the International ETFs Data

After conducting exploratory data analysis on our international ETFs data, we found that

there were no surprises because both the ETF price and return series displayed the expected

behavior of typical financial series. As expected, the ETF prices were highly autocorrelated and

not stationary while the returns exhibited heavy tails, autocorrelation, and stationary.

Out of the 23 ETFs that Schizas and his colleagues used, we collected closing price data

for 22 ETFs that spanned the time period from April 01, 1996 to December 31, 2011.The reason

we omitted the 23rd ETF was because data for the ETF EZU (iShares MSCI EMU Index) was not

found on CRSP. While the majority of ETF records started on April 01, 1996, two ETFs(Korea

and Taiwan) had data starting on May 10, 2000 and June 20, 2000. From the 22 ETFs we could

form a maximum number of 231 pairs.

Normality:

When testing the international ETFs return data for normality, both the Shapiro-Wilks

and Jarque-Bera normality tests yielded p-values of 0.00 that strongly reject the null hypothesis

Normal QQ Plot of SPY Returns

Quantiles of Standard Normal

SP

Y R

etur

ns E

mpi

rical

Qua

ntile

s

-2 0 2

-0.1

0-0

.05

0.0

0.05

0.10

0.15

Normal QQ Plot of EWZ (Germany) Returns

Quantiles of Standard Normal

EW

G R

etur

ns E

mpi

rical

Qua

ntile

s

-2 0 2

-0.1

0-0

.05

0.0

0.05

0.10

0.15

0.20

Normal QQ Plot of EWJ (Japan) Returns

Quantiles of Standard Normal

EW

J R

etur

ns E

mpi

rical

Qua

ntile

s

-2 0 2

-0.1

0-0

.05

0.0

0.05

0.10

0.15

of a normal distribution. The Jarque-Bera test resulted in very high values for the test statistic for

each of the return series, indicating the presence of heavy tails. On the previous page, we

included the normal qq-plots for several of the largest ETFs in which the heavy-tailed

distribution could be easily observed.

Independence:

Next, we conducted the Ljung-Box test to check for

the presence of autocorrelation for all 22 ETFs. To the right,

the histogram of the p-values from the Ljung-Box tests

shows that all the p-values were very close to zero. 0.0 0.0005 0.0010 0.0015 0.0020

05

1015

20

Histogram of p-value for Ljung Box on ETF Returns

Ljung-Box p-value

Num

ber o

f ETF

s

Page 7: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 6

Therefore, we strongly rejected the null hypothesis that the return series contained no

autocorrelation for all international ETFs.

Autocorrelation:

In the ACF plots included below for several of the largest ETFs, we found that the lag 1

coefficient had the largest magnitude. Since the lag 1 coefficient was negative, the ETFs seemed

to be short-term mean reverting. There were also some lags between 10 and 20 that had large

positive magnitudes, which may potentially signal medium-term momentum, but the lags could

be too far away to be meaningful.

Lag

AC

F

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Autocorrelation of SPY Returns

Lag

AC

F

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Autocorrelation of EWG Returns

,Lag

AC

F

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Autocorrelation of EWG Returns

We then collected all of the AR(1) coefficients when modeling each ETF as an AR(1)

process. The histogram on the left side of the following page shows that all of the AR(1)

coefficients were negative, meaning that all of the ETFs in our universe tended to display mean

reversion in the short term.

Stationarity:

We conducted the Augmented Dickey-Fuller Stationarity test on the ETF returns to see

whether they contained a unit root. The above histogram on the right side of the page shows that

the Augmented Dickey-Fuller test statistics for all international ETFs were highly negative: the

most negative test statistic was around -73, and the least negative test statistic was -17. This

-0.15 -0.10 -0.05 0.0

02

46

Histogram of AR(1) Coefficient on ETF Returns

AR(1) coefficients

Num

ber o

f ETF

s

-80 -60 -40 -20 0

02

46

810

12

Histogram of Augmented Dickey-Fuller Test Statistic

ADF Test Statistic

Num

ber o

f ETF

s

Page 8: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 7

result indicated that all the international pairs did not contain a unit root and were consequently

all stationary, which is expected for financial return series.

V. Overview of Our Strategy’s Performance

After confirming that our data was clean, we backtested our pair trading strategy to

generate returns and see how our strategy would have performed over time. We ran our rolling

backtest over the time period from September 04, 2004 to December 31, 2011. The reason for

selecting September 04, 2004 to be the start date was because the previous trading day was the

last day in which there was at least one ETF out of the 22 that had a zero volume day. We did not

want to be trading any low liquidity ETFs.

On the following page, we included a graph of the strategy’s equity curve (cumulative

growth of investing $1 in the strategy). The strategy has a compound annualized growth rate of -

3.7%, a maximum drawdown of -53.3%, and a full period annualized Sharpe Ratio of -0.16 (the

full period annualized Sharpe Ratio was calculated by first calculating mean daily excess return

above 10 year Treasuries, for the full period,divided by daily standard deviation, then

annualizing this quotient by multiplying by 250/√250).

Interestingly, the above equity curve shows that our international ETF pairs trading

strategy seemed to consistently lose money from early 2005 to early 2008. The strategy returns

0

0.2

0.4

0.6

0.8

1

1.2

Equity Curve of Our Pair Trading Strategy

Leve

ls

Page 9: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 8

then jumped upwards erratically for a few years from mid 2008 to mid 2011. However, from mid

2011 onwards, the returns became negative again.

These results demonstrated the dynamics of dominance between mean reversion and

momentum in our pair trading strategy. The fact that our strategy consistently lost money from

2005 to 2008 meant that we were consistently making the wrong bets: instead of making a

successful bet on international ETF convergence, the ETF pairs seemed to diverge even more

after we selected them. In other words, there appeared to be momentum in the international ETFs

we traded during that period.

However, there seemed to be a regime shift after 2008, as our pair trading strategy’s

returns improved. What this suggests is that international ETFs started to become more mean

reverting than trending in the short to intermediate term. This makes intuitive sense, as the

economies of the world were in crisis during the couple of years after 2008, and so they—or at

least their markets—probably tended to move together. However, our project was an empirical

one, so the “economic story” behind the performance of our trading strategy was left as a future

research topic.

VI. EDA of Strategy Returns

The average strategy returns had a

heavy-tailed distribution and contained

autocorrelation, which was typical of most

financial returns. The ranked pair returns

displayed the same statistical

characteristics as the average returns,

although there did not appear to be a

relationship between the rank of the pair

and the level of autocorrelation or

stationarity.

Normality:

Both the Shapiro-Wilks and Jarque-Bera normality tests yielded p-values of 0.0000,

providing strong evidence that the returns were not normally distributed. The Jarque-Bera test

Page 10: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 9

statistic had a very high value of 10,586.97, signifying the presence of heavy tails. The normal

qq-plot below confirmed this observation by showing that the strategy returns indeed followed a

heavy-tailed distribution.

Independence: The Ljung-Box test resulted in a p-value of 0.0000, meaning that the

returns definitely contained autocorrelation and were not independent. The

acf-plot above supported this conclusion since it shows that there were

significant lags at lag 1, lag 5 and lag 6. Fitting an AR(p) model (p=1-6) to

the average returns revealed that the AR(1) model provided the best fit, with

the AR(6) model being a close second since they had the lowest AIC values.

This outcome was consistent with the results from the acf-plot.

Stationarity: Finally, the Dickey-Fuller test resulted in a p-value of 1.01e-16, which indicated that the

returns were stationary and did not contain a unit root. This conclusion was expected because

only data that tended to be influenced by historical values, such as price data, should contain a

unit root. Returns data, on the other hand, did not depend on past data and should be stationary

without containing a unit root.

Analysis of Ranked Pair Returns: After performing the same analysis on the 5 ranked pair returns series, we found that each

series also had heavy tails and autocorrelation, much like the average strategy returns. Some

ranked pair returns contained more autocorrelation than others, but there did not appear to be a

relationship between the rank of the pair and the degree of autocorrelation or stationarity. The

pairs with rank 0 and 2 only had a few statistically significant lags while ranks 1, 3 and 4

AR(p) AIC 1 -8877 2 -8870 3 -8865 4 -8858 5 -8868 6 -8872

Page 11: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 10

hadnearly all significant lags up until lag 20, and the pairs with rank 1 and 3 had the lowest p-

values for the Dickey-Fuller test on an order of 10-19 while ranks 0, 2 and 4 had higher p-values

on an order of 10-16.

VII. Analysis of Strategy Performance: Rolling Sharpe Ratio

Analyzing the 20-day rolling Sharpe ratio of our trading strategy revealed that the

performance of our trading strategy was not very good given that the Sharpe ratio oscillated

around 0.00 across time. Calculating the rolling Sharpe ratio using the GARCH conditional

deviation, rather than the rolling standard deviation, resulted in a larger range of outliers in

addition to a smaller spread between quartile 1 and quartile 3 and did not improve the overall

performance of the strategy. A closer look revealed that the GARCH model should not be used

to fit the trading strategy returns at all.

Sharpe Ratio Using Unconditional Standard Deviation: We first calculated the rolling Sharpe ratio by dividing the rolling mean by the rolling

standard deviation. Below on the left side of the page, the plot of the rolling mean and rolling

standard deviation showed that there was a positive relationship between risk and reward, since

the mean return tended to increase as the standard deviation increases. On the right side, the plot

of the rolling Sharpe ratio using rolling standard deviation showed that the series seemed to

fluctuate around 0.00.

The box plot below showed that the rolling Sharpe ratio did in fact have a mean of 0.00, and the

values between quartiles 1 and 3 ranged from -0.01 to +0.01. Based on this result, our trading

strategy did not seem to have much value.

Page 12: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 11

Sharpe Ratio Using Conditional Standard Deviation:

The plot of the average strategy returns above on the right showed that there was some

volatility clustering from Q4 2008 to Q2 2009, so after fitting a GARCH(1,1) model to the

returns we recalculated the rolling Sharpe ratio by dividing the rolling mean by the conditional

standard deviation. We thought this might improve our results since the GARCH conditional

standard deviation should be better at accounting for volatility clustering than rolling standard

deviation, but the performance of the strategy ended up being worse as the mean of the rolling

Sharpe ratio still remained at 0.00 while the spread between quartiles 1 and 3 shrank even further

as shown in the box plot. It was also evident from the box plot on the previous page that there

were more outliers when using the conditional standard deviation than for the unconditional

standard deviation. One particular outlier could be seen in Q3 2010 of the plot of the rolling

Sharpe ratio using conditional standard deviation.

Comparing Conditional vs. Unconditional Standard Deviation:

A comparison of the conditional standard deviation with the unconditional rolling

standard deviation in the plot to the left revealed that the conditional standard deviation

contained much more variance and had higher peaks. The conditional standard deviation had a

variance of 0.00018 while the unconditional standard deviation had a variance of 0.00013. This

explained why the Sharpe ratios calculated using the conditional standard deviation were smaller

than the Sharpe ratios calculated using the unconditional standard deviation, since the

denominator of the ratio was standard deviation.

Page 13: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 12

Comparing the above box plots of the unconditional standard deviation and conditional

standard deviation further supported this observation by showing that the conditional standard

deviation had many more outliers on the upward end than for the unconditional standard

deviation.

Evaluation of GARCH(1,1) Model:

To evaluate whether using a GARCH model was appropriate in the first place, we first

confirmed that there was significant autocorrelation in the trading strategy’s average squared

returns, which suggested that the returns might display time-varying conditional

heteroskedasticity. Additionally, the Lagrange-Multiplier test produced a p-value of 0.0002,

indicating that the residuals of the GARCH model did show an ARCH effect. However, despite

the previous evidence supporting the use of the GARCH model, the normal qq-plot of the

GARCH residuals showed that the residuals were not normally distributed at all, which meant

that the GARCH model could not be used for modeling the standard deviation of the trading

strategy’s returns.

Unconditional Stdev

Unconditional Stdev

Page 14: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 13

Furthermore, there was still autocorrelation present in both the residuals and squared residuals,

based on the p-values of 0.0000 from the Ljung-Box test, which meant that the GARCH model

did not successfully model the serial correlation structure in the conditional standard deviation.

Finally, the “C” coefficient in the GARCH model was not statistically significant, with a p-value

of 0.5662 that showed that the coefficient could actually be zero instead.

VIII. Analysis of Strategy Performance: Kelly Betting

We also analyzed our strategy performance by studying the wealth process of an investor

who used Kelly betting when making daily investments in our trading strategy (i.e. he would

lever our strategy’s performance—the average return of the five pairs traded—every day based

on the Kelly Criterion).We simulated both the full and fractional versions of the Kelly Criterion

under varying restrictions. In terms of the full version of the simulation, the long-only,

unleveraged strategy performed better than the long-short, leveraged strategy. On the other hand,

the fractional version of the simulation considerably outperformed the full version altogether.

Regardless of the Kelly strategies’ relative performances to each other, all the strategies

did poorly and had negative CAGR values. However, each CAGR was still higher than our

trading strategy’s CAGR of -3.70%. This was consistent with the fact that Kelly betting focuses

on maximizing long-term terminal wealth, rather than short-term wealth, so the Kelly strategies

should have higher CAGR values than the underlying trading strategy. Overall, the results from

our Kelly simulation further confirmed the weak performance of our trading strategy since even

maximizing the long-term wealth could not produce positive returns.

Full Kelly Criterion (Long-Only, Unleveraged)

In this scenario, the investor began with an

initial wealth of 100 and only made unleveraged bets

if there was a positive expected return (calculated

from time 0 to t). To the right, the plot of the wealth

time series showed that no bets were made for a long

period of time since the expected return was

consistently negative up until approximately day

Page 15: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 14

1200 (year 2008). However, after day 1200, the level of wealth spiked briefly before plummeting

to a value of 88. The simulation resulted in a negative CAGR of -0.65%. Calculating the

summary statistics for the wealth return series showed that the returns

had both a negative mean and Sharpe Ratio, and the downside was 4

times greater than the upside. Since the expected returns were so

consistently negative over time we also tried applying the long-short, leveraged version of the

Kelly betting strategy to take unleveraged short positions when expected returns were negative,

but this strategy did not fare any better as we will discuss next.

Full Kelly Criterion (Long-Short, Leveraged)

We updated the investor’s strategy to include

unleveraged short positionsas well as leveraged long

positions up to twice the amount of the investor’s

total wealth. There was visibly more variance in the

wealth time series, and a higher number of bets were

made due to the inclusion of short positions.

However, after ending at a value of 84, the

strategyresulted in a negative CAGR of -1.53% indicating that it was

even less effective than the previous long-only strategy. The summary

statistics for the wealth return series showed that the standard deviation

doubled in the positive direction, while both the mean return and CAGR doubled in the negative

direction. The maximum return also doubled, but the minimum return actually remained close to

the same as for the long-only strategy. This suggested that since the max loss of wealth on any

given day stayed roughly the same, we are unfortunately taking more bets that go against us

when compared to the long-only Kelly Criterion.

Fractional Kelly Criterion (Long-Short, Leveraged) In the full Kelly criterion, a huge assumption

was made that the historical returns were indicative of

future expected returns and variance. Fractional Kelly

mitigated that risk by scaling down the size of the bet.

Summary Statistics:

Max: 2.39% μ: -0.01% Min: -8.03% σ: 0.27% CAGR: -0.65%

Summary Statistics:

Max: 4.28% μ: -0.03% Min: -8.04% σ:0.52% CAGR: -1.53%

Page 16: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 15

After testing a few values of f(0<f<1), there was definitely a positive relationship between

decreasing the value of f and improving the stability of the returns with a tradeoff of lower

returns. For f=0.20, the strategy had a CAGR of -0.29% and a

mean return of -0.03%. Decreasing to f=0.05 resulted in a higher CAGR

of -0.07%, smaller spread between the minimum and maximum from

-1.61% to +0.86%, lower standard deviation of 0.03%, but also a very

small expected return close to 0. On the other hand, increasing to f=0.50 resulted in a lower

CAGR of -0.74%, larger spread between the minimum and maximum from -4.02% to +2.14%,

higher standard deviation of 0.26%, and also a more negative expected return of -0.05%.

IX. Improving Our Strategy’s Performance

Since our international ETF pairs trading strategy seems to lose money most of the time,

we decided to look at improving our strategy’s performance, firstly by reversing the direction of

the positions we take, and then applying a moving average filter to our strategy’s equity curve.

Reversing the Strategy:

By reversing the direction of our original pair trades, we were now betting that the

international ETF pairs would continue to diverge after selecting them; since we selected pairs

based on the magnitude of divergence (as measured by a residual), the bet is essentially that there

is momentum in the divergence of international ETF pairs. The reversed strategy had a

compound annualized growth rate of 3.8%, a maximum drawdown of -66.7%, and a full period

annualized Sharpe Ratio of 0.16.

Summary Statistics:

Max: 0.86% μ: -0.03% Min: -1.61% σ: 0.10% CAGR: -0.29%

Page 17: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 16

We noticed that this strategy’s returns tend to trend over the medium to long term.

Applying a moving average filter to the equity curve seems to be a good way to capture the

trending nature of the strategy’s returns. Specifically, we would calculate the moving average of

the strategy’s equity curve/cumulative growth. If the strategy is currently underperforming its

average performance, we would short the strategy (in this case, since the “strategy” under

consideration is actually the reverse of our original pair trading strategy, shorting the “strategy”

means taking the original unreversed trade). Likewise, if the strategy’s performance is higher

than its average performance, we would long the strategy.

Moving Average Filter:

We decided to test the performance of using a 200-day moving average as a type of trade

filter described above. The first graph is a plot of the reversed strategy’s equity curve along with

its 200 day moving average. The second graph is a plot of the reversed strategy’s equity curve

after filtering the trades by the 200 day moving average as described in the previous paragraph.

The performance numbers of the reversed pair trading strategy with the 200 day moving average

filter were as follows: 23.8% compound annualized growth rate, -35.2% maximum drawdown,

and a full period annualized Sharpe Ratio of 0.95.

00.20.40.60.8

11.21.41.61.8

2

Reversed Strategy Equity Curve

Page 18: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 17

The filtered strategy still seemed to be very volatile: it actually had roughly the same

daily return standard deviation as the unfiltered strategy (1.79%). However, the compound

annualized growth rate was relatively high at 23.8%, which suggests that, with the 200 day

moving average filter, we are on average correctly capturing the regime shifts between

momentum and mean reversion in our international ETF pairs.

X. Final Considerations There are several considerations to take into account in the trading strategy that we

developed, and these may either be interpreted as determinants of risk in our strategy or jumping-

off points for extensions from our analysis. First of all, our trading strategy does not factor in

transaction costs, which may be significant since we are making trades somewhat frequently at a

0

0.5

1

1.5

2

Reversed Strategy Equity Curve Plotted with MA 200

Reversed Strategy Equity CurveMA200

0123456

Equity Curve After Applying MA 200 Filter

Page 19: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 18

rate of 5 new positions every 20 days; since we are trading pairs, this means 10 trades per pair,

and 20 trades including both entry and exit. This averages to about a single trade every trading

day, which may be too frequent for an individual speculator, but may not be out of the realm of

possibility for a large institution like a hedge fund. In addition to transaction costs, there is

slippage due to illiquidity. We noticed that some of the international ETFs only had average

daily volume in the tens of thousands for about the first year in our backtesting period: an

institution could have had trouble trading large quantities of these ETFs in the early years

without moving the market too much. Including transaction costs and slippage would reduce the

overall profitability of our strategy.

Secondly, our trading strategy determines the ranking of the pairs based only on the last

residual of the 120-day formation period. This is to ensure that we are trading pairs that have the

greatest residual—the greatest “divergence”—right before we enter the trades, since we are

betting that the pairs will converge in the near future. We considered incorporating an

exponential moving average of the formation period residuals (weighting the more recent

residuals more) instead of just basing our trading decision on a single data point, but we decided

against it because we figured that there should not be problems with highly fluctuating residuals

since our exploratory data analysis on the international ETF’s data came out clean. Extensions

from our work may potentially consider using either an exponential moving average or some

other method of incorporating the past residuals in the formation period.

Thirdly, our initial strategy equity curve suggests that international ETF’s tend to diverge

during good economic times and converge during bad economic times. Our trading strategy

performed poorly up until the financial crisis in 2008, and then it started performing well as the

world’s economies began to move together during the crisis until it recently started dropping

again partway through 2011. This is a potential research question worth investigating.

Lastly, there is the peso problem, where historical data may not reflect all risks,

especially those in the future. An example of this phenomenon can be seen in the backtest period

from 2004 to 2008. The returns to our strategy were very consistent during that time period and

volatility was low (in this specific case, the international ETF pairs tended to diverge even more

after we picked them). If we were to put ourselves in 2008, given the historical data up until that

point, we would not have known when—if ever—the persistence in the momentum of

international ETF pairs would break down; indeed, it did break down immediately, during the

Page 20: STAT 434 Rebecca Wu Troy Shu Final Report

P a g e | 19

financial crisis, when international ETF pairs suddenly became mean reverting (i.e. the

international ETF pairs started moving together) and our bets on ETF pair convergence started

making money. Our models could not have foreseen this risk predicted in the historical data.

This is the reason why making models and trading strategies that are adaptive is a good thing to

do in the volatile and unpredictable world we live in today.