Cointegration and spurious regression: Enhancing delivery via replication, empirical application and simulation

Steve Cook
Department of Economics
Swansea University
s.cook at swan.ac.uk

Published July 2022

Summary
1. Introduction
2. Spurious regression
3. Single equation cointegration analysis
4. Further cointegration analysis
5. Conclusion
References
Notes

Summary

Spurious regression and cointegration are familiar components of higher-level undergraduate and postgraduate econometrics courses. Drawing upon both empirical and simulation analysis, this paper discusses and demonstrates an active learning approach to the delivery of these topics. The approach presented emphasises two forms of replication. While the first form relates to the familiar reproduction of empirical findings, the second alternative form considers the replication of results automatically generated by software. To support the adoption of the proposed active learning approach within the classroom, the resources employed in this paper are provided. These resources include both an adaptable programme to simulate artificial data and a data set allowing cointegration analysis to be illustrated via examination of the much-researched hypothesis of a ripple effect within the UK housing market.

1. Introduction

The notion of cointegration occupies a central role in higher-level undergraduate and postgraduate econometrics courses. A natural approach to teaching this topic is to begin with discussion of the notion of spurious regression before considering the seminal work of Engle and Granger (1987) on the use of single equation cointegration analysis. While important in themselves, these topics serve also to provide the foundation for the coverage of further approaches such as multivariate cointegration analysis (Johansen, 1988) and bounds testing for long-run relationships (Pesaran et al., 2001). This paper considers how the teaching of the Engle-Granger procedure and spurious regression can be enhanced by the adoption of an approach based upon active learning. Combining relevant elements from econometric research with exercises based upon both empirical and simulation analysis, a discussion and accompanying resources are provided to support a practical coverage of these topics.

A particular feature of the approach proposed is the use of two forms of replication. In addition to considering replication in the usual sense of reproducing empirical results, its use is also championed here in an alternative manner involving the manual reproduction of results automatically produced by software. Under this second approach to replication, knowledge of methods and approaches is both challenged and developed by undertaking the hidden steps underlying the production of automated results. In simple terms, this second form of replication poses the following question: do you know the methods being considered in sufficient detail to work through the required, and hidden, steps of the analysis to reproduce the results automatically produced by the software?

The use of active learning has been discussed for both econometrics, and statistics more generally, in a range of studies (see, inter alia, Snee, 1993; Allen and Baughman, 2016; Cook, 2016; Rock et al., 2016; Bloomer Green et al., 2018). Within this research, benefits are reported as a result of students undertaking activities in relation to the topics being studied rather than simply acting as third-party recipients of information. The present paper adds to this body of work with its emphasis on practical tasks and the provision of resources to allow others to adopt an active learning approach to delivery in the classroom or, more precisely, computer lab. In the process of undertaking simulation and empirical analyses, the approach adopted also supports the development of transferable skills in relation to coding and data analysis. However, a key feature of the proposed approach is the use of replication which has the benefit of allowing further themes in the pedagogical literature to be addressed, namely those relating to ‘anxiety towards quants’ (Dreger and Aiken, 1957; Dowker et al., 2016) and self-efficacy (Bandura, 1977; Zahaciva et al., 2005).

In particular, by achieving the targeted goal or outcome provided by the replication exercise, one’s confidence or level of self-efficacy can be developed and in the process potential anxiety towards quantitative methods can be addressed. In combination, the practical coverage and use of replication proposed herein provide the opportunity to test understanding of how methods are structured and which choices are made at each stage of their implementation, before providing the reassurance and confidence afforded by the successful reproduction of results automatically produced by software.

As has been noted, the coverage of spurious regression and cointegration that follows provides a discussion of relevant elements of econometric research combined with practical exercises to support delivery in the classroom. Importantly, the code and data required for this are provided. In addition to working through analysis with artificial data, a further set of data is provided to allow unit root and cointegration analysis to be undertaken within the context of a topical example relating to the UK housing market. For this final example, a collection of results are presented with their replication providing the basis for additional classroom activities and exercises. All of the analysis, both empirical and simulation, reported in this paper is undertaken using the EViews 12 software package.

2. Spurious regression

The notion of spurious regression relates to the incorrect detection of an apparent relationship between independent time series processes. While Granger et al. (2001) considers this in relation to stationary series with ‘strong temporal properties’ (p.899), spurious regression is most commonly considered in relation to the false detection of relationships between independent unit root processes. Prominent studies associated with this topic are those of Granger and Newbold (1974) and Phillips (1986) where simulation and theoretical analyses are presented. The work of Phillips (1986) is particularly informative as it provides some very important results concerning spurious regression. These findings can be illustrated via consideration of the following simple example. Suppose an investigator wishing to examine the relationship between two independent unit root processes {y_t,x_t} estimates the following regression:

(1) y_t= α + βx_t + w_t

An obvious approach to consider the relationship between {y_t,x_t} would involve examination of resulting values for the t-ratio for and R² obtained from this regression. However, the findings of Phillips (1986) show that as the sample size tends to infinity, has a non-degenerate limiting distribution and the distribution of its associated t-ratio, t(), diverges. In addition, Phillips (1986) shows the coefficient of determination from (1) to have a non-degenerate limiting distribution. In short, the statistics we would wish to consult to determine whether a relationship is present cannot be relied upon in these circumstances given their tendency to spuriously detect a relationship when none exists.

To adopt an active learning approach towards spurious regression, simulation can be employed to generate two independent unit root processes which can then be included in a simple regression of the form of (1) above to provide an illustration of spurious regression in action. The important issue here is that the use of simulation allows the creation of series that are known to be independent by design. Following standard practice, two unit root processes can be generated as given by (2) and (3):

(2) x_t = x_t_–1 + v_t v_t ~ N(0, 1)

(3) y_t = y_t_–1 + e_t e_t ~ N(0, 1)

Code to employ within the programming facility of EViews 12 to generate these artificial series is provided in the file spurious_xy.prg. Following convention, an initial value of zero is chosen for both series: x₀ = 0; y₀ = 0. To avoid use of the same starting value for the series within the sample employed for analysis, this initial observation is deleted to create a sample of 280 observations. Although obviously artificial, the sample is labelled as 1952Q1 to 2021Q4 to mimic the type of sample available for macroeconomics series from known outlets such as the Federal St Louis Database. With the resulting series {y_t,x_t} created, a simple linear regression given by (1) can be run. The summarised results from this regression are reported in Table One below. From inspection of the results obtained, it can be seen that = –0.4009, t() = –13.5917, the p-value associated with t() is 0.00% and R² = 39.9%. As a result, these series have provided an illustration of spurious regression in action: while t() indicates a link between the two series and the R² is significantly different from zero, it is known that the two series are independent processes.

Table One: Estimation of equation (1)

Variable	Coefficient	Std. Error	t-Statistic	Prob.
Dependent Variable: y
C	–0.683207	0.413254	–1.653238	0.0994
x	–0.400921	0.029498	–13.59170	0.0000
R-squared: 0.399223 Durbin-Watson stat: 0.064917

The use of code to produce artificial series which are independent by construction allows spurious regression to be actively explored rather than simply considered. In addition, while the code provided has been employed to create a single pairing of the series {y_t,x_t} it can obviously be manipulated to generate more series to allow further analysis of the concept of spurious regression. A simple approach would be to change the seed for the random number generator to produce different series. Alternatively, a loop could be added and the code amended to produce a range of series, with a matrix created to collect the relevant t-ratios and R² values from estimated simple linear regressions for a collection of series.

3. Single equation cointegration analysis

Following the discussion of spurious regression, attention can now turn to the concept of cointegration and its focus on the presence of genuine long-run relationships between series. A foundation for this is provided by the single equation approach of Engle and Granger (1986). Continuing with our bivariate analysis, this method involves a two-stage procedure. The first stage is provided by the simple linear regression given by equation (1) above. Given the focus on potential cointegration, this is now labelled as a static cointegrating regression. To examine whether a cointegrating relationship exists between {y_t,x_t}, the residuals {ŵ_t} from (1) are saved and the second stage of the procedure involves application of a (cointegration) Augmented Dickey-Fuller (CADF) test to examine whether the residual series possesses a unit root. If the linear combination of {y_t,x_t} produced by regression analysis creates a residual series that does not have a unit root, then {y_t,x_t} are deemed to be cointegrated. Conversely, if the residual series is found to be a unit root process, then {y_t,x_t} are viewed as not cointegrated. Formally, the residuals {ŵ_t} from (1) are employed in the following CADF testing equation:

(4)

where the presence of a unit root in {ŵ_t}, or absence of cointegration between {y_t,x_t}, is tested via the null H₀: ρ = 0 in (4) using the test statistic . The test is one-sided with the alternative being given as H₁: ρ < 0. As either y_t or x_t can be the dependent variable in the static cointegrating regression, the analysis can be performed twice with y_t,x_t used in turn as the dependent variable and regressor respectively. This is the issue of ‘normalisation’ and results in two sets of residuals being available for analysis.

A further issue to consider relates to the use of deterministic terms when employing this method. While deterministics are absent from the testing equation of (4) due to the consideration of a residual series, the Engle-Granger procedure has been considered here with a single deterministic term (an intercept, α) included in the first stage static cointegrating regression of (1). However, the deterministic terms employed can be extended to include a trend term as a regressor if required, with the inclusion or exclusion of a trend term leading to discussion of the notions of stochastic and deterministic cointegration (see Ogaki and Park, 1997). Here the commonly considered approach of including an intercept as the sole deterministic term in the cointegrating regression will be followed.

To illustrate the Engle-Granger procedure in practice, the artificial series {y_t,x_t} above can be revisited. This has the additional advantage, as will be seen, of re-emphasising the issue of spurious regression. Ahead of this analysis, unit root testing can be employed to examine the order of integration of the {y_t,x_t} series. Although these series are known to be unit root processes by construction, their presence provides the opportunity run through an application of unit root analysis. Operating at conventionally considered levels of significance, application of the DF-GLS test of Elliott et al. (1996) with an intercept as the sole deterministic term and the Schwarz Information Criterion (SIC) as a means of determining the degree of augmentation of the unit root testing equations, the unit root hypothesis could not be rejected for the {y_t,x_t} series but was rejected for their first difference.[1] With {y_t,x_t} found to be unit root processes, the Engle-Granger approach can be applied to the series. Within EViews 12, this can be undertaken via an ‘automated’ approach whereby the two series are selected and a cointegration test option is applied automatically. Application of this test to the two series with the Modified Akaike Information Criterion (MAIC) used to determine the degree of augmentation of the testing equation of (4) produced the summarised results presented in Table Two below. Considering Table Two, the heading ‘Dependent’ indicates which series has been used as the dependent variable in the first stage static cointegrating regression. From inspection of these results, it can be seen that the null of no cointegration is not rejected at conventionally considered levels for either set of results, with large p-values of over 63% and 89% respectively reported. This absence of cointegration reinforces the previous discussion of spurious regression between the {y_t,x_t} series.

Table Two: Engle-Granger test results for {y_t,x_t}

Dependent	tau-statistic	Prob.
x_t	–1.028292	0.8965
y_t	–1.798171	0.6317

As noted above, the results presented in Table Two were calculated using the automated option within EViews 12. An opportunity to test understanding then arises by attempting to replicate these results manually by undertaking the steps underlying their generation. Alternatively expressed, knowledge of the Engle-Granger procedure can be developed by reflecting upon and conducting the underlying steps hidden in the production of the findings in Table Two. To do this, the appropriate static cointegrating regressions need to be run with the residuals from these saved and employed in appropriately specified (C)ADF test to replicate each of the results above. To replicate these results, a number of issues have to be appreciated and understood: the appropriate use of deterministic terms in the first step of the procedure; the absence of deterministic terms in the second step; the use of a relevant approach to select the degree of augmentation of the CADF testing equation; and use of the relevant residual series. Estimating the static cointegrating regressions, saving the residuals and performing the unit root tests produced the results presented in Tables Three and Four where ‘RES_AB’ denotes the residual series from a static cointegrating regression where the variable ‘A’ is used as the dependent variable and ‘B’ is the regressor. In addition to testing understanding of the mechanics of Engle-Granger procedure, the process of replicating the automated findings presents a further challenge to understanding as the unit root analysis undertaken in the second stage of the procedure will generate unit root critical values and p-values rather than those for a cointegration test. This is something that can be explored in class to emphasise a number of issues including the difference between unit root analysis of an actual series and a derived residual series, and the difference between unit root and cointegration critical values.

Table Three: Replicating Engle-Granger test results 1

Null Hypothesis: RES_YX has a unit root
Exogenous: None
Lag Length: 1 (Automatic - based on Modified AIC, maxlag=15)
	t-Statistic
Augmented Dickey-Fuller test statistic	–1.798171

Table Four: Replicating Engle-Granger test results 2

Null Hypothesis: RES_XY has a unit root
Exogenous: None
Lag Length: 1 (Automatic - based on Modified AIC, maxlag=15)
	t-Statistic
Augmented Dickey-Fuller test statistic	–1.028292

4. Further cointegration analysis

To further develop understanding of the Engle-Granger procedure, additional empirical analysis can be undertaken. While Economics contains numerous examples of theories where long-run relationships between series are proposed, one particular example that can be considered is the notion of a ripple effect within the UK housing market. Under this hypothesis, changes in house prices in the UK are observed firstly in London and the South East before filtering across to other regions. As a result, while house prices across regions may deviate over the shorter term, this hypothesis proposes a long-run relationship between regional house prices as changes eventually ‘ripple out’ to more distant regions. Therefore, the ripple effect hypothesis provides a useful vehicle to illustrate the Engle-Granger approach in practice and is considered here via analysis of seasonally adjusted house price indices for the Outer Metropolitan and North regions of the UK obtained from the Nationwide Building Society.[2] These series, expressed in natural logarithmic form and measured at a quarterly frequency over the period 1973Q4 to 2022Q1 are labelled OM and N respectively and provided in the EViews workfile ripple.wf1.

The analysis of properties and relationship between OM and N can provide the basis of classroom activity on both unit root testing and cointegration analysis. Continuing with the theme of replication, Tables Five and Six provide results that can be reproduced and discussed. In Table Five, more standard replication can be considered in the form of reproducing the reported DF-GLS test statistics. These results were obtained using the SIC to determine the degree of augmentation of the ADF testing equations with deterministic terms chosen to match the properties of the {OM, N} series and their first differences. To consider this latter issue, the series can be graphed to examine their nature and the potential impact upon inferences of incorrectly specifying deterministic terms can be considered.

The discussion of these results, the associated inferences to be drawn at the 5% level of significance and their interpretation leads naturally to the cointegration analysis in Table Six. Considering Table Six, the previously discussed ‘dual-approach’ to replication arises. The findings indicating a long run relationship between {OM, N} can be reproduced both using the automated approach and via the two-stages of the Engle-Granger approach being performed ‘manually’ using the software. To ensure exact reproduction, it should be noted that an intercept was employed as the sole deterministic term in the first stage cointegrating regressions and the degree of augmentation of the second stage CADF testing equations was selected using the SIC. Again the analysis provides the basis for classroom discussion, which can focus on the support provided for the presence of a ripple effect within the UK housing market given the long-run relationship detected.

Table Five: DF-GLS unit root test statistics for house price indices

OM	∆ OM	N	∆ N
–1.921691	–3.308741	–1.626704	–4.543800

Table Six: Engle-Granger test results for {N, OM}

Dependent	tau-statistic	Prob.
N	–3.827089	0.0146
OM	–3.753380	0.0181

5. Conclusion

This paper has considered the topics of spurious regression and single equation cointegration analysis. In addition to providing information and discussion on these topics, an active learning approach to delivery has been championed. Resources in the form of code for required simulation and data for empirical analysis, along with a discussion of relevant issues from econometric research, have been provided to support the adoption of this approach within the classroom. The use and importance of replication has been promoted throughout the discussion, with the usual of replication in the form of reproducing results automatically generated by software emphasised. As noted, the confidence developed by achieving the known outcome provided by the replication exercise provides a form of active learning that addresses prominent pedagogical issues concerning self-efficacy and anxiety towards quants.

References

Allen, P. and Baughman, F. 2016. Active learning in research methods classes is associated with higher knowledge and confidence, though not evaluations or satisfaction. Frontiers in Psychology 7, 279. https://doi.org/10.3389/fpsyg.2016.00279

Bandura, A. 1977. Self-efficacy: toward a unifying theory of behavioral change. Psychological Review 84, 191-215. https://doi.org/10.1037/0033-295X.84.2.191

Bloomer Green, L., McCormick, N., McDaniel, S., Holmes Rowell G. and Strayer, J. 2018. Implementing active learning department wide: A course community for a culture change. Journal of Statistics Education 26, 190-196. https://doi.org/10.1080/10691898.2018.1527195

Cook, S. 2016. Modern econometrics: Structuring delivery and assessment. Cogent Economics & Finance 4:1, https://doi.org/10.1080/23322039.2016.1152705

Dickey, D. and Fuller, W. 1979. Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74, 427-431. https://doi.org/10.2307/2286348

Dowker, A., Sarkar A. and Looi, C. 2016. Mathematics anxiety: what have we learned in 60 years? Frontiers in Psychology 7, 508. https://doi.org/10.3389/fpsyg.2016.00508

Dreger, R. and Aiken, L. 1957. The identification of number anxiety in a college population. Journal of Educational Psychology 48, 344-351. https://psycnet.apa.org/doi/10.1037/h0045894

Elliott, G., Rothenberg, T. and Stock, J. 1996. Efficient tests for an autoregressive unit root. Econometrica 64, 813-836. https://doi.org/10.2307/2171846

Engle, R. and Granger, C. 1987. Co-integration and error correction: Representation, estimation and testing. Econometrica 55, 251-276. https://doi.org/10.2307/1913236

Granger, C. and Newbold, P. 1974. Spurious regressions in econometrics. Journal of Econometrics 2, 111-120. https://doi.org/10.1016/0304-4076(74)90034-7

Granger, C., Hyung, N. and Jeon, Y. 2001. Spurious regressions with stationary series. Applied Economics 33, 899-904. DOI: 10.1080/00036840121734

Johansen, S. 1988. Statistical analysis of cointegrating vectors. Journal of Economic Dynamics and Control 12, 231-254. https://doi.org/10.1016/0165-1889(88)90041-3

Ogaki, M. and Park, J. 1997. A cointegration approach to estimating preference parameters. Journal of Econometrics 82, 107-134. https://doi.org/10.1016/S0304-4076(97)00053-5

Pesaran, M., Shin, Y. and Smith, R. 2001. Bounds testing approaches to the analysis of level relationships. Journal of Applied Econometrics 16, 289-326. http://www.jstor.org/stable/2678547

Phillips, P. 1986. Understanding spurious regressions in econometrics. Journal of Econometrics 33, 311-340. https://doi.org/10.1016/0304-4076(86)90001-1

Rock, A., Coventry, W., Morgan, M. and Loi, N. 2016. Teaching research methods and statistics in eLearning environments: Pedagogy, practical examples, and possible futures. Frontiers in Psychology 7, 339. https://doi.org/10.3389/fpsyg.2016.00339

Snee, R. (1993). What’s missing in statistical education? American Statistician 47, 149-154. https://doi.org/10.2307/2685201

Zahaciva, A., Lynch, S. and Espenshade, T. 2005. Self-efficacy, stress, and academic success in college. Research in Higher Education 46, 677-706. https://doi.org/10.1007/s11162-004-4139-z

Notes

[1] Throughout this paper references are made to optimising the degree of augmentation of testing equations using alternative information criteria. In all cases optimisation occurs over a range from zero to a maximum lag length given as:

[2] https://www.nationwidehousepriceindex.co.uk/resources/f/uk-and-regional-quarterly-data-all-properties-series

↑ Top

Other teaching ideas in