Overview of VaR Backtesting

Market risk is the risk of losses in positions arising from movements in market prices. Value-at-risk (VaR) is one of the main measures of financial risk. VaR is an estimate of how much value a portfolio can lose in a given time period with a given confidence level. For example, if the one-day 95% VaR of a portfolio is 10MM, then there is a 95% chance that the portfolio loses less than 10MM the following day. In other words, only 5% of the time (or about once in 20 days) the portfolio losses exceed 10MM.

For many portfolios, especially trading portfolios, VaR is computed daily. At the closing of the following day, the actual profits and losses for the portfolio are known and can be compared to the VaR estimated the day before. You can use this daily data to assess the performance of VaR models, which is the goal of VaR backtesting. The performance of VaR models can be measured in different ways. In practice, many different metrics and statistical tests are used to identify VaR models that are performing poorly or performing better. As a best practice, use more than one criterion to backtest the performance of VaR models, because all tests have strengths and weaknesses.

Suppose that you have VaR limits and corresponding returns or profits and losses for days t = 1,…,N. Use VaRt to denote the VaR estimate for day t (determined on day t − 1). Use Rt to denote the actual return or profit and loss observed on day t. Profits and losses are expressed in monetary units and represent value changes in a portfolio. The corresponding VaR limits are also given in monetary units. Returns represent the change in portfolio value as a proportion (or percentage) of its value on the previous day. The corresponding VaR limits are also given as a proportion (or percentage). The VaR limits must be produced from existing VaR models. Then, to perform a VaR backtesting analysis, provide these limits and their corresponding returns as data inputs to the VaR backtesting tools in Risk Management Toolbox™.

The toolbox supports these VaR backtests:

Binomial test
Traffic light test
Kupiec’s tests
Christoffersen’s tests
Haas’s tests

Binomial Test

The most straightforward test is to compare the observed number of exceptions, x, to the expected number of exceptions. From the properties of a binomial distribution, you can build a confidence interval for the expected number of exceptions. Using exact probabilities from the binomial distribution or a normal approximation, the bin function uses a normal approximation. By computing the probability of observing x exceptions, you can compute the probability of wrongly rejecting a good model when x exceptions occur. This is the p-value for the observed number of exceptions x. For a given test confidence level, a straightforward accept-or-reject result in this case is to fail the VaR model whenever x is outside the test confidence interval for the expected number of exceptions. “Outside the confidence interval” can mean too many exceptions, or too few exceptions. Too few exceptions might be a sign that the VaR model is too conservative.

The test statistic is

$Z_{b i n} = \frac{x - N p}{\sqrt{N p (1 - p)}}$

where x is the number of failures, N is the number of observations, and p = 1 – VaR level. The binomial test is approximately distributed as a standard normal distribution.

For more information, see References for Jorion and bin.

Traffic Light Test

A variation on the binomial test proposed by the Basel Committee is the traffic light test or three zones test. For a given number of exceptions x, you can compute the probability of observing up to x exceptions. That is, any number of exceptions from 0 to x, or the cumulative probability up to x. The probability is computed using a binomial distribution. The three zones are defined as follows:

The “red” zone starts at the number of exceptions where this probability equals or exceeds 99.99%. It is unlikely that too many exceptions come from a correct VaR model.
The “yellow” zone covers the number of exceptions where the probability equals or exceeds 95% but is smaller than 99.99%. Even though there is a high number of violations, the violation count is not exceedingly high.
Everything below the yellow zone is "green." If you have too few failures, they fall in the green zone. Only too many failures lead to model rejections.

For more information, see References for Basel Committee on Banking Supervision and tl.

Kupiec’s POF and TUFF Tests

Kupiec (1995) introduced a variation on the binomial test called the proportion of failures (POF) test. The POF test works with the binomial distribution approach. In addition, it uses a likelihood ratio to test whether the probability of exceptions is synchronized with the probability p implied by the VaR confidence level. If the data suggests that the probability of exceptions is different than p, the VaR model is rejected. The POF test statistic is

$L R_{P O F} = - 2 \log (\frac{{(1 - p)}^{N - x} p^{x}}{{(1 - \frac{x}{N})}^{N - x} {(\frac{x}{N})}^{x}})$

where x is the number of failures, N the number of observations and p = 1 – VaR level.

This statistic is asymptotically distributed as a chi-square variable with 1 degree of freedom. The VaR model fails the test if this likelihood ratio exceeds a critical value. The critical value depends on the test confidence level.

Kupiec also proposed a second test called the time until first failure (TUFF). The TUFF test looks at when the first rejection occurred. If it happens too soon, the test fails the VaR model. Checking only the first exception leaves much information out, specifically, whatever happened after the first exception is ignored. The TBFI test extends the TUFF approach to include all the failures. See tbfi.

The TUFF test is also based on a likelihood ratio, but the underlying distribution is a geometric distribution. If n is the number of days until the first rejection, the test statistic is given by

$L R_{T U F F} = - 2 \log (\frac{p {(1 - p)}^{n - 1}}{(\frac{1}{n}) {(1 - \frac{1}{n})}^{n - 1}})$

This statistic is asymptotically distributed as a chi-square variable with 1 degree of freedom. For more information, see References for Kupiec, pof, and tuff.

Christoffersen’s Interval Forecast Tests

Christoffersen (1998) proposed a test to measure whether the probability of observing an exception on a particular day depends on whether an exception occurred. Unlike the unconditional probability of observing an exception, Christoffersen's test measures the dependency between consecutive days only. The test statistic for independence in Christoffersen’s interval forecast (IF) approach is given by

$L R_{C C I} = - 2 \log (\frac{{(1 - π)}^{n 00 + n 10} π^{n 01 + n 11}}{{(1 - π_{0})}^{n 00} π_{0}^{n 01} {(1 - π_{1})}^{n 10} π_{1}^{n 11}})$

where

n00 = Number of periods with no failures followed by a period with no failures.
n10 = Number of periods with failures followed by a period with no failures.
n01 = Number of periods with no failures followed by a period with failures.
n11 = Number of periods with failures followed by a period with failures.

and

π₀ — Probability of having a failure on period t, given that no failure occurred on period t − 1 = n01 / (n00 + n01)
π₁ — Probability of having a failure on period t, given that a failure occurred on period t − 1 = n11 / (n10 + n11)
π — Probability of having a failure on period t = (n01 + n11 / (n00 + n01 + n10 + n11)

This statistic is asymptotically distributed as a chi-square with 1 degree of freedom. You can combine this statistic with the frequency POF test to get a conditional coverage (CC) mixed test:

LR_CC = LR_POF + LR_CCI

This test is asymptotically distributed as a chi-square variable with 2 degrees of freedom.

For more information, see References for Christoffersen, cc, and cci.

Haas’s Time Between Failures or Mixed Kupiec’s Test

Haas (2001) extended Kupiec’s TUFF test to incorporate the time information between all the exceptions in the sample. Haas’s test applies the TUFF test to each exception in the sample and aggregates the time between failures (TBF) test statistic.

$L R_{T B F I} = - 2 \sum_{i = 1}^{x} \log (\frac{p {(1 - p)}^{n_{i} - 1}}{(\frac{1}{n_{i}}) {(1 - \frac{1}{n_{i}})}^{n_{i} - 1}})$

In this statistic, p = 1 – VaR level and n_i is the number of days between failures i-1 and i (or until the first exception for i = 1). This statistic is asymptotically distributed as a chi-square variable with x degrees of freedom, where x is the number of failures.

Like Christoffersen’s test, you can combine this test with the frequency POF test to get a TBF mixed test, sometimes called Haas’ mixed Kupiec’s test:

$L R_{T B F} = L R_{P O F} + L R_{T B F I}$

This test is asymptotically distributed as a chi-square variable with x+1 degrees of freedom. For more information, see References for Haas, tbf, and tbfi.

References

[1] Basel Committee on Banking Supervision, Supervisory framework for the use of “backtesting” in conjunction with the internal models approach to market risk capital requirements. January 1996, https://www.bis.org/publ/bcbs22.htm.

[2] Christoffersen, P. "Evaluating Interval Forecasts." International Economic Review. Vol. 39, 1998, pp. 841–862.

[3] Cogneau, P. “Backtesting Value-at-Risk: how good is the model?" Intelligent Risk, PRMIA, July, 2015.

[4] Haas, M. "New Methods in Backtesting." Financial Engineering, Research Center Caesar, Bonn, 2001.

[5] Jorion, P. Financial Risk Manager Handbook. 6th Edition, Wiley Finance, 2011.

[6] Kupiec, P. "Techniques for Verifying the Accuracy of Risk Management Models." Journal of Derivatives. Vol. 3, 1995, pp. 73–84.

[7] McNeil, A., Frey, R., and Embrechts, P. Quantitative Risk Management. Princeton University Press, 2005.

[8] Nieppola, O. “Backtesting Value-at-Risk Models.” Helsinki School of Economics, 2009.