Anderson Darling Goodness-of-the-fit test?
43 views (last 30 days)
I want to assess the GOF of a given dataset to a variety of different distributions (i.e. exponential, lognormal, normal etc) using the Anderson-Darling test-statistics. Unfortunately, this doesn't seem to have been implemented in Matlab, so I was wonderung whether some of you might have a function that does exactly this?
Mike Croucher on 11 Apr 2022
The statistics toolbox has an adtest function. It can test against a specific distribution with known parameters, or a more general test against any of the following distributions with unknown parameters: 'norm', 'exp', 'ev', 'logn', 'weibull'
More Answers (1)
William Rose on 10 Apr 2022
Calculating the AD statistic is straightforward, and the attached function does it. Figuring out the p-value associated with a given AD statistic is not straightforward. The attached function does it, but beware. This 2018 paper gives a formula for the p-value (Table 3). They cite a textbook from 1986, which I don't find online. They say this formula is supposed to work no matter what distribution is being tested. However, I have tested their Table 3 formula with Monte Carlo simulation*. The tests show that the Table 3 formula gives very wrong answers if the data is uniform and the tested distribution is also uniform. The formula also gives very wrong answers if the data is exponential and the tested distribution is also exponential. The formula gives reasonable answers if the data is normal and the tested distribution is normal. By "very wrong answers", I mean that if you generate 10,000 sets of 100 uniformly distributed random numbers, and test versus a uniform distribution, about 1% of the p-values should be <.01, about 5% should be <.05, and about 10% should be <.10. But the actual number of p-values that are less, at each level, is much greater than expected. In other words, there are too many low p-values. Thus, when testing data that is actually uniformly (or exponentially) distributed, one would reject the hypothesis "this data came from a uniform (or exponential) distribution" far more often than one should. The p-values for nomal data tested versus a normal distribution appear to be approximately correct, at least at p=0.01, 0.05, and 0.10, with data sets of 20, 50, and 100 points per set.
The function that computes the A-D statistic and the associated p-value is attached: AndersonDarling.m. Note the warnings above. I am also attaching four scripts which test the function. See the comments in the function and in each script.
The 2018 paper linked above and here has references to other papers and books which may have formulas that give better results for non-normal distributions.
*I did the Monte Carlo testing with 10,000 data sets of normally distributed random numbers, with 20 data points in each data set. The I did it again with 50 and with 100 in each data set. Then I did it all again with uniformly and exponentially distributed random numbers, for 90,000 data sets in all.