«Corresponding Author: Jordan Anaya1 Charlottesville, VA, US Email address: omnesresnetwork PeerJ Preprints | ...»
The GRIMMER test: A method for testing the validity of
reported measures of variability
omnesres.com, email: firstname.lastname@example.org, twitter: @omnesresnetwork
Charlottesville, VA, US
Email address: email@example.com
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2400v1 | CC BY 4.0 Open Access | rec: 29 Aug 2016, publ: 29 Aug 2016
The GRIMMER test: A method for testing the validity of reported measures of variability Jordan Anaya1 1 Omnes Res
ABSTRACTGRIMMER (Granularity-Related Inconsistency of Means Mapped to Error Repeats) builds upon the GRIM test and allows for testing whether reported measures of variability are mathematically possible. GRIMMER relies upon the statistical phenomenon that variances display a simple repetitive pattern when the data is discrete, i.e. granular.
This observation allows for the generation of an algorithm that can quickly identify whether a reported statistic of any size or precision is consistent with the stated sample size and granularity. My implementation of the test is available at PrePubMed and currently allows for testing variances, standard deviations, and standard errors for integer data. It is possible to extend the test to other measures of variability such as deviation from the mean, or apply the test to non-integer data such as data reported to halves or tenths. The ability of the test to identify inconsistent statistics relies upon four factors: (1) the sample size; (2) the granularity of the data; (3) the precision (number of decimals) of the reported statistic; and (4) the size of the standard deviation or standard error (but not the variance). The test is most powerful when the sample size is small, the granularity is large, the statistic is reported to a large number of decimal places, and the standard deviation or standard error is small (variance is immune to size considerations). This test has important implications for any ﬁeld that routinely reports statistics for granular data to at least two decimal places because it can help identify errors in publications, and should be used by journals during their initial screen of new submissions. The errors detected can be the result of anything from something as innocent as a typo or rounding error to large statistical mistakes or unfortunately even fraud. In this report I describe the mathematical foundations of the GRIMMER test and the algorithm I use to implement it.
Keywords: Standard deviations, standard errors, variances, statistics, reproducibility, replicability
THEORETICAL FOUNDATIONSIn 1983 Magic Johnson received 304.5 votes for Most Valuable Player (link). Votes for Most Valuable Player can only be whole numbers, and the redditors at r/nba recognized the impossibility of this sum and proposed some interesting reasons for the apparent half vote.
If something like this was discovered in the scientiﬁc literature it would probably alsocause a bit of confusion.
Imagine a lab reported a statistic that could only be a whole number, such as the number of mice used in an experiment. If they claimed their experiment involved 10.5 mice it would be unclear how many mice they actually used, and similarly to the half MVP vote, if the number was taken seriously it may engender some interesting speculation as to how this half mouse came to be, or perhaps there were two quarter mice, or even four eighth mice.
Just as a simple statistic such as a sum can be nonsensical and potentially humorous when dealing with discrete data, it is possible for other statistics to be just as nonsensical, albeit less conspicuously so and likely not as humorous. It is because detecting incorrect values for anything but the simplest of statistics requires more effort than just checking if the reported statistic has the same precision as the data that these errors have thus far gone undetected by the scientiﬁc community. Only recently has there been progress on evaluating the statistics of granular data, and the work revealed a striking number of nonsensical values (Brown and Heathers, 2016). One can only imagine how many errors will be revealed with further advances in this ﬁeld and widespread adoption of the techniques.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2400v1 | CC BY 4.0 Open Access | rec: 29 Aug 2016, publ: 29 Aug 2016 Review of the GRIM test This work and future work on statistics with granular data are possible because of the solid foundations laid by the GRIM test (Brown and Heathers, 2016). The authors of the GRIM test made the simple observation that when the values of data sets are granular, the means are also granular, and this makes certain means mathematically impossible, i.e. “inconsistent” with the data. The authors referred to their test checking for inconsistent means as the Granularity-Related Inconsistency of Means (GRIM) test.
The GRIM test is elegant in its simplicity. Given a data set with granularity G and sample size N, the granularity of G the mean is N.
For example, with a data set of 10 values where the values are reported as integers (a granularity of 1), the means of all possible sets can be enumerated:
0, 0.1, 0.2, 0.3, 0.4,..., 1.0, 1.1, 1.2, 1.3, 1.4,...
The granularity of the means in this case was 1/10 = 0.1. Any means reported that are not a multiple of 0.1 would be incorrect and would fail the GRIM test.
The GRIM test has no upper limit on the size of means it can test. For example, even means around a million will still have to be a multiple of 0.1. Where the limitations of the test arise are in the sample size and the number of reported decimals of the mean. Going back to our example, if the researcher rounded his or her means to the nearest
integer the possible reported means would now be:
0, 1, 2, 3, 4,...
In this case the GRIM test is useless because due to rounding the granularity of the means is smaller than the granularity of the reported means. Stated generally, if the mean is reported to decimal places D, then the GRIM test can detect inconsistent means given
For example, if G is 1, i.e. integer values, and the means are reported to two decimals (D is 2), the test is applicable for sample sizes up to 99, after which the granularity of the means are equal to or less than the granularity of the reported means. This is because at two decimals the granularity of the reported means is.01, and at a sample size of 100 the granularity of the means is also.01.
A naive algorithm for the GRIM test would be to enumerate all possible means, round them to the same decimals as the reported mean, and then check if the mean is one of the possibilities. Although this would work, the authors of
the GRIM test created a clever algorithm for testing the consistency of means:
1. Multiply the reported mean by the sample size
2. Round the result to the nearest integer
3. Divide by the sample size
4. Round the result to D decimal places and compare to reported mean This algorithm assumes a granularity of 1 for the values. If the data is present at a different granularity you must round to the nearest granular value instead of rounding to the nearest integer at step 2.
Beyond the GRIM test A natural extension of the GRIM test would be to apply it to other commonly reported statistics, for example standard deviations. While it might be obvious that the standard deviations of granular data should also be granular, it is
It is unclear to me how to determine the granularity of standard deviations from this formula.
Another problem with standard deviations is the presence of the square root. The ultimate goal of a granularity test is to determine if the fractional component of the reported statistic is consistent regardless of the value of the integer component. We saw that when working with means 10.1 was a consistent value, as was 100.1, as was 1,000.1, etc. It is difﬁcult to imagine that standard deviations would display any sort of consistent fractional values. The curve of √ x is just that, a curve, and values become compressed as x → ∞, with the effect that the integer portions of the values are not independent of the fractional components.
To get around this we can use another commonly reported statistic, the variance, which is σ 2. With the elimination of the square root there may exist a simple granularity G that is independent of the size of the variance. But again, it is unclear how to determine what this granularity might be. A naive approach would be to employ a brute force algorithm that records all possible variances and then checks a reported statistic against this table of values. However, there is no upper limit to variances, and computing all possible variances for a given data set can quickly become computationally intensive.
Nota bene: From now on the data will be assumed to be integers and thus have a G of 1.
Using the itertools library for Python it is possible to generate all possible combinations of a data set of size N and possible unique values 0, 1, 2, 3,..., X with the function combinations with replacement. The number of possible unique combinations that the function will generate is
What this equation shows is that when N or X gets large the number of possible combinations that need to be tested quickly become unmanageable. Despite this I decided to start calculating as many variances as I could and see what I observed.
Enumerating the unique possible variances for a sample size of 5 and sorting them results in this sequence:
0.0, 0.16, 0.24, 0.4, 0.56, 0.64, 0.8, 0.96, 1.04, 1.2, 1.36, 1.44, 1.6, 1.76, 1.84, 2.0, 2.16, 2.24, 2.4, 2.56, 2.64, 2.8, 2.96, 3.04, 3.2, 3.36, 3.44, 3.6, 3.76, 3.84,...
At ﬁrst glance it may appear that the granularity of the variances, Gσ 2, is 0.16, however there are times when the step is.08 instead of 0.16, such as the step between the values 0.56 and 0.64.
A close inspection of the sequence reveals that the fractional values of variances that have an even integer are the same regardless of the size of the even integer. Similarly, the fractional values of variances that have an odd integer are the same regardless of the size of the odd integer. I will refer to the fractional values for even integers as the even pattern (EP), and the fractional values for odd integers as the odd pattern (OP). I will also take the liberty to refer to variances with an even integer component as even variances (EV ) and variances with an odd integer component as
These repeating fractional values are not limited to a sample size of 5. Enumerating the unique possible variances
for a sample size of 6 and sorting them results in this sequence:
0.0, 0.138, 0.2, 0.25, 0.3, 0.472, 0.5, 0.583, 0.6, 0.805, 0.8, 0.916, 1.0, 1.138, 1.2, 1.25, 0.3, 1.472, 1.5, 1.583, 1.6, 1.805, 1.8, 1.916,...
In this case the odd pattern is the same as the even pattern. This leads to the following theorem:
I am completely unaware of any previous reports of repeating patterns of variances of discrete numbers and this may be the ﬁrst time this is reported in the literature.
It appears we now have everything we need to apply granularity testing to variances, and consequently standard deviations and standard errors since they can be derived from variances. However, the power of this granularity test to detect inconsistent values appears weaker than the GRIM test. Recall that the GRIM test can detect inconsistent values for means rounded to one decimal for sample sizes up to 9. However, at only a sample size of 5 we see that a test for variances is already beginning to break down if the reported statistic is rounded to 1 decimal. For example, the variances 0.16 and 0.24 would both get rounded to 0.2 and be indistinguishable.
But does the test have to be less powerful than the GRIM test? When researchers report a variance or standard deviation they also often report a mean. Can we somehow use that mean to narrow down which variances in EP and OP we should test? Intuition says yes. For example, imagine the reported variance is 0.0. The only way to have a variance of 0 is for every value in the data set to be the same. If every value in the data set is the same then the mean would have to have a fractional value of 0 since the mean would just be the integer value that is repeated in the data set.
To investigate this I began recording which means are consistent with which variances. Below are the means coupled
to the variances for a sample size of 5:
What I hope this table shows is that only certain means are consistent with certain variances. In fact, for a given variance the means that are consistent for that variance always either have fractional component F, or fractional
This observation now provides us with a second check. Actually, it provides a third check if we ﬁrst apply the GRIM test on the reported mean. As a result, when a researcher reports a mean and a variance, for the values to be consistent the mean must ﬁrst pass the GRIM test, the variance must then match either pattern EP or OP, and then the mean must be consistent with the EP or OP value. Failing any of these three checks indicates the values were reported incorrectly or possibly fabricated.
Incorporating the mean into the test increases the power of the test. Now if the variances for a sample size of 5 are rounded to a single decimal, and the mean is provided, it is possible to determine if a reported value of 0.2 originated from a variance of 0.16 or 0.24. More importantly, if a researcher reports a variance of 0.2 but reports a mean of 1.0, they would not pass the test despite the fact that two possible variances round to 0.2. And of course if they report any means with an odd number as a fractional value such as 1.1 they would also fail the GRIM test.