Jurg Ott, 10 May 2014

For
some test statistics, determining the associated empirical
significance level is mathematically intractable, which is why we
perform permutation (randomization) tests [1]
to generate distributions of test statistics under the null
hypothesis. Randomization tests furnish unbiased and accurate
estimates of *p*-values.
Under the hypothesis of no association, any permutation of the labels
for the two outcome types (case, control) is assumed to be equally
likely and represents a random occurrence of the data under the null
hypothesis of no association. Because of the potentially very large
number of possible permutations, we take a random sample of
permutations according to an approach to sequentially generate a new
random sample based on the previous random sample [2].
In each randomization sample, we compute our test statistic exactly
as it was done in the observed data. The empirical significance level
is then estimated as the proportion of randomization samples with a
test statistic at least as large as the one observed. With multiple
tests such as testing large numbers of variables (SNPs), a major
advantage of permutation-based tests is that both types of empirical
significance levels can be estimated, nominal or point-wise (for each
given variable), and an experiment-wise or global *p*-value.
The latter tends to be smaller than the Bonferroni-corrected
significance level, particularly when variables (and test results)
are correlated.

Some
authors argue that under the null hypothesis the observed data should
be included in the randomized data, and we do this here. Thus, the
smallest significance level achievable is equal to 1/*n*,
where *n* is
the number of permutation samples (including the observed data).

Note:
The maxstat programs require a positive integer seed for their random
number generator, which they will read from the *seed.txt* file
if present, and if not, they will create a seed based on the system
clock.

Let
*p*_{0} denote
a *p*-value uncorrected for testing *n* SNPs, with *p*
denoting the corresponding corrected *p*-value. When SNPs are
independent then *p* = *np*_{0} (Bonferroni
correction), or, more accurately, *p* = 1 – (1 –
*p*_{0})^{n}. Permutation analysis furnishes *p*_{0}
and *p* for possibly dependent SNPs so that these two *p*-values
can be used to obtain the effective number of independent SNPs, which
is given by *n*_{eff} = log(1 – *p*)/log(1 –
*p*_{0}).

We
implemented our approach in the *maxstat*
computer package, available for
__Windows__
and __Linux__
(__Ubuntu__).
As in our __sumstat__
package, the *maxstat*
program requires two input
files, a datafile and a parameter file. The former has the same
structure as for the __sumstat__
program, that is, one row
(line) per SNP and one column for each individual, with the body of
the file containing a genotype code (for example, 1 = AA, 2 = AB, 3 =
BB, 0 = missing) for each SNP and individual. Optionally, the last
three columns specify, for each SNP, the chromosome number, position
in bp, and the SNP ID (e.g., rs1212) The last row consists of a
series of two integers, for example, 1 and 2, or 0 and 1, where the
higher number refers to cases and the lower number to controls.
Examples of parameter files are shown below.

With
small sample sizes, contingency tables sometimes contain empty cells.
It has been recommended to replace zero in such cells by ½
[3]. The *maxstat*
programs, where appropriate,
initially fill all cells by a quantity chosen by the user on input.

As
described below, the *maxstat*
program comes in different
versions. All programs are invoked in a command/terminal window by
typing the program name followed by the name of the parameter file.

Some
of the programs work with quantitative trait values (QTLs) and then
allow for __power
transformation__ to
make distributions more normal if necessary, but this does not seem
needed in permutation testing..

This version assumes that markers (variables) are SNPs and phenotypes are binary such as cases and controls. An example of a relevant parameter file is as follows:

maxstat: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32 0 0 0 0 codes for genotypes, missing, #obs, empty 14 code for test statistic 10000 number of permutation samples 0 list number of test-SNP for interaction. Below: Input/output file names sampledata.txt maxstatSample.out --------------------------------------------------------- EXPLANATIONS (only the text above is read by the program) Line 1: Any text, truncated to 250 characters. Line 2: Code for genotypes given in the data: 0 = the genotypes are 1, 2, 3 1 = the genotypes are 0, 1, 2 2 = the genotypes are -1, 0, 1 Value for "missing" genotype Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here. Initial number of observations in contingency tables. For sparse tables, 0.5 is recommended for stability. Line 3: Code for single-locus test statistic (for details, see below) Line 4: Number of permutation samples Line 5: List number of test SNP, which will be paired with all other SNPs in turn (statistic code 16). If the test SNP number is given as a positive number then the maximum of the 3 chi-squares serves as test statistic, otherwise it's the sum of the 3 chi-squares. Line 6: Name of datafile. No other characters before or after the name. Line 7: Name of output file. No other characters before or after the name. If this line is blank the output file will be called "maxstat.out". ---------------------------------------- Statistic codes on line 4: 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes 2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW. 8 = chi-square for proportion of homozygotes, cases vs controls. SEE NOTE BELOW. 10 = Armitage test for trend 11 = t-test comparing mean genotype codes. SEE NOTE BELOW. 12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles 14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW. 16 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP. 17 = interaction chi-square for given test SNP versus any target SNP 18 = Max2 test: maximum chi-square for dominant and recessive genotype action

NOTE 1 (statistic codes 2, 11, and 14): When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 minus type 1 observations will be computed. For example, the test statistic with a code of 11 is |t|. Also, a test code of -14 will test whether Fcase > Fctrl.

NOTE 2 (where appropriate): Chi-square is generally computed as a likelihood ratio (LR) statistic. If the statistic code is entered as a negative number then the Pearson chi-square will be computed.

Statistic code 16 allows for conditional search for disease-associated (target) SNPs given genotypes at a test SNP [4]. This technique might uncover SNPs with negligible main effect (little association effect by themselves) but strong association through interaction with the test SNP.

For
this version, input variables are also SNP genotypes but the
phenotype (output variable) is a quantitative observation, a QTL.
Multiple QTLs may be present. The test statistic for each QTL is *F*
in one-way ANOVA for three
marker genotypes, and the overall test statistic for all QTLs is the
maximum of the test statistics (*F*-values)
for each QTL [5] which
represents an approximation to multivariate analysis [1].
A simple datafile (*sampleQdata.txt*)
with 3 SNPs and 2 QTLs is as follows:

1 1 2 2 3 3 1 1 2 2 3 3 3 1 1 2 2 3 3 3 12 13778 B1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 12 14998 B2 1 3 3 2 2 3 3 1 1 1 1 3 1 3 3 1 1 2 2 2 12 334 B3 5 7 6 6.1 6 8 6 9 6 7 6 7 9 7 10 8 11 10.3 11 13 5 7 6 6 5.9 8 6 10 6 7 6 7 9 7 10 8 11 9 11.2 13

A corresponding parameter file
(*maxstatQ.param*)
is as follows:

maxstatQ: Test data, 2 QTLs 1 0 2 codes for genotypes, missing, number of phenotypes 3 99 code for test statistic, missing QTL value 10000 number of permutation samples. Below: Name of datafile sampleQdata.txt --------------------------------------------------------- EXPLANATIONS (only the text above is read by the program) Line 1 Any text Line 2 Code for genotypes given in the data: -1 = the codes are -1, 0, 1 (genotype codes will 0 = the codes are 0, 1, 2 (be adjusted to 1, 2, 3 1 = the codes are 1, 2, 3 (internally. Value for missing genotype Number of phenotypes (QTLs) appended to genotypes (last input lines) Line 3 Code for single-locus test statistic (for details, see below) Value for missing phenotype (QTL) Line 4 Number of permutation samples Line 5 Name of datafile holding genotypes and phenotypes ---------------------------------------- Codes on line 3 2 = ANOVA for 3 genotypes, max. of F over QTLs 3 = ANOVA for 3 genotypes, sum of F values over QTLs

The
program will write output to a file, *maxstatQ.out*.

Here, input variables are QTL observations and the phenotype (output variable) is binary. This situation may occur, for example, when you have two groups of animals, treated and untreated, and measure the expression level for a gene on each animal. An example of a relevant parameter file is as follows:

maxstatR: sample data for maxstatR program 9 0 codes for missing, #obs 11 1 1 code for test statistic, expo, auto limits (statis=3) 19999 number of permutation samples. Below,input and output file names sampleRdata.txt maxstatRsample.out –------------------------------------------------------------- EXPLANATIONS (only the text above will be read by the program) Line 1: Any text, truncated to 250 characters. Line 2: Value for "missing" Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here. Line 3: Code for single-locus test statistic (see below) Exponent for power transformation; expo=1 for no transformation Automatic determination of limits (1) or user input (0) (min and max of values for distribution function; statistic code = 3 only), after transformation if any. If user-chosen, these limits just apply to the range of x-values, for which the max(D) statistic is computed, but all x-values will be used in the calculation of the test. Line 4: Number of permutation samples Line 5: Name of data file. No other characters before or after the name. Line 6: Name of output file. If this line is blank the output file will be "maxstatR.out". ---------------------------------------- Codes on line 3: 2 = difference in mean values, group 2 vs. group 1 3 = Kolmogorov-Smirnov 2-sample test statistic 4 = Anderson-Darling test 11 = 2-sample t statistic

NOTE: A positive value of the
code indicates a 2-sided test, a negative value specifies a 1-sided
test: group 2 mean > group 1 mean is significant. For example, -3
indicates whether values in group 2 tend to be larger than those in
group 1, that is, the test statistic is *D*
= max_{x}[F_{1}(*x*)
– F_{2}(*x*)],
where F_{i}(*x*)
is the empirical distribution function of the i-th group at the point
*x*.
The (usual) two-sided test statistic is

max_{x}|F_{2}(*x*)
– F_{1}(*x*)|.

The
above graph shows empirical distribution functions, F_{case}(*x*)
and F_{control}(*x*);
*x*-axis
= observed quantitative trait values, *y*-axis
= distribution function, that is, proportion of individuals smaller
than or equal to *x*.
The case *x*-values
(blue) are generally larger than the control values (red). That is,
for most *x*-values,
F_{control}(*x*)
> F_{case}(*x*).
The arrow indicates how the statistic, *D*,
is measured.

1.
Manly BFJ: **Randomization, bootstrap, and Monte Carlo methods in
biology**, 3rd edn. Boca Raton, FL: Chapman & Hall/ CRC; 2007.

2.
Nijenhuis A, Wilf HS: **Combinatorial algorithms for computers and
calculators**, 2d edn. New York: Academic Press; 1978.

3.
Berkson J: **Maximum likelihood and minimum chi-square estimates of
the logistic function**. *J Amer Statist Assoc *1955,
**50**:130-162.

4.
Wang G, Yang Y, Ott J: **Genome-wide conditional search for
epistatic disease-predisposing variants in human association studies**.
*Hum Hered *2010, **70**(1):34-41.

5.
Suo C, Toulopoulou T, Bramon E, Walshe M, Picchioni M, Murray R, Ott
J: **Analysis of multiple phenotypes in genome-wide genetic mapping
studies**. *BMC Bioinformatics *2013, **14**:151.