Permutation-based tests for case-control designs

Jurg Ott, 10 May 2014


Permutation tests

For some test statistics, determining the associated empirical significance level is mathematically intractable, which is why we perform permutation (randomization) tests [1] to generate distributions of test statistics under the null hypothesis. Randomization tests furnish unbiased and accurate estimates of p-values. Under the hypothesis of no association, any permutation of the labels for the two outcome types (case, control) is assumed to be equally likely and represents a random occurrence of the data under the null hypothesis of no association. Because of the potentially very large number of possible permutations, we take a random sample of permutations according to an approach to sequentially generate a new random sample based on the previous random sample [2]. In each randomization sample, we compute our test statistic exactly as it was done in the observed data. The empirical significance level is then estimated as the proportion of randomization samples with a test statistic at least as large as the one observed. With multiple tests such as testing large numbers of variables (SNPs), a major advantage of permutation-based tests is that both types of empirical significance levels can be estimated, nominal or point-wise (for each given variable), and an experiment-wise or global p-value. The latter tends to be smaller than the Bonferroni-corrected significance level, particularly when variables (and test results) are correlated.

Some authors argue that under the null hypothesis the observed data should be included in the randomized data, and we do this here. Thus, the smallest significance level achievable is equal to 1/n, where n is the number of permutation samples (including the observed data).

Note: The maxstat programs require a positive integer seed for their random number generator, which they will read from the seed.txt file if present, and if not, they will create a seed based on the system clock.


Effective number of independent observations

Let p0 denote a p-value uncorrected for testing n SNPs, with p denoting the corresponding corrected p-value. When SNPs are independent then p = np0 (Bonferroni correction), or, more accurately, p = 1 – (1 – p0)n. Permutation analysis furnishes p0 and p for possibly dependent SNPs so that these two p-values can be used to obtain the effective number of independent SNPs, which is given by neff = log(1 – p)/log(1 – p0).


Implementation

We implemented our approach in the maxstat computer package, available for Windows and Linux (Ubuntu). As in our sumstat package, the maxstat program requires two input files, a datafile and a parameter file. The former has the same structure as for the sumstat program, that is, one row (line) per SNP and one column for each individual, with the body of the file containing a genotype code (for example, 1 = AA, 2 = AB, 3 = BB, 0 = missing) for each SNP and individual. Optionally, the last three columns specify, for each SNP, the chromosome number, position in bp, and the SNP ID (e.g., rs1212) The last row consists of a series of two integers, for example, 1 and 2, or 0 and 1, where the higher number refers to cases and the lower number to controls. Examples of parameter files are shown below.

With small sample sizes, contingency tables sometimes contain empty cells. It has been recommended to replace zero in such cells by ½ [3]. The maxstat programs, where appropriate, initially fill all cells by a quantity chosen by the user on input.


Program versions

As described below, the maxstat program comes in different versions. All programs are invoked in a command/terminal window by typing the program name followed by the name of the parameter file.

Some of the programs work with quantitative trait values (QTLs) and then allow for power transformation to make distributions more normal if necessary, but this does not seem needed in permutation testing..


Standard maxstat

This version assumes that markers (variables) are SNPs and phenotypes are binary such as cases and controls. An example of a relevant parameter file is as follows:

maxstat: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
0 0 0 0  codes for genotypes, missing, #obs, empty
14       code for test statistic
10000    number of permutation samples
0        list number of test-SNP for interaction. Below: Input/output file names
sampledata.txt
maxstatSample.out

---------------------------------------------------------
EXPLANATIONS (only the text above is read by the program)

Line 1: Any text, truncated to 250 characters.

Line 2:
  Code for genotypes given in the data:
    0 = the genotypes are 1, 2, 3
    1 = the genotypes are 0, 1, 2
    2 = the genotypes are -1, 0, 1

  Value for "missing" genotype

  Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are     present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here.

  Initial number of observations in contingency tables. For sparse tables, 0.5 is recommended for stability.

Line 3:
  Code for single-locus test statistic (for details, see below)

Line 4:
  Number of permutation samples

Line 5:
  List number of test SNP, which will be paired with all other SNPs in turn (statistic code 16). If the test SNP number is given as a positive number then the maximum of the 3 chi-squares serves as test statistic, otherwise it's the sum of the 3 chi-squares.

Line 6:
  Name of datafile. No other characters before or after the name.

Line 7:
  Name of output file. No other characters before or after the name. If this line is blank the output file will be called "maxstat.out".
----------------------------------------
Statistic codes on line 4:
 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
 2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW.
 8 = chi-square for proportion of homozygotes, cases vs controls. SEE NOTE BELOW.
10 = Armitage test for trend
11 = t-test comparing mean genotype codes. SEE NOTE BELOW.
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles
14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW.
16 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP.
17 = interaction chi-square for given test SNP versus any target SNP
18 = Max2 test: maximum chi-square for dominant and recessive genotype action

NOTE 1 (statistic codes 2, 11, and 14): When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 minus type 1 observations will be computed. For example, the test statistic with a code of 11 is |t|. Also, a test code of -14 will test whether Fcase > Fctrl.

NOTE 2 (where appropriate): Chi-square is generally computed as a likelihood ratio (LR) statistic. If the statistic code is entered as a negative number then the Pearson chi-square will be computed.

Statistic code 16 allows for conditional search for disease-associated (target) SNPs given genotypes at a test SNP [4]. This technique might uncover SNPs with negligible main effect (little association effect by themselves) but strong association through interaction with the test SNP.


MaxstatQ

For this version, input variables are also SNP genotypes but the phenotype (output variable) is a quantitative observation, a QTL. Multiple QTLs may be present. The test statistic for each QTL is F in one-way ANOVA for three marker genotypes, and the overall test statistic for all QTLs is the maximum of the test statistics (F-values) for each QTL [5] which represents an approximation to multivariate analysis [1]. A simple datafile (sampleQdata.txt) with 3 SNPs and 2 QTLs is as follows:

 1 1 2 2 3 3 1 1 2 2 3 3 3 1 1 2 2 3 3 3  12 13778 B1
 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3  12 14998 B2
 1 3 3 2 2 3 3 1 1 1 1 3 1 3 3 1 1 2 2 2  12   334 B3
 5 7 6 6.1 6 8 6 9 6 7 6 7 9 7 10 8 11 10.3 11 13
 5 7 6 6 5.9 8 6 10 6 7 6 7 9 7 10 8 11 9 11.2 13

A corresponding parameter file (maxstatQ.param) is as follows:

maxstatQ: Test data, 2 QTLs
1 0 2    codes for genotypes, missing, number of phenotypes
3 99     code for test statistic, missing QTL value
10000    number of permutation samples. Below: Name of datafile
sampleQdata.txt

---------------------------------------------------------
EXPLANATIONS (only the text above is read by the program)

Line 1
  Any text

Line 2
  Code for genotypes given in the data:
    -1 = the codes are -1, 0, 1     (genotype codes will
     0 = the codes are 0, 1, 2      (be adjusted to 1, 2, 3
     1 = the codes are 1, 2, 3      (internally.

  Value for missing genotype

  Number of phenotypes (QTLs) appended to genotypes (last input lines)

Line 3
  Code for single-locus test statistic (for details, see below)

  Value for missing phenotype (QTL)

Line 4
  Number of permutation samples

Line 5
  Name of datafile holding genotypes and phenotypes
----------------------------------------
Codes on line 3
 2 = ANOVA for 3 genotypes, max. of F over QTLs
 3 = ANOVA for 3 genotypes, sum of F values over QTLs

The program will write output to a file, maxstatQ.out.


MaxstatR

Here, input variables are QTL observations and the phenotype (output variable) is binary. This situation may occur, for example, when you have two groups of animals, treated and untreated, and measure the expression level for a gene on each animal. An example of a relevant parameter file is as follows:

maxstatR: sample data for maxstatR program
9 0      codes for missing, #obs
11 1 1   code for test statistic, expo, auto limits (statis=3)
19999    number of permutation samples. Below,input and output file names
sampleRdata.txt
maxstatRsample.out

-------------------------------------------------------------
EXPLANATIONS (only the text above will be read by the program)

Line 1: Any text, truncated to 250 characters.

Line 2:
  Value for "missing"

  Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are     present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here.

Line 3:
  Code for single-locus test statistic (see below)

  Exponent for power transformation; expo=1 for no transformation

  Automatic determination of limits (1) or user input (0) (min and max of values for distribution function; statistic code = 3 only), after transformation if any. If user-chosen, these limits just apply to the range of x-values, for which the max(D) statistic is computed, but all x-values will be used in the calculation of the test.

Line 4:
  Number of permutation samples

Line 5:
  Name of data file. No other characters before or after the name.

Line 6:
  Name of output file. If this line is blank the output file will be "maxstatR.out".

----------------------------------------
Codes on line 3:
 2 = difference in mean values, group 2 vs. group 1
 3 = Kolmogorov-Smirnov 2-sample test statistic
 4 = Anderson-Darling test
11 = 2-sample t statistic

NOTE: A positive value of the code indicates a 2-sided test, a negative value specifies a 1-sided test: group 2 mean > group 1 mean is significant. For example, -3 indicates whether values in group 2 tend to be larger than those in group 1, that is, the test statistic is D = maxx[F1(x) – F2(x)], where Fi(x) is the empirical distribution function of the i-th group at the point x. The (usual) two-sided test statistic is

maxx|F2(x) – F1(x)|.



The above graph shows empirical distribution functions, Fcase(x) and Fcontrol(x); x-axis = observed quantitative trait values, y-axis = distribution function, that is, proportion of individuals smaller than or equal to x. The case x-values (blue) are generally larger than the control values (red). That is, for most x-values, Fcontrol(x) > Fcase(x). The arrow indicates how the statistic, D, is measured.


References

1. Manly BFJ: Randomization, bootstrap, and Monte Carlo methods in biology, 3rd edn. Boca Raton, FL: Chapman & Hall/ CRC; 2007.

2. Nijenhuis A, Wilf HS: Combinatorial algorithms for computers and calculators, 2d edn. New York: Academic Press; 1978.

3. Berkson J: Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 1955, 50:130-162.

4. Wang G, Yang Y, Ott J: Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 2010, 70(1):34-41.

5. Suo C, Toulopoulou T, Bramon E, Walshe M, Picchioni M, Murray R, Ott J: Analysis of multiple phenotypes in genome-wide genetic mapping studies. BMC Bioinformatics 2013, 14:151.