SNPstat program for genetic association analysis

Jurg Ott   15 Dec 2012

Introduction

This program carries out case-control association analysis employing a variety of possible test statistics. Most of these test one SNP after another for disease association but some allow for conditional analysis of (target) SNPs given a designated (test) SNP. As discussed below, three input files are required, a datafile holding genotype and individual data, a parameter file, and a command file. This program is designed for test statistics with known null distributions so that nominal p-values can be computed or approximated numerically. For more complex test statistics without known null distributions, the sumstat programs approximate null distributions many powerful test statistics via permutation (randomization) sampling. The SNPstat program package includes a sample dataset and corresponding output.

Input files

Datafile

This file is in the sumstat format, that is, rows are SNPs and columns are individuals, with the last three columns (optional) representing, for a given SNP, the chromosome, position on the chromosome, and SNP identifier. The last row identifies disease status for each individual, for example, 0 = unknown, 1 = unaffected, and 2 = affected. This format is similar to plink’s transposed format. To convert a dataset from plink to sumstat format see the appropriate document. An example datafile, sumstatSample.dat, is contained in the program package. It is comprised of 200 SNPs, all are assumed to be on chromosome 8 with artificial position numbers. The SNP identifiers are the letters rs with the position numbers appended.

Parameter file

An example parameter file, paramSample.txt, is shown below, with brief explanations. Details about the various test statistics will be provided in the “Test statistics” section below.

SNPstat: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
0 0 0    codes for genotypes and missing, and #obs
13 1 0.5 code for test statistic, lambda, initial # obs
0        list number of test-SNP for interaction. Below: Input file name
sumstatSample.dat
SNPstatSampleResults.out
 

EXPLANATIONS
 
Line 1:  Any text
 
Line 2:
  Code for genotypes given in the data:
      0 = the genotypes are 1, 2, 3 (for AA, AB, BB, respectively)
      1 = the genotypes are 0, 1, 2
      2 = the genotypes are -1, 0, 1
 
  Value for "missing"
 
  Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers (genotype codes) are present. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here.
 
Line 3:
  Code for single-locus test statistic (see below). Chi-square (codes 1 and 12) is generally computed from the likelihood ratio (LR) statistic. If these codes are entered as negative numbers, for example, -1 instead of 1, then the Pearson chi-square will be computed.
 
  Lambda value = correction factor for genomic control; 1 for no correction
 
  Initial number of observations in contingency tables. May be zero but 0.5 has recommended for stability in sparse tables [1].
 
Line 4:
  List number of test SNP, which will be paired with all other SNPs in turn (statistic 16 or 17). Enter 0 if no such conditioning on test SNP genotypes is desired. If the SNP number is given as a positive number then the maximum of the 3 chi-squares serves as test statistic and the resulting p-value is multiplied by 3 (Bonferroni correction); if the code is given as a negative number then the sum of the 3 chi-squares is computed leading to a chi-square with 6 df.
 
Line 5:
  Name of input file. No other characters before or after the name.
 
Line 6:
  Name of output file. If this line is blank then the output file will be SNPstat.out.
----------------------------------------
Codes on line 3:
 
 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
 2 = chi-square for 2 x 2 table, case-control versus 2 homozygotes (no heterozygotes)
 8 = Homozygotes versus heterozygotes, 2-sided
 9 = Homozygotes versus heterozygotes, 1-sided: Homoz associated with disease
10 = test for trend
11 = |t|, t-test
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles
13 = Chi-square, rec (1 vs 2+3) or dom (1+2 vs 3) (the larger of the two)
16 = chi-square for 2 x 3 table of genotypes conditioned on genotypes of test SNP
17 = interaction chi-square for given SNP versus all other SNPs

Running the program

The program should be run in a Windows command (cmd) window. It should reside in the same folder as the datafile or be accessible through the Windows path. With the sample datafile and parameter file above, the program is started when you type the command, SNPstat paramSample.txt. Output will then be found in a file called SNPstatSampleResults.out. As this file may be rather large, it is preferably viewed, for example, with the Wordpad program (part of Windows).

Test statistics

Code 1. Consider a 2 x 3 contingency table, where rows correspond to controls and cases, and columns represent the the three genotypes of a given, while the cells contain numbers of individuals. the SNPstat program will construct such a table for each SNP and compute chi-square (2 df). With small numbers of observations, the initial cell count may be set to 0.5 rather than 0 for stability [1].

Code 2. A 2 x 2 table is constructed for each SNP, where rows are cases/controls and columns are the two homozygotes, AA/BB. Heterozygotes are disregarded.

Code 8. Analogous to code 2, but the two columns refer to homozygotes (AA + BB) and heterozygotes (AB).

Code 9. Analogous to code 8 but the test is carried out in a 1-sided manner, that is, a higher proportion of homozygotes in cases than controls is significant.

Code 10. Armitage’s test for trend [2].

Code 11. t-test carried out on genotype codes as “quantitative measurements”

Code 12. Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP alleles

Code 13. Two 2 x 2 tables are constructed, each with rows for cases and controls, but (1) columns = genotype 1 versus genotypes 2 + 3 combined (recessive), and (2) columns = genotypes 1 + 2 versus 3 (dominant). The larger of the resulting two chi-squares is the test statistic, and the associated p-value is corrected for  two tests, that is, the p-value reported is p(2 – p) (Bonferroni correction).

Code 16. For a given SNP (called the test SNP), data are divided into 3 groups depending on the genotype at this SNP. The sequence number (order in which the SNP is listed in the datafile) of the test SNP must be indicated on line 4. Then an association analysis (code 1) is carried out for any other SNP (called target SNP) in each of the 3 groups of individuals. The resulting 3 chi-squares are independent and are either (a) summed up for a resulting chi-square with 6 df, or (b) the maximum of the 3 chi-squares is retained and its associated p-value corrected for 3 tests. If the target SNP number is positive then (b) is carried out, if it is negative then (a) is carried out. For example, if line 4 contains the number 311 that means that the 311-th SNP should be the test SNP. This analysis tests for the main effect of the target SNP and its interaction effect of the test SNP [3].

Code 17. Analogous to code 16 (target SNP number is always positive) but the test statistic only reflects the interaction between the test and target SNPs. This interaction term is obtained by partitioning chi-square into main and interaction effects [4].

SNPstatQ program

An analogous program version is in preparation that can handle quantitative traits rather than disease statuses, with 1-way ANOVA (F-tests) as test statistics.

References

1 Berkson J: Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 1955;50:130-162.

2 Agresti A: Categorical data analysis, ed 2nd. New York, Wiley-Interscience, 2002.

3 Wang G, Yang Y, Ott J: Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 2010;70:34-41.

4 Yang Y, He C, Ott J: Testing association with interactions by partitioning chi-squares. Ann Hum Genet 2009;73:109-117.