Jurg Ott 15 Dec 2012
This program carries out case-control association analysis employing a variety of possible test statistics. Most of these test one SNP after another for disease association but some allow for conditional analysis of (target) SNPs given a designated (test) SNP. As discussed below, three input files are required, a datafile holding genotype and individual data, a parameter file, and a command file. This program is designed for test statistics with known null distributions so that nominal p-values can be computed or approximated numerically. For more complex test statistics without known null distributions, the sumstat programs approximate null distributions many powerful test statistics via permutation (randomization) sampling. The SNPstat program package includes a sample dataset and corresponding output.
This file is in the sumstat format, that is, rows are SNPs and columns are individuals, with the last three columns (optional) representing, for a given SNP, the chromosome, position on the chromosome, and SNP identifier. The last row identifies disease status for each individual, for example, 0 = unknown, 1 = unaffected, and 2 = affected. This format is similar to plink’s transposed format. To convert a dataset from plink to sumstat format see the appropriate document. An example datafile, sumstatSample.dat, is contained in the program package. It is comprised of 200 SNPs, all are assumed to be on chromosome 8 with artificial position numbers. The SNP identifiers are the letters rs with the position numbers appended.
An example parameter file, paramSample.txt, is shown below, with brief explanations. Details about the various test statistics will be provided in the “Test statistics” section below.SNPstat: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
The program should be run in a Windows command (cmd) window. It should reside in the same folder as the datafile or be accessible through the Windows path. With the sample datafile and parameter file above, the program is started when you type the command, SNPstat paramSample.txt. Output will then be found in a file called SNPstatSampleResults.out. As this file may be rather large, it is preferably viewed, for example, with the Wordpad program (part of Windows).
Code 1. Consider a 2 x 3 contingency table, where rows correspond to controls and cases, and columns represent the the three genotypes of a given, while the cells contain numbers of individuals. the SNPstat program will construct such a table for each SNP and compute chi-square (2 df). With small numbers of observations, the initial cell count may be set to 0.5 rather than 0 for stability .
Code 2. A 2 x 2 table is constructed for each SNP, where rows are cases/controls and columns are the two homozygotes, AA/BB. Heterozygotes are disregarded.
Code 8. Analogous to code 2, but the two columns refer to homozygotes (AA + BB) and heterozygotes (AB).
Code 9. Analogous to code 8 but the test is carried out in a 1-sided manner, that is, a higher proportion of homozygotes in cases than controls is significant.
Code 10. Armitage’s test for trend .
Code 11. t-test carried out on genotype codes as “quantitative measurements”
Code 12. Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP alleles
Code 13. Two 2 x 2 tables are constructed, each with rows for cases and controls, but (1) columns = genotype 1 versus genotypes 2 + 3 combined (recessive), and (2) columns = genotypes 1 + 2 versus 3 (dominant). The larger of the resulting two chi-squares is the test statistic, and the associated p-value is corrected for two tests, that is, the p-value reported is p(2 – p) (Bonferroni correction).
Code 16. For a given SNP (called the test SNP), data are divided into 3 groups depending on the genotype at this SNP. The sequence number (order in which the SNP is listed in the datafile) of the test SNP must be indicated on line 4. Then an association analysis (code 1) is carried out for any other SNP (called target SNP) in each of the 3 groups of individuals. The resulting 3 chi-squares are independent and are either (a) summed up for a resulting chi-square with 6 df, or (b) the maximum of the 3 chi-squares is retained and its associated p-value corrected for 3 tests. If the target SNP number is positive then (b) is carried out, if it is negative then (a) is carried out. For example, if line 4 contains the number 311 that means that the 311-th SNP should be the test SNP. This analysis tests for the main effect of the target SNP and its interaction effect of the test SNP .
Code 17. Analogous to code 16 (target SNP number is always positive) but the test statistic only reflects the interaction between the test and target SNPs. This interaction term is obtained by partitioning chi-square into main and interaction effects .
An analogous program version is in preparation that can handle quantitative traits rather than disease statuses, with 1-way ANOVA (F-tests) as test statistics.
1 Berkson J: Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 1955;50:130-162.
2 Agresti A: Categorical data analysis, ed 2nd. New York, Wiley-Interscience, 2002.
3 Wang G, Yang Y, Ott J: Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 2010;70:34-41.
4 Yang Y, He C, Ott J: Testing association with interactions by partitioning chi-squares. Ann Hum Genet 2009;73:109-117.