Jurg Ott 15 Dec 2012

This program carries out case-control
association analysis
employing a variety of possible test statistics. Most of these test one
SNP
after another for disease association but some allow for conditional
analysis
of (target) SNPs given a designated (test) SNP. As discussed below,
three input
files are required, a datafile holding genotype and individual data, a
parameter file, and a command file. This
program is designed
for test statistics with known null distributions so that nominal
p-values can
be computed or approximated numerically. For more complex test
statistics
without known null distributions, the *sumstat* programs approximate
null distributions many powerful
test statistics via permutation (randomization) sampling. The SNPstat program package
includes a sample dataset and corresponding output.

This file is in the *sumstat* format, that is, rows
are SNPs and columns are
individuals, with the last three columns (optional) representing, for a
given
SNP, the chromosome, position on the chromosome, and SNP identifier.
The last
row identifies disease status for each individual, for example, 0 =
unknown, 1
= unaffected, and 2 = affected. This format is similar to *plink*’s transposed format. To
convert a dataset from *plink* to *sumstat* format see the appropriate document.
An example
datafile, *sumstatSample.dat*, is
contained in the program package. It is comprised of 200 SNPs, all are
assumed
to be on chromosome 8 with artificial position numbers. The SNP
identifiers are
the letters *rs* with the position
numbers appended.

An example parameter file, *paramSample.txt*, is shown below, with
brief explanations.
Details about the various test statistics will be provided in the “Test
statistics” section below.

0 0 0 codes for genotypes and missing, and #obs

13 1 0.5 code for test statistic, lambda, initial # obs

0 list number of test-SNP for interaction. Below: Input file name

sumstatSample.dat

SNPstatSampleResults.out

EXPLANATIONS

Line 1: Any text

Line 2:

Code for genotypes given in the data:

0 = the genotypes are 1, 2, 3 (for AA, AB, BB, respectively)

1 = the genotypes are 0, 1, 2

2 = the genotypes are -1, 0, 1

Value for "missing"

Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome

Line 3:

Code for single-locus test statistic (see below). Chi-square (codes 1 and 12) is generally computed from the likelihood ratio (LR) statistic. If these codes are entered as negative numbers, for example, -1 instead

Lambda value = correction factor for genomic control; 1 for no correction

Initial number of observations in contingency tables. May be zero but 0.5

Line 4:

List number of test SNP, which will be paired with all other SNPs in turn (statistic 16 or 17). Enter 0 if no such conditioning on test SNP genotypes is

Line 5:

Name of input file. No other characters before or after the name.

Line 6:

Name of output file. If this line is blank then the output file will be SNPstat.out.

1 = chi-square for 2 x 3 table, case-control versus 3 genotypes

2 = chi-square for 2 x 2 table, case-control versus 2 homozygotes (no heterozygotes)

8 = Homozygotes versus heterozygotes, 2-sided

9 = Homozygotes versus heterozygotes, 1-sided: Homoz associated with disease

10 = test for trend

11 = |t|, t-test

12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles

13 = Chi-square, rec (1 vs 2+3) or dom (1+2 vs 3) (the larger of the two)

16 = chi-square for 2 x 3 table of genotypes conditioned on genotypes of test SNP

17 = interaction chi-square for given SNP versus all other SNPs

The program should be run in a
Windows command (cmd) window.
It should reside in the same folder as the datafile or be accessible
through
the Windows path. With the sample datafile and parameter file above,
the
program is started when you type the command, SNPstat
paramSample.txt. Output will then be found in a file called *SNPstatSampleResults.out*.
As this file may be rather large, it is preferably viewed, for example,
with
the *Wordpad* program (part of
Windows).

**Code 1**.
Consider
a 2 x 3 contingency table, where rows correspond to controls and cases,
and
columns represent the the three genotypes of a given, while the cells
contain
numbers of individuals. the SNPstat program will construct such a table
for
each SNP and compute chi-square (2 df). With small numbers of
observations, the
initial cell count may be set to 0.5 rather than 0 for
stability [1].

**Code 2**.
A 2 x 2
table is constructed for each SNP, where rows are cases/controls and
columns
are the two homozygotes, AA/BB. Heterozygotes are disregarded.

**Code 8**.
Analogous
to code 2, but the two columns refer to homozygotes (AA + BB) and
heterozygotes
(AB).

**Code 9**.
Analogous
to code 8 but the test is carried out in a 1-sided manner, that is, a
higher
proportion of homozygotes in cases than controls is significant.

**Code 10**.
Armitage’s test for trend [2].

**Code 11**.
t-test
carried out on genotype codes as “quantitative measurements”

**Code 12**.
Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP
alleles

**Code 13**.
Two 2 x
2 tables are constructed, each with rows for cases and controls, but
(1)
columns = genotype 1 versus genotypes 2 + 3 combined (recessive), and
(2)
columns = genotypes 1 + 2 versus 3 (dominant). The larger of the
resulting two
chi-squares is the test statistic, and the associated p-value is
corrected for two
tests, that is, the p-value reported is
p(2 – p) (Bonferroni correction).

**Code 16**.
For a
given SNP (called the test SNP), data are divided into 3 groups
depending on
the genotype at this SNP. The sequence number (order in which the SNP
is listed
in the datafile) of the test SNP must be indicated on line 4. Then an
association analysis (code 1) is carried out for any other SNP (called
target
SNP) in each of the 3 groups of individuals. The resulting 3
chi-squares are
independent and are either (a) summed up for a resulting chi-square
with 6 df,
or (b) the maximum of the 3 chi-squares is retained and its associated
p-value
corrected for 3 tests. If the target SNP number is positive then (b) is
carried
out, if it is negative then (a) is carried out. For example, if line 4
contains
the number 311 that means that the 311-th SNP should be the test SNP.
This
analysis tests for the main effect of the target SNP *and*
its interaction effect of the test SNP [3].

**Code 17**.
Analogous to code 16 (target SNP number is always positive) but the
test
statistic only reflects the interaction between the test and target
SNPs. This
interaction term is obtained by partitioning chi-square into main and
interaction effects [4].

An analogous program version is in preparation that can handle quantitative traits rather than disease statuses, with 1-way ANOVA (F-tests) as test statistics.

1 Berkson J: Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 1955;50:130-162.

2 Agresti A: Categorical data analysis, ed 2nd. New York, Wiley-Interscience, 2002.

3 Wang G, Yang Y, Ott J: Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 2010;70:34-41.

4 Yang Y, He C, Ott J: Testing association with interactions by partitioning chi-squares. Ann Hum Genet 2009;73:109-117.