Josephine Hoh |
Jurg Ott |

Yale University, New Haven |
Institute of Psychology CAS, Beijing, and Rockefeller University New York |

josephine.hoh@yale.edu |
ott@rockefeller.edu |

20 Jan 2013 |
http://lab.rockefeller.edu/ott/ |

(*Sumstat*
and *sumstatQ*
computer programs)

The general approach is described
in Hoh et al. (2001). Briefly, consider a number of marker loci in
the genome. At each marker, genotypes are available for two types of
observations (but see below for quantitative trait phenotypes,
*sumstatQ*), for example:

Case and control individuals, or

Two types of sibling pairs, affected-affected (AA) and affected-unaffected (AU), or

Families for which two types of genotypes are available, (1) observed genotypes and (2) genotypes generated under the hypothesis of no linkage, for example, with the simulate2 computer program.

The general idea is to find a *set*
of SNPs (variables, generally) that jointly are associated with
disease. We do this by initially computing an association statistic
for each SNP, for example, χ^{2} for a 2 × 3 table,
where the two rows correspond to cases and controls, and the three
columns refer to SNP genotypes. Then markers are ordered by the size
of their test statistics, and sums, *S*, are formed
sequentially, starting with the largest test statistic and gradually
adding one after another SNP up to a maximum number of terms in *S*.
For example, *S*_{3} will be the sum of the three
largest test statistics for SNPs wherever they are in the genome. For
each *S*_{i}, an associated significance level is
computed with permutation testing. The smallest such significance
level represents the experiment-wise test statistic, for which an
associated significance level is computed via permutation samples.

The problem handled here is
discussed in Manly (2007) under the heading of *Comparison
of Sample Mean Vectors*.
The use of a sum of univariate test statistics (which is what we
compute here) as a multivariate test statistic for a randomization
test seems to have been first proposed by Chung and Fraser
(1958).

In the current implementation, marker loci must be
SNPs (two alleles each). The newest version of
sumstat represents a major upgrade and contains source and
executable code for Windows and Linux. The program comes in two
flavors, (1) regular *sumstat*
that keeps data in memory (heap),
and (2) *sumstatS*
that repeatedly reads data from
disk; (2) is slower than (1) but can run when (1) might run out of
memory. For Linux users, the following section provides some
practical hints.

Instructions given here are
specific for Ubuntu Linux
but should work in most Linux environments. Download the *sumstat*
package and extract all files.
Delete *sumstat.exe*
and *sumstatS.exe*.
The compilable files are **.pas*
while the **.p*
files are include files that will be
read by the main program upon compilation. The *sumstat.linux*
file is executable code for Linux
with some settings but you may want to create executable code with
settings for your specific needs. Relevant settings are as follows
and may be found in the CONST section of the *sumstat.pas*
file:

maxobs = 5000; {maximum number of observations}

maxvar = 800000; {max. number of SNPs excl. responses}

maxsum = 500; {max. number of terms in a sum}

howoften = 50; {after how many random samples to write to screen}

maxnamelength=20; {max length of SNP name}

The program will allocate memory for
as many individuals (cases and controls) as are present in the data,
but for each individual it will allocate memory for as many variables
(SNPs) as given by the value of *maxvar*. Thus, to keep memory
requirements low you may want to change the value of *maxvar* to
something just slightly higher than the actual number of SNPs in your
data. Do this by opening the *sumstat.pas* file in your text
editor and changing numbers as desired. Then compile the program in a
terminal window (ctrl-alt-T) by typing **fpc sumstat.pas**, which
should take only a few seconds. Disregard warnings about code not
being accessible or about “link.res containing output
sections”. You may check the compiled version by typing
**./sumstat**. If everything looks satisfactory, make the program
accessible (put it in the path) by typing **sudo mv sumstat
/usr/local/bin**, which will move the *sumstat* program file
to a folder in the path. Verify your actions by typing **sumstat**,
which should invoke the program, but it will not carry out
calculations as you have not yet furnished a parameter file name (see
elsewhere in this document).

For each of our sum statistics,
determining the associated empirical significance level is
mathematically intractable, which is why we perform permutation
(randomization) tests (Manly 2007). Under the hypothesis of no
association, any permutation of the labels for the two outcome types
(case, control) is expected to be equally likely. Because of the
potentially very large number of possible permutations, we take a
random sample of permutations and approximate empirical significance
levels. Based on the given array of case and control labels in the
observed data, random permutations of labels are generated
sequentially according to an algorithm by Nijenhuis and Wilf (1978).
Specifically, let *S*_{obs} be the sum statistic
obtained in the observed data for a given number of SNPs. In each
permutation sample (same genotypes as observed but labels “affected”
and “unaffected” permuted), we compute a sum statistic,
*S*_{perm}, in the same way as in the observed data. The
proportion of permutation samples for which *S*_{perm} ≥
*S*_{obs} is then an estimate for the empirical
significance level associated with *S*_{obs}.

A
second level of permutations is performed (in the *statpval*
procedure) as follows. Assume that we evaluate *N* sum
statistics, that is, the *N*-th sum statistic contains the test
statistics summed over *N* SNPs (*N* = 15 is generally a
suitable number). For each sum statistic with a given number of
terms, the associated empirical significance level is determined. We
then take the smallest of these as our single overall statistic of
interest and determine its associated (experiment-wise) significance
level. This is done on the basis of the permutation samples obtained
in level (1) (Manly 2007, section 6.8): Each of these samples in turn
is taken as an “observed” (test) dataset and the
remaining permutation samples are used to evaluate the significance
level of the smallest *p*-value in the “observed”
permutation sample.

Some authors argue that under the null
hypothesis the observed data should be included in the randomized
data, and we do this here. Thus, the smallest significance level
achievable is equal to 1/*n*, where *n* is the number of
permutation samples (including the observed data).

With
missing observations, it may happen for one or more variables that a
permutation sample contains observations on only one class (either
cases or controls). Such ill-conditioned permutation samples will be
skipped. The approach described above has been implemented in the
computer program, *sumstat*,
which is available for downloading including a sample dataset.

The *sumstat* program works
with the following three text files.

This input file (named in the
parameter file below) should consist of a matrix with *K*
+ 1 rows and *N*
columns, where *N*
= number of observations (e.g., case
and control individuals) and *K*
= number of input variables (eg.
SNPs). Each cell in the matrix contains a genotype code. The last row
contains the code for the type of observation (e.g., 2 = case and 1 =
control). Optionally, the *N*
input observations in each row (for
a given marker) may be followed by three quantities: (1) Chromosome
number, (2) position in base pairs of the marker on the chromosome,
and (3) marker name. These items will allow for an easy ordering of
markers by their chromosomal positions, for example, with the *sort*
command in Windows (not really
relevant for *sumstat*,
but for *scanstat*).
On the last input line, you may use the following dummy values: (1)
99, (2) 99, (3) xx.

NOTE: Do not use marker names starting
with x, &, %, $, followed by digits or else the program will
interpret these as numbers to a base other than 10. For example, x11
will be interpreted as a hexadecimal number equal to (decimal) 17. If
you must use x plus digits then use xx plus digits, for example,
xx11. The program will recognize this as text, which it must do to
properly count individuals.

Your data may be in *plink*
format, in which case you may want
to use the p2s
program to convert *plink*
files to *sumstat*
format.

The *sumstat*
program reads parameter values from
a file, here called a parameter file. An example for such a parameter
file (**Sample.par**,
included in the program package) is as follows:

sumstat:
Sample data, generated, 200 variables. Heritability 0.50, threshold
2.32

0
0 0 codes for genotypes, missing, #obs

14
1 code for test statistic, lambda (the latter
only needed for codes 1 and 12)

20
0.5 max # terms in sum, initial # obs in contingency
tables

10000
number of permutation samples

0
list number of test-SNP for interaction. Below:
Input/output file
names

Sample.dat

SampleResults.out

EXPLANATIONS

Line
1:Any text, truncated to 250 characters.

Line
2:

Code
for genotypes given in the data:

0
= the genotypes are 1, 2, 3

1
= the genotypes are 0, 1, 2

2
= the genotypes are -1, 0, 1

Value
for "missing"

Number
of observations, N. Enter 0 if program should find N. It determines N
from the first input line that should end with (1) chromosome number,
(2) bp position, and (3) SNP ID. If (1) - (3) are missing then only
numbers are present on this line. If (1) - (3) are present then the
program recognizes (3) because it starts with a letter. If that is
not the case and the SNP ID is a pure number then the program
will take it to be a genotype code, which will generally lead to an
error message. If such an error occurs at the end of the line, at
item k, then the number of observations is k-3, which should be
entered here. The simplest remedy is, of course, to have all SNP ID's
start with a letter.

Line
3:

Code
for single-locus test statistic to select SNPs (see Test Statistics,
below)

Lambda
value for genomic control (if code=1 or code=12)

Line
4:

Maximum
number of terms in sum

Initial
number of observations in contingency tables. For sparse tables, 0.5
has been recommended for stability (Berkson 1955).

Line
5:

Number
of permutation samples

Line
6:

List
number of test SNP, which will be paired with all other SNPs in turn
(statistic code16). If SNP number is given as a positive number
then the maximum of the 3 chi-squares serves as test statistic,
otherwise it's the sum of the 3 chi-squares.

Line
7:

Name
of datafile. No other characters before or after the name.

Line
8:

Name
of output file. No other characters before or after the name. If
this

line
is blank the output file will be
called *sumstat.out*.

----------------------------------------

Codes
on line 4:

1
= chi-square for 2 x 3 table, case-control versus 3 genotypes

2
= difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW.

7
= proportion of homozygotes in cases minus controls

8
= difference in proportion of homozygotes, cases cs controls. SEE
NOTE BELOW.

10
= Armitage test for trend

11
= t-test. SEE NOTE BELOW.

12
= chi-square for 2 x 2 table of alleles, case-control versus 2
alleles

14
= chi-square for differences in allele frequencies and F values. SEE
NOTE BELOW.

16
= chi-squares for 2 x 3 table of genotypes conditioned on
genotypes of test SNP.

17
= interaction chi-square for given test SNP versus any target SNP

18
= Max2 test: maximum chi-square for dominant and recessive genotype
action

21
= difference in proportions of risk alleles, cases minus
controls

NOTE
(statistic codes 2, 11, and 14)

When
applicable, absolute differences between the two groups are computed
(two-sided test). But when the statistic code is given as a negative
number then the difference type 2 minus type 1 observations will be
computed. For example, the test statistic with a code of 11 is |t|.
Also, a test code of -14 will test whether
Fcase>Fctrl.

NOTE

Chi-square
is generally computed as a likelihood ratio (LR) statistic. If the
statistic code is entered as a negative number then the Pearson
chi-square will be computed.

An input file called *seed.txt*
holds a seed for the random number generator. This seed must be a
positive integer number. At program termination, this file will be
overwritten with the ending seed so that seeds are always updated
from one to the next program run. We implemented in our Pascal
programs the newest random number generator, *ran (int64)*,
highly recommended by Press et al (2007), which has a period of
approximately 10^{57}. We are grateful to Dr. Quan Long for
helping us understand the C-code of this random number generator. If
no *seed.txt* file is present when the program starts it will
create one based on the system time.

The output file will be named
*sumstat.out* or
whatever name you have chosen on the last line of the parameter file.
Most of the output is self-explanatory. The main table contains the
following items:

*i*= rank of given SNPSNP# = consecutive number of given SNP in the datafile. For example, 7 refers to the 7th SNP.

Stat = statistic used, for example, chi-square with 2 df.

Sum = all the statistics summed up for SNPs listed in

*i*= 1 through the current value of*i*for the given SNP.p0Stat = nominal p-value, not adjusted for multiple testing

pStat = p-value for given statistic, adjusted for multiple testing

pSum = p-value for the given sum, adjusted for multiple testing, not yet adjusted for testing different sets of SNPs.

The final p-value at the bottom is generally different from any p-value listed in the table. It represents the significance level associated with the smallest pSum value, which is taken to be the single experiment-wise test statistic.

With the given parameter file,
the output file looks as follows:

Program
Sumstat version 15 Dec 2012

sumstat:
Sample data, generated, 200 variables. Heritability 0.50, threshold
2.32

Input
file = Sample.dat

Current
time: Fri 21 Dec 2012 17:35:04h

Locus-specific
statistic for selection:

Chi-square
for differences in allele freq and F values (statis=14)

Lambda
used = 1.0000. Initial cell count = 0.50

Number
of SNPs = 200

Number
of permutation samples = 10000. Smallest possible p-value =
0.000100

i
SNP# Stat
Sum p0Stat pStat
pSum ch position SNP

1
2 14.8792 14.8792 0.000100
0.029797 0.029797 8 200310
rs200310

2
190 10.7103 25.5894
0.002300 0.268273 0.022298 8
202190 rs202190

3
7 10.1387 35.7281 0.004000
0.352565 0.013999 8 200360
rs200360

4
74 7.7747 43.5028
0.005599 0.825617 0.014999 8
201030 rs201030

5
5 7.4946 50.9974
0.006999 0.870513 0.014699 8
200340 rs200340

6
8 7.1620 58.1594
0.018498 0.914109 0.013099 8
200370 rs200370

7
96 6.4181 64.5775
0.017198 0.975502 0.013099 8
201250 rs201250

8
73 6.3256 70.9031
0.034697 0.979302 0.012599 8
201020 rs201020

9
15 5.2778 76.1809
0.021298 0.998600 0.014699 8
200440 rs200440

10
115 5.2502 81.4312
0.036796 0.998700 0.016898 8
201440 rs201440

11
71 5.0635 86.4946
0.065193 0.999600 0.017498 8
201000 rs201000

12
145 5.0267 91.5213
0.046695 0.999800 0.018598 8
201740 rs201740

13
75 4.6681 96.1894
0.074093 1.000000 0.019698 8
201040 rs201040

14
180 4.6275 100.8169
0.048995 1.000000 0.020998 8
202090 rs202090

15
80 4.5901 105.4070
0.062994 1.000000 0.022098 8
201090 rs201090

16
164 4.4619 109.8689
0.089791 1.000000 0.022498 8
201930 rs201930

17
27 4.4320 114.3009
0.115188 1.000000 0.022598 8
200560 rs200560

18
195 4.3773 118.6782
0.094891 1.000000 0.022598 8
202240 rs202240

19
118 4.2614 122.9396
0.126287 1.000000 0.022598 8
201470 rs201470

20
122 4.1496 127.0892
0.047895 1.000000 0.022498 8
201510 rs201510

Gr
1 Gr 2

------------------------------------

Total
#observations 250
250

p0Stat
= sig. level, uncorrected for multiple testing

pStat
= sig. level of given statistic, corrected

(equivalent
to Bonferroni correction for independent tests)

(p0Stat
and pStat include observed data as a null dataset)

pSum
= sig. level of sum statistic, corrected

Starting
seed = 17985685530142462792

Final
p-value = 0.032097

Current
time: Fri 21 Dec 2012 17:35:43h

The
smallest SNP-specific p-value (corrected for multiple testing) is
0.0298 for SNP #2 (rs200310). The smallest p-value for any of the 20
sums is 0.0126, which is associated with an experiment-wise p-value
of 0.0321. In this case, constructing sums has not gained anything as
0.0321 > 0.0298.

The file, *statout.txt*,
will contain values of permutation samples. It is required for
calculation of the overall p-value and will be deleted at program
termination, so the user will generally not see it.

You cannot simply click on
program names. Instead you need to open a command (DOS) box, for
example, by clicking on *Start*
and then on *Command Prompt*
(in Windows XP). In earlier Windows versions, click on *Start*,
then on *Run...* and
type **cmd**. It is a good idea to change one of the standard
features in Windows: Make sure you see extensions of known file types
on your screen; for example, you should see *sumstat.exe*,
not just *sumstat*.

The
program is run by typing **sumstat** followed on the same line by
the name of the parameter file. For example, you type **sumstat
Sample.par**.

**Code 1**. Consider a 2 x 3
contingency table, where rows correspond to controls and cases, and
columns represent the three genotypes of a given SNP, while the cells
contain numbers of individuals. The *sumstat* program will
construct such a table for each SNP and compute chi-square (2 df).
With small numbers of observations, the initial cell count may be set
to 0.5 rather than 0 for stability (Berkson 1955).

**Code 2**. Test statistic is
the difference in mean genotype codes between cases and controls
(1-sided or 2-sided).

**Code 7**. Proportion of
homozygotes in cases versus controls (1-sided or 2-sided).

**Code 8**. Chi-square for
proportion of homozygotes in cases versus controls (1-sided or
2-sided).

**Code 10**. Armitage’s
test for trend (Agresti 2002).

**Code 11**. t-test carried
out on genotype codes as “quantitative measurements”.
This statistic tests whether mean genotype codes are different in
cases and controls. It may be more appropriate for small minor allele
frequencies than statistic #2 as the variance is then small so that
mean differences are “enlarged”.

**Code 12**. Chi-square in 2 x
2 table, rows = cases and controls, columns = two SNP alleles

**Code 14**. Chi-square for
differences, cases versus controls, of allele frequencies and F value
(alpha, inbreeding coefficient) (Zhang et al 2008)

**Code 16**. For a given SNP
(called the test SNP), data are divided into 3 groups depending on
the genotype at this SNP. The sequence number (order in which the SNP
is listed in the datafile) of the test SNP must be indicated on line
4. Then an association analysis (code 1) is carried out for any other
SNP (called target SNP) in each of the 3 groups of individuals. The
resulting 3 chi-squares are independent and are either (a) summed up
for a resulting chi-square with 6 df, or (b) the maximum of the 3
chi-squares is retained and its associated p-value corrected for 3
tests. If the target SNP number is provided as a positive number then
(b) is carried out, if it is negative then (a) is carried out. For
example, if line 4 contains the number 311 that means that the 311-th
SNP should be the test SNP. This analysis tests for the main effect
of the target SNP *and* its interaction effect of the test SNP
(Wang et al 2010).

**Code 17**. Analogous to code
16 (target SNP number is always positive) but the test statistic only
reflects the interaction between the test and target SNPs. This
interaction term is obtained by partitioning chi-square into main and
interaction effects and retaining only the interaction term (Yang et
al 2009).

**Code 18**. Two 2 x 2 tables
of numbers of individuals are constructed, one for recessive
inheritance (genotypes AA+AB versus BB) and one for dominant
inheritance (genotypes AA versus AB + BB), with rows corresponding to
cases/controls. The larger of the resulting two chi-square is the
test statics; its p-value will be corrected for multiple testing,
that is, the reported p-value is p(2 – p) (Bonferroni, 2
tests).

**Code 21**. For each SNP, the
risk allele is defined as the allele leading to an odds ratio > 1.
The sum of risk alleles in cases is compared with those in controls
(Ott and Sun 2012).

The following data set
(*Sample.dat*) was generated on the computer: 500 SNP markers,
250 cases, 250 control individuals, 10 susceptibility loci at SNPs 1
through 10 obtained via Hartl and Clark's liability threshold model,
heritability for all 10 disease loci combined = 0.50, population
trait prevalence = 0.01. An input parameter file, *Sample.par*,
is provided. With a particular run of 5000 permutation replicates,
the output collected in the file *SampleResults.out*
has been obtained.

This program represents a
modification of the *sumstat* program in that it works with
quantitative phenotypes (QTLs) rather than case-control type data.
The datafile structure is the same as for *sumstat* but after
the input lines containing genotype data, one or more QTLs can be
present, one per input line for each individual.

Here is an example of a parameter
file (also contained in the program package
as *SampleQ.par*):

sumstatQ:
Sample data, generated.

0
0 0 codes for genotypes, missing genotype,
#obs

19
2 -9.9 code for test statistic, number of QTLs, missing
phenotype/QTL

10
max # terms in sum

10000
number of permutation samples

0
reserved (not used). Below: Input/output file
names

SampleQ.dat

SampleQResults.out

The various lines of
input specify the following parameter values (this text is also
contained in the *inputSampleQ.txt*
file):

Line
1: Any text, truncated to 250 characters.

Line
2:

Code
for genotypes given in the data: 0 = the genotypes are 1, 2,
3; 1 = the genotypes are 0, 1, 2; 2 = the genotypes are -1,
0, 1

Value
for "missing" genotype

Number of observations, N.
Enter 0 if program should find N. It determines N from the first
input line that should end with (1) chromosome number, (2) bp
position, and (3) SNP ID. If (1) - (3) are missing then only
numbers are present on this line. If (1) - (3) are present then
the program recognizes (3) because it starts with a letter. If
that is not the case and the SNP ID is a pure number
then the program will take it to be a genotype code, which
will generally lead to an error message. If such an error occurs
at the end of the line, at item k, then the number of
observations is k-3, which should be entered here.

Line
3:

Code
for single-locus test statistic to select SNPs (for details, see
below)

Number
of QTL (phenotype) lines following genotype lines in datafile

Value of missing
phenotype.

Line
4:

Maximum
number of terms in the sum

Line
5:

Number
of permutation samples

Line
6:

(not currently used)

Line 7:

Name of
datafile. No other characters before or after the name.

Line
8:

Name of output file. No other characters before or after
the name. If this line is blank the output file will be
called *sumstatQ.out*.

----------------------------------------

Codes
on line 3:

19 = 1-way ANOVA for mean differences of QTL between
three SNP genotypes

20 = Max3 test: dom / rec / linear increase in
means

The test statistic for each SNP
is a standard one-way analysis of variance resulting in an *F*-value.
Multivariate analysis for multiple phenotypes (QTLs) is approximated
by focusing, at each SNP, on the largest *F*-value (Manly 2006),
where *F*_{max} may occur at different QTLs for
different SNPs.

A sample datafile, *SampleQ.par*,
was created as follows. Genotype codes are as in the datafile for the
sumstat program. Case-control labels were replaced by values of two
QTL phenotypes, which are normally distributed with given means *m*
and variances *s*. The first QTL has a mean equal to the
genotype code at SNP # 5 and a variance of *s* = 3.5; the second
QTL has mean equal to the genotype code at SNP # 10 and a variance of
*s* = 3.7.

The output file generated with
the given parameter file shows the following results:

Program
SumstatQ for QTLs version 15 Dec 2012

sumstatQ:
Sample data, generated.

Input
file = SampleQ.dat

Current
time: Fri 21 Dec 2012 12:46:08h

Locus-specific
statistic for selection:

1-way
ANOVA for mean differences of QTL between three SNP genotypes
(statis=19)

Number
of missing QTL observations = 0

Number
of QTL phenotypes = 2

Number
of SNPs = 200

Number
of individuals = 500

Number
of permutation samples = 10000+1

i
SNP# Stat
Sum p0Stat pStat
pSum ch position SNP

1
5 10.1240 10.1240 0.000100
0.020398 0.020398 8 200340
rs200340

2
10 9.1865 19.3104
0.000200 0.050995 0.002700 8
200390 rs200390

3
20 7.0128 26.3232
0.001900 0.333967 0.002000 8
200490 rs200490

4
177 5.9307 32.2540
0.006599 0.683732 0.002100 8
202060 rs202060

5
83 4.2145 36.4684
0.029897 0.997600 0.005000 8
201120 rs201120

6
110 4.2040 40.6725
0.035296 0.997800 0.008899 8
201390 rs201390

7
154 4.0568 44.7292
0.038996 0.999300 0.012499 8
201830 rs201830

8
190 3.9875 48.7168
0.035996 0.999600 0.016098 8
202190 rs202190

9
100 3.7143 52.4311
0.050095 1.000000 0.023898 8
201290 rs201290

10
30 3.6626 56.0937
0.050695 1.000000 0.029497 8
200590 rs200590

p0Stat
= sig. level, uncorrected for multiple testing

pStat
= sig. level of given statistic, corrected

(equivalent
to Bonferroni correction for independent tests)

(p0Stat
and pStat include observed data as a null dataset)

pSum
= sig. level of sum statistic, corrected

Starting
seed = 6976125602393045611

Final
p-value = 0.005199

Current
time: Fri 21 Dec 2012 12:47:01h

The
best single-locus result is for SNP # 5 (rs200340) with a genome-wide
significance level of 0.0204. The smallest p-value among all sums is
equal to 0.0020 with an associated genome-wide significance level of
0.0052. Thus, building sums of the F statistics has been beneficial
as 0.0052 < 0.0204.

Agresti A (2002) Categorical data analysis, 2nd edition. New York, Wiley-Interscience

Berkson J (1955) Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 50, 130-162

Chung JH, Fraser DAS (1958) Randomization tests for a multivariate two sample problem. J Amer Statist Assoc 53:729-735

de Quervain DJ, Poirier R, Wollmer MA, Grimaldi LM, Tsolaki M, Streffer JR, Hock C, Nitsch RM, Mohajeri MH, Papassotiropoulos A (2004) Glucocorticoid-related genetic susceptibility for Alzheimer's disease. Hum Mol Genet 13:47-52 [an example of the use of sum statistics]

Edgington ES, Onghena P (2007) Randomization Tests, 4th edition. Chapman & Hall/CRC, Boca Raton

Hoh J, Ott J (2003) Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet 4:701-709.

Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11:2115-2119. [main reference for the method described here]

Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413-417. [a nested bootstrap approach to SNP selection]

Kim S, Zhang K, Sun F (2003) Detecting susceptibility genes in case-control studies using set association. BMC Genet 4 Suppl 1:S9 [documents increased power of sum statistics]

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308:385-389

Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

Nijenhuis A, Wilf HS (1978) Combinatorial algorithms for computers and calculators. Academic Press, New York

Ott J, Hoh J (2003) Set association analysis of SNP case-control and microarray data. J Comput Biol 10:569-574

Ott J, Sun D (2012) Multilocus association analysis under polygenic models. Int J Data Ming and Bioinformatics 6, 482-489

Papassotiropoulos A, Wollmer MA, Tsolaki M, Brunner F, Molyva D, Lutjohann D, Nitsch RM, Hock C (2005) A cluster of cholesterol-related genes confers susceptibility for Alzheimer's disease. J Clin Psychiatry 66:940-947 [an example of the use of sum statistics]

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York

Wang G, Yang Y, Ott J (2010) Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 70, 34-41

Weir BS (1996) Genetic data analysis II : methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass.

Wille A, Hoh J, Ott J (2003) Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genet Epidemiol 25:350-359.

Yang Y, He C, Ott J (2009) Testing association with interactions by partitionning chi-squares. Ann Hum Genet 73, 109-117

Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-Ortiz A, Macaya C, Pintor E, Fernandez-Cruz A, Ott J, Lindpainter K (2002) Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. Pharmacogenomics J 2:197-201

Zhang Q, Wang S, Ott J (2008) Combining identity by descent and associatioin in genetic case-control studies. BMC Genet 9, 42