Josephine Hoh

Jurg Ott

Yale University, New Haven

Institute of Psychology CAS, Beijing, and Rockefeller University New York

josephine.hoh@yale.edu

ott@rockefeller.edu

20 Jan 2013

http://lab.rockefeller.edu/ott/

S statistic in gene mapping

(Sumstat and sumstatQ computer programs)

Methodology

The general approach is described in Hoh et al. (2001). Briefly, consider a number of marker loci in the genome. At each marker, genotypes are available for two types of observations (but see below for quantitative trait phenotypes, sumstatQ), for example:

The general idea is to find a set of SNPs (variables, generally) that jointly are associated with disease. We do this by initially computing an association statistic for each SNP, for example, χ2 for a 2 × 3 table, where the two rows correspond to cases and controls, and the three columns refer to SNP genotypes. Then markers are ordered by the size of their test statistics, and sums, S, are formed sequentially, starting with the largest test statistic and gradually adding one after another SNP up to a maximum number of terms in S. For example, S3 will be the sum of the three largest test statistics for SNPs wherever they are in the genome. For each Si, an associated significance level is computed with permutation testing. The smallest such significance level represents the experiment-wise test statistic, for which an associated significance level is computed via permutation samples.

The problem handled here is discussed in Manly (2007) under the heading of Comparison of Sample Mean Vectors. The use of a sum of univariate test statistics (which is what we compute here) as a multivariate test statistic for a randomization test seems to have been first proposed by Chung and Fraser (1958).

In the current implementation, marker loci must be SNPs (two alleles each). The newest version of sumstat represents a major upgrade and contains source and executable code for Windows and Linux. The program comes in two flavors, (1) regular
sumstat that keeps data in memory (heap), and (2) sumstatS that repeatedly reads data from disk; (2) is slower than (1) but can run when (1) might run out of memory. For Linux users, the following section provides some practical hints.

Notes for Linux users

Instructions given here are specific for Ubuntu Linux but should work in most Linux environments. Download the sumstat package and extract all files. Delete sumstat.exe and sumstatS.exe. The compilable files are *.pas while the *.p files are include files that will be read by the main program upon compilation. The sumstat.linux file is executable code for Linux with some settings but you may want to create executable code with settings for your specific needs. Relevant settings are as follows and may be found in the CONST section of the sumstat.pas file:

maxobs = 5000;    {maximum number of observations}
maxvar = 800000; {max. number of SNPs excl. responses}
maxsum = 500; {max. number of terms in a sum}
howoften = 50; {after how many random samples to write to screen}
maxnamelength=20; {max length of SNP name}

The program will allocate memory for as many individuals (cases and controls) as are present in the data, but for each individual it will allocate memory for as many variables (SNPs) as given by the value of maxvar. Thus, to keep memory requirements low you may want to change the value of maxvar to something just slightly higher than the actual number of SNPs in your data. Do this by opening the sumstat.pas file in your text editor and changing numbers as desired. Then compile the program in a terminal window (ctrl-alt-T) by typing fpc sumstat.pas, which should take only a few seconds. Disregard warnings about code not being accessible or about “link.res containing output sections”. You may check the compiled version by typing ./sumstat. If everything looks satisfactory, make the program accessible (put it in the path) by typing sudo mv sumstat /usr/local/bin, which will move the sumstat program file to a folder in the path. Verify your actions by typing sumstat, which should invoke the program, but it will not carry out calculations as you have not yet furnished a parameter file name (see elsewhere in this document).

Permutation tests

For each of our sum statistics, determining the associated empirical significance level is mathematically intractable, which is why we perform permutation (randomization) tests (Manly 2007). Under the hypothesis of no association, any permutation of the labels for the two outcome types (case, control) is expected to be equally likely. Because of the potentially very large number of possible permutations, we take a random sample of permutations and approximate empirical significance levels. Based on the given array of case and control labels in the observed data, random permutations of labels are generated sequentially according to an algorithm by Nijenhuis and Wilf (1978). Specifically, let Sobs be the sum statistic obtained in the observed data for a given number of SNPs. In each permutation sample (same genotypes as observed but labels “affected” and “unaffected” permuted), we compute a sum statistic, Sperm, in the same way as in the observed data. The proportion of permutation samples for which SpermSobs is then an estimate for the empirical significance level associated with Sobs.

A second level of permutations is performed (in the statpval procedure) as follows. Assume that we evaluate N sum statistics, that is, the N-th sum statistic contains the test statistics summed over N SNPs (N = 15 is generally a suitable number). For each sum statistic with a given number of terms, the associated empirical significance level is determined. We then take the smallest of these as our single overall statistic of interest and determine its associated (experiment-wise) significance level. This is done on the basis of the permutation samples obtained in level (1) (Manly 2007, section 6.8): Each of these samples in turn is taken as an “observed” (test) dataset and the remaining permutation samples are used to evaluate the significance level of the smallest p-value in the “observed” permutation sample.

Some authors argue that under the null hypothesis the observed data should be included in the randomized data, and we do this here. Thus, the smallest significance level achievable is equal to 1/n, where n is the number of permutation samples (including the observed data).

With missing observations, it may happen for one or more variables that a permutation sample contains observations on only one class (either cases or controls). Such ill-conditioned permutation samples will be skipped. The approach described above has been implemented in the computer program, sumstat, which is available for downloading including a sample dataset.

Input and output for Sumstat programs

The sumstat program works with the following three text files.

1. Data file

This input file (named in the parameter file below) should consist of a matrix with K + 1 rows and N columns, where N = number of observations (e.g., case and control individuals) and K = number of input variables (eg. SNPs). Each cell in the matrix contains a genotype code. The last row contains the code for the type of observation (e.g., 2 = case and 1 = control). Optionally, the N input observations in each row (for a given marker) may be followed by three quantities: (1) Chromosome number, (2) position in base pairs of the marker on the chromosome, and (3) marker name. These items will allow for an easy ordering of markers by their chromosomal positions, for example, with the sort command in Windows (not really relevant for sumstat, but for scanstat). On the last input line, you may use the following dummy values: (1) 99, (2) 99, (3) xx.

NOTE: Do not use marker names starting with x, &, %, $, followed by digits or else the program will interpret these as numbers to a base other than 10. For example, x11 will be interpreted as a hexadecimal number equal to (decimal) 17. If you must use x plus digits then use xx plus digits, for example, xx11. The program will recognize this as text, which it must do to properly count individuals.

Your data may be in
plink format, in which case you may want to use the p2s program to convert plink files to sumstat format.

2. Parameter file

The sumstat program reads parameter values from a file, here called a parameter file. An example for such a parameter file (Sample.par, included in the program package) is as follows:

sumstat: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
0 0 0    codes for genotypes, missing, #obs
14 1     code for test statistic, lambda (the latter only needed for codes 1 and 12)
20 0.5   max # terms in sum, initial # obs in contingency tables
10000    number of permutation samples
0        list number of test-SNP for interaction. Below: Input/output file names
Sample.dat
SampleResults.out


EXPLANATIONS

Line 1:Any text, truncated to 250 characters.

Line 2:
  Code for genotypes given in the data:
   0 = the genotypes are 1, 2, 3
   1 = the genotypes are 0, 1, 2
   2 = the genotypes are -1, 0, 1
  Value for "missing"
  Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here. The simplest remedy is, of course, to have all SNP ID's start with a letter.

Line 3:
  Code for single-locus test statistic to select SNPs (see Test Statistics, below)
  Lambda value for genomic control (if code=1 or code=12)

Line 4:
  Maximum number of terms in sum
  Initial number of observations in contingency tables. For sparse tables, 0.5 has been recommended for stability (Berkson 1955).

Line 5:
  Number of permutation samples

Line 6:
  List number of test SNP, which will be paired with all other SNPs in turn (statistic code16). If SNP number is given as a positive number then the maximum of the 3 chi-squares serves as test statistic, otherwise it's the sum of the 3 chi-squares.

Line 7:
  Name of datafile. No other characters before or after the name.

Line 8:
  Name of output file. No other characters before or after the name. If this
  line is blank the output file will be called sumstat.out.

----------------------------------------

Codes on line 4:
 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
 2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW.
 7 = proportion of homozygotes in cases minus controls
 8 = difference in proportion of homozygotes, cases cs controls. SEE NOTE BELOW.
10 = Armitage test for trend
11 = t-test. SEE NOTE BELOW.
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles
14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW.
16 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP.
17 = interaction chi-square for given test SNP versus any target SNP
18 = Max2 test: maximum chi-square for dominant and recessive genotype action
21 = difference in proportions of risk alleles, cases minus controls

NOTE (statistic codes 2, 11, and 14)
When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 minus type 1 observations will be computed. For example, the test statistic with a code of 11 is |t|. Also, a test code of -14 will test whether Fcase>Fctrl.

NOTE
Chi-square is generally computed as a likelihood ratio (LR) statistic. If the statistic code is entered as a negative number then the Pearson chi-square will be computed.

3. Seed for random number generator (optional)

An input file called seed.txt holds a seed for the random number generator. This seed must be a positive integer number. At program termination, this file will be overwritten with the ending seed so that seeds are always updated from one to the next program run. We implemented in our Pascal programs the newest random number generator, ran (int64), highly recommended by Press et al (2007), which has a period of approximately 1057. We are grateful to Dr. Quan Long for helping us understand the C-code of this random number generator. If no seed.txt file is present when the program starts it will create one based on the system time.

Output file

The output file will be named sumstat.out or whatever name you have chosen on the last line of the parameter file. Most of the output is self-explanatory. The main table contains the following items:

With the given parameter file, the output file looks as follows:

Program Sumstat version 15 Dec 2012

sumstat: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
Input file = Sample.dat
Current time: Fri 21 Dec 2012  17:35:04h

Locus-specific statistic for selection:
Chi-square for differences in allele freq and F values (statis=14)
Lambda used = 1.0000. Initial cell count = 0.50
Number of SNPs = 200
Number of permutation samples = 10000. Smallest possible p-value = 0.000100

   i   SNP#      Stat         Sum    p0Stat     pStat      pSum ch   position  SNP
   1      2   14.8792     14.8792  0.000100  0.029797  0.029797  8     200310 rs200310
   2    190   10.7103     25.5894  0.002300  0.268273  0.022298  8     202190 rs202190
   3      7   10.1387     35.7281  0.004000  0.352565  0.013999  8     200360 rs200360
   4     74    7.7747     43.5028  0.005599  0.825617  0.014999  8     201030 rs201030
   5      5    7.4946     50.9974  0.006999  0.870513  0.014699  8     200340 rs200340
   6      8    7.1620     58.1594  0.018498  0.914109  0.013099  8     200370 rs200370
   7     96    6.4181     64.5775  0.017198  0.975502  0.013099  8     201250 rs201250
   8     73    6.3256     70.9031  0.034697  0.979302  0.012599  8     201020 rs201020
   9     15    5.2778     76.1809  0.021298  0.998600  0.014699  8     200440 rs200440
  10    115    5.2502     81.4312  0.036796  0.998700  0.016898  8     201440 rs201440
  11     71    5.0635     86.4946  0.065193  0.999600  0.017498  8     201000 rs201000
  12    145    5.0267     91.5213  0.046695  0.999800  0.018598  8     201740 rs201740
  13     75    4.6681     96.1894  0.074093  1.000000  0.019698  8     201040 rs201040
  14    180    4.6275    100.8169  0.048995  1.000000  0.020998  8     202090 rs202090
  15     80    4.5901    105.4070  0.062994  1.000000  0.022098  8     201090 rs201090
  16    164    4.4619    109.8689  0.089791  1.000000  0.022498  8     201930 rs201930
  17     27    4.4320    114.3009  0.115188  1.000000  0.022598  8     200560 rs200560
  18    195    4.3773    118.6782  0.094891  1.000000  0.022598  8     202240 rs202240
  19    118    4.2614    122.9396  0.126287  1.000000  0.022598  8     201470 rs201470
  20    122    4.1496    127.0892  0.047895  1.000000  0.022498  8     201510 rs201510

                        Gr 1    Gr 2
------------------------------------
Total #observations      250     250

p0Stat = sig. level, uncorrected for multiple testing
pStat  = sig. level of given statistic, corrected
 (equivalent to Bonferroni correction for independent tests)
 (p0Stat and pStat include observed data as a null dataset)
pSum   = sig. level of sum statistic, corrected

Starting seed = 17985685530142462792

Final p-value = 0.032097
Current time: Fri 21 Dec 2012  17:35:43h

The smallest SNP-specific p-value (corrected for multiple testing) is 0.0298 for SNP #2 (rs200310). The smallest p-value for any of the 20 sums is 0.0126, which is associated with an experiment-wise p-value of 0.0321. In this case, constructing sums has not gained anything as 0.0321 > 0.0298.

The file,
statout.txt, will contain values of permutation samples. It is required for calculation of the overall p-value and will be deleted at program termination, so the user will generally not see it.

Running the program

You cannot simply click on program names. Instead you need to open a command (DOS) box, for example, by clicking on Start and then on Command Prompt (in Windows XP). In earlier Windows versions, click on Start, then on Run... and type cmd. It is a good idea to change one of the standard features in Windows: Make sure you see extensions of known file types on your screen; for example, you should see sumstat.exe, not just sumstat.

The program is run by typing sumstat followed on the same line by the name of the parameter file. For example, you type sumstat Sample.par.

Test statistics

Code 1. Consider a 2 x 3 contingency table, where rows correspond to controls and cases, and columns represent the three genotypes of a given SNP, while the cells contain numbers of individuals. The sumstat program will construct such a table for each SNP and compute chi-square (2 df). With small numbers of observations, the initial cell count may be set to 0.5 rather than 0 for stability (Berkson 1955).

Code 2. Test statistic is the difference in mean genotype codes between cases and controls (1-sided or 2-sided).

Code 7. Proportion of homozygotes in cases versus controls (1-sided or 2-sided).

Code 8. Chi-square for proportion of homozygotes in cases versus controls (1-sided or 2-sided).

Code 10. Armitage’s test for trend (Agresti 2002).

Code 11. t-test carried out on genotype codes as “quantitative measurements”. This statistic tests whether mean genotype codes are different in cases and controls. It may be more appropriate for small minor allele frequencies than statistic #2 as the variance is then small so that mean differences are “enlarged”.

Code 12. Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP alleles

Code 14. Chi-square for differences, cases versus controls, of allele frequencies and F value (alpha, inbreeding coefficient) (Zhang et al 2008)

Code 16. For a given SNP (called the test SNP), data are divided into 3 groups depending on the genotype at this SNP. The sequence number (order in which the SNP is listed in the datafile) of the test SNP must be indicated on line 4. Then an association analysis (code 1) is carried out for any other SNP (called target SNP) in each of the 3 groups of individuals. The resulting 3 chi-squares are independent and are either (a) summed up for a resulting chi-square with 6 df, or (b) the maximum of the 3 chi-squares is retained and its associated p-value corrected for 3 tests. If the target SNP number is provided as a positive number then (b) is carried out, if it is negative then (a) is carried out. For example, if line 4 contains the number 311 that means that the 311-th SNP should be the test SNP. This analysis tests for the main effect of the target SNP and its interaction effect of the test SNP (Wang et al 2010).

Code 17. Analogous to code 16 (target SNP number is always positive) but the test statistic only reflects the interaction between the test and target SNPs. This interaction term is obtained by partitioning chi-square into main and interaction effects and retaining only the interaction term (Yang et al 2009).

Code 18. Two 2 x 2 tables of numbers of individuals are constructed, one for recessive inheritance (genotypes AA+AB versus BB) and one for dominant inheritance (genotypes AA versus AB + BB), with rows corresponding to cases/controls. The larger of the resulting two chi-square is the test statics; its p-value will be corrected for multiple testing, that is, the reported p-value is p(2 – p) (Bonferroni, 2 tests).

Code 21. For each SNP, the risk allele is defined as the allele leading to an odds ratio > 1. The sum of risk alleles in cases is compared with those in controls (Ott and Sun 2012).

Sample data

Generated Data

The following data set (Sample.dat) was generated on the computer: 500 SNP markers, 250 cases, 250 control individuals, 10 susceptibility loci at SNPs 1 through 10 obtained via Hartl and Clark's liability threshold model, heritability for all 10 disease loci combined = 0.50, population trait prevalence = 0.01. An input parameter file, Sample.par, is provided. With a particular run of 5000 permutation replicates, the output collected in the file SampleResults.out has been obtained.

sumstatQ program

This program represents a modification of the sumstat program in that it works with quantitative phenotypes (QTLs) rather than case-control type data. The datafile structure is the same as for sumstat but after the input lines containing genotype data, one or more QTLs can be present, one per input line for each individual.

Parameter file

Here is an example of a parameter file (also contained in the program package as SampleQ.par):

sumstatQ: Sample data, generated.
0 0 0      codes for genotypes, missing genotype, #obs
19 2 -9.9  code for test statistic, number of QTLs, missing phenotype/QTL
10         max # terms in sum
10000      number of permutation samples
0          reserved (not used). Below: Input/output file names
SampleQ.dat
SampleQResults.out

The various lines of input specify the following parameter values (this text is also contained in the inputSampleQ.txt file):

Line 1: Any text, truncated to 250 characters.

Line 2:
 
Code for genotypes given in the data: 0 = the genotypes are 1, 2, 3; 1 = the genotypes are 0, 1, 2; 2 = the genotypes are -1, 0, 1
 
Value for "missing" genotype
 
Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number     then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here.

Line 3:
 
Code for single-locus test statistic to select SNPs (for details, see below)
 
Number of QTL (phenotype) lines following genotype lines in datafile
 
Value of missing phenotype.

Line 4:
 
Maximum number of terms in the sum

Line 5:
 
Number of permutation samples

Line 6:
  (not currently used)

Line 7:
  Name of datafile. No other characters before or after the name.

Line 8:
  Name of output file. No other characters before or after the name. If this line is blank the output file will be called 
sumstatQ.out.

----------------------------------------

Codes on line 3:
19 = 1-way ANOVA for mean differences of QTL between three SNP genotypes
20 = Max3 test: dom / rec / linear increase in means

Test statistics

The test statistic for each SNP is a standard one-way analysis of variance resulting in an F-value. Multivariate analysis for multiple phenotypes (QTLs) is approximated by focusing, at each SNP, on the largest F-value (Manly 2006), where Fmax may occur at different QTLs for different SNPs.

Datafile

A sample datafile, SampleQ.par, was created as follows. Genotype codes are as in the datafile for the sumstat program. Case-control labels were replaced by values of two QTL phenotypes, which are normally distributed with given means m and variances s. The first QTL has a mean equal to the genotype code at SNP # 5 and a variance of s = 3.5; the second QTL has mean equal to the genotype code at SNP # 10 and a variance of s = 3.7.

Output file

The output file generated with the given parameter file shows the following results:

Program SumstatQ for QTLs version 15 Dec 2012

sumstatQ: Sample data, generated.
Input file = SampleQ.dat
Current time: Fri 21 Dec 2012  12:46:08h

Locus-specific statistic for selection:
1-way ANOVA for mean differences of QTL between three SNP genotypes (statis=19)
Number of missing QTL observations = 0
Number of QTL phenotypes = 2
Number of SNPs = 200
Number of individuals = 500
Number of permutation samples = 10000+1

  
i   SNP#      Stat         Sum    p0Stat     pStat      pSum ch   position   SNP
  
1      5   10.1240     10.1240  0.000100  0.020398  0.020398  8     200340 rs200340
  
2     10    9.1865     19.3104  0.000200  0.050995  0.002700  8     200390 rs200390
  
3     20    7.0128     26.3232  0.001900  0.333967  0.002000  8     200490 rs200490
  
4    177    5.9307     32.2540  0.006599  0.683732  0.002100  8     202060 rs202060
  
5     83    4.2145     36.4684  0.029897  0.997600  0.005000  8     201120 rs201120
  
6    110    4.2040     40.6725  0.035296  0.997800  0.008899  8     201390 rs201390
  
7    154    4.0568     44.7292  0.038996  0.999300  0.012499  8     201830 rs201830
  
8    190    3.9875     48.7168  0.035996  0.999600  0.016098  8     202190 rs202190
  
9    100    3.7143     52.4311  0.050095  1.000000  0.023898  8     201290 rs201290
 
10     30    3.6626     56.0937  0.050695  1.000000  0.029497  8     200590 rs200590

p0Stat = sig. level, uncorrected for multiple testing
pStat  = sig. level of given statistic, corrected
 
(equivalent to Bonferroni correction for independent tests)
 
(p0Stat and pStat include observed data as a null dataset)
pSum   = sig. level of sum statistic, corrected

Starting seed = 6976125602393045611

Final p-value = 0.005199
Current time: Fri 21 Dec 2012  12:47:01h

The best single-locus result is for SNP # 5 (rs200340) with a genome-wide significance level of 0.0204. The smallest p-value among all sums is equal to 0.0020 with an associated genome-wide significance level of 0.0052. Thus, building sums of the F statistics has been beneficial as 0.0052 < 0.0204.

References

Agresti A (2002) Categorical data analysis, 2nd edition. New York, Wiley-Interscience

Berkson J (1955) Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 50, 130-162

Chung JH, Fraser DAS (1958) Randomization tests for a multivariate two sample problem. J Amer Statist Assoc 53:729-735

de Quervain DJ, Poirier R, Wollmer MA, Grimaldi LM, Tsolaki M, Streffer JR, Hock C, Nitsch RM, Mohajeri MH, Papassotiropoulos A (2004) Glucocorticoid-related genetic susceptibility for Alzheimer's disease. Hum Mol Genet 13:47-52 [an example of the use of sum statistics]

Edgington ES, Onghena P (2007) Randomization Tests, 4th edition. Chapman & Hall/CRC, Boca Raton

Hoh J, Ott J (2003) Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet 4:701-709.

Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11:2115-2119. [main reference for the method described here]

Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413-417. [a nested bootstrap approach to SNP selection]

Kim S, Zhang K, Sun F (2003) Detecting susceptibility genes in case-control studies using set association. BMC Genet 4 Suppl 1:S9 [documents increased power of sum statistics]

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308:385-389

Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

Nijenhuis A, Wilf HS (1978) Combinatorial algorithms for computers and calculators. Academic Press, New York

Ott J, Hoh J (2003) Set association analysis of SNP case-control and microarray data. J Comput Biol 10:569-574

Ott J, Sun D (2012) Multilocus association analysis under polygenic models. Int J Data Ming and Bioinformatics 6, 482-489

Papassotiropoulos A, Wollmer MA, Tsolaki M, Brunner F, Molyva D, Lutjohann D, Nitsch RM, Hock C (2005) A cluster of cholesterol-related genes confers susceptibility for Alzheimer's disease. J Clin Psychiatry 66:940-947 [an example of the use of sum statistics]

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York

Wang G, Yang Y, Ott J (2010) Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 70, 34-41

Weir BS (1996) Genetic data analysis II : methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass.

Wille A, Hoh J, Ott J (2003) Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genet Epidemiol 25:350-359.

Yang Y, He C, Ott J (2009) Testing association with interactions by partitionning chi-squares. Ann Hum Genet 73, 109-117

Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-Ortiz A, Macaya C, Pintor E, Fernandez-Cruz A, Ott J, Lindpainter K (2002) Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. Pharmacogenomics J 2:197-201

Zhang Q, Wang S, Ott J (2008) Combining identity by descent and associatioin in genetic case-control studies. BMC Genet 9, 42