Josephine Hoh

Jurg Ott

Yale University, New Haven

Rockefeller University New York

11 Oct 2017

S statistic in gene mapping

(Sumstat computer program)


The general approach to gene mapping by our method is described in Hoh et al. (2001). Briefly, consider a number of marker loci in the genome. At each marker, genotypes are available for two types of observations, for example:

The general idea is to find a set of SNPs (bi-allelic DNA variants) that jointly are associated with disease. We do this by initially computing an association statistic for each SNP, for example, χ2 for a 2 × 3 table, where the two rows correspond to cases and controls, and the three columns refer to SNP genotypes. Then markers are ordered by the size of their test statistics, and sums, S, are formed sequentially, starting with the largest test statistic and gradually adding one after another SNP up to a maximum number of terms in S. For example, S3 will be the sum of the three largest test statistics for SNPs wherever they are in the genome. For each Si, an associated significance level is computed with permutation testing. The smallest such significance level represents the experiment-wise test statistic, for which an associated significance level is computed via permutation samples.

The problem handled here is discussed in Manly (2007) under the heading of Comparison of Sample Mean Vectors. The use of a sum of univariate test statistics (which is what we compute here) as a multivariate test statistic for a randomization test seems to have been first proposed by Chung and Fraser (1958). In the current implementation, marker loci must be SNPs (two alleles each). The newest version of sumstat contains source and executable code for Windows and Linux. For Linux users, the following section provides some practical hints.

Compiling, and notes for Linux users

Instructions given here are specific for Ubuntu Linux but should work in most Linux environments. Download the sumstat package and extract all files. Delete sumstat.exe. The compilable files are *.pas while the *.p files are include files that will be read by the main program upon compilation. The sumstat.linux file is executable code for Linux with some default settings but you may want to create executable code with settings for your specific needs. Relevant settings are as follows and may be found in the CONST section of the sumstat.pas file:

maxobs = 5000;    {maximum number of observations}
maxvar = 500000;  {max. number of SNPs}
maxsum = 100;     {max. number of terms in a sum}
howoften = 2000;  {after how many random samples to write to screen}
maxnamelength=20; {max length of SNP name} 

The program will allocate memory for as many individuals (cases and controls) as are present in the data, but for each individual it will allocate memory for as many variables (SNPs) as given by the value of maxvar. Thus, to keep memory requirements low you may want to change the value of maxvar to something just slightly higher than the actual number of SNPs in your data. Do this by opening the sumstat.pas file in your text editor and changing numbers as desired. Then compile the program in a terminal window (ctrl-alt-T) by typing fpc sumstat.pas, which should take only a few seconds. Disregard warnings about code not being accessible or about “link.res containing output sections”. You may check the compiled version by typing ./sumstat. If everything looks satisfactory, make the program accessible (put it in the path) by typing sudo mv sumstat /usr/local/bin, which will move the sumstat program file to a folder in the path. Verify your actions by typing sumstat, which should invoke the program, but it will not carry out calculations as you have not yet furnished a parameter file name (see elsewhere in this document).

The sumstat program is written in Free Pascal, which supplies a free compiler for Windows, Linux, and other operating systems.

Permutation tests

For each of our sum statistics, determining the associated empirical significance level is mathematically intractable, which is why we perform permutation (randomization) tests (Manly 2007). Under the hypothesis of no association, any permutation of the labels for the two outcome types (case, control) is expected to be equally likely. Because of the potentially very large number of possible permutations, we take a random sample of permutations and approximate empirical significance levels. Based on the given array of case and control labels in the observed data, random permutations of labels are generated sequentially according to an algorithm by Nijenhuis and Wilf (1978). Specifically, let Sobs be the sum statistic obtained in the observed data for a given number of SNPs. In each permutation sample (same genotypes as observed but labels “affected” and “unaffected” permuted), we compute a sum statistic, Sperm, in the same way as in the observed data. The proportion of permutation samples for which SpermSobs is then an estimate for the empirical significance level associated with Sobs.

A second level of permutations is performed (in the statpval procedure) as follows. Assume that we evaluate N sum statistics, that is, the N-th sum statistic contains the test statistics summed over N SNPs (N = 15 is generally a suitable number). For each sum statistic with a given number of terms, the associated empirical significance level is determined. We then take the smallest of these as our single overall statistic of interest and determine its associated (experiment-wise) significance level. This is done on the basis of the permutation samples obtained in level (1) (Manly 2007, section 6.8): Each of these samples in turn is taken as an “observed” (test) dataset and the remaining permutation samples are used to evaluate the significance level of the smallest p-value in the “observed” permutation sample.

Some authors argue that under the null hypothesis the observed data should be included in the randomized data, and we do this here. Thus, the smallest significance level achievable is equal to 1/n, where n is the number of permutation samples (including the observed data).

With missing observations, it may happen for one or more variables that a permutation sample contains observations on only one class (either cases or controls). Such ill-conditioned permutation samples will be skipped.

Input and output for Sumstat program

The sumstat program works with the following three text files.

1. Data file

This input file (named in the parameter file below) should consist of a matrix with K+1 rows and N columns, where N = number of observations (e.g., case and control individuals) and K = number of input variables (eg. SNPs). Each cell in the matrix contains a genotype code. The last row contains the code for the type of observation (e.g., 2 = case and 1 = control). Optionally, the N input observations in each row (for a given marker) may be followed by three quantities: (1) Chromosome number, (2) position in base pairs of the marker on the chromosome, and (3) marker name. These items will allow for an easy ordering of markers by their chromosomal positions, for example, with the sort command in Windows (not really relevant for sumstat, but for scanstat). On the last input line, you may use the following dummy values: (1) 99, (2) 99, (3) xx.

NOTE: Do not use marker names starting with x, &, %, $, followed by digits or else the program will interpret these as numbers to a base other than 10. For example, x11 will be interpreted as a hexadecimal number equal to (decimal) 17. If you must use x plus digits then use xx plus digits, for example, xx11. The program will recognize this as text, which it must do to properly count individuals. Your data may be in
plink format, in which case you may want to use the p2s program to convert plink files to sumstat format.

2. Parameter file

The sumstat program reads parameter values from a file, here called a parameter file. An example for such a parameter file (sumstat.param, included in the program package) is as follows:

sumstat: Sample data, generated, heritability = 0.50
0 0 0   codes for genotypes, missing, #obs
14 1    code for test statistic, lambda used
10 0    max # terms in sum, initial # obs in contingency tables
99999   number of permutation samples
0       list number of test-SNP for interaction. Below: Input/output file names


Line 1: Any text, truncated to 250 characters.

Line 2:
 Code for genotypes given in the data:
        0 = the genotypes are 1, 2, 3
        1 = the genotypes are 0, 1, 2
        2 = the genotypes are -1, 0, 1
 Value for "missing"
 Number of observations, N. Enter 0 if the program should find N.

Line 3:
 Code for single-locus test statistic to select SNPs (for details, see below)
 Lambda value for genomic control (if code=1 or code=12)

Line 4:
 Maximum number of terms in sum
 Initial number of observations in contingency tables. For sparse tables, 0.5
is recommended for stability.

Line 5:
 Number of permutation samples

Line 6:
 List number of test SNP, which will be paired with all other SNPs in turn
(statistic code 16). If the test SNP number is given as a positive number
then the maximum of the 3 chi-squares serves as the test statistic, otherwise
it's the sum of the 3 chi-squares.

Line 7:
 Name of datafile. No other characters before or after the name.

Line 8:
 Name of output file. No other characters before or after the name. If this
line is blank the output file will be called "sumstat.out".

Codes on line 4:

 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
 2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW.
 7 = proportion of homozygotes in cases minus controls. SEE NOTE BELOW.
 8 = chi-square for proportion of homozygotes, cases vs controls. SEE NOTE BELOW.
10 = Armitage test for trend
11 = t-test. SEE NOTE BELOW.
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles
14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW.
16 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP.
17 = interaction chi-square for given test SNP versus any target SNP
18 = Max2 test: maximum chi-square for dominant and recessive genotype action
19 = Max3 test: maximum chi-square for dom/rec/linear models
21 = difference in proportions of risk alleles, cases minus controls

NOTE (statistic codes 2, 11, and 14)
When applicable, absolute differences between the two groups are computed (two-sided
test). But when the statistic code is given as a negative number then the difference
type 2 (cases) minus type 1 (controls) observations will be computed. For example,
the test statistic with a code of 11 is |t|; a test code of -14 will test whether
Fcase > Fctrl.

Chi-square is generally computed as a likelihood ratio (LR) statistic. If the statistic
code is entered as a negative number then the Pearson chi-square will be computed.

3. Seed for random number generator (optional)

An input file called seed.txt holds a seed for the random number generator. This seed must be a positive integer number. At program termination, this file will be overwritten with a new seed so that seeds are always updated from one to the next program run. We implemented in our Pascal programs the newest random number generator, ran (int64), highly recommended by Press et al (2007), which has a period of approximately 1057. We are grateful to Dr. Quan Long for helping us understand the C-code of this random number generator. If no seed.txt file is present when the program starts it will create one based on the system time.

Output file

The output file will be named sumstat.out or whatever name you have chosen on the last line of the parameter file. Most of the output is self-explanatory. The main table contains the following items:

With the given parameter file, the output file looks as follows:

Program Sumstat version 06 Oct 2017

sumstat: Sample data, generated, heritability = 0.50
Input file = sample.dat
Current time:  9 Oct 2017   5:19:05h

Locus-specific statistic for selection:
Chi-square for differences in allele freq and F values (stat.code = 14)
Initial cell count = 0.00.
Number of SNPs = 200
Number of permutation samples = 19999+1. Smallest possible p-value = 0.000050

rank   SNP#      Stat         Sum      p0       p    pSum ch   position   SNP
   1      2   15.0294     15.0294 0.00015 0.03060 0.03060  8     200310 rs200310
   2    190   10.6404     25.6698 0.00315 0.28560 0.02415  8     202190 rs202190
   3      7   10.2412     35.9109 0.00390 0.34690 0.01345  8     200360 rs200360
   4     74    7.9087     43.8197 0.00295 0.80190 0.01390  8     201030 rs201030
   5      5    7.5414     51.3611 0.00545 0.86305 0.01305  8     200340 rs200340
   6      8    7.2514     58.6125 0.01975 0.90525 0.01165  8     200370 rs200370
   7     96    6.4601     65.0726 0.01615 0.97555 0.01160  8     201250 rs201250
   8     73    6.2648     71.3375 0.03445 0.98530 0.01155  8     201020 rs201020
   9     15    5.2777     76.6152 0.02310 0.99940 0.01390  8     200440 rs200440
  10     71    5.1003     81.7155 0.06775 0.99970 0.01600  8     201000 rs201000

                        Gr 1    Gr 2
Total #observations      250     250

p0 = sig. level, uncorrected for multiple testing
p  = sig. level of given statistic, corrected
 (equivalent to Bonferroni correction for independent tests)
 (p0Stat and pStat include observed data as a null dataset)
pSum   = sig. level of sum statistic, corrected

Starting seed = 4109687182758328260

Final p-value = 0.027101
Current time:  9 Oct 2017   5:20:16h

The smallest SNP-specific p-value (corrected for multiple testing) is 0.0298 for SNP #2 (rs200310). The smallest p-value for any of the 20 sums is 0.0126, which is associated with an experiment-wise p-value of 0.0321. In this case, constructing sums has not gained anything as 0.0321 > 0.0298.

The file, statout.txt, will contain values of permutation samples. It is required for calculation of the overall p-value and will be deleted at program termination, so the user will generally not see it.

Running the program

You cannot simply click on program names. Instead you need to open a command (DOS) box, for example, by clicking on Start and then on Command Prompt (in Windows XP). In earlier Windows versions, click on Start, then on Run... and type cmd. It is a good idea to change one of the standard features in Windows: Make sure you see extensions of known file types on your screen; for example, you should see sumstat.exe, not just sumstat.

The program is run by typing sumstat followed on the same line by the name of the parameter file. For example, you type sumstat sumstat.param; any name for the parameter file will be fine as long as it conforms to the rules shown above. If the parameter file is sumstat.param then you may simply type sumstat, which implies a parameter file by the name of sumstat.param.

Test statistics

Code 1. Consider a 2 x 3 contingency table, where rows correspond to controls and cases, and columns represent the three genotypes of a given SNP, while the cells contain numbers of individuals. The sumstat program will construct such a table for each SNP and compute chi-square (2 df). With small numbers of observations, the initial cell count may be set to 0.5 rather than 0 for stability (Berkson 1955).

Code 2. Test statistic is the difference in mean genotype codes between cases and controls (1-sided or 2-sided).

Code 7. Proportion of homozygotes in cases versus controls (1-sided or 2-sided).

Code 8. Chi-square for proportion of homozygotes in cases versus controls (1-sided or 2-sided).

Code 10. Armitage’s test for trend (Agresti 2002).

Code 11. t-test carried out on genotype codes as “quantitative measurements”. This statistic tests whether mean genotype codes are different in cases and controls. It may be more appropriate for small minor allele frequencies than statistic #2 as the variance is then small so that mean differences are “enlarged”.

Code 12. Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP alleles

Code 14. Chi-square for differences, cases versus controls, of allele frequencies and F value (alpha, inbreeding coefficient) (Zhang et al 2008)

Code 16. For a given SNP (called the test SNP), data are divided into 3 groups depending on the genotype at this SNP. The sequence number (order in which the SNP is listed in the datafile) of the test SNP must be indicated on line 4. Then an association analysis (code 1) is carried out for any other SNP (called target SNP) in each of the 3 groups of individuals. The resulting 3 chi-squares are independent and are either (a) summed up for a resulting chi-square with 6 df, or (b) the maximum of the 3 chi-squares is retained and its associated p-value corrected for 3 tests. If the target SNP number is provided as a positive number then (b) is carried out, if it is negative then (a) is carried out. For example, if line 4 contains the number 311 that means that the 311-th SNP should be the test SNP. This analysis tests for the main effect of the target SNP and its interaction effect of the test SNP (Wang et al 2010).

Code 17. Analogous to code 16 (target SNP number is always positive) but the test statistic only reflects the interaction between the test and target SNPs. This interaction term is obtained by partitioning chi-square into main and interaction effects and retaining only the interaction term (Yang et al 2009).

Code 18. Two 2 x 2 tables of numbers of individuals are constructed, one for recessive inheritance (genotypes AA+AB versus BB) and one for dominant inheritance (genotypes AA versus AB + BB), with rows corresponding to cases/controls. The larger of the resulting two chi-square is the test statics; its p-value will be corrected for multiple testing, that is, the reported p-value is p(2 – p) (Bonferroni, 2 tests).

Code 21. For each SNP, the risk allele is defined as the allele leading to an odds ratio > 1. The sum of risk alleles in cases is compared with those in controls (Ott and Sun 2012).

Sample data

Generated Data

The following data set (Sample.dat) was generated on the computer: 500 SNP markers, 250 cases, 250 control individuals, 10 susceptibility loci at SNPs 1 through 10 obtained via Hartl and Clark's liability threshold model, heritability for all 10 disease loci combined = 0.50, population trait prevalence = 0.01. An input parameter file, sumstat.param, is provided. With a particular run of 5000 permutation replicates, the output collected in the file SampleResults.out has been obtained.


Agresti A (2002) Categorical data analysis, 2nd edition. New York, Wiley-Interscience

Berkson J (1955) Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 50, 130-162

Chung JH, Fraser DAS (1958) Randomization tests for a multivariate two sample problem. J Amer Statist Assoc 53:729-735

de Quervain DJ, Poirier R, Wollmer MA, Grimaldi LM, Tsolaki M, Streffer JR, Hock C, Nitsch RM, Mohajeri MH, Papassotiropoulos A (2004) Glucocorticoid-related genetic susceptibility for Alzheimer's disease. Hum Mol Genet 13:47-52 [an example of the use of sum statistics]

Edgington ES, Onghena P (2007) Randomization Tests, 4th edition. Chapman & Hall/CRC, Boca Raton

Hoh J, Ott J (2003) Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet 4:701-709.

Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11:2115-2119. [main reference for the method described here]

Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413-417. [a nested bootstrap approach to SNP selection]

Kim S, Zhang K, Sun F (2003) Detecting susceptibility genes in case-control studies using set association. BMC Genet 4 Suppl 1:S9 [documents increased power of sum statistics]

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308:385-389

Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

Nijenhuis A, Wilf HS (1978) Combinatorial algorithms for computers and calculators. Academic Press, New York. This book is now freely available as a pdf file.

Ott J, Hoh J (2003) Set association analysis of SNP case-control and microarray data. J Comput Biol 10:569-574

Ott J, Sun D (2012) Multilocus association analysis under polygenic models. Int J Data Ming and Bioinformatics 6, 482-489

Papassotiropoulos A, Wollmer MA, Tsolaki M, Brunner F, Molyva D, Lutjohann D, Nitsch RM, Hock C (2005) A cluster of cholesterol-related genes confers susceptibility for Alzheimer's disease. J Clin Psychiatry 66:940-947 [an example of the use of sum statistics]

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York

Wang G, Yang Y, Ott J (2010) Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 70, 34-41

Weir BS (1996) Genetic data analysis II: methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass.

Wille A, Hoh J, Ott J (2003) Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genet Epidemiol 25:350-359.

Yang Y, He C, Ott J (2009) Testing association with interactions by partitioning chi-squares. Ann Hum Genet 73, 109-117

Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-Ortiz A, Macaya C, Pintor E, Fernandez-Cruz A, Ott J, Lindpainter K (2002) Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. Pharmacogenomics J 2:197-201

Zhang Q, Wang S, Ott J (2008) Combining identity by descent and association in genetic case-control studies. BMC Genet 9, 42