Josephine Hoh |
Jurg Ott |
Yale University, New Haven |
Rockefeller University New York |
josephine.hoh@yale.edu |
ott@rockefeller.edu |
11 Oct 2017 |
http://lab.rockefeller.edu/ott/ |
(Sumstat computer program)
The general approach to gene mapping by our method is described in Hoh et al. (2001). Briefly, consider a number of marker loci in the genome. At each marker, genotypes are available for two types of observations, for example:
Case and control individuals, or
Two types of sibling pairs, affected-affected (AA) and affected-unaffected (AU), or
Families for which two types of genotypes are available, (1) observed genotypes and (2) genotypes generated under the hypothesis of no linkage, for example, with the simulate2 computer program.
The general idea is to find a set of SNPs (bi-allelic DNA variants) that jointly are associated with disease. We do this by initially computing an association statistic for each SNP, for example, χ2 for a 2 × 3 table, where the two rows correspond to cases and controls, and the three columns refer to SNP genotypes. Then markers are ordered by the size of their test statistics, and sums, S, are formed sequentially, starting with the largest test statistic and gradually adding one after another SNP up to a maximum number of terms in S. For example, S3 will be the sum of the three largest test statistics for SNPs wherever they are in the genome. For each Si, an associated significance level is computed with permutation testing. The smallest such significance level represents the experiment-wise test statistic, for which an associated significance level is computed via permutation samples.
The problem handled here is discussed in Manly (2007) under the heading of Comparison of Sample Mean Vectors. The use of a sum of univariate test statistics (which is what we compute here) as a multivariate test statistic for a randomization test seems to have been first proposed by Chung and Fraser (1958). In the current implementation, marker loci must be SNPs (two alleles each). The newest version of sumstat contains source and executable code for Windows and Linux. For Linux users, the following section provides some practical hints.
Instructions given here are specific for Ubuntu Linux but should work in most Linux environments. Download the sumstat package and extract all files. Delete sumstat.exe. The compilable files are *.pas while the *.p files are include files that will be read by the main program upon compilation. The sumstat.linux file is executable code for Linux with some default settings but you may want to create executable code with settings for your specific needs. Relevant settings are as follows and may be found in the CONST section of the sumstat.pas file:
maxobs = 5000; {maximum number of observations} maxvar = 500000; {max. number of SNPs} maxsum = 100; {max. number of terms in a sum} howoften = 2000; {after how many random samples to write to screen} maxnamelength=20; {max length of SNP name}
The program will allocate memory for as many individuals (cases and controls) as are present in the data, but for each individual it will allocate memory for as many variables (SNPs) as given by the value of maxvar. Thus, to keep memory requirements low you may want to change the value of maxvar to something just slightly higher than the actual number of SNPs in your data. Do this by opening the sumstat.pas file in your text editor and changing numbers as desired. Then compile the program in a terminal window (ctrl-alt-T) by typing fpc sumstat.pas, which should take only a few seconds. Disregard warnings about code not being accessible or about “link.res containing output sections”. You may check the compiled version by typing ./sumstat. If everything looks satisfactory, make the program accessible (put it in the path) by typing sudo mv sumstat /usr/local/bin, which will move the sumstat program file to a folder in the path. Verify your actions by typing sumstat, which should invoke the program, but it will not carry out calculations as you have not yet furnished a parameter file name (see elsewhere in this document).
The sumstat program is written in Free Pascal, which supplies a free compiler for Windows, Linux, and other operating systems.
For each of our sum statistics,
determining the associated empirical significance level is
mathematically intractable, which is why we perform permutation
(randomization) tests (Manly 2007). Under the hypothesis of no
association, any permutation of the labels for the two outcome types
(case, control) is expected to be equally likely. Because of the
potentially very large number of possible permutations, we take a
random sample of permutations and approximate empirical significance
levels. Based on the given array of case and control labels in the
observed data, random permutations of labels are generated
sequentially according to an algorithm by Nijenhuis and Wilf (1978).
Specifically, let Sobs be the sum statistic
obtained in the observed data for a given number of SNPs. In each
permutation sample (same genotypes as observed but labels “affected”
and “unaffected” permuted), we compute a sum statistic,
Sperm, in the same way as in the observed data. The
proportion of permutation samples for which Sperm ≥
Sobs is then an estimate for the empirical
significance level associated with Sobs.
A
second level of permutations is performed (in the statpval
procedure) as follows. Assume that we evaluate N sum
statistics, that is, the N-th sum statistic contains the test
statistics summed over N SNPs (N = 15 is generally a
suitable number). For each sum statistic with a given number of
terms, the associated empirical significance level is determined. We
then take the smallest of these as our single overall statistic of
interest and determine its associated (experiment-wise) significance
level. This is done on the basis of the permutation samples obtained
in level (1) (Manly 2007, section 6.8): Each of these samples in turn
is taken as an “observed” (test) dataset and the
remaining permutation samples are used to evaluate the significance
level of the smallest p-value in the “observed”
permutation sample.
Some authors argue that under the null
hypothesis the observed data should be included in the randomized
data, and we do this here. Thus, the smallest significance level
achievable is equal to 1/n, where n is the number of
permutation samples (including the observed data).
With
missing observations, it may happen for one or more variables that a
permutation sample contains observations on only one class (either
cases or controls). Such ill-conditioned permutation samples will be
skipped.
The sumstat program works with the following three text files.
This input file (named in the
parameter file below) should consist of a matrix with K+1
rows and N
columns, where N
= number of observations (e.g., case
and control individuals) and K
= number of input variables (eg.
SNPs). Each cell in the matrix contains a genotype code. The last row
contains the code for the type of observation (e.g., 2 = case and 1 =
control). Optionally, the N
input observations in each row (for
a given marker) may be followed by three quantities: (1) Chromosome
number, (2) position in base pairs of the marker on the chromosome,
and (3) marker name. These items will allow for an easy ordering of
markers by their chromosomal positions, for example, with the sort
command in Windows (not really
relevant for sumstat,
but for scanstat).
On the last input line, you may use the following dummy values: (1)
99, (2) 99, (3) xx.
NOTE: Do not use marker names starting
with x, &, %, $, followed by digits or else the program will
interpret these as numbers to a base other than 10. For example, x11
will be interpreted as a hexadecimal number equal to (decimal) 17. If
you must use x plus digits then use xx plus digits, for example,
xx11. The program will recognize this as text, which it must do to
properly count individuals. Your data may be in plink
format, in which case you may want
to use the p2s
program to convert plink
files to sumstat
format.
The sumstat program reads parameter values from a file, here called a parameter file. An example for such a parameter file (sumstat.param, included in the program package) is as follows:
sumstat: Sample data, generated, heritability = 0.50 0 0 0 codes for genotypes, missing, #obs 14 1 code for test statistic, lambda used 10 0 max # terms in sum, initial # obs in contingency tables 99999 number of permutation samples 0 list number of test-SNP for interaction. Below: Input/output file names sample.dat SampleResults.out EXPLANATIONS Line 1: Any text, truncated to 250 characters. Line 2: Code for genotypes given in the data: 0 = the genotypes are 1, 2, 3 1 = the genotypes are 0, 1, 2 2 = the genotypes are -1, 0, 1 Value for "missing" Number of observations, N. Enter 0 if the program should find N. Line 3: Code for single-locus test statistic to select SNPs (for details, see below) Lambda value for genomic control (if code=1 or code=12) Line 4: Maximum number of terms in sum Initial number of observations in contingency tables. For sparse tables, 0.5 is recommended for stability. Line 5: Number of permutation samples Line 6: List number of test SNP, which will be paired with all other SNPs in turn (statistic code 16). If the test SNP number is given as a positive number then the maximum of the 3 chi-squares serves as the test statistic, otherwise it's the sum of the 3 chi-squares. Line 7: Name of datafile. No other characters before or after the name. Line 8: Name of output file. No other characters before or after the name. If this line is blank the output file will be called "sumstat.out". ---------------------------------------- Codes on line 4: 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes 2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW. 7 = proportion of homozygotes in cases minus controls. SEE NOTE BELOW. 8 = chi-square for proportion of homozygotes, cases vs controls. SEE NOTE BELOW. 10 = Armitage test for trend 11 = t-test. SEE NOTE BELOW. 12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles 14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW. 16 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP. 17 = interaction chi-square for given test SNP versus any target SNP 18 = Max2 test: maximum chi-square for dominant and recessive genotype action 19 = Max3 test: maximum chi-square for dom/rec/linear models 21 = difference in proportions of risk alleles, cases minus controls NOTE (statistic codes 2, 11, and 14) When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 (cases) minus type 1 (controls) observations will be computed. For example, the test statistic with a code of 11 is |t|; a test code of -14 will test whether Fcase > Fctrl. NOTE Chi-square is generally computed as a likelihood ratio (LR) statistic. If the statistic code is entered as a negative number then the Pearson chi-square will be computed.
An input file called seed.txt holds a seed for the random number generator. This seed must be a positive integer number. At program termination, this file will be overwritten with a new seed so that seeds are always updated from one to the next program run. We implemented in our Pascal programs the newest random number generator, ran (int64), highly recommended by Press et al (2007), which has a period of approximately 1057. We are grateful to Dr. Quan Long for helping us understand the C-code of this random number generator. If no seed.txt file is present when the program starts it will create one based on the system time.
The output file will be named sumstat.out or whatever name you have chosen on the last line of the parameter file. Most of the output is self-explanatory. The main table contains the following items:
i = rank of given SNP
SNP# = consecutive number of given SNP in the datafile. For example, 7 refers to the 7th SNP.
Stat = statistic used, for example, chi-square with 2 df.
Sum = all the statistics summed up for SNPs listed in i = 1 through the current value of i for the given SNP.
p0Stat = nominal p-value, not adjusted for multiple testing
pStat = p-value for given statistic, adjusted for multiple testing
pSum = p-value for the given sum, adjusted for multiple testing, not yet adjusted for testing different sets of SNPs.
The final p-value at the bottom is generally different from any p-value listed in the table. It represents the significance level associated with the smallest pSum value, which is taken to be the single experiment-wise test statistic.
With the given parameter file, the output file looks as follows:
Program Sumstat version 06 Oct 2017 sumstat: Sample data, generated, heritability = 0.50 Input file = sample.dat Current time: 9 Oct 2017 5:19:05h Locus-specific statistic for selection: Chi-square for differences in allele freq and F values (stat.code = 14) Initial cell count = 0.00. Number of SNPs = 200 Number of permutation samples = 19999+1. Smallest possible p-value = 0.000050 rank SNP# Stat Sum p0 p pSum ch position SNP 1 2 15.0294 15.0294 0.00015 0.03060 0.03060 8 200310 rs200310 2 190 10.6404 25.6698 0.00315 0.28560 0.02415 8 202190 rs202190 3 7 10.2412 35.9109 0.00390 0.34690 0.01345 8 200360 rs200360 4 74 7.9087 43.8197 0.00295 0.80190 0.01390 8 201030 rs201030 5 5 7.5414 51.3611 0.00545 0.86305 0.01305 8 200340 rs200340 6 8 7.2514 58.6125 0.01975 0.90525 0.01165 8 200370 rs200370 7 96 6.4601 65.0726 0.01615 0.97555 0.01160 8 201250 rs201250 8 73 6.2648 71.3375 0.03445 0.98530 0.01155 8 201020 rs201020 9 15 5.2777 76.6152 0.02310 0.99940 0.01390 8 200440 rs200440 10 71 5.1003 81.7155 0.06775 0.99970 0.01600 8 201000 rs201000 Gr 1 Gr 2 ------------------------------------ Total #observations 250 250 p0 = sig. level, uncorrected for multiple testing p = sig. level of given statistic, corrected (equivalent to Bonferroni correction for independent tests) (p0Stat and pStat include observed data as a null dataset) pSum = sig. level of sum statistic, corrected Starting seed = 4109687182758328260 Final p-value = 0.027101 Current time: 9 Oct 2017 5:20:16h
The smallest SNP-specific p-value (corrected for multiple testing) is 0.0298 for SNP #2 (rs200310). The smallest p-value for any of the 20 sums is 0.0126, which is associated with an experiment-wise p-value of 0.0321. In this case, constructing sums has not gained anything as 0.0321 > 0.0298.
The file, statout.txt, will contain values of permutation samples. It is required for calculation of the overall p-value and will be deleted at program termination, so the user will generally not see it.
You cannot simply click on program names. Instead you need to open a command (DOS) box, for example, by clicking on Start and then on Command Prompt (in Windows XP). In earlier Windows versions, click on Start, then on Run... and type cmd. It is a good idea to change one of the standard features in Windows: Make sure you see extensions of known file types on your screen; for example, you should see sumstat.exe, not just sumstat.
The program is run by typing sumstat followed on the same line by the name of the parameter file. For example, you type sumstat sumstat.param; any name for the parameter file will be fine as long as it conforms to the rules shown above. If the parameter file is sumstat.param then you may simply type sumstat, which implies a parameter file by the name of sumstat.param.
Code 1. Consider a 2 x 3 contingency table, where rows correspond to controls and cases, and columns represent the three genotypes of a given SNP, while the cells contain numbers of individuals. The sumstat program will construct such a table for each SNP and compute chi-square (2 df). With small numbers of observations, the initial cell count may be set to 0.5 rather than 0 for stability (Berkson 1955).
Code 2. Test statistic is the difference in mean genotype codes between cases and controls (1-sided or 2-sided).
Code 7. Proportion of homozygotes in cases versus controls (1-sided or 2-sided).
Code 8. Chi-square for proportion of homozygotes in cases versus controls (1-sided or 2-sided).
Code 10. Armitage’s test for trend (Agresti 2002).
Code 11. t-test carried out on genotype codes as “quantitative measurements”. This statistic tests whether mean genotype codes are different in cases and controls. It may be more appropriate for small minor allele frequencies than statistic #2 as the variance is then small so that mean differences are “enlarged”.
Code 12. Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP alleles
Code 14. Chi-square for differences, cases versus controls, of allele frequencies and F value (alpha, inbreeding coefficient) (Zhang et al 2008)
Code 16. For a given SNP (called the test SNP), data are divided into 3 groups depending on the genotype at this SNP. The sequence number (order in which the SNP is listed in the datafile) of the test SNP must be indicated on line 4. Then an association analysis (code 1) is carried out for any other SNP (called target SNP) in each of the 3 groups of individuals. The resulting 3 chi-squares are independent and are either (a) summed up for a resulting chi-square with 6 df, or (b) the maximum of the 3 chi-squares is retained and its associated p-value corrected for 3 tests. If the target SNP number is provided as a positive number then (b) is carried out, if it is negative then (a) is carried out. For example, if line 4 contains the number 311 that means that the 311-th SNP should be the test SNP. This analysis tests for the main effect of the target SNP and its interaction effect of the test SNP (Wang et al 2010).
Code 17. Analogous to code 16 (target SNP number is always positive) but the test statistic only reflects the interaction between the test and target SNPs. This interaction term is obtained by partitioning chi-square into main and interaction effects and retaining only the interaction term (Yang et al 2009).
Code 18. Two 2 x 2 tables of numbers of individuals are constructed, one for recessive inheritance (genotypes AA+AB versus BB) and one for dominant inheritance (genotypes AA versus AB + BB), with rows corresponding to cases/controls. The larger of the resulting two chi-square is the test statics; its p-value will be corrected for multiple testing, that is, the reported p-value is p(2 – p) (Bonferroni, 2 tests).
Code 21. For each SNP, the risk allele is defined as the allele leading to an odds ratio > 1. The sum of risk alleles in cases is compared with those in controls (Ott and Sun 2012).
The following data set (Sample.dat) was generated on the computer: 500 SNP markers, 250 cases, 250 control individuals, 10 susceptibility loci at SNPs 1 through 10 obtained via Hartl and Clark's liability threshold model, heritability for all 10 disease loci combined = 0.50, population trait prevalence = 0.01. An input parameter file, sumstat.param, is provided. With a particular run of 5000 permutation replicates, the output collected in the file SampleResults.out has been obtained.
Agresti A (2002) Categorical data analysis, 2nd edition. New York, Wiley-Interscience
Berkson J (1955) Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 50, 130-162
Chung JH, Fraser DAS (1958) Randomization tests for a multivariate two sample problem. J Amer Statist Assoc 53:729-735
de Quervain DJ, Poirier R, Wollmer MA, Grimaldi LM, Tsolaki M, Streffer JR, Hock C, Nitsch RM, Mohajeri MH, Papassotiropoulos A (2004) Glucocorticoid-related genetic susceptibility for Alzheimer's disease. Hum Mol Genet 13:47-52 [an example of the use of sum statistics]
Edgington ES, Onghena P (2007) Randomization Tests, 4th edition. Chapman & Hall/CRC, Boca Raton
Hoh J, Ott J (2003) Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet 4:701-709.
Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11:2115-2119. [main reference for the method described here]
Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413-417. [a nested bootstrap approach to SNP selection]
Kim S, Zhang K, Sun F (2003) Detecting susceptibility genes in case-control studies using set association. BMC Genet 4 Suppl 1:S9 [documents increased power of sum statistics]
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308:385-389
Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York
Nijenhuis A, Wilf HS (1978) Combinatorial algorithms for computers and calculators. Academic Press, New York. This book is now freely available as a pdf file.
Ott J, Hoh J (2003) Set association analysis of SNP case-control and microarray data. J Comput Biol 10:569-574
Ott J, Sun D (2012) Multilocus association analysis under polygenic models. Int J Data Ming and Bioinformatics 6, 482-489
Papassotiropoulos A, Wollmer MA, Tsolaki M, Brunner F, Molyva D, Lutjohann D, Nitsch RM, Hock C (2005) A cluster of cholesterol-related genes confers susceptibility for Alzheimer's disease. J Clin Psychiatry 66:940-947 [an example of the use of sum statistics]
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York
Wang G, Yang Y, Ott J (2010) Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 70, 34-41
Weir BS (1996) Genetic data analysis II: methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass.
Wille A, Hoh J, Ott J (2003) Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genet Epidemiol 25:350-359.
Yang Y, He C, Ott J (2009) Testing association with interactions by partitioning chi-squares. Ann Hum Genet 73, 109-117
Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-Ortiz A, Macaya C, Pintor E, Fernandez-Cruz A, Ott J, Lindpainter K (2002) Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. Pharmacogenomics J 2:197-201
Zhang Q, Wang S, Ott J (2008) Combining identity by descent and association in genetic case-control studies. BMC Genet 9, 42