Josephine Hoh

Jurg Ott

Yale University, New Haven

Rockefeller University New York

josephine.hoh@yale.edu

ott@rockefeller.edu

22 May 2016

http://lab.rockefeller.edu/ott/

Scan statistics in gene mapping

(scanstat computer program)

Methodology

The general approach is described in Hoh and Ott (2000). The current program version corrected an error that occurred when the length of the scan statistic exceeded the number of markers in any chromosome.

Briefly, consider a number of marker loci in the genome. At each marker, genotypes are available for two types of observations, for example:

The general idea is to find a set of contiguous SNPs that jointly are associated with disease. We do this by initially computing an association statistic for each SNP, for example, χ2 for a 2 × 3 table, where the two rows correspond to cases and controls, and the three columns refer to SNP genotypes. Then a window of a given length is chosen, that is, a number L of consecutive SNPs on a chromosome, where L is also called the window size. For example, a window may extend from SNP # 5 through SNP # 10 (L = 6). A window of a given size is now moved across the genome one SNP at a time. For each window, the sum over association statistics of all SNPs in the window is computed. The largest such sum over all windows is defined as the scan statistic of length L (Glaz and Balakrishnan 1999). An associated significance level pL is computed with permutation samples (Manly 2006). We perform this action for all windows of a chosen minimum and maximum length, for example, from L = 100 through L = 2000. The smallest value of pL encountered for all window sizes is our genome-wide test statistic, for which we compute an associated significance level, called "Final p-value" in the output. This is analogous to what we do in our sumstat programs.

Notes for Linux users

Instructions given here are specific for Ubuntu Linux but should work in most Linux environments. Download the scanstat package and extract all files. Delete scanstat.exe. Compilable files are *.pas while the *.p files are include files that will be read by the main program upon compilation. The scanstatLinux file is executable code for Linux with some settings but you may want to create executable code with settings for your specific needs. Relevant settings are as follows and may be found in the CONST section of the sumstat.pas file:

maxobs = 5000;     {maximum number of observations}
maxvar = 800000; {max. number of SNPs excl. responses}
maxsum = 20000; {max. length of scan statistic}
howoften = 1000; {after how many random samples to write to screen}
maxperm = 100000; {max # permutations}

The program will allocate memory for as many individuals (cases and controls) as are present in the data, but for each individual it will allocate memory for as many variables (SNPs) as given by the value of maxvar. Thus, to keep memory requirements low you may want to change the value of maxvar to something just slightly higher than the actual number of SNPs in your data. Do this by opening the scanstat.pas file in your text editor and changing numbers as desired. Then compile the program in a terminal window (ctrl-alt-T) by typing fpc scanstat.pas, which should take only a few seconds. Disregard warnings about code not being accessible or about “link.res containing output sections”. You may check the compiled version by typing ./scanstat. If everything looks satisfactory, make the program accessible (put it in the path) by typing sudo mv scanstat /usr/local/bin, which will move the scanstat program file to a folder in the path. Verify your actions by typing scanstat, which should invoke the program, but it will not carry out calculations as you have not yet furnished a parameter file name (see elsewhere in this document).

Input and output for scanstat programs

The scanstat program works with the following text files.

1. Data file

This input file (named in the parameter file below) should consist of a matrix with K + 1 rows and N columns, where N = number of observations (e.g., case and control individuals) and K = number of input variables (eg. SNPs). Each cell in the matrix contains a genotype code. The last row contains the code for the type of individual (e.g., 2 = case and 1 = control). Optionally, the N input observations on each row (for a given marker) may be followed by three quantities: (1) Chromosome number, (2) position in base pairs of the marker on the chromosome, and (3) marker name. These items will allow for an easy ordering of markers by their chromosomal positions, for example, with the sort command in Windows. On the last input line, you may use the following dummy values: (1) 99, (2) 99, (3) xx.

NOTE: Do not use marker names starting with x, &, %, $, followed by digits or else the program will interpret these as numbers to a base other than 10. For example, x11 will be interpreted as a hexadecimal number equal to (decimal) 17. If you must use x plus digits then use xx plus digits, for example, xx11. The program will recognize this as text, which it must do to properly count individuals.

Your data may be in
plink format, in which case you may want to use the p2s program to convert plink files to sumstat / scanstat format.

2. Parameter file

The scanstat program reads parameter values from a file, here called a parameter file. An example for such a parameter file (Sample.par, included in program package, the same file as in the sumstat package) is as follows:

scanstat: Sample data, generated
0 0 0 codes for genotypes, missing, #obs
10 1 0.5 Statistic code, lambda, initial # obs in tables
1 100 min. and max. length of scan statistic
10000 number of random samples. Below: Input/output file names
Sample.dat
SampleResults.out


EXPLANATIONS

Line 1: Any text, truncated to 250 characters

Line 2:
  Code for genotypes
  Code for missing
  Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here.

Line 3:
  Code for single-locus test statistic (see section Test Statistics below)

Line 4:
  Minimum length (number of SNPs) of scan statistic
  Maximum length of scan statistic
  Stepsize from min to max; may be 1

Line 5:
  Number of permutation samples

Line 6:
  Input file name

Line 7:
  Input file name. If this line is blank the output file name will be "scanstat.out"
----------------------------------------
Statistic codes on line 3
(below, "type 2 observations" and "cases" are used synonymously):

2. Difference between mean in type 2 minus mean in type 1 genotype codes. SEE NOTE BELOW.

5. t-test. SEE NOTE BELOW.

7. Differences in allele frequencies and inbreeding coefficient (FP test)between cases and controls. SEE NOTE BELOW.

8. Difference of homozygote frequencies cases-controls. SEE NOTE BELOW.

9. Chi-square for 2 x 3 table of genotypes

10. Chi-square for 2 x 2 table of alleles

NOTE (statistic codes 2, 5, 7, and 8)
When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 minus type 1 observations will be computed. For example, the test statistic with a code of 5 is |t|. A code of -8 will test whether the proportion of homozygotes is larger in cases than controls, and a code of 8 will compute chi-square for the difference in homozygote frequencies between cases and controls. Also, a test code of -7 will test whether Fcase > Fctrl.

3. Seed file (optional)

An input file called seed.txt holds a seed for the random number generator. This seed must be a positive integer number. At program termination, this file will be overwritten with the ending seed so that seeds are always updated from one to the next program run. We implemented in our Pascal programs the newest random number generator, ran (int64), highly recommended by Press et al (2007), which has a period of approximately 1057. We are grateful to Dr. Quan Long for helping us understand the C-code of this random number generator. If no seed.txt file is present when the program starts it will create one based on the system time.

4. Output file

The output file will be named sumstat.out or whatever name you have chosen on the last line of the parameter file. Most of the output is self-explanatory. The main table contains the following items:

With the given parameter file, the output file looks as follows:

Program Scanstat version 15 Dec 2012

scanstat: Sample data, generated
Input file = Sample.dat
Current time: Wed  9 Jan 2013   8:54:28h

Single-locus statistic: chi-square for 2x2 table of alleles (statis=10)
Lambda used = 1.0000

     L     no1     no2         stat   p-value CH         SNP1         SNP2         BP1         BP2    BPdist
     1       2       2      14.9789  0.022798  8     rs200310     rs200310      200310      200310         0
     2       2       3      18.6992  0.014299  8     rs200310     rs200320      200310      200320        10
     3       2       4      18.7991  0.036796  8     rs200310     rs200330      200310      200330        20
     4       2       5      26.3249  0.005299  8     rs200310     rs200340      200310      200340        30
     5       2       6      27.5242  0.005499  8     rs200310     rs200350      200310      200350        40
     6       2       7      36.1817  0.000300  8     rs200310     rs200360      200310      200360        50
     7       2       8      43.6835  0.000100  8     rs200310     rs200370      200310      200370        60
     8       2       9      44.2468  0.000100  8     rs200310     rs200380      200310      200380        70
     9       2      10      44.4518  0.000200  8     rs200310     rs200390      200310      200390        80
    10       2      11      46.0838  0.000300  8     rs200310     rs200400      200310      200400        90
    11       1      11      46.1623  0.000500  8     rs200300     rs200400      200300      200400       100
    12       2      13      46.4892  0.001000  8     rs200310     rs200420      200310      200420       110
    13       2      14      47.2797  0.001000  8     rs200310     rs200430      200310      200430       120
    14       2      15      52.3956  0.000200  8     rs200310     rs200440      200310      200440       130
    15       2      16      53.1694  0.000300  8     rs200310     rs200450      200310      200450       140
    16       2      17      53.3079  0.000500  8     rs200310     rs200460      200310      200460       150
    17       2      18      53.5084  0.001100  8     rs200310     rs200470      200310      200470       160
    18       2      19      53.7955  0.001500  8     rs200310     rs200480      200310      200480       170
    19       1      19      53.8740  0.002200  8     rs200300     rs200480      200300      200480       180
    20       2      21      54.3839  0.002600  8     rs200310     rs200500      200310      200500       190
    21       1      21      54.4624  0.004400  8     rs200300     rs200500      200300      200500       200
    22       2      23      55.7795  0.004600  8     rs200310     rs200520      200310      200520       210
    23       2      24      57.6830  0.003900  8     rs200310     rs200530      200310      200530       220
    24       2      25      58.4719  0.004800  8     rs200310     rs200540      200310      200540       230
    25       2      26      58.9263  0.005699  8     rs200310     rs200550      200310      200550       240
    26       2      27      63.4201  0.002800  8     rs200310     rs200560      200310      200560       250
    27       2      28      63.5091  0.003400  8     rs200310     rs200570      200310      200570       260
    28       2      29      63.9952  0.004400  8     rs200310     rs200580      200310      200580       270
    29       1      29      64.0737  0.006499  8     rs200300     rs200580      200300      200580       280
    30       2      31      66.3675  0.005399  8     rs200310     rs200600      200310      200600       290
    31       2      32      67.9321  0.004600  8     rs200310     rs200610      200310      200610       300
    32       2      33      68.1151  0.006499  8     rs200310     rs200620      200310      200620       310
    33       2      34      68.8482  0.007499  8     rs200310     rs200630      200310      200630       320
    34       2      35      69.5211  0.007999  8     rs200310     rs200640      200310      200640       330
    35       2      36      69.9737  0.010799  8     rs200310     rs200650      200310      200650       340
    36       2      37      71.8021  0.008499  8     rs200310     rs200660      200310      200660       350
    37       2      38      72.6697  0.009599  8     rs200310     rs200670      200310      200670       360
    38       1      38      72.7483  0.013799  8     rs200300     rs200670      200300      200670       370
    39       2      40      72.9596  0.018498  8     rs200310     rs200690      200310      200690       380
    40       2      41      73.0858  0.022698  8     rs200310     rs200700      200310      200700       390
    41       2      42      73.1996  0.029497  8     rs200310     rs200710      200310      200710       400
    42       2      43      73.3545  0.036696  8     rs200310     rs200720      200310      200720       410
    43       2      44      73.5028  0.047695  8     rs200310     rs200730      200310      200730       420
    44       1      44      73.5813  0.058994  8     rs200300     rs200730      200300      200730       430
    45       2      46      75.7977  0.048895  8     rs200310     rs200750      200310      200750       440
    46       2      47      77.3556  0.046195  8     rs200310     rs200760      200310      200760       450
    47       2      48      77.4608  0.055894  8     rs200310     rs200770      200310      200770       460
    48       1      48      77.5393  0.069193  8     rs200300     rs200770      200300      200770       470
    49       2      50      77.9411  0.077692  8     rs200310     rs200790      200310      200790       480
    50       1      50      78.0196  0.093991  8     rs200300     rs200790      200300      200790       490
    51       2      52      81.1917  0.064894  8     rs200310     rs200810      200310      200810       500
    52       1      52      81.2702  0.079692  8     rs200300     rs200810      200300      200810       510
    53       2      54      81.3527  0.096590  8     rs200310     rs200830      200310      200830       520
    54       2      55      81.7064  0.111289  8     rs200310     rs200840      200310      200840       530
    55       1      55      81.7850  0.130487  8     rs200300     rs200840      200300      200840       540
    56       2      57      81.9338  0.147785  8     rs200310     rs200860      200310      200860       550
    57       2      58      82.1298  0.165683  8     rs200310     rs200870      200310      200870       560
    58       2      59      83.0357  0.172183  8     rs200310     rs200880      200310      200880       570
    59       2      60      84.5707  0.159284  8     rs200310     rs200890      200310      200890       580
    60       1      60      84.6492  0.183882  8     rs200300     rs200890      200300      200890       590
    61       1      61      84.6589  0.212079  8     rs200300     rs200900      200300      200900       600
    62       2      63      84.6988  0.240576  8     rs200310     rs200920      200310      200920       610
    63       2      64      87.3358  0.198980  8     rs200310     rs200930      200310      200930       620
    64       2      65      89.4274  0.169683  8     rs200310     rs200940      200310      200940       630
    65       1      65      89.5059  0.193681  8     rs200300     rs200940      200300      200940       640
    66       2      67      91.4288  0.172783  8     rs200310     rs200960      200310      200960       650
    67       2      68      91.7665  0.188881  8     rs200310     rs200970      200310      200970       660
    68       2      69      91.9742  0.211679  8     rs200310     rs200980      200310      200980       670
    69       1      69      92.0527  0.238276  8     rs200300     rs200980      200300      200980       680
    70       2      71      96.8998  0.145285  8     rs200310     rs201000      200310      201000       690
    71       2      72      98.0896  0.146785  8     rs200310     rs201010      200310      201010       700
    72       1      72      98.1681  0.164384  8     rs200300     rs201010      200300      201010       710
    73       2      74     106.0232  0.064494  8     rs200310     rs201030      200310      201030       720
    74       2      75     109.4331  0.044996  8     rs200310     rs201040      200310      201040       730
    75       1      75     109.5116  0.053195  8     rs200300     rs201040      200300      201040       740
    76       2      77     109.5659  0.062194  8     rs200310     rs201060      200310      201060       750
    77       1      77     109.6445  0.072993  8     rs200300     rs201060      200300      201060       760
    78       1      78     109.6670  0.083692  8     rs200300     rs201070      200300      201070       770
    79       2      80     112.3526  0.067893  8     rs200310     rs201090      200310      201090       780
    80       2      81     113.3582  0.067893  8     rs200310     rs201100      200310      201100       790
    81       2      82     114.8295  0.065093  8     rs200310     rs201110      200310      201110       800
    82       2      83     115.0570  0.073693  8     rs200310     rs201120      200310      201120       810
    83       2      84     115.9673  0.075692  8     rs200310     rs201130      200310      201130       820
    84       1      84     116.0459  0.085391  8     rs200300     rs201130      200300      201130       830
    85       2      86     116.4058  0.090891  8     rs200310     rs201150      200310      201150       840
    86       2      87     116.5025  0.103690  8     rs200310     rs201160      200310      201160       850
    87       2      88     117.1280  0.108789  8     rs200310     rs201170      200310      201170       860
    88       2      89     118.8707  0.099090  8     rs200310     rs201180      200310      201180       870
    89       2      90     119.8328  0.101590  8     rs200310     rs201190      200310      201190       880
    90       1      90     119.9114  0.111189  8     rs200300     rs201190      200300      201190       890
    91       1      91     119.9501  0.126487  8     rs200300     rs201200      200300      201200       900
    92       2      93     120.5025  0.134287  8     rs200310     rs201220      200310      201220       910
    93       2      94     122.1530  0.124988  8     rs200310     rs201230      200310      201230       920
    94       2      95     122.2851  0.139586  8     rs200310     rs201240      200310      201240       930
    95       2      96     128.7142  0.075092  8     rs200310     rs201250      200310      201250       940
    96       2      97     129.4820  0.076792  8     rs200310     rs201260      200310      201260       950
    97       1      97     129.5606  0.085391  8     rs200300     rs201260      200300      201260       960
    98       1      98     129.5606  0.097290  8     rs200300     rs201270      200300      201270       970
    99       2     100     129.7445  0.106989  8     rs200310     rs201290      200310      201290       980
   100       2     101     130.4134  0.109989  8     rs200310     rs201300      200310      201300       990


L = number of SNPs in scan statistic (its length)
no1 = number of first SNP in scan statistic
no2 = number of last SNP in scan statistic
stat = scan statistic of given length, L
p-value = empirical significance level, genome-wide
CH = chromosome number
SNP1 = ID of first SNP; SNP2 = ID of last SNP
BP1 = base pair position of first SNP; BP2 for last SNP

Number of permutation samples = 10000. Smallest possible p-value = 0.000100
Number of SNPs = 200

Number of observations in group 1:   250
Number of observations in group 2:   250
     Total number of observations:   500
Above numbers include missing observations.
Code for missing = 0

Starting seed = 91201385428578

Final p-value = 0.002000
Current time: Wed  9 Jan 2013   8:57:10h

The best single-locus test (L = 1) has a p-value of 0.0228 and the best scan statistic (at L = 7/8) has p = 0.0001. To allow for the fact that the latter was chosen among 100 scan statistics, we computed a final p-value of 0.0020. As 0.0020 < 0.0228, it has been beneficial to apply scan statistics – they provided an improvement over testing one SNP at a time.

The file, statout.txt, will contain values of permutation samples. It is required for calculation of the overall p-value and will be deleted at program termination, so the user will generally not see it.

Running the program

You cannot simply click on program names. Instead you need to open a command (DOS) box, for example, by clicking on Start and then on Command Prompt (in Windows XP). In earlier Windows versions, click on Start, then on Run... and type cmd. It is also a good idea to change one of the standard features in Windows: Make sure you see extensions of known file types on your screen; for example, you should see scanstat.exe, not just scanstat.

The program is run by typing scanstat followed on the same line by the name of the parameter file. For example, you type scanstat Sample.par.

Test statistics

Code 2. Test statistic is the difference in mean genotype codes between cases and controls (1-sided or 2-sided).

Code 5. t-test carried out on genotype codes as “quantitative measurements”. This statistic tests whether on average genotype codes in cases are different from those in controls. It may be more appropriate for small minor allele frequencies than  than statistic #2 as the variance is then small so that mean differences are “enlarged”.

Code 8. Chi-square for proportion of homozygotes in cases versus controls (1-sided or 2-sided).

Code 9. Consider a 2 x 3 contingency table, where rows correspond to controls and cases, and columns represent the three genotypes of a given SNP, while the cells contain numbers of individuals. The scanstat program will construct such a table for each SNP and compute chi-square (2 df). With small numbers of observations, the initial cell count may be set to 0.5 rather than 0 for stability (Berkson 1955).

Code 10. Chi-square in 2 x 2 table, rows = cases and controls, columns = two SNP alleles

Sample data

The following data set (Sample.dat) was generated on the computer: 500 SNP markers, 250 cases, 250 control individuals, 10 susceptibility loci at SNPs 1 through 10 obtained via Hartl and Clark's liability threshold model, heritability for all 10 disease loci combined = 0.50, population trait prevalence = 0.01. An input parameter file, Sample.par, is provided. With a particular run of 10,000 permutation replicates, the output collected in the file SampleResults.out has been obtained.

References

Berkson J (1955) Maximum likelihood and minimum chi-square estimates of the logistic function. J Amer Statist Assoc 50, 130-162

Glaz J, Balakrishnan N (1999) Introduction to scan statistics. In: Glaz J, Balakrishnan N (eds) Scan statistics and applications: Statistics for Industry and Technology. Birkhäuser, Boston, pp 3-26

Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci USA 97:9615-9617.

Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

Ott J, Hoh J (2012) Scan statistics in human gene mapping. Am J Hum Genet 91, 970 (letter)

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York