Manual for PTest program (Partition Test)

Jurg Ott / 20 Jan 2013
Institute of Psychology, CAS, Beijing
Rockefeller University, New York

Introduction

This program implements the Partition Test (PTest) [1] for case-control genetic association studies in human gene mapping. The PTest is based on single-SNP p-values resulting from an association test, for example, the chi-square test comparing genotype or allele frequencies in cases and controls. For a given threshold, pT, of these p-values, the observed number of p-values below the threshold are compared with the number expected under no association, where the latter is given by pT × N, and N is the total number of SNPs in the study. Observed and expected numbers are compared with a normal deviate, Z = √(chi-square), which is set equal to zero when the observed number is smaller than the expected number. Such Z values are obtained for a range of thresholds taking a number of steps such as 1000 from a given maximum such as 0.10 down to zero. The largest Z value, Zmax, is taken to be the overall test statistic. In a number of permutation (randomization) samples [2], the significance level associated with Zmax is estimated by the proportion of randomization samples with Zmax at least as large as the one in the observed dataset, where the latter is counted as one of the null samples. Thus, with n permutation samples, the smallest possible overall significance level is given by 1/(1 + n). The program is written in Free Pascal (Linux or Windows, 32 bit or 64 bit) and may be downloaded as a program package including sample input files. It comes in two versions: PTest allocates memory to hold genotypes and randomization samples while PTestS reads genotypes again and again for each new randomization sample; the latter is slower than the former but can potentially accommodate larger datasets.

Notes for Linux users

Instructions given here are specific for Ubuntu Linux but should work in most Linux environments. Download the PTest package and extract all files. Delete PTest.exe and PTestS.exe. The compilable files are *.pas while the *.p files are include files that will be read by the main program upon compilation. To compile the program in Linux you may want to adjust the current parameter settings and then proceed as shown below. Relevant settings are as follows and may be found in the CONST section of the PTest.pas file:

maxobs = 5000; {maximum number of observations}
maxvar = 900000; {max. number of SNPs excl. responses}
maxstep = 10000; {max. number of steps the threshold can take}
howoften = 50; {after how many random samples to write to screen}
maxnamelength=20; {max length of SNP name}

The program will allocate memory for as many individuals (cases and controls) as are present in the data, but for each individual it will allocate memory for as many variables (SNPs) as given by the value of maxvar. Thus, to keep memory requirements low you may want to change the value of maxvar to something just slightly higher than the actual number of SNPs in your data. Do this by opening the PTest.pas file in your text editor and changing numbers as desired. Then compile the program in a terminal window (ctrl-alt-T) by typing fpc PTest.pas, which should take only a few seconds. Disregard warnings about code not being accessible or about “link.res containing output sections”. You may check the compiled version by typing ./PTest. If everything looks satisfactory, make the program accessible (put it in the path) by typing sudo mv PTest /usr/local/bin, which will move the PTest program file to a folder in the path. Verify your actions by typing PTest, which should invoke the program, but it will not carry out calculations as you have not yet furnished a parameter file name (see elsewhere in this document).

Input files

Three input files are required, a datafile holding genotypes and phenotypes (affected or unaffected), a parameter file, and a seed file. The latter is optional; it will be created with a random seed based on the system clock if it is not present.

Data file

The datafile conforms to the rules of the sumstat programs [3], that is, each row corresponds to a SNP and each column to an individual. Each cell in the body of this table represents a genotype code, for example, 1, 2, 3 corresponding to genotypes AA, AB, BB, respectively, with 0 representing unknown (missing). The last row contains codes for affection status, for example, 1 = control and 2 = case (any two numbers will be fine, with the larger of the two numbers standing for “case”). Optionally, the last three columns identify each SNP and represent (1) the chromosome number, (2) position in bp, and (3) SNP symbol. These text files are typically large and cannot generally be viewed in a text editor. Up to a certain file size such as 30-50 kb, the Wordpad program in Windows is recommended. For smaller files up to about 30 MB in size, the Crimson Editor is highly recommended.

For datafiles in
plink format, a small utility program, p2s, may be obtained to convert data files from plink to sumstat format. The p2s package also contains a utility program, copylines, that allows you to cut small portions out of very large text files.

Parameter file

The parameter file holds information about file names, what single-SNP test statistics should be used, and so on. A sample parameter file, sampleP.par, is provided and reproduced below (the same sample data as in the sumstat package). Text beginning with EXPLANATIONS will not be read by the program.

PTest: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
0 0 0 0.5          codes for genotypes, missing, #obs, initial #obs in tables
12 1000 0.10 1.1   code for test statistic, #steps, upper limit, lambda
9999               number of permutation samples. Below: Input file name
sampleP.dat
samplePresults.out

EXPLANATIONS

Line 1:    Any text, truncated to 250 characters.

Line 2:
  Code for genotypes given in the data:
    0 = the genotypes are 1, 2, 3
    1 = the genotypes are 0, 1, 2
    2 = the genotypes are -1, 0, 1
  Value for "missing"
  Number of observations, N. Enter 0 if program should find N. It determines N from the first input line that should end with (1) chromosome number, (2) bp position, and (3) SNP ID. If (1) - (3) are missing then only numbers are present on this line. If (1) - (3) are present then the program recognizes (3) because it starts with a letter. If that is not the case and the SNP ID is a pure number then the program will take it to be a genotype code, which will generally lead to an error message. If such an error occurs at the end of the line, at item k, then the number of observations is k-3, which should be entered here.
  Initial number of observations in contingency tables. For sparse tables, 0.5 is recommended for stability.

Line 3:
  Code for single-locus test statistic (see below) to be used
  Number of steps of p to take
  Upper limit of p-value (0.2 recommended)

Line 4:
  Number of permutation samples

Line 5:
  Name of datafile. No other characters before or after the name.

Line 6:
  Name of output file. No other characters before or after the name.
  If this line is blank (empty) the output file will be "PTest.out".


----------------------------------------
Codes on line 4:
 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
 2 = TDT analysis; must read plink tdt output file
10 = test for trend
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles

Chi-square (codes 1 and 12) is generally computed from the likelihood ratio (LR) statistic. If these codes are entered as negative numbers, for example, -1 instead of 1, then the Pearson chi-square will be computed.

Seed file (optional)

A text file called seed.txt is required for the random number generator. It holds a single positive integer number. After completion of a run, the ending random number will be written into this file so that each run starts with a new random number. If this file is missing it will be created by the program based on the system clock.

Running the PTest program

The program must be run in a Windows CMD (command) window, with analogous application in Linux. Open such a window and move to the folder holding input files as given above. If you do not see file extensions (for example, exe), it is recommended that you make appropriate changes to your system so you will see file extensions. With the input file names given above, the program is run by typing PTest sampleP.par. Experience shows that the PTest is preferentially applied to data after rigorous quality control requiring, for example, a call rate for SNPs of at least 99% (SNPs with lower call rates to be deleted by the user; the PTest program will not perform any deletions).

Output files

The program will produce three output files whose names depend on input line 6 in the parameter file. Assume that the output file name given there is samplePresults.out. Then the three files are:

samplePresults.out

This is the basic output file holding summary results. With the parameter file shown above, the program furnishes the following output:

Program PTest version 15 Jan 2013
maxstatP: Sample data, generated, 200 variables. Heritability 0.50, threshold 2.32
Input file = sampleP.dat
Locus-specific statistic:
LR chi-square for 2 x 2 table, case-control versus 2 alleles (statis=12)
Lambda = 1.1000 Upper limit of p = 0.1000
Initial value of empty cells in contingency tables = 0.50
Initial value of cells in lik. calculations = 0.00
200 SNPs present
Number of permutation samples = 1999. P-values include observed dataset.
Smallest possible p-value = 1/(1999+1) = 0.000500
Max. chi-square occurs at pLimit = 0.00910000, step 91 of 1000 steps
5 SNPs below best threshold.
Associated global p-value = 137/2000 = 0.068500
Gr 1 Gr 2
------------------------------------
Total #observations 250 250
Starting seed = 12219258721316005406
p-values associated with each step:
Step p-value
1 1.000000
2 1.000000
3 0.028000
4 0.039000
5 0.051500
6 0.060500
7 0.071500
8 0.079000
9 0.087500
10 0.097000
11 0.108000
12 0.120000
13 0.136000
14 0.149500
15 0.156500
16 0.166500
17 0.175500
18 0.188000
19 0.200500
20 0.207500
21 0.218500
22 0.230500
23 0.243500
24 0.250000
25 0.258500
26 0.271000
27 0.276000
28 0.286500
29 0.292000
30 0.302000
31 0.323000
32 0.336000
33 0.343500
34 0.358000
35 0.359500
36 0.367000
37 0.374000
38 0.384500
39 0.395000
40 0.401500
41 0.409500
42 0.412000
43 0.420500
44 0.433000
45 0.438500
46 0.445500
47 0.458000
48 0.470500
49 0.476000
50 1.000000
51 0.140000
52 0.144000
53 0.147000
54 0.154000
55 0.160500
56 0.170000
57 0.174500
58 0.180500
59 0.187000
60 0.191500
61 0.197000
62 0.199000
63 0.204000
64 0.207000
65 0.215500
66 0.221000
67 0.231500
68 0.234000
69 0.242500
70 0.245000
71 0.254500
72 0.260000
73 0.261500
74 0.270500
75 0.273500
76 0.075000
77 0.077500
78 0.080500
79 0.085500
80 0.089000
81 0.089500
82 0.090000
83 0.091500
84 0.096500
85 0.099500
86 0.102500
87 0.103500
88 0.106000
89 0.109000
90 0.029000
91 0.007500
92 0.008500
93 0.008500
94 0.009500
95 0.010000
96 0.011000
97 0.011500
98 0.012000
...
991 1.000000
992 1.000000
993 0.250500
994 0.254500
995 0.254500
996 0.254500
997 0.254500
998 0.254500
999 0.254500
1000 1.000000

These results state that the largest test statistic (comparing observed and expected numbers of p-values) occurred at p-value threshold of pT = 0.0091 and that 5 SNPs have their individual p-values at or below this threshold. The overall significance level for this dataset is 0.0685. Thus, there is suggestive evidence that these 5 SNPs are susceptibility variants for the disease studied.

samplePresults.out1

For each step that the p-value threshold takes, this file records observed (#pObs) and expected numbers (#pExp) of single-SNP p-values at or below the current step. For example, if the upper limit of the threshold is set at 0.10 and a number of 100 steps should be taken, then each step size is 0.10/100 and, with a total number of n SNPs, the expected number of SNPs at the lowest step is given by n × 0.10/1000. With the parameter settings as given, the first 100 lines of this file are as follows:

Step pLimit #pObs #pExp chi-sq Z-value
1 0.000100 0 0.02 0.00 0.0000
2 0.000200 0 0.04 0.00 0.0000
3 0.000300 1 0.06 3.75 1.9368
4 0.000400 1 0.08 3.22 1.7932
5 0.000500 1 0.10 2.81 1.6761
6 0.000600 1 0.12 2.48 1.5762
7 0.000700 1 0.14 2.22 1.4886
8 0.000800 1 0.16 1.99 1.4102
9 0.000900 1 0.18 1.79 1.3390
10 0.001000 1 0.20 1.62 1.2736
11 0.001100 1 0.22 1.47 1.2130
12 0.001200 1 0.24 1.34 1.1563
13 0.001300 1 0.26 1.22 1.1031
14 0.001400 1 0.28 1.11 1.0529
15 0.001500 1 0.30 1.01 1.0052
16 0.001600 1 0.32 0.92 0.9598
17 0.001700 1 0.34 0.84 0.9164
18 0.001800 1 0.36 0.77 0.8748
19 0.001900 1 0.38 0.70 0.8349
20 0.002000 1 0.40 0.63 0.7965
21 0.002100 1 0.42 0.58 0.7594
22 0.002200 1 0.44 0.52 0.7236
23 0.002300 1 0.46 0.47 0.6889
24 0.002400 1 0.48 0.43 0.6552
25 0.002500 1 0.50 0.39 0.6225
26 0.002600 1 0.52 0.35 0.5908
27 0.002700 1 0.54 0.31 0.5599
28 0.002800 1 0.56 0.28 0.5297
29 0.002900 1 0.58 0.25 0.5003
30 0.003000 1 0.60 0.22 0.4717
31 0.003100 1 0.62 0.20 0.4436
32 0.003200 1 0.64 0.17 0.4162
33 0.003300 1 0.66 0.15 0.3894
34 0.003400 1 0.68 0.13 0.3631
35 0.003500 1 0.70 0.11 0.3373
36 0.003600 1 0.72 0.10 0.3121
37 0.003700 1 0.74 0.08 0.2873
38 0.003800 1 0.76 0.07 0.2630
39 0.003900 1 0.78 0.06 0.2391
40 0.004000 1 0.80 0.05 0.2156
41 0.004100 1 0.82 0.04 0.1925
42 0.004200 1 0.84 0.03 0.1698
43 0.004300 1 0.86 0.02 0.1475
44 0.004400 1 0.88 0.02 0.1255
45 0.004500 1 0.90 0.01 0.1038
46 0.004600 1 0.92 0.01 0.0824
47 0.004700 1 0.94 0.00 0.0614
48 0.004800 1 0.96 0.00 0.0406
49 0.004900 1 0.98 0.00 0.0202
50 0.005000 1 1.00 0.00 0.0000
51 0.005100 2 1.02 0.74 0.8592
52 0.005200 2 1.04 0.70 0.8369
53 0.005300 2 1.06 0.66 0.8148
54 0.005400 2 1.08 0.63 0.7931
55 0.005500 2 1.10 0.60 0.7716
56 0.005600 2 1.12 0.56 0.7504
57 0.005700 2 1.14 0.53 0.7295
58 0.005800 2 1.16 0.50 0.7088
59 0.005900 2 1.18 0.47 0.6884
60 0.006000 2 1.20 0.45 0.6682
61 0.006100 2 1.22 0.42 0.6483
62 0.006200 2 1.24 0.40 0.6285
63 0.006300 2 1.26 0.37 0.6090
64 0.006400 2 1.28 0.35 0.5897
65 0.006500 2 1.30 0.33 0.5706
66 0.006600 2 1.32 0.30 0.5517
67 0.006700 2 1.34 0.28 0.5330
68 0.006800 2 1.36 0.26 0.5145
69 0.006900 2 1.38 0.25 0.4962
70 0.007000 2 1.40 0.23 0.4780
71 0.007100 2 1.42 0.21 0.4601
72 0.007200 2 1.44 0.20 0.4423
73 0.007300 2 1.46 0.18 0.4246
74 0.007400 2 1.48 0.17 0.4072
75 0.007500 2 1.50 0.15 0.3899
76 0.007600 3 1.52 1.13 1.0632
77 0.007700 3 1.54 1.09 1.0449
78 0.007800 3 1.56 1.05 1.0267
79 0.007900 3 1.58 1.02 1.0086
80 0.008000 3 1.60 0.98 0.9907
81 0.008100 3 1.62 0.95 0.9730
82 0.008200 3 1.64 0.91 0.9554
83 0.008300 3 1.66 0.88 0.9380
84 0.008400 3 1.68 0.85 0.9207
85 0.008500 3 1.70 0.82 0.9036
86 0.008600 3 1.72 0.79 0.8866
87 0.008700 3 1.74 0.76 0.8697
88 0.008800 3 1.76 0.73 0.8530
89 0.008900 3 1.78 0.70 0.8364
90 0.009000 4 1.80 2.01 1.4187
91 0.009100 5 1.82 3.80 1.9487
92 0.009200 5 1.84 3.73 1.9306
93 0.009300 5 1.86 3.66 1.9128
94 0.009400 5 1.88 3.59 1.8950
95 0.009500 5 1.90 3.52 1.8774
96 0.009600 5 1.92 3.46 1.8599
97 0.009700 5 1.94 3.40 1.8426
98 0.009800 5 1.96 3.33 1.8253
99 0.009900 5 1.98 3.27 1.8082
100 0.010000 5 2.00 3.21 1.7913

The largest discrepancy between observed and expected numbers of p-values, as measured by chi-square, occurs at step 91.

samplePresults.outp

This file provides, for each SNP, its individual p-value and chromosomal position. As those SNPs with p-values lower than the overall significant p-value threshold are considered susceptibility SNPs, this output file will allow identifying these SNPs.

Literature

[1] Ott J, Liu Z, Shen Y (2012) Challenging False Discovery Rate: A partition test based on p values in human case-control association studies. Hum Hered 74, 45-50

[2] Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

[3] Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11:2115-2119