Documentation to maxstatRS program

Jurg Ott / 16 March 2016

Introduction

This program processes data on the Hamming distance between sets of variants in a case individual versus a number of control individuals (Imai et al. 2015). The program package contains source and executable files for Windows and Linux (current version dated 16 Mar 2016). A GUI version is also available.

Consider 1 case individual from a family pedigree, and n control individuals. The program reads HDR values and numbers of pairs (HDR2) furnishing a given HDR value. Values of HDR < 0 and HDR > 1 will be set equal to missing (= -9).

We form all possible pairs of individuals and distinguish those pairs containing the case individual (group 1 pairs) from the pairs consisting of two control individuals (group 2 pairs). For each pair, each candidate variant, and each region (for example, 10K, 20K, , 100K), we compute HDR as the proportion of variants that differ between the two individuals in the pair. Then we compute a t statistic, that is, the scaled mean difference between the group 1 and 2 pairs. The t statistic is then maximized over regions, which results in the variant-specific test statistic, tmax. We generally use a one-sided statistic as the difference, group 1 mean minus group 2 mean, so that only positive values of t are considered significant.

Candidate variants are ranked by their tmax value. Significance levels are obtained from pseudo-samples: Each of the control individuals in turn is taken to be a pseudo-affected individual with all other individuals considered pseudo-controls. In each pseudo-sample the same analysis is carried out as in the observed data. The proportion of pseudo-samples with a test statistic at least as large as the observed test statistic is taken to be the p-value associated with tmax for a given candidate variant.

Empirical significance levels are computed (1) by counting the observed dataset as one-half of a null dataset; this is a compromise between treating the observed data as null data and not considering the observed data among the null datasets. (2) Also, we compute mid-p values (Agresti, 2002) rather than standard p-values but results are almost the same as with standard p-values.

Minimum number of pairs

For a given variant and region, if the average number of HDR2 values for all pairs (case-ctrl and ctrl-ctrl pairs) is less than the critical value (currently = 3) stated in the parameter file, then all HDR values for that variant and region are set equal to missing.

Parameter file

The maxstatRS program must be initiated with a parameter file named on the command line. For example, in a terminal window you type maxstatRS maxstatRS.param, where the second item is the required parameter file described below. An example parameter file is included in the program package; it provides detailed explanations on the various input lines.

Input file

All data, HDR and HDR2 values, are furnished in a single input file (named in the parameter file), in which candidate variants are identified by their sequential number, for example, 1..606. A sample input file is provided in the program package.

Top line: Four numbers, indicating the number of candidate variants, the number of case individuals (currently =1), the number of control individuals, and the number of regions. For example, for family B, this line would be

606 1 41 10

The program will know that there is an affected individual (numbered 0) and 41 control individuals (numbered 1..41). Thus, for each candidate variant and each region, it will expect 42 41/2 = 861 HDR values and the same number of HDR2 values.

Header line: There is a second input line identifying the various columns.

The program will then read two sets of 606 lines each. The first set of 606 lines will contain HDR values, and the second set will contain HDR2 values. Each set will be repeated for each region. The numbers on each line correspond to pairs of individuals (numbered 0..n), where n is the number of control individuals. For example, with n = 4, the 10 pairs must be furnished in the order indicated below, where rows and columns represent individuals numbered 0..4:


0

1

2

3

4

0


1

2

3

4

1



5

6

7

2




8

9

3





10

4






All numbers must be separated by white space (blank or tab). The program will expect the input structure shown above. In summary, for n = 41, the input file will be as follows:

606 41 10 #variants #control_individuals #regions

606 lines, each line containing 861 HDR values, to be repeated for each region

606 lines, each line containing 861 HDR2 values, to be repeated for each region

Thus, in the input file, after the initial line, what changes first from line to line is the variant, then the region, and then the HDR type (HDR/HDR2). For example, after the first 606 variants, the region changes from 10 to 20 while HDR remains =1; after the next 606 variants, the region changes from 20 to 30 with HDR=1, and so on; after the 606 variants with region = 100, the region is set back to 10 while HDR is increased from 1 to 2, and so on.

Regions: The region size is given in kb, for example, 10, 20, and so on. Region sizes should be in multiples of 10.

At a given candidate variant and region, we have a number n of pairs of individuals (e.g., n = 861). For each pair, we compute HDR based on a number k of variants flanking the candidate locus. If the average of the k values (over the n pairs) is less than 3 then we set all HDR values for the given candidate variant and region equal to missing, that is, in effect, we don't use that region for the given candidate variant.

Technical comments

Given are one case (affected) individual and n control (unaffected) individuals. Individuals are numbered as follows: Case = 0, controls = 1, 2, , n. There are N = n(n 1)/2 control-control pairs and T = (n + 1)n/2 pairs total, where N + n = T. In the program, ncase = n, nctrl = N, numobs = ncase + nctrl = T.

Case-control pairs are ordered by the sequence number of control individuals, that is, (case control 1), (case control 2), and so on. This is self-evident but is mentioned here for completeness.

Control-control pairs should on input occur in the following order (they do so now): (1 2), (1 3), , (1 n), (2 3), (2 4), , (2 n), , ( (n-1) n).

Example with n = 4 (N = 6, T = 10): Control-control pairs


0

1

2

3

4

0






1



1

2

3

2




4

5

3





6

4






Example with n = 4 (N = 6, T = 10): All pairs


0

1

2

3

4

0


1

2

3

4

1



5

6

7

2




8

9

3





10

4






Empirical significance levels

To estimate significance levels associated with the test statistics, a number n of pseudo-data are defined, where in the i-th pseudo-sample, the i-th control individual functions as a pseudo-case and all other individuals (including the case) function as pseudo-control individuals. For each pseudo-dataset, the same procedure is carried out as for the observed data, that is, pairs of individuals are formed and groups 1 and 2 are defined. Among the results from the pseudo datasets, we count the number k of pseudo test statistics that are at least as large as the observed test statistic. The p-value is then obtained as

p = (k + e)/(n + e), 

where e is either 0 or 1 (preferred), depending on the user's choice in the parameter file. With e = 1, the observed data are included in the pseudo data.

References

Agresti A (2002) Categorical data analysis, 2nd edn. Wiley-Interscience, New York

Imai A, Nakaya A, Fahiminiya S, Tetreault M, Majewski J, Sakata Y, Takashima S, Lathrop M, Ott J (2015) Beyond Homozygosity Mapping: Family-Control analysis based on Hamming distance for prioritizing variants in exome sequencing. Sci Rep 5: 12028. doi: 10.1038/srep12028