Documentation to maxstatRS program
Jurg Ott / 16 March 2016
This program processes data on the Hamming distance between sets of variants in a case individual versus a number of control individuals (Imai et al. 2015). The program package contains source and executable files for Windows and Linux (current version dated 16 Mar 2016). A GUI version is also available.
Consider 1 case individual from a family pedigree, and n control individuals. The program reads HDR values and numbers of pairs (HDR2) furnishing a given HDR value. Values of HDR < 0 and HDR > 1 will be set equal to missing (= 9).
We form all possible pairs of individuals and distinguish those pairs containing the case individual (group 1 pairs) from the pairs consisting of two control individuals (group 2 pairs). For each pair, each candidate variant, and each region (for example, 10K, 20K, …, 100K), we compute HDR as the proportion of variants that differ between the two individuals in the pair. Then we compute a t statistic, that is, the scaled mean difference between the group 1 and 2 pairs. The t statistic is then maximized over regions, which results in the variantspecific test statistic, t_{max}. We generally use a onesided statistic as the difference, group 1 mean minus group 2 mean, so that only positive values of t are considered significant.
Candidate variants are ranked by their t_{max} value. Significance levels are obtained from pseudosamples: Each of the control individuals in turn is taken to be a pseudoaffected individual with all other individuals considered pseudocontrols. In each pseudosample the same analysis is carried out as in the observed data. The proportion of pseudosamples with a test statistic at least as large as the observed test statistic is taken to be the pvalue associated with t_{max} for a given candidate variant.
Empirical significance levels are computed (1) by counting the observed dataset as onehalf of a null dataset; this is a compromise between treating the observed data as null data and not considering the observed data among the null datasets. (2) Also, we compute midp values (Agresti, 2002) rather than standard pvalues but results are almost the same as with standard pvalues.
For a given variant and region, if the average number of HDR2 values for all pairs (casectrl and ctrlctrl pairs) is less than the critical value (currently = 3) stated in the parameter file, then all HDR values for that variant and region are set equal to missing.
The maxstatRS program must be initiated with a parameter file named on the command line. For example, in a terminal window you type maxstatRS maxstatRS.param, where the second item is the required parameter file described below. An example parameter file is included in the program package; it provides detailed explanations on the various input lines.
All data, HDR and HDR2 values, are furnished in a single input file (named in the parameter file), in which candidate variants are identified by their sequential number, for example, 1..606. A sample input file is provided in the program package.
Top line: Four numbers, indicating the number of candidate variants, the number of case individuals (currently =1), the number of control individuals, and the number of regions. For example, for family B, this line would be
606 1 41 10
The program will know that there is an affected individual (numbered 0) and 41 control individuals (numbered 1..41). Thus, for each candidate variant and each region, it will expect 42 × 41/2 = 861 HDR values and the same number of HDR2 values.
Header line: There is a second input line identifying the various columns.
The program will then read two sets of 606 lines each. The first set of 606 lines will contain HDR values, and the second set will contain HDR2 values. Each set will be repeated for each region. The numbers on each line correspond to pairs of individuals (numbered 0..n), where n is the number of control individuals. For example, with n = 4, the 10 pairs must be furnished in the order indicated below, where rows and columns represent individuals numbered 0..4:

0 
1 
2 
3 
4 
0 

1 
2 
3 
4 
1 


5 
6 
7 
2 



8 
9 
3 




10 
4 





All numbers must be separated by white space (blank or tab). The program will expect the input structure shown above. In summary, for n = 41, the input file will be as follows:
606 41 10 #variants #control_individuals #regions
606 lines, each line containing 861 HDR values, to be repeated for each region
606 lines, each line containing 861 HDR2 values, to be repeated for each region
Thus, in the input file, after the initial line, what changes first from line to line is the variant, then the region, and then the HDR type (HDR/HDR2). For example, after the first 606 variants, the region changes from 10 to 20 while HDR remains =1; after the next 606 variants, the region changes from 20 to 30 with HDR=1, and so on; after the 606 variants with region = 100, the region is set back to 10 while HDR is increased from 1 to 2, and so on.
Regions: The region size is given in kb, for example, 10, 20, and so on. Region sizes should be in multiples of 10.
At a given candidate variant and region, we have a number n of pairs of individuals (e.g., n = 861). For each pair, we compute HDR based on a number k of variants flanking the candidate locus. If the average of the k values (over the n pairs) is less than 3 then we set all HDR values for the given candidate variant and region equal to missing, that is, in effect, we don't use that region for the given candidate variant.
Given are one case (affected) individual and n control (unaffected) individuals. Individuals are numbered as follows: Case = 0, controls = 1, 2, …, n. There are N = n(n – 1)/2 controlcontrol pairs and T = (n + 1)n/2 pairs total, where N + n = T. In the program, ncase = n, nctrl = N, numobs = ncase + nctrl = T.
Casecontrol pairs are ordered by the sequence number of control individuals, that is, (case – control 1), (case – control 2), and so on. This is selfevident but is mentioned here for completeness.
Controlcontrol pairs should on input occur in the following order (they do so now): (1 – 2), (1 – 3), …, (1 – n), (2 – 3), (2 – 4), …, (2 – n), …, ( (n1) – n).
Example with n = 4 (N = 6, T = 10): Controlcontrol pairs

0 
1 
2 
3 
4 
0 





1 


1 
2 
3 
2 



4 
5 
3 




6 
4 





Example with n = 4 (N = 6, T = 10): All pairs

0 
1 
2 
3 
4 
0 

1 
2 
3 
4 
1 


5 
6 
7 
2 



8 
9 
3 




10 
4 





To estimate significance levels associated with the test statistics, a number n of pseudodata are defined, where in the ith pseudosample, the ith control individual functions as a pseudocase and all other individuals (including the case) function as pseudocontrol individuals. For each pseudodataset, the same procedure is carried out as for the observed data, that is, pairs of individuals are formed and groups 1 and 2 are defined. Among the results from the pseudo datasets, we count the number k of pseudo test statistics that are at least as large as the observed test statistic. The pvalue is then obtained as
p = (k + e)/(n + e),
where e is either 0 or 1 (preferred), depending on the user's choice in the parameter file. With e = 1, the observed data are included in the pseudo data.
Agresti A (2002) Categorical data analysis, 2nd edn. WileyInterscience, New York
Imai A, Nakaya A, Fahiminiya S, Tetreault M, Majewski J, Sakata Y, Takashima S, Lathrop M, Ott J (2015) Beyond Homozygosity Mapping: FamilyControl analysis based on Hamming distance for prioritizing variants in exome sequencing. Sci Rep 5: 12028. doi: 10.1038/srep12028