Documentation to maxstatRS program
Jurg Ott / 05 Nov 2017
This program processes data on the Hamming distance between sets of variants in a case individual versus a number of control individuals (Imai et al. 2015). The program package contains source and executable files for Windows and Linux (current version dated 05 Nov 2017). A GUI version is also available.
Consider 1 case individual from a family pedigree, and n control individuals. The program reads HDR values and numbers of pairs (HDR2) furnishing a given HDR value. Values of HDR < 0 and HDR > 1 will be set equal to missing (= 9). Currently the total number of individuals is limited to 50; increasing this number will require increased memory (RAM) resources.
We form all possible pairs of individuals and distinguish those pairs containing the case individual (group 1 pairs) from the pairs consisting of two control individuals (group 2 pairs). For each pair, each candidate variant, and each region (for example, 10K, 20K, …, 100K), we compute HDR as the proportion of variants that differ between the two individuals in the pair. Then we compute a t statistic, that is, the scaled mean difference between the group 1 and 2 pairs. The t statistic is then maximized over regions, which results in the variantspecific test statistic, t_{max}. We generally use a onesided statistic as the difference, group 1 mean minus group 2 mean, so that only positive values of t are considered significant.
Candidate variants are ranked by their t_{max} value. Significance levels are obtained from pseudosamples: Each of the control individuals in turn is taken to be a pseudoaffected individual with all other individuals considered pseudocontrols. In each pseudosample the same analysis is carried out as in the observed data. The proportion of pseudosamples with a test statistic at least as large as the observed test statistic is taken to be the pvalue associated with t_{max} for a given candidate variant.
Empirical significance levels are computed (1) by counting the observed dataset as onehalf of a null dataset; this is a compromise between treating the observed data as null data and not considering the observed data among the null datasets. (2) Also, we compute midp values (Agresti, 2002) rather than standard pvalues but results are almost the same as with standard pvalues.
For a given variant and region, if the average number of HDR2 values for all pairs (casectrl and ctrlctrl pairs) is less than the critical value (currently = 3) stated in the parameter file, then all HDR values for that variant and region are set equal to missing.
The maxstatRS program must be initiated with a parameter file named on the command line. For example, in a terminal window you type maxstatRS maxstatRS.param, where the second item is the required parameter file described below. An example parameter file is included in the program package, reproduced below:
maxstatRS: Family B 9 14 23900677 code for missing, chrom and bp position of disease variant 12 1 1 code for test statistic, exponent, include obs 30 100 3.0 begin/end regions, min bp HDR. Below: File names inputfile_familyB.txt famB.T1.out 10 First "region" in the data 20 30 40 50 60 70 80 90 100 Last "region" in the data == EXPLANATIONS == Line 1: Any text, truncated to 250 characters. Line 2: Value for "missing", should be left equal to 9 Chromosome number for disease variant if known, =0 if unknown Chromosomal position of disease variant if known, =0 if unknown Line 3: Code for primary test statistic (see below) Exponent for power transformation of HDR values (see Note 2 below) 1 if observed data are counted among null data, 0 otherwise (see Note 4) Line 4: First variant region to analyze; must be a multiple of 10 Last variant region to analyze; must be a multiple of 10 Min. number of basepairs (variants) for a valid HDR Line 5: Name of datafile (input). See Note 3 below. Line 6: Name of results (output) file. If this line is blank the output file will be "maxstatRS.out". Line 7+: As many lines as there are regions in the input file, with each line indicating the size of the region in kb. This list must be terminated with one or more blank lines, which will tell the program that the list has ended.  Codes on line 3: 1 = mean difference 3 = KolmogorovSmirnov 2sample test statistic 11 = 2sample t statistic, equal variances 12 = 2sample t statistic, unequal variances NOTE 1: A positive value of the code indicates a 2sided test, a negative value specifies a 1sided test. For example, 3 indicates whether values in group 2 tend to be larger than those in group 1 (distribution function in group 1 tends to be larger than that in group 2). NOTE 2: That exponent has the function of transforming HDR data with a skewed distribution to become (nearly) normal. Assume that x is a quantitative trait like body height or cholesterol level. Often such measurements have a long upper tail and a lower bound like 0. To make their distributions more symmetric (and thus more normal) one might transform x, for example, into y = √x = x^λ, for λ = ½. More generally, such power transformations are given by y=λ+(x^λ – 1)/λ, with y = ln(x) for λ = 0, and λ = 1 for y = x (no transformation). In practice, however, there may not be a need for power transformations. See Ott J (1979) Detection of rare major genes in lipid levels. Hum Genet 51:7991. NOTE 3: Structure of data input file. Line 1 Number of candidate variants Number of control individuals in addition to the case individual Number of "regions" Line 2 Any text Line 3 chr These 3 (or any other) characters in columns 13 (not used by program) Chromosome number following "chr" Basepair position of variant Region Number 1 or 2 for 1=HDR value, 2=HDR2 value (number of obs leading to HDR) HDR values of casecontrol pairs and controlcontrol pairs Line 4 Analogous input for variant 2, etc. NOTE 4: If item 3 on line 3 is equal to 1 then the observed data are included among the null (pseudo) data. This is a standard practice in statistics. If the item is equal to 0 then observed data are not counted as null data.
All data, HDR and HDR2 values, are furnished in a single input file (named in the parameter file), in which candidate variants are identified by their sequential number, for example, 1..606. A sample input file is provided in the program package.
Top line: Four numbers, indicating the number of candidate variants, the number of case individuals (currently =1), the number of control individuals, and the number of regions. For example, for family B, this line would be
606 1 41 10
The program will know that there is an affected individual (numbered 0) and 41 control individuals (numbered 1..41). Thus, for each candidate variant and each region, it will expect 42 × 41/2 = 861 HDR values and the same number of HDR2 values.
Header line: There is a second input line identifying the various columns.
The program will then read two sets of 606 lines each. The first set of 606 lines will contain HDR values, and the second set will contain HDR2 values. Each set will be repeated for each region. The numbers on each line correspond to pairs of individuals (numbered 0..n), where n is the number of control individuals. For example, with n = 4, the 10 pairs must be furnished in the order indicated below, where rows and columns represent individuals numbered 0..4:

0 
1 
2 
3 
4 
0 

1 
2 
3 
4 
1 


5 
6 
7 
2 



8 
9 
3 




10 
4 





All numbers must be separated by white space (blank or tab). The program will expect the input structure shown above. In summary, for n = 41, the input file will be as follows:
606 41 10 #variants #control_individuals #regions
606 lines, each line containing 861 HDR values, to be repeated for each region
606 lines, each line containing 861 HDR2 values, to be repeated for each region
Thus, in the input file, after the initial line, what changes first from line to line is the variant, then the region, and then the HDR type (HDR/HDR2). For example, after the first 606 variants, the region changes from 10 to 20 while HDR remains =1; after the next 606 variants, the region changes from 20 to 30 with HDR=1, and so on; after the 606 variants with region = 100, the region is set back to 10 while HDR is increased from 1 to 2, and so on.
Regions: The region size is given in kb, for example, 10, 20, and so on. Region sizes should be in multiples of 10.
At a given candidate variant and region, we have a number n of pairs of individuals (e.g., n = 861). For each pair, we compute HDR based on a number k of variants flanking the candidate locus. If the average of the k values (over the n pairs) is less than 3 then we set all HDR values for the given candidate variant and region equal to missing, that is, in effect, we don't use that region for the given candidate variant.
Given are one case (affected) individual and n control (unaffected) individuals. Individuals are numbered as follows: Case = 0, controls = 1, 2, …, n. There are N = n(n – 1)/2 controlcontrol pairs and T = (n + 1)n/2 pairs total, where N + n = T. In the program, ncase = n, nctrl = N, numobs = ncase + nctrl = T.
Casecontrol pairs are ordered by the sequence number of control individuals, that is, (case – control 1), (case – control 2), and so on. This is selfevident but is mentioned here for completeness.
Controlcontrol pairs should on input occur in the following order (they do so now): (1 – 2), (1 – 3), …, (1 – n), (2 – 3), (2 – 4), …, (2 – n), …, ( (n1) – n).
Example with n = 4 (N = 6, T = 10): Controlcontrol pairs

0 
1 
2 
3 
4 
0 





1 


1 
2 
3 
2 



4 
5 
3 




6 
4 





Example with n = 4 (N = 6, T = 10): All pairs

0 
1 
2 
3 
4 
0 

1 
2 
3 
4 
1 


5 
6 
7 
2 



8 
9 
3 




10 
4 





To estimate significance levels associated with the test statistics, a number n of pseudodata are defined, where in the ith pseudosample, the ith control individual functions as a pseudocase and all other individuals (including the case) function as pseudocontrol individuals. For each pseudodataset, the same procedure is carried out as for the observed data, that is, pairs of individuals are formed and groups 1 and 2 are defined. Among the results from the pseudo datasets, we count the number k of pseudo test statistics that are at least as large as the observed test statistic. The pvalue is then obtained as
p = (k + e)/(n + e),
where e is either 0 or 1 (preferred), depending on the user's choice in the parameter file. With e = 1, the observed data are included in the pseudo data.
Agresti A (2002) Categorical data analysis, 2nd edn. WileyInterscience, New York
Imai A, Nakaya A, Fahiminiya S, Tetreault M, Majewski J, Sakata Y, Takashima S, Lathrop M, Ott J (2015) Beyond Homozygosity Mapping: FamilyControl analysis based on Hamming distance for prioritizing variants in exome sequencing. Sci Rep 5: 12028. doi: 10.1038/srep12028