Documentation to maxstatRS program

Jurg Ott / 05 Nov 2017

Introduction

This program processes data on the Hamming distance between sets of variants in a case individual versus a number of control individuals (Imai et al. 2015). The program package contains source and executable files for Windows and Linux (current version dated 05 Nov 2017). A GUI version is also available.

Consider 1 case individual from a family pedigree, and n control individuals. The program reads HDR values and numbers of pairs (HDR2) furnishing a given HDR value. Values of HDR < 0 and HDR > 1 will be set equal to missing (= -9). Currently the total number of individuals is limited to 50; increasing this number will require increased memory (RAM) resources.

We form all possible pairs of individuals and distinguish those pairs containing the case individual (group 1 pairs) from the pairs consisting of two control individuals (group 2 pairs). For each pair, each candidate variant, and each region (for example, 10K, 20K, …, 100K), we compute HDR as the proportion of variants that differ between the two individuals in the pair. Then we compute a t statistic, that is, the scaled mean difference between the group 1 and 2 pairs. The t statistic is then maximized over regions, which results in the variant-specific test statistic, tmax. We generally use a one-sided statistic as the difference, group 1 mean minus group 2 mean, so that only positive values of t are considered significant.

Candidate variants are ranked by their tmax value. Significance levels are obtained from pseudo-samples: Each of the control individuals in turn is taken to be a pseudo-affected individual with all other individuals considered pseudo-controls. In each pseudo-sample the same analysis is carried out as in the observed data. The proportion of pseudo-samples with a test statistic at least as large as the observed test statistic is taken to be the p-value associated with tmax for a given candidate variant.

Empirical significance levels are computed (1) by counting the observed dataset as one-half of a null dataset; this is a compromise between treating the observed data as null data and not considering the observed data among the null datasets. (2) Also, we compute mid-p values (Agresti, 2002) rather than standard p-values but results are almost the same as with standard p-values.

Minimum number of pairs

For a given variant and region, if the average number of HDR2 values for all pairs (case-ctrl and ctrl-ctrl pairs) is less than the critical value (currently = 3) stated in the parameter file, then all HDR values for that variant and region are set equal to missing.

Parameter file

The maxstatRS program must be initiated with a parameter file named on the command line. For example, in a terminal window you type maxstatRS maxstatRS.param, where the second item is the required parameter file described below. An example parameter file is included in the program package, reproduced below:

maxstatRS: Family B
-9 14 23900677  code for missing, chrom and bp position of disease variant
-12 1 1         code for test statistic, exponent, include obs
30 100 3.0      begin/end regions, min bp HDR. Below: File names
inputfile_familyB.txt
famB.T1.out
10      First "region" in the data
20
30
40
50
60
70
80
90
100     Last "region" in the data

== EXPLANATIONS ==
Line 1: Any text, truncated to 250 characters.

Line 2:
  Value for "missing", should be left equal to -9
  Chromosome number for disease variant if known, =0 if unknown
  Chromosomal position of disease variant if known, =0 if unknown

Line 3:
  Code for primary test statistic (see below)
  Exponent for power transformation of HDR values (see Note 2 below)
  1 if observed data are counted among null data, 0 otherwise (see Note 4)

Line 4:
  First variant region to analyze; must be a multiple of 10
  Last variant region to analyze; must be a multiple of 10
  Min. number of basepairs (variants) for a valid HDR

Line 5:
  Name of datafile (input). See Note 3 below.

Line 6:
  Name of results (output) file. If this line is blank the output file
    will be "maxstatRS.out".

Line 7+:
  As many lines as there are regions in the input file, with each
  line indicating the size of the region in kb. This list must be
  terminated with one or more blank lines, which will tell the
  program that the list has ended.

----------------------------------------
Codes on line 3:
 1 = mean difference
 3 = Kolmogorov-Smirnov 2-sample test statistic
11 = 2-sample t statistic, equal variances
12 = 2-sample t statistic, unequal variances

NOTE 1: A positive value of the code indicates a 2-sided test, a negative
value specifies a 1-sided test. For example, -3 indicates whether values
in group 2 tend to be larger than those in group 1 (distribution function
in group 1 tends to be larger than that in group 2).

NOTE 2: That exponent has the function of transforming HDR data with a
skewed distribution to become (nearly) normal. Assume that x is a
quantitative trait like body height or cholesterol level. Often such
measurements have a long upper tail and a lower bound like 0. To make
their distributions more symmetric (and thus more normal) one might
transform x, for example, into y = x = x^λ, for λ = ½. More generally,
such power transformations are given by y=λ+(x^λ – 1)/λ, with y = ln(x)
for λ = 0, and λ = 1 for y = x (no transformation). In practice, however,
there may not be a need for power transformations. See Ott J (1979)
Detection of rare major genes in lipid levels. Hum Genet 51:79-91.

NOTE 3: Structure of data input file.
Line 1
  Number of candidate variants
  Number of control individuals in addition to the case individual
  Number of "regions"
Line 2
  Any text
Line 3
  chr  These 3 (or any other) characters in columns 1-3 (not used by program)
  Chromosome number following "chr"
  Basepair position of variant
  Region
  Number 1 or 2 for 1=HDR value, 2=HDR2 value (number of obs leading to HDR)
  HDR values of case-control pairs and control-control pairs
Line 4
  Analogous input for variant 2, etc.

NOTE 4: If item 3 on line 3 is equal to 1 then the observed data are included
among the null (pseudo) data. This is a standard practice in statistics. If the
item is equal to 0 then observed data are not counted as null data.

Input file

All data, HDR and HDR2 values, are furnished in a single input file (named in the parameter file), in which candidate variants are identified by their sequential number, for example, 1..606. A sample input file is provided in the program package.

Top line: Four numbers, indicating the number of candidate variants, the number of case individuals (currently =1), the number of control individuals, and the number of regions. For example, for family B, this line would be

606 1 41 10

The program will know that there is an affected individual (numbered 0) and 41 control individuals (numbered 1..41). Thus, for each candidate variant and each region, it will expect 42 × 41/2 = 861 HDR values and the same number of HDR2 values.

Header line: There is a second input line identifying the various columns.

The program will then read two sets of 606 lines each. The first set of 606 lines will contain HDR values, and the second set will contain HDR2 values. Each set will be repeated for each region. The numbers on each line correspond to pairs of individuals (numbered 0..n), where n is the number of control individuals. For example, with n = 4, the 10 pairs must be furnished in the order indicated below, where rows and columns represent individuals numbered 0..4:


0

1

2

3

4

0


1

2

3

4

1



5

6

7

2




8

9

3





10

4






All numbers must be separated by white space (blank or tab). The program will expect the input structure shown above. In summary, for n = 41, the input file will be as follows:

606 41 10 #variants #control_individuals #regions

606 lines, each line containing 861 HDR values, to be repeated for each region

606 lines, each line containing 861 HDR2 values, to be repeated for each region

Thus, in the input file, after the initial line, what changes first from line to line is the variant, then the region, and then the HDR type (HDR/HDR2). For example, after the first 606 variants, the region changes from 10 to 20 while HDR remains =1; after the next 606 variants, the region changes from 20 to 30 with HDR=1, and so on; after the 606 variants with region = 100, the region is set back to 10 while HDR is increased from 1 to 2, and so on.

Regions: The region size is given in kb, for example, 10, 20, and so on. Region sizes should be in multiples of 10.

At a given candidate variant and region, we have a number n of pairs of individuals (e.g., n = 861). For each pair, we compute HDR based on a number k of variants flanking the candidate locus. If the average of the k values (over the n pairs) is less than 3 then we set all HDR values for the given candidate variant and region equal to missing, that is, in effect, we don't use that region for the given candidate variant.

Technical comments

Given are one case (affected) individual and n control (unaffected) individuals. Individuals are numbered as follows: Case = 0, controls = 1, 2, …, n. There are N = n(n – 1)/2 control-control pairs and T = (n + 1)n/2 pairs total, where N + n = T. In the program, ncase = n, nctrl = N, numobs = ncase + nctrl = T.

Case-control pairs are ordered by the sequence number of control individuals, that is, (case – control 1), (case – control 2), and so on. This is self-evident but is mentioned here for completeness.

Control-control pairs should on input occur in the following order (they do so now): (1 – 2), (1 – 3), …, (1 – n), (2 – 3), (2 – 4), …, (2 – n), …, ( (n-1) – n).

Example with n = 4 (N = 6, T = 10): Control-control pairs


0

1

2

3

4

0






1



1

2

3

2




4

5

3





6

4






Example with n = 4 (N = 6, T = 10): All pairs


0

1

2

3

4

0


1

2

3

4

1



5

6

7

2




8

9

3





10

4








Empirical significance levels

To estimate significance levels associated with the test statistics, a number n of pseudo-data are defined, where in the i-th pseudo-sample, the i-th control individual functions as a pseudo-case and all other individuals (including the case) function as pseudo-control individuals. For each pseudo-dataset, the same procedure is carried out as for the observed data, that is, pairs of individuals are formed and groups 1 and 2 are defined. Among the results from the pseudo datasets, we count the number k of pseudo test statistics that are at least as large as the observed test statistic. The p-value is then obtained as

p = (k + e)/(n + e), 

where e is either 0 or 1 (preferred), depending on the user's choice in the parameter file. With e = 1, the observed data are included in the pseudo data.

References

Agresti A (2002) Categorical data analysis, 2nd edn. Wiley-Interscience, New York

Imai A, Nakaya A, Fahiminiya S, Tetreault M, Majewski J, Sakata Y, Takashima S, Lathrop M, Ott J (2015) Beyond Homozygosity Mapping: Family-Control analysis based on Hamming distance for prioritizing variants in exome sequencing. Sci Rep 5: 12028. doi: 10.1038/srep12028