Jurg Ott / 22 June 2017
Rockefeller University, New York

Heterozygosity Mapping for Dominant Trait Variants


This document describes a new statistical approach for genetically localizing pathogenic variants underlying inherited dominant traits, and provides software for carrying out corresponding analyses, both for Windows and Linux. This outline refers to the Windows version but Linux users will easily see how to use this in Linux.

The program package provides two programs, PropHet and NegPos, described below. Also included is a test dataset. Data should be formatted according to plink2, that is, plink version 1.9 or later, and should be provided as transposed datasets, for example, FamSc13.tfam and FamSc13.tped (included in package), where each line refers to a variant, with variants being in chromosomal order. Options used to prepare these files should include the following two lines:
  --output-missing-phenotype 0
  --recode 12 transpose
As described below, each of the two programs must be invoked in a terminal by typing its name followed by the name of a parameter file on the command line. In Windows, the cmd program provides such a terminal. The Linux executables, prophet and negpos, must be made executable after downloading.

PropHet program

This program computes for each variant an H value, which is an n-point moving average. The number n should be odd, for example, 101. H would then be equal to the average heterozygosity at 101 variants, comprising the given variant and the 50 variants preceding it and the 50 variants following it. Thus, in a dataset, the first and last 50 variants don't have H values. Variants with missing H values are skipped in this process.

This program should be invoked by typing, for example, prophet prophet.param, where the second item refers to the parameter file. This file consists of 6 lines as follows:

101            Line 2: n for n-point averages
2              Line 3: (Type of) individual(s) to be used
0 13 32893405  Line 4: Value of P(het) to be ranked, chr and bp of known variant

=== EXPLANATIONS, prophet program ===
Line 1
  Input file name, without ".tped". The associated ".tfam" file will also be accessed.

Line 2
  n for n-point averages, like n = 101. Must be odd.

Line 3
  s for selecting individuals to work on:
  s = 2 for all affecteds
  s = 1 use unaffecteds
  s = -k to select individual with sequence number k in the "tfam" file

Line 4
  Value of P(het) (6 decimal places) to be ranked; 0 = no ranking
  Chromosome and bp position of known functional variant; 0 0 if unknown

Line 5
  Two sets of chrom bp values for printing (0 = skip), for example:
  4 98223
  4 98360
  There must not be any text trailing these numbers!

Line 6
  Output file name. If blank, that name will be equal to the name on line 1 with ".het" appended.

When you run the PropHet program with the parameter file as shown above, an output file called FamSc13.tped.het is produced, which contains one line per variant. Taking this into a spreadsheet will you give a file (included as FamSc13.tped.het.xlsx) with the following columns:
chr -- chromosome of given variant
marker -- variant name
bp -- basepair position of variant
dist-kb -- distance in kb from disease variant (this line is missing when disease variant not identified)
p.het -- n-point averaged H (heterozygosity) value at given variant
dis=&& -- this column contains "&&" at the disease variant

Thus, the output provides an H value for each variant except for the first (n - 1)/2 and last (n - 1)/2 variants, where n refers to n-point averages.

NegPos program

Based on an output file containing H values as the one discussed above, this program produces runs of increased heterozygosity, RIH, one run per line. It does so as follows: First, values d1, d2, and so on are computed, where di = hi+1 - hi, and hi is the H value at the ith variant. Values of d < s (s = 0.00001, for example) are set equal to zero, and zero d values are skipped. Thus, successively increasing H values are indicated by negative d values, and decreasing H values lead to positive d values (hence the program name, NegPos). The program then searches for the variant, vi, with the first negative d value and keeps going from one to the next variant until a positive d value is encountered at variant vj, then keeps going until at variant vk it again finds a negative d value. The RIH so found now extends from variants vi through vk, with the locally largest H value being at variant vj-1.

This program should be invoked by typing, for example, negpos negpos.param, where the second item refers to the parameter file. This file consists of 6 lines as follows:

0.00001    s; values d<s will be set equal to zero.
1          which column of H values to read
0.02       stepsize
1          "dist-kb" present if 1

=== Parameter file for negpos program ===

Line 1
 Input file name (output of prophet program)

Line 2
 s; values of d < s will be set equal to zero, that is, will be skipped. Best to keep small = 0.00001

Line 3
 col = number of column of H values to read.

Line 4
 Step size to use for Hmax classes shown in table output file

Line 5
 =1 if "dist-kb" input column is present, = 0 otherwise

Line 6
  Output file name. If blank, that name will be equal to the name on line 1 with ".negpos" appended.

The program produces two output files with names, in this case, of FamSc13.tped.het.negpos and FamSc13.tped.het.negpos.tab. The former file contains one RIH per line and the latter contains a table, that is, a histogram of H values. Below is a description of these two output files, for the given data.

FamSc13.tped.het.negpos output file

chrom -- chromosome number for given RIH
start -- start bp position of RIH
end -- end bp position of RIH
LENbp -- length in bp of RIH
mid-point -- the bp position at which negative d values change to positive d values (not necessarily in the middle of the RIH)
het[mid] -- H value at "mid-point"

These Hmax values (one per RIH) are indicative for pathogenic dominant trait variants at or near the "mid-point", within the given RIH. To judge which RIHs are reliable the histogram of Hmax values shown in the other output file may be consulted.

FamSc13.tped.het.negpos.tab output file

This output file shows a histogram of Hmax values, with step size given by the corresponding parameter in the parameter file.
Hlo -- lower bound of given class of Hmax values
Hhi -- upper bound of given class of Hmax values
n -- number of Hmax values in given class
Lmin -- shortest RIH associated with the Hmax values in this class
Lmax -- longest RIH associated with the Hmax values in this class
Lrange -- corresponding rage, Lmax - Lmin
Lmean -- corresponding mean RIH lengths
Lsd -- corresponding standard deviation of RIH lengths

The given data result in the following histogram:

Hlo Hhi n Lmin Lmax Lrange Lmean Lsd
0  0.02  2,868 6  160,146  160,140 4,003  7,564
0.02 0.04 5,065 3 43,667 43,664 3,158 3,210
0.04 0.06 6,916 4 159,526 159,522 2,873 3,458
0.06 0.08 6,167 3 39,534 39,531 2,742 2,756
0.08 0.10 4,692 9 102,190 102,181 2,619 2,988
0.10 0.12 3,077 4 101,294 101,290 2,522 3,126
0.12 0.14 2,000 3 36,056 36,053 2,394 2,440
0.14 0.16 1,188 3 152,369 152,366 2,526 4,914
0.16 0.18 635 12 17,567 17,555 2,181 2,204
0.18 0.20 361 7 14,428 14,421 2,109 2,234
0.20 0.22 316 3 18,920 18,917 2,025 2,177
0.22 0.24 234 3 18,990 18,987 1,783 2,221
0.24 0.26 156 25 13,172 13,147 1,879 2,155
0.26 0.28 100 24 8,314 8,290 1,535 1,760
0.28 0.30 63 45 6,167 6,122 1,306 1,396
0.30 0.32 78 13 8,765 8,752 1,525 1,656
0.32 0.34 36 17 3,560 3,543 915 847
0.34 0.36 27 30 4,784 4,754 1,502 1,420
0.36 0.38 30 19 4,195 4,176 968 865
0.38 0.40 13 56 4,602 4,546 746 1,159
0.40 0.42 3 105 1,482 1,377 861 570
0.42 0.44 2 141 169 28 155 14
0.44 0.46 1 98 98 0 98 0

A scatterplot of the 34,032 Hmax values (y-axis) against the length of the RIH in which they occur shows the following picture:
Large values of H seem to be associated with a limited range of RIH lengths. Very small H values, presumably random, seemingly appear in a wide range of RIH lengths and may well be disregarded. Based on the scatterplot above and the table (histogram of H values), it is safe to say that Hmax values < 0.10 presumably are random, perhaps even values < 0.12.


(to be furnished)