Jurg Ott / 28 Jan 2020
Rockefeller University, New York

Heterozygosity Analysis for Dominant Trait Variant Mapping


This document describes a new statistical approach [1] for genetically localizing pathogenic variants underlying inherited dominant traits, and provides software for carrying out corresponding analyses, both for Windows and Linux. This outline refers to the Windows version but Linux users will easily see how to use this in Linux.

The program package provides a program, PH, which in turn calls two subprograms, PropHet and NegPos (file names are prophetRun and negposRun), described below. Also included is a test dataset [2]. Data should be formatted according to plink2, that is, plink version 1.9 or later, and should be provided as transposed datasets, for example, Sch13.tfam and Sch13.tped (included in package; thanks to Drs. Y. Riazalhosseini and M. Garshasbi for making this dataset available), where each line refers to a variant, with variants being in chromosomal order. Options used to prepare these files should include the following two lines:

 --output-missing-phenotype 0
--recode 12 transpose

Two-step approach

Step 1: Consider a window of Navg contiguous variants and slide this window across the genome, one variant at a time. At each window position, compute the average heterozygosity, H, over the variants in the given window, where H = Nhet/(Nhet + Nhom), Nhet = number of heterozygous genotypes and Nhom = number of homozygous genotypes over all markers and individuals. This step is carried out by the PropHet program but, at this time, no instructions are provided for the use of the PropHet and NegPos programs separately.

The PH program will initially try a value of Navg = 101. If this furnishes values of H = 1 (100%) then it keeps increasing Navg until all H values are smaller than 1. Large H values are considered indicative of an inherited dominant trait variant but step 2, below, will furnish more informative values.

Step 2. Based on the generally large number of H values obtained in step 1, segments of contiguous variants are formed such that in a given segment, H values keep increasing up to a local maximum, Hmax, and then decrease again. These segments are also referred to as RIHs for Runs of Increased Heterozygosity. The highest local maxima are taken to be located in the vicinity of inherited dominant pathogenic variants. This step is carried out by the NegPos program -- it first constructs successive differences, D(i) = H(i) - H(i+1), where negative values of D(i) indicate increasing H values, and positive D(i) values show decreasing H values. It also allows for a number Nint of interruptions, that is, for example, a positive D value occurring within a sequence of negative D values. Large values of Nint will force the resulting segments to be wide.

The PH program and sample data

A sample dataset, files Sch13.tfam and Sch13.tped, is provided in the program package. These files refer to family Sh [2] and variants on chromosome 13, which contains the known disease variant in the BRCA2 gene. A parameter (script) file, PH.param, is required to run the PH program. For the given dataset, the PH.param file may look as follows:

Sch13           Line 1: Input file identifier
-2 0 0 Line 2: (Type of) individual(s) used; fixed Navg, Nint
13 32893405 Line 3: chrom and bp of known BRCA2 variant (rs276174825)
0 1 Line 4: Number of top ranked variants to print; rank tier
PH-Schrom13 Line 5: Output ID

Line 1
Input file name, without ".tped". The associated ".tfam" file will also be accessed.

Line 2
s for selecting individuals to work on:
s = 2 for all affecteds (affection status specified in the tfam file)
s = 1 to select unaffecteds
s = 0 to select individuals of unknown phenotype
s = -k to select the individual on line k in the tfam file, irrespective of their
phenotype. Here, the second individual in the tfam file is chosen for analysis.
Navg = number of variants over which an average H value is computed.
Must be odd. Set = 0 for automatic choice (recommended!)
Nint = number of interruptions allowed in negpos (no effect for
prophet). Set = 0 for automatic choice (recommended!)

Line 3
Chromosome and bp position of known functional variant, which on output will be flagged "dis";
0 0 if unknown

Line 4
Number of top ranked variants to be printed, for example, print variants ranked 10 or better (lower).
In case of tied ranks, the actual number of candidate variants printed might be slightly lower than
specified here.
= 0 for no such output, = -1 to output all variants
Category of rank chosen (1 = basic rank, currently the only choice)

Line 5
Output identifier. With the identifier given here, two output files will be written:
PH-Schrom13.PHrun.log provides various statistics
PH-Schrom13.negpos.log shows ranks of disease variant

Line 6
Name and location of folder holding the negposRun and prophetRun programs.
Examples are as follows:
F:\Home\bin\ for Windows
.\ for Windows when programs are in the current folder
/home/joe/bin/ for Linux

More details will be provided in an upcoming version of this document.
To run the program, simply type PH (see below), optionally followed by one or more command line parameters. Each such parameter consists of a single alphanumeric character but currently there are only three recognized parameters as follows:

Important: The program is currently set up to use path names for the subprograms. Examples of path names are given on line 6 in the sample parameter file above. Your PH program name may be in the path, in which case you do not need to use a path name for it; otherwise, you may need to call the program as F:\Home\bin\PH (/home/joe/bin/PH in Linux), for example. However, even though you may not need a path name for PH, you do need to provide a path name for the subprograms to indicate where they reside. If you leave line 6 of the parameter file empty (no path name) it is assumed that the subprogram names are in the path and the PH program will ask you to confirm this. For example, in Windows, you make a folder (directory) C:\bin and move the prophetRun.exe and negposRun.exe files there. Then, line 6 in your PH.param file should read C:\bin\.

For given chromosome and bp numbers (line 3 of parameter file), the program will rank this variant among all variants. If the program does not find the specified variant it indicates the variant nearest to the specified variant and then stops.

Ranking variants

Ranking of variants regarding their likelihood of being a disease variant is based on the Hmax values generated by the negpos program. We try to eliminate some of the false positive Hmax values as follows. Presumably, the RIH containing the disease variant is wider than the average of all RIHs. Thus, one might in a spreadsheet have two columns, column 1 with Hmax values and column 2 with values 0 and 1, where a value of 1 represents the fact a given RIH has a width larger than the average width. Then one would sort the Hmax values followed by sorting on the values in column 2. In practice, it is not always the case that the disease RIH is wider than the average RIH. Therefore, the negpos program will rank RIHs assuming, for example, that the disease RIH has a width greater than a quarter of the average width of all segments. This seems quite safe and eliminates some of the false positive Hmax values. The output file, in this case, Sch13.negpos.log, will show rankings based on different assumptions about the width of the disease RIH.

In principle, H values obtained by the prophet program can also be used to rank variants but, because of the generally large total number of variants, even good percentiles such as 0.5% still refer to fairly large numbers of candidate variants. If a candidate ("disease") variant is a false positive, percentiles for the H values presumably follow a uniform distribution (there is no maximization involved in generating H values), so these percentiles may help in discriminating true from false positive results but this aspect has not yet been investigated in detail.

Moving averages for heterozygosity

There is a question of how many variants should be combined to form a moving average for the variant in the middle of this set of variants. For the families (in ref 1) analyzed with all affecteds combined in each family, we tested different numbers of variants in a moving average to see, which ones provided the best ranks for Hmax. It turns out that the best value inversely depends on variant density, Nvar/kb = number of variants per kb of sequence length of all chromosomes. Let y = value of (2m + 1) with the best prediction of Hmax, and x = variant density, that is, the number of variants per kb of sequence in a given family dataset. For families S, L, and M, we observe the following respective pairs of values, (y, x) = (41, 2.27), (151, 0.66), and (201, 0.041). These pairs define an almost exact linear relationship, y = 201 71x, which was used in the analysis for all individuals, that is, the variant density x defines the value of y = (2m + 1) to be used in the analysis. In addition, at stage 1, we imposed a minimum value of m = 100 for obtaining H averages, and at stage 2 we forced RIHs to contain 370 or 50 variants depending on whether variant density was above or below 0.1 variants/kb. These empirically obtained parameters turned out to be optimal or nearly so in all examples we analyzed.


1 Imai-Okazaki A, Li Y, Horpaopan S, Riazalhosseini Y, Garshasbi M, Mosse YP, Zhang D, Schrauwen I, Leal SM, Lathrop M, Ott J: Heterozygosity mapping for dominant trait variants. Human Mutation 2019, https://doi.org/10.1002/humu.23765.

2 Ataei-Kachouei M, Nadaf J, Akbari MT, Atri M, Majewski J, Riazalhosseini Y, Garshasbi M: Double heterozygosity of BRCA2 and STK11 in familial breast cancer detected by exome sequencing. Iran J Public Health 2015;44:1348-1352.