Jurg Ott / 20 May 2018
Rockefeller University, New York
This document describes a new statistical approach  for genetically localizing pathogenic variants underlying inherited dominant traits, and provides software for carrying out corresponding analyses, both for Windows and Linux. This outline refers to the Windows version but Linux users will easily see how to use this in Linux.
The program package provides a program, PH, which in turn calls two subprograms, PropHet and NegPos (file names are prophetRun and negposRun), described below. Also included is a test dataset . Data should be formatted according to plink2, that is, plink version 1.9 or later, and should be provided as transposed datasets, for example, Sch13.tfam and Sch13.tped (included in package; thanks to Drs. Y. Riazalhosseini and M. Garshasbi for making this dataset available), where each line refers to a variant, with variants being in chromosomal order. Options used to prepare these files should include the following two lines:
--recode 12 transpose
Step 1: Consider a window of Navg contiguous
variants and slide this window across the genome, one variant at a
time. At each window position, compute the average heterozygosity, H,
over the variants in the given window, where H = Nhet/(Nhet
+ Nhom), Nhet = number of heterozygous genotypes and
Nhom = number of homozygous genotypes over all markers and
individuals. This step is carried out by the PropHet program
but, at this time, no instructions are provided for the use of the
PropHet and NegPos programs separately.
The PH program will initially try a value of Navg = 101. If this furnishes values of H = 1 then it keeps increasing Navg until all H values are smaller than 1. Large H values are considered indicative of an inherited dominant trait variant but step 2, below, will furnish more informative values.
Step 2. Based on the generally large number of H values obtained in step 1, segments of contiguous variants are formed such that in a given segment, H values keep increasing up to a local maximum, Hmax, and then decrease again. These segments are also referred to as RIHs for Runs of Increased Heterozygosity. The highest local maxima are taken to be located in the vicinity of inherited dominant pathogenic variants. This step is carried out by the NegPos program -- it first constructs successive differences, D(i) = H(i) - H(i+1), where negative values of D(i) indicate increasing H values, and positive D(i) values show decreasing H values. It also allows for a number Nint of interruptions, that is, for example, a positive D value occurring within a sequence of negative D values. Large values of Nint will force the resulting segments to be wide.
A sample dataset, files Sch13.tfam and Sch13.tped, is provided in the program package. These files refer to family Sh  and variants on chromosome 13, which contains the known disease variant in the BRCA2 gene. A parameter (script) file, PH.param, is required to run the PH program. For the given dataset, the PH.param file may look as follows:
Sch13 Line 1: Input file identifier
-2 0 0 Line 2: (Type of) individual(s) used; fixed Navg, Nint
13 32893405 Line 3: chrom and bp of known BRCA2 variant (rs276174825)
0 1 Line 4: Number of top ranked variants to print; rank tier
PH-Schrom13 Line 5: Output ID
=== EXPLANATIONS ===
Input file name, without ".tped". The associated ".tfam" file will also be accessed.
s for selecting individuals to work on:
s = 2 for all affecteds (affection status specified in the tfam file)
s = 1 to select unaffecteds
s = 0 to select individuals of unknown phenotype
s = -k to select the individual on line k in the tfam file, irrespective of their
phenotype. Here, the second individual in the tfam file is chosen for analysis.
Navg = number of variants over which an average H value is computed.
Must be odd. Set = 0 for automatic choice (recommended!)
Nint = number of interruptions allowed in negpos (no effect for
prophet). Set = 0 for automatic choice (recommended!)
Chromosome and bp position of known functional variant, which on output will be flagged "dis";
0 0 if unknown
Number of top ranked variants to be printed, for example, print variants ranked 10 or better (lower).
In case of tied ranks, the actual number of candidate variants printed might be slightly lower than
= 0 for no such output, = -1 to output all variants
Category of rank chosen (1 = basic rank)
Output identifier. With the identifier given here, two output files will be written:
PH-Schrom13.PHrun.log provides various statistics
PH-Schrom13.negpos.log shows ranks of disease variant
Name and location of folder holding the negposRun and prophetRun programs.
Examples are as follows:
F:\Home\bin\ for Windows
.\ for Windows when programs are in the current folder
/home/joe/bin/ for Linux
More details will be provided in an upcoming version of this
To run the program, simply type PH (see below), optionally followed by one or more command line parameters. Each such parameter consists of a single alphanumeric character but currently there are only three recognized parameters as follows:
For given chromosome and bp numbers (line 3 of parameter file), the program will rank this variant among all variants. If the program does not find the specified variant it indicates the variant nearest to the specified variant and then stops.
Ranking of variants regarding their likelihood of being a disease variant is based on the Hmax values generated by the negpos program. We try to eliminate some of the false positive Hmax values as follows. Presumably, the RIH containing the disease variant is wider than the average of all RIHs. Thus, one might in a spreadsheet have two columns, column 1 with Hmax values and column 2 with values 0 and 1, where a value of 1 represents the fact a given RIH has a width larger than the average width. Then one would sort the Hmax values followed by sorting on the values in column 2. In practice, it is not always the case that the disease RIH is wider than the average RIH. Therefore, the negpos program will rank RIHs assuming, for example, that the disease RIH has a width greater than a quarter of the average width of all segments. This seems quite safe and eliminates some of the false positive Hmax values. The output file, in this case, Sch13.negpos.log, will show rankings based on different assumptions about the width of the disease RIH.
In principle, H values obtained by the prophet program can also be used to rank variants but, because of the generally large total number of variants, even good percentiles such as 0.5% still refer to fairly large numbers of candidate variants. If a candidate ("disease") variant is a false positive, percentiles for the H values presumably follow a uniform distribution (there is no maximization involved in generating H values), so these percentiles may help in discriminationg true from false positive results but this aspect has not yet been investigated in detail.
1 Imai A, Li Y, Horpaopan S, Riazalhosseini Y, Garshasbi M, Mosse YP, Zhang D, Leal SM, Lathrop M, Ott J: Heterozygosity mapping for dominant trait variants (under review).
2 Ataei-Kachouei M, Nadaf J, Akbari MT, Atri M, Majewski J, Riazalhosseini Y, Garshasbi M: Double heterozygosity of BRCA2 and STK11 in familial breast cancer detected by exome sequencing. Iran J Public Health 2015;44:1348-1352.