Jurg Ott / 28 October 2017
Rockefeller University, New York

Heterozygosity Mapping for Dominant Trait Variants

Introduction

This document describes a new statistical approach [1] for genetically localizing pathogenic variants underlying inherited dominant traits, and provides software for carrying out corresponding analyses, both for Windows and Linux. This outline refers to the Windows version but Linux users will easily see how to use this in Linux.

The program package provides two programs, PropHet and NegPos, described below. Also included is a test dataset [2]. Data should be formatted according to plink2, that is, plink version 1.9 or later, and should be provided as transposed datasets, for example, FamSc13.tfam and FamSc13.tped (included in package; thanks to Drs. Y. Riazalhosseini and M. Garshasbi for making this dataset available), where each line refers to a variant, with variants being in chromosomal order. Options used to prepare these files should include the following two lines:

 --output-missing-phenotype 0
 --recode 12 transpose

As described below, each of the two programs is preferably invoked in a terminal window by typing its name followed by the name of a parameter file on the command line, but the user may also just click on their names although this allows less flexibility regarding parameter file names. In Windows, cmd provides such a terminal. The Linux executables, prophet and negpos, must be made executable after downloading. Each program produces output data and a log file containing relevant statistics.

PropHet program

This program computes for each variant an H value, which is an n-point moving average of heterozygosity. The number n should be odd, for example, 101. H would then be equal to the average heterozygosity at 101 variants, comprising the given variant and the 50 variants preceding it and the 50 variants following it. Thus, on each chromosome, the first and last 50 variants don't have H values. Variants with missing H values are skipped in this process.

This program should be invoked by typing, for example, prophet prophet.param, where the second item refers to the parameter file (any file name will do as long as its content conforms to the appropriate structure). As shown below, this file consists of 6 lines followed by some explanatory text.

-------------begin----------
FamSc13
39            Line 2: n for n-point averages
2             Line 3: (Type of) individual(s) to be used
13 32893405   Line 4: chr and bp of known variant
2000          Line 5: Number of top ranked variants to print
FamSc13.het

=== EXPLANATIONS, prophet program ===
Line 1
 Input file name, without ".tped". The associated ".tfam" file
 will also be accessed. No text after this name!

Line 2
 n for n-point averages, like n=51. Must be odd. The program will
 print on screen when an n-point average is exactly equal to 1. If
 so, n should be increased so that no sets of consecutive markers (or
 at most one) will all be heterozygous.

Line 3
 s for selecting individuals to work on:
 s = 2 for all affecteds
 s = 1 use unaffecteds
 s = -k to select individual on line k in tfam file

Line 4
 Chromosome and bp position of known functional variant,
 which will be flagged "dis"; 0 0 if unknown

Line 5
 Number of top ranked variants to be output by negpos program
 when run after the prophet program. This number has an effect
 only when the disease variant is unknown.

Line 6
 Name of output file. No text after this name!
 If this line is blank then the output file name will
 be as on line 1 with ".het" appended.
--------------end-----------

When you run the PropHet program with the parameter file as shown above, an output file called FamSc13.het is produced, which contains one line per variant. Taking this into a spreadsheet will you give a file (included as FamSc13.het.xlsx) with the following header line, indicating column contents:

chr      chromosome of given variant
marker   variant name
bp       basepair position of variant
dist-kb  distance in kb from disease variant (this line
         is missing when disease variant is unknown)
p.het    n-point averaged H (heterozygosity) value at given variant
dis      this column shows "dis" at the disease variant so it
         can easily be searched for.

The header line finishes with an indication of n (for n-point moving average) and how individuals were selected for analysis from the dataset (see parameter file). Thus, the output provides an H value for each variant except for the first (n - 1)/2 and last (n - 1)/2 variants on each chromosome, where n refers to n-point averages. The prophet program will also write a file, negpos.param, so that the program discussed below can be run directly as negpos negpos.param, or simply negpos.

Occasionally, multiple values of H = 1 are encountered. If this happens you may want to try running prophet with a higher value of n, but be aware that results for different individuals are strictly comparable only when obtained with the same n value.

NegPos program

Based on an output file containing H values like the one discussed above, this program produces runs (RIHs) of increasing heterozygosity followed by a sequence of variants with decreasing heterozygosity, one run per line. It does so as follows: First, values d1, d2, and so on are computed, where di = hi+1 - hi, and hi is the H value at the ith variant. Values of d = 0 (actually, d < 0.00001) are skipped. Thus, successively increasing H values are identified by negative d values, and decreasing H values lead to positive d values (hence the name, NegPos). The program then searches for the variant, vi, with the first negative d value and keeps going from one to the next variant until a positive d value is encountered at variant vj, then keeps going until at variant vk it again finds a negative d value. The RIH so found now extends from variants vi through vk, with the locally largest H value, Hmax, being at variant vj-1. Thus, while prophet generates H values, one per variant, the negpos program produces Hmax values, one per RIH.

This program should be invoked by typing, for example, negpos negpos.param, where the second item refers to the parameter file. This file consists of 4 lines as follows:

-------------begin-----------------
FamSc13.het
    2       Line 2: number of interruptions allowed
 1000       Line 3: number of top ranked variants to print
FamSc13.het.negpos

=== Parameter file for negpos program ===
For the convenience of the user, this file is
routinely written by the prophet program but it
may be modified by the user if desired.

Line 1
 Complete input file name (output of prophet program).
 No text after this name!

Line 2
 Number of interruptions allowed in neg-pos sequence
 so that, for example, a sequence if - signs can contain
 a + sign and still count as a sequence os minuses.

Line 3
 Number of top ranked neg-pos sequences to be output
 when the disease variant is unknown. This number has
 no effect when the disease variant is known.

Line 4
 Output file name. If blank, the output file name will be
 line 1 appended with ".negpos".
--------------end------------------

The program produces an output file called, in this case, FamSc13.het.negpos. It contains one RIH per line. Below is a description of this file for the given data.

FamSc13.het.negpos output file

The column headings in this file have the following meaning:

chrom      chromosome number for given RIH
start      start bp position of RIH
end        end bp position of RIH
LENbp      length in bp of RIH
bp_Hmax    the bp position at which negative d values change to 
           positive d values (not necessarily in the middle of
           the RIH), i.e. at Hmax.
Hmax       maximum H value in the given neg-pos sequence (at "bp_Hmax")
dis        the line containing the disease variant will be flagged
           with the word “dis”. This column is missing when the
           disease variant is unknown.

These Hmax values (one per RIH) are indicative of pathogenic dominant trait variants at or near bp_Hmax, within the given RIH.

Both programs also write log files containing information about the completed run. If the disease variant is unknown, the negpos program will also write a file called, in this case, FamSc13.het.negpos.res, containing variants ranked by four different criteria (see next section).

Ranking variants

When the disease variant is unknown, ranking of variants regarding their likelihood of being a disease variant is based on the Hmax values generated by the negpos program. We try to eliminate some of the false positive Hmax values as follows. Presumably, the RIH containing the disease variant is wider than the average of all RIHs. Thus, one might in a spreadsheet have two columns, column 1 with Hmax values and column 2 with values 0 and 1, where a value of 1 represents the fact a given RIH has a width larger than the average width. Then one would sort the Hmax values followed by sorting on the values in column 2. In practice, it is not always the case that the disease RIH is wider than the average RIH. Therefore, the negpos program will rank RIHs assuming, for example, that the disease RIH has a width greater than a quarter of the average width. This seems quite safe and eliminates some of the false positive Hmax values. The output file, in this case, FamSc13.het.negpos.res, will show rankings based on different assumptions about the width of the disease RIH.

References

1 Imai A, Li Y, Horpaopan S, Riazalhosseini Y, Garshasbi M, Mosse YP, Zhang D, Leal SM, Lathrop M, Ott J: Heterozygosity mapping for dominant trait variants (under review).

2 Ataei-Kachouei M, Nadaf J, Akbari MT, Atri M, Majewski J, Riazalhosseini Y, Garshasbi M: Double heterozygosity of BRCA2 and STK11 in familial breast cancer detected by exome sequencing. Iran J Public Health 2015;44:1348-1352.