Statdel program for prioritizing pathogenic chromosomal deletions

Jurg Ott, 20 March 2020

Introduction

This website refers to a computer program, statdel, described in Imai-Okazaki et al. (2017). Here, just the basic instructions for its use are provided. More details may be found in the sample parameter file.

Parameter file

The program needs to be invoked in a command window (Windows) or terminal window (Linux) by typing on the command line, for example: statdel Pt369.param, where the second item refers to the parameter file required. In the program package, a sample parameter file is included and provides information on the relevant parameters. The sample parameter file looks as follows:

Statdel: Pt369 with known pathogenic deletions
-9 17 14005439 15217437 17p12 code for missing, chrom, start and end bp of dis.del, 0 0 0 if unknown
-12 1 1 code for test statistic, exponent, include obs
1 3 0.5 0.8 no. of "n" values, min #variants for HDR, min median(HDR), min. overlap
hdr_scores_Pt369.txt
StatdelPt369.out

EXPLANATIONS

Line 1: Any text, truncated to 255 characters.

Line 2:

  1. Value for "missing", should be left equal to -9
  2. Chromosome number for disease deletion if known (=0 otherwise)
  3. Chromosomal start position of disease deletion if known (=0 otherwise)
  4. Chromosomal end position of disease deletion if known (items 2 - 4 are only used in a research setting when the true disease deletion is known. Potential text following this item is not used by the program.

Line 3:

  1. Code for primary test statistic (see below)
  2. Exponent for power transformation of HDR values (see Note 2 below)
  3. 1 if observed data are counted among null data, 0 otherwise (see Note 4)

Line 4:

  1. Number, N, of different boundary values ("n") flanking an ROH, currently fixed at 1
  2. Min. average number of basepairs (variants) for a valid HDR (preferably 3 or more)
  3. Min. median HDR for an ROH to have a valid result.
  4. Min. proportion of disease deletion to be covered by an ROH so that this ROH is taken to identify the disease deletion. Such an ROH is flagged * on output. Two examples, assuming min. overlap = 0.8, are as follows:
Disease deletion:     o o o o o o o o
Candidate ROH:  + + + + +

The ROH overlaps a proportion of 2/8 = 0.25 of the deletion, so does not qualify to represent the deletion.

Disease deletion:  o o o o o o o o o o
Candidate ROH:       + + + + + + + + + + + +

The ROH overlaps a proportion of 9/10 = 0.90 of the deletion, so it qualifies to represent the deletion.

Exception: If an ROH is shorter than the known deletion but completely inside the deletion boundaries, then the overlap is declared to be 1.0 (100%).

Line 5: Name of the file holding HDR values

Line 6: Name of results (output) file. If this line is blank the output file will be called "statdel.out".

----------------------------------------

Codes on line 3:

  1 = mean difference
  3 = Kolmogorov-Smirnov 2-sample test statistic
11 = 2-sample t statistic, equal variances
12 = 2-sample t statistic, unequal variances

NOTE 1: A positive value of the code indicates a 2-sided test, a negative value specifies a 1-sided test. For example, -3 indicates whether values in group 2 tend to be larger than those in group 1 (distribution function in group 1 tends to be larger than that in group 2).

NOTE 2: That exponent has the function of transforming HDR data with a skewed distribution to become (nearly) normal. Assume that x is a quantitative trait like body height or cholesterol level. Often such measurements have a long upper tail and a lower bound like 0. To make their distributions more symmetric (and thus more normal) one might transform x, for example, into y = √x = xλ, with λ = . More generally, such power transformations are given by y = (xλ 1)/λ + λ, with y = ln(x) for λ = 0, and λ = 1 for y = x (no transformation). In practice, however, there may not be a need for power transformations. See Ott J (1979) Detection of rare major genes in lipid levels. Hum Genet 51:79-91.

NOTE 3: Structure of data input file, with each line providing the following information.

Line 1:

  1. Number of candidate variants. Number of case individuals = 1, fixed.
  2. Number of control individuals in addition to the case individual
  3. (obsolete: Number of "regions", may be set to 1, no longer needed)

Line 2: Any text

Line 3:

  1. chr These 3 (or any other) characters in columns 1-3 will not be used by the program
  2. Chromosome number following "chr"
  3. Basepair start position of deletion
  4. Basepair end position of deletion
  5. Number 1 or 2, with 1=HDR value, 2=HDR2 value (number of obs leading to HDR)
  6. HDR values of case-control pairs and control-control pairs

Line 4: Analogous input for variant 2, etc.

The total number of lines (after lines 1 and 2) is twice the number of candidate variants.

NOTE 4: If item 3 on line 3 is equal to 1 then the observed data are included among the null (pseudo) data. This is standard practice in statistics. If the item is equal to 0 then observed data are not counted as null data.

Included files

The following files are included in the program package:

errtrap.p

include file to source program

hdr_scores_Pt369.txt

sample data file

statdel

Linux executable

statdel.exe

Windows executable

statdel.pas

source program

StatdelPt369.out

sample output file

overlap.p

include file to source program

Pt369.param

sample parameter file

readstr.p

include file to source program


Literature

Imai-Okazai et al (2017) HDR-del: A tool based on Hamming distance for prioritizing pathogenic chromosomal deletions in exome sequencing. Hum Mutat 38:1796-1800 (PMID: 28722338)