Jurg Ott / 15
July 2018
Rockefeller
University, New York
ott@rockefeller.edu
Contents
MULTIPLE TESTS 
MAPFUN program  Apply map functions to recombination
fractions 
Updated  The conting program now has an added feature: For 2 x 3 contingency tables, the table entries are interpreted as numbers of genotypes, with rows = cases/controls and columns referring to three genotypes, 1/1, 1/2, 2/2, in that order. In addition to computing the usual chisquare for this 2 x 3 table, the program constructs the associated 2 x 2 table of alleles and computes its chisquare and permutationbased pvalue. The latter is immune to population admixture (stratification, substructure).
In the TREND program, equation (15.1) has been replaced by (15.6), Armitage et al (2002), to better allow for small numbers of observations.
Windows:
The UTIL package may be obtained as a set of small programs.
Simply copy the UtilWin.zip
file into a specific directory (folder), for
example, c:\util,
and unzip it. Then, doubleclick on UP
(up.exe file). Note that all extracted files
must be in the same folder, or be accessible by the path.
Linux:
Compiled programs and source code is available here.
The source code is the same as in the Windows distribution. In Ubuntu
Linux, executable code should be made executable with the chmod
command and copied, for example, into the /usr/local/bin
folder.
Most programs are
interactive. To use interactive programs in batch mode, an input file
may be created (for example, with the Windows Notepad program)
containing exactly the information asked by the program during
interactive operation; the program is then run with redirection of
standard input and output. An example is given for the HET program,
below.
The UTIL package is a
set of programs useful in statistical genetics. Information about
most of the methods underlying these programs may be found in Ott
(1999). All but the IBD and NOCOM programs are written in Free
Pascal in a manner as close to standard Pascal as
possible. The Free Pascal compiler is available at no charge and is
largely compatible with Turbo Pascal but does not suffer from Turbo
Pascal's many restrictions. It is available for Windows, Linux, etc.
Programs may be used as they come and there is generally no need to
recompile them. Linux users will find additional information in a
separate
document.
Generally, a statistical test of a hypothesis
is carried out at a certain significance level, α_{1},
where α_{1} is
the probability of a false positive test result. When m
independent tests are carried out, each at the
α_{1} level,
the probability that at least one of them leads to a false positive
result is given by α_{m}
= 1 – (1 – α_{1})^{m},
where α_{m} >
α_{1} may
be regarded as the overall significance level. To keep α_{m}
at some predetermined value, for example, 0.05,
one must carry out each individual test at the α_{1}
level, which is given by
α_{1}
= 1 – (1 – α_{m})^{1/m}
≈ α_{m}/m
(Bonferroni correction). However, when tests are
nonindependent, this correction is known to be quite conservative.
The most accurate and unbiased way of computing significance levels
is with permutation
tests.
Fisher's exact test (based on the
hypergeometric distribution) is carried out for 2 × 2 tables.
The maximum number of observations is set equal to 100,000. In
addition to pvalues, the
program computes the disequilibrium parameter, both in absolute value
and in % of its maximum, and the odds ratio (approx. relative risk;
see also ODDSRATIO program).
4 16 
20 
5 37 
42 
For a 3 × 3 table of genotypes at two SNPs, this program estimates haplotype frequencies and determines D'.
The BINOM program is interactive and 1)
computes class probabilities for binomial data, or 2) confidence
limits for p on
the basis of an observed proportion, k/n,
where the number k of
"successes" is binomially distributed as Bin(p,
n).
Multinomial
data may be handled by considering each of a number of classes at a
time, lumping together the other classes hence making it a binomial
problem (Murphy 1982, p. 296). The overall confidence coefficient for
confidence intervals obtained in this manner will then be larger than
the confidence coefficient specified for each class (see section on
multiple tests). A wellknown example of a multinomial problem is
that of estimating gene frequencies for more than two alleles.
1)
CLASS PROBABILITIES: BINOM will ask for values of n
(number of observations) and p
(probability of 'success' for an observation).
One may enter either a single value of k
= observed number of successes, or two values of
k, k_{1}
and k_{2}.
The program will then calculate the binomial probability, P(k),
or the sum of these probabilities for all kvalues
between k_{1}
and k_{2}
(including k_{1}
and k_{2}),
assuming that k follows
a binomial distribution, Bin(n,
p).
The number, n,
currently has an upper limit of 100,000.
Among
the many uses of the BINOM program, consider the following
application. Assume that under the null hypothesis, each observation
(opportunity for recombination) has probability 0.5 of resulting in a
'success' (i.e., a recombination). In n
= 20 observations, only 4 (rather than the 10
expected) were recombinants. Is the observed proportion of
recombinants, t =
4/20 = 0.20, significantly smaller than the expected proportion of
50%? The empirical significance level is defined as the probability,
under the null hypothesis, of obtaining an outcome as extreme or more
extreme than the one observed. Using BINOM with n
= 20 and p
= 0.5, one finds that the kvalues
from 0 through 4 have a combined probability of occurrence of 0.0059.
The result is, thus, significant at the 0.01 level. Note that in
linkage analysis, more stringent significance levels are generally
required, e.g., 0.0001.
2)
CONFIDENCE INTERVALS: The user will be asked to specify the lower
(E_{L})
and upper (E_{U})
error probabilities associated with the lower and upper endpoint of
the confidence interval. Usually, E_{L}
and E_{U}
will be equal (eg, 0.025 each for a 95%
confidence coefficient). The lower endpoint, p_{L}
, is determined numerically such that the sum of
the binomial probabilities, P(i,
p_{L},
n), i
= k..n,
is equal to E_{L};
p_{U}
is found analogously such that E_{U}
is equal to the sum of the P(i,
p_{U},
n), i
= 0..k.
The numerical method used for solving these equations is that of
halving the interval. Statistics based on Pfanzagl (1966).
The
results will be accurate to the number of decimal places chosen for
output. Thus, be aware that a large number of decimal places and/or a
large number, n,
may require long computing times on older machines. For numerical
reasons, the endpoint of confidence intervals cannot be smaller than
0.0001 or larger than 0.9999.
This program computes the pvalue associated with a given value of chisquare and its number of degrees of freedom (df). The method is based on formulas 26.4.4 and 26.4.21 in Abramowitz and Stegun (1968) (exact calculation of pvalues). For χ^{2} > 169 and df > 1, an approximation is used, corresponding to formula 26.4.14 in Abramowitz and Stegun (1968).
For contingency tables with give numbers of rows and columns, this program interactively calculates likelihood ratio or Pearson chisquares and associated pvalues. For 2 x 3 tables it interprets the two rows as corresponding to control and case individuals, with the three columns referring to SNP genotypes 11, 12, and 22, in that order. In addition to computing the regular chisquare and nominal pvalue for this table of genotypes, conting will also find the associated 2 x 2 table of alleles, compute its associated chisquare and the corresponding permutationbased pvalue, which is immune to population stratification. Note: The program requires a positive integer seed for its random number generator, which it will read from the seed.txt file if present, and if not, it will create a seed based on the system clock. The number of permutations used is 100,000, including the observed data, so that the smallest possible significance level is 1/100,000 = 0.00001.
For codominant marker systems, this program calculates maximum likelihood and unbiased estimates of heterozygosity (see Weir 1996). Also, standard errors are computed (based on Nei and Roychoudhury 1974) and used to provide confidence intervals for true heterozygosity.
The HIST program is interactive and almost
selfexplanatory. It produces on the screen a simple histogram for a
number of quantitative observations. The number of classes is 20.
Observed values x may
be subjected to a power transformation,
y
= (xr
– 1)/r
+ r,
where the use can choose the exponent (power) r.
Note that a limiting value of r
= 0 corresponds to y
= ln(x),
and r =
1 (default value) is equivalent to y
= x.
A sample input file is provided. It was obtained by generating random
normal deviates with a standard deviation of 2, where the first 150
observations have mean 10 and the next 50 observations have mean 15.
This interactive program tests deviations from HardyWeinberg equilibrium (HWE) in codominant systems. It requests the number of alleles and observed genotypes. As noted in program output, genotypes must be furnished in the order, 1/1, 1/2, 2/2, 1/3, 2/3, 3/3, 1/4, and so on. The program compares observed numbers of genotypes with the numbers of genotypes expected under HWE, where the latter are computed on the basis of allele frequencies estimated from the genotype frequencies (null hypothesis, H_{0}). For observed numbers, the relative cell frequencies are the estimates of the genotype probabilities (alternative hypothesis, H_{1}). For the comparison between observed and expected numbers of genotypes, likelihood ratio chisquare is computed. If n is the number of alleles, the number of parameters estimated under H_{0} and H_{1} is n – 1 and n(n + 1)/2 – 1, respectively. Thus, the number of df for the chisquare test is n(n – 1)/2. Results are unreliable with small numbers of observations. Then, more elaborate approaches are advocated (Guo and Thompson 1992, Zaykin et al. 1995). The program also lists the log likelihoods (natural logarithms) under H_{0} (HWE) and H_{1} (no HWE).
This program carries out exact (probability) tests of HWE, using a formula on page 99 in Weir (1996). Consider three genotypes, AA, AB, and BB at a SNP, and corresponding observed numbers of genotypes, nAA, nAB, and nBB. For a constant total number of observed genotypes and number of A alleles, the program evaluates all possible numbers of genotypes consistent with these constant numbers of genotypes and A alleles, and computes the probability of occurrence of each such set of genotypes. The empirical significance level p is then obtained as the sum of all those probabilities of occurrence that are smaller than or equal to the probability of occurrence of the observed set of genotypes. The current program version works with a total number of observations of up to 1,000,000.
The program computes expected frequencies of
ibd sharing from both parents given that (exactly) two offspring are
affected with a 2locus trait. The two trait loci are unlinked with
each other. Locus parameters are the penetrances for the 9 twolocus
genotype pairs and population allele frequencies at each locus.
Each
locus is assumed to be in HardyWeinberg equilibrium. In particular,
absence of assortative mating and full viability/fertility are
assumed. All meioses are taken to be informative with regards to ibd
sharing at both disease loci (i.e. 100% informative).
The
program also computes deviations from independence between the ibd
sharing at the two loci, where the expected values under independence
are computed from the marginals for the two loci. In addition, trait
prevalence based on the entered parameter values is computed.
The program goes through all possible parental genotypes at both loci. (The order of alleles is taken into account for convenience of programming; i.e. genotype ab is different from ba.) For each parental genotype, all possible transmission possibilities to two offspring are considered and ibd sharing at each locus with regards to both parents is computed. The ibd sharing is weighted by the (unconditional) probability of the joint parental 2locus genotypes and the probability that both sibs are affected. The weighted ibd sharing is added to the appropriate cell in the ibd sharing table. Finally, the frequencies of 2locus ibd sharing for all cells in the 2locus ibd sharing table is divided by the sum of the weights (see above) so that the frequencies in the table sum to 1.

Locus 2 
Locus 1 
bb bB/Bb BB 
aa 
f1 f2
f3 
The program prompts the user for input from the keyboard for 9 penetrances, f1  f9, at the two loci jointly. Input order must be as shown on the right (a, b = wildtype alleles; A, B = disease alleles). Also, the allele frequencies of the traitpredisposing alleles at both trait loci need to be entered. Output appears on the screen.

Locus 2 

Locus 1 
b B 
marginal 
a 
0.2024 0.2262 
0.4286 
marginal 
0.4286 0.5714 
1 
This program calculates equivalent observations of recombinants and nonrecombinants, as introduced by J.H. Edwards (for details, see Ott 1999). For a given maximum lod score Z and recombination fraction t at which Z occurs, the program assumes phaseknown data and computes the numbers k of recombinants and total number n of meioses leading to Z and t, where t = k/n. Analogous calculations are carried out for any two pairs of values (Z, t) on the lod score curve.
For seven map functions, MAPFUN will
calculate the map distance for given recombination fraction or vice
versa. The program is interactive and selfexplanatory. Recombination
fractions (theta values) will have to be given as decimal fractions
(not in %), map distances in Morgans (not cM).
As
an example, the M option expects input of a recombination fraction
and will compute the corresponding map distance. With the option MS
(S standing for sum), MAPFUN will calculate a running addition of the
map distances and corresponding recombination fractions for each of
the map functions.
The map
functions used are Haldane, Kosambi, CarterFalconer, Rao,
Felsenstein, Sturt, and Binomial. The latter assumes a maximum of N
binomially distributed crossovers.
To
convert map distances from one metric into another (e.g., from
Haldane into Kosambi cM), one first converts the map distances in
each interval between adjacent loci into recombination fractions and
then converts the recombination fraction in each interval into the
desired type of map distance (Keats, Ott, and Conneally 1989).
Clearly, with today's dense marker maps, there is no longer much need
for mapping functions.
This program computes the inverse of the normal distribution function, that is, it provides the normal deviate, z, for a given tail probability P or Q of the normal distribution. It is based on formula 26.2.14 in Abramowitz and Stegun (1968, page 932). Under option 1 (upper tail probability, Q), the program reports the corresponding normal deviate, z, and the density of the normal curve at z.
This program computes the upper tail probability, Q, associated with a normal deviate, z. It is based on several formulas in Abramowitz and Stegun (1968); formulas 26.2.12 and 26.2.14, tables 26.1 and 26.2.
This program replaces an earlier program
called RELRISK. It is based on the combination of the log(odds ratio)
(OR) over 2 x 2 contingency tables by the Woolf (1955) method (see
Armitage et al 2002). It focuses on two factors, each of which is
either present or absent. One of these may be a risk factor and the
other a response variable. For example, the layout of each table may
be as follows:
Risk factor
Cases Controls
present absent


Risk
factor a b
or Cases a
b
not at risk
c d
Controls c
d


For a
set of 2 × 2 tables (observed numbers, a,
b, c, d in the four cells of each
table), odds ratios (approx. relative risks) are calculated in each
table and tests of homogeneity are carried out for the tables in a
set. Observations in the different tables are considered independent.
For example, in a given table, the two rows may correspond to case
and control individuals, and the two columns may represent the two
alleles at a given SNP marker. Different tables (same SNP) may
correspond to different ethnicities, or to groups of people with
different nongenetic risk factors. As recommended by Breslow and Day
(1980), 1/2 is added to each table entry.
The
program is interactive and prints results after each set of tables.
The input data (numbers of observations in the four cells per 2 ×
2 table) are expected to be entered on the keyboard, but an input
file (e.g., OR.txt)
may be created and the program run with the command, oddsratio
<OR.txt. The program creates an output file called
ORout.txt.
Example
data as quoted in Table 4.3, page 145, of Breslow and Day
(1980):
6
1 0 9 106
4
5 26 164
25 21 29 138
42 34 27 139
19 36 18 88
5
8 0 31
0
Resulting output
produced by ODDSRATIO program:
Group
OR 95% confid.interval
1 1.5714
0.5333 4.6304
2 1.9412
0.5904 6.3819
All 1.7289 0.7768
3.8479
Chisquare for het: 0.0664 1 df, p = 0.796691
NOTE:
The odds ratio for "All" is the weighted average of table
odds ratios as given by the Woolf method; it is not the OR for the
summed table entries.
The population attributable risk (PAR)
estimates the proportion of cases in the total population that are
attributable to a given risk factor. Based on equation (19.35) on
page 684 in Armitage/Berry/Matthews (2002), the PAR program computes
this quantity for the following input
table:

Risk factor
Cases Controls
present absent


Risk factor a
c = Cases a
b
not at risk b d
Controls c
d



Smoking 
cases 
83 3 
This program combines independent pvalues
by two different methods as follows.
1)
Fisher method.
When pvalues
from n independent
investigations should be combined for a total judgment in the form of
one pvalue,
R.A. Fisher's method (page 99 in Fisher [1970]) specifies that one
should transform each value of p,
which has a uniform distribution under the null hypothesis, into c
= –2 × LN(p),
which has a chisquare distribution with 2 df. The resulting n
cvalues are
added together. Their sum, Σ(c),
represents a chisquared variable with 2n
df. The PVALUES program carries out these
transformations and applies the CHIPROB program to arrive at a total
pvalue.
For example, assume that three independent tests (not necessarily
chisquare tests) have furnished the pvalues
of 0.011, 0.047, and 0.35. The combined
pvalue is then equal to
0.008. Elston (1991) has published interesting comments about
Fisher's method.
p_{1} 
p_{2} 
0.05 
0.1741 
0.01 
0.1309 
0.001 
0.0977 
0.0001 
0.0784 
0.00001 
0.0656 
1.0E6 
0.0565 
1.0E7 
0.0497 
1.0E8 
0.0444 
Calculates conversions among
Tm
= male recombination fraction
Tf
= female recombination fraction
Ta
= sexaveraged recombination fraction (assuming
equal number of meioses in both sexes)
Xm
= male map distance
Xf
= female map distance
R
= Xf/Xm
For example, SEXDIF
can transform a pair of values (Tm, Tf) into (Tm, R), that is, for
given male and female recombination fractions, it calculates the
ratio of femaletomale map distance. The program is interactive and
selfexplanatory. Also, for a given sexaveraged recombination
fraction and R, it solves for Tm and Tf
values. With dense genetic maps, there is no longer much need for
mapping functions.
Interactive program to calculate the twosided tail probability for a given value t of the t distribution with a given number of degrees of freedom (whole numbers only, no fractional degrees of freedom). The method is based on equations 4.24.4, page 96, in Johnson & Kotz (1970).
For a 2 × 3 table with 2 rows (cases,
controls or vice versa) and 3 columns (SNP genotypes A/A, A/B, B/B),
this program carries out the (approximate) CochranArmitage test for
trend in proportions across columns. It is based on equation (15.1)
in Armitage/Berry/Matthews (2002). For example, the following
data,
19
29 24
497 560
269
furnish chisquare (1 df)
= 7.1927, p =
0.0073, indicating evidence for a difference in trend in the two
rows.
Abramowitz and Stegun (1968) Handbook
of mathematical functions. Dover, New
York
Armitage P, Berry G,
Matthews JNS (2002) Statistical
Methods in Medical Research, 4th
edition. Blackwell, Oxford
Breslow
NE, Day NE (1980) Statistical methods
in cancer research, vol 1The
analysis of casecontrol studies. International Agency for Research
on Cancer, Lyon
Elston RC
(1991) On Fisher's method of combining pvalues. Biom
J 33,
339345
Fisher RA (1970)
Statistical Methods for Research
Workers, 14th edition.
Hafner/MacMillan, New York
Frankel
& Schork (1996) Who is afraid of epistasis? Nat
Genet 14,
371373
Guo SW, Thompson EA
(1992) Performing the exact test of HardyWeinberg proportion for
multiple alleles. Biometrics
48,
361372
Johnson SM (1963)
Generation of permutations by adjacent transposition. Math
Comp 17,
282285
Johnson NL, Kotz S
(1970) Continuous univariate
distributions2. Houghton Mifflin,
Boston
Keats BJB, Ott J,
Conneally PM (1989) Human Gene Mapping 10  Report of the committee
on linkage and gene order. Cytogenet
Cell Genet 51,
459502
Lehmer DH (1964) The
machine tools of combinatorics; chapter 1 in: Applied
combinatorial mathematics, ed.
Beckenbach EF. Wiley, New York
Mantel
N, Haenszel W (1959) Statistical aspects of the analysis of data from
retrospective studies of disease. J
Natl Cancer Inst 22,
719748
Murphy EA (1982)
Biostatistics in Medicine.
Johns Hopkins University Press, Baltimore
Nei
M, Roychoudhury AK (1974) Sampling variances of heterozygosity and
genetic distance. Genetics
76,
379390
Nijenhuis A, Wilf HS
(1978) Combinatorial algorithms for
computers and calculators, second
edition. New York: Academic Press
Ott
J (1985) A chisquare test to distinguish allelic association from
other causes of phenotypic association between two loci. Genet
Epidemiol 2,
7984
Ott J (1992) Strategies
for characterizing highly polymorphic markers in human gene mapping.
Am J Hum Genet 51,
283290
Ott J (1995)
Estimating crossover frequencies and testing for numerical
interference with highly polymorphic markers. In Genetic
Mapping and DNA Sequencing, Vol. 81
in "The IMA Volumes in Mathematics and its Applications,"
eds. Terry Speed and Michael S. Waterman. New York: Springer, pp
4963
Ott J (1997) Testing for
interference in human genetic maps. J
Mol Med 75,
414419
Ott J (1999) Analysis
of human genetic linkage, 3rd
edition. Johns Hopkins University Press, Baltimore
Ott
J (2004) Association of genetic loci  Replication or not, that is
the question. Neurology 63,
955958 (review)
Pfanzagl J
(1966) Allgemeine Methodenlehre der
Statistik II. Sammlung Göschen
Band 747, Walter de Gruyter, Berlin
Terwilliger
JD, Ott J (1994) Handbook of Human
Genetic Linkage. Johns Hopkins
University Press, Baltimore
Weir
BS (1996) Genetic Data Analysis II.
Sinauer, Sunderland
Whitlock
MC (2005) Combining probability from independent tests: the weighted
Zmethod is superior to Fisher's approach. J
Evol Biol 18,
13681373
Woolf B (1955) On
estimating the relation between blood graoup and disease. Ann
Hum Genet 19,
251253
Zaykin D, Zhivotovsky
L, Weir BS (1995) Exact tests for association between alleles at
arbitrary numbers of loci. Genetica
96, 169178