Jurg Ott
Rockefeller University New York
18 August 2015
Documentation to HOMOG programs
Contents
HOMOG program
Additional Programs
Construction of Support Intervals
References
INTRODUCTION
Two test situations may be distinguished, a mixture of families in
which individual families cannot unequivocally be assigned to one type
or group (HOMOG
programs), or known groups of families (fixed classifications,
MTEST program, see below), for example, as
distinguished by their origin. The HOMOG programs are written close to standard Pascal (exception:
HOMOGM is written in C) and have been compiled with Free Pascal in Windows and Linux. HOMOG has been updated to accommodate multiple sets of input
data.
This documentation describes various forms of the homogeneity
(admixture) test but only the basic HOMOG program
is furnished as a downloadable package. The additional programs are
somewhat outdated and presumably no longer of interest. They may be
obtained from the author upon request.
All HOMOG programs analyze heterogeneity (two or more
disease loci) with respect to single marker loci or known maps of
markers. In the first case, the programs expect lod scores between
disease phenotype and the marker(s), and in the second case they expect
multipoint lod scores for disease versus a known map(s) of markers. HOMOG carries out a homogeneity
test (A-test) under the following alternative hypothesis: Two family
types, one with linkage between a trait (or any gene locus for that
matter) to a marker or map of markers, the other without linkage. For
more information see Ott (1999). What follows are instructions on
setting up the input file for the HOMOG program.
HOMOG program
Input is as described below. Default file names are HOMOG.DAT for
input
and HOMOG.OUT for output.
Line 1: Title line
Line 2: N
STEPSIZE LDIFF where
- N = number of θ values at which lod scores are
available or should be computed. Omit lod = 0 at θ = 0.5.
- STEPSIZE = step size at which α values are
incremented in the search over the likelihood surface (for example,
0.05).
- LDIFF (optional) = difference in log likelihood,
used in the construction of support intervals. LDIFF is optional; if it
is not given (or when LDIFF = 0) no support intervals will be computed.
In regular situations, the joint support interval for α and θ
corresponds to an approximate 95% confidence region when Ldiff = 3.00.
Line 3: OUT
ALOW where the
OUTput option is set as follows:
OUT Table of
lnL(alpha,theta) Lods for
families
0
no
no
1
no
yes
2
yes
no
3
yes
yes
ALOW = lowest value of α analyzed (e.g., ALOW=0)
Line 4: N recombination
fraction (θ) values, e.g., 0.01, 0.05,
0.1, etc. At these points, lod scores will be computed. A large number
N of θs (e.g., 10) will yield more accurate results than a small
number. Note that these θ values are not used by the program other
than to identify the lod scores.
Line 5: NFAM = number of
families for which lods are provided
Line 6: Lod scores for
family 1. Lods smaller than -99 are taken
to represent minus infinity, and a log likelihood of minus infinity
will appear as -99 on output.
Repeat line 6 for families 2, 3, etc.
Sample data: The file HOMOG.DAT shows a specific example based on the
analysis by Morton (1956) on Elliptocytosis vs. Rh and is reproduced
below.
Morton (1956) Elliptocytosis vs. Rh
5 0.05 2.0
3 0
0.01 0.05 0.1 0.2 0.3
14
0.292 0.258 0.215
0.134 0.064
-9.548 -4.232 -2.182 -0.540 0.018
5.022 4.682 4.236
3.264 2.165
1.560 2.321 2.317
1.953 1.376
2.489 3.304 2.735
2.109 1.274
0.276 0.260 0.238
0.191 0.138
0.992 0.908 0.805
0.597 0.380
2.657 2.442 2.161
1.554 0.892
-0.110 -0.091 -0.070 -0.038 -0.016
-7.307 -3.840 -2.399 -1.086 -0.455
-1.402 -0.721 -0.444 -0.194 -0.076
-0.300 -0.101 -0.020 0.059
0.046
-4.036 -2.020 -1.220 -0.531 -0.224
-1.118 -0.471 -0.235 -0.064 -0.014
Additional Programs
The programs described below are somewhat obsolete. The usage
instructions are retained here for the unlikely event that someone
still needs to use one of these programs.
HOMOG1 is an extension of the
homogeneity test, with the following alternative hypothesis: Two family
types, one linked the other unlinked, plus a possible sex difference in
the recombination fraction. This program comes in two versions,
depending on whether the lods for the two sexes are independent or
not: HOMOG1a reads independent lods, HOMOG1b reads dependent
lods. For the same problem, with independent lods, HOMOG1a is more
efficient in terms of memory space.
HOMOG2 is another extension of
the homogeneity test. There are two postulated trait loci at different
map positions (or different map distances from a marker), with only one
of the trait loci occurring in any one family. That is, there are two
family types, both with linkage. The recombination fraction between
trait locus 1 and the marker is θ1, that between trait locus
2 and the
marker is θ2, where θ1 < θ2 <
0.5 (male and female recombination
fractions the same). Typically, HOMOG2 is used with a marker map rather
than a single marker in which case θ1 and θ2
refer to the map positions
of the two trait loci. The proportion, α, of families with trait locus
1 is also estimated.
HOMOG3 and HOMOG4 are analogous to HOMOG2 but
specify 3 or 4 family types (trait loci). They only calculate the
maximum log likelihood and the ML estimates but no support intervals.
HOMOG3R is a specialized
version of the HOMOG3 program. It calculates log likelihoods under the
assumption of two trait loci, where in a proportion α1 of
families the
trait is linked to marker 1, and in a proportion α2 of
families is
linked to marker 2. A third family type with proportion α3
has its
trait locus elsewhere. The two trait genes are taken to be unlinked
with each other.
HOMOGM is an extension of
HOMOG3R to any number of trait loci. It was written in C by Ajita
Bhat and uses a numerical maximization routine to maximize the
likelihood over parameter values.
POINT4 is interactive and
calculates, for a mixture of up to 4 family types, the log likelihood
at specific parameter values.
Before running one of the programs, an input file must be constructed
according to the rules given below. Input and output files have fixed
names. For example, for the HOMOG program, the input file is HOMOG.DAT and the output file is HOMOG.OUT. Output of lod scores and
log likelihoods is preset to a width of 80 columns unless the input
quantity LL is read.
In each of the homogeneity tests, groups or types of families are
assumed where any given family cannot unequivocally be assigned to
either of these types. The statistical hypotheses referred to in the
programs are defined as follows:
- H0 is the very basic hypothesis of both
homogeneity
and absence of linkage.
- H1 is the usual null hypothesis of homogeneity, i.e.,
all families belong to a single family type with linkage between the
main locus and the marker locus.
- H2 refers to the hypothesis of heterogeneity, with two
family types, type 1 and type 2, where a is the proportion of families
of type 1 or, equivalently, is the probability that a family belongs to
type 1. Family type 1 is characterized by a recombination fraction θ
(programs HOMOG, HOMOG1a, and HOMOG1b) or θ1 (program
HOMOG2) while in families of type 2, the recombination fraction is
assumed to be equal to ½ (programs HOMOG, HOMOG1a, and HOMOG1b)
or θ2 (program HOMOG2, θ1 < θ2 <
0.5).
- H3 refers to a particular type of "homogeneity": There
is only one family type with recombination fraction θ, but allowance is
made for a difference in the recombination fraction between the sexes.
- H4 is the heterogeneity alternative to H3,
i.e., there are two family types with recombination fractions of θ and
½ and, in addition, there might also be a sex difference in θ
between the sexes.
The relationship between the hypotheses 1 through 4 may be displayed as
follows:
----------------------------------------------------------
Recombination
fraction α
= 1 α < 1
in the two
sexes
(Homogeneity) (Heterogeneity)
----------------------------------------------------------
equal
H1
H2
unequal
H3
H4
----------------------------------------------------------
In the programs and on output, genetic distance is labeled in terms of
the recombination fraction, θ. However, the programs may also be used
when the genetic distances are in centimorgans, x. Free recombination
(infinite map distance) is labeled as θ = 99 or -99.
Tests of one hypothesis against another are carried out as likelihood
ratio tests, where the likelihood ratio with respect to the two
hypotheses is calculated. Asymptotic p-values based on theoretical
chi-square distributions are no longer reported because in most of
these tests they are not really appropriate.
CONSTRUCTION OF SUPPORT INTERVALS
Some of the programs described below will calculate support
intervals
for parameters estimated (support "regions" for more than two
parameters) and, for each family, the conditional probability of being
of one of the family types considered. Such calculations are only then
carried out, however, when on input a value for LDIFF is specified.
Otherwise, no support interval calculations will be performed, which
will result in faster program execution. Such approximate confidence
intervals will be very crude and practically useless if only a few θ
values with lod scores are present so that the θ values are far apart
of each other, or when the step size for a is too large, say, larger
than 0.10. It is important for good support intervals that lod scores
are available at many θ values. When only few lod scores are available,
a possible solution is to approximate lod scores by interpolating
between calculated lod scores before inputting them to the HOMOG
programs.
In the HOMOG programs, support regions/intervals are computed as
follows. First, the program determines the highest Ln likelihood, Lmax,
under the most general alternative hypothesis, i.e., the one with the
largest number of parameters estimated. Then, the program recalculates
likelihoods and marks all those parameter values which have an Ln
likelihood larger than Lmax – Ldiff (Ln likelihood within Ldiff of the
maximum). The marked parameter values then form the support interval.
Such a support interval is called an Ldiff-unit support interval. The
table below gives examples for the correspondence between Ldiff and the
associated likelihood ratio.
Under regular conditions, support intervals may be interpreted as
approximate confidence intervals. For example, with two-point analysis
and two family types (one linked and the other unlinked), 2 ×
Ldiff
approximately follows a chi-square distribution on 1 df when no
heterogeneity is present. In multipoint situations, however, the
approximation by chi-square is unreliable because the distribution of
the test statistic is unknown.
----------------------------------------------
Difference in units of Approx.
Likelihood
----------------------- p-value
ratio
(LR) ln(LR)=Ldiff lod score
(1 df)
----------------------------------------------
7.39
2.00
0.87 .046
10
2.30
1 .032
20
3.00
1.30 .014
50
3.91
1.70 .005
100
4.60
2 .002
1000
6.91
3 .0002
----------------------------------------------
HOMOG1a program
The possible hypotheses under which likelihoods are calculated by
the
HOMOG1 program may be displayed as follows, where df stands for degrees
of freedom (see also introduction):
----------------------------------------------------------
Male and female
Homogeneity
(one Heterogeneity (two
rec.
fractions family
type) family types)
----------------------------------------------------------
equal H1
(1
df)
H2 (2 df)
unequal H3
(2
df)
H4 (3 df)
----------------------------------------------------------
The test of H1 against H4 leads to a chi-square
value with 2 df that
may be partitioned into two components according to the manner in which
H4 is reached from H1. Note that there are two
possible paths leading
from H1 to H4.
Input to the HOMOG1a program is similar to that for the HOMOG program
and is as given in the following table. But refer to the notes below
this table. File names are HOMOG1A.DAT for input and HOMOG1A.OUT for
output.
Line 1: Title line
Line 2: NM
NF STEPSIZE
LDIFF where
- NM = no. of male θ values, tm, at which lod scores
are available. Do not count θ = 0.5.
- NF = no. of female θ values, tf, at which lod
scores are available. Do not count θ = 0.5.
- STEPSIZE = step size at which α values are
incremented in the search over the likelihood surface (e.g., 0.05).
- LDIFF (optional) = difference in log likelihood,
used in the construction of support intervals (see section 1, above).
In regular situations, the joint support interval for α and θ
corresponds to an approximate 95% confidence region when Ldiff = 3.91.
Line 3: OUT ALOW LL where the
OUTput option is set as follows (Warning: the table of lnL(α, θ)
contains [NM × NF – 1]/STEPSIZE lines):
OUT Table of
lnL(alpha,theta) Lods for
families
0
no
no
1
no
yes
2
yes
no
3
yes
yes
ALOW = lowest value of alpha analyzed (e.g., ALOW = 0)
LL (optional) = line length of output (may be missing)
Line 4: NM male θ values,
tm. They may be entered on a single
line, or distributed over several lines. The order is irrelevant.
Line 5: NF female θ
values, tf.
Line 6: NFAM = number of
families
Line 7: NM male lod
scores for family 1. Lods smaller than -80
are taken to represent minus infinity, and a log likelihood of minus
infinity will appear as -99 on output.
Line 8: NF female lod
scores for family 1.
Repeat lines 7 and 8 for families 2, 3, etc.
As to the θ values at which lod scores are available in each family,
the user is essentially free which theta values to choose. However, he
or she should make sure that there is a sufficiently large number of
pairs both with tm = tf and with tm ≠ tf.
Sample data: The file HOMOG1A.DAT provides an example of data that may
be analyzed for heterogeneity as well as for a sex difference in the
recombination fraction. Data quoted in Ott (1986).
HOMOG1b program
The calculations performed by the HOMOG1b program are basically
the
same as those by the HOMOG1a program. HOMOG1b allows input of
nonindependent lod scores for the two sexes and, everything else being
equal, requires more memory to run. The input is slightly different
from that of the HOMOG1a program and is as follows. File names are
HOMOG1B.DAT for input and HOMOG1B.OUT for output.
Line 1: Title line
Line 2: N
STEPSIZE LDIFF where
- N = no. of pairs of θ values, tm and tf
(male and
female recombination fractions), at which lod scores are available. Do
not count θ = 0.5.
- STEPSIZE = step size at which α values are
incremented in the search over the likelihood surface (e.g., 0.05).
- LDIFF (optional) = difference in log likelihood,
used in the construction of support intervals (see section 1, above).
In regular situations, the joint support interval for α and θ
corresponds to an approximate 95% confidence region when Ldiff = 3.91.
Line 3: OUT ALOW
LL where
the
OUTput option is set as follows:
OUT
Table of lnL(alpha,theta) Lods
for families
---------------------------------------------------
0
no
no
1
no
yes
2
yes
no
3
yes
yes
ALOW = lowest value of alpha analyzed (e.g., ALOW=0)
LL (optional) = line length of output (may be missing)
Line 4: N pairs of θ
values, tm and tf, where in each pair the
first value is the male and the second value is the female
recombination fraction. Each pair may be entered on a single line, or
several pairs may be entered on one line, e.g., 0.01, 0.01, 0.05, 0.05,
0.01, 0.05,... There must be exactly as many pairs (N of them) as
there are lod scores for each family as provided on lines no. 5, below.
Omit lod = 0 at tm = tf = 0.5. The order in which
these pairs are provided
is irrelevant.
Line 5: NFAM = number of
families
Line 6: Lod scores (N of
them) for family 1. Lods smaller than
-80 are taken to represent minus infinity, and a log likelihood of
minus infinity will appear as -99 on output.
Repeat line 6 for families 2, 3, etc.
As to the pairs of θ values at which lod scores are available in each
family, the user is essentially free which θ values to choose.
However, he or she should make sure that there is a sufficiently large
number of pairs both with tm = tf and with tm ≠
tf. A
minimum set of pairs would be all possible combinations of tm
and tf
with tm = 0, 0.05, ..., 0.5, and with tf = 0,
0.05, ..., 0.5. When the
highest lod score occurs at tf > tm, then it
might be sufficient to
provide lods in one triangle of the plane of (tm, tf)-values
only,
e.g., at tm = 0, 0.05, ..., 0.5, and tm ≤ tf
≤ 0.5.
Such a set of theta values may graphically be represented as follows
(marked with crosses):
tm = 0.5 |
0.3
|
x x
0.1 |
x x x
0.05| x
x x x
0 | x x x
x x
---------+-----------------------
tf = 0 0.05 0.1
0.3 0.5
On input, for example, the following θ values would have to be given on
line(s) 4:
0 0 0
0.05 0 0.1 0 0.3
0 0.5 0.05 0.05
0.05 0.1 0.05
0.3 0.05 0.5 0.1 0.1 ...
0.3 0.5
Sample data: the file HOMOG1B.DAT contains the same data as the file
HOMOG1A.DAT referenced in the previous section except that the joint
lod scores, Z(tm,tf), have been reconstructed
from the independent
sex-specific lod scores, Z(tm) and Z(tf), as Z(tm,
tf) = Z(tm) + Z(tf).
HOMOG2 program
As mentioned in the introduction, in this extension to the A-test,
the
alternative hypothesis H2 of heterogeneity specifies two
family types,
both with linkage, one with recombination fraction θ1
between trait
locus 1 and a marker (or θ1 is the map position of trait 1),
the other
with recombination fraction θ2 between trait locus 2 and the
marker (or θ2 is the map position of trait 2) (θ1
< θ2 < 0.5). α is the
probability of belonging to type 1 (with θ1). The two trait
loci are on
the same chromosome.
Input format is the same as for the HOMOG program and is as given in
the following table, but refer to the notes after the table. File names
are HOMOG2.DAT for input and HOMOG2.OUT for output.
Line 1: Title line
Line 2: NT
STEPSIZE LDIFF
where
- NT = no. of θ values (or map distances) at which
lod scores are available or should be computed. Omit lod = 0
at θ = 0.5.
- STEPSIZE = step size at which the α values are
incremented in the search over the likelihood surface (e.g., 0.05).
- LDIFF (optional) = difference in log likelihood,
used in the construction of support intervals. LDIFF is optional. When
it is missing, no support intervals will be calculated.
Line 3: OUT ALOW
LL
where the
OUTput option is set as follows (Warning: table of lnL(α, θ) will
contain
0.5 × N(N + 3)/STEPSIZE lines):
OUT Table of
lnL(alpha,theta) Lods for
families
---------------------------------------------------
0
no
no
1
no
yes
2
yes
no
3
yes
yes
---------------------------------------------------
ALOW = lowest value of a analyzed (e.g., ALOW = 0)
LL = line length of output (optional; if missing: LL = 80)
Line 4: Recombination
fraction (θ) values, e.g., 0.01, 0.05,
etc. At these points, lod scores will be computed. A rather large
number NT of recombination fractions (e.g., 10) will yield more
accurate results than a small number.
Line 5: NFAM = number of
families for which lod scores are
provided.
Line 6: Lod scores for
family 1. Lods smaller than -80 are taken
to repre-sent minus infinity, and a log likelihood of minus infinity
will appear as -99 on output.
Repeat line 6 for each family.
The null hypothesis (H1) is specified by α = 1 or,
equivalently, by θ1
= θ2, and has one degree of freedom (df). The alternative
hypothesis
(H2) is characterized by three df, i.e., α, θ1,
and θ2 are estimated.
However, setting α = 1 forces θ2 to be equal to θ1
so that the
asymptotic chi-square distribution may not apply. Thus, instead of
p-values, the current version
of the HOMOG2 program outputs likelihood
ratios (odds ratios) for the hypotheses considered.
HOMOG3 and HOMOG4 programs
These programs are straightforward extensions of the HOMOG2
program to
3 and 4 family types (linked trait loci). They use the same input
format as the HOMOG and HOMOG2 programs, and the input files are
HOMOG3.DAT and HOMOG4.DAT, respectively. Output files will be called
HOMOG3.OUT and HOMOG4.OUT.
The HOMOG3 and HOMOG4 programs simply calculate the max. Ln likelihood
under the most general hypothesis of heterogeneity. Appropriate
significance tests will have to be carried out manually by the user by
comparing output from these programs with output from the HOMOG or
HOMOG2 programs. Notice that HOMOG3 and HOMOG4 carry out an exhaustive
search of the parameter space and may require a large amount of
computer time. While they are running, they display the current alpha
values so that programs may be interrupted by the user.
Interpreting results of HOMOG3 or HOMOG4 is not straightforward. For
example, whenever one of the components (α's) is equal to zero, the
associated θ value is irrelevant. Also, there may be more than one
parameter constellation with the same maximum likelihood. The HOMOG3
and HOMOG4 programs differ in their output as follows.
In the HOMOG3 program, if the OUTput option (line 3) is set to a value
larger than 1, all possible sets of α values will be printed (one set
per line), and for each set the maximum likelihood over the θ values
will be given along with those θ values at which the maximum occurred.
In the HOMOG4 program, if the OUTput option (line 3) is set to a value
larger than 1, a table containing the Ln likelihood for each possible
set of parameter values will be written to the output file.
WARNING: THIS FILE COULD BE VERY LARGE! For example, when the
sample HOMOG.DAT file is analyzed by the HOMOG4 program, the output
file will be 1.5MB long. For most practical situations, one should set
OUT = 0 on line 3.
Notice that any α component cannot take on the whole range of values
from 0 through 1. Also, for computational efficiency, only α3
is
allowed to have a value of 1; if in the course of the calculations, α2
or α1 were also allowed to be equal to 1, the resulting
likelihoods and θ estimates would be exactly the same as with α3
= 1. Similarly, α3
cannot be equal to zero. In this program, if a single a is zero, it
must be α1; if two α's are zero, these must be α1
and α2.
HOMOG3R program
This is a specialized version of the HOMOG3 program. It calculates
log
likelihoods under the assumption that in a proportion α1 of
families
trait locus 1 is linked to marker 1 and in a proportion α2
of families
trait locus 2 is linked to marker 2, where the two markers (or maps of
markers) are located in different regions of the genome such that the
trait loci are unlinked with each other. There may be a third
proportion, α3 = 1 – α1 – α2, of
families whose trait locus is unlinked
to either of the two map regions harboring trait loci 1 and 2. The two
regions of the genome usually correspond to different chromosomes and
are identified in the following table by the respective sets of θ
values, NT1 and NT2.
Default file names are HOMOG3R.DAT for input and HOMOG3R.OUT for output.
Line 1: Title line
Line 2: NT1
NT2 STEPSIZE where
- NT1 = no. of θ values (or map locations) at which
lod scores are available for trait versus marker 1. Omit lod = 0 at θ =
0.5.
- NT2 analogous for marker 2.
- STEPSIZE = step size at which the α values are
incremented in the search over the likelihood surface (e.g., 0.05).
Line 3:
OUT
ALOW where the OUTput option
is set as follows:
---------------------------------------------------
OUT Table of
lnL(alpha,theta) Lods for
families
---------------------------------------------------
0
no
no
1
no
yes
2
yes
no
3
yes
yes
---------------------------------------------------
The table of lnL(α, θ) will print one line for each pair of α1
and α2.
On each line, the log likelihood, maximized over the θs, is printed.
ALOW = lowest value of α analyzed (e.g., ALOW=0).
Line 4: All NT1 + NT2 θ
values, e.g., 0.01, 0.05, etc., that is,
the θ values for marker 1 immediately followed by the θ values for
marker 2. These values are for identification purposes only and are not
used in the calculations. It may thus be useful to distinguish θ
values for marker 1 (e.g.. -0.10 or 0.11) from those for marker 2
(e.g.. 0.10). A large number of recombination fractions will yield more
accurate results than a small number.
Line 5: NFAM = number of
families for which lod scores are
provided.
Line 6: The NT1 + NT2 lod
scores for family 1. Lods smaller than
-80 are taken to represent minus infinity, and a log likelihood of
minus infinity will appear as -99 on output.
Repeat line 6 for each family.
A special situation is given when the two markers near trait loci 1 and
2 are taken to be candidate genes and lod scores are evaluated at θ = 0
only. In this case, the HOMOG3R program will maximize likelihoods over
only two θ values, 0 and 0.5. Consider the following input file
(another sample data set is provided in the file HOMOG3R.DAT):
Linkage to two candidate genes on
different chromosomes
1 1 .05
1 0
-0.01 0.01
4
0.903 -99
2.007 -99
0.601 0.601
-99 1.204
For each of four families, at each of two chromosomes, the file
contains lod scores at θ = 0 (identified as -0.01 for marker 1 on
chromosome 1 and 0.01 for marker 2 on chromosome 2).
There are three possible hypotheses of homogeneity:
- All families are linked with marker 1 but unlinked with
marker 2.
- All families are linked with marker 2 but not with
marker 1.
- All families are unlinked with markers 1 and 2.
For the given data, hypotheses 1 and 2 have zero likelihood
because
there is always at least one family with one or more known
recombinations (but note the comments below). The HOMOG3R program
furnishes the following output:
Program HOMOG3R
version 1.70 J. Ott
Heterogeneity -- Three family
types, type 1 with linkage to first set of
theta values, type 2 with linkage
to second set of theta values (usually
two different chromosomes), type
3 unlinked.
>> Linkage to two candidate
genes on different chromosomes
<<
Fam. Lod scores
1
0.9030 -99.0000
2
2.0070 -99.0000
3
0.6010 0.6010
4
-99.0000 1.2040
Theta -0.0100
0.0100
Results for different hypotheses
(fixed values in parentheses)
Hypothesis a1 a2
a3
t1
t2 lnL
----------------------------------------------------------------------
H1
Heterogeneity 0.65
0.35 0.00 -0.010
0.010 8.9453
H2
Het,
a3=0 0.65 0.35 (0)
-0.010 0.010 8.9453
H3
Het,
a2=0 0.70 (0) 0.30
-0.010 (-99) 5.9688
H4
Het,
a1=0 (0) 0.40 0.60
(-99) -0.010 1.7107
H5 Homogeneity,
a1=1 (1)
(0) (0) -99.000
(-99) 0.0000
H6 Homogeneity,
a2=1 (0)
(1) (0) (-99)
-99.000 0.0000
H7 Homogeneity,
a3=1 (0)
(0) (1) (-99)
(-99) (0)
Evidence for heterogeneity
(H1 vs. H5/6/7): Het. versus
homogeneity
Difference in
Ln(L)
= 8.9453
Lik. ratio for
heterog. = 7671.7558
Evidence for 2 versus 1
disease locus (H1 vs. H3/4):
Difference in
Ln(L)
= 2.9765
Lik. ratio for
heterog. = 19.6190
Family Conditional
prob. of being
no. type 1 type
2 type 3 (under heterogeneity, H1)
1 1.0000
0.0000 0.0000
2 1.0000
0.0000 0.0000
3 0.6500
0.3500 0.0000
4 0.0000
1.0000 0.0000
The program output shows positive likelihoods for hypotheses H1
through
H4. It may come as a surprise that formally the program also
indicates
non-null likelihoods (log likelihood greater than -99 = -∞) for
hypotheses H5 and H6, that is, the data are
compatible with homogeneity
(locus in all families on chromosome 1 or chromosome 2). This is so
because the likelihood is maximized over θ = 0 and θ = 0.5. Then the
recombination fraction is estimated as 0.5 (indicated by -99.000 in the
output) whenever θ = 0 is incompatible with the data. Of course, if θ
were kept fixed at 0, the data would be incompatible with hypotheses α1
= 1 and α2 = 1.
Testing for significance of heterogeneity may be carried out in
different ways. Under heterogeneity that allows for 3 components
(hypothesis H1, above), 4 parameters are specified: α1,
α2, θ1, and θ2.
Setting α1 = 1 leads to H5 (all families have
their gene at position 1;
it so happens that with the above data the estimated θ1 =
0.50, i.e.
map position = ∞). The number of estimated parameters is then equal
to 1 (setting α1 = 1 makes θ2 irrelevant). Thus
the number of degrees
of freedom is 4 – 1 = 3 (this is probably somewhat conservative because
statistically this is a nonstandard situation). The corresponding test
statistic might then be constructed as χ2 = 2 ×
(difference in
ln(L) between H1 and H5) = 9.9453. One may
proceed analogously for
hypotheses H6 and H7.
HOMOGM program
This is an extension of the HOMOG3R program to any number of loci.
A
given family can have its disease locus at any one of n loci or at some
other unknown locus. For the i-th family type, a proportion αi
and
disease location xi (or recombination fraction θi
in the case of a
single marker linked with the i-th gene) are estimated. For the (n +
1)st family type (unlinked), only the proportion αn+1
is estimated.
Because of the large number of parameters, the likelihood is no longer
computed for a complete grid over the whole parameter space as it is
done in the other HOMOG programs. Rather, the simplex algorithm was
implemented in a computer program written by Dr. Ajita Bhat for
numerical maximization of the likelihood. A detailed program
description is given below.
INTRODUCTION
HOMOGM analyzes heterogeneity for two or more disease loci that
are
unlinked with each other. This program also allows for a proportion of
families to be unlinked to any of the loci. Different hypotheses can be
generated using multiple fixed values of θ (theta, recombination
fractions) and α (alpha, proportion of families with linkage to
different locations). For detailed explanation please see Bhat et al.
(1999). The example used in that paper is contained in the homogm.dat
file.
The HOMOGM program is written in C. It is compilable on UNIX platforms
using a C compiler using the following command:
>gcc -O2 homogm.c -o homogm -lm
or
>cc -O homogm.c -o homogm -lm
For Windows, the executable, homogmpc.exe
is available. Also the source code, homogmpc.c
is available and may be compiled using the
djgpp compiler (http://
www.delorie.com/djgpp/) and the following
command:
>gcc -O2 homogmpc.c -o
homogmpc -lm
FILES NEEDED
The default input file name is homogm.dat
and the program creates an
output file named homogm.out.
The input file needs to be supplied by
the user.
Sample input file
A template of the homogm.dat file is given below. This template
file is
created assuming there are 3 loci and 200 families in the data set. A
brief description of each line is given and a detailed explanation of
the file is given further below. The format of the input file
homogm.dat is as follows:
Families with 3 locations ⇐ Title line {Any suitable title
of your choice} <Line 1>
6 6 7 ⇐
Number of q values for each locus
<Line 2>
0 0 0 0 0 0 0 ⇐ Fixed
parameters (hypothesis generation)
<Line 3>
0.001 0.100
0.200 0.300 0.400 0.5 ⇐
Theta values for locus 1 <Line 4>
0.001 0.020
0.050 0.100 0.200 0.5 ⇐
Theta values for locus 2 <Line 5>
0.001 0.010
0.060 0.100 0.200 0.300
0.400 ⇐ Theta values for locus 3 <Line
6>
200
⇐ Number of
families <Line 7>
2.107 1.787 1.429
1.023 0.554 0 ⇐ Lod scores for
1st family and 1st locus <Line 8>
.................
.................
-99 -0.058 0.313
0.357 0.237 0 ⇐ Lod score for
the 100th family at 1st locus
................
................
-99 -1.675 -0.781 -0.347 -0.092
0 ⇐ Lod scores for 200th
family and 1st locus
-99 -0.212
0.046 0.053 0.011 0 ⇐ Lod
scores for 1st family and 2nd locus
................
................
-99 -2.092 -0.562
0.022 0.164 0 ⇐ Lod
score for the 100th family and 2nd locus
................
................
-99 -1.172 -0.317
0.013 0.094 0 ⇐ Lod
score for the 200th family and 2nd locus
-1.4966 -0.5123
-0.2289 0.1758 0.3221
0.4185 0.3627 ⇐ Lod scores for 1st family and 3rd
locus
................
................
-10.4953
-6.4992 -5.2995 -3.4091
-2.5406 -1.3876 -0.7413 ⇐ Lod scores for 100th
family and 3rd locus
...............
...............
-4.4962 -2.5079
-1.9191 -1.0192 -0.6321
-0.1835 -0.0053 ⇐ Lod score for the 200th family
and 3rd locus
Detailed description of the input file, homogm.dat
Line 1 is a
title line of
user's choice.
Line 2 indicates the number of
θ values considered for the study for
loci 1, 2 and 3 respectively.
Line 3 parameters (hypothesis
option): The number of zeros or ones on
this line should correspond to twice the number of loci +1 i.e. 2n + 1,
where n is the number of
loci. Zero at the corresponding number
suggests that the user does not wish to fix that particular value
(i.e., that parameter should be estimated); whereas 1 at the
corresponding number suggests that the user wishes to fix that
particular value (parameter not estimated).
The first n values denote
fixing of the θ values for the loci in serial
order. The next n values
denote fixing of α values for the
corresponding loci in serial order. The (2n + 1)-th (i.e. the last)
number denotes whether the α value for the unlinked locus is
fixed/not-fixed. Any combination of α and θ can be fixed at a time.
Note: If the θ value for any particular locus is fixed to 0.5 then
automatically the α value for that locus is fixed to 0. In
addition, if n out of (n + 1) α values are fixed, it means
that the (n
+ 1)th α value would be automatically fixed to (1 - sum of all n α
values). Also, if n of the (n + 1) α values sum up to one, then
the (n
+ 1)th alpha value is automatically fixed to zero. As an aid, users are
provided with the number of free parameters that were available while
running the hypotheses in the output file, homogm.out (details given
below). Explanation on hypotheses testing can be obtained from the
handbook of human genetic linkage (Terwilliger and Ott 1994).
Hypothesis generation: When the program is run, the user is prompted
for values of the fixed parameters. Values may be fixed in such a way
that values corresponding to θs are always between the minimum and
maximum values of that given in the file homogm.dat, and sum of all αs
must never be >1.
Some of the examples for running hypotheses are (3 loci):
1). No values are fixed. Line 3 looks like:
0 0 0 0 0 0 0
2). Second α value is fixed to 1. Line 3 looks like:
0 0 0 0 1 0 0
The program prompts for the constant that should be entered as 1.
3). Second θ value is fixed to 0.5. Line 3 looks like:
0 1 0 0 0 0 0
The program prompts for the constant and 0.5 should be entered. In this
case, the program will automatically set the second α value to
0.
4). First θ value is fixed to 0.1 and second α value fixed to 0.2.
Line 3 looks like:
1 0 0 0 1 0 0
The program will prompt the user for the value of only these 2 fixed
parameters. So the first input by the user should be 0.1 and second
input must be 0.2.
5). Unlinked locus is fixed to 0 and first θ value to 0.2. Line 3 looks
like:
1 0 0 0 0 0 1
The program will prompt the user for the first θ value which should be
fixed to 0.2 and then the value of unlinked loci which in this case
should be 0.
6). If the unlinked locus is fixed to be 1, max. likelihood is 0 and no
linkage to either loci exists.
Note: The program does not interpolate between two θ values that are
specified in the data file homogm.dat.
Thus if the user wishes to fix
the value of θ to 0.03 then the corresponding lod score (using MLINK)
for that particular value has to be included in the input data file.
Thus in this example, if the user fixes the value of the second theta
to 0.4 the program automatically sets it back to the closest value in
the file (i.e. 0.35 in this case).
Lines 4, 5, 6: Contain the list
of all possible θ values.
Each line contains θ values for each locus. In this particular example,
Line 4 denotes all the possible θ values for locus 1, line 5 for locus
2, and line 6 contains possible θ values for locus 3. The θ values for
all loci should not have the upper and lower bounds beyond 0 and 0.5. A
higher number of θ values would give more accurate answers of log
likelihood. It is not necessary to include θ = 0.5 (for which the lod
score is 0 by definition).
Line 7 indicates the number of
families; in this case 200.
Lines 8 onward are a list of
lod scores for each locus 1, 2, 3 at each θ value. First are the lod
scores for all families at locus 1, then
those for all families at locus 2, etc. The number of lod scores should
match the number of θ values.
Output of the program
The program outputs the values of the fixed parameters and the
maximum
likelihood for those values on the screen. The program also creates the
output file homogm.out. This
output file gives the time and date it
was created. Following that is number of free parameters used to
generate the value (in this example it is 6 since line #3 in homogm.dat
has all values set to 0). The maximum log likelihood is given followed
by the table of θ and α values for the loci in serial order. In this
example, the maximum likelihood is 174.0028. The α value of the
unlinked locus is 0.148 as given in this example; meaning a proportion
of 0.148 (or 1/7) families is unlinked to any of the three loci. Fixed
parameters are shown in parentheses in the output file (as shown in the
manuscript and in the output file of homog3r).
Following that is the table of conditional probabilities of each family
linked to that particular locus. In this example, family 1 has the
highest probability (0.25) to be in the criterion of being unlinked to
either of the three loci. Similarly, family 2 has the highest
probability (0.92) of being linked to locus 1 and no probability of
being linked to locus 3 and so on for the other families.
*** Program HOMOGM (Created on
Nov 20 1998, at 12:00:52) Dr. Jurg Ott
***
This output file created on Fri
Nov 20 12:01:36 1998
Heterogeneity -- Trait vs. 3 loci
on different chromosomes
The number of free parameters is 6
The value of Max ln(L) is
174.002816
The value of log10(L) is 75.569423
t[1]
t[2] t[3]
===========================================================
0.100 0.200
0.020
a[1]
a[2] a[3]
===========================================================
0.221 0.282
0.349
Alpha-unlinked is 0.148
Family linkage probabilities
1 0.08746 0.31311 0.34903 0.25040
2 0.92632 0.04094 0.00000 0.03274
..............
..............
100 0.03388 0.06823 0.02484
0.87306
..............
..............
200 0.00055 0.01771 0.96757
0.01417
Starting values for parameter estimation
The program does not depend on starting values. However, if users
are
interested in exploring various starting points the user-defined method
needs to be used. In that case the program needs to be compiled using
the following command:
>> gcc homogm.c
-DSTARTING_VALUES -o homogm -lm -O2
or
>> cc homogm.c
-DSTARTING_VALUES -o homogm -lm -O
The program will then prompt the user to input different starting
values (for θ only).
Hypothesis testing
The program provides only one maximum likelihood value (reported
in the
form of the natural and the decimal log), i.e. the likelihood
associated with the parameter estimates in the output. The program
could have been designed to test various hypotheses. However, because
of the essentially unlimited number of parameters and hypotheses, it is
left to the user to specify what hypotheses he/she wants to test. This
may be accomplished by setting any of the parameter values to "fixed"
or "variable" (input line 3). For example, one wants to test whether
all families belong to class 2 (have the disease locus linked with
marker 2). Then there are two possibilities:
(1) Set α2 = 1 and all other αs = 0 and keep these values
fixed, or
(2) fix all αs other than α2 and αn+1 at the
value 0.
The latter hypothesis differs from the former in that it allows for a
proportion of families to be unlinked to locus 2. The log likelihood
obtained under such a restricted hypothesis is then compared with the
log likelihood under a less restricted or unrestricted hypothesis. See
explanations at the end of the section on the HOMOG3R program for
further details.
POINT4 program
The POINT4 program is interactive and calculates the log
likelihood at
specific parameter values for a mixture of up to 4 family types. It
reads input files in the regular format for the HOMOG program. When
files in the format for HOMOG3R are to be used by POINT4, the two
numbers on the second line indicating numbers of recombination
fractions must be replaced by a single number, which is the sum of the
previous two numbers.
To use the program, you will have to furnish 4 values of a (proportions
of family types), e.g., 0.23 0.77 0 0 for two
components. Also, you need to specify "theta" values.
However, rather than the actual recombination fractions, the program
expects the consecutive (integer) numbers corresponding to the θ values
given in the input file, for example, 3 for the third θ value. To
specify a recombination fraction of 50%, enter a number outside the
range of numbers of θ values, e.g.. 0. The θ numbers corresponding to α
= 0 are irrelevant (any θ will do).
MTEST program
The MTEST program implements Morton's likelihood ratio test for
heterogeneity of the recombination fraction among different groups of
families (Morton 1956). Each group consists of a certain number of
families, e.g., groups may correspond to investigators, or to countries
of origin, etc. Also, each family may be regarded as forming a group of
its own (Morton's original usage of the test). The test assumes
homogeneity within each group (same θ). The null hypothesis
specifies overall homogeneity while under the alternative hypothesis of
heterogeneity, a potentially different θ value exists for each group.
This test statistic is also reported by the standard HOMOG program.
Files used by the program have the following fixed names:
- MTEST.DAT is the input file. It has the same
structure as the input file to the HOMOG program. A sample
MTEST.DAT file is provided.
- MTEST.OUT is the output file.
- MTEST.GRP is an input file holding the family
group definitions. The first line contains the number, NGR, of groups
to follow on subsequent lines. On each of the following NGR lines,
family numbers are given that form one group, e.g., 3 11 12 15.
Contiguous family numbers may be given in abbreviated form, e.g.,
numbers 7 through 11 may be given as -7 11.The first line and the
following NGR lines define one set of groups. As many such sets may be
given as desired. An example MTEST.GRP file is provided. Note that
after the number NGR of groups, a title may follow on the same line,
but there must be at least one space between NGR and the title.If NGR =
0 is given as the number of groups, this is taken to indicate that each
family should form one group of its own (original usage of Morton's
test). In that case, no family numbers are to be provided.The last line
of the file should contain the number -1 to indicate the end of input.
REFERENCES
Bhat A, Heath SC, Ott J (1999) Heterogeneity for multiple disease
loci
in linkage analysis. Hum Hered
49:229-231
Morton NE (1956) The detection and estimation of linkage between the
genes for elliptocytosis and the Rh blood type. Am J Hum Genet 8:80-96
Ott J (1986) Linkage probability and its approximate confidence
interval under possible heterogeneity. Genet Epidemiol Suppl
1:251-257
Ott J (1999) Analysis of Human
Genetic Linkage, third edition. Johns
Hopkins University Press, Baltimore
Terwilliger JD, Ott J (1994) Handbook
of Human Genetic Linkage. Johns
Hopkins University Press, Baltimore. Now available as a pdf file.