Jurg Ott, Rockefeller University, New York (ott@rockefeller.edu)
EH is a program to test and estimate linkage disequilibrium between different markers or between a disease locus and markers (Xie and Ott 1993). This is an updated version in which the previous "disease" (case-control) option has been deleted (but see below how to work with case-control data). The program is written in Free Pascal. Free Pascal is available for various platforms, e.g. Windows and Linux. Linux users will find useful information in a separate document. Software reference: http://lab.rockefeller.edu/ott/.
The data are taken to consist of a number of individuals collected at random from a population. Based on these sample data, the EH program estimates allele frequencies for each marker. Haplotype frequencies are estimated with allelic association (H1) and without (H0). The EH program also provides log likelihood, chi-square and the number of degrees of freedom under hypotheses H0 and H1. For more information please refer to Terwilliger and Ott (1994). Examples of uses of the EH program include Marroni et al (2005), Bialecka et al (2008), and Formicola et al (2010).
Notes:
EH.PAS: Source code of EH program.
EH.EXE: Executable code of EH program, which is compiled with a maximum of 30 alleles per locus, 10 loci, 1000 haplotypes, and 3600 genotype patterns (product of numbers of genotypes at each locus).
EH.DAT, EH.OUT, etc: Sample input
and output files.
First line:
Number of alleles at the first marker,
number of
alleles at the second marker, and so on. Assuming you have 2 markers,
the first marker has 2 alleles and the second marker has 3, you write 2
3 in the first line. The order of markers in the remainder of the input
file is determined by
the order of markers you entered in the first line.
Subsequent lines:
Number of observations for given
genotypes. These
numbers must be arranged as follows:
The number of columns is the number of the possible genotypes at the last locus. Let M be the number of alleles at the last locus, then the number of the possible genotypes equals M(M + 1)/2. For example, if the last locus has two alleles, then there are 3 possible genotypes which are 1/1, 1/2 and 2/2. Therefore, in each row there are 3 columns corresponding to the genotypes 1/1, 1/2, 2/2. Similarly, if the last marker has three alleles, then there are 6 columns corresponding to 1/1, 1/2, 2/2, 1/3, 2/3, 3/3.
The number of rows is the product of the number of the possible genotypes at the first (N - 1) markers, where N is the total number of markers. That is, no. of rows = L1(L1 + 1)/2 x L2(L2 + 1)/2 x ... x Li(Li + 1)/2 ..., where Li is the number of alleles at the i-th locus.
For example, assume you have 3 loci and the first and the second locus each have 2 alleles, and the third locus has 3 alleles. Thus, there are 6 columns and 9 rows (see example 1 below). However, if the first locus has 3 alleles and the second and third have 2 alleles each, there are 18 rows and 3 columns.
The output file from the EH
program, EH.OUT by
default,
contains the
estimated haplotype frequencies and their corresponding log likelihoods.
Below, two examples are provided. In the methodology section, a case-control example is given.
Example 1:
Assume we have a random sample with three markers, 2 alleles each for the first and the second markers, and 3 alleles for the third marker. The observations for each genotype are as follows:
-------------------------------------------According to the instructions in the input file section, we set up the input file as follows (these data are furnished in the EH.DAT file):
Locus 3
-----------------------
Locus 1 Locus 2 1/1 1/2 2/2 1/3 2/3 3/3
-------------------------------------------
1/1 1/1 0 0 0 12 3 9
1/1 1/2 2 12 2 2 3 4
1/1 2/2 5 4 4 2 0 18
1/2 1/1 7 2 0 2 6 3
1/2 1/2 9 0 0 3 4 2
1/2 2/2 10 3 0 2 3 8
2/2 1/1 1 2 10 2 9 4
2/2 1/2 5 3 4 6 7 3
2/2 2/2 9 3 0 0 3 10
-------------------------------------------
2 2 3The output file from the EH program is:
0 0 0 12 3 9
2 12 2 2 3 4
5 4 4 2 0 18
7 2 0 2 6 3
9 0 0 3 4 2
10 3 0 2 3 8
1 2 10 2 9 4
5 3 4 6 7 3
9 3 0 0 3 10
Estimates of Gene Frequencies (Assuming Independence)Two kinds of haplotype frquencies are listed. "Independent" -- these are obtained from the allele frequencies at the individual loci. That is, these haplotype frequencies are not estimated but calculated from allele frequencies under the assumption of no association. "w/Association" -- these are estimated from the data, allowing for association (linkage disequilibrium), assuming Hardy-Weinberg equilibrium.
----\------------------------------------
locus \ allele 1 2 3
--------\--------------------------------
1 | 0.5022 0.4978
2 | 0.4736 0.5264
3 | 0.3436 0.2357 0.4207
-----------------------------------------
# of Typed Individuals: 227
There are 12 Possible Haplotypes of These 3 Loci.
They are Listed Below, with their Estimated Frequencies:
-------------------------------------------------------------
| Allele Allele Allele | Haplotype Frequency |
| at at at | |
| Locus 1 Locus 2 Locus 3 | Independent w/Association |
-------------------------------------------------------------
1 1 1 0.081720 0.091796
1 1 2 0.056052 0.020709
1 1 3 0.100055 0.115110
1 2 1 0.090843 0.078947
1 2 2 0.062309 0.067771
1 2 3 0.111224 0.127870
2 1 1 0.081004 0.055169
2 1 2 0.055560 0.117023
2 1 3 0.099177 0.073761
2 2 1 0.090046 0.117700
2 2 2 0.061762 0.030180
2 2 3 0.110248 0.103964
-------------------------------------------------------------
# of Iterations = 14
#param Ln(L) Chi-square
----------------------------------------------------------------
H0: No Association 4 -953.90 0.00
H1: Allelic Associations Allowed 11 -934.98 37.84
Again we have three markers, but with different numbers of alleles. The first marker has 3 alleles, and the second and the third each have 2 alleles. The observations for each genotype are as follows:
-------------------------------Data input looks as follows:
Locus 3
-----------
Locus 1 Locus 2 1/1 1/2 2/2
-------------------------------
1/1 0 2 5
1/1 1/2 7 9 10
2/2 1 5 9
1/1 0 12 4
1/2 1/2 2 0 3
2/2 2 3 3
1/1 0 2 4
2/2 1/2 0 0 0
2/2 10 4 0
1/1 12 2 2
1/3 1/2 2 3 2
2/2 2 6 2
1/1 3 3 0
2/3 1/2 6 4 3
2/2 9 7 3
1/1 9 4 18
3/3 1/2 3 2 8
2/2 4 3 10
-------------------------------
3 2 2The estimated allele frequencies and haplotypes are the same as in example 1.
0 2 5
7 9 10
1 5 9
0 12 4
2 0 3
2 3 3
0 2 4
0 0 0
10 4 0
12 2 2
2 3 2
2 6 0
3 3 0
6 4 3
9 7 3
9 4 18
3 2 8
4 3 10
The method of gene counting is a particular form of the EM algorithm, which iteratively furnishes ML estimates of parameters. Consider, for example, an individual with genotypes Dd at locus 1 and bb at locus 2. This individual is known to contain a Db and a db haplotype. Therefore, these two haplotypes can be directly observed and included in a count of haplotypes. On the other hand, an individidual with genotypes Dd and Bb has one or the other of the phases DB/db and Db/dB. Each phase occurs with a probability given by the haplotype frequencies. According to the principle of gene counting, such a doubly heterozygous individual may be split into two components in the proportions given by the phase probabilities. With a given phase, haplotypes can again be counted but these counts contribute towards the total only with a weight (of less than 1) as given by the appropriate phase probability.
With a given set of initial haplotype frequencies, the gene counting method furnishes estimates in an iterative manner. Each iteration leads to a new count of haplotypes, which is converted into relative frequencies. After a sufficient number of iterations, the haplotype frequencies will be close to their MLE's.
Case-control data
If you want to test whether haplotype frequencies are significantly different in cases and controls, you run EH three times, (1) for cases, (2) for controls, and (3) for cases and controls combined. For a given data set (CASE.DAT, CONTROL.DAT, provided), results are shown below.
cases
#param Ln(L) Chi-square
-------------------------------------------------------------
H0: No
Association
3
-255.62 0.00
H1: Allelic Associations
Allowed
7
-238.70 33.83
controls
#param Ln(L) Chi-square
-------------------------------------------------------------
H0: No
Association
3
-144.41 0.00
H1: Allelic Associations
Allowed
7
-135.83 17.15
cases+controls
#param Ln(L) Chi-square
-------------------------------------------------------------
H0: No
Association
3
-404.36 0.00
H1: Allelic Associations
Allowed
7
-384.26 40.20
The relevant test statistic is T = ln(L, cases) + ln(L, controls) – ln(L, cases + controls together), where that Ln(L) value associated with H1 is used. With a sufficient number of observations, in the absence of a true difference between case and control haplotype frequencies, twice this value has an approximate chi-square distribution with a number of df equal to the number of haplotypes estimated minus 1. For the above data, one obtains (–238.70) + (–135.83) – (–384.26) = 9.73; c2 = 2 x 9.73 = 19.46 on 7df is associated with an empirical significance level of 0.007.
Formicola D et al (2010) Common variants in the regulative regions of GRIA1 and GRIA3 receptor genes are associated with migraine susceptibility. BMC Medical Genetics 11:103
Marchini J et al (2006) A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet 78, 437-450
Marroni AS et al
(2005) Consistent interethnic differences in the distribution of
clinically relevant endothelial nitric oxide synthase genetic
polymorphisms. Nitric Oxide 12, 177-182
Xie X, Ott J (1993) Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 53, 1107 (abstract)
Terwilliger J, Ott J (1994) Handbook of Human Genetic Linkage. Johns Hopkins University Press, Baltimore