11 Jan 2013

User's Guide to the EH program

Jurg Ott, Rockefeller University, New York (ott@rockefeller.edu)

Introduction

EH is a program to test and estimate linkage disequilibrium between different markers or between a disease locus and markers (Xie and Ott 1993). This is an updated version in which the previous "disease" (case-control) option has been deleted (but see below how to work with case-control data). The program is written in Free Pascal. Free Pascal is available for various platforms, e.g. Windows and Linux. Linux users will find useful information in a separate document. Software reference: http://lab.rockefeller.edu/ott/.

The data are taken to consist of a number of individuals collected at random from a population. Based on these sample data, the EH program estimates allele frequencies for each marker. Haplotype frequencies are estimated with allelic association (H1) and without (H0). The EH program also provides log likelihood, chi-square and the number of degrees of freedom under hypotheses H0 and H1. For more information please refer to Terwilliger and Ott (1994). Examples of uses of the EH program include Marroni et al (2005), Bialecka et al (2008), and Formicola et al (2010).

Notes:

Files in this package (Windows):

EH.PAS: Source code of EH program.

EH.EXE: Executable code of EH program, which is compiled with a maximum of 30 alleles per locus, 10 loci, 1000 haplotypes, and 3600 genotype patterns (product of numbers of genotypes at each locus).

EH.DAT, EH.OUT, etc: Sample input and output files.  

Input file

The EH program does not require additional programs although you need a Pascal compiler (Free Pascal) to recompile the program when you change program constants. There is one input file whose name the user can determine, for example, EH.DAT (this is the default name). It contains the numbers of alleles for each marker and the observations for each genotype.

First line:
Number of alleles at the first marker, number of alleles at the second marker, and so on. Assuming you have 2 markers, the first marker has 2 alleles and the second marker has 3, you write 2 3 in the first line. The order of markers in the remainder of the input file is determined by the order of markers you entered in the first line.

Subsequent lines:
Number of observations for given genotypes. These numbers must be arranged as follows:

The number of columns is the number of the possible genotypes at the last locus. Let M be the number of alleles at the last locus, then the number of the possible genotypes equals M(M + 1)/2. For example, if the last locus has two alleles, then there are 3 possible genotypes which are 1/1, 1/2 and 2/2. Therefore, in each row there are 3 columns corresponding to the genotypes 1/1, 1/2, 2/2. Similarly, if the last marker has three alleles, then there are 6 columns corresponding to 1/1, 1/2, 2/2, 1/3, 2/3, 3/3.

The number of rows is the product of the number of the possible genotypes at the first (N - 1) markers, where N is the total number of markers. That is, no. of rows = L1(L1 + 1)/2 x L2(L2 + 1)/2 x ... x Li(Li + 1)/2 ..., where Li is the number of alleles at the i-th locus.

For example, assume you have 3 loci and the first and the second locus each have 2 alleles, and the third locus has 3 alleles. Thus, there are 6 columns and 9 rows (see example 1 below). However, if the first locus has 3 alleles and the second and third have 2 alleles each, there are 18 rows and 3 columns.

The output file from the EH program, EH.OUT by default, contains the estimated haplotype frequencies and their corresponding log likelihoods.  

Running the EH program

Open a command window and make certain the EH program is accessible, or double-click on the EH.EXE file. In Windows 2000, you open a command window by double-clicking on an icon called "CMD". Prepare an input file according to the description above. Then simply type EH and follow instructions. If nothing happens, ie. you receive the message "'eh is not recognized as an internal or external command, operable program or batch file", then the EH program is not in the path (ask a computer expert what this means).

Below, two examples are provided. In the methodology section, a case-control example is given.

Example 1:

Assume we have a random sample with three markers, 2 alleles each for the first and the second markers, and 3 alleles for the third marker. The observations for each genotype are as follows:

-------------------------------------------
                             Locus 3
                    -----------------------
Locus 1  Locus 2    1/1 1/2 2/2 1/3 2/3 3/3
-------------------------------------------
  1/1      1/1       0   0   0  12   3   9
  1/1      1/2       2  12   2   2   3   4
  1/1      2/2       5   4   4   2   0  18
  1/2      1/1       7   2   0   2   6   3
  1/2      1/2       9   0   0   3   4   2
  1/2      2/2      10   3   0   2   3   8
  2/2      1/1       1   2  10   2   9   4
  2/2      1/2       5   3   4   6   7   3
  2/2      2/2       9   3   0   0   3  10
-------------------------------------------
According to the instructions in the input file section, we set up the input file as follows (these data are furnished in the EH.DAT file):
  2  2  3
  0  0  0 12  3  9
  2 12  2  2  3  4
  5  4  4  2  0 18
  7  2  0  2  6  3
  9  0  0  3  4  2
 10  3  0  2  3  8
  1  2 10  2  9  4
  5  3  4  6  7  3
  9  3  0  0  3 10
The output file from the EH program is:
Estimates of Gene Frequencies (Assuming Independence)
----\------------------------------------
locus \ allele     1       2       3   
--------\--------------------------------
   1    |         0.5022  0.4978        
   2    |         0.4736  0.5264        
   3    |         0.3436  0.2357  0.4207
-----------------------------------------
# of Typed Individuals: 227

There are 12 Possible Haplotypes of These 3 Loci.
They are Listed Below, with their Estimated Frequencies:
-------------------------------------------------------------
| Allele   Allele   Allele  |      Haplotype Frequency      |
|   at       at       at    |                               |
| Locus 1  Locus 2  Locus 3 |  Independent   w/Association  |
-------------------------------------------------------------
   1        1        1           0.081720     0.091796
   1        1        2           0.056052     0.020709
   1        1        3           0.100055     0.115110
   1        2        1           0.090843     0.078947
   1        2        2           0.062309     0.067771
   1        2        3           0.111224     0.127870
   2        1        1           0.081004     0.055169
   2        1        2           0.055560     0.117023
   2        1        3           0.099177     0.073761
   2        2        1           0.090046     0.117700
   2        2        2           0.061762     0.030180
   2        2        3           0.110248     0.103964
-------------------------------------------------------------
# of Iterations = 14

                                   #param   Ln(L)     Chi-square
----------------------------------------------------------------
H0: No Association                      4   -953.90      0.00
H1: Allelic Associations Allowed       11   -934.98     37.84
Two kinds of haplotype frquencies are listed. "Independent" -- these are obtained from the allele frequencies at the individual loci. That is, these haplotype frequencies are not estimated but calculated from allele frequencies under the assumption of no association. "w/Association" -- these are estimated from the data, allowing for association (linkage disequilibrium), assuming Hardy-Weinberg equilibrium.  

Example 2:

Again we have three markers, but with different numbers of alleles. The first marker has 3 alleles, and the second and the third each have 2 alleles. The observations for each genotype are as follows:

-------------------------------
                      Locus 3
                    -----------
Locus 1   Locus 2   1/1 1/2 2/2
-------------------------------
           1/1       0   2   5   
   1/1     1/2       7   9  10
           2/2       1   5   9

           1/1       0  12   4
   1/2     1/2       2   0   3
           2/2       2   3   3

           1/1       0   2   4
   2/2     1/2       0   0   0
           2/2      10   4   0

           1/1      12   2   2
   1/3     1/2       2   3   2
           2/2       2   6   2

           1/1       3   3   0
   2/3     1/2       6   4   3
           2/2       9   7   3

           1/1       9   4  18
   3/3     1/2       3   2   8
           2/2       4   3  10
-------------------------------
Data input looks as follows:
3 2 2
   0   2   5
   7   9  10
   1   5   9
   0  12   4
   2   0   3
   2   3   3
   0   2   4
   0   0   0
  10   4   0
  12   2   2
   2   3   2
   2   6   0
   3   3   0
   6   4   3
   9   7   3
   9   4  18
   3   2   8
   4   3  10
The estimated allele frequencies and haplotypes are the same as in example 1.  

Methodology

Gene counting

The method of gene counting is a particular form of the EM algorithm, which iteratively furnishes ML estimates of parameters. Consider, for example, an individual with genotypes Dd at locus 1 and bb at locus 2. This individual is known to contain a Db and a db haplotype. Therefore, these two haplotypes can be directly observed and included in a count of haplotypes. On the other hand, an individidual with genotypes Dd and Bb has one or the other of the phases DB/db and Db/dB. Each phase occurs with a probability given by the haplotype frequencies. According to the principle of gene counting, such a doubly heterozygous individual may be split into two components in the proportions given by the phase probabilities. With a given phase, haplotypes can again be counted but these counts contribute towards the total only with a weight (of less than 1) as given by the appropriate phase probability.

With a given set of initial haplotype frequencies, the gene counting method furnishes estimates in an iterative manner. Each iteration leads to a new count of haplotypes, which is converted into relative frequencies. After a sufficient number of iterations, the haplotype frequencies will be close to their MLE's.

Case-control data

If you want to test whether haplotype frequencies are significantly different in cases and controls, you run EH three times, (1) for cases, (2) for controls, and (3) for cases and controls combined. For a given data set (CASE.DAT, CONTROL.DAT, provided), results are shown below.

cases                              #param   Ln(L)  Chi-square
-------------------------------------------------------------
H0: No Association                      3   -255.62      0.00
H1: Allelic Associations Allowed        7   -238.70     33.83

controls                           #param   Ln(L)  Chi-square
-------------------------------------------------------------
H0: No Association                      3   -144.41      0.00
H1: Allelic Associations Allowed        7   -135.83     17.15

cases+controls                     #param   Ln(L)  Chi-square
-------------------------------------------------------------
H0: No Association                      3   -404.36      0.00
H1: Allelic Associations Allowed        7   -384.26     40.20

The relevant test statistic is T = ln(L, cases) + ln(L, controls) – ln(L, cases + controls together), where that Ln(L) value associated with H1 is used. With a sufficient number of observations, in the absence of a true difference between case and control haplotype frequencies, twice this value has an approximate chi-square distribution with a number of df equal to the number of haplotypes estimated minus 1. For the above data, one obtains (–238.70) + (–135.83) – (–384.26) = 9.73; c2 = 2 x 9.73 = 19.46 on 7df is associated with an empirical significance level of 0.007.

References

Bialecka M et al (2008) The association of functional catechol-O-methyltransferase haplotypes with risk of Parkinson's disease, levodopa treatment response, and complications. Pharmacogenet Genomics 18, 815-821

Formicola D et al (2010) Common variants in the regulative regions of GRIA1 and GRIA3 receptor genes are associated with migraine susceptibility. BMC Medical Genetics 11:103

Marchini J et al (2006) A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet 78, 437-450

Marroni AS et al (2005) Consistent interethnic differences in the distribution of clinically relevant endothelial nitric oxide synthase genetic polymorphisms. Nitric Oxide 12, 177-182

Xie X, Ott J (1993) Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 53, 1107 (abstract)

Terwilliger J, Ott J (1994) Handbook of Human Genetic Linkage. Johns Hopkins University Press, Baltimore