Genotype Pattern Mining for Human Digenic Traits

Jurg Ott - 29 March 2023

This document describes a program package, GPM, for finding pairs of genotypes or DNA variants associated with a genetic trait (so-called digenic disease), where each genotype or variant by itself may not be disease associated. GPM consists of two algorithms, Gpairs and Vpairs, described below. They are available for Windows and Linux (Kubuntu) but the programs perform much better in Linux than in Windows. There are also two published case-control datasets in plink format [9] that may serve as sample data, 1) Age-related macular degeneration (AMD) [1] and 2) AMD data collected in Hong Kong (HK) [2]. The programs must be run in a command box (terminal).

Frequent Pattern Mining
Frequent Pattern Mining (FPM) methods can rapidly pick frequent patterns from large databases of items. The first such approach was implemented in the Apriori program [3, 4], which was developed to mine consumer data. In addition to finding frequent items commonly purchased by a consumer, Apriori also furnishes association rules, that is, estimates of P(Y|X), where Y and X are purchased goods like bread, milk, and wine. Such association rules can predict how likely it is that someone buying X will also buy Y. Further information may be found in our recent reviews [5-7].

Here we apply FPM principles to genetic case-control studies with a moderate number of DNA variants and individuals. Specifically, we consider pairs of genotypes (genotype patterns) with one genotype each from two variants. Genotypes are labeled 1, 2, and 3 for AA, AB, and BB, respectively, and the label 0 (zero) stands for “missing”. For example, a genotype pattern might be X = ABAA = 21. Phenotypes are labeled 2 for cases and 1 for controls, and we want to find association rules, P(Y = 2|X), that is, estimates for the conditional probability (based on the proportion) of being a case among all individuals with genotype pattern X. In FPM terminology [3, 4], P(Y|X) is referred to as confidence, and P(X) is the support for X. In statistics, confidence is known as the (positive) predictive value.

In a previous publication [8], we applied the Apriori algorithm to find interaction (epistasis) effects of variants in case-control data. Here, however, the focus is simply on frequencies of genotype patterns and whether they are different in cases and controls [10] while individual variants may not show such differences. Other approaches to find epistatic variants exist [5].

Pairs of variants and genotypes

Consider two variants, possibly on different chromosomes. Each variant has 3 genotypes, so there are 3 × 3 = 9 genotype pairs. All possible variant pairs are numbered 1, 2, ..., M, where M may be a rather large number. For a given pair of variants, we proceed in two ways as follows, where the two programs discussed below assume that the data are in standard plink format [9].

Vpairs program. For each of cases and controls, a 3 × 3 table of genotype pairs is set up as shown in the example below.

       |   SNP 2
SNP 1  |  1  2  3
 1     | 14  9 14
 2     | 19 55 29
 3     | 13 31 16

       |   SNP 2
SNP 1  |  1  2  3
 1     |  8 31  6
 2     | 17 50 31
 3     | 10 30 17

Then a likelihood ratio test is carried out to test for differences in interaction between cases and controls, which will result in a chi-square value with 4 df as shown below for the example data.

Heterogeneity analysis, genotype tests
          chi-sq df   p
CONTROLS 11.0769  4 0.0257
CASES     6.5365  4 0.1625
BOTH      2.9578  4 0.5649
Heterog  14.6556  4 0.0055

Each of the M variant pairs furnishes such a chi-square result. The Vpairs program will read on the command line a number of input parameters (just type Vpairs so see what parameters need to be furnished on the command line), one of which is the number of threads the program should use. It will then assign an equal number of variant pairs to each thread for analysis.

Gpairs program. For two variants, this program will work on each of the 9 genotype pairs. For each genotype pair (pattern), X, the number of cases and controls with and without the pattern will be determined, which leads to a table as shown below, where a, b, c, and d refer to numbers of individuals:

Phenotype       |  X present  X absent
Cases, Y = 2    |      a         b
Controls, Y = 1 |      c         d
                |     N_X      N_noX

The numbers a, b, c, and d form a 2 × 2 table, for which chi-square or Fisher’s exact test is computed. This is done for each pattern, X, passing some filters, that is, a pattern may be skipped when:

1. its two variants are on the same chromosome.
2. an expected number in the 2 × 2 table falls below a threshold like 1 (only applies to chi-square test).
3. the support, a + c, is smaller than a threshold like 10.
4. the confidence, a/(a + c), is smaller than a threshold like 95%.

Type Gpairs to see what parameters to enter on the command line. Specifying a high minimum confidence of, say, 95% will select patterns with a/(a + c) > 0.95 or < 0.05. To avoid this, set the minimum confidence equal to zero.

Statistical testing. Due to the large computational burden, there is currently no provision for permutation testing. Instead, Bonferroni correction is applied to obtain p-values corrected for multiple testing. Tests are carried out in a two-sided manner, that is, patterns more frequent or less frequent in cases than controls are of interest and the output will show in the Y field, which of these two possibilities applies (Y = 2 or 1, respectively).

GPMsort program. Sometimes an output file is very large so that it cannot be read into Excel or some other spreadsheet program like Libreoffice/calc. GPMsort will then read the output file, sort lines by decreasing chi-square, and output as many lines with largest chi-square as desired.


1. R. J. Klein, C. Zeiss, E. Y. Chew, J. Y. Tsai, R. S. Sackler, C. Haynes, A. K. Henning, J. P. SanGiovanni, S. M. Mane, S. T. Mayne, M. B. Bracken, F. L. Ferris, J. Ott, C. Barnstable and J. Hoh: Complement factor H polymorphism in age-related macular degeneration. Science, 308(5720), 385-9 (2005) doi: 10.1126/science.1109557

2. A. Dewan, M. Liu, S. Hartman, S. S. Zhang, D. T. Liu, C. Zhao, P. O. Tam, W. M. Chan, D. S. Lam, M. Snyder, C. Barnstable, C. P. Pang and J. Hoh: HTRA1 promoter polymorphism in wet age-related macular degeneration. Science, 314(5801), 989-92 (2006) doi: 10.1126/science.1133807

3. R. Agrawal, T. Imielinski and A. Swami: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data. Washington DC (1993) doi: 10.1145/170035.170072

4. R. Agrawal and R. Srikant: Fast algorithms for mining association rules. In: 20th VLCB Conference. Proceedings of the 20th VLCB Conference, Santiago, Chile (1994) URL:

5. A. Okazaki, S. Horpaopan, Q. Zhang, M. Randesi and J. Ott: Genotype pattern mining for pairs of interacting variants underlying digenic traits. Genes, 12(8), 1160 (2021) doi:10.3390/genes12081160

6. A. Okazaki and J. Ott: Machine learning approaches to explore digenic inheritance. Trends Genet (2022) doi:10.1016/j.tig.2022.04.009

7. J. Ott and T. Park: Overview of frequent pattern mining. Genomics Inform, 20(4), e39 (2022) doi:10.5808/gi.22074

8. Q. Zhang, Q. Long and J. Ott: AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol, 10(6), e1003627 (2014) doi:10.1371/journal.pcbi.1003627

9. C. C. Chang, C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell and J. J. Lee: Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4, 7 (2015) doi:10.1186/s13742-015-0047-8

10. Q. Zhang et al: A multi-threaded approach to genotype pattern mining for detecting digenic disease genes (in preparation)