Jurg Ott - 29 March 2023
This document describes a program package, GPM, for finding pairs of genotypes or DNA variants associated with a genetic trait (so-called digenic disease), where each genotype or variant by itself may not be disease associated. GPM consists of two algorithms, Gpairs and Vpairs, described below. They are available for Windows and Linux (Kubuntu) but the programs perform much better in Linux than in Windows. There are also two published case-control datasets in plink format [9] that may serve as sample data, 1) Age-related macular degeneration (AMD) [1] and 2) AMD data collected in Hong Kong (HK) [2]. The programs must be run in a command box (terminal).
Frequent
Pattern Mining
Frequent Pattern Mining
(FPM) methods can rapidly pick frequent patterns from large databases
of items. The first such approach was implemented in the Apriori
program [3, 4], which was developed to mine consumer data. In
addition to finding frequent items commonly purchased by a consumer, Apriori
also furnishes association rules, that is, estimates of
P(Y|X), where Y and X are purchased goods like bread, milk, and wine.
Such association rules can predict how likely it is that someone
buying X will also buy Y. Further information may be found in our
recent reviews [5-7].
Here we apply FPM principles to
genetic case-control studies with a moderate number of DNA variants
and individuals. Specifically, we consider pairs of genotypes
(genotype patterns) with one genotype each from two variants.
Genotypes are labeled 1, 2, and 3 for AA, AB, and BB, respectively,
and the label 0 (zero) stands for “missing”. For example, a
genotype pattern might be X = AB–AA = 2–1. Phenotypes are labeled 2
for cases and 1 for controls, and we want to find association rules,
P(Y = 2|X), that is, estimates for the conditional
probability (based on the proportion) of being a case among all
individuals with genotype pattern X. In FPM terminology [3, 4],
P(Y|X) is referred to as confidence, and P(X) is the support
for X. In statistics, confidence is known as the (positive) predictive value.
In a previous publication [8], we applied the
Apriori algorithm to find interaction (epistasis) effects of
variants in case-control data. Here, however, the focus is simply on
frequencies of genotype patterns and whether they are different in
cases and controls [10] while individual variants may not show such
differences. Other approaches to find epistatic variants exist [5].
Vpairs program. For each of cases and controls, a 3 × 3 table of genotype pairs is set up as shown in the example below.
CONTROLS | SNP 2 SNP 1 | 1 2 3 -------+--------- 1 | 14 9 14 2 | 19 55 29 3 | 13 31 16 ----------------- CASES | SNP 2 SNP 1 | 1 2 3 -------+--------- 1 | 8 31 6 2 | 17 50 31 3 | 10 30 17 -----------------
Then a likelihood ratio test is carried out to test for differences in interaction between cases and controls, which will result in a chi-square value with 4 df as shown below for the example data.
Heterogeneity analysis, genotype tests chi-sq df p CONTROLS 11.0769 4 0.0257 CASES 6.5365 4 0.1625 BOTH 2.9578 4 0.5649 Heterog 14.6556 4 0.0055
Each of the M variant pairs furnishes such a chi-square result. The Vpairs program will read on the command line a number of input parameters (just type Vpairs so see what parameters need to be furnished on the command line), one of which is the number of threads the program should use. It will then assign an equal number of variant pairs to each thread for analysis.
Gpairs program. For two variants, this program will work on each of the 9 genotype pairs. For each genotype pair (pattern), X, the number of cases and controls with and without the pattern will be determined, which leads to a table as shown below, where a, b, c, and d refer to numbers of individuals:
-------------------------------------- Phenotype | X present X absent ----------------+--------------------- Cases, Y = 2 | a b Controls, Y = 1 | c d ----------------+--------------------- | N_X N_noX --------------------------------------
The numbers a, b, c, and d form a 2 × 2
table, for which chi-square or Fisher’s exact test is computed.
This is done for each pattern, X, passing some filters, that is, a pattern may be skipped when:
1. its two variants are on the same chromosome.
2. an expected number in the 2 × 2 table falls below a threshold like 1 (only applies to chi-square test).
3. the support, a + c, is smaller than a threshold like 10.
4. the confidence, a/(a + c), is smaller than a threshold like 95%.
Type Gpairs to see what parameters to enter on the command line. Specifying a high minimum confidence of, say, 95% will select patterns with a/(a + c) > 0.95 or < 0.05. To avoid this, set the minimum confidence equal to zero.
Statistical testing. Due to the large computational burden, there is currently no provision for permutation testing. Instead, Bonferroni correction is applied to obtain p-values corrected for multiple testing. Tests are carried out in a two-sided manner, that is, patterns more frequent or less frequent in cases than controls are of interest and the output will show in the Y field, which of these two possibilities applies (Y = 2 or 1, respectively).
GPMsort program. Sometimes an output file is very large so that it cannot be read into Excel or some other spreadsheet program like Libreoffice/calc. GPMsort will then read the output file, sort lines by decreasing chi-square, and output as many lines with largest chi-square as desired.
1. R. J. Klein, C. Zeiss, E. Y. Chew, J. Y. Tsai, R. S. Sackler, C. Haynes, A. K. Henning, J. P. SanGiovanni, S. M. Mane, S. T. Mayne, M. B. Bracken, F. L. Ferris, J. Ott, C. Barnstable and J. Hoh: Complement factor H polymorphism in age-related macular degeneration. Science, 308(5720), 385-9 (2005) doi: 10.1126/science.1109557
2. A. Dewan, M. Liu, S. Hartman, S. S. Zhang, D. T. Liu, C. Zhao, P. O. Tam, W. M. Chan, D. S. Lam, M. Snyder, C. Barnstable, C. P. Pang and J. Hoh: HTRA1 promoter polymorphism in wet age-related macular degeneration. Science, 314(5801), 989-92 (2006) doi: 10.1126/science.1133807
3. R. Agrawal, T. Imielinski and A. Swami: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data. Washington DC (1993) doi: 10.1145/170035.170072
4. R. Agrawal and R. Srikant: Fast algorithms for mining association rules. In: 20th VLCB Conference. Proceedings of the 20th VLCB Conference, Santiago, Chile (1994) URL: http://www.vldb.org/conf/1994/P487.PDF
5. A. Okazaki, S. Horpaopan, Q. Zhang, M. Randesi and J. Ott: Genotype pattern mining for pairs of interacting variants underlying digenic traits. Genes, 12(8), 1160 (2021) doi:10.3390/genes12081160
6. A. Okazaki and J. Ott: Machine learning approaches to explore digenic inheritance. Trends Genet (2022) doi:10.1016/j.tig.2022.04.009
7. J. Ott and T. Park: Overview of frequent pattern mining. Genomics Inform, 20(4), e39 (2022) doi:10.5808/gi.22074
8. Q. Zhang, Q. Long and J. Ott: AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol, 10(6), e1003627 (2014) doi:10.1371/journal.pcbi.1003627
9. C. C. Chang, C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell and J. J. Lee: Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4, 7 (2015) doi:10.1186/s13742-015-0047-8
10. Q. Zhang et al: A multi-threaded approach to genotype pattern mining for detecting digenic disease genes (in preparation)