The Rockefeller University | 1230 York Avenue, New York, NY 10065 – 30 March 2021

Laboratory of Statistical Genetics
Jurg Ott, PhD, Director


Course on Genotype Pattern Mining for Human Digenic Traits

Date: January 17-21, 2022
Location: Rockefeller University, New York

Description

The course is being planned now. Details will be forthcoming. If you are interested in attending please send me email so I can put you on our attendance list.

We previously developed an approach to harness Frequent Pattern Mining for assessing the combined effects on disease of two DNA variants [15]. We updated this approach with a modern FPM engine and implemented it in a computer program, GPM [16], for Genotype Pattern Mining.

Here is a brief description: In this course, we will largely focus on detecting combined effects of two DNA variants on disease. Various examples of the joint actions of two variants (digenic inheritance) have been published [1;2]. There has been much debate about the definition of epistasis [3-5]. Here, as outlined below, we are simply concerned with diplotypes having different frequencies in cases and controls, where we consider diplotypes of length 2, that is, genotype patterns consisting of two genotypes each from different variants. Combined effects of two variants may be assessed in 3 × 3 tables of genotypes, with rows corresponding to genotypes at one variant and columns referring to another variant, and one table referring to cases and another such table to controls. The well-known Multidimensionality Dimension Reduction (MDR) method applies a specific machine-learning approach to such tables [6-10] while our approach, GPM, uses a general FPM algorithm tweaked into finding frequent genotype patterns in cases versus controls

The first FPM approach, the Apriori algorithm [11], was developed to handle the ever increasing databases of consumer transactions. It was of interest to learn what consumers tend to buy together so that predictions (so-called association rules) can be made, for example, how likely will a consumer buy wine when they buy bread and cheese? In our implementation of FPM methods, we focus on individual genotype patterns, that is, sets of two genotypes (diplotypes), one each from a different genomic location (possibly from different genes). A 3 × 3 table of genotypes exhibits nine genotype patterns. The specific FPM algorithm used is fpgrowth [12], and we built code around it so it works in a case-control setting [16]. The whole approach is embedded in a straightforward permutation framework [13;14]. As applied to case-control studies, our implementation develops predictions, based on the presence of specific genotype patterns, whether an individual is likely to be or become a case. These methods are particularly important in situations where single variants show little or no disease association, in which case it would be very difficult or downright impossible by standard statistical methods to detect digenic genotype patterns associated with disease.

  1. Ming, J.E., and Muenke, M. (2002). Multiple hits during early embryonic development: digenic diseases and holoprosencephaly. Am J Hum Genet 71, 1017-1032.

  2. Schaffer, A.A. (2013). Digenic inheritance in medical genetics. J Med Genet 50, 641-652.

  3. Frankel, W.N., and Schork, N.J. (1996). Who's afraid of epistasis? Nat Genet 14, 371-373.

  4. Wang, X., Elston, R.C., and Zhu, X. (2010). Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet 12, 74.

  5. Wang, X., Elston, R.C., and Zhu, X. (2010). The meaning of interaction. Hum Hered 70, 269-277.

  6. Gola, D., Mahachie John, J.M., van Steen, K., and Konig, I.R. (2016). A roadmap to multifactor dimensionality reduction methods. Briefings in Bioinformatics 17, 293-308.

  7. Moore, J.H., and Andrews, P.C. (2015). Epistasis Analysis Using Multifactor Dimensionality Reduction. In Epistasis: Methods and Protocols, J.H. Moore and S.M. Williams, eds. (New York, NY, Springer New York), pp 301-314.

  8. Moore, J.H., and Hahn, L.W. (2002). A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pac Symp Biocomput, 53-64.

  9. Ritchie, M.D., Hahn, L.W., and Moore, J.H. (2003). Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 24, 150-157.

  10. Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., and Moore, J.H. (2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69, 138-147.

  11. Agrawal, R., and Srikant, R. (1994). Fast algorithms for mining association rules. In: 20th VLCB Conference. (Santiago, Chile, Proceedings of the 20th VLCB Conference), pp 487-499.

  12. Borgelt, C. (2005). An implementation of the FP-growth algorithm. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. (Chicago, Illinois, Association for Computing Machinery), pp 1–5.

  13. Manly, B.F.J. (2007). Randomization, bootstrap, and Monte Carlo methods in biology.(Boca Raton, FL: Chapman & Hall/ CRC).

  14. Manly, B.F.J., and Navarro Alberto, J.A. (2021). Randomization, bootstrap and Monte Carlo methods in biology. In: Chapman & hall/crc texts in statistical science. (Boca Raton, Taylor & Francis,), p 1 online resource.

  15. Zhang Q, Long Q, Ott J. AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol. 2014;10(6):e1003627. Epub 2014/06/06. doi: 10.1371/journal.pcbi.1003627. PubMed PMID: 24901472; PubMed Central PMCID: PMC4046917.

  16. Ott J. Frequent Pattern Mining of Genotypes Underlying Digenic Traits. 8th International Conference on Big Data Analysis and Data Mining – August 9-10, 2021 Zurich (abstract accepted)