Rockefeller U seal

The Rockefeller University | 1230 York Avenue, New York, NY 10065 7 October 2021

Laboratory of Statistical Genetics

Jurg Ott, PhD, Director





Based on our recent paper describing methodology.


Aim: To detect pairs of genotypes from different genes that significantly discriminate between cases and controls while individual genotypes have little effect.

Course on Genotype Pattern Mining for Digenic Traits

Date: February 21-25, 2022  (KompoZer)

Location: Rockefeller University, New York; Weiss Building room 305

Frequent Pattern Mining

Various examples of the joint actions of two variants (digenic inheritance) have been published [1;2]. There has been much debate about the definition of epistasis and how to detect it [3-5]. Here, as outlined below, we are simply concerned with diplotypes having different frequencies in cases and controls, where we consider diplotypes of length 2, that is, genotype patterns consisting of two genotypes, each from a different variant. Combined effects of two variants may be assessed in 3 3 tables of genotypes, with rows corresponding to genotypes at one variant and columns referring to another variant, and one table referring to cases and another such table to controls. The well-known Multidimensionality Dimension Reduction (MDR) method applies a specific machine-learning approach to such tables [6-10] while our approach, GPM, uses a general-purpose FPM algorithm tweaked into finding genotype patterns with frequencies different in cases and controls.

The first FPM approach, the Apriori algorithm [11], was developed to handle the ever increasing databases of consumer transactions. It was of interest to learn what consumers tend to buy together so that predictions (so-called association rules) can be made, for example, how likely will a consumer buy wine when they buy bread and cheese? In our implementation of FPM methods, we focus on individual genotype patterns, that is, sets of two genotypes (diplotypes), one each from a different genomic location (possibly from different genes). A 3 3 table of genotypes exhibits nine genotype patterns. The specific FPM algorithm used is fpgrowth [12], and we built code around it so it works in a case-control setting [16]. The whole approach is embedded in a straightforward permutation framework [13;14]. As applied to case-control studies, our implementation develops predictions, based on the presence of specific genotype patterns, whether an individual is likely to be or become a case. These methods are particularly important in situations where single variants show little or no disease association, in which case it would be very difficult or downright impossible by standard statistical methods to detect digenic genotype patterns associated with disease. We previously developed an approach to harness Frequent Pattern Mining for assessing the combined effects on disease of two DNA variants [15] and recently updated this approach [16] with a modern FPM engine and implemented it in a computer program, GPM [16], for Genotype Pattern Mining.

Course Description

In this course, we will largely focus on detecting combined effects of two DNA variants on disease. The course is being planned for in-person attendance. Should this turn out to be impossible, we may hold the course virtually by Zoom. Details will be forthcoming. If you are interested in attending please send me email so I can put you on our attendance list. Please note that you cannot enter the Rockefeller campus without being vaccinated against Covid-19.

Tentative Outline

The course will be taught by Profs. Jurg Ott (Rockefeller University, New York), Taesung Park, and Atsuko Okazaki, Juntendo University, Tokyo, Japan, with a guest lecture by Prof. Suzanne Leal, Columbia University, New York. Prof. Park is Professor of Statistics at Seoul National University in Korea and has published in the area of pattern mining.

Costs for the course are $950 for academics and $1,900 for non-academics. An initial deposit of $100 will be required, refundable until December 31, 2021. Payment details will be provided shortly. At this point, no money is due but you may want to reserve your spot on the participant list (first come first served) by sending me email. You will be notified when a deposit is due.

As in previous courses, there will be lectures followed by exercises. Most exercises can be done in Windows, but a small number of programs run only in Linux. We will provide accounts on our Linux servers. Course participants are expected to bring their own Windows laptops, perhaps with dual-boot installed so the laptop can be booted up in Windows or Linux (Kubuntu preferred). If you want to prepare for the course, a recently published review on FPM methods is very useful [17].

    Monday, Feb 21, 2022
  • Welcome
  • Statistical principles in hypothesis testing (J. Ott, lecture)
  • Principles of frequent pattern mining (FPM) or frequent itemset mining (FIM) (C. Borgelt, lecture)

    Tuesday Feb 22, 2022
  • Identification of highly penetrant disease variants (S. Leal, lecture)
  • Implementations of FPM methods (C. Borgelt, lecure and exercises)

    Wednesday Feb 23, 2022
  • FPM methods in genetics, permutation testing, plink program for genetic databases (J. Ott, lecture)
  • Implementation of fpgrowth for mining genotype patterns, GPM (J. Ott, lecture and exercises)
  • Disease-causing interactions betreen mitochondrial and nuclear genes (A. Okazaki, lecture)

    Thursday Feb 24, 2022
  • Statistical evaluation of significance (p-values) and discovery (q-values, false discovery rates) (J. Ott, lecture)
  • GPM program for Linux (J. Ott, exercises)
    Friday Feb 25, 2022
  • Linkage analysis independent of marker allele frequencies and LD between markers and disease using pseudomarker (J. Ott, lecture and exercises)
  • Data presented by course participants


References

  1. Ming, J.E., and Muenke, M. (2002). Multiple hits during early embryonic development: digenic diseases and holoprosencephaly. Am J Hum Genet 71, 1017-1032.
  2. Schaffer, A.A. (2013). Digenic inheritance in medical genetics. J Med Genet 50, 641-652.
  3. Frankel, W.N., and Schork, N.J. (1996). Who's afraid of epistasis? Nat Genet 14, 371-373.
  4. Wang, X., Elston, R.C., and Zhu, X. (2010). Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet 12, 74.
  5. Wang, X., Elston, R.C., and Zhu, X. (2010). The meaning of interaction. Hum Hered 70, 269-277.
  6. Gola, D., Mahachie John, J.M., van Steen, K., and Konig, I.R. (2016). A roadmap to multifactor dimensionality reduction methods. Briefings in Bioinformatics 17, 293-308.
  7. Moore, J.H., and Andrews, P.C. (2015). Epistasis Analysis Using Multifactor Dimensionality Reduction. In Epistasis: Methods and Protocols, J.H. Moore and S.M. Williams, eds. (New York, NY, Springer New York), pp 301-314.
  8. Moore, J.H., and Hahn, L.W. (2002). A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pac Symp Biocomput, 53-64.
  9. Ritchie, M.D., Hahn, L.W., and Moore, J.H. (2003). Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 24, 150-157.
  10. Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., and Moore, J.H. (2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69, 138-147.
  11. Agrawal, R., and Srikant, R. (1994). Fast algorithms for mining association rules. In: 20th VLCB Conference. (Santiago, Chile, Proceedings of the 20th VLCB Conference), pp 487-499.
  12. Borgelt, C. (2005). An implementation of the FP-growth algorithm. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. (Chicago, Illinois, Association for Computing Machinery), pp 15.
  13. Manly, B.F.J. (2007). Randomization, bootstrap, and Monte Carlo methods in biology.(Boca Raton, FL: Chapman & Hall/ CRC).
  14. Manly, B.F.J., and Navarro Alberto, J.A. (2021). Randomization, bootstrap and Monte Carlo methods in biology. In: Chapman & Hall/crc texts in statistical science. (Boca Raton, Taylor & Francis,), p 1 online resource.
  15. Zhang Q, Long Q, Ott J. AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol. 2014;10(6):e1003627, doi: 10.1371/journal.pcbi.1003627. PubMed PMID: 24901472; PubMed Central PMCID: PMC4046917.
  16. Okazaki A, Horpaopan S, Zhang Q, Randesi M, Ott J. Genotype Pattern Mining for Pairs of Interacting Variants Underlying Digenic Traits. Genes 2021, 12, 1160, doi:10.3390/genes12081160.
  17. Chee C-H, Jaafar J, Aziz IA, Hasan MH, Yeoh W. Algorithms for frequent itemset mining: a literature review. Artificial Intelligence Review 2019, 52, 2603-2621, doi:10.1007/s10462-018-9629-z.