The Rockefeller University | 1230 York Avenue, New York, NY 10065 – 29 July 2021
Course on Genotype Pattern Mining for Human Digenic Traits
Date: January 17-21, 2022 (KompoZer)
Location: Rockefeller University, New York
Frequent Pattern Mining
Various examples of the joint actions of two variants (digenic inheritance) have been published [1;2]. There has been much debate about the definition of epistasis and how to detect it [3-5]. Here, as outlined below, we are simply concerned with diplotypes having different frequencies in cases and controls, where we consider diplotypes of length 2, that is, genotype patterns consisting of two genotypes, each from a different variant. Combined effects of two variants may be assessed in 3 × 3 tables of genotypes, with rows corresponding to genotypes at one variant and columns referring to another variant, and one table referring to cases and another such table to controls. The well-known Multidimensionality Dimension Reduction (MDR) method applies a specific machine-learning approach to such tables [6-10] while our approach, GPM, uses a general FPM algorithm tweaked into finding frequent genotype patterns in cases versus controls.
The first FPM approach, the Apriori algorithm , was developed to handle the ever increasing databases of consumer transactions. It was of interest to learn what consumers tend to buy together so that predictions (so-called association rules) can be made, for example, how likely will a consumer buy wine when they buy bread and cheese? In our implementation of FPM methods, we focus on individual genotype patterns, that is, sets of two genotypes (diplotypes), one each from a different genomic location (possibly from different genes). A 3 × 3 table of genotypes exhibits nine genotype patterns. The specific FPM algorithm used is fpgrowth , and we built code around it so it works in a case-control setting . The whole approach is embedded in a straightforward permutation framework [13;14]. As applied to case-control studies, our implementation develops predictions, based on the presence of specific genotype patterns, whether an individual is likely to be or become a case. These methods are particularly important in situations where single variants show little or no disease association, in which case it would be very difficult or downright impossible by standard statistical methods to detect digenic genotype patterns associated with disease. We previously developed an approach to harness Frequent Pattern Mining for assessing the combined effects on disease of two DNA variants  and recently updated this approach  with a modern FPM engine and implemented it in a computer program, GPM , for Genotype Pattern Mining.
In this course, we will
largely focus on detecting combined
effects of two DNA variants
on disease. The course
is being planned for in-person attendance. Should this turn out to
be impossible, we will hold the course virtually by Zoom. Details
will be forthcoming. If you are interested in attending please
send me email
so I can put you on our attendance list.
The course will be taught by Profs. Jurg Ott (Rockefeller University, New York), Christian Borgelt (University of Salzburg, Austria; https://borgelt.net/index.html), and Qingrun Zhang (University of Calgary, Canada; https://cumming.ucalgary.ca/departments/bmb/profiles/dr-qingrun-zhang), with a guest lecture by Prof. Suzanne Leal (Columbia University, New York; http://statgen.us/Suzanne_M_Leal_PhD) . Prof. Borgelt is Professor for Data Science with a joint appointment at the Departments of Mathematics and of Computer Science.
Costs for the course are $950 for academics and $1,900 for non-academics. An initial deposit of $100 will be required, refundable until November 1, 2021. Payment details will be provided shortly. At this point, no money is due but you may want to reserve your spot on the participant list (first come first served) by sending me email. You will be notified when a deposit is due.
As in previous courses, there will be lectures followed by exercises. Most exercises can be done in Windows, but a small number of programs run only in Linux. We will provide accounts on our Linux servers. Course participants are expected to bring their own Windows laptops, perhaps with dual-boot installed so the laptop can be booted up in Windows or Linux (Kubuntu preferred). If you want to prepare for the course, a recently published review on FPM methods is very useful .
Monday, Jan 17, 2022
Tuesday Jan 18, 2022
Identification of highly penetrant disease variants (S. Leal, lecture)
Implementations of FPM methods (C. Borgelt, lecure and exercises)
Wednesday Jan 19, 2022
FPM methods in genetics,
permutation testing, plink program for genetic databases (J. Ott, lecture)
Thursday Jan 20, 2022
Statistical evaluation of significance (p-values) and discovery (q-values, false discovery rates) (J. Ott, lecture)GPM program for Linux (J. Ott, exercises)
Friday Jan 21, 2022
analysis independent of marker allele frequencies and LD between
markers and disease using pseudomarker (J. Ott, lecture and exercises)