Genotype Pattern Mining for Human Digenic Traits

Jurg Ott - 14 August 2021

This document describes a program package, GPM, for finding genotype patterns with frequencies different in cases and controls. This package is for Windows; the Linux version is discussed below.

Frequent Pattern Mining (FPM) methods can rapidly pick frequent patterns from large databases of items. The first such approach was implemented in the Apriori program [1,2], which was developed to mine consumer data. In addition to finding items commonly purchased by a consumer, Apriori also furnishes association rules, that is, estimates of P(Y|X), where Y and X are purchased goods like bread, milk, and wine. Such association rules can predict how likely it is that someone buying X will also buy Y.

Here we apply FPM principles to genetic case-control studies with a moderate number of variants and observations. Specifically, we consider pairs of genotypes (genotype patterns) with one genotype each from one of two variants. Genotypes are labeled 0, 1, 2, and 3 for "missing", AA, AB, and BB, respectively. For example, a genotype pattern might be X = AB-AA = 2-1. Phenotypes are labeled 2 for cases and 1 for controls, and we want to find association rules, P(Y = 2|X), that is, estimates for the conditional probability (based on the proportion) of being a case among all individuals with genotype pattern X. In FPM terminology [1,2], P(Y|X) is referred to as confidence, and P(X) is the support for X.

In a previous publication [8], we applied the Apriori algorithm to find interaction (epistasis) effects of variants in case-control data. Here, however, the focus is simply on frequencies of genotype patterns and whether they are different in cases and controls while individual variants may not show such differences. Other approaches to find epistatic variants exist, notably Multidimensional Dimension Reduction (MDR) [3-4] and Conditional Search [6]. In our recent paper [9], properties of GPM and MDR are compared.

The GPM program introduced here makes use of the fpgrowth algorithm [5], a modern implementation of FPM, which users should download from Prof. Borgelt's webpage. GPM comes in a Windows and a Linux version; Windows is discussed first. As mentioned above, GPM is designed to find patterns of genotypes at different DNA variants with pattern frequencies different in cases and controls, whether or not individual variants show different genotype frequencies in cases and controls. Specifically, we consider patterns of length 2, that is, sets of two genotypes, one from each of two variants. The GPM program package contains the main program that runs fpgrowth as needed, and several helper programs useful for data preparation. Initially, the case-control data should be in plink format, that is, as two files, *.map and *.ped. Here is a description of the various analysis steps.

It is assumed that you invoke the program on the command line. That is, in Windows, you run a command window (type cmd in the search box at the bottom left in Windows 10).

Case-control data

We use the sample data that come with the MDR program [7], transformed into a plink dataset, sada.map and sada.ped (furnished in the program package). In practice, you may want to keep the missing genotype rate as low as possible, for example, with the plink filter, --geno 0.05 or even smaller (the current dataset has no missing genotypes). The first step is to transform the genotype data into a format suitable for fpgrowth: Calling the makeFPM program with sada[.ped] as input will lead to the sadaFPM.txt file.

The fpgrowth program will treat missing genotypes (coded 0) as an additional category. When reporting results furnished by fpgrowth, the GPM program will filter out all pairs of genotypes with one or both genotypes missing.

The next step is to prepare a parameter file for the GPM program. Included in the program package is an example, sada.param, with the following lines:

sada-2s5c70  {Line 1}
999 {Line 2}
c:\bin\ {Line 3}
2 5 70 {Line 4}
sada.map {Line 5}
sadaFPM.txt {Line 6}
10 {Line 7}

Explanations to the various input lines:

1) “2” refers to confidence for cases, “1” for controls
2) min. number of individuals carrying a given pattern (support)
3) min. confidence as a percentage

Optionally, there can be an additional two numbers, kmin and kmax, referring to the respective minimum and maximum numbers of items in a pattern, including the phenotype. Default values are kmin = kmax = 3 (2 genotypes per pattern plus the phenotype).

After you download the program package, move the following files into a folder containing programs (for example, C:\bin): fpgrowth.exe (from Dr. Borgelt's webpage), GPM.exe, makeFPM.exe, and pairSNPs.exe. Everything else should go into your work folder, for example, \work. Open a cmd window and change directory to \work, typing cd \work. To call the program, type its name followed by the name of the parameter file, for example, GPM sada.param. If you don't furnish a name for the parameter file on the command line, the program will assume that its name is GPM.param. If no parameter file is found, the program will issue an error message and stop.

Explanations to output

For a given pattern (genotype pair), X, identified by fpgrowth, consider the following 2 × 2 table of observed numbers, where a, b, c, and d refer to numbers of individuals:

Phenotype

X present

X absent


Cases, Y = 2

a

b

Ncase

Controls, Y = 1

c

d

Nctrl


N_X

N_notX

N

Table 1. The numbers a, b, c, and d form a 2 × 2 table, for which chi-square is computed. This is done for each pattern passing the requirements spelled out in the parameter file. Often, Fisher's exact test is applied to such 2 × 2 tables. However, while this is a good solution for 1-sided tests, we are interested in 2-sided tests, that is, we want to find patters that are more frequent or less frequent in cases than controls, and 2-sided applications of the Fisher test are not unique [10], with different approaches having different properties. It is for this reason that we apply a (2-sided) chi-square statistic.

The “X present” column refers to individuals carrying the given pattern, X. The proportion of cases is Ncase/N, and the proportion of cases among people with X is P(Y = 2 | X present), estimated by a/(a + c). Certain behaviors of the fpgrowth program can be changed. After downloading it, please run the sample data to verify that you obtain the same sada.obs.out file as furnished in this program package. If not, I will be happy to investigate the situation.

With the given parameters, the fpgrowth program reports 9 patterns in the sada.obs.out file, which is best imported into a spreadsheet. For each of the 9 patterns, numbers a through d are shown as in Table 1, with associated chi-square values. Sorted by decreasing chi-square, the sada.obs.out.xlsx file shows one large chi-square followed by rather small ones.

For each pattern, a likelihood ratio chi-square is computed from the associated 2 × 2 table (Table 1), and the test statistic, Tobs, is the largest chi-square over all patterns. In each permutation, labels for cases and controls are randomly permuted and the whole process of pattern search and calculations is repeated leading to a null maximum chi-square, T0. The p-value is then given as the proportion of null samples with T0 > Tobs. Here, the observed data are counted among the null data so that the smallest possible p-value is 1/M, where M is the number of permutations including the observed data. Each observed and permutated dataset will contribute one line with a chi-square value to the output file name on line 6 of the parameter file. Two special values are as follows:

The sada-2s5c70.perm.out.xlsx spreadsheet shows chi-square for the observed data in the line for perm = 0; all remaining chi-squares, obtained from permuted data, are in the lines below and ordered by decreasing values. Column C demonstrates how to compute estimated p-values for null chi-squares. It lists permutations from 0 through 999, where 0 refers to the observed data.

To view the numbers of cases and controls with bivariate genotypes for a given pair of variants, you can use the pairSNPs program. It requires a parameter file, which may be named on the command line; otherwise expects a file called pairSNPs.param (a sample file is furnished in the program package). It requires an input file, that is, the pedigree file but in transposed format, for example, sadaT.tped. In our sample data, the variant pair (5, 7) had the largest chi-square value (#5 is referred to as test SNP, and #7 as target SNP [6]). In the sadaT.pairs.out file, we see the following output:

CONTROLS
Test SNP | Target SNP 7
5 | 1 2 3
-------+------------------
1 | 14 9 14
2 | 19 55 29
3 | 13 31 16
--------------------------

CASES
Test SNP | Target SNP 7
5 | 1 2 3
-------+------------------
1 | 8 31 6
2 | 17 50 31
3 | 10 30 17
--------------------------

Heterogeneity analysis, genotype tests
chi-sq df p
CONTROLS 11.0769 4 0.0257
CASES 6.5365 4 0.1625
BOTH 2.9578 4 0.5649
Heterog 14.6556 4 0.0055

Partitioning chi-square, genotype tests
Source chi-sq df p
SNP 1 main 0.9831 2 0.6117
SNP 2 main 2.9637 2 0.2272
Interaction 14.6556 4 0.0055
Total table 18.6024 8 0.0009

There is clearly a strong heterogeneity between cases and controls, or interaction between the two variants, in that the two variants are much more correlated in controls than in cases.

Some hints on using command windows in Windows: (1) Make sure you see extensions of file names, for example, that you see FPM.txt and not just FPM. (2) Type cmd in the search box at the bottom left of your Windows screen, then click on "Open File Location" and drag the Command Prompt to the desktop to make a shortcut there. This will allow you to permanently change the appearance of the cmd prompt. (3) Make a folder called bin in the C: drive. Prepare a text file with the following lines:

@echo off
set dircmd=/p/o
set path=C:\bin;%PATH%

Save this file in the C:\bin folder with the name setbin.bat, making sure it is not called setbin.bat.txt. Each time you open your cmd window, you type C:\bin\setbin and press Enter. Then all executable program files you place into the C:\bin folder will be accessible (in the path) from anywhere. To verify that everything is ok, type path and press Enter. You should then see PATH=C:\bin;C:\Program Files....

General notes

Please don't hesitate to send me email to report any problems or difficulties you might be having.

Linux version of GPM program

The Linux version is analogous to the Windows version, with a few exceptions. It was tested in Kubuntu. Here is the Linux version of the parameter file discussed above:

sada-s5c70	{Line 1}
9999 {Line 2}
/home/jurg/GPMlinux/ {Line 3}
2 5 70 {Line 4}
sada.map {Line 5}
sadaFPM.txt {Line 6}
10 {Line 7}
/home/jurg/ {Line 8}

Explanations

  1. Prefix, string to appear in front of output files
  2. Number of permutations
  3. Name of folder containing fpgrowth program
  4. Three numbers -- 1) 2 for cases, 1 for controls; 2) min. support; 3) min. confidence as a percentage
  5. Map file name
  6. Input file to fpgrowth, *FPM.txt
  7. After how many permutations to write to report file
  8. Folder where report file will be written

After program termination, in contrast to the Windows version, GPM Linux will write a log file, whose name starts with the prefix and ends in ".log". The program is best run in a terminal window (console) with the following two commands (replace sada.param with your parameter file name):

GPM sada.param > /dev/null 2>&1 &
disown

The second command ensures that the program runs in the background and is not interrupted when you log out. Depending on your data, GPM can run for many days.

References

  1. Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. ACM SIGMOD Conference on Management of Data. Washington DC 1993. p. 207-16.
  2. Agrawal R, Srikant R. Fast algorithms for mining association rules. 20th VLCB Conference. Santiago, Chile: Proceedings of the 20th VLCB Conference; 1994. p. 487-99.
  3. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138-47.
  4. Moore JH, Hahn LW. A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pac Symp Biocomput. 2002:53-64.
  5. Borgelt C. An implementation of the FP-growth algorithm. Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. Chicago, Illinois: Association for Computing Machinery; 2005. p. 1–5.
  6. Wang G, Yang Y, Ott J. Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered. 2010 Apr 23;70(1):34-41.
  7. Moore JH, Andrews PC. Epistasis Analysis Using Multifactor Dimensionality Reduction. In: Moore JH, Williams SM, editors. Epistasis: Methods and Protocols. New York, NY: Springer New York; 2015. p. 301-14.
  8. Zhang Q, Long Q, Ott J. AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol. 2014;10(6):e1003627. Epub 2014/06/06. doi: 10.1371/journal.pcbi.1003627. PubMed PMID: 24901472; PubMed Central PMCID: PMC4046917.
  9. Okazaki A, Horpaopan S, Zhang Q, Randesi M, Ott J. Genotype pattern mining for pairs of interacting variants underlying digenic traits. Genes 2021, 12, 1160, doi:https://doi.org/10.3390/genes12081160
  10. Agresti A. Categorical data analysis. Wiley-Interscience, New York; 2002.