FPMp program to find interactions among variants in case-control studies

Jurg Ott - 14 Feb 2021

Frequent Pattern Mining (FPM) methods can rapidly pick frequent patterns from large databases of items. The first such approach was implemented in the Apriori program [1,2], which was developed to mine consumer data. In addition to finding items commonly purchased by a consumer, Apriori also furnishes association rules, that is, estimates of P(Y|X), where Y and X are purchased goods like bread, milk, and wine. Such association rules can predict how likely it is that someone buying X will also buy Y.

Here we apply FPM principles to genetic case-control studies with a moderate number of variants and observations. Specifically, we consider pairs of genotypes (genotype patterns) with one genotype each from one of two variants. Genotypes are labeled 0, 1, 2, and 3 for "missing", AA, AB, and BB, respectively. For example, a genotype pattern might be X = AB-AA = 2-1. Phenotypes are labeled 2 for cases and 1 for controls, and we want to find association rules, P(Y = 2|X), that is, estimates for the probability (based on the proportion) of being a case among all individuals with genotype pattern X. In FPM terminology [1,2], P(Y|X) is referred to as confidence, and P(X) is the support for X.

The purpose for this endeavor is to discover interaction (digenic) effects of two variants, that is, different frequencies of genotype patterns in cases and controls while individual variants may not show such differences. Other approaches to find epistatic variants exist, notably Multidimensional Dimension Reduction (MDR) [3-4] and Conditional Search [6]. In a forthcoming manuscript, properties of FPMp and MDR will be compared.

The FPMp program introduced here makes use of the fpgrowth program [5], a modern implementation of FPM, which users should download from Prof. Borgelt's webpage. It comes in a Windows version; a Linux version is in preparation. FPMp is designed to find patterns of genotypes at different DNA variants with pattern frequencies different in cases and controls, whether or not individual variants show different genotype frequencies between cases and controls. We discuss patterns of length 2, that is, we consider sets of two genotypes, one from each of two variants. The FPMp program package contains the main program that runs fpgrowth as needed, and several helper programs useful for data preparation. Initially, the case-control data should be in plink format, that is, as two files, *.map and *.ped. Here is a description of the various analysis steps.

It is assumed that you invoke the program on the command line. That is, in Windows, you run a command window (type cmd in the search box at the bottom left in Windows 10).

Case-control data

We use the sample data that come with the MDR program [7], transformed into a plink dataset, sada.map and sada.ped (furnished in the program package). In practice, you may want to keep the missing genotype rate as low as possible, for example, with the plink filter, --geno 0.05 or even smaller (the current dataset has no missing genotypes). The first step is to transform the genotype data into a format suitable for fpgrowth: Calling the makeFPM program with sada[.ped] as input will lead to the sadaFPM.txt file.

The fpgrowth program will treat missing genotypes (coded 0) as an additional category. When reporting results furnished by fpgrowth, the FPMp program will filter out all pairs of genotypes with one or both genotypes missing.

The next step is to prepare a parameter file for the FPMp program. Included in the program package is an example, sada.param, with the following lines:
999           {Line 1}
c:\bin\       {Line 2}
 2 5 70        {Line 3}
sada.map      {Line 4}
sadaFPM.txt   {Line 5}
sada.perm.out {Line 6}
sada.obs.out  {Line 7}
10            {Line 8}

Explanations to the various input lines:
After you download the program package, move the following files into a folder containing programs, for example, C:\bin\: fpgrowth.exe (from Dr. Borgelt's webpage), FPMp.exe, makeFPM.exe, and pairSNPs.exe. Everything else should go into your work folder, for example, \work. Open a cmd window and change directory to \work, typing cd \work. To call the program, type its name followed by the name of the parameter file, for example, FPMp sada.param. If you don't furnish a name for the parameter file on the command line, the program will issue an error message and stop.

Explanations to output

For each of the patterns (genotype pairs) identified by fpgrowth, consider the following 2 2 table of observed numbers, where a, b, c, and d refer to numbers of individuals:

Phenotype
Pattern, X
not X

Cases, Y = 2
a
b
Ncase
Controls, Y = 1
c
d
Nctrl

N_X
N_notX
N
Table 1. The numbers a, b, c, and d form a 2 2 table, for which chi-square is computed. This is done for each pattern passing the requirements spelled out in the parameter file.

The X column refers to individuals carrying a given pattern, X, with "not X" referring to anyone else. The proportion of cases is Ncase/N, and the proportion of cases among people with X is P(Y = 2|X), estimated by a/(a + c). Certain behaviors of the fpgrowth program can be changed. After downloading it, please run the sample data to verify that you obtain the same sada.obs.out file as furnished in this program package. If not, I will be happy to investigate the situation.

With the given parameters, the fpgrowth program reports 9 patterns in the sada.obs.out file, which is best imported into a spreadsheet. For each of the 9 patterns, a table like Table 1 is shown with the associated chi-square value. Sorted by decreasing chi-square, the sada.obs.out.xlsx file shows one large chi-square followed by rather small ones.

For each pattern, a likelihood ratio chi-square is computed from the associated 2 2 table (Table 1), and the test statistic, Tobs, is the largest chi-square over all patterns. In each permutation, labels for cases and controls are randomly permuted and the whole process of pattern search and calculations is repeated leading to a null maximum chi-square, T0. The p-value is then given as the proportion of null samples with T0  >  Tobs. Here, the observed data are counted among the null data so that the smallest possible p-value is 1/N, where N is the number of permutations including the observed data. Each observed and permutated dataset will contribute one line with a chi-square value to the output file name on line 6 of the parameter file. Two special values are as follows:
The sada.perm.out.xlsx spreadsheet shows chi-square for the observed data in the line for perm = 0; all remaining chi-squares, obtained from permuted data, are in the lines below and ordered by decreasing values. Column C demonstrates how to compute estimated p-values for null chi-squares. It lists permutations from 0 through 999, where 0 refers to the observed data.

To view the numbers of cases and controls with bivariate genotypes for a given pair of variants, you can use the pairSNPs program. It requires a parameter file, which may be named on the command line; otherwise it is expected as a file called pairSNPs.param (a sample file is furnished in the program package). It requires an input file, that is, the pedigree file but in transposed format, for example, sadaT.tped. In our sample data, the variant pair (5, 7) had the largest chi-square value (#5 is referred to as test SNP, and #7 as target SNP [6]). In the sadaT.pairs.out file, we see the following output:

CONTROLS
Test SNP |     Target SNP 7
       5 |     1     2     3
  -------+------------------
       1 |    14     9    14
       2 |    19    55    29
       3 |    13    31    16
  --------------------------

CASES
Test SNP |     Target SNP 7
       5 |     1     2     3
  -------+------------------
       1 |     8    31     6
       2 |    17    50    31
       3 |    10    30    17
  --------------------------

Heterogeneity analysis, genotype tests
              chi-sq  df     p
CONTROLS     11.0769   4  0.025713
   CASES      6.5365   4  0.162508
    BOTH      2.9578   4  0.564919
 Heterog     14.6556   4  0.005471

Partitioning chi-square, genotype tests
      Source    chi-sq  df     p
  SNP 1 main    0.9831   2  0.611692
  SNP 2 main    2.9637   2  0.227213
 Interaction   14.6556   4  0.005471
 Total table   18.6024   8  0.000941

There is clearly a strong heterogeneity between cases and controls, or interaction between the two variants, in that the two variants are much more correlated in controls than cases.

If you see a file, seed.txt, after program termination, it is best to just leave this file as is. If you delete it, the program will make a new one based on the system clock, but random number generation performs best when new seeds are based on previously used seeds.

Some hints on using command windows in Windows: (1) Make sure you see extensions of file name, for example, that you see FPM.txt and not just FPM. (2) Type cmd in the search box at the bottom left of your Windows screen, then click on "Open File Location" and drag the Command Prompt to the desktop to make a shortcut there. This will allow you to permanently change the appearance of the cmd prompt. (3) Make a folder called bin in the C drive. Prepare a text file with the following two lines:
set dircmd=/p/o
set path=C:\bin;%PATH%

Save this file in the C:\bin folder with the name setbin.bat, making sure it is not called setbin.bat.txt. Each time you open your cmd window, you type C:\bin\setbin and press Enter. Then all executable program files you place into the C:\bin folder will be accessible (in the path) from anywhere. To verify that everything is ok, type path and press Enter. You should then see PATH=C:\bin;C:\Program Files....

If FPMp stops with the message, Error running ...fpgrowth, this most likely means that fpgrowth ran out of memory.

Please don't hesitate to send me email to report any problems or difficulties you might be having.

References

1. Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. ACM SIGMOD Conference on Management of Data. Washington DC 1993. p. 207-16.

2. Agrawal R, Srikant R. Fast algorithms for mining association rules. 20th VLCB Conference. Santiago, Chile: Proceedings of the 20th VLCB Conference; 1994. p. 487-99.

3. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138-47.

4. Moore JH, Hahn LW. A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pac Symp Biocomput. 2002:53-64.

5. Borgelt C. An implementation of the FP-growth algorithm. Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. Chicago, Illinois: Association for Computing Machinery; 2005. p. 1–5.

6. Wang G, Yang Y, Ott J. Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered. 2010 Apr 23;70(1):34-41.

7. Moore JH, Andrews PC. Epistasis Analysis Using Multifactor Dimensionality Reduction. In: Moore JH, Williams SM, editors. Epistasis: Methods and Protocols. New York, NY: Springer New York; 2015. p. 301-14.