Jurg Ott / 4 Feb 2013
Rockefeller University New York and
Institute of Psychology, Chinese Academy of Sciences, Beijing

Disease-Associated Genotype Patterns

This document refers to our publication (Long et al 2009) on estimating genotype patterns (diplotypes) and testing for frequency differencies between case and control individuals. Here is a brief description on how to use our software called randompat, RP. It is currently available for Windows PCs, but the source code (included) may be compiled for Linux PCs with the Free Pascal compiler. A Ukrainian (Belorussion?) translation of this document is available here.

Installation

Input Files

As mentioned above, there are two input files, a parameter file and a datafile. The sample parameter file provides a brief description on how to set up this file. The datafile must have the structure of a sumstat file and may or may not contain chromosomal information for the SNPs. Briefly, rows in the datafile correspond to SNPs and columns represent individuals, while the body of the file contains genotype codes, for example, 1 = AA, 2 = AB, 3 = BB, 0 = unknown. The last row contains indicator codes for disease status, for example, 1 = control, 2 = case (affected). The last three columns are optional and may specify chromosome number, position, and marker identifier.

Your data may be in plink format, in which case you may use the p2s program to convert from plink to sumstat format..

Interpretation of Output

As described in our publication (Long et al 2009), the randompat program picks m SNPs on the basis of their individual significance for association. This can be done with the allele test (based on 2x2 tables of alleles) or the genotype test. Naturally, the order of SNPs picked is generally different depending on which test statistic is used to pick SNPs. The number m of SNPs for which genotype patterns are formed is an input quantity. Two parameter files are included in this package, RPparamZee2.txt and RPparamZee3.txt. They differ in the test type used to pick SNPs. The sample dataset provided here was previously described (Hoh et al 2001).

Running the ZeeData.txt file produces the following output:

Program RANDOMPAT version 04 Feb 2013

Sample data
Input file = ZeeData.txt
Current time:  7 Feb 2013  10:37:07h

Pattern is rare when exp #obs <  1.00 in cases or controls
Number of permutations = 5000
SNPs picked by allele test. Lambda = 1.0000
Input file = ZeeData.txt
Number of individuals = 779
Number of SNPs = 88

=== Observed data, best 2 SNPs ===
TestSNP   seq.#      chi-sq     p-value chr   position  name
      1      75      6.4891 1.0854E-002  17        113 TP53-2_17p13
      2      44      4.7591 2.9144E-002   7        236 NOS3_07q36
779 of 779 individuals showed complete patterns

Observed table of genotype patterns and odds ratios (zero cell entries replaced by 0.5)
Controls   Cases  Pattern   OR          95% CI
      91      66  3 1       0.8954      0.6281      1.2764
     101      71  2 2       0.8582      0.6088      1.2098
      73      46  2 1       0.7624      0.5107      1.1382
     104     100  3 2       1.3018      0.9450      1.7933
      21      38  3 3       2.4714      1.4145      4.3180
      29      13  2 3       0.5380      0.2724      1.0622
      12       3  1 1       0.2696      0.0682      1.0664
      13      10  1 2       0.9598      0.4080      2.2582
       2       4  1 3       2.9700      0.4363     20.2170
sum  440     348  Total = 788
p = 3.2162E-003 for table of genotype patterns

=== Randomized adjusted p-values ===
Test SNP 1    2842/5000 = 5.6840E-001 = 0.5684
Test SNP 2    3493/5000 = 6.9860E-001 = 0.6986
     Table     697/5000 = 1.3940E-001 = 0.1394

Initial seed = 72201310377753
Current time:  7 Feb 2013  10:37:13h


For this run, the program picked the two SNPs with smallest p-values in the allele association test (chi-square with 1 df), then formed genotype patterens and listed all patterns with an expected number of observations >1 in cases and controls each. It turns out that all possible 3 x 3 = 9 patterns occur in this dataset. The pattern with strongest disease association, judged by the odds ratio, OR = 2.97, is 1-3 (AA-BB) at test SNPs 1 and 2, respectively. However, due to the small number of individuals showing this pattern, the confidence interval is very wide and includes OR = 1. On the other hand, the pattern showing the strongest association with absence of disease is 0.2696 (1/OR = 3.71)
. The final significance level of these results, adjusted for testing multiple SNPs, is p = 0.1394.

References

Hoh, J., Wille, A., and Ott, J. 2001. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11(12): 2115-2119.

Long, Q., Zhang, Q., and Ott, J. 2009. Detecting disease-associated genotype patterns. BMC Bioinformatics 10(Suppl 1): S75.