Jurg Ott / 10 Jan 2013
Rockefeller University New York
Email:
ott@rockefeller.edu

Institute of Psychology
Chinese Academy of Sciences, Beijing
ottjurg@psych.ac.cn


Documentation to SIMULATE program

Originally written by Joe Terwilliger [1], SIMULATE is a computer program to simulate genotypes in family members for a map of linked markers unlinked to a given affection status locus. The markers are assumed to be in linkage equilibrium when genotypes are assigned to founders in pedigrees (except in SIMULATE3). Output from this program is in SLINK format and is ready for analysis with UNKNOWN, ISIM, LSIM, or MSIM of the SLINK package. The program is written in Free Pascal for Windows (32 bit and 64 bit) and Linux and comes in the following three versions:

(1)
SIMULATE is essentially the original version, in which all marker genotypes are generated based on population marker characteristics (allele frequencies, etc.) and recombination fractions between them.

(2)
SIMULATE2 assumes that for founder individuals with known genotypes (i.e., they are "typed") the original (observed) genotypes are provided. These genotypes will not be modified (generated) in the course of the simulations. For founders with unknown genotypes ("untyped"), marker genotypes will be generated as in the SIMULATE program. However, prior to running SIMULATE2, users may want to impute such genotypes from the original data so that few founders have unknown genotypes. The two program versions have somewhat different input requirements. For example, the random number generator is different – SIMULATE requires 3 seeds and SIMULATE2 requires only one.

(3) SIMULATE3 is an extension of SIMULATE in that markers must be SNPs (have only two alleles) and may be correlated (in linkage disequilibrium) with each other (see below how these correlations are generated). Also, this program version incorporates an updated random number generator, ran (int64) [2].


LOCUS TYPES

All locus types in the LINKAGE programs can be simulated with the following exceptions:

The SIMULATE2 program can only handle marker loci and optionally a trait locus as the first locus.


INPUT FILES

This program requires three input files as follows (exceptions for SIMULATE3 are given in a separate section below):

1.
SIMDATA.DAT  a Standard LINKAGE parameter file (datafile), specifying the map of markers (chromosome order = file order). Note that, as in SLINK, the last input line is not relevant. However, the input line before last specifies recombination fractions between loci.

2.
SIMPED.DAT  a post-MAKEPED LINKAGE pedigree file with an additional line at the top specifying the following numbers:

For subsequent input lines (one line per individual), different rules apply to the SIMULATE and SIMULATE2 programs:

SIMULATE. The fields for id's, sex, and probands are as in standard LINKAGE pedigree files. However, since this program simulates all marker loci, you must only have one digit per marker in the pedigree file (no marker genotypes, or 0 0, as in SLINK; see exception for SIMULATE2 below) to tell whether that marker is to be simulated or left unknown in that individual. A 0 means that marker should be untyped, and a 1 means it should be typed (simulated). See below for locus order and types of loci. That is, each marker may be designated as being known or unknown, not only each individual as in SLINK.

SIMULATE2. Depending on whether an individual is a founder (no parents in pedigree) or non-founder, and whether genotypes are known or not, marker genotypes are coded as follows (after an initial optional affection status locus, only marker genotypes are permitted). Note that the coding scheme below is analogous to that in SLINK but different from that in SIMULATE.

3. PROBLEM.DAT  A file containing the following numbers:

At the end of the simulation, the program writes a new seed to PROBLEM.DAT such that SIMULATE2 may be called repeatedly and each time continues with an updated seed.


OUTPUT FILES

The program creates the following output files:


Markov model for correlated SNPs

A simple model for correlated SNPs is to assume that SNP (i – 1) only depends on SNP i. While such a dependency may not be completely realistic, it can nonetheless serve a useful role in bioinformatics investigations on genomes [3]. Generating such SNP genotypes for each individual may be accomplished as follows.

Consider two adjacent SNPs, locus 1 (alleles 1 and 2) and locus 2 (alleles 1’ and 2’). Let p = P(1) be the minor allele frequency at locus 1, and q = P(1’) be the minor allele frequency at locus 2, and assume that the two alleles in a genotype are in a fixed order; for now we only consider the first allele in a given genotype. We want to make alleles at locus 2 depend on those at locus 1 in the manner indicated in the graph. Thus, for example, if an individual has allele 1 at locus 1, then his probability of having allele 1’ at locus 2 is

P(1’|1) = 1 – s.

This model has three parameters: p, s, and t, with allele frequencies at locus 2 being given by

q = p(1 – s) + (1 – p)t.

The joint distribution of alleles at two loci may be constructed based on the conditional probabilities introduced above, for example, P(1’, 1) = P(1’|1)P(1) = (1 – s)p. This leads to the following table:


Locus 2

Locus 1


1

2

total

1’

(1 – s)p

t(1 – p)

q

2’

sp

(1 – t)(1 – p)

1 – q

total

p

1 – p

1


For given p and q, there is one free parameter (degree of freedom, df) left, which we conveniently choose to be a measure of linkage disequilibrium between the SNPs. Thus, we want to express the square of the correlation coefficient, r2, and the scaled disequilibrium parameter, D’, in terms of p, q, s, and t. To obtain r2, we attach respective values of x = 1 and x = 0 to alleles 1 and 2, and values of y = 1 and y = 0 to alleles 1’ and 2’. The correlation coefficient is defined as

r = [E(xy) – E(x)E(y)]/√[V(x)V(y)],

with E denoting expectation and V denoting variance, for example,

V(x) = E(x2) – E2(x) = p – p2 = p(1 – p).

With this, we obtain the squared correlation coefficient in terms of the parameters in the table above as

r2 = [P(1, 1’) – pq]2/[p(1 – p)q(1 – q)] = p(1 – s – q)2/[(1 – p)q(1 – q)].

Similarly, assuming a positive correlation between the two SNPs, we find the scaled linkage disequilibrium as

D’ = [P(1, 1’) – pq]/min[p(1 – q), (1 – p)q] = p(1 – s – q)/min[p(1 – q), (1 – p)q].

When allele frequencies at loci 1 and 2 are the same, q = p, then the above formulas simplify to

r = 1 – s/(1 – p) and D’ = 1 – s/(1 – p) = r.

To generate (predict) an individual’s genotype at locus 2 given his genotype at locus 1, as noted above, we treat the two alleles at a genotype as independent entities, that is, we assume Hardy-Weinberg equilibrium (HWE) and generate each allele separately. For example, for the first allele in genotypes at loci 1 and 2, allele 1’ (locus 2) will be obtained with probability P(1’|1) = 1 – s if the first allele at locus 1 is 1, and with probability P(1’|2) = t if the first allele at locus 1 is 2. Thus, we need to express s and t in terms of p, q, and r (only r2 and not D’ will be used here to express linkage disequilibrium). Simple algebraic manipulations lead to

s = 1 – q – [r2p(1 – p)q(1 – q)] ½ /p

and

t = q – [r2p(1 – p)q(1 – q)] ½ /(1 – p).

For q = p, these expressions simplify to

s = (1 – p)(1 – r) and t = p(1 – r).

For a given individual, it is convenient to successively generate alleles at all loci to form a very long haplotype along a chromosome, and then to generate a second haplotype at all loci. The two haplotypes will then define the genotypes of this individual. This procedure has been implemented in the Simulate program as Simulate3.


Input files for SIMULATE3

File simdata.dat as described above.

File simped.dat as described above except that the top line must contain (1) the number of pedigrees, followed by (2) the numbers of individuals in each pedigree, and (3) the square of the correlation coefficient, r2, as a measure of linkage disequilibrium between adjacent SNPs. If (3) is missing, r2 = 0 is assumed. Note that, as above, this input file is post-Makeped, and individuals must be sequentially ordered by their IDs.

File problem.dat holding only one number, that is, the number of replicates desired.

File seed.txt holding a positive integer seed.

An example for each of these files is contained in the distribution package.



REFERENCES

1 Terwilliger JD, Speer M, Ott J: Chromosome-based method for rapid computer simulation in human genetic linkage analysis. Genet Epidemiol 1993;10:217-224.

2 Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical recipes 3rd edition: The art of scientific computing, ed 3rd. Cambridge, UK; New York, Cambridge University Press, 2007.

3 Waterman MS: Introduction to computational biology: Maps, sequences and genomes, ed 1. Boca Raton FL, Chapman and Hall/CRC, 1995.