25 August 2006
Jurg Ott
ott@rockefeller.edu

Manual for Simnuclear program

Version 1.01 (test version) 13 May 1999

Version 1.02, 16 Feb 2000 (bug eliminated: at the start of the genome, inheritance of grandparental alleles to offspring was not random and could lead to wrong evidence for linkage)

Version for Windows, Simnuc program: Long file names not supported by the gcc compiler (djgpp)
Program developed and written by Harald Göring

1. Description of program

The program simulates a data set consisting of nuclear pedigrees ascertained (generated) in such a way that some family members are affected by a multi-locus trait. The genetic model underlying the trait is an oligogenic threshold model. As outlined below, the user can vary the proportion of the overall variance due to each of the underlying susceptibility loci.

The data set is output in a pre-MAKEPED LINKAGE-format pedigree file (with trait phenotypes and marker genotypes for autosomal data). The user has control over many input parameters such as size of data set, pedigree size distribution, trait characteristics, ascertainment scheme, number and size of chromosomes, QTL characteristics, marker characteristics, and some output options. Input is via the command line (but can be redirected to a file) and is not checked for input errors. The initial seed for the random number generator is in file "seed.txt", which at program termination will be updated with the final seed. Program output is in an output file ("out.txt").

2. Algorithm of program

Trait model: A random pedigree is simulated; it is then ascertained or discarded. The program assumes a quantitative trait, which is taken to be ~N(0,1) (with a small number of QTLs, the resulting quantitative phenotype will not be very "normal"). The variance is contributed by diallelic QTLs (genes underlying the trait) and individual-specific (VarE.non-shared) and family-specific (VarE.shared) environmental factors. The proportion of the variance due to genetic factors, the heritability, is h2 = ∑i(VarGi), where VarGi is the proportion of variance due to the i-th QTL. Thus, the overall model for the trait variance is Vartotal = h2 + VarE.shared + VarE.non-shared ≡ 1. For more details on such a model, see Hartl DL (1987) A primer of population genetics, 2nd ed. (paperback) Sinauer Associates, p. 232.

Memory usage: The program may require much memory. All marker genotypes for all pedigree members in all pedigrees in the data set are stored simultaneously. For example, for a genome of size 3000 cM, marker spacing of 1 cM, up to 11 children per family, and a data set consisting of 100 families, the program stores 3000 markers × 13 individuals × 100 pedigrees × 2 alleles/genotype > 6,000,000 alleles! The memory requirements could be reduced drastically by just storing and outputting one pedigree at a time.

Program steps

  1. Simulation of the number of children, based on family size distribution.
  2. Trait-locus alleles are simulated randomly for parents, i.e. absence of LD and presence of HWE is assumed.
  3. The genome-wide cross-over point processes for all meioses are simulated. For details, see Terwilliger JD, Speer M, Ott J (1993) Chromosome-based methods for rapid computer simulation in human genetic linkage analysis, Genet Epidemiol 10:217-224. Absence of interference is assumed: On each gamete, crossovers are placed according to the Poisson law, that is, with an average distance between adjacent crossovers of 100 cM, and chromosomes are formed according to their genetic lengths. — The program internally treats the genome as one giant entity. The independent segregation of chromosomes is superimposed on the cross-over point process. For this reason, a small region at the end of each chromosome (constant empty, set to 1 cM or less) has no markers and no cross-overs. A switch in parental origin between 2 consecutive chromosomes (from end of chromosome A to start of chromosome B) is stored as a "cross-over" in the middle of this empty region at the end of each chromosome.
  4. Transmission of trait-locus alleles from parents to offspring. This is fully determined by the cross-over point process.
  5. Simulation of family-specific (shared) and person-specific (non-shared) environmental contributions.
  6. Calculation of quantitative and qualitative trait phenotypes for all pedigree members.
  7. Pedigree ascertainment. If not ascertained, start over with step 1.
  8. Simulation of marker alleles in parents. This is done assuming absence of linkage disequilibrium (between markers and between markers and QTLs) and presence of HWE.
  9. Transmission of marker alleles from parents to offspring. This is fully determined by the cross-over point process.
  10. Output, including simulations whether a parent is phenotyped and genotyped and whether, for example, a particular PCR reaction (one marker in one person) did not work.

3. Input

All input (except for initial seed) is via the command line, but can be redirected to a file (see section 6, below). Note that the input is not checked for errors. Notation used for ranges: [a, b] includes the boundary values; (a, b) excludes them. The user is prompted for the following input parameters:
  1. Number of chromosomes
  2. Lengths of all chromosomes in cM (as many numbers as there are chromosomes)
  3. Number of QTLs
  4. In turn for each QTL:
  5. VarE.shared. Must have h2 + VarE.shared ≤ 1. VarE.non-shared is computed in the program from h2 + VarE.shared + VarE.non-shared = 1.
  6. Prevalence of affected phenotype; real (0, 1). This value is used for ascertainment of families, even when the phenotype is quantitative (see item 17 below).
  7. Marker density in cM (i.e. distance between adjacent markers), in useful relation to chromosome sizes.
  8. Number n of alleles of markers (the same for all markers)
  9. Marker allele frequencies (furnish n − 1 values). The sum of these frequencies must be < 1! The frequency of the nth allele will be computed by the program.
  10. Maximum number m of children in a family
  11. Population frequencies of families with 1, 2, ..., m − 1 children (irrespective of trait). Reals in [0,1]. The sum of these numbers must be ≤ 1. For example, if there should be exactly 2 children per family, this line should contain only one number, i.e. 0.
  12. Number of pedigrees in data set
  13. Minimum number of "affected" sibs for pedigree ascertainment (positive int.) (this must be compatible with the family size distribution)
  14. P(parent phenotyped); real [0, 1], i.e. the expected proportion of parents with known trait phenotypes. All sibs are assumed to be phenotyped.
  15. P(parent genotyped | phenotyped) real [0, 1], i.e. the expected proportion of phenotyped parents who have known genotypes. Unphenotyped parents are assumed to be ungenotyped.
  16. P(a single marker in an individual is NOT genotyped); real [0, 1].
  17. Option for phenotype output: 1 = qualitative, 0 = quantitative
  18. Option for output of sibs: 1 = only affected sibs, 0 = all sibs
The initial seed for the random number generator is contained in an input file (seed.txt) (positive long integer).

4. Output

The key output (pedigree file) is in an output file (out.txt). This file is in pre-MAKEPED LINKAGE pedigree file format for autosomal data. The phenotype is either given as qualitative or quantitative (0 indicates unknown in either case). Unknown marker genotypes are indicated as "0 0".
The updated seed is output to the file seed.txt, replacing the original seed.
A summary of the input parameters is output to the screen and may be captured into a file by redirection of standard output (see below).

5. Program constants

The program has no limiting constants, and memory is allocated as needed. If not enough memory can be allocated, the program gives an error message and aborts.

There are 3 constants (all defined in the beginning of file simnuc.c), which the user may want to change:

FNSEED: name of file containing seed (currently seed.txt)
FNOUT:  name of output file (currently out.txt)
empty:  length of region at the end of each chromosome w/o markers and cross-overs (see above) (currently set to 1 cM)

6. Compiling and running the program

To compile the simnuc program with the gcc compiler (djgpp), simply type make.

To run the program, either answer all the queries it issues or prepare an input in a file. A sample input file, samplein.txt, is provided. Then type simnuc < samplein.txt. The output resulting from this input file is provided as the file, out.txt. To capture the parameters used in a file called sampleout.txt, issue the command simnuc < samplein.txt > sampleout.txt.