Jurg Ott, Ph.D.
Rockefeller University New York
20 March 2013

SLINK: a general simulation program for linkage analysis
Based on an algorithm by Jurg Ott
Programming by Daniel E. Weeks with the help of Mark Lathrop and Jurg Ott
Documentation and exercises by Daniel E. Weeks and Jurg Ott

Notes: This serves as the documentation to the SLINK program (this text is almost identical to the one on Dan Weeks' webpage). Included in this write-up is a set of simple exercises that the user may carry out. However, familiarity with the LINKAGE programs is assumed. You will not be able to correctly use these programs unless you are familiar with the LINKAGE programs. This documentation does not attempt to teach you how to use the LINKAGE programs. Furthermore, it is assumed that you already have the LINKAGE package. The SLINK program distributed on our web site is the original version, written in Pascal. Dr. Weeks (http://watson.hgen.pitt.edu) distributes a modified version, FastSLINK, written in C, that executes much more rapidly.

0. KNOWN BUGS

In the present version of SLINK, pedigree ID's may be names or numbers. If they are numbers, these numbers must not exceed the maximum number of pedigrees specified. It is safest to number pedigrees consecutively starting with 1 or to use names such as P112 (the MAKEPED program will then assign suitable pedigree numbers). This is inconsistent with the LINKAGE programs.

1. INTRODUCTION

Simulation can provide approximate answers to questions that are cumbersome or impossible to answer analytically. For example, in a linkage study of a disease it is important to know if the pedigrees you have collected (or plan to collect) will be sufficient to detect linkage. The power to detect linkage depends on a variety of factors, such as the structure of the pedigree, the number of affected individuals, and the informativeness of the markers. The SLINK program described below allows one to carry out such power calculations by simulating genotypes at one locus given the phenotypes at another locus linked with the first locus. If one is interested in simulating under no linkage (eg. to investigate exclusion possibilities or p-values), the SLINK program may also be used but in no-linkage situations other programs such as SIMULATE are much more efficient than SLINK.

The SLINK package consists of a simulation program (SLINK) and several analysis programs (MSIM, LSIM, ISIM, ELODHET) (this allows analyzing the data under models different from those under which the data were generated). As outlined below, the input files to these programs conform to the rules required by the LINKAGE programs.

2. OVERVIEW OF SLINK

The program SLINK is a general computer simulation program that employs a variation of the algorithm described by Ott (1989). Suppose there are N people in a pedigree. Let x = (x1, x2, ..., xN) represent the vector of phenotypes of the N individuals in the pedigree. Likewise, let g = (g1, g2, ..., gN) represent the vector of multilocus genotypes including phase information. Then the conditional probability distribution of the genotypes given the phenotypes may be calculated by a series of successive risk calculations:

    P(g | x) = P(g1 | x) P(g2 | g1, x) P(g3 | g1, g2, x) ...

In the SLINK simulation algorithm, we calculate the conditional probabilities (or risks) of all the possible multilocus genotypes with phase, P(g1|x), and, based on these, we randomly assign one of the genotypes to person 1. We then calculate P(g2 | g1, x), taking into account the genotype g1 just generated, and randomly assign a genotype to person 2, and so on. The process of successive risk calculations continues until all individuals in the pedigree have been assigned a multilocus genotype. This approach is quite efficient because once a multilocus genotype has been assigned to one individual, it is known in all subsequent steps. Note that this algorithm permits simulation conditional on any combination of phenotypic data. For example, if the pedigree were partially typed at a marker of interest, one could simulate conditional on both the disease phenotypes and the marker data currently available.

SLINK is based on the LINKAGE programs version 4.91 (Lathrop et al. 1984) and accepts slightly modified LINKAGE data files as explained below. The simulated pedigrees may be analyzed by either the standard LINKAGE programs or by the special companion programs (MSIM, ISIM, or LSIM), which have been modified to read one replicate of the pedigrees at a time, rather than reading in the entire pedigree file. In addition, the companion programs provide some simple statistical summaries.

As described above, the data are simulated conditional on the phenotypes given in the pedigree file, at the recombination fraction(s) given in the data file. The loci are conceptually divided into two categories: the 'trait' locus and the marker loci; the trait locus need not be an affection-status locus type but may be a codominant marker as well. There may be either one or no trait locus, while there may be as many marker loci as you desire. In practice, there is usually one trait locus and one marker locus. Multiple marker loci may be emulated by a single "virtual" marker that is tighly linnked with the disease locus. Generating genotypes at multiple marker loci tends to require much computer time.

3. INPUT FILES AND AVAILABILITY CODES

SLINK requires three input files:

  1. 'slinkin.dat' holding various parameter values, which previously had to be furnished interactively. No text is permitted between the values but text may follow the last input value (see example file). The values are:

  2. 'simdata.dat', which is a standard LINKAGE data file in MLINK-format.

  3. 'simped.dat', which is a standard LINKAGE pedigree file with an additional column inserted after the last phenotype column. This additional column in the SLINK pedigree file 'simped.dat' contains the availability code, which controls what type of phenotypes are written to the output file. These codes are consistent with the codes used by the simulation program SIMLINK (Boehnke 1986; Ploughman and Boehnke 1989).

-----------------------------------------------------------------
 Meaning of code for
 ----------------------------------------------------------
Code Markers Trait
-----------------------------------------------------------------
0 Unavailable Use orig. phenotypes as given in 'simped.dat'
1 Available Use simulated phenotypes
2 Available Use orig. phenotypes as given in 'simped.dat'
3 Unavailable Use simulated phenotypes
-----------------------------------------------------------------

A person should be coded as "marker unavailable" if that person will be assigned the phenotype "unknown" at each marker locus in the output file 'pedfile.dat'. This is appropriate if the person will not be typed for any markers because DNA samples are not available from that person (i.e., they are dead or uncooperative). Likewise, a person should be coded as "marker available" if that person will be assigned simulated marker phenotypes in the output file 'pedfile.dat'.

It is usually appropriate to "use original phenotypes" as given in 'simped.dat' at the trait locus, because most simulation studies of diseases are carried out conditional on the observed phenotypes. In this case, the trait phenotype remains as it was in the 'simped.dat' input file and availability codes are 2 for typed (available for genotyping at the marker locus) and 0 for untyped.

Keep in mind that these simulations are carried out conditional on the phenotypes given in the 'simped.dat' input file. Thus, if you want to indicate that someone is unavailable at the trait locus, make their original phenotype unknown in 'simped.dat' and use one of the two "Use original phenotypes" codes.

If a simulation was being carried out using marker data only, then only two availability codes are needed, as indicated in the table below. Every person who has an availability code of 0 will be given the 'unknown' phenotype at each marker locus. Every person who has an availability code of 1 will be given a simulated phenotype at each marker locus.

Code Meaning of code for Markers
-------------------------------------------------------------
0 Unavailable: Assign phenotype 'unknown' at each marker
1 Available: Assign simulated phenotypes at each marker

Note, however, that simulation of marker data only (perhaps with an unlinked disease locus) is carried out much more rapidly with the SIMULATE program.

Please read the example problem below, then start using SLINK with the exercise described in section 4.

EXAMPLE: Small 4-member family

Suppose we want to simulate the marker genotypes for two affected children in a simple nuclear family where the two parents must be carriers for a fully penetrant recessive disease, and they are known to have four different marker alleles (this is a way of specifying a highly polymorphic marker without the need for a large number of alleles). We first create a pedigree file called 'simpre.dat' which, after processing it with MAKEPED (one of the programs of the LINKAGE package), will become 'simped.dat' ('simpre.dat' is shown below; the comments beginning with <== are not part of the file). We use the availability code 2 for everyone, which means "Markers available, Use original trait phenotypes". Thus, the 'pedfile.dat' file (to be created by SLINK) will contain the original trait phenotypes as given in 'simpre.dat' (and 'simped.dat'). At the marker locus, person 1 will have marker genotype 1/2 and person 2 will be 3/4, since the simulations are carried out conditional on the phenotypes given in the 'simped.dat' file. The two affected children (persons 3 and 4) will be assigned simulated marker genotypes, since their original marker genotypes in the 'simped.dat' file are as yet unknown (0 0). This pedigree is used here for exercise purposes; there is only a small number of possible outcomes for the lod scores. More complicated examples for the use of SLINK may be found in Terwilliger and Ott (1994).

File: Simpre.dat

1 1 0 0 1 1 1 2 2 <== Father who is 1/2 at the marker
1 2 0 0 2 1 3 4 2 <== Mother who is 3/4 at the marker
1 3 1 2 2 2 0 0 2 <== First affected child
1 4 1 2 1 2 0 0 2 <== Second affected child

In the corresponding 'simdata.dat' file below, we give penetrances corresponding to a fully penetrant recessive disease at the trait locus and define a codominant four-allele system at the marker locus.

File: Simdata.dat

2 0 0 5 << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF 1)
1 2
1 2 << AFFECTION, NO. OF ALLELES
0.9900 0.0100 << GENE FREQUENCIES Trait locus
1 << NO. OF LIABILITY CLASSES
0.0000 0.0000 1.0000 << PENETRANCES
3 4 << ALLELE NUMBERS, NO. OF ALLELES Marker locus
0.250000 0.250000 0.250000 0.250000 << GENE FREQUENCIES
0 0 << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2)
0.01000 << RECOMBINATION VALUES
1 0.01000 0.50000 << REC VARIED, INCREMENT, FINISHING VALUE

This example is referred to again later in this document.

4. Exercise: HOW TO USE SLINK

Example data: Leal and Ott (1990) have written a program for analytically calculating the expected lod scores in linkage analysis of autosomal recessive traits in nuclear families with various numbers of affected and unaffected offspring. For a recombination fraction of 0.01 between disease and marker, their program gives the following results for a nuclear family where both parents are carriers for the disease and they are known to have four different marker alleles (# aff = number of affected children, # unaff = number of unaffected children, E(Z) = expected lod score, SD = standard deviation):

       #aff    #unaff         E(Z)        SD
       ---------------------------------------
        0        2          0.00680     0.07874
        0        3          0.02356     0.13864
        0        4          0.05896     0.17959
        1        1          0.10094     0.15916
        1        2          0.20367     0.22856
        1        3          0.30776     0.28349
        1        4          0.41310     0.32936
        2        0          0.51759     0.33388
        2        1          0.62484     0.37704
        2        2          0.73311     0.41187
        2        3          0.84209     0.44147
        2        4          0.95157     0.46753
        3        0          1.05966     0.47806
        3        1          1.17014     0.49896
        3        2          1.28068     0.51888
        3        3          1.39127     0.53789
        3        4          1.50189     0.55609
        4        0          1.61202     0.55902
        4        1          1.72270     0.57644
        4        2          1.83342     0.59316
        4        3          1.94415     0.60931
        4        4          2.05490     0.62495

These analytic results may provide one way to check that SLINK is working correctly. We do this by carrying out a simulation to duplicate a portion of this table. Suppose we want to compare the average lod score from a nuclear family with two affected children to that from a nuclear family with three affected children (see the two underlined lines in table above). To duplicate the analytical results, we must assume that the disease is recessive with full penetrance. In addition, we assume that the two parents have four different marker alleles.

To carry out a simulation, proceed as follows (see also the flow chart at the end of this document):

  1. Create a 'simdata.dat' file defining the locus systems and the true thetas.

  2. Create a 'simped.dat' file defining the pedigree structure, phenotypes, and availability codes.

  3. Create the 'slinkin.dat' file defining parameters for the simulation (see section 3, above).

  4. Run SLINK to create a 'pedfile.dat' with the simulated data in it.

  5. Create a 'datafile.dat' defining how the simulated data should be analyzed.

  6. Run UNKNOWN to create a 'speedfile.dat' and an 'ipedfile.dat'.

  7. Run the appropriate analysis program, such as MSIM or LSIM.

These steps are described in detail below. Please read the entire explanation for each step before proceeding with that step.

Step 1) Create a 'simdata.dat' file defining the locus systems and the true recombination fraction values.

The file 'simdata.dat' defines the locus systems and the TRUE thetas (recombination fractions) under which the simulation will be carried out. These true θ (theta) values are given in the RECOMBINATION VALUES line; the subsequent line with increment and finishing value is irrelevant for SLINK. The 'simdata.dat' file must be in standard MLINK format and can be created with the PREPLINK program of the LINKAGE package.

Use PREPLINK to create a 'simdata.dat' file matching the one displayed above in the initial example. The first locus must be a recessive affection status locus with a common (normal) allele and a rare (disease) allele, where the latter may have a frequency of 0.001. The second locus should be an allele-number system with four equally frequent alleles. Choose the true θ to be 0.01, i.e. the marker is at a recombination fraction of 0.01 from the disease.

Suitable 'simdata.dat' file:

 2 0 0 5  << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
 0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF 1)
  1  2
1   2  << AFFECTION, NO. OF ALLELES
  0.99900  0.00100 << GENE FREQUENCIES
 1 << NO. OF LIABILITY CLASSES
 0 0  1.0000 << PENETRANCES
3   4  << ALLELE NUMBERS, NO. OF ALLELES
  0.25000  0.25000  0.25000  0.25000 << GENE FREQUENCIES
 0 0  << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2)
  0.010 << RECOMBINATION VALUES
 1 0.010 0.015 << REC VARIED, INCREMENT, FINISHING VALUE

Step 2) Create a 'simped.dat' file defining the pedigree structure, phenotypes, and availability codes for the following two pedigrees:

First use a text editor to create a pedigree file 'simpre.dat' analogous to the one shown above in the initial example. Include data for two pedigrees (see above): The first one having two affected children and the second having three affected children.

File 'simpre.dat':

1  1 0 0 1  1  1 2  2
1  2 0 0 2  1  3 4  2
1  3 1 2 2  2  0 0  2
1  4 1 2 1  2  0 0  2
2  1 0 0 1  1  1 2  2
2  2 0 0 2  1  3 4  2
2  3 1 2 2  2  0 0  2
2  4 1 2 1  2  0 0  2
2  5 1 2 1  2  0 0  2

Then use MAKEPED to create the 'simped.dat' pedigree file:

1 1 0 0 3 0 0 1 1  1  1 2 2  Ped: 1  Per: 1
1 2 0 0 3 0 0 2 0  1  3 4 2  Ped: 1  Per: 2
1 3 1 2 0 4 4 1 0  2  0 0 2  Ped: 1  Per: 3
1 4 1 2 0 0 0 2 0  2  0 0 2  Ped: 1  Per: 4
2 1 0 0 3 0 0 1 1  1  1 2 2  Ped: 2  Per: 1
2 2 0 0 3 0 0 2 0  1  3 4 2  Ped: 2  Per: 2
2 3 1 2 0 4 4 1 0  2  0 0 2  Ped: 2  Per: 3
2 4 1 2 0 5 5 2 0  2  0 0 2  Ped: 2  Per: 4
2 5 1 2 0 0 0 2 0  2  0 0 2  Ped: 2  Per: 5

The file 'simped.dat' defines the pedigrees to be simulated. It is a LINKAGE pedigree file with an additional column containing the availability codes after the last phenotype field. All simulations are carried out conditional on the phenotypes given in 'simped.dat'.

Step 3) Create the 'slinkin.dat' input file for SLINK holding the parameters defined in section 3, above. Please note that the random number seeds will be changed for every new simulation (SLINK will write a new 'slinkin.dat' file with updated random seeds after completing the simulation).

Example of 'slinkin.dat' file:

-531835011 1000 1  0.000000
Seed, #replicates, trait locus, proportion unlinked families
 

Step 4) Run SLINK to create a 'pedfile.dat' file containing the simulated data. The input files for SLINK are 'simdata.dat', 'simped.dat' and 'slinkin.dat' (see the flow chart at the end of this document).

SLINK outputs the simulated data in the file 'pedfile.dat', while the parameters (as described above) and the θs are written to 'simout.dat'. When one of the companion analysis programs is run, it will put the information from the most recent 'simout.dat' into the output file it creates (see flow chart, appended). Note that the 'pedfile.dat' is a bona-fide LINKAGE pedigree file and can be analyzed by any of the regular LINKAGE programs. However, it may contain many replicates of the pedigrees (one set after another), so that in most cases it is more practical to use the modified versions of the LINKAGE programs (MSIM, ISIM, LSIM) that are designed to analyze the data a replicate at a time.

Step 5) Create a 'datafile.dat' defining how the simulated data should be analyzed.

The analysis programs require five input files (see flow chart). Two are made by UNKNOWN ('ipedfile.dat' and 'speedfile.dat', see step 6). One is made by SLINK ('simout.dat', see step 4). The fourth, 'datafile.dat', must be made using PREPLINK prior to running the desired analysis program. The fifth input file, 'limit.dat', contains three threshold values (e.g., 1 2 3) used to determine the proportion of replicates exceeding a given lod score limit.

The 'datafile.dat' is a standard LINKAGE data file which determines how the analyses of the simulated data will be carried out. It must be in MLINK format for MSIM; ILINK format for ISIM; and LINKMAP format for LSIM.

The example 'datafile.dat' below is in MLINK format and is thus appropriate for input into MSIM (the modified version of MLINK). For this exercise, the easiest way to create a 'datafile.dat' is to copy 'simdata.dat' to 'datafile.dat', with the last two lines changed appropriately. To match the results of Leal and Ott (1990) you must analyze at the same θ (= 0.01) as that used in the simulation. In the interest of time, modify the 'datafile.dat' so that MSIM will evaluate the average lod scores only at the true θ (= 0.01). This can be done by specifying a starting value of the recombination fraction of 0.01, an increment value of 0.01 and a finishing value of 0.015.

File 'datafile.dat':

 2 0 0 5  << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
 0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF 1)
  1  2
1   2  << AFFECTION, NO. OF ALLELES
  0.99900  0.00100 << GENE FREQUENCIES
 1 << NO. OF LIABILITY CLASSES
 0 0  1.0000 << PENETRANCES
3   4  << ALLELE NUMBERS, NO. OF ALLELES
  0.25000  0.25000  0.25000  0.25000 << GENE FREQUENCIES
 0 0  << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2)
  0.01 << RECOMBINATION VALUES
 1 0.01 0.015 << REC VARIED, INCREMENT, FINISHING VALUE

NOTE: If you choose the increment size too small, you will get an error message that maxpnt is too small. Maxpnt is the number of θs (or map positions in LSIM) at which the lod score is evaluated. You may fix this problem by either choosing a larger increment or by inreasing maxpnt and recompiling. In general, a staring value of 0.0001, increment values of 0.05 or 0.10, and a finishing value of 0.45 are suitable.

Step 6) Run UNKNOWN to create a 'speedfile.dat' and an 'ipedfile.dat' file.

Once the 'pedfile.dat' has been created by SLINK, then it is necessary to process this file with the program UNKNOWN (of the LINKAGE package) before running any of the analysis programs. UNKNOWN creates the pedigree file 'ipedfile.dat' and the 'speedfile.dat' ('speedfil.dat' on Windows machines). These two files are needed for input into MSIM, ISIM, or LSIM.

Step 7) Create a lod score threshold file, 'limit.dat', for the appropriate analysis program.

The analysis programs MSIM, ISIM, and LSIM each provide statistical output as to the proportion of replicates with maximum lod scores greater than a given constant, which is an approximation to power. One may want to know what is the probability of getting a lod score over 3, for example, in a given dataset. The user is given the chance to pick three critical lod scores, and the program will keep track of the number of times the given lod scores were exceeded in the simulated pedigrees. The user must produce such a file in a text editor, consisting of three such lod score thresholds, on the same or on separate lines. For the purposes of this exercise, use the thresholds 1, 2, and 3.

File 'limit.dat':

1 2 3

Step 8) Run the appropriate analysis program. The subsequent sections show different ways of analyzing the simulated data.
 

5. TYPICAL APPLICATIONS

The analysis programs, MSIM, LSIM, and ISIM, each require three input files:

  1. A parameter file called 'limit.dat'. This file holds three thresholds/limits for the maximum lod score; the programs will approximate the probability with which the maximum lod score exceeds each of the three thresholds. Typical threshold values are 1, 2, and 3. Note that numbers must have at least one digit to the left of any decimal point, for example, 0.5. A number given as .5 would lead to an error.

  2. A locus file, 'datafile.dat', which is analogous to the one used for the LINKAGE programs. You may simply copy 'simdata.dat' to 'datafile.dat' and modify the 'datafile.dat' file to correspond (i) to the analysis program to be used and (ii) to reflect the θ values at which analysis is to be carried out. Note that 'datafile.dat' is also used as an input file to the UNKNOWN program.

  3. A pedigree file, 'ipedfile.dat', created by the UNKNOWN program.

Details of program usage are given below.

5A. MSIM: Approximating the expected lod score

First we show how to use MSIM, which summarizes its results in the file 'msim.dat'. The first part of 'msim.dat' contains information defining the simulation, such as the random number seed, the number of replicates, the requested proportion of unlinked families, and the trait locus number. This information is taken from the most recent 'simout.dat'. Note that the θs and the locus order presented in the 'simout.dat' section pertain to the model under which the simulation was carried out (as previously specified in the 'simdata.dat' file). These may differ from the θs and locus order used in the analysis of the simulated data. The rest of 'msim.dat' provides statistical information.

If two loci are used, then the results are reported on the traditional lod score scale. When three or more loci are used, multipoint lod scores are computed as the log likelihood at the current θs minus the log likelihood with the θ involving the trait locus set to 0.50. For this reason, when MLINK or MSIM are used, the trait locus must be either the leftmost or rightmost locus for these statistics to make sense. Normally, MSIM, like MLINK, will be used for analyses involving only two loci.

Run MSIM in order to estimate the expected lod score for the data analysed at the true value of the recombination fraction (= 0.01). MSIM will produce the output file 'msim.dat' which will contain a table similar to this:

 ********* Data from most recent SIMOUT.DAT *********
 The random seeds are: 17543 23415  2177
 The number of replications is:   200
 The requested proportion of unlinked families is:  0.000
 The trait locus is locus number:   1
    Summary Statistics about simped.dat
 Number of pedigrees       2
 Number of people          9
 Number of females         5
 Number of males           4
 There were   0 in category:  Marker Unknown; Trait original
 There were   0 in category:  Marker Available; Trait simulated
 There were   9 in category:  Marker Available; Trait original
 There were   0 in category:  Marker Unknown; Trait simulated
LINKAGE (V4.91) WITH  2-POINT AUTOSOMAL DATA
-----------------------------------
 LINKED ORDER OF LOCI:   1  2
-----------------------------------
-----------------------------------
TRUE THETAS FOR LINKED ORDER    0.010000
-----------------------------------
 UNLINKED ORDER OF LOCI:   1  2
-----------------------------------
TRUE THETAS FOR UNLINKED ORDER    0.500000
-----------------------------------
 Elapsed Time for one replicate =  0.22000 seconds
  Elapsed Time =   30.32000 seconds or     0.51 minutes.
-----------------------------------
Actual proportion of unlinked families:  0.000
 ********* End of most recent SIMOUT.DAT *********

 ORDER OF LOCI:   1  2
 Average Lod Scores at Given Thetas
Number of replicates =    200
------------------------------------------------------------------
THETAS  0.010
------------------------------------------------------------------
Pedigree      Average        StdDev           Min            Max
     1       0.516902       0.332081      -1.109958       0.584688
     2       1.058454       0.472900      -0.813337       1.177930
Study        1.575356       0.592840      -1.923295       1.762618
------------------------------------------------------------------

For a family with two affected children, the simulated mean lod score is 0.516902, which is only slightly different from the analytical answer of 0.51759. Likewise, for a family with three affected children, the simulated mean lod score is 1.058454 and the analytical answer is 1.05966. Note that your simulated means will differ slightly from the ones shown here depending on the seeds chosen.

Brief explanation of MSIM output:
The file 'msim.dat' contains information defining the simulation, followed by some tables providing statistical information about the distribution of the simulated lod scores. The 'Average' column provides the expected (or mean) lod score by pedigree and by study (i.e., set of families). The 'StdDev' column provides the standard deviation of the lod score. The 'Min' column lists the minimum lod score encountered in all the replicates, and the 'Max' column lists the maximum lod score encountered in all the replicates (and when the pedigrees are used together in one study). Note that the only columns in which the Study value will equal the sum of the Pedigree values is the Average column (in this example, the minimum and maximum values also add up — the data set is too small — but this is not generally the case). If one is interested in approaching the absolute smallest (or absolute largest) lod score in the whole study, one should add up the min. (or max.) values (if all have the same sign) over pedigrees rather than looking at the Study min. (or max.) value.

5B. ISIM: Average maximum lod score and power

With ISIM (as with ILINK), the recombination fraction between two loci can be iterated and, therefore, the lod score obtained for a given replicate is maximized over the θ values. Thus, ISIM can approximate the average (expectation) of the maximum lod score.

In our example application of ISIM, we use the files from the initial example, where we simulated the marker genotypes for two affected children in a simple nuclear family where the two parents are known to have four different marker alleles. The 'simdata.dat' and 'simped.dat' have been explained above. The 'datafile.dat' must be in ILINK format. If the recombination fraction between the two loci is iterated, then the average of the maximum lod score will be calculated. If the recombination fraction is not iterated, then the average of the lod score (unmaximized) will be calculated at the requested θs. In the file below, we request that the recombination fraction be iterated. Note: ISIM may not maximize the log likelihood over θ correctly if the disease gene frequencies are extreme (even when they are fixed). You may also want to designate locus 1 or locus 2 as the locus with iterated parameters.

File 'datafile.dat':

2 0 0 3 << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF 1)
1 2
1 2 << AFFECTION, NO. OF ALLELES
0.9900 0.0100 << GENE FREQUENCIES
1 << NO. OF LIABILITY CLASSES
0.0000 0.0000 1.0000 << PENETRANCES
3 4 << ALLELE NUMBERS, NO. OF ALLELES
0.250000 0.250000 0.250000 0.250000 << GENE FREQUENCIES
0 0 << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2)
0.01000 << RECOMBINATION VALUES
1 << THIS LOCUS MAY HAVE ITERATED PARS
1

When we run ISIM, we obtain the following output, which reports the average of the maximum lod score.
File 'isim.dat' (edited):

********* Data from most recent SIMOUT.DAT *********
The random seeds are: 29876 28996 12989
The number of replications is: 200
The requested proportion of unlinked families is: 0.000
The trait locus is locus number: 1
 Summary Statistics about simped.dat
Number of pedigrees 1
Number of people 4
Number of females 2
Number of males 2
There were 0 in category: Marker Unknown; Trait original
There were 0 in category: Marker Available; Trait simulated
There were 4 in category: Marker Available; Trait original
There were 0 in category: Marker Unknown; Trait simulated
LINKAGE (V4.91) WITH 2-POINT AUTOSOMAL DATA
-----------------------------------
LINKED ORDER OF LOCI: 1 2
-----------------------------------
-----------------------------------
TRUE THETAS FOR LINKED ORDER 0.010000
-----------------------------------
Actual proportion of unlinked families: 0.000
********* End of most recent SIMOUT.DAT *********
Average Maximum Lod Score
The number of replicates is 200
--------------------------------------------------------
Average Maximum | StdDev | Min | Max
 0.576825 0.117744 0.000000 0.600859
--------------------------------------------------------
--------------------------------------------------------
 Number of maximum lod scores greater than
 a given constant
--------------------------------------------------------
Constant Number Percent
 1 161 80.500 (assumed values)
 2 0 0.000
 3 0 0.000
--------------------------------------------------------

Note that there is no θ value reported with the Average Maximum Lod Score. This is because each maximum lod score (in a replicate) may correspond to a different estimate of θ. Also, note that the average maximum lod score is not exactly additive over families or studies. The average lod score (ELOD), calculated previously, does possess this desirable property of additivity (see Ott, 1999).

The last table appearing in 'isim.dat' gives the number and percentage of replicates in which the maximized lod score was found greater than the given constants. The proportion of replicates with Zmax ³ 3 is an approximation to the power, i.e. to finding a significant linkage.

5C. MSIM: Interpolating Zmax in each replicate

The average of the maximum lod score (calculated by ISIM) may be much more time consuming to compute, because maximization of the lod score requires many evaluations of the lod score. To partially avoid this problem, MSIM can estimate maximum lod scores by quadratic interpolation. The quadratic interpolation option works automatically, but only in the special case where there are two loci, autosomal inheritance, and the 'datafile.dat' for MSIM requests evaluation of the lod score at three or more distinct points. To get the best results from the quadratic interpolation, you should evaluate the lod score at more than three points. Also, it is better not to choose θ = 0.0 as one of the points, because if there is an obligate recombination, then the lod score is negative infinity (use something like θ = 0.00001 or 0.001 instead). For example, if we run the problem above, starting θ at 0.01 and increasing it in increments of 0.10 up to 0.41, for a total of 5 points, we get the results seen below, which are very similar to those obtained above by ISIM:

From the file: 'msim.dat'

----------------------------------------------------------------
Average Maximum Lod Scores based on quadratic interpolation
----------------------------------------------------------------
Pedigree Average StdDev Min Max
 1 0.579848 0.118361 0.000000 0.604008
----------------------------------------------------------------

5D. Finding the peak of the expected lod score curve

Theoretically, the expected lod score should have its maximum (the ELOD) at the true recombination fraction. In a simulation, one may verify this by computing the average lod score at several assumed θ values in the vicinity of the true recombination fraction under which the data are generated. The peak should occur at or near the true recombination fraction. MSIM or LSIM may be used for these calculations.

5E. MSIM and ELODHET - analysis under heterogeneity

As outlined above, SLINK can generate data under heterogeneity (α < 1), where α is the proportion of families with linkage. However, if families generated with α < 1 are analyzed with MSIM, which assumes α = 1, the resulting lod scores are too low compared with those obtained under heterogeneity. Also, the estimates of θ are biased upwards when heterogeneous data are analyzed under homogeneity.

When data are generated under heterogeneity, it is usually most appropriate to also analyze them under heterogeneity. The ELODHET program accomplishes this task. It carries out two types of analyses:

1) For the given values of α (proportion of families with linkage) and θ (recombination fraction in families with linkage) under which the family data were generated, ELODHET calculates expected lod scores Z for each family and, for all families jointly, the probability (power) that the lod score at α and θ exceeds given thresholds (as furnished in the 'limit.dat' file). The expected lod score under heterogeneity is defined as Z(α,θ) = log10[L(α,θ)] - log10[L(1, 0.5)], where L is the usual likelihood under heterogeneity, that is, for the i-th family, Li(α,θ) = α Li(θ) + 1 - α, and Li(θ) is the antilog of the i-th family's lod score.

2) Expected maximum lod scores are calculated by evaluating in each replicate, for all families jointly, the maximum likelihood over a grid of values of α and θ (irrespective of the α and θ values used to generate the data). That is, each simulated replicate furnishes Zmax(α,θ) = log10[Lmax(α,θ)] - log10[L(0, 0.5)], where α and θ are estimates obtained in a given replicate. For the Zmax values, their average is computed and the proportion of replicates in which Zmax exceeds given thresholds (power). Note that Zmax involves two degrees of freedom, and the associated lod score threshold is thus somewhat less conservative than that for the lod score under homogeneity. In this sense, the Zmax under heterogeneity is comparable to obtaining separate estimates for male and female recombination fractions.

To use the ELODHET program, one must first run SLINK and MSIM. For the latter run, lod scores should be evaluated at suitable intervals, for example, starting at θ = 0.00001 and increasing in steps of 0.05 with a limiting value of 0.47. In its present implementation, the step size for α in the analysis is taken to be 0.025. As in the HOMOG programs, no interpolation is carried out to approximate maximum lod scores. The ELODHET program reads the LODFILE.DAT file created by MSIM and furnishes its results in an output file called ELODHET.DAT.



6. TECHNICAL INFORMATION

6A. Program constants

 Program Purpose/Description
-----------------------------------------------
Simulation SLINK General simulation program
Analysis MSIM Modified version of MLINK
 ISIM Modified version of ILINK
 LSIM Modified version of LINKMAP

The analysis programs, MSIM, ISIM, and LSIM, read the number of pedigrees per replicate (i.e. study) from the summary file 'simout.dat' produced by SLINK.

There are several constants that may be changed to modify the behavior of the programs. This requires editing the source code followed by recompilation.

A standard version of one of the LINKAGE programs, such as MLINK, produces a lot of output. For the purposes of conserving disk space and speeding up the SLINK programs, most of the 'standard' output has been turned off by a logical flag. This constant is called 'prn' or 'print' and is identified in the source code of each program by a comment. Set the constant equal to TRUE if you want more output.

For an affection status locus, the unaffected individuals are indicated by the integer constant assigned to 'unaff'.

In the LINKAGE programs, the phenotype at a quantitative locus may consist of several measurements (or traits). However, SLINK supports only ONE measurement (or trait) for a quantitative locus. Unaffected males at a sex-linked locus are indicated by the integer constant assigned to 'unafqu'.

MSIM will write a list of lod scores by pedigree to the file 'lodfile.dat' if the constant 'lodprint' is set to TRUE. However, it only writes out the male θs. This may be useful if you wish to produce a histogram of lod scores.

6B. References for SLINK

Please use the references, Ott (1989) and Weeks et al. (1990), when reporting results based on SLINK.

6C. Acknowledgments

We would like to thank Joseph Terwilliger and Suzanne Leal for their helpful comments. Support by grants MH44292 from NIMH and HG00008 from the National Human Genome Research Institute are gratefully acknowledged.
 

7. LITERATURE

Boehnke M (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. Am J Hum Genet 39:513-527

Lange K, Matthysse S (1989) Simulation of pedigree genotypes by random walks. Am J Hum Genet 45:959-970

Lathrop GM, Lalouel JM, Julier C, Ott J (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA 81:3443-3446

Leal SM and Ott J (1990)  Expected lod scores in linkage analysis of autosomal recessive traits for affected and unaffected offspring. Am J Hum Genet 47:A188 (abstract)

Ott J (1985) Analysis of human genetic linkage, first edition. Johns Hopkins University Press, Baltimore

Ott J (1989) Computer-simulation methods in human linkage analysis. Proc Natl Acad Sci USA 86:4175-4178

Ott J (1999) Analysis of Human Genetic Linkage, 3rd edition. Johns Hopkins University Press, Baltimore and London

Ploughman LM, Boehnke M (1989) Estimating the power of a proposed linkage study for a complex genetic trait. Am J Hum Genet 44:543-551

Sandkuyl L, Ott J (1989) Determining informativity of marker typing for genetic counseling in a pedigree. Hum Genet 82:159-162

Sherrington R, Brynjolfsson J, Petursson H, Potter M, Dudleston K, Barraclough B, Wasmuth J, Dobbs M, Gurling H (1988) Localization of a susceptibility locus for schizophrenia on chromosome 5. Nature 336:164-167

Terwilliger JD, Ott J (1994) Handbook of Human Genetic Linkage. Johns Hopkins University Press, Baltimore

Weeks DE, Ott J, Lathrop GM (1990) SLINK: a general simulation program for linkage analysis. Am J Hum Genet 47:A204 (abstr)
 
 
 

Flow chart for the SLINK program