LINKAGE programs for Linux

Jurg Ott, Rockefeller University, New York / 20 April 2022

In linkage analysis with family data, to accommodate large numbers of markers, it is convenient and efficient to analyze one marker at a time. Here I describe a shell program, runML (ML), that runs one marker at a time and computes the maximum lod score for each family and for all families combined. ML runs in Linux (e.g., Kubuntu) and is based on fastlink so it automatically resolves loops (see included document by Dr. A. Shaffer, "Unknown.pdf") . Input data are expected in plink format as a transposed fileset with marker alleles coded 12. A sample dataset is provided, cousin.tfam and cousin.tped, which is based on Figure 7-1 in our handbook [1].

family pedigree with cousin marriage
In a similar vein, I have taken the pseudomarker program and put it into a shell program (PM) so it runs one marker at a time. Input to PM and ML is very similar; for details see the PM documentation. The main differences between the two approaches are as follows: PM maximizes log likelihoods over various parameters and, for example, treats marker allele frequencies as nuisance parameters so that resulting lod scores (maximized over θ) are virtually independent of marker allele frequencies. ML on the other hand estimates marker allele frequencies based on alleles seen in all individuals, that is under no linkage. Also, ML will report lod scores for each family (and all families combined) while this is not possible in PM. Results from ML and PM tend to be comparable but PM, while somewhat slower, is more accurate and can lead to higher lod scores. ML is based on the mlink program while PM is based on ilink. Considerable differences between ML and PM may occur in the presence of loops and recessive inheritance, where marker allele frequencies are crucial.

Sample data

Sample data (Figure 7-1) are provided in plink format. There are three markers (sequence variants) on chromosome 16, rs001, rs002, and rs003. For marker rs001, family members have the genotypes shown in the graph above. For marker rs002, genotypes are the same as for marker rs001 except that the genotype of SON1 is 1/2 instead of 1/1; for marker rs003, genotypes are the same as for marker rs001.

Results of the sample data are shown below.

Model 1:

line

chr

marker

bp

freq

F_A

all.fam

sumLodmax

chi2het

Ndf

all1

all2

1

15

rs001

180001

0.4231

0.9662

0.9662

0.9662

0

0

11

15

2

15

rs002

180011

0.3846

0.1646

0.1646

0.1646

0

0

10

16

3

15

rs003

180055

0.4231

0.9662

0.9662

0.9662

0

0

11

15

Model 2:

line

chr

marker

bp

freq

F_A

all.fam

sumLodmax

chi2het

Ndf

all1

all2

1

15

rs001

180001

0.4231

1.2886

1.2886

1.2886

0

0

11

15

2

15

rs002

180011

0.3846

0.3116

0.3116

0.3116

0

0

10

16

3

15

rs003

180055

0.4231

1.2886

1.2886

1.2886

0

0

11

15

Columns have the following meaning:

Model 2 furnishes a larger lod score because it allow for information from unaffected individuals.

Running the programs

The programs are best run with a parameter file named on the command line. That is, you run the program in a terminal by typing, for example,
runML <runML.param, where the name of the parameter file can be anything you choose.

The sample parameter file included in the program package is as follows:

cousin
R
1
10
/home/joe/
/home/joe/runLINK/

EXPLANATIONS

Line 1: Name of data file, here cousin (omit ".tped"). The files *.tped and *.tfam must be in plink transposed format with alleles coded 12. No trailing text!

Line 2: D for dominant, R for recessive mode of inheritance.

Line 3: model 1 = affected-only analysis, 2 = incomplete penetrance. For details see PM documentation.

Line 4: Number of markers to run. If this is 0 (zero), or the number specified exceeds the actual number of markers present, all markers will be run.

Line 5: Folder into which a status report file will be written for every 2000 markers.

Line 6: Folder containing executables makepedpseudo, mlinkpseudo, and unknownpseudo.

To run the program in the background, type, for example,
runML <runML.param > /dev/null 2>&1 &
disown

The second command has the effect that the program will keep running even after you log out. Progress can be monitored by displaying the report file that ML will be writing to the directory on line 5 after every 2000 markers analyzed. For example, you may simply type cat *.report.

The following programs must reside in the folder specified on line 6 of the parameter file: mlinkpseudo, unknownpseudo, makepedpseudo. Members of a given family must be contiguous. Monomorphic variants will be skipped. Family and other IDs can be alphanumeric.

Reference

1 Terwilliger JD, Ott J: Handbook of human genetic linkage. Baltimore, Johns Hopkins University Press, 1994 (available as a pdf file).

2 Morton NE: The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet 8:80-96