Converting files from plink to sumstat format

Jurg Ott  26 Feb 2015


The plink program uses the LINKAGE input format, which is a widely used format in genetic linkage and association analysis. The sumstat suite of programs has been developed using a different input format. Specifically, it lists each SNP on a separate input line (row) rather than an input column. I have written a small utility program to convert formats from plink to sumstat and vice versa. Also, another small program, copylines, is discussed below. The program package may be downloaded as a zip file.

Using the p2s program

First, you need to ask plink to make a file in its alternate format, that is, with rows rather than columns representing SNPs. Assume that your plink files are data.ped and Then you type the command,

plink --file data --transpose --recode12 (in plink version 1.07)
plink --file data --recode 12 transpose (in plink version 1.9)

or, if your data are in binary format, use --bfile data.

plink program will produce two files, plink.tped and plink.tfam. Run the conversion program, p2s, and follow instructions. The program requires certain parameters before it can run. Below is a list of questions the program will ask along with suitable responses. Default responses, in parentheses, may be chosen by simply pressing the Enter key.

Name of tped file? (press Enter for plink.*)
     In response to this question you type, for example, data.tped. If your files are plink.tped and plink.tfam then both are read after you simply press the Enter key. Otherwise, the p2s program will ask you for the name of the tfam file.

Should SNPs be ordered by chromosomal position? (y/n)
     If you type y then the p2s program will order the resulting dataset by chromosomal position. It attempts to do this with the Windows sort command. Note: The sort command has a maximum record (line) length of  65535 characters, which the p2s program uses. If your data are longer than that limit of characters then the sort command will issue a message saying "Input record exceeds maximum length". In this case, repeat the conversion process but specify n in response to the last question. SNPs will then not be ordered by the p2s program but they might already have been in chromosomal order in the original plink format.

You may want to run the p2s program with a parameter file, for example, p2s.param. An example for such a file is included in the program package. It must contain 3 lines as follows:
1) Name of tped file
2) Name of tfam file
3) y or n (for yes or no for sorting SNPs by chromosomal position).
With this file, you run the program by typing, p2s <p2s.param.

Converting from sumstat to plink format

An analogous program, s2p, allows for the reverse conversion. This program may still require some modification. If you encounter problems using it please send me email.

Working with very large files

In Windows, there is generally no text editor that can handle very large files and displaying line and column numbers. The most suitable program, not using line numbers, is probably wordpad. To see only parts of a large file, a small utility program, copylines, is included that allows you to copy parts of a large file (a certain range of lines and columns) to a new file, which should be much smaller and can be viewed with a regular editor like the Crimson editor. Its usage is self-explanatory. Also, notepad++ is very nice and useful.