Jurg Ott / 24 July 2017
Rockefeller University New York (ott@rockefeller.edu)

# NOCOM program

## INTRODUCTION

This document describes how to use the NOCOM program, which estimates parameters (means, variance, proportions of components) of a mixture of normal distributions for independent observations (quantitative data). Power transformation is optional, with estimation of the exponent as well. Thus, for example, lognormal distributions may also be analyzed. The estimation is iterative and is based on a 'counting' (EM) algorithm (Hasselblad 1966, Ott 1979), starting from a set of initial parameter values furnished by the user. NOCOM is written in Fortran and has been compiled with the GNU G77 compiler (GCC-2.95 version) in Windows (nocom.exe, and GNU gfortran in Linux (Kubuntu) (nocom, make executable). The maximum number of observations is currently set at 1,000,000.

As mentioned above, for a given value of the exponent in the power transformation, the parameters are estimated iteratively by the EM algorithm. When also the exponent has to be estimated, its value is changed by an initial step size of D. At each new exponent value, the other parameters are first approximately adjusted for the change in the exponent and then estimated by the counting method. The step size for the exponent is halved whenever the direction of the search changes until the step size falls below the tolerance TOL. If the exponent exceeds the lower bound, RLB, or the upper bound, RUB, an error message is issued. In programming, special care was taken to allow for initial mean estimates to be far from their maximum likelihood value.

Note: With 3 components, the user has the option to impose Hardy-Weinberg equilibrium on the proportions (weights) of the three components.

## ESTIMATING MIXTURE PARAMETERS

To estimate the parameters such as means and proportions, suitable starting values are required. A sensible approach at finding starting values is to produce a histogram of the data and visually identify component means (e.g., using the HIST program). A common standard deviation may be estimated by focussing on one of the components, preferably the predominant one. One imagines a normal density curve for this component and identifies its maximum point and a point of inflection at either side of the maximum. The (horizontal) distance between point of maximum and one of the inflection points is one standard deviation. In practice, starting with too high a value for the standard deviation is better than when the starting value is too small. An example is given below.

## TESTING FOR A MIXTURE

Consider the null hypothesis, H0: µ1 =
µ2, and the alternative hypothesis, H1: µ1µ2, where µi is the mean of the i-th component. To test whether H1 is significantly better than H0, one looks at the log likelihood obtained under each hypothesis. Then, one forms the statistic, G2 = 2[ln(L1) – ln(L0)], where Li is the maximum likelihood obtained under the i-th hypothesis.

Because
µ1 = µ2 forces p1 = 1, G2 does not have an asymptotic chi-square distribution. Often, it is nonetheless taken to approximately be chi-squared with 2 df. Computer simulations show, however, that the p-value so obtained is somewhat too small, that is, the test based on the chi-square distribution with 2 df is liberal (nonconservative) (Thode et al. 1988).

For a chi-square test with 2 df at the 5% significance level, the threshold of the test statistic is equal to 5.99. Instead of this limit, according to Thode et al (1988), the somewhat higher limit, 6.08 + 4.51/√n, should be used, where n is the number of observations.

## DATA INPUT

Observations (all larger than zero except when a constant exponent of 1 is specified; see below) are expected in a file (e.g., one observation per line) to be named by the user. Output will be written to the nocomresults.txt file. Parameter values are entered interactively as follows:
1. Exponent, e, for power transformation of original values (y) into transformed values (x), where x = (ye – 1)/e + e. Enter e = 1 for no transformation
2. Estimate exponent (1) or keep it constant (0)?
3. Number, NC, of components of the normal mixture
For a single component (NC = 1), skip items 4 through 11 below
1. Fixed ratios of standard deviations (2) or separate standard deviation for each component (1)? Note that parameter estimation is basically unstable when separate standard deviations are estimated for each component. Thus, option 2 is recommended. A special case of this option is that standard deviations are the same for each component. While the user will still enter possibly different standard deviations, the program will estimate a single variance factor by which each variance is increased or decreased.
Items 5 through 11, below, must be repeated for each component:
---------
1. Mean of ith component
2. Estimate this mean (1) or keep its value constant (0)?
For variance option (1) [see 4)]:
1. Standard deviation for ith component
2. Estimate this standard deviation (1) or keep it constant (0)?
For variance option (2):
1. Standard deviation for 1st component, or ratio of standard deviation to that of 1st component
2. Proportion (weight) of ith component
3. Estimate this proportion (1) or keep constant (0)?
----------

For variance option (2):
1. Estimate standard deviations (1) or keep constant (0)?
2. Enter one of the following codes for further calculations:
0 to stop
1 to start over (data will be read again)
2 for new initial values with previously used data
3 for new exponent only, same data as before.

The program does all computations on transformed values, x. Input (initial estimates) and output are also in terms of the transformed (x) rather than the original (y) observations.
Note: Exponent e = 1 is equivalent to x = y; e = 0 is equivalent to x = ln(y); e = 0.5 is equivalent to x = √y (square root transformation).

The program dimensions are set to accommodate up to 10,000 observations and up to 9 components of a mixture of normal distributions. At each 25th iteration or whenever the final iteration is reached, results are printed (the iteration number is the first number printed). With exponent estimation, the final iteration may not be the one with the highest log likelihood obtained. Therefore, the maximum log likelihood and its associated estimated exponent are printed after the last iteration.

As a training set, 200 observations are provided in the nocomdata.txt file (to run this automatically see instructions in the nocom.param file). These have been obtained using a random number generator for normal variables, with the standard deviation being equal to 2. The first 150 observations have mean 10, the next 50 observations have mean 15. As a general guideline, to obtain reasonable starting values for means, standard deviation and proportions, one first looks at the distribution of these 200 observations using the HIST program. With 20 classes, one sees two modes and might come up with starting values as shown in the following table. Running NOCOM with a constant exponent of 1, two components, common standard deviation (option 2), one obtains the following results:

 Mean 1 Mean 2 Common std.dev. Prop. 1 Prop. 2 Ln(lik) True values 10 15 2 0.75 0.25 1 component 11.16 – 3.07 1 – -324.490 Starting values, 2 components 10.5 17.0 2.5 0.80 0.20 Final values, 2 components 9.98 15.53 2.07 0.79 0.21 -315.053

The test statistic is then G2 = 2 × (324.490 – 315.053) = 18.874, which is larger than the 5% significance threshold of 6.08 + 4.51/√(200) = 6.399. Thus, there is significant evidence for the presence of two components with different means.

A copy of an interactive session is shown below. Prompts by the NOCOM program are given in red.

G:\pr\Util>nocom

╔═══════════════════════════════════════╗
║                                       ║
║   NOCOM program                       ║
║   Copyright (c) 1987-2016 Jurg Ott    ║
║                                       ║
╚═══════════════════════════════════════╝

The following maximum values are in effect:
MN = 900000 observations
MC =      4 components

See program manual at

Reference the following paper when using Nocom:
Ott J (1979) Detection of rare major genes in lipid levels.
Hum Genet 51, 79-91

Opening "nocom.dat" file for input

Opening "nocom.out" file for output
Using     200 observations

ENTER EXPONENT
1
ESTIMATE EXPONENT (1) or keep it constant (0)?
0
ENTER NUMBER OF COMPONENTS (0 < n < 4)
2
Fixed ratios of standard deviations (2; recommended)
or separate standard deviation for each component (1)?
2

ENTER STARTING PARAMETER VALUES FOR EACH COMPONENT

Mean for component 1:
10.5
Estimate this mean (1) or keep its value constant (0)?
1
Standard deviation for component 1:
2.5
Weight (probability) of component 1:
.8
Estimate weight (1) or keep value constant (0)?
1

Mean for component 2:
17
Estimate this mean (1) or keep its value constant (0)?
1
Ratio of standard deviation to that of comp. 1:
1
Weight (probability) of component 2:
.2
Estimate weight (1) or keep value constant (0)?
1
Estimate standard deviations (1) or keep constant (0)?
1
STARTING VALUES
0  MEANS           10.500000           17.000000
STD.DEV            2.500000            2.500000
PROP.            0.800000            0.200000
EXPO LN(L)              1.0000             -322.677947

23  MEANS            9.981047           15.526054
STD.DEV            2.071605            2.071605
PROP.            0.787352            0.212648
EXPO LN(L)              1.0000             -315.052901

ENTER  0 TO STOP,
1 Start over (data will be read again)
2 Start over, retaining current (transformed) observations
3 New exponent only, same data, with parameter
adjustment due to change of exponent
0

## REFERENCES

Hasselblad V (1966) Estimation of parameters for a mixture of normal distributions. Technometrics 8, 431-444

Ott J (1979) Detection of rare major genes in lipid levels. Hum Genet 51, 79-91

Thode HC, Finch SJ, Mendell NR (1988) Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44, 1195-1201