2
forces p1 = 1, G2
does not have an asymptotic
chi-square distribution. Often, it is nonetheless taken to
approximately be chi-squared with 2 df. Computer simulations show,
however, that the p-value so
obtained is somewhat too small, that is,
the test based on the chi-square distribution with 2 df is liberal
(nonconservative) (Thode et al. 1988).
For a chi-square test with 2 df at the 5% significance level, the
threshold of the test statistic is equal to 5.99. Instead of this
limit, according to Thode et al (1988), the somewhat higher limit, 6.08
+ 4.51/√n, should be used, where n is the number of observations.
DATA INPUT
Observations (all larger than zero except when a constant exponent of 1
is specified; see below) are expected in a file (e.g., one observation per line) to be named by the user. Output will be
written to the nocomresults.txt file.
Parameter
values are entered interactively as follows:
- Exponent, e, for power transformation of original values (y) into
transformed values (x), where x = (ye – 1)/e + e. Enter e = 1
for no transformation
- Estimate exponent (1) or keep it constant (0)?
- Number, NC, of components of the normal mixture
For a single component (NC = 1), skip items 4 through 11 below
- Fixed ratios of standard deviations (2) or separate standard
deviation for each component (1)? Note that parameter estimation is
basically unstable when separate standard deviations are estimated for
each component. Thus, option 2 is recommended. A special case of this
option is that standard deviations are the same for each component.
While the user will still enter possibly different standard deviations,
the program will estimate a single variance factor by which each
variance is increased or decreased.
Items 5 through 11, below, must be repeated for each component:
---------
- Mean of ith component
- Estimate this mean (1) or keep its value constant (0)?
For variance option (1) [see 4)]:
- Standard deviation for ith component
- Estimate this standard deviation (1) or keep it constant (0)?
For variance option (2):
- Standard deviation for 1st component, or ratio of standard
deviation
to that of 1st component
- Proportion (weight) of ith component
- Estimate this proportion (1) or keep constant (0)?
----------
For variance option (2):
- Estimate standard deviations (1) or keep constant (0)?
- Enter one of the following codes for further calculations:
0 to stop
1 to start over (data will be read
again)
2 for new initial values with
previously used data
3 for new exponent only, same
data as before.
The program does all computations on transformed values, x. Input
(initial estimates) and output are also in terms of the transformed (x)
rather than the original (y) observations.
Note: Exponent e = 1 is equivalent to x = y; e = 0 is equivalent to x =
ln(y); e = 0.5 is equivalent to x = √y (square root
transformation).
The program dimensions are set to accommodate up to 10,000 observations
and up to 9 components of a mixture of normal distributions. At each
25th iteration or whenever the final iteration is reached,
results are
printed (the iteration number is the first number printed). With
exponent estimation, the final iteration may not be the one with the
highest log likelihood obtained. Therefore, the maximum log likelihood
and its associated estimated exponent are printed after the last
iteration.
As a training set, 200 observations are provided in the nocomdata.txt file (to run this automatically see instructions in the nocom.param file).
These have been obtained using a random number generator for normal
variables, with the standard deviation being equal to 2. The first 150
observations have mean 10, the next 50 observations have mean 15. As a
general guideline, to obtain reasonable starting values for means,
standard deviation and proportions, one first looks at the distribution
of these 200 observations using the HIST program. With 20 classes, one
sees two modes and might come up with starting values as shown in the
following table. Running NOCOM with a constant exponent of 1, two
components, common standard deviation (option 2), one obtains the
following results:
|
Mean 1
|
Mean 2
|
Common
std.dev.
|
Prop. 1
|
Prop. 2
|
Ln(lik)
|
True values
|
10
|
15
|
2
|
0.75
|
0.25
|
|
1 component
|
11.16
|
– |
3.07
|
1
|
– |
-324.490
|
Starting values, 2 components
|
10.5
|
17.0
|
2.5
|
0.80
|
0.20
|
|
Final values, 2 components
|
9.98
|
15.53
|
2.07
|
0.79
|
0.21
|
-315.053
|
The test statistic is then G2 = 2 × (324.490 –
315.053) = 18.874,
which is larger than the 5% significance threshold of 6.08 +
4.51/√(200) = 6.399. Thus, there
is significant evidence for the presence of two components with
different means.
A copy of an interactive session is shown below. Prompts by the NOCOM program are given in red.
G:\pr\Util>nocom
╔═══════════════════════════════════════╗
║
║
║ NOCOM
program
║
║ Copyright (c) 1987-2016 Jurg Ott ║
║
║
╚═══════════════════════════════════════╝
The following maximum values are in effect:
MN = 900000 observations
MC = 4 components
See program manual at
http://www.jurgott.org/linkage/nocom.htm
Reference the following paper when using Nocom:
Ott J (1979) Detection of rare major genes in lipid levels.
Hum Genet 51, 79-91
Opening "nocom.dat" file for input
Opening "nocom.out" file for output
Using 200 observations
ENTER EXPONENT
1
ESTIMATE EXPONENT (1) or keep it constant (0)?
0
ENTER NUMBER OF COMPONENTS (0 < n < 4)
2
Fixed ratios of standard deviations (2; recommended)
or separate standard deviation for each component (1)?
2
ENTER STARTING PARAMETER VALUES FOR EACH COMPONENT
Mean for component 1:
10.5
Estimate this mean (1) or keep its value constant (0)?
1
Standard deviation for component 1:
2.5
Weight (probability) of component 1:
.8
Estimate weight (1) or keep value constant (0)?
1
Mean for component 2:
17
Estimate this mean (1) or keep its value constant (0)?
1
Ratio of standard deviation to that of comp. 1:
1
Weight (probability) of component 2:
.2
Estimate weight (1) or keep value constant (0)?
1
Estimate standard deviations (1) or keep constant (0)?
1
STARTING VALUES
0
MEANS
10.500000
17.000000
STD.DEV
2.500000
2.500000
PROP.
0.800000
0.200000
EXPO
LN(L)
1.0000
-322.677947
23
MEANS
9.981047
15.526054
STD.DEV
2.071605
2.071605
PROP.
0.787352
0.212648
EXPO
LN(L)
1.0000
-315.052901
ENTER 0 TO STOP,
1 Start over (data will be read again)
2 Start over, retaining current (transformed) observations
3 New exponent only, same data, with parameter
adjustment due to change of exponent
0
REFERENCES
Hasselblad V (1966) Estimation of parameters for a mixture of normal
distributions. Technometrics 8, 431-444
Ott J (1979) Detection of rare major genes in lipid levels. Hum Genet
51, 79-91
Thode HC, Finch SJ, Mendell NR (1988) Simulated percentage points for
the null distribution of the likelihood ratio test for a mixture of two
normals. Biometrics 44, 1195-1201