DESCRIPTION OF THE CANCER DATA FILE

The Prostate Cancer clinical trial data of Byar and Green (1980) is 
reproduced in Andrews and Herzberg (1985, pp 261-247). It is also 
available in Statlib at the URL 

http://lib.stat.cmu.edu/datasets/Andrews/T46.1

Some transformations and deletions have been made to get the version 
of this data set in Cancer.dat, these are described below.

This data set was obtained from a randomized clinical trial comparing 
four treatments for 506 patients with prostatic cancer. These patients 
had been grouped by physicians using clinical criteria into Stage 3 
and Stage 4 of the disease. This classification has not been used by 
Multimix.for to group the data, but it is useful to compare groupings 
found by Multimix or any other clustering program with it. It could 
also be used as the basis for a discriminant analysis program.

There are twelve pre-trial covariates measured on each patient, seven 
may be taken to be continuous, four to be discrete, and one variable 
(Index of tumour stage and histolic grade) is an index nearly all of 
whose values lie between 7 and 15, and which could be considered either 
discrete or continuous. 

In order, the covariates are:

Age, Weight, Performance rating, Cardiovascular disease history, 
Systolic Blood pressure,  Diastolic blood pressure, 
Electrocardiogram code, Serum haemoglobin,  Size of primary tumour, 
Index of tumour stage and histolic grade,  
Serum prostatic acid phosphatase.  

Continuous covariates:
Age, Weight, Systolic Blood pressure, Diastolic blood pressure,
Serum haemoglobin, Size of primary tumour,
Index of tumour stage and histolic grade,
Serum prostatic acid phosphatase

Categorical covariates (Number of Levels):
Performance rating (4), Cardiovascular disease history (2),
Electrocardiogram code (7), Bone metastases (2) 

A preliminary inspection of the data showed that the size of the 
primary tumour (SZ) and serum prostatic acid phosphatase (AP) were both 
skewed variables. These variables have therefore been transformed. A
square root transformation was used for SZ, and a logarithmic
transformation was used for AP to achieve approximate normality. 
Observations that had missing values in any of the twelve pretreatment 
covariates were omitted from further analysis, leaving 475 out of the 
original 506 observations available.

When the program Missing.for becomes available a useful exercise will
be to re-estimate the parameters and group assignments using all 506
observations (not forgetting to transform AP and SZ). The parameter
estimates based on the 475 complete observations may be used as
initial parameter estimates for the new iteration. Missing.for is
slower to execute than Multimix.for, so this will be the usual way
in which it is used.



