SOFTWARE FOR FITTING MULTILINEAR MODELS TO DATA
Robert T. Ross
Department of Biochemistry, The Ohio State University,
484 West 12th Avenue, Columbus OH 43210 USA
Internet: rtr+@osu.edu
edition of 01 Feb 95
This document describes a family of programs written by the
author to fit multilinear models to data in 2-way, 3-way, and 4-
way arrays. (A matrix is a "2-way array", with one way being the
rows and the other way being the columns. A 3-way array is a
block which extends in three directions.)
All current programs are in FORTRAN77.
PROGRAMS CURRENTLY AVAILABLE:
TPALS - fits general Parafac model to 3-way data
T2A - fits general Tucker2 model to 3-way data
T3A - fits general Tucker3 model to 3-way data
FILES CURRENTLY AVAILABLE VIA WWW SERVER:
(at http://calvin.biosci.ohio-state.edu/docs)
TPALS.F - FORTRAN code for TPALS
TYRACE.DAT - data file in format "TA",
ready for use by TPALS.F
TYRACE.OUT - output generated by TPALS.F
when TYRACE.DAT is used as input
Additional FORTRAN code and data files for testing will be
available soon.
OTHER REQUIREMENTS:
All current programs require the IMSL subroutine library.
WARNING: IN ORDER TO REMOVE THE DEPENDENCE OF TPALS ON A NAGLIB
ROUTINE, ITS LINEAR LEAST SQUARES CALCULATION IN THE CASE IOPT=-1
(NEGATIVE ANSWERS NOT ALLOWED) WAS RECENTLY CONVERTED TO USE IMSL
ROUTINE DLCLSQ. WE HAVE EXPERIENCED OCCASIONAL PROBLEMS WHEN
ATTEMPTING TO SOLVE FOR THREE COMPONENTS WITH A LARGE ARRAY.
UNTIL THIS PROBLEM IS RESOLVED, USERS HAVING NAGLIB AVAILABLE MAY
WISH TO MODIFY SUBROUTINE REGRES TO USE IT RATHER THAN IMSL.
PROGRAM AND SUBROUTINE NAMING CONVENTIONS:
The first character of each program name is "B", "T", or "Q",
designating whether the array used is 2-way, 3-way, or 4-way
(usually fit by bilinear, trilinear, or quadrilinear models,
respectively).
For programs using 3-way arrays, the first two characters are
"TP", "T2" and, "T3", describing programs that fit the PARAFAC,
Tucker2 (T2), and Tucker3 (T3) models.
The remainder of the program name conveys further information
about the model and algorithm used. The program "TPALS" fits
Parafac models to 3-way data using the ALS (alternating least
squares) algorithm.
A similar approach is used in naming subroutines. For example,
subroutine TRESID is used by all programs in the "T" family, and
subroutine TPOUT is used only by "TP" programs.
CHOICES OF ARRAY WAYS:
In many multilinear models, two or more ways can be interchanged
and the resulting permuted model is the same type of multilinear
model. In the general Parafac and Tucker3 models, each of the
three ways is equivalent. In the general Tucker2 model, two of
the three ways are equivalent. However, this software (at least
in its current edition) does not treat the ways equivalently, so
it may make a difference which experimental variable is assigned
to which way.
For "TP" programs, the solution is initialized using a
decomposition which requires that the number of levels in way-1
and way-2 be at least equal to the number of components being
considered, whereas there need only be 2 levels for way-3. You
may need to interchange the identities of the ways in your data
array to meet this condition.
Also, see "Array dimensions" in the next section.
POSSIBLE FORTRAN CODE ADJUSTMENTS REQUIRED:
Array dimensions: As required, each subroutine has a PARAMETER
statement defining the value of NDA, NDB, NDC, NDT, NDF, and ND2.
NDA, NDB, and NDC must be at least equal to the number of levels
in way-1, way-2, and way-3; known as NI, NJ, and NK,
respectively. NDT must be at least equal to the larger of NI and
NJ. NDF must be at least equal to the largest number of
components to be considered. ND2 must be at least equal to the
largest number of data points used in a linear least-squares
step; for TPALS, this is the number of data in any "slice".
Thus, if there are no missing data, ND2 in TPALS must be equal to
the largest of NI*NJ, NI*NK, and NJ*NK. Array sizes may be
adjusted by using an editor to make a global change such as
"change 'NDA=40' to 'NDA=50'".
Machine precision and subroutine names: Adequate accuracy
requires the use of 8-byte floating-point variables (REAL*8). We
have been using a machine in which REAL*8 is regarded as "DOUBLE
PRECISION". Thus, calls to IMSL subroutines are to the double
precision edition, which have names beginning with the prefix
"D". If the code is to be used on a machine for which REAL*8 is
the default precision (such as a CRAY), the prefix "D" will need
to be removed from all IMSL subroutine names (Listing the
character string "CALL D" will locate all of them.). And there
may be a couple of other double precision functions which will
need to be converted to their normal precision equivalents; your
compiler will tell you where they are!
INPUT FILE FORMAT:
The input files for these programs have as common a format as
possible. For the "T" family of programs, the basic input format
is designated "TA". A file in this format is read by SUBROUTINE
READTA. The file can be described as three blocks:
File Block 1. Program Control Parameters
This single line of control parameters is placed at the top of
the file so that it may be easily edited.
NF maximum number of components to be included in a model
(The program will generally fit 1, 2, ..., NF components.)
IOPT program control variable used for different purposes in
different programs. In TPALS it can have two values:
+1 if negative elements are allowed in output vectors, and -1 if
all elements are constrained to be nonnegative.
IDPRT controls the amount of diagnostic output produced.
When IDPRT=0, no diagnostic output is produced, other than error
conditions. When IDPRT=1 or 2, a moderate amount of diagnostic
output is produced which may be of some interest to a user
wishing to observe the detailed working of the program. When
IDPRT=3, a very large volume of output is produced, usually of
interest only when debugging error conditions.
File Block 2. Label Information
This block contains label information and has no influence on the
calculations.
Intended uses:
DAYTIM date and time of data collection
TITLE title of dataset
INAM name of variable used for the first way
JNAM name of variable used for the second way
KNAM name of variable used for the third way
Note added 5 Jan 95: I am about to delete the following segment
from the "TA" format. RTR
NK number of level labels to follow
KLABEL label for level k
File Block 3. The Dataset
NI number of levels in the first way
NJ number of levels in the second way
NK number of levels in the third way
XI(I) value of first-way variable for level i
XJ(J) value of second-way variable for level j
XK(K) value of third-way variable for level k
SCALED a scale factor by which the data has been divided in
order for it to be read as F7.4 accurately
SCALES a scale factor by which the standard deviations have
been divided
DATA(I,J,K) the data
SIGMA(I,J,K) the estimated standard deviation of DATA(I,J,K)
IGOOD(I,J,K) a "data good" flag.
IGOOD(I,J,K)=1 means that DATA(I,J,K) should be included in the
calculations, while IGOOD(I,J,K)=0 means that DATA(I,J,K) is
missing or so unreliable as to be unusable. Using a very large
value of SIGMA(I,J,K) has much the same effect as setting
IGOOD(I,J,K)=0, except that setting IGOOD to zero saves computer
time (and memory) by skipping calculations involving the
corresponding datum.
In many circumstances, including the unmodified use of general
multilinear models such as Parafac, T2, and T3, the values of the
independent variables, XI, XJ, and XK, are not used in the model,
and thus may be assigned dummy values.
OUTPUT FILE FORMAT:
For program TPALS:
In the absence of diagnostic printout (IDPRT=0 and no errors),
the output file is as follows:
From SUBROUTINE READTA: header information
For each of NF=1, 2, etc.:
From TPOUT:
The vectors for each component and each way. The vectors for the
first two ways are normalized to have a maximum of 1, so that the
scale of each component is conveyed by the vector for the third
way.
From TRESID:
An analysis of residuals, including the size and location of the
largest positive and largest negative residuals, a partial sum-
of-squared residuals by level for each way, and a map of
residuals for each 'layer' (third-way level). Each map is coded
so that a missing value is '0'; a small residual
(-1