SOFTWARE FOR FITTING MULTILINEAR MODELS TO DATA


Robert T. Ross

Department of Biochemistry, The Ohio State University,
484 West 12th Avenue, Columbus OH 43210 USA

Internet: rtr+@osu.edu

edition of 01 Feb 95

This document describes a family of programs written by the
author to fit multilinear models to data in 2-way, 3-way, and 4-
way arrays.  (A matrix is a "2-way array", with one way being the
rows and the other way being the columns.  A 3-way array is a
block which extends in three directions.)

All current programs are in FORTRAN77.


PROGRAMS CURRENTLY AVAILABLE: TPALS - fits general Parafac model to 3-way data T2A - fits general Tucker2 model to 3-way data T3A - fits general Tucker3 model to 3-way data FILES CURRENTLY AVAILABLE VIA WWW SERVER: (at http://calvin.biosci.ohio-state.edu/docs) TPALS.F - FORTRAN code for TPALS TYRACE.DAT - data file in format "TA", ready for use by TPALS.F TYRACE.OUT - output generated by TPALS.F when TYRACE.DAT is used as input Additional FORTRAN code and data files for testing will be available soon.
OTHER REQUIREMENTS: All current programs require the IMSL subroutine library. WARNING: IN ORDER TO REMOVE THE DEPENDENCE OF TPALS ON A NAGLIB ROUTINE, ITS LINEAR LEAST SQUARES CALCULATION IN THE CASE IOPT=-1 (NEGATIVE ANSWERS NOT ALLOWED) WAS RECENTLY CONVERTED TO USE IMSL ROUTINE DLCLSQ. WE HAVE EXPERIENCED OCCASIONAL PROBLEMS WHEN ATTEMPTING TO SOLVE FOR THREE COMPONENTS WITH A LARGE ARRAY. UNTIL THIS PROBLEM IS RESOLVED, USERS HAVING NAGLIB AVAILABLE MAY WISH TO MODIFY SUBROUTINE REGRES TO USE IT RATHER THAN IMSL.
PROGRAM AND SUBROUTINE NAMING CONVENTIONS: The first character of each program name is "B", "T", or "Q", designating whether the array used is 2-way, 3-way, or 4-way (usually fit by bilinear, trilinear, or quadrilinear models, respectively). For programs using 3-way arrays, the first two characters are "TP", "T2" and, "T3", describing programs that fit the PARAFAC, Tucker2 (T2), and Tucker3 (T3) models. The remainder of the program name conveys further information about the model and algorithm used. The program "TPALS" fits Parafac models to 3-way data using the ALS (alternating least squares) algorithm. A similar approach is used in naming subroutines. For example, subroutine TRESID is used by all programs in the "T" family, and subroutine TPOUT is used only by "TP" programs.
CHOICES OF ARRAY WAYS: In many multilinear models, two or more ways can be interchanged and the resulting permuted model is the same type of multilinear model. In the general Parafac and Tucker3 models, each of the three ways is equivalent. In the general Tucker2 model, two of the three ways are equivalent. However, this software (at least in its current edition) does not treat the ways equivalently, so it may make a difference which experimental variable is assigned to which way. For "TP" programs, the solution is initialized using a decomposition which requires that the number of levels in way-1 and way-2 be at least equal to the number of components being considered, whereas there need only be 2 levels for way-3. You may need to interchange the identities of the ways in your data array to meet this condition. Also, see "Array dimensions" in the next section.
POSSIBLE FORTRAN CODE ADJUSTMENTS REQUIRED: Array dimensions: As required, each subroutine has a PARAMETER statement defining the value of NDA, NDB, NDC, NDT, NDF, and ND2. NDA, NDB, and NDC must be at least equal to the number of levels in way-1, way-2, and way-3; known as NI, NJ, and NK, respectively. NDT must be at least equal to the larger of NI and NJ. NDF must be at least equal to the largest number of components to be considered. ND2 must be at least equal to the largest number of data points used in a linear least-squares step; for TPALS, this is the number of data in any "slice". Thus, if there are no missing data, ND2 in TPALS must be equal to the largest of NI*NJ, NI*NK, and NJ*NK. Array sizes may be adjusted by using an editor to make a global change such as "change 'NDA=40' to 'NDA=50'". Machine precision and subroutine names: Adequate accuracy requires the use of 8-byte floating-point variables (REAL*8). We have been using a machine in which REAL*8 is regarded as "DOUBLE PRECISION". Thus, calls to IMSL subroutines are to the double precision edition, which have names beginning with the prefix "D". If the code is to be used on a machine for which REAL*8 is the default precision (such as a CRAY), the prefix "D" will need to be removed from all IMSL subroutine names (Listing the character string "CALL D" will locate all of them.). And there may be a couple of other double precision functions which will need to be converted to their normal precision equivalents; your compiler will tell you where they are!
INPUT FILE FORMAT: The input files for these programs have as common a format as possible. For the "T" family of programs, the basic input format is designated "TA". A file in this format is read by SUBROUTINE READTA. The file can be described as three blocks: File Block 1. Program Control Parameters This single line of control parameters is placed at the top of the file so that it may be easily edited. NF maximum number of components to be included in a model (The program will generally fit 1, 2, ..., NF components.) IOPT program control variable used for different purposes in different programs. In TPALS it can have two values: +1 if negative elements are allowed in output vectors, and -1 if all elements are constrained to be nonnegative. IDPRT controls the amount of diagnostic output produced. When IDPRT=0, no diagnostic output is produced, other than error conditions. When IDPRT=1 or 2, a moderate amount of diagnostic output is produced which may be of some interest to a user wishing to observe the detailed working of the program. When IDPRT=3, a very large volume of output is produced, usually of interest only when debugging error conditions. File Block 2. Label Information This block contains label information and has no influence on the calculations. Intended uses: DAYTIM date and time of data collection TITLE title of dataset INAM name of variable used for the first way JNAM name of variable used for the second way KNAM name of variable used for the third way Note added 5 Jan 95: I am about to delete the following segment from the "TA" format. RTR NK number of level labels to follow KLABEL label for level k File Block 3. The Dataset NI number of levels in the first way NJ number of levels in the second way NK number of levels in the third way XI(I) value of first-way variable for level i XJ(J) value of second-way variable for level j XK(K) value of third-way variable for level k SCALED a scale factor by which the data has been divided in order for it to be read as F7.4 accurately SCALES a scale factor by which the standard deviations have been divided DATA(I,J,K) the data SIGMA(I,J,K) the estimated standard deviation of DATA(I,J,K) IGOOD(I,J,K) a "data good" flag. IGOOD(I,J,K)=1 means that DATA(I,J,K) should be included in the calculations, while IGOOD(I,J,K)=0 means that DATA(I,J,K) is missing or so unreliable as to be unusable. Using a very large value of SIGMA(I,J,K) has much the same effect as setting IGOOD(I,J,K)=0, except that setting IGOOD to zero saves computer time (and memory) by skipping calculations involving the corresponding datum. In many circumstances, including the unmodified use of general multilinear models such as Parafac, T2, and T3, the values of the independent variables, XI, XJ, and XK, are not used in the model, and thus may be assigned dummy values.
OUTPUT FILE FORMAT: For program TPALS: In the absence of diagnostic printout (IDPRT=0 and no errors), the output file is as follows: From SUBROUTINE READTA: header information For each of NF=1, 2, etc.: From TPOUT: The vectors for each component and each way. The vectors for the first two ways are normalized to have a maximum of 1, so that the scale of each component is conveyed by the vector for the third way. From TRESID: An analysis of residuals, including the size and location of the largest positive and largest negative residuals, a partial sum- of-squared residuals by level for each way, and a map of residuals for each 'layer' (third-way level). Each map is coded so that a missing value is '0'; a small residual (-1