% Document Type: LaTeX
% Document Previewer: xdvi -margins 1.5
\documentclass{article}
\renewcommand{\vec}[1]{\mbox{\boldmath $#1$}}
\title{%article title
Least-Squares Fitting of a Hyperplane
}
\author{Robert K. Moniot}
\date{\today}
\addtolength{\textwidth}{0.5in}
\addtolength{\oddsidemargin}{-0.25in}
\begin{document}
\maketitle
\abstract{A method is developed for fitting a hyperplane to a
set of data by least-squares, allowing for independent uncertainties in all
coordinates of each data point, and including an error analysis.
\vspace{1ex}
{\em Note:} This paper is adapted from a technical report I wrote as a
graduate student in the Department of Physics, University of
California, Berkeley, in 1976.} Copyright \copyright 2002, by Robert
K. Moniot. All rights reserved.
\section{Introduction} A recurring computational problem in the field
of isotopic studies of terrestrial and extraterrestrial materials has
been the interpretation of the observed mass spectra in terms of
mixtures of various source components. Each source
component is characterized by fixed, but often unknown, isotopic
ratios, but it is present in variable amounts in different measured
samples. One would like to verify the hypothesis that a given number
of components is adequate to account for all observations, and, if
possible, not only to determine the source component compositions, but
also to resolve each measured sample into its original components, in
order to separate different processes in its origin for study. This
paper treats only the first steps in this sequence of analysis, i.e.,
the investigation into the number and compositions of possible source
components.
The measured mass spectra may be denoted by $Y_i(\mu)$, where $i=1,
\ldots, n$ is the sample number, and $\mu$ is the mass number. Since
$\mu$ takes on only discrete values $\mu_k,\ k=1, \ldots, p$, each
mass spectrum can be represented by a vector in a $p$-dimensional
vector space, $Y_{ik} = Y_i(\mu_k)$. These are assumed to be made up
of linear combinations of the component spectra,
\begin{equation}
Y_i(\mu) = \sum_{j=1}^m \alpha_{ij}g_j(\mu)
\label{LinearComboEqn}
\end{equation}
Where the $\alpha_{ij}$ are scalars between 0 and 1 subject to the
normalization condition $\sum_{j=1}^m \alpha_{ij} = 1,\ i=1, \ldots,
n$ and the $g_j(\mu)$ are the $m$ different component spectra.
The problem is analogous to that of curve resolution encountered, for
example, in chromatography or spectrophotometry, where it has been
treated with considerable success using the technique of principal
component analysis. Lawton and Silvestre (1971), for example, have
considered the case of two source components and have developed a
method for computing two bands of curves, each containing one of the
source components. The method of principal component analysis,
however, runs into difficulties if the data are characterized by
widely different experimental uncertainties. This is often the case
with mass spectroscopic data. Even if the relative uncertainties in
isotopic ratios are similar, the ratios can vary by orders of magnitude.
As Anderson (1963) pointed out, the method of principal component
analysis is justified only if the ratio of the ``uncertainty''
variance to the ``systematic,'' i.e.\ correlation, variance is the
same for all components of the data. Nonconformity with this
requirement may be remedied to some degree by rescaling the data
according to their respective uncertainties. Here we abandon the
method of principal component analysis for an alternative approach
that is on a better statistical footing in that it takes full account
of the estimated uncertainties of the data.
It is easily shown that data points consisting of linear combinations
of components according to Equation~(\ref{LinearComboEqn}) must lie in
an $m-1$-dimensional subspace of the full $p$-dimensional vector
space. This subspace is defined by the simplex whose vertices are the
$m$ distinct components. This paper deals with only the first step in
component resolution, namely the determination of the parameters of
this subspace. Furthermore, it considers only the simplest case, in
which $p = m$, that is, the number of components is the same as the
number of coordinates of the space (e.g.\ the number of isotopic
ratios measured in each sample). Thus for 2-dimensional data we seek
the equation of a straight line, for 3-dimensional data a plane, and
in general a hyperplane of dimension one less than the space in
which it is embedded. The general case of arbitrary $m \le p$ is to
be dealt with in a future paper.
\section{Definition of problem} In accordance with the forgoing
discussion, we
assume that the measured data points should ideally lie on a
hyperplane of $m-1$ dimensions in a space of $m$ dimensions. If we
let $\vec{y} = [y_1, y_2, \ldots, y_m]$ denote a point on this
hyperplane, the equation which the data point ideally should satisfy
can be written
\begin{equation}
f(\vec{y}) = \sum_{k=1}^{m-1} a_k y_k + a_m - y_m = 0
\label{HyperplaneEqn}
\end{equation}
The measured data consist of a set of $n$ vectors $\vec{Y_i},\
i=1\ldots, n$. Each coordinate $Y_{ik}$ of each data point has an
associated experimental uncertainty $\sigma_{ik}$. (The $\sigma_{ik}$
may or may not be known a priori. We assume that at least their
relative magnitudes are known.) The experimental errors will cause
the data points $\vec{Y_i}$ to lie scattered off the hyperplane of
Equation~(\ref{HyperplaneEqn}). Therefore one seeks a ``best fit''
to the data.
The method of principal component analysis, referred to above, is
equivalent to seeking the hyperplane that minimizes the sum of squares
of perpendicular distances from the measured points to the
hyperplane. As already mentioned, this method is unable to take
proper account of the experimental uncertainties of the data, and is
not invariant under a change of scale of one or more axes.
The usual method that one finds in the literature for obtaining a best
fit of this kind is based on minimizing a sum of squares of the
residuals $f(\vec{Y_i})$. This is called a regression of $y_m$
against $y_1$ through $y_{m-1}$. The sum of squares of these
residuals is either unweighted or weighted by $1/\sigma_{im}^2$
(Bevington, 1991). This approach ignores the uncertainties in
coordinates $y_1$ through $y_{m-1}$. It also gives different results
depending on which coordinate is chosen as the ``dependent''
coordinate $y_m$.
The correct treatment that properly takes account of the experimental
uncertainties was formulated by Deming (1943). Given the data
$Y_{ik}$ with associated uncertainties $\sigma_{ik}$, a set of
corresponding ``adjusted'' values $y_{ik}$ are sought which lie
exactly on the hyperplane (\ref{HyperplaneEqn}) and minimize the
variance
\begin{equation}
S = \sum_i \sum_{k=1}^m {1 \over \sigma_{ik}^2} (Y_{ik} - y_{ik})^2
\label{DemingVarianceEqn}
\end{equation}
The solution of this formulation of the problem is not
straightforward. York (1966) first devised an approach, later
improved by Williamson (1968), for the straight-line case. Here we
extend Williamson's solution to arbitrary $m \ge 1$.
\section{Derivation of solution}
We begin with the constraint that the adjusted points $\vec{y_i}$ are
required to satisfy the hyperplane equation, i.e.\ $f(\vec{y_i}) = 0,\
i=1,\ldots, n$ where $f$ is as defined in Equation~(\ref{HyperplaneEqn}).
Defining residuals $V_{ik} = Y_{ik} - y_{ik}$ allows this constraint
to be rewritten as
\begin{equation}
f(\vec{y_i}) = f(\vec{Y_i}) - \sum_{k=1}^{m-1} a_k V_{ik} + V_{im} = 0,
\quad i=1,\ldots, n
\label{ConstraintEqn}
\end{equation}
Now, given any values of the parameters $a_k$, we seek those values of
$y_{ik}$ that will minimize $S$ for that choice of the $a_k$, subject
to the constraint (\ref{ConstraintEqn}). Thus we require
\begin{equation}
\delta S = 0 = \sum_i \sum_{k=1}^m {1 \over \sigma_{ik}^2} V_{ik}
\,\delta V_{ik}
\label{deltaSEqn}
\end{equation}
From (\ref{ConstraintEqn}) we have
\begin{equation}
\delta f(\vec{y_i}) = 0 = - \sum_{k=1}^{m-1} a_k\, \delta V_{ik} +
\delta V_{im}, \quad i=1, \ldots, n
\label{deltafEqn}
\end{equation}
Multiplying each of the $n$ equations (\ref{deltafEqn}) by its own
undetermined multiplier $\lambda_i$ and adding them all to
Equation~(\ref{deltaSEqn}), we obtain
\begin{equation}
\sum_i \sum_{k=1}^{m-1} ({ V_{ik} \over \sigma_{ik}^2} - \lambda_i
a_k)\, \delta V_{ik} + \sum_i ({ V_{im} \over \sigma_{im}^2} +
\lambda_i)\, \delta V_{im} = 0
\label{GrandMinEqn}
\end{equation}
Since the $V_{ik}$ are independent, the coefficients of $\delta
V_{ik}$ in this equation must individually be zero, giving
\begin{equation}
\begin{array}{rcl}
V_{ik} & = & \lambda_i a_k \sigma_{ik}^2, \quad i=1, \ldots, n,\ k=1,
\ldots, m-1 \\
V_{im} & = & -\lambda_i \sigma_{im}^2, \quad i=1, \ldots, n
\end{array}
\label{VarianceSolutionEqn}
\end{equation}
Substituting this result into Equation~(\ref{ConstraintEqn}) and
solving for $\lambda_i$ yields
\begin{equation}
\lambda_i = W_i f(\vec{Y_i})
\label{lambdaEqn}
\end{equation}
where
\begin{equation}
W_i = \left[ \sum_{k=1}^{m-1} a_k^2 \sigma_{ik}^2 + \sigma_{im}^2
\right]^{-1}
\label{WEqn}
\end{equation}
This allows $S$ to be rewritten as
\begin{equation}
S = {\textstyle\sum_i} W_i f(\vec{Y_i})^2
\label{SwithWEqn}
\end{equation}
This expresses $S$ in the form of a weighted sum of the
residuals as defined for a conventional regression, but with weights
that properly take account of the individual uncertainties
$\sigma_{ik}$. Now $S$ is to be minimized with respect to the
parameters $a_k$. Setting $\partial S/ \partial a_k = 0$ leads to the
following set of equations analogous to the normal equations of the
conventional regression:
\begin{equation}
\begin{array}{l}
\sum_i W_i f(\vec{Y_i})\, y_{ik} = 0, \quad k=1, \ldots, m-1 \\[1ex]
\sum_i W_i f(\vec{Y_i}) = 0
\end{array}
\label{NormalEqns}
\end{equation}
Since the parameters $a_k$ occur in $W_i$, $f(\vec{Y_i})$, and
$\vec{y_i}$, these equations are nonlinear and cannot be solved in
closed form. However, recognizing that $W_i$ and $\vec{y_i}$ are
only weakly dependent on the parameters of the hyperplane, we can
linearize Equations (\ref{NormalEqns}) by treating those quantities as
constants. Writing out $f$ in terms of the parameters, then, we
obtain
\begin{equation}
\begin{array}{lclcl}
{\displaystyle \sum_{k=1}^{m-1}} (\sum_i W_i y_{ij} Y_{ik})\, a_k
&+& (\sum_i W_i y_{ij})\, a_m
&=& \sum_i W_i y_{ij} Y_{im},
\\ &&&& \quad j = 1, \ldots, m-1 \\
{\displaystyle \sum_{k=1}^{m-1}} (\sum_i W_i Y_{ik})\, a_k
&+& (\sum_i W_i)\, a_m
&=& \sum_i W_i Y_{im} \\
\end{array}
\label{ExpandedNormalEqns}
\end{equation}
Identifying the parenthesized terms in
Equation~(\ref{ExpandedNormalEqns}) as the elements of an $m \times m$
matrix $M$ and the right-hand sides as elements of a vector $\vec{b}$
of length $m$, this set of equations is seen as a linear system of the
form $M \vec{a} = \vec{b}$. Solution proceeds iteratively. Starting
with an initial guess for the vector of coefficients $\vec{a}$, the
matrix $M$ and vector $\vec{b}$ are evaluated and the system
$M\vec{a}=\vec{b}$ is solved for the new value of $\vec{a}$. This new
value is used to re-evaluate $M$ and $\vec{b}$, and the equation is
solved again. The
iteration is continued until convergence is obtained. In practice, it
is not necessary to have a good starting guess for $\vec{a}$. From
Equation~(\ref{WEqn}) it can be seen that the initial choice $\vec{a}=0$
gives, as the result of the first iteration, the same parameters as
would be obtained if the $\sigma_{ik}$ were zero for all $k$
except $m$. This is the same result as would be given by the
conventional weighted regression of $y_m$ against the other
coordinates. Incidentally, this means that weighted averages ($m=1$)
are computed correctly in one iteration. Experience has shown that
the convergence is rapid for data
sets where the fit is justified, and only a few iterations are
necessary to obtain the coefficients to accuracies that are well
within their uncertainties.
\subsection{Refinement} It should be mentioned that for the sake of
numerical stability, the measured points $\vec{Y_i}$ should be
translated if necessary into a coordinate system whose origin is close
to the mean of the points. This stability can be achieved
automatically, and the computation simplified somewhat, by
reformulating the solution in the following way.
First, solve for $a_m$ from the last of the equations
(\ref{ExpandedNormalEqns}):
\begin{equation}
a_m = \overline{Y}_{m} - \sum_{k=1}^{m-1} a_k \overline{Y}_{k}
\label{asubmEqn}
\end{equation}
where
\begin{equation}
\overline{Y}_{k} = { {\textstyle \sum_i W_i Y_{ik}} \over
{\textstyle \sum_i W_i}}, \quad k=1, \ldots, m
\label{YbarEqn}
\end{equation}
This shows that the point $\overline{\vec{Y_i}}$ lies on the best-fit
hyperplane. Now define $\vec{Y_i}' = \vec{Y_i} - \overline{\vec{Y}}$
and $\vec{z_i} = \vec{y_i} - \overline{\vec{Y}}$.
From Equations (\ref{VarianceSolutionEqn}) and (\ref{lambdaEqn}) we have
\begin{equation}
z_{ij} = Y_{ij}' - W_i a_j \sigma_{ij}^2 f(\vec{Y_i}), \quad i=1, \ldots, n,\ j=1, \ldots m-1
\label{zEqn}
\end{equation}
where we can express $f(\vec{Y_i})$ as
\begin{equation}
f(\vec{Y_i}) = \sum_{k=1}^{m-1} a_k Y_{ik}' - Y_{im}' \quad i=1, \ldots, n
\label{fOfYEqn}
\end{equation}
Then upon
inserting (\ref{asubmEqn}) into the remaining equations
(\ref{ExpandedNormalEqns}) we find
\begin{equation}
\begin{array}{lclcl}
{\displaystyle \sum_{k=1}^{m-1}} (\sum_i W_i z_{ij} Y_{ik}')\, a_k
&=& \sum_i W_i z_{ij} Y_{im}', \quad j = 1, \ldots, m-1 \\
\end{array}
\end{equation}
This reformulation has improved the numerical stability and reduced the
order of the set of equations that needs to be solved on each
iteration by 1.
\section{Error analysis} The variances of the hyperplane parameters
can be found by evaluating
\begin{equation}
\sigma^2(a_j) = \sum_i \sum_{k=1}^{m} \sigma_{ik}^2 \left({ \partial
a_j \over \partial Y_{ik} }\right)^2
\label{ParamVarianceEqn}
\end{equation}
(This equation assumes the data $Y_{ik}$ are uncorrelated.) Since the
dependence of $a_j$ on $Y_{ik}$ is not linear as
Equation~(\ref{ExpandedNormalEqns}) suggests, due to the dependence of
$W_i$ and $y_{ik}$ on $a_j$, evaluation of this expression is very
complicated. The original version of this paper contained an error in
the result of this calculation, and a corrected calculation has not
yet been done. To first order, however, ignoring the nonlinearity one
obtains the approximation
\begin{equation}
\sigma^2(a_j) \approx M_{jj}^{-1}
\label{ApproxParamVarianceEqn}
\end{equation}
that is, the variances of the parameters are given simply by the
diagonal elements of the inverse of the normal matrix defined in
Equation~(\ref{ExpandedNormalEqns}). (The
off-diagonal elements of this matrix are the covariances of the
parameters.) For well-behaved data such as those used for
illustration by York (1966), this approximation is good to within a
few percent.
If the experimenter does not have standard errors $\sigma_{ik}$ for
the measured quantities $Y_{ik}$, but only relative uncertainties, the
resulting fit is the same using these relative uncertainties, but the
variances in the fitted parameters are given by
expression~(\ref{ApproxParamVarianceEqn}) multiplied by $S/\nu$,
where $\nu = n - m$ is the number of degrees of freedom of the
problem. If the errors $\sigma_{ik}$ are known a priori, then the
goodness of fit can be inferred from the value of $S/\nu$, which
should be close to unity for normally distributed errors. This
constitutes a test of the $m$-component hypothesis as set forth in
the introduction.
\section{Bibliography}
\setlength{\parindent}{0pt}
\setlength{\parskip}{1ex}
Anderson, T. W. (1963). Asymptotic theory for principal component
analysis, {\em Annals of Mathematical Statistics,} {\bf 34,} 122--148.
Bevington, P. R. and Robinson, D. K. (1991). {\em Data Reduction
and Error Analysis for The Physical Sciences,} McGraw-Hill, New York.
Deming, W. E. (1943). {\em Statistical Adjustment of Data,} Wiley,
New York.
Lawton, W. H. and Sylvestre, E. A. (1971). Self modeling curve
resolution. {\em Technometrics,} {\bf 13,} 617--633.
Williamson, J. H. (1968). Least-squares fitting of a straight line.
{\em Canadian Journal of Physics,} {\bf 46,} 1845--1846.
York, D. (1966). Least-squares fitting of a straight line.
{\em Canadian Journal of Physics,} {\bf 44,} 1079--1086.
\end{document}