Abstract
We study the effect of the nonlinear dependence of the iterate x k of Conjugate Gradients method (CG) from the data b in the GCV procedure to stop the iterations. We compare two versions of using GCV to stop CG. In one version we compute the GCV function with the iterate x k depending linearly from the data b and the other one depending nonlinearly. We have tested the two versions in a large scale problem: positron emission tomography (PET). Our results suggest the necessity of considering the nonlinearity for the GCV function to obtain a reasonable stopping criterion.
Generalized Cross Validation; Tomography; Conjugate Gradients
The effect of the nonlinearity on GCV applied to conjugate gradients in computerized tomography
Reginaldo J. Santos^{I}; Álvaro R. de Pierro^{II}
^{I}Departamento de Matemática, ICEx, Universidade Federal de Minas Gerais, CP 702, 30123970 Belo Horizonte, MG, Brazil. Email: regi@mat.ufmg.br
^{II}Department of Applied Mathematics  IMECC, State University of Campinas, CP 6065, 13081970 Campinas, SP, Brazil. Email: alvaro@ime.unicamp.br
ABSTRACT
We study the effect of the nonlinear dependence of the iterate x^{k} of Conjugate Gradients method (CG) from the data b in the GCV procedure to stop the iterations. We compare two versions of using GCV to stop CG. In one version we compute the GCV function with the iterate x^{k} depending linearly from the data b and the other one depending nonlinearly. We have tested the two versions in a large scale problem: positron emission tomography (PET). Our results suggest the necessity of considering the nonlinearity for the GCV function to obtain a reasonable stopping criterion.
Mathematical subject classification: 62G05, 92C55, 65F10, 65C05.
Keywords: Generalized Cross Validation, Tomography, Conjugate Gradients.
1 Introduction
We consider the problem of estimating a solution x of
where A is a real m × n matrix, b is an mvector of observations and is an error vector. We assume that the components of are random with zero mean, uncorrelated and with variance s^{2} (unknown); i.e.;
where E denotes the expectation and I is the identity matrix.
If A is a full rank matrix, the GaussMarkov Theorem states that the least squares solution of (1.1), i.e., = (A^{t}A)^{1}A^{t}b, is the best unbiased linear estimator of x, meaning that it is the minimum variance estimator (see, for example [26]). But, if A is illconditioned, this minimum variance is still large. It is well known that, if we allow the estimator to be biased the variance could be drastically reduced (see [28, 29]).
One way to do this, is by considering solutions of regularized problems of the form
for a positive real number l, where B is an n × n matrix, that introduces a priori information on the problem. For example, B could be the matrix associated to the discretization of derivatives enforcing smoothness of x.
Another way is to apply an iterative method that is convergent to the solution of
In this case, a regularization effect is obtained by choosing as 'solution' an early iterate of the method (See [31, 9] and the references therein, or see [14, Chapter 7]). Here the iteration number k plays the role of the parameter l of the first approach. For stationary iterative methods in [10] and [23] it is shown that this approach is equivalent in some sense to the first one.
A crucial point in both approaches is the choice of the regularization parameter l in (1.3) or the iteration number k in the second approach. A possibility is to approximate a value of l, such that, the average of the mean square error is minimum, i.e., l solves the problem
where
where x_{l} is the solution of (1.3) and x is one solution of (1.1). Probably, the most popular of these methods is Generalized CrossValidation (GCV) (See [33, 16]).
If x_{l} is a estimator for x (the solution of (1.3) or one iterate of a iterative method convergent to the solution of (1.4)), the influence operator A_{l} is defined as
For the solution of (1.3), the influence operator is given by
If A_{l} is given by (1.8), in [6, 13] the GCV function V(l) was defined as
GCV chooses the regularization parameter by minimizing V(l). This estimate is a rotationally invariant version of Allen's PRESS, or CrossValidation [1].
In [6, 13] was proven that, if A_{l} is given by (1.8), then
where , whenever 0 < µ_{1} < 1. In [8] Eldén presentsa method to compute the traceterm of the GCV function using bidiagonalization. In [12], Golub and Von Matt proposed an iterative method to approximate the traceterm of the GCV function for large scale problems. In [21], Golub, Nguyen and Milanfar extended the derivation of the GCV function to underdetermined systems with applications to superresolution reconstruction.
Another approach is to apply an iterative method directly to the least square problem
minimize Ax b^{2}
and take as as a 'solution' an early iterate of the method. In [10] and [23], it is proven that this approach is equivalent, in some sense, to (1.3). Now, the role of the parameter l is played by the iteration number k and the GCV functional can be used to choose this number, that is, as a stopping rule, as shown in [32, 25] for stationary linear iterative methods. It is well known that the Conjugate Gradients method applied to the normal equations (CGLS) associated to the problem (1.1) can achieve its best accuracy significantly faster than stationary methods and that CGLS can be used as a regularizing algorithm (see [22]). However, the iterations of CG are a nonlinear function of the data vector b, and, as pointed out in [15], a very good stopping rule should be used. In [15], Hanke and Hansen suggest a Monte Carlo GCV procedure to stop CGLS and the nmethod ([15], pages 298299). But the nonlinearity of the influence operator in the case of CGLS is not taken into account. While for the nmethod the influence operator is linear or affine, for CGLS it is nonlinear. However they observe that the application of their procedure "can suffer severely from the nonlinearity of CGLS". Also in [15] another even more crude approximation is used. In the later case the denominator of the GCV functional is approximated by the expression , where p is the matrix rank; and this is the same for every value of b ([15], page 307).
The main goal of this paper is to compare a Monte Carlo GCV procedure deduced in Section 2 that takes into account nonlinearity with another one that linearizes the influence operator for large scale problems like that arising in Positron Emission Tomography (PET).
Another alternative to GCV can be the use of the Lcurve (see [15]), that is not considered in this article because of the fact that we wanted to make explicit the importance of considering the nonlinear part of CG and because of the lack in that approach (the Lcurve) of most of the statistical information provided by the problem.
In Section 2 we derive an inequality like (1.10), when A_{l} is a nonlinearfunction of b. We also show how to use a Monte Carlo approach, as in [11], in order to compute the denominator of our GCV functional. In Section 3 wedescribe the implementation of our procedure applied to a preconditioned Conjugate Gradients Method (PCCGLS). In Section 4 we present numerical experiments simulating a PET problem and in Section 5 some concluding remarks.
2 GCV for a nonlinear influence operator
We would like to obtain a good estimate for
using the relative residual defined by
and our information on the error (1.2).
The next result is a technical lemma.
Lemma 1. Let F(l), g(l), G(l), r(l) and H(l) be real valued functions with l in and a real number. If
and
then the following expression is valid:
Proof. From (2.3) e (2.4) we have that
that implies (2.5).
Throughout this paper we will assume that the influence operator, defined by
is a continuously differentiable operator as a function of b, with b varying in an open set W, that contains Ax, and if we denote by DA_{l}(·) the Jacobian of A_{l}(·), we have the next result.
Proposition 2. Let {x_{l}} be a family of estimators for the solution of (1.1), and (1.2). For each l, A_{l}(·) is continuously differentiable in W, then
where U(l) is given by (2.2), T(l) by (2.1) and the function O_{l}(^{t}) is such that O_{l}(^{t})< M
^{t}, for some M > 0.Proof. Expanding the square of the residual of (1.1) and (1.6), at x_{l} we get that
Now, expanding A_{l}(·) at Ax + around the point Ax in Taylor formula:
where O_{l}(^{t}) satisfies the hypothesis. Using (1.1) and (2.10) above we obtain
Dividing (2.9) by m, applying the expectation and using (1.2), (2.11) we obtain
Let we introduce the GCV function for nonlinear influence operators. For Ax Î W, we define the GCV functional by
In spite of the fact that we do not know the value of Ax, we will show bellow how we can obtain a good approximation for the denominator of (2.13) for large scale problems.
Theorem 3. Let {x_{l}} be a family of estimators for the solution of (1.1), for which (1.2) is valid. and for V(l) given by (2.13) the following equality holds
where µ_{1}(l) = Tr(DA_{l}(Ax)) and r(l) = E(^{t}O_{l}(^{t})).
Proof. By Proposition 2 the inequality (2.8) holds. Applying Lemma 1 we get (2.14).
The next Proposition is the basis of our approximation scheme.
Proposition 4. Let w = (w_{1},...,w_{m}) be a vector of random components with normal distribution, that is, w ~ (0,I_{m×m}). Suppose that A_{l}(·) is continuously differentiable in the open set W that contains Ax. Let and the estimators corresponding to the data vectors b + dw and b dw respectively. Then
Proof. Expanding A_{l}(·) at Ax + + dw and Ax + + dw around the point Ax up to the first order:
and
Thus, we obtain
and the result follows using a slight extension of the Theorem 2.2 of [11].
For the sake of consistency with the usual mathematical notation for the iterations of an algorithm, in what follows, the parameter l will be substituted by k.
3 Monte Carlo Implementation of GCV for Conjugate Gradients
Our main goal in this section is to apply our Monte Carlo Implementation of GCV as stopping rule for Conjugate Gradients method for solving (1.1). Now, each iterate x^{k} is an estimate of the solution of (1.1) and k plays the role of l. In Section 2 we showed that GCV is applicable to influence operators that are continuously differentiable in an open set that contains Ax.
The Conjugate Gradients method of Hestenes and Stiefel [3] applied to the normal equations (CGLS) can be written as: (for a given x^{0})
For each k > 0, the operator T(k)(b) = x^{k} is nonlinear and it is continuously differentiable outside the termination set (closed)
where P_{U} denotes the orthogonal projection onto the subspace U (i.e., W = (^{n} M_{k})). If A is an illconditioned operator, then T(k)(·) is discontinuous in M_{k}_{1}, as pointed out by Eicke, Louis and Plato in [7] in the infinite dimensional context. The same arguments of [7] can be easily used to understand the behavior of the method in the discrete case. For b in M_{k}_{1}, we define b^{l} = b + u_{l}, where u_{l} is the singular vector associated to the singular value s_{l}. From the fact that b^{l} Î M_{k} it follows that
Thus, if s_{l} » 0, then T(k)(b) T(k)(b^{l}) » ¥. And if b Î M_{k}_{1} and b_{w} is b perturbed by a random vector w, we should also have T(k)(b) T(k)(b_{w}) » ¥. But, usually in illposed problems the vector b has all the components of the singular vectors. It means that b does not belong to a termination set M_{k}, for any k.
Taking into account the results of the preceding section we describe the algorithm for calculating an approximation of the GCV functional (2.13) by using central differences to approximate the direcional derivative.
Algorithm 1.
(i) Generate a pseudorandom vector w = (w_{1},...,w_{m})^{t} Î ^{m} from a normal distribution with standard deviation equal to one, i.e.,
w ~ (0,I_{m×m}) .
(ii) For each k, compute the iterates x^{k} and corresponding to b, b dw e b + dw, respectively, where d = 10^{4}. Take
4 Positron emission tomography
The goal of positron emission tomography (PET) is the quantitative determination of the momenttomoment changes in the chemistry and flow physiology of radioactive labelled components inside the body. The mathematical problem consists of reconstructing a function representing the distribution of radioactivity in a body crosssection from measured data that are the total activity along lines of known location. One of the main differences between this problem and that arising in Xray tomography [17] is that here measurements tend to be much more noisy, so, direct inversion using convolution backprojection (CBP) doesn't necessarily give the best results (see [30]).
In positron emission tomography (PET) [27], the isotope used emits positrons which annihilate with nearly electrons generating two photons travelling away from each other in (nearly) opposite directions; the number of such photons pairs (detected in time coincidence) for each line or pair of detectors is related to the integral of the concentration of isotope along the line.
Suppose now that we discretize the problem by subdividing the reconstruction region into n small squareshaped picture elements (pixels, for short) and we assume that the activity in each pixel j is a constant, denoted by x_{j}. If we count b_{i} coincidences along m lines and a_{ij} denotes the probability that a photon emitted by pixel j is detected by pair i, then y_{i} is a sample from a Poisson distribution whose expected value is a_{ij}x_{j}.
It is well known that ART, with small relaxation parameters [18], approximates a weighted least squares solution of (1.1). This fact suggests that the application of Conjugate Gradients to the system (4.21) could give similar or better results [24] and it is a reasonable example to test the capability of GCV as a stopping rule for CG.
With the objetive of comparing the GCV procedures in Computerized Tomography we apply the Conjugate Gradients method ``preconditioned'' with a symmetric ART (from algebraic reconstruction technique) method, as presented recently in [24], i.e. we apply CG to solve the generalized least squares problem (CGGLS)
where C_{w} = (D + wL), if AA^{t} = L + D + L^{t}, where L is the strictly lower triangular part and D the diagonal of AA^{t} respectively.
This corresponds to apply CG to the system
In the ART method, w is the relaxation parameter. Here it has a different role, it is the weight of the lower triangular part of A in the preconditioning. When w is zero, this corresponds to a diagonal scaling.
The Conjugate Gradients method applied to (4.21) can be written as:
The elements of the matrix L are not explicitly known, but in [2] was shown an efficient way to compute the products A^{t}
r and Aw.In our numerical experiments we used the programming system SNARK93, developed by the Medical Image Processing Group of the University of Pennsylvania [5]. The images to be reconstructed (phantom) were obtained from a computerized atlas based on typical values inside the brain, as in [18]. The data collection geometry was a divergent one simulating a typical PET data acquisition [5]. We used a discretization with n = 95 × 95 pixels and the divergent geometry had 300 views, of 101 rays each, a total number of m = 30292 equations (8 rays were not considered because their lack of intersection with the image region). The starting point was a uniform image x^{0} = (a,...,a)^{t}, where a is an approximation of the average density of the phantom given by
The choice of a uniform nonzero starting point was advocated by L. Kaufman [20] and it is widely accepted as the best choice for many researchers in PET. The vector b was taken from a pseudorandom number generator with a Poisson distribution (see [19]). The total photon count was 2 022 085, 991 179, 514 925 and 238 172 (monotonically increasing noise).
We have performed experiments applying Algorithm 1 with photon counts 2 022 085, 991 179, 514 925 and 238 172 and with six values of w, between 0.0 and 0.025. For each value w and each photon count we repeated the experiments ten times. The ten tests gave very similar results.
We plotted, for one of the tests, in Figure 1 the functions
against the iteration number k, for w = 0.0 (diagonal scaling) and w = 0.025. Here we call V_{HH} the Monte Carlo GCV function, if we suppose that PCCGLS is a linear function of the data vector b, as proposed by HankeHansen in [15]. As expected, the minimum of our Monte Carlo GCV function V_{MC}(k) coincides with that of A(x^{k } x)^{2}, and also the curves are very similar. We can see that using V_{HH} can give wrong results in the case of Conjugate Gradients. Figures 2 and 3 show the reconstructions corresponding to the curves in row 1 of the Fig. 1.
5 Concluding remarks
Iterative methods in Emission Computed Tomography, like RAMLA [4], that are being used nowadays by PET scanners (see for example http://www.medical.philips.com/us/products/pet/products/cpet/), are fast and produce high quality pictures (because they take into account the Poisson nature of the noise) in few iterations. So it was not our goal in this paper to compare Conjugate Gradient with those methods, but to show that our approximation of GCV is applicable as a stopping rule to the Conjugate Gradient method for large scale illposed problems, similar to Positron Emission Tomography (very large scale, not severely illposed and relatively large noise in the data). Some authors (see, for example, [15, 9]) have suggested the use of other approximations to GCV. Our experimental results show (Fig. 1) that Algorithm 1 gives very good (much better than Hanke et al's) approximations for the minimum of A(x^{k } x)^{2}.
Further research is needed to extend these applications of GCV to more general nonlinear methods and nonlinear problems. We also need more specific asymptotic theoretical results.
Acknowledgements. We are grateful to J. Browne and G.T. Herman for their support on the use of SNARK93.
Álvaro R. De Pierro was partially supported by CNPq Grant No. 300969 /20031 and FAPESP Grant No. 2002 / 071532, Brazil.
Received: 11/XI/05. Accepted: 30/XI/05.
#647/05.
 [1] D.M. Allen, The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16(1) (1974), 125127.
 [2] A. Björck and T. Elfving, Accelerated projection methods for computing pseudoinverse solutions of systems of linear equations. BIT, 19 (1979), 145163.
 [3] Ĺke Björck, Numerical Methods for Least Squares Problems SIAM, Philadelphia (1996).
 [4] J. Browne and A.R. De Pierro, A rowaction alternative to the em algorithm for maximizing likelihoods in emission tomography. IEEE Trans. Med. Imag., 15 (1996), 687699.
 [5] J.A. Browne, G.T. Herman and D. Odhner, SNARK93 a programming system for image reconstruction from projections. Technical Report MIPG198, Department of Radiology, University of Pennsylvania, 1993.
 [6] P. Craven and G. Wahba, Smoothing noisy data with spline functions. Numer. Math., 31 (1979), 377403.
 [7] B. Eicke, A.K. Louis, and R. Plato, The instability of some gradient methods for illposed problems. Numer. Math., 58 (1990), 129134.
 [8] L. Eldén, A note on the computation of the generalized crossvalidation function for illconditioned least squares problems. BIT, 24 (1984), 467472.
 [9] H. Engl, M. Hanke and A. Neubauer, Regularization of inverse problems Kluwer Academic Publishers Group, Dordrecht (1996).
 [10] H.E. Fleming, Equivalence of regularization and truncated iteration in the solution of illposed image reconstruction problems. Linear Algebra and its Appl., 130 (1990), 133150.
 [11] D.A. Girard, A fast 'MonteCarlo crossvalidation' procedure for large least squares problems with noisy data. Numer. Math., 56 (1989), 123.
 [12] G. Golub and U. von Matt, Generalized crossvalidation for largescale problems. Journal of Computational and Graphical Statistics, 6 (1997), 134.
 [13] G.H. Golub, M.T. Heath and G. Wahba, Generalized crossvalidation as a method forchoosing a good ridge parameter. Technometrics, 21 (1979), 215223.
 [14] M. Hanke, Conjugate Gradient Type Methods for IllPosed Problems Longman Scientific & Technical, Essex,UK (1995).
 [15] M. Hanke and P.C. Hansen, Regularization methods for largescale problems. Surv. Math. Ind., 3 (1993), 253315.
 [16] P.C. Hansen, RankDeficient and Discrete IllPosed Problems SIAM, Philadelphia (1998).
 [17] G.T. Herman, Image Reconstruction from Projections: The Fundamentals of Computerized Tomography Academic Press, New York (1980).
 [18] G.T. Herman and L.B. Meyer, Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Med. Imaging, 12 (1993), 600609.
 [19] G.T. Herman and D. Odhner, Performance evaluation of an iterative image reconstruction algorithm for positron emission tomography. IEEE Trans. Med. Imaging, 10 (1991), 336346.
 [20] L. Kaufman, Implementing and accelerating the EM algorithm for positron emission tomography. IEEE Trans. Med. Imag., 6 (1987), 3751.
 [21] N. Nguyen, P. Milanfar and G. Golub, Efficient generalized crossvalidation with applications to parametric image restoration and resolution enhancement. IEEE Transactions on Image Processing, 10(9) (2001), 12991308.
 [22] R. Plato, Optimal algorithms for linear illposed problems yield regularization methods. Numer. Funct. Anal. Optim., 11 (1990), 111118.
 [23] R.J. Santos, Equivalence of regularization and truncated iteration for general illposed problems. Linear Algebra and its Appl., 236 (1996), 2533.
 [24] R.J. Santos, Preconditioning conjugate gradient with symmetric algebraic reconstruction technique (ART) in computerized tomography. Applied Numerical Mathematics, 47 (2003), 255263.
 [25] R.J. Santos and A.R. De Pierro, A cheaper way to compute generalized crossvalidation as a stopping rule for linear stationary iterative methods. Journal of Computational and Graphics Statistics, 12 (2003), 417433.
 [26] S.D. Silvey, Statistical Inference Penguin, Harmondsworth (1970).
 [27] M.M. TerPogossian et al., Positron emission tomography. Scientific American, October: 170181, 1980.
 [28] A. van der Sluis and H.A. van der Vorst, Numerical solution of large, sparse linear algebraic systems arising from tomographic problems. In G. Nolet, editor, Seismic Tomography D. Reidel Pub. Comp., Dordrecht, The Netherlands (1987).
 [29] A. van der Sluis and H.A. van der Vorst, SIRT and CG type methods for the iterative solution of sparse linear least squares problems. Linear Algebra and its Appl., 130 (1990), 257303.
 [30] Y. Vardi, L.A. Shepp and L. Kaufman, A statistical model for positron emission tomography. J. Amer. Stat. Assoc., 80(389) (1985), 837.
 [31] C. Vogel, Computational methods for inverse problems SIAM, Philadelphia (2002).
 [32] G. Wahba, Three topics in illposed problems. In H. Engl and C. Groetsch, editors, Inverse and IllPosed Problems Academic Press, New York (1987).
 [33] G. Wahba, Spline Models for Observational Data SIAM, Philadelphia (1991).
Publication Dates

Publication in this collection
27 Sept 2006 
Date of issue
2006
History

Received
11 Nov 2005 
Accepted
30 Nov 2005