Statistical biophysical parameter retrieval and emulation with Gaussian processes

Statistical biophysical parameter retrieval and emulation with Gaussian processes

Chapter 2.13 Statistical biophysical parameter retrieval and emulation with Gaussian processes Gustau Camps-Valls*, 1, Luis Go´mez-Chova1, Valero Lap...

2MB Sizes 0 Downloads 0 Views

Chapter 2.13

Statistical biophysical parameter retrieval and emulation with Gaussian processes Gustau Camps-Valls*, 1, Luis Go´mez-Chova1, Valero Laparra1, Luca Martino1, Gonzalo Mateo-Garcı´a1, Jordi Mun˜oz-Marı´1, Daniel H. Svendsen1 and Jochem Verrelst1 Image Processing Laboratory (IPL), Universitat de Vale`ncia, Vale`ncia, Spain *Corresponding author.

1. Introduction The main goal of Earth observation (EO) is to monitor the Earth and its interaction with the atmosphere. Satellite sensors provide useful data to tackle the problem and address many important applications with deep societal, environmental, and economical implications. The analysis of the acquired data can be done either at local or global scales by looking at bio-geochemical cycles, atmospheric situations, and vegetation dynamics [1e5]. All these complex interactions are studied through the definition of biophysical parameters, either representing different properties for land (e.g., surface temperature, crop yield, defoliation, biomass, leaf area coverage), water (e.g., yellow substance, ocean color, suspended matter or chlorophyll concentration), or the atmosphere (e.g., temperature, moisture, or trace gases). Every single application considers the specific knowledge about the physical, chemical, and biological processes involved, such as energy balance, evapotranspiration, or photosynthesis. However, remotely sensed observations only sample the radiation reflected or emitted by the surface and thus, an intermediate modeling step is necessary to transform the measurements into

1

All authors contributed equally, and thus appear sorted alphabetically.

Hyperspectral Imaging. https://doi.org/10.1016/B978-0-444-63977-6.00015-8 Copyright © 2020 Elsevier B.V. All rights reserved.

333

334 SECTION j II Algorithms and methods

estimations of the biophysical parameters [6]. This is in general considered to be an inverse modeling problem because we have access to observations generated by the system and, after all, we are interested in the unknown parameters that generated those. This chapter reviews recent methodological advances in a distinct and very important field in remote sensing data processing: that concerned with the estimation of bio-geophysical parameters from remote sensing images (multispectral or hyperspectral), in situ measurements with field spectrometers, simulations from radiative transfer models (RTMs), and ancillary data. All these modalities of data are typically known as Earth observation data. A series of international study projections, such as the International GeosphereBiosphere Programme (IGBP), the World Climate Research Programme (WCRP), and the National Aeronautics and Space Administration (NASA) Earth Observing System (EOS), established remote sensing model inversion as the most important problem to be solved with EO imagery in the near future. Monitoring the physical processes occurring in the land, ocean, and atmosphere is now possible and reliable due to the improved image characteristics (increasing spatial, spectral, multiangular, and temporal resolutions) and the advances in signal processing, statistics, and machine learning. Current EO, however, faces two very important challenges that we call the data problem and the model problem. Optical EO satellites, endowed with a high temporal resolution, enable the retrieval and hence monitoring of plant biophysical variables [7,8]. We are witnessing an ever increasing amount of data gathered with current and upcoming EO satellite missions, such as the European Space Agency (ESA) Sentinels and the NASA A-Train satellite constellations. With the superspectral Copernicus’ Sentinel-2 [9] and Sentinel3 missions [10] as well as the planned Environmental Mapping and Analysis Programme (EnMAP) [11], Hyperspectral Infrared Imager (HyspIRI) [12], Hyperspectral Precursor and Application Mission (PRISMA) [13], and Fluorescence Explorer (FLEX) [14] imaging spectrometer missions, an unprecedented data stream for land monitoring becomes available. These data influxes require processing techniques, which should be accurate, robust, reliable, and fast, when the retrieval of information on plant growth and health status is envisaged. Over the last few decades a wide diversity of vegetation retrieval methods have been developed, but only a few of them made it into operational processing chains, and many of them are only in its infancy. But there is an additional, more scientifically challenging, problem: the lack of adoption of a unified mathematical framework for prediction. Both issues call for advances in EO data processing methods. Solving inverse problems is a recurrent topic of research in physics in general, and in EO in particular. After all, science is about making inferences about physical parameters from sensory data. A very relevant inverse problem is that of estimating vegetation properties from remotely sensed images. This chapter focuses on this particular problem. Accurate inverse models may help

Statistical biophysical parameter retrieval Chapter j 2.13

335

to determine, for example, the phenological stage and health status (e.g., development, productivity, stress) of crops and forests [15]. Leaf chlorophyll content (Chl), leaf area index (LAI), and fractional vegetation cover (FVC) are among the most important vegetation parameters to retrieve from space or airborne observations [16,17]. In general, mechanistic models implement the laws of physics and allow us to compute the data values given in a model [18]. This is typically referred as to the forward problem. The inverse problem tries to reconstruct the model from a set of measurements; see Fig. 1. A forward model describing the system can be expressed as x ¼ gðy; uÞ þ n; where x is a measurement obtained by the satellite (such as the expected radiance); y represents the state of the biophysical variables on the Earth (which we desire to infer or predict and is often referred to as outputs or targets in the inverse modeling literature); u contains a set of controllable configurations (e.g., wavelengths, viewing direction, time, Sun position, and polarization); n is an additive noise vector; and g($) is a function which relates y with x. Such a function g is typically considered to be nonlinear, smooth, and continuous. The codes implementing such models are very often computationally demanding. Traditionally the emphasis has been put on developing physically consistent g models. Understanding processes is of course the ultimate goal. Nevertheless, from a practical viewpoint, one aims to invert such a model with acquired observations, so one is ultimately interested in obtaining an inverse model

FIGURE 1 Forward and inverse problems in remote sensing data modeling. RTM, radiative transfer model.

336 SECTION j II Algorithms and methods

y ¼ f ðx; qÞ þ e; that is finding a (possibly nonlinear) function f parameterized by q that approximates the biophysical variables y given observations x received by the satellite, and subject to an error term e. RTMs are typically used to implement the forward direction [19,20]. However, inverting RTMs directly is very complex because the number of unknowns is generally larger than the number of independent radiometric information [3]. Also, estimating physical parameters from RTMs is hampered by the presence of high levels of uncertainty and noise, primarily associated to atmospheric conditions and sensor calibration. This translates into inverse problems where deemed similar spectra may correspond to very diverse parameter values. This gives rise to undetermination and ill-posed problems difficult to address, unless strong regularization is imposed. Methods for model inversion and parameter retrieval can be roughly separated in three main families: statistical, physical, and hybrid methods [5]. Statistical inversion predicts a biogeophysical parameter of interest using a training data set of inputeoutput data pairs coming from concurrent measurements of the parameter of interest (e.g., LAI) and the corresponding satellite observations (e.g., reflectances). Statistical methods typically outperform other approaches, but ground truth measurements involving a terrestrial campaign are necessary. Physical inversion reverses RTMs by searching for similar spectra in lookup tables (LUTs) and assigning the parameter corresponding to the most similar observed spectrum. This requires selecting an appropriate cost function and generating a rich, representative LUT from the RTM. The use of RTMs to generate data sets is a common practice, and especially convenient because acquisition campaigns are very costly in terms of time, money, and human resources, and usually limited in terms of parameter combinations. Finally, hybrid inversion exploits the inputeoutput data generated by RTM simulations and train statistical regression models to invert the RTM model. Hybrid models combine the flexibility and scalability of machine learning while respecting the physics encoded in the RTMs. Currently, kernel machines in general [21,22], and Bayesian nonparametric approaches such as Gaussian process regression (GPR) [23] in particular, are among the preferred regression models in this field [24,25]. We will start the chapter in Section 2 with an overview and application of standard GPs in remote sensing parameter retrieval, and then we will introduce several advances around GPs for forward and inverse modeling in Sections 3 and 4. We will first pay attention to three advances in inverse modeling: (1) how to deal with multiple outputs (variables of interest) simultaneously in Sections 1 and 2) how to exploit the information contained in both data and models to construct physically aware machine learning models in Sections 2 and 3) how to deal with the data deluge developing GP models that scale well to large data sets in Section 3. Finally, we will approach the interesting

Statistical biophysical parameter retrieval Chapter j 2.13

337

problem of modeling the forward RTM with GPs in Section 4. This field is known as emulation or surrogate modeling and has enormous advantages, both computational and of model tractability. We conclude in Section 5 with some remarks and an outline of future work.

2. Gaussian process models for inverse modeling In this section we will show how GPs can be used for the problem of inverse modeling.

2.1 Gaussian processes for parameter retrieval GPs are state-of-the-art tools for regression and function approximation and have been recently shown to excel in biophysical variable retrieval by following both statistical [24,25,25a] and hybrid approaches [26,27]. See Fig. 2. Let us consider a set of n pairs of observations or measurements, D n : ¼ fxi ; yi gni¼1 , perturbed by an additive independent noise. The input data pairs (X, y) used to fit the inverse machine learning model f (,) come from either in situ field campaign data (statistical approach) or simulations by means of an RTM (hybrid approach). We assume the following model,   yi ¼ f ðxi Þ þ ei ; ei wN 0; s2 ; (1) where f (x) is an unknown latent function, x ˛ Rd, and s2 stands for the noise variance. Defining y ¼ ½y1 ; .; yn T and f ¼ ½ f ðx1 Þ; .; f ðxn ÞT , the conditional distribution of y given f becomes pðyjfÞ ¼ N ðf; s2 IÞ, where I is the n  n identity matrix. Now, in the GP approach, we assume that f follows an n-dimensional Gaussian distribution fwN ð0; KÞ [28]. The covariance matrix K of this distribution is determined by a kernel function encoding similarity between the input points [23]. Very often the Radial Basis Function (RBF) kernel is adopted, which returns similarities between pairs of observations, so kernel matrix entries are computed according to ½Kij : ¼ kðxi ; xj Þ ¼ expð kxi  xj k2 =ð2l2 ÞÞ, where l is the kernel lengthscale. The intuition here is the following: the more similar input

FIGURE 2 Statistical inverse modeling.

338 SECTION j II Algorithms and methods

xi and xj are the more correlated output yi and yj should be. Thus, the marginal distribution of y can be written as Z pðyÞ ¼ pðyjfÞpðfÞdf ¼ N ð0; Cn Þ; where Cn ¼ K þ s2I. Now, we are interested in predicting a new output y* given an input x*. The GP framework handles this by constructing a joint distribution over the training and test points,      y Cn kT wN 0; ; y k c where k ¼ ½kðx ; x1 Þ; .; kðx ; xn ÞT is an n  1 vector and c* ¼ k(x*,x*) þ s2. Then, using standard algebra manipulations and properties of Gaussians, we can find the distribution over y* conditioned on the training data, i.e., pðy jD n Þ, which is a normal distribution with predictive mean and variance given by mGP ðx Þ ¼ kT C1 n y;

(2) s2GP ðx Þ ¼ c  kT C1 n k :  Namely, pðy jD n Þ ¼ N ðy mGP ðx Þ;s2GP ðx ÞÞ. Moreover, denoting f * ¼ f(x*)  we have that pð f  jD n Þ ¼ N ðy mGP ðx Þ; s2GP ðx Þ  s2 Þ. Note that the two previous densities only differ in the expression of the variance.1 Thus, GPs yield not only predictions mGP* for test data but also the so-called “error-bars” or confidence intervals, sGP*, assessing the uncertainty of the mean prediction. The hyperparameters q ¼ [l,s] to be tuned in the GP determine the width of the RBF kernel function and the noise on the observations. This can be done by marginal likelihood maximization or simple grid search, attempting to minimize the squared prediction errors.

2.2 Mapping applications with Gaussian processes 2.2.1 Optimizing Gaussian process regression mapping by analyzing associated uncertainties In this section we present the use of GPs into an integrated scientific toolbox for semi-automatic retrieval of biophysical parameters, the so-called “Machine Learning Regression Algorithm (MLRA) toolbox” [29] that implements the simple regression MATLAB toolbox [30]. Essentially the MLRA toolbox enables to: (1) train various machine learning algorithms, including GPR; (2) to evaluate their performances against validation data; and (3) to apply a 1. In the interpolation case, when no measurement noise is considered, then pðy jD n Þ ¼ pð f  jD n Þ.

Statistical biophysical parameter retrieval Chapter j 2.13

339

selected model (e.g., the best performing one) to an image for mapping applications. Special attention is given here to the GPR algorithm; not only because several comparison studies concluded GPR is a top performing retrieval algorithm [25,29,31e33], but also because of GP’s additional properties of band relevance ranking and the provision of uncertainty estimates [25,34,35]. Although GPR has proven to be a powerful regression algorithm for mapping applications, just as any other regression algorithm it suffers from spectral collinearity when too many similar bands are fed into the algorithm [36]. This situation is typically the case when applying GPR to hyperspectral data. To mitigate the problem of collinearity, an elegant solution is to make use of dimensionality reduction (DR) techniques. DR methods transform the original spectral data in some way that allows the definition of a small set of new features (components) in a lower-dimensional space which contain the vast majority of the original data set’s information. Usually, the classical principal component analysis (PCA) [37] is applied in regression studies. While PCA has proven its use in a broad diversity of applications, situations may occur where PCA is not the best choice. As an extension of PCA, partial least squares (PLS) introduces some refinements by looking for projections that maximize the covariance and correlations between spectral information and input variables [38]. PLS embedded into regression (PLSR) even became a popular regression method in chemometrics and remote sensing applications (e.g., see Ref. [32] for review). Nonetheless, the regression part of PLSR and principal component regression has always been restricted to multiple linear regression. The question therefore arose how well DR methods such as PCA and PLSR combine with GPs. This was recently tackled in Ref. [36] based on simulated and experimental data, and is summarized below. The widely used SPectra bARrax Campaign (SPARC) data set [25,39,40] was chosen to evaluate the performances of the combined DR-GPR retrieval strategies. This data set consists of field measurements of biophysical variables (chlorophyll content [Chla], LAI, and FVC) including and associated spectral data at the Barrax agricultural site. LAI values were collected over a variety of crop types (garlic, alfalfa, onion, sunflower, corn, potato, sugar beet, vineyard, and wheat). During the SPARC campaign in July 2003, airborne hyperspectral HyMap flight lines were acquired for the study site. HyMap flew with a configuration of 125 contiguous spectral bands, spectrally positioned between 430 and 2490 nm. Three types of analyses were conducted: (1) GPR without DR methods, i.e., trained by all bands, (2) GPR with PCA, i.e., trained by five components, and (3) GPR with PLS, also trained by five components. A 4-k cross-validation (CV) sampling design was applied to make sure all data were used for training and validation. When inspecting the R2CV results, however, no gain in accuracy by the DR methods was observed as compared to using all bands. In fact, GPR yielded best results when using all bands, although the differences in accuracy as compared to combining with the best performing

340 SECTION j II Algorithms and methods

DR method is small: R2CV 0.94 versus 0.93 using PLS. That the DR methods did not really improve accuracy can be attributed to the versatility of GPR but also to the nature of the SPARC data set. This data set consists of relatively few samples, i.e., 118, and a limited amount of hyperspectral bands, i.e., 125. Considering that real data are more fluctuating and not free from noise implies that the problem of collinearity is less prevailing to this experimental airborne data set. However, when also taking processing speed into account then DR methods make the difference. Although the regression algorithms ran fast because of being trained by relatively few samples and bands, applying DR methods still caused a substantial acceleration. For instance, PLS-GPR processed model development and validation about 8 times faster than GPR alone. To exemplify the mapping of a single variable, LAI, an arbitrary subset of HyMap flight line was twice processed using a GPR model: (1) without a DR method and (2) in combination with five components of the best performing DR method, i.e., PLS. The conversion of the HyMap subset to an LAI map using GPR alone took 73 s. When instead using the PLS-GPR model, then processing time reduced to only 5 s. The gain in processing speed (about 14 times faster) is not trivial, particularly in view of operational processing of hyperspectral data. Inspection of the obtained LAI maps (Fig. 3, top) reveals the contrast between the irrigated circular parcels with high LAI and the surrounding fallow land. These maps are in agreement with earlier mapping approaches

FIGURE 3 Leaf area index (LAI) map [m2/m2] processed by Gaussian process regression (GPR) using all 125 bands (top left), LAI map processed by PLS-GPR using five components (top right), associated GPR uncertainty estimates (s), respectively (bottom). Relative differences in SD (s) between GPR of all 125 bands and PLS-GPR are also provided (bottom right) [36].

Statistical biophysical parameter retrieval Chapter j 2.13

341

[29,33]. Note that LAI mean (m) estimates vary along the two maps, especially for low LAI on senescent, nonirrigated parcels. For instance, the PLS-GPR model is considerably better adapted to assign close-to-zero values to the fallow lands. In contrast, the LAI map as obtained by GPR alone looks more heterogeneous, which suggests that the model had more difficulty dealing with the spatial variability of the image. The question arises which map is more correct. To answer this question we use the uncertainty estimates (s) that the GPR provides along with the mean estimates that can actually be quantified. The lower the s, the more confident the retrieval relative to what has been presented during the training phase. Uncertainty maps are provided in Fig. 3 (bottom). When GPR uses all bands, the uncertainties tend to vary more, with especially high, patchy uncertainties over the harvested or nonirrigated drylands. Low uncertainties are encountered on the green irrigated fields, which can be attributed to the applied sampling design that predominantly focused on crops in vegetative status. Hence, the uncertainty maps of both retrieval algorithms can be compared to derive conclusions about the processing quality of a GPR model. In order to quantify the extent of reduced uncertainties, the relative differences between GPR and PLS-GPR maps are mapped in Fig. 3 (bottom right). This map is dominated by shades of blue, which indicates that for most of the land covers uncertainties are systematically reduced by the PLS-GPR model, mostly on the order of 50%. This thus suggests that training the regression algorithm with components from a DR method not only speeds up image processing but also leads to more certain estimates and thus higher quality maps. Altogether, given not only its fast processing but also its excellent performance, GPR in combination with a DR method would be a first choice to apply when dealing with hyperspectral data.

2.2.2 Operational use of Gaussian process regression for mapping biophysical parameters We are here interested in developing improved S2 vegetation products for agricultural monitoring applications. Specifically the idea was to reformulate the classical LAI product. Existing LAI products only provide LAI green estimates, i.e., for when the plant is in vegetative state. These products underestimate the LAI of annuals that senescence, e.g., crops, then LAI values close to zero are provided, even though the senescent leaves are still on the plant. To correct for this, an explicit separation between green and brown LAI is strongly necessary. Thanks to new bands of S2 in the short-wave infrared (SWIR) that overlap with the absorbance regions of woody materials, it becomes possible to quantify LAI of senescent vegetation. To do so, by means of developing two independent GP models we strived to quantify LAI green and brown separately. Hence, two GPR models were trained: one for LAIgreen and another one for LAIbrown. Only S2 bands at 10 and 20 m were used, leading to a total of 12 bands. Contrarily to hyperspectral data, there is no

342 SECTION j II Algorithms and methods

reason to apply a band selection or DR method with such low number of bands. Models were trained based on field data of the SPARC data set using again a 4-k CV design. It led to an R2CV of 0.89 for LAIgreen and 0.67 for LAIbrown, respectively. These models were used in a processing chain for automated S2 processing. The clear advantage of GPR is the delivery of associated uncertainty estimates, which can then function as a spatial mask. By setting an uncertainty threshold LAIgreen/brown estimates are filtered that fall within the threshold. As such, not only reliable estimates are displayed, but the uncertainty threshold also assures that any nonvegetated surface is masked out, e.g., water bodies, clouds, man-made bodies; given the models were not trained for these areas, they will automatically lead to high uncertainties. To exemplify the new LAI product, two S2 images acquired during summer, one in France (South of Toulouse) and one in Spain (Valladolid region), were processed into maps of LAIgreen/brown. The composite maps of LAIgreen/brown that fall within 50% uncertainty are shown in Fig. 4. Whereas over the French area green vegetation is dominating and only some crops are senescencing, the Valladolid region is much drier. There, green areas are only to be found at irrigated parcels, the other areas are either covered with senescent vegetation (LAIbrown) or are dried-out, fallow land, leading to close-to-zero LAI. Thanks to the uncertainty threshold, any nonvegetated area is directly masked out in both images.

3. Advanced Gaussian processes for inverse modeling This section pays attention to several advances in GP modeling of biophysical parameters. We will first tackle the problem of predicting several parameters simultaneously, the so-called multioutput regression problem. Note that the multioutput problem does not only appear when trying to estimate several parameters at the same time but also, as we will see later in Section 4, when we try to estimate several spectral channels in emulation settings. Then, we will see how to combine both in situ and RTM simulated data in a novel GP model

FIGURE 4 Subsets of S2 composite LAIgreen and LAIbrown product [m2/m2] for South of Toulouse, France (left), and West of Valladolid, Spain (right). LAI, leaf area index.

Statistical biophysical parameter retrieval Chapter j 2.13

343

that allows to perform transfer learning across space, time, or crops. We end this section by reviewing methodological advances in GPs to tackle big EO problems.

3.1 Multioutput Gaussian process models One of the problems with the standard GPR formulation is that it applies only to scalar functions, i.e., we can predict only one variable. A straightforward strategy to deal with several target variables is to develop as many individual GP models as variables. While generally good performance is attained in practice, the approach has a clear shortcoming: the obtained models are independent and they do not take into account the relationships between the output variables. The field of multioutput regression with GP models has been largely studied, and many implementations are possible nowadays. We here review the most common approaches: direct multioutput modeling [25], the linear model of coregionalization (LMC) [41], also known as co-kriging in the field of geostatistics [42], and a particular case called intrinsic coregionalization model (ICM) [41]. Other more sophisticated forms involve defining latent force model (LFM) GPs [43e45], yet they are not considered here for the sake of brevity.

3.1.1 Direct multioutput approach The classical GP formulation only deals with a single output variable. Nevertheless, GPs can be straightforwardly extended to more than one output. Indeed, let us remind the equations for the mean and the predictive variance of a GP given in Section 2 1  mGP ðx Þ ¼ kT K þ s2 I y ¼ kT a  1 s2GP ðx Þ ¼ s2 þ kðx ; x Þ  kT K þ s2 I k : A direct multioutput GP can be trivially obtained if model weights for each individual output are stacked together as L ¼ ja1j . jaDj, a matrix of size n  D. The L matrix is obtained using the same formula (and virtually the same computational cost) as in the standard single-output GP by  1 L ¼ K þ s2 I Y; (3) where now outputs are also stacked together in Y ¼ [y1j/jyD]˛RnD. A direct multioutput model does not take into account the relationships between the output variables explicitly. Nevertheless, when learning hyperparameters, all output variables are taken into account so the final model is optimized to perform well in all output variables jointly. In prediction mode, each output is

344 SECTION j II Algorithms and methods

obtained independently using only the corresponding column of the L matrix and the input samples. In order to adjust these multioutput hyperparameters we proceed as follows. The hyperparameters of the single-output model are obtained by maximizing the log marginal likelihood, that assuming a Gaussian prior has a closed form given by Ref. [23].  n 1 1  1  log pðyjXÞ ¼  yT K þ s2 I y  logK þ s2 I  log 2 p; (4) 2 2 2 where j$j is the matrix determinant. For the multioutput case we just take the diagonal elements for the first term of (4)   n 1 1  1 log pðyjXÞ ¼  diag yT K þ s2 I y  logK þ s2 I  log 2 p; (5) 2 2 2 and we end up with a vector of D elements, each one being the log marginal likelihood for each output variable. Then we take the sum of all vector elements. This simple strategy allows to optimize the model parameters by just maximizing the sum of the individual log marginal likelihoods. Regarding the derivatives of Eq. (5) w.r.t. hyperparameters q, we obtain them by adding the individual derivatives for each output, given by   v 1 T 1 vK 1 1 1 vK log pðyi jX; qÞ ¼ yi K K yi  Tr K vqj 2 vqj 2 vqj   (6)  T  1 vK ¼ Tr ai ai  K1 2 vqj where qj is the j-th hyperparameter, and ai ¼ Kyi for each output yi. 3.1.2 Coregionalization models Multioutput models have several advantages: one model accounts for all outputs, thus reducing the computational burden, and usually slightly more accurate (and physically consistent) results are obtained. Considering all outputs at the same time acts as a kind of regularization that yields smoother functions over all outputs. However, the direct multioutput model does not take into account the relationships between different outputs, and in fact acts as single-output models put together, the only difference being that hyperparameters are optimized considering all outputs jointly instead of separately. A real multioutput formulation has to consider that we want to model a vector function instead of a scalar one. That is, we go from scalar functions, f : X /R, to vector functions, f : X/RD , where D is the function dimension (i.e., the number of outputs). This implies that our reproducing kernel should be K : X  X/RDD , so we can express f by means of the representer theorem [46]: fðxÞ ¼

N X i¼1

Kðxi ; xÞci ;

ci ˛RD ;

(7)

Statistical biophysical parameter retrieval Chapter j 2.13

345

where each term K(xi,x) is a matrix acting on a vector ci. The coefficients ci satisfy the linear system  1 c ¼ KðX; XÞ þ lNI y; (8) where c, y are ND vectors obtained concatenating the coefficients and the outputs, respectively, and KðX; XÞ is an NDND matrix with entries 0 ðKðxi ; xj ÞÞd;d0 for i, j ¼ 1,.,N and d,d ¼ 1,.,D. The blocks of this matrix are N  N matrices of the form K(Xi,Xj), i, j ¼ 1,.,D (assuming all outputs have the same number of training samples), where Xi are the training samples corresponding to the i-th output. The predictions for new samples are given by T

fðx Þ ¼ Kx c with Kx ˛RDND composed by blocks ðKðx ; xj ÞÞd;d0 . When the matrix KðX; XÞ, composed of block matrices K(Xi,Xj), is a block diagonal matrix, that is, K(Xi,Xj) ¼ 0 cisj, then the outputs are independent. The nondiagonal matrices establish the relationships between the outputs. Usually a particular class of kernel functions that can be expressed as the sum of products between a standard kernel function, defined as usual in the input space, and another positive definite matrix encoding the relationships between the outputs, is considered. That is, we have K(xi,xj) ¼ k(xi,xj)B, where B is a D  D positive semi-definite matrix. In the more general case we have KðX; XÞ ¼

Q X

Bq 5kq ðX; XÞ;

(9)

q¼1

where 5 is the Kronecker product. We will focus now on two particular cases of this class of kernels: the LMC, and the intrinsic coregionalization model (ICM). Both models were introduced in the field of geostatistics where multioutput prediction using GPs is known as co-kriging [42,47]: • Linear Model of Coregionalization. In the LMC each output is expressed as a linear combination of independent random functions [42]. fd ðxÞ ¼

Q X

ad;q uq ðxÞ;

(10)

q¼1

where ad,q are scalar coefficients and uq(x) are latent functions with zero mean and covariance kq(x,x0 ). Eq. (10) can be further expanded to explicitly state for the latent functions that share the same covariance, that is, fd ðxÞ ¼ PQ PRq i i q¼1 i¼1 ad;q uq ðxÞ.

346 SECTION j II Algorithms and methods

It can be shown [41] that the full covariance (matrix) of this model is the same one in Eq. (9), where each Bq ˛ RDD is known as coregionalization matrix [42]. The rank of each matrix Bq determines the number of latent functions that share the same covariance function [41]. • Intrinsic Coregionalization Model. The Intrinsic Coregionalization Model is a particular case of LMC with Q ¼ 1. Therefore the kernel matrix for the whole data set X is just KðX;XÞ ¼ B 5 KðX;XÞ. In the ICM each output is a linear combination of latent functions uq(x) all having the same covariance 0 function, k(x,x ). The rank of the matrix B determines the number of latent functions ui(x) (all with the same covariance) we use in the linear combination.

3.1.3 Gap filling of leaf area index and fraction of absorbed photosynthetically active radiation time series Approximation with ICM models. ICM models can capture the relationships between a set of output variables. This can be further exploited in cases where we lack information of one or more variables in order to improve predictions over single-output models. We will illustrate this setting with a real multioutput time series data set composed of LAI and fraction of absorbed photosynthetically active radiation (fAPAR) products corresponding to Spain rice crop fields ranging from 2003 to 2014. The data were acquired in the context of the Earth obseRvation Model based RicE information Service (ERMES) European project,2 which aims to develop a prototype of Global Monitoring for Environment and Security (GMES) downstream services based on the assimilation of EO and in situ data within crop modeling solutions dedicated to the rice sector. In this data set we can observe the interannual variability of rice from 2003 to 2014 at coarse spatial resolution (2 km) which is useful for regional vegetation modeling. Concretely, the ERMES LAI and fAPAR products were derived from two optical sensors onboard satellite platforms (Moderate Resolution Imaging Spectroradiometer, MODIS, and Satellite Pour l’Observation de la Terre, SPOT-Vegetation). We will focus on the data coming from MODIS. Fig. 5 left shows the trends for LAI (top) and fAPAR (bottom) from 2003 to 2014. To conduct the experiment and show the potential of ICM models, we removed training samples from 2003 to 2005 for LAI, and from 2011 to 2014 for fAPAR, as shown in Fig. 5A, where black crosses represent training data and red crosses testing points. Fig. 5B shows the results obtained when predicting LAI and fAPAR using individual, single-output models. As we can see, individual models make poor predictions in the areas where training data are missing. We also observe that the predictive variance of the model is considerably larger in these areas, as expected. This problem can be tackled 2. http://www.ermes-fp7space.eu/.

Statistical biophysical parameter retrieval Chapter j 2.13

(A)

347

(B)

(C)

FIGURE 5 (A) LAI and fAPAR MODIS 8 daily time series of Spain rice are from 2003 to 2014. (B) Predictions made using individual, single-output models. (C) Predictions made using the ICM model. fAPAR, fraction of absorbed photosynthetically active radiation; LAI, leaf area index; MODIS, Moderate Resolution Imaging Spectroradiometer.

quite efficiently with a multioutput model under the framework of ICM. We built a model with a covariance model defined byKICM ¼ B5KMate´rn, where to ensure B is definitely positive it is defined as WWT þ diag(k), for some matrix W and vector k, which are hyperparameters to be tuned, and KMate´rn is the standard Mate´rn 3/2 kernel. Results of using this coregionalization model can be seen in Fig. 5C. The ICM is able to capture the signal characteristics, and the predictive variance is smaller compared to individual models. Approximation with LMC models. The results obtained by the ICM can be further improved using the LMC. For this particular problem we build a covariance as the sum of three different covariance functions accounting for the properties of our signals, that is: KLMC ¼ B5ðKbias þ Klinear þ KMatern Þ ¼ B1 5Kbias þ B2 5Klinear þ B3 5KMatern ; where the bias kernel (Kbias) accounts for the mean of the signals, the linear kernel (Klinear) accounts for linear trends, and the Mate´rn 3/2 kernel (KMate´rn) for the underlying signals.

348 SECTION j II Algorithms and methods

The same strategy is usually employed with single-output models. We also fit two GP models with covariances as the sum of a bias term, a linear term, and a Mate´rn kernel term. Obtained results for both single-output and LMC multioutput models can be compared in Fig. 6. As in the previous experiment, single-output models are unable to capture the signal characteristics in the areas where training samples are missing. On the other hand, the LMC model perfectly captures the signal characteristics, improving noticeably the results obtained by the previous ICM model.

(A)

(B)

FIGURE 6 (A) Predictions with individual Gaussian processes using a covariance function with three terms: bias, linear, and Mate´rn kernel. (B) The same predictions using the linear model of coregionalization multioutput model. fAPAR, fraction of absorbed photosynthetically active radiation; LAI, leaf area index.

Statistical biophysical parameter retrieval Chapter j 2.13

349

3.2 Hybrid modeling with joint Gaussian process models Statistical retrieval models, such as GPs, need data to learn a mapping from an observed spectrum to an underlying biophysical parameter. Unfortunately, real data in a remote sensing context require expensive and time-consuming campaigns. On the other hand, there is a wealth of simulated data available, through physical models such as RTMs. There are obvious reasons for attempting to include this simulated data in a statistical model. It is known that a lack of data can lead to overfitting for many predictive models, including GPs. Furthermore, their behavior in regions of scarce or no training data is difficult to predict and usually not ideal. The problem with simulated data, however, is that they rarely follow the same distribution3 as the real data. What we would like then is to have the simulated data “guide” our model when extrapolating into the region of scarce data, but doing so without confusing predictions made in the “datarich” domain. This section presents a method which automatically detects the relative quality of the simulated and real data and combines them accordingly. This occurs by learning an additional hyperparameter w.r.t. a standard GP model, and fitting parameters through maximizing the pseudo-likelihood of the real observations.

3.2.1 Joint Gaussian process regression Let us now assume that the data set D n is formed by two disjoint sets: one set of r real data pairs, D r ¼ fðxi ; yi Þgri¼1 , and one set of s RTM-simulated pairs D s ¼ fðxi ; yi Þgni¼rþ1 , so that n ¼ r þ s and D n ¼ D r WD s . In matrix form, we have Xr ˛Rrd, yr ˛Rr1, Xs ˛ Rsd, and ys ˛ Rs1, containing all the inputs and outputs of D r and D s , respectively. Finally, the n  1 vector y contains all the n outputs, sorted with the real data first, followed by the simulated data. As opposed to (1), we now assume a model that distinguishes between the two data sources in the noise term,   s2 if i  r yi ¼ f ðxi Þ þ ei ; ei wN 0; 2 : (11) s g if i > r where we shall call g the trust parameter. Its immediate interpretation is that it models the relative precision of the simulated data w.r.t. the real data. The assumed model (11) results in a covariance matrix for the distribution of the output vector, y

3. In fact, what we care about is that they follow the same conditional distribution over the output given the input.

350 SECTION j II Algorithms and methods

1

0 Cn ¼ K þ s2 V;

C B 1 1 C V ¼ diagB @1; .; 1 ; g ; .; g A; |fflfflffl{zfflfflffl} |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} r

(12)

s

which may be substituted into (2) to obtain the corresponding joint Gaussian process (JGP) predictive mean and variance. Recall that the data source we are interested in performing new predictions on is the one of the real data. Therefore, when fitting the hyperparameters, we maximize only the likelihood of the real data, thus tuning the model for prediction on one type of data. The predictive probability of a single training data point conditioned on the remaining data is a normal distribution determined by Eq. (2), using all data points but the i’th, indicated here with the backslash symbol. Thus, the predictive log-likelihood leaving out training point i can be expressed as 1 ðyi  mi Þ2 log pðyi jXyi ; yyi ; qÞ ¼  log 2 ps2i  : 2 s2i The leave-one-out (LOO) likelihood, to be maximized with respect to the hyperparameters, is constructed by summing over each data point. Here, we perform focused inference, by only summing over the real data points LLOO ðX; y; qÞ ¼

r X

log pðyi jXyi yyi ; qÞ:

(13)

i¼1

Now when computing r different predictive means and variances, it would appear that we have to invert r slightly different covariance matrices. Luckily, there is a way around this very computationally inefficient approach, which involves simply computing the inverse of the complete covariance matrix once [48]. Instead of using Eq. (2) a total of r times to evaluate the likelihood function, the following equations may be used: mi ¼ yi 

½K1 yi ; ½K1 ii

s2i ¼

1 1

½K ii

;

(14)

where [$]i denotes the i’th element of a vector, and [$]ii is the i’th diagonal element of a matrix. Maximizing only the likelihood of the main data set, g takes on the role of a trust parameter determining the relevance of the simulated data when performing prediction on the real data. If the simulated data are not helpful for prediction, g takes on a low value which forces the weights of Eq. (3) on the simulated data to zero.

3.2.2 Extrapolation across crop types The JGP framework is designed to help prediction in regions where there is no (or little) data available. Such a scenario could occur, for example, if only

Statistical biophysical parameter retrieval Chapter j 2.13

351

FIGURE 7 Projection of data only NDVIeLAI space, showing how different crop types tend to cluster together. LAI, leaf area index; NDVI, normalized difference vegetation index.

crops of low LAI value had been sampled in the SPARC data set described in Section 2.2.1. As can be seen from the projection of the data onto normalized difference vegetation index (NDVI) and LAI space of Fig. 7, this would correspond to having data for bare soil, garlic, poppy, onion, alfalfa, and wheat. Now if these data were used for training, a normal GP model would have a hard time extrapolating to the high-LAI data (marked with crosses in Fig. 7). In order to aid predictions we use simulated data generated through the PROSPECT and SAIL radiative transfer model (PROSAIL) which is described in more detail in Section 4.3. We compare different schemes for using RTM-simulated data to perform this extrapolation: A GP trained only on simulated data, a GP trained on the joint database of real and RTM-simulated data without distinguishing between the source and the JGP model. In Table 1, these methods are denoted as GPS, GPRþS, and JGP, respectively; and their

TABLE 1 Root mean squared error (RMSE) for different Gaussian process (GP) schemes, when the source crops of the training data are low in leaf area index, and vice verse for the test data. Method

GPR

GPS

GPRþS

JGP

RMSE

1.72

1.64

1.70

1.60

We see that the joint GP manages to leverage the simulated data so that it performs better than both the GP trained purely on simulated and that trained only on real data.

352 SECTION j II Algorithms and methods

root mean squared error (RMSE) in an independent test set is compared to that of the GPR trained only on real data. In this scenario, the JGP performs the best, indicating that it manages to use the simulated data for extrapolation while still using the information of the real data.

3.3 Gaussian process retrievals for large-scale problems One of the major drawbacks of GP is dealing with a high volume of training data. In those scenarios, a na€ıve implementation requires computation which grows as O ðn3 Þ where n is the number of training points. Moreover the memory requirements to store the covariance matrix K grows as O ðn2 Þ. This makes GP impractical when the training set is in the range of hundred thousands of points. To overcome this limitation several approaches have been proposed in the literature. Some computationally efficient approximations are based on approximate matrix-vector multiplications, where the system of linear equations involved is solved with a reduced number of conjugate gradient steps [49]. These approaches are very general and could be applied to solve any of the GP methods presented in this chapter. However, in this section we will focus on a family of GP specifically proposed for large-scale problems, the sparse approximation, which is based on inducing variables and on spectral representations.

3.3.1 Sparse approximations based on inducing variables We will introduce sparse approximations in the context of standard singleoutput regression following the notation of Section 2. To facilitate the understanding of the formulas we introduced subindexes in the matrices (indicating size) whenever there might be confusion. The simplest sparse approach consists in sampling the data before fitting the model. This approach is called subset of data (SoD): we sample m  n points from D n forming the subset D m 3D n . Then we fit the standard GP using only the subsampled points. The computational cost of this approach is O ðm3 Þ; however, our model completely discards nem data points. This can seriously hamper the performance of the model depending on the problem but may work if the m samples are representative enough. Many approaches based on SoD have been proposed in the literature to leverage all the available data. The common idea is to use a subset of the inputs Xm that are often called induced inputs or pseudo-inputs, which are assigned to a set of random variables called inducing variables u ¼ [u1,.,um]. Those inducing variables are then estimated using all the available data D n . This contrasts with the SoD approach: in the SoD approach the noise-free variables are estimated using only data fromD m . These models include subset of regressors (SoRs), deterministic training conditional (DTC) [50,51],

Statistical biophysical parameter retrieval Chapter j 2.13

353

fully independent training conditional (FITC) [52], variational free energy (VFE) [53], and partially independent training conditional (PITC) [54], among others. We will briefly present some of these methods following the framework proposed in Refs. [23,54]. First we start fixing a prior for the inducing variables u : pðuÞ ¼ N ð0; Km;m Þ, which is the same that is set for the noise-free variables f of the standard GP but using the inducing inputs Xm. Hereafter in this section the notation Kn,m indicates a matrix K of n rows and m columns. Then, the different methods establish different relationships between the pseudovariables u and the noise-free variables f of all the data: pmethod(fju). Finally, integrating over the u variable yields a modified prior over the noisefree f. Thus the computational efficiency of the method will depend on the assumptions we make over the relation between f and u. The SoR method establishes the deterministic relation:

pSoR ðfjuÞ ¼ N Kn;m K1 m;m u; 0 : Integrating u gives the prior over f: pSoR ðfÞ ¼ N ð0; Qn;n Þ (where Qn;n ¼ Kn;m K1 m;m Km;n ). The posterior using this prior for new inputs x* is then given by:

  1 1 pSoR ð f  jx ; D n Þ ¼ N Q;n Qn;n þ s2 I y; Q;  Q;n Qn;n þ s2 I Qn;

 1  1 ¼ N k;m Km;n Kn;m þ s2 Km;m Km;n y; s2 k;m Km;n Kn;m þ s2 Km;m km; : (15) Note that (15) just replaces K by Q in the standard GP posterior. This posterior is the same that one would obtain using the Nystro¨m method [55]. The equivalent equation is obtained applying the matrix inversion lemma. This formula is less computationally expensive and, therefore, preferred for its implementation. The initial computational complexity is O ðnm2 Þ, then, for each test case, we have O ðmÞ for the predictive mean and O ðm2 Þ for the predictive variance. One of the biggest problems of SoR appears when the tested data point x* is far away from the m inducing inputs in matrix Xm. Using the SoR approach, the predictive variance will go to zero since Q*,*z0. This contrasts with the exact GP, in which the predictive variance k** is higher when the test data are far away from the training points X. DTC, FITC, and PITC methods overcome this limitation in different ways. DTC changes Q*,* term with k*,* in the predictive variance distribution of (15). FITC proposes a relation between f and u where each fi is conditionally independent given u: 

pFITC ðfjuÞ ¼ N Kn;m K1 m;m u; diag Kn;n  Qn;n :

354 SECTION j II Algorithms and methods

After integrating out u, the posterior is then given by:

1 1   pFITC ðf  jx ; D n Þ ¼ N Q;n Qn;n þ L y; k;  Q;n Qn;n þ L Qn; ; (16) where L ¼ diag[Kn,n  Qn,n þ s I]. Again we can apply the matrix inversion lemma to reduce the computational cost. Finally, PITC relaxes the conditional independence assumption with a block conditional independence by changing the diag term by a blockdiag. Optimizing the inducing inputs. So far we have assumed that the Xm locations were given. The most straightforward approach would be to randomly select Xm from the complete training set X. Several greedy optimization approaches have been proposed to select the best subset of a given size for different criteria [50,51]. If we relax the condition that the input locations Xm must be a part of X, we change the discrete optimization problem by a continuous optimization one. This is the approach proposed originally in Ref. [52]. Their suggestion is optimizing the marginal likelihood of all the data, which is the standard approach for optimizing the parameters of the kernel (l) and s2. Their proposal is based on the FITC model; however, it can be adapted to any other sparse approximation previously mentioned. The log marginal likelihood of FITC is: 2

 1  1 1  n logðpFITC ðyjX; QÞÞ ¼  log Qn;n þ L  yT Qn;n þ L y  logð2pÞ: 2 2 2 (17) The approach consists in jointly minimized Eq. (17) with respect to Q ¼ {Xm,l,s2} using gradient-based methods. The computational cost of evaluating the objective function log(pFITC(yjXm)) is O ðnm2 Þ. Computing the derivatives w.r.t. l and Xm depends on the choice of the kernel function; in the case of the RBF kernel it is also O ðnm2 Þ. This results in a total complexity of O ðTnm2 Þ, where T is the number of steps in the optimization. Variational approach. VFE [53] had a large impact on the GP approximation literature. Following the formulation of [56], we consider the lower bound of the exact GP log marginal likelihood log(p(yjQ)) defined as: F ðq; QÞ ¼ logðpðyjQÞÞ  K LðqðfÞjjpðfjyÞÞ; where KL is the KullbackeLeibler divergence. If we set q to satisfy q(f) ¼ p(fju)q(u), the optimal distribution q* that maximizes the bound F is the one that we obtain in the DTC method (see Refs. [53,56] for details). This leads to the posterior:

  1 1 pVFE ðf  jx ; D n Þ ¼ N Q;n Qn;n þ s2 I y; k;  Q;n Qn;n þ s2 I Qn; :

Statistical biophysical parameter retrieval Chapter j 2.13

355

So the VFE and the DTC approaches result in the same posterior distribution, but for VFE the hyperparameters (Q ¼ {Xm,l,s2}) are optimized according to the F bound:  1  1 1  n F ðQÞ ¼ log Qn;n þ s2 I  yT Qn;n þ s2 I y  logð2pÞ 2 2 2  1   2 tr Kn;n  Qn;n : 2s

(18)

Compared with Eq. (17), Eq. (18) has an extra trace term. In Ref. [57], both approaches are theoretically and practically compared. Although both approaches yield similar results, they recommend VFE since it exhibits less unsatisfactory properties. In Ref. [58], a modified bound suited for stochastic variational inference was proposed, which allows to apply stochastic gradient descent on the gradient-based optimization reducing the cost of each gradient descent step to O ðm3 Þ. Finally, the work in Ref. [59] framed all the described approaches in a unifying framework.

3.3.2 Sparse approximations based on spectral densities One of the main concerns about sparse-inducing methods presented above is the lack of expressive power of the resulting kernel. In this context, random Fourier features have been presented as a powerful approach for large-scale kernel machines [60,61]. These methods consider that all stationary kernels, k(x,x0 ) ¼ k(xx0 ), present a spectral density, which is its Fourier dual. Thus the kernel can be approximated by a finite number of bases, fðxÞ ¼ T s2 P T 0 ½cosðuT1 xÞsinðuT1 xÞ.cosðuTr xÞsinðuTr xÞ , kðx;x0 Þ z m0 m r¼1 cosður ðx  x ÞÞ. These sinusoidal bases are a set on m random frequencies each with a multivariate probability density p(u). This density is the inverse Fourier transform of k. Therefore, the method explicitly yields a mapping function that allows us to “linearize” kernels by applying the feature expansion f. Sparse spectrum GP [62] (SSGP) extends GPs in this context. The predictive distribution reminds the Bayesian linear regression solution for the transformed x:   pSSGP ðy jx ; D n Þ ¼ N fðx ÞT Q1 Fy; s2 þ s2 fðx ÞT Q1 fðx Þ ; where F ¼ [f(x1)/f(xn)] is the 2m  n matrix of mapped samples, and Q ¼ 2 FFT þ ms I . In SSGP the hyperparameters of the model (the frequencies s2 2m 0

2 fur gm r¼1 , the parameters of the kernel l and s ) are learned by maximizing the log marginal likelihood [62]:

 1 1 T y y  yT FT Q1 Fy  logjQj 2s2 2 ms2 n þ m log 2  log 2 ps2 2 s0

log pðyjqÞ ¼ 

356 SECTION j II Algorithms and methods

The computational cost for the predictive distribution and the marginal likelihood is O ðnm2 Þ; for the mean and variance of a test sample is O ðmÞ and O ðm2 Þ, respectively; and since the full covariance matrix (n  n) is replaced by the explicitly mapped data (2m  n), memory is also reduced.

3.3.3 Estimation of surface temperature Here we compare the performance of the classical GPs, the SSGPs, and the sparse approximation based on inducing variables (FITC), in a big EO data problem: the retrieval of surface temperature from satellite data. In particular we used an orbit observation from the infrared atmospheric sounding interferometer (IASI), and the surface temperature given by the European Centre for Medium-Range Weather Forecasts (ECMWF). The data set collected on August 2013 consists of 98.000 data points: 50.000 points are used for training and the remainder points are used for testing (results in Fig. 8). The dimensionality of the data has been reduced to 10 dimensions using PCA. Results show that the FITC is more efficient in reducing the error for a given number of data in the kernel with the counterpart of increasing the training time. The SSGPs approximation can be seen as a compromise between the classical GPs and the FITC in both error and training time. Once the parameters are learned, the testing time is the same for all three methods.

4. Emulation and optimization of radiative transfer models with Gaussian processes Emulation deals with the challenging problem of building statistical models for complex physical RTMs. The emulators are also called surrogate or proxy models. An emulator is a statistical model which tries to reproduce the behavior of a deterministic and very costly physical (RTM) model. Emulators built with GPs are gaining popularity in remote sensing and geosciences since they allow efficient data processing and sensitivity analysis [25,63,64]. Here, we are interested in optimizing emulators such that a minimal number of

FIGURE 8 Different Gaussian process (GP) approximations to estimate surface temperature from infrared atmospheric sounding interferometer radiances: classical GPs, sparse spectrum (SSGPs), and sparse approximation based on inducing variables (FITC). [Left] Root mean squared error (RMSE) as a function of the number of data points present in the kernel (m). [Center] Training time as a function of m. [Right] RMSE as a function of training time.

Statistical biophysical parameter retrieval Chapter j 2.13

357

simulations is run to attain a given approximation error. We describe a general framework, called Automatic Emulation (AE) which is related to Bayesian optimization and active learning techniques. We first define the generic elements of the AE methodology and then describe one specific implementation, based on a GP interpolator: the Automatic Gaussian Process Emulator (AGAPE) [65]. Another alternative was recently proposed [66] based on the Relevance Vector Machine (RVM) technique and Nakagami functions [66e68]. Note that the inputs to RTMs, and hence emulators thereof, are state vectors (biophysical parameters and configuration values) y and that outputs are observations (radiances) x. We want to remark that in what follows, the notation of inputs and outputs is reversed: y’s denote now inputs, whereas x’s denote outputs. Therefore, the regression (or interpolation) methods used below act in the forward direction, right in the opposite direction explained in the previous sections. Notationally, we aim to emulate (i.e., interpolate/mimic) a costly function g(y) choosing adequately the nodes (simulated data), in order to reduce the error in the interpolation with the smallest possible number of evaluations of g(y). Given an input matrix of nodes (used for the interpolation) at the t-th iterations, Yt ¼ ½y1 /ymt , of dimension S  mt (where S is the dimension of each yi and mt is the number of points), we have a vector of outputs, xt ¼ ½x1 ; .; xmt T , where xt ¼ g(yt) is the estimation of the observations (e.g., radiances) at iteration t ˛ Nþ of the algorithm. Fig. 9 shows the graphical representations of a generic automatic emulator. At each iteration t one performs an interpolation/regression, obtaining gbt ðyjYt ; xt Þ, followed by an optimization step that updates the acquisition function, At(y), updates the set Ytþ1 ¼ ½Yt ; ymt þ1 adding a new node, and sets mt ) mt þ 1 and t ) t þ 1. The procedure is repeated until a suitable stopping condition is met, such as a certain maximum number of points is included or a desired precision error ε is   achieved,  gbt ðyÞ  gbt1 ðyÞ  ε. Fig. 10 shows an illustrative example of

FIGURE 9 Scheme of an automatic emulator. RTM, radiative transfer model.

358 SECTION j II Algorithms and methods

FIGURE 10 General sketch of the Automatic Emulation (AE) procedure. Top: the radiative transfer model g(y) (solid line), its approximation gbt ðyÞ (dashed line). Bottom: the acquisition function At(y). Its maximum suggests where a new node should be added there.

performance of the automatic emulator and the concept of acquisition function, which acts as a sort of informed oracle that returns the user the most interesting parameter space region to sample from.

4.1 Theoretical formulation The acquisition function, At(y), encodes useful information for proposing new points to build the emulator. At each iteration, a new node is added, maximizing At(y), i.e., ymt þ1 ¼ argmax At ðyÞ; y˛Y

and set Ytþ1 ¼ ½Yt ; ymt þ1; mtþ1 ¼ mt þ 1. Here, we propose to account for both a geometry Ht(y) and a diversity Vt(y) function, At ðyÞ ¼ ½Ht ðyÞbt Vt ðyÞ;

bt ˛½0; 1;

(19)

where At ðyÞ : Y 1R, and bt is an increasing function with respect to t, with 0 limt/Nbt ¼ 1 (or bt ¼ 1 for t > t ). Function Ht(y) captures the geometrical information in g, while function Vt(y) depends on the distribution of the points in the current vector Yt. More specifically, Vt(y) presents a greater value around empty areas within Y , whereas it will be approximately zero close to the support points, and exactly zero at the support points, i.e., Vt(yi) ¼ 0, for i ¼ 1,.,mt and ct ˛ N. Since g is unknown, the function Ht(y) can only be derived from information acquired in advance or by considering the

Statistical biophysical parameter retrieval Chapter j 2.13

359

approximation gb. The tempering value, bt, helps to downweight the likely less informative estimates in the very first iterations. If bt ¼ 0, we disregard Ht(y) and At(y) ¼ Vt(y), whereas, if bt ¼ 1, we have At(y) ¼ Hy(y)Vt(y).

4.2 Multioutput automatic Gaussian process emulator First of all, if we consider a single output we have an RTM model of type x ¼ g(y)þe, where ewN ð0; s2 Þ (with s ¼ 0 in emulation, since we consider the interpolation case). Then, we have Cn ¼ K and c* ¼ K(y*,y*) in Eq. (2). Therefore, the GP predictive mean and variance at iteration t for a new point y* become simply gbt ðy jYt ; xt Þ ¼ mGP ðy Þ ¼ kT K1 xt ¼ kT a; s2GP ðy Þ ¼ kðy ; y Þ  kT K1 k ;

(20)

where now k ¼ ½kðy ; y1 Þ; .; kðy ; ymt ÞT contains the similarities between the input point y* and the observed ones at iteration t, K is an mtmt kernel matrix with entries Ki,j: ¼ k(yi,yj), and a ¼ K1xt is the coefficient vector for interpolation. The interpolation for y* can be simply expressed as a linear P t combination of gbt ðy Þ ¼ kT a ¼ m i¼1 ai kðy ; yi Þ. Note that, for the singleoutput case, we can choose Vt ðyÞ : ¼ s2GP ðyÞ, that is induced directly by the GP interpolator, after training of the hyperparameters [69e72].4 As geometric information, we consider enforcing flatness on the interpolation function, and thus aim to minimize the norm of the gradient of the interpolating function gbt w.r.t. the input data y, i.e.,   mt  X      (21) ai Vy kðy; yi Þ: Ht ðyÞ : ¼ Vy gbt ðyjYt ; xt Þ ¼   i¼1  This intuitively makes wavy regions of g require more support points than flat regions. We can optimize A(y) using interacting parallel simulated annealing methods [70e72]. Now, let us consider the multioutput case, starting from the following RTM model represented by the equation x ¼ gðyÞ þ e;

(22)

xðDÞ T ˛RD

with x ¼ ½xð1Þ ; .; and ew N ð0; s2 ID Þ where ID is a D  D identity matrix. Then, we have K outputs for each input y, i.e., gðyÞ ¼ ½gð1Þ ðyÞ; .; gðDÞ ðyÞT : Y /RD : In order to design an automatic emulator of g(y), we have to extend the strategy described above, choosing the following elements:

4. The training of the hyperparameter l of kernel function k can be performed with standard procedure, as cross-validation or marginal likelihood optimization [69e72].

360 SECTION j II Algorithms and methods

1. Choose a multioutput GP interpolator, considering the same input matrix Yt ¼ ½y1 ; .; ymt  (for all the outputs) and the output matrix Xt ¼ ½x1 ; .; xmt  (that has dimension Dmt), providing an approximation  T ð1Þ ðDÞ b g ðyjYt ; Xt Þ; .; b g t ðyÞ ¼ b g t ðyjYt ; Xt Þ ¼ b g ðyjYt ; Xt Þ : See, for instance, Refs. [23,73] and Section 3.1. The simplest procedure consists in applying an independent interpolator for each output. Considering the RBF kernel per output dimension, kd ðy; y0 Þ ¼ expðky  y0 k2 =2l2d Þ; we can have a different hyperparameter ld for each output. 2. Given the matrix of current nodes Yt ¼ ½y1 ; .; ymt , in general, we obtain ðdÞ

K different functions Vt ðyÞ ¼ ½s2GP ðyÞ

ðdÞ

; one for each output. Note that

ðdÞ Vt

the functions differ even if the input matrixYt is the same for all the outputs, due to the different hyperparameters ld,d ¼ 1,.,D, specifically tuned for each output. ðdÞ

ðdÞ

3. Given each gb ðyjYt ; Xt Þ; we obtain the gradient functions Ht ðyÞ in Eq. (21), for each d ¼ 1,.,D. Therefore, we can define the complete acquisition function as " # bt " # D D Y Y ðdÞ ðdÞ At ðyÞ ¼ Ht ðyÞ Vt ðyÞ : d¼1

(23)

d¼1

Optimizing At(y), we find the next node ymtþ1 to incorporate in the next iteration, Ytþ1 ¼ ½Yt ; ymtþ1  Highly correlated outputs. Above we have described a multioutput approach for emulation where we have the same set Yt of nodes for all the outputs, but different interpolators for each output. In this case, using D independent interpolators and the squared exponential kernel function, we have one different hyperparameter ld for each output. However, if the outputs are highly correlated, we can consider a unique learning process and a unique hyperparamter l, shared for all the outputs. Indeed, considering the Dmt matrix Xt ¼ ½x1 ; .xmt , the mt  mt kernel matrix K: ¼ [ki,j] ¼ k(yi,yj) with i,j ¼ 1,.,mt, and the vector k ¼ ½kðy ; y1 Þ; .; kðy ; ymt ÞT of dimension mt  1, then we can define a direct multioutput GP (cf. Section 3.1): b g t ðy ÞT ¼ mGP ðy Þ ¼ kT K1 XTt ; Vt ðy Þ ¼ s2GP ðy Þ ¼ kðy ; y Þ  kT K1 k ;

(24)

where b g t ðyÞ ¼ mGP ðyÞT is a vector of dimension D  1. In this case we have a unique diversity function Vt ðy Þ ¼ s2GP ðy Þ for all the outputs. Note that we

Statistical biophysical parameter retrieval Chapter j 2.13

361

can define a D  D matrix M obtained taking the logarithm of the extended marginal likelihood (recalling that s2 ¼ 0),

 1 D MðlÞ ¼ log½pðXt jlÞ ¼  Xt K1 XTt  log detK1 þ x; 2 2 where x is a constant value. Then, the single hyperparameter l can be found maximizing the trace of M(l), i.e., lopt ¼ argmaxl˛Rþ trace½MðlÞ: Finally observe that, in this specific case, the acquisition function is simplified as " # bt D Y ðdÞ At ðyÞ ¼ Ht ðyÞ ½Vt ðyÞD ; (25) d¼1

due to the single and common diversity function Vt(y). Although we have the same gradient vector for the kernel function, Vy kðy; y0 Þ ¼ 

kðy; y0 Þ ½ðy1  y1 0Þ; .; ðyS  y0 S ÞT ; s2

we have several coefficients a, different for each output. Indeed, a ¼ K1 XTt is a D  mt matrix: each row corresponds to a specific output. Therefore, we ðdÞ

have still D different gradient-based functions Ht ðyÞdefined as in Eq. (21).

4.3 Automatic emulation of the PROSAIL radiative transfer model We tested our method for the emulation of the standard leaf-canopy PROSAIL model. PROSAIL is the most widely used RTM in the last 20 years in remote sensing studies [74]. PROSAIL models canopy reflectance using the turbid medium assumption (i.e., modeling the canopy as a turbid medium for which leaves are randomly distributed), which is particularly well suited for homogeneous canopies. PROSAIL simulates leaf reflectance from 400 to 2500 nm with a 1 nm spectral resolution as a function of biochemistry and structure of the canopy, its leaves, the background soil reflectance, and the sun-sensor geometry. Leaf optical properties are given by the mesophyll structural parameter (N) and leaf chlorophyll (Chl), dry matter (Cm), water (Cw), carotenoid (Car), and brown pigment (Cbr) contents. At canopy level PROSAIL is characterized by LAI, the average leaf angle inclination (ALA), and the hot-spot parameter (Hotspot). The system geometry is described by the solar zenith angle (qs), view zenith angle (qn), and the relative azimuth angle between both angles (DQ). In this experiment, for ease of computation, we chose as free parameters the most important variable at leaf and canopy-level, respectively, namely Chl and LAI, and keep the rest fixed. Table 2 shows the values for the remaining parameters which are set for simulation of wheat. This results in an input space of dimension two, where we restrict the variables

362 SECTION j II Algorithms and methods

TABLE 2 Values of physical parameters used for simulating with the PROSAIL model, corresponding to wheat. N

Cm

Cw

1.5

0.01 mg/cm

0.01 mg/cm

8 mg/cm

0

ALA

Hotspot

qs

Qn

DQ

Spherical

0.01

30 degrees

10 degrees

0

2

Car 2

Cbr 2

chl ˛ [0;80]mg/cm2 and LAI ˛ [0; 8]. We scale down the output dimension from 2101 to 20 by PCA. This results in a function f (y) where y ¼ [Chl, LAI] mapping from input space of dimension two to the output space of dimension D ¼ 20. We compare different approaches to sampling the data points that lead to a good emulator. In order to test the accuracy of the different schemes, we evaluated f (y) at all the possible 4900 combinations of 70 values of Chl and 70 values of LAI. This fine grid represents the ground truth in the experiment. It is on this data set that we perform PCA in order to obtain the 20 principal components used for DR. We test (a) a random approach choosing points uniformly at each iteration, (b) the Latin hypercube sampling (LHS) approach (see, e.g., Ref. [63]), and (c) The multioutput AGAPE (denoted as Automatic Multi-Output Gaussian Process Emulator, AMOGAPE), using simulated annealing for optimization of At. We start with 50 points generated by LHS for all the methods. For each added point we compute the test RMSE, in the original reflectance domain, of our multioutput interpolator applied to the grid. In other words, the predicted 20-dimensional vector is projected back into 2101-dimensional reflectance space and compared to ground truth. The multioutput RMSE for the Nt ¼ 4900 test points with output dimension Drefl. ¼ 2101 is computed as follows, vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 2 Drefl:  Nt u1 X 1 X t RMSE ¼ (26) xd;i  xbd;i : Nt i¼1 Drefl: d¼1 The results, averaged over 50 runs, are shown in Fig. 11. The LHC sampling method exhibits a lot of variance as it chooses a completely new set of points for each iteration, as opposed to the other approaches which sequentially add points to existing data sets. We see how the AMOGAPE rather quickly identifies the points it needs to build a stronger emulator, while the LHS decreases the error and in a slower manner. The completely random sampling method, not being designed to maximize information gained, does

Statistical biophysical parameter retrieval Chapter j 2.13

363

FIGURE 11 RMSE on test grid computed for emulators using different sampling methods. Each method is initialized with 50 points sampled with the LHS scheme, upon which 50 more are sampled. AMOGAPE, Automatic Multi-Output Gaussian Process Emulator; LHS, Latin hypercube sampling; RMSE, Root mean squared error.

not manage to reach the same level of error as the other methods. This gap is expected to widen as the input dimensionality grows.

5. Conclusions We reviewed the recent advances available in statistical biophysical retrieval based on GP modeling. We focused both on the inverse modeling and in emulation of forward RTMs. In the first part of this chapter, we reviewed the standard application of GPs for prediction of biophysical parameter estimates. In this case, GPs typically rely only on in situ observational data to create the mode, or alternatively follow a hybrid approach by inverting simulations from RTMs. We illustrated cases of these approaches, and extended them by designing a joint GP that can combine in situ measurements and simulated data. Another interesting problem that we revised was that of considering multiple variables of interest at the same time. We reviewed the field of multioutput GPs and illustrated the performance of the most common methods in gap filling of time series of biophysical parameters. The methods can actually transport information across time series and fill in the gaps smartly. A third family of advanced GP models reviewed was that to tackle large-scale data sets. In the last decade, machine learning algorithms have to face the big data problem, and GPs are not particularly well equipped to cope with it. We reviewed several strategies to tackle large-scale problems involving thousands of observations.

364 SECTION j II Algorithms and methods

In the second part of the chapter, we introduced a family of algorithms based on GPs to mimic RTMs. Once the models are trained, such emulators are able to produce realistic spectra at a fraction of time of RTMs and also inherit tractability properties from the probabilistic properties of GPs. We showed performance in a set of challenging problems in model emulation, land, and land parameter estimation from space. The developed models demonstrated good performance in all applications, adaptation to the signal characteristics, and transportability to unseen situations. Future work is tied to design coupled landeatmosphere models with GPs and to address domain adaptation problems by incorporating grounding factors based on physical models.

Acknowledgments This work was funded by the European Research Council (ERC) under the ERC-CoG-2014 SEDAL project (grant agreement 647423) and ERC-2017-STG SENTIFLEX project (grant agreement 755617), the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) under the “Network of Excellence’ program (grant code TEC2016-81900-REDT), and MINECO (ERDF) under CICYT projects TIN2015-64210-R and TEC2016-77741-R.

References [1] T.M. Lillesand, R.W. Kiefer, J. Chipman, Remote Sensing and Image Interpretation, John Wiley & Sons, New York, 2008. [2] S. Liang, Quantitative Remote Sensing of Land Surfaces, John Wiley & Sons, New York, 2004. [3] S. Liang, Advances in Land Remote Sensing: System, Modeling, Inversion and Applications, Springer Verlag, Germany, 2008. [4] C.D. Rodgers, Inverse Methods for Atmospheric Sounding: Theory and Practice, World Scientific Publishing Co. Ltd., 2000. [5] G. Camps-Valls, D. Tuia, L. Go´mez-Chova, S. Jime´nez, J. Malo (Eds.), Remote Sensing Image Processing, Morgan & Claypool Publishers, LaPorte, CO, USA, September 2011. [6] F. Baret, S. Buis, Estimating canopy characteristics from remote sensing observations: review of methods and associated problems, in: Advances in Land Remote Sensing: System, Modeling, Inversion and Applications, Springer Verlag, Germany, 2008. [7] S. Moulin, A. Bondeau, R. Dele´colle, Combining agricultural crop models and satellite observations: from field to regional scales, International Journal of Remote Sensing 19 (6) (1998) 1021e1036. [8] W.A. Dorigo, R. Zurita-Milla, A.J.W. de Wit, J. Brazile, R. Singh, M.E. Schaepman, A review on reflective remote sensing and data assimilation techniques for enhanced agroecosystem modeling, International Journal of Applied Earth Observation and Geoinformation 9 (2) (2007) 165e193. [9] M. Drusch, U. Del Bello, S. Carlier, O. Colin, V. Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimort, A. Meygret, F. Spoto, O. Sy, F. Marchese, P. Bargellini, Sentinel2: ESA’s optical high-resolution mission for GMES operational services, Remote Sensing of Environment 120 (2012) 25e36.

Statistical biophysical parameter retrieval Chapter j 2.13

365

C. Donlon, B. Berruti, A. Buongiorno, M.-H. Ferreira, P. Fe´me´nias, J. Frerick, P. Goryl, U. Klein, H. Laur, C. Mavrocordatos, J. Nieke, H. Rebhan, B. Seitz, J. Stroede, R. Sciarra, The global monitoring for environment and security (GMES) sentinel-3 mission, Remote Sensing of Environment 120 (2012) 37e57. [11] T. Stuffler, C. Kaufmann, S. Hofer, K.P. Farster, G. Schreier, A. Mueller, A. Eckardt, H. Bach, B. Penne´, U. Benz, R. Haydn, The EnMAP hyperspectral imager-an advanced optical payload for future applications in earth observation programmes, Acta Astronautica 61 (1e6) (2007) 115e120. [12] D.A. Roberts, D.A. Quattrochi, G.C. Hulley, S.J. Hook, R.O. Green, Synergies between VSWIR and TIR data for the urban environment: an evaluation of the potential for the hyperspectral infrared imager (HyspIRI) decadal survey mission, Remote Sensing of Environment 117 (2012) 83e101. [13] D. Labate, M. Ceccherini, A. Cisbani, V. De Cosmo, C. Galeazzi, L. Giunti, M. Melozzi, S. Pieraccini, M. Stagi, The PRISMA payload optomechanical design, a high performance instrument for a new hyperspectral mission, Acta Astronautica 65 (9e10) (2009) 1429e1436. [14] S. Kraft, U. Del Bello, M. Bouvet, M. Drusch, J. Moreno, Flex: Esa’s Earth Explorer 8 Candidate Mission, 2012, pp. 7125e7128. [15] T. Hilker, N.C. Coops, M.A. Wulder, T.A. Black, R.D. Guy, The use of remote sensing in light use efficiency based models of gross primary production: a review of current status and future requirements, The Science of the Total Environment 404 (2e3) (2008) 411e423. [16] R.H. Whittaker, P.L. Marks, Methods of Assessing Terrestrial Productivity, in: Primary Productivity of the Biosphere, 1975, pp. 55e118. [17] H.K. Lichtenthaler, Chlorophylls and carotenoids: pigments of photosynthetic biomembranes, Methods in Enzymology 148 (1987) 350e382. [18] R. Snieder, J. Trampert, Inverse Problems in Geophysics, Springer Vienna, Vienna, 1999, pp. 119e190. [19] S. Jacquemoud, C. Bacour, H. Poilve´, J.-P. Frangi, Comparison of four radiative transfer models to simulate plant canopies reflectance: direct and inverse mode, Remote Sensing of Environment 74 (3) (2000) 471e481. [20] W. Verhoef, H. Bach, Simulation of hyperspectral and directional radiance images using coupled biophysical and atmospheric radiative transfer models, Remote Sensing of Environment 87 (2003) 23e41. [21] G. Camps-Valls, L. Bruzzone, Kernel Methods for Remote Sensing Data Analysis, John Wiley and Sons, 2009. ´ lvarez, M. Martı´nez-Ramo´n, Mun˜oz-Marı´, G. Camps-Valls, Digital Signal [22] J.L. Rojo- A Processing with Kernel Methods, Wiley & Sons, West Sussex, UK, February 2018. [23] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, The MIT Press, Cambridge, MA, 2006. [24] J. Verrelst, L. Alonso, G. Camps-Valls, J. Delegido, J. Moreno, Retrieval of vegetation parameters using Gaussian processes techniques, IEEE Transactions on Geoscience and Remote Sensing 49 (2012) 1832e1843. [25] G. Camps-Valls, J. Verrelst, J. Munoz-Mari, V. Laparra, F. Mateo-Jimenez, J. Gomez-Dans, A survey on Gaussian processes for earth-observation data analysis: a comprehensive investigation, IEEE Geoscience and Remote Sensing Magazine 4 (2) (2016) 58e78. [25a] G. Camps-Valls, D. Sejdinovic, J. Runge, M. Reichstein, A perspective on Gaussian processes for earth observation, National Science Review (2019). https://doi.org/10.1093/ nsr/nwz028. [10]

366 SECTION j II Algorithms and methods [26] M. Campos-Taberner, F.J. Garcia-Haro, A. Moreno, M.A. Gilabert, S. Sanchez-Ruiz, B. Martinez, G. Camps-Valls, Mapping leaf area index with a smartphone and Gaussian processes, IEEE Geoscience and Remote Sensing Letters 12 (12) (2015) 2501e2505. [27] M. Campos-Taberner, F.J. Garcı´a-Haro, G. Camps-Valls, G. Grau-Muedra, F. Nutini, A. Crema, M. Boschetti, Multitemporal and multiresolution leaf area index retrieval for operational local rice crop monitoring, Remote Sensing of Environment 187 (2016) 102e118. [28] C.M. Bishop, Pattern recognition, Machine Learning 128 (2006) 1e58. [29] J.P. Rivera Caicedo, J. Verrelst, J. Mun˜oz-Marı´, J. Moreno, G. Camps-Valls, Toward a semiautomatic machine learning retrieval of biophysical parameters, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (4) (2014) 1249e1259. [30] G. Camps-Valls, L. Go´mez-Chova, J. Mun˜oz-Marı´, M. La´zaro-Gredilla, J. Verrelst, Simpler: A Simple Educational Matlab Toolbox for Statistical Regression, vol. 3, 2016. V3.0. [31] J. Verrelst, J. Mun˜oz, L. Alonso, J. Delegido, J.P. Rivera, G. Camps-Valls, J. Moreno, Machine learning regression algorithms for biophysical parameter retrieval: opportunities for sentinel-2 and -3, Remote Sensing of Environment 118 (2012) 127e139. [32] J. Verrelst, G. Camps-Valls, J. Mun˜oz-Marı´, J.P. Rivera, F. Veroustraete, J.G.P.W. Clevers, J. Moreno, Optical remote sensing and the retrieval of terrestrial vegetation bio-geophysical properties - a review, ISPRS Journal of Photogrammetry and Remote Sensing 108 (2015) 273e290. [33] J. Verrelst, E. Romijn, L. Kooistra, Mapping vegetation density in a heterogeneous river floodplain ecosystem using pointable CHRIS/PROBA data, Remote Sensing 4 (9) (2012) 2866e2889. [34] J. Verrelst, L. Alonso, G. Camps-Valls, J. Delegido, J. Moreno, Retrieval of vegetation biophysical parameters using Gaussian process techniques, IEEE Transactions on Geoscience and Remote Sensing 50 (5) (2012) 1832e1843. [35] J. Verrelst, J.P. Rivera, J. Moreno, G. Camps-Valls, Gaussian processes uncertainty estimates in experimental sentinel-2 LAI and leaf chlorophyll content retrieval, ISPRS Journal of Photogrammetry and Remote Sensing 86 (2013) 157e167. [36] J. Pablo Rivera-Caicedo, J. Verrelst, J. Mun˜oz-Marı´, G. Camps-Valls, J. Moreno, Hyperspectral dimensionality reduction for biophysical variable statistical retrieval, ISPRS Journal of Photogrammetry and Remote Sensing 132 (2017) 88e101. [37] I. Jolliffe, Principal Component Analysis, Springer Verlag, New York, 1986. [38] H. Wold, Non-linear Estimation by Iterative Least Squares Procedures, 1966. [39] J. Delegido, J. Verrelst, C.M. Meza, J.P. Rivera, L. Alonso, J. Moreno, A red-edge spectral index for remote sensing estimation of green LAI over agroecosystems, European Journal of Agronomy 46 (2013) 42e52. [40] D. Tuia, J. Verrelst, L. Alonso, F. Pe´rez-Cruz, G. Camps-Valls, Multioutput support vector regression for remote sensing biophysical parameter estimation, IEEE Geoscience and Remote Sensing Letters 8 (4) (2011) 804e808. [41] M.A. Alvarez, L. Rosasco, N.D. Lawrence, Kernels for Vector-Valued Functions: A Review, arXiv:1106.6251 [cs, math, stat], June 2011. arXiv: 1106.6251. [42] A. Journel, C. Huijbregts, Mining Geostatistics, Academic Press, 1978. [43] D. Luengo-Garcia, M. Campos-Taberner, G. Camps-Valls, Latent force models for Earth observation time series prediction, in: 2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2016), September 2016. Salerno, Italy.

Statistical biophysical parameter retrieval Chapter j 2.13 [44]

[45]

[46]

[47] [48]

[49]

[50] [51] [52]

[53] [54] [55]

[56]

[57]

[58]

[59]

[60]

367

L. Martino, D. Luengo, G. Camps-Valls, Latent force models for model-data integration in vegetation monitoring, in: 10th EARSeL SIG Imaging Spectroscopy Workshop, University of Zurich, Switzerland), April 2017, pp. 19e21, 2017. G. Camps-Valls, D. Svendsen, M. Campos, L. Martino, D. Luengo, Vegetation monitoring with Gaussian processes and latent force models, in: European Geosciences Union General Assembly, 2017, pp. 23e28. Vienna, Austria. B. Scho¨lkopf, R. Herbrich, A.J. Smola, A generalized representer theorem, in: D. Helmbold, B. Williamson (Eds.), Computational Learning Theory, Springer Berlin Heidelberg, Berlin, Heidelberg, 2001, pp. 416e426. P. Goovaerts, Geostatistics for Natural Resources Evaluation, in: Applied Geostatistics Series, Oxford University Press, 1997. S. Sundararajan, S. Sathiya Keerthi, Predictive app roaches for choosing hyperparameters in Gaussian processes, in: Advances in Neural Information Processing Systems, 2000, pp. 631e637. J. Quin˜onero-Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, Dataset Shift in Machine Learning, in: Neural Information Processing Series, MIT Press, Cambridge, Mass., London, 2009. M. Seeger, K. Christopher, I. Williams, N.D. Lawrence, Fast Forward Selection to Speed up Sparse Gaussian Process Regression, 2003. L. Csato´, M. Opper, Sparse on-line Gaussian processes, Neural Computation 14 (3) (2002) 641e668. E. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in: Y. Weiss, P.B. Scho¨lkopf, J.C. Platt (Eds.), Advances in Neural Information Processing Systems, vol. 18, MIT Press, 2006, pp. 1257e1264. M.K. Titsias, Variational learning of inducing variables in sparse Gaussian processes, Journal of Machine Learning Research 5 (2009) 567e574. J. Quin˜onero-Candela, C.E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, Journal of Machine Learning Research 6 (2005) 1939e1959. C.K.I. Williams, M. Seeger, Using the Nystro¨m method to speed up kernel machines, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, 2001, pp. 682e688. A.G. de G. Matthews, J. Hensman, R.E. Turner, Z. Ghahramani, On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes, in: 19th International Conference on Artificial Intelligence and Statistics, April 2015, pp. 231e239, arXiv: 1504.07027. M. Bauer, M. van der Wilk, C.E. Rasmussen, Understanding probabilistic sparse Gaussian process approximations, in: D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems, vol. 29, Curran Associates, Inc., 2016, pp. 1533e1541. H. James, N. Fusi, N.D. Lawrence, Gaussian Processes for Big Data, in: Twenty- Ninth Conference on Uncertainty in Artificial Intelligence, September 2013, pp. 282e290, arXiv: 1309.6835. T.D. Bui, J. Yan, R.E. Turner, A Unifying Framework for Gaussian Process Pseudo-Point Approximations Using Power Expectation Propagation, arXiv:1605.07066 [cs, stat], May 2016. arXiv: 1605.07066. R. Ali, B. Recht, Random features for large-scale kernel machines, in: Neural Information Processing Systems, 2007.

368 SECTION j II Algorithms and methods [61] A. Pe´rez-Suay, J. Amoro´s-Lo´pez, L. Go´mez-Chova, L. Valero, J. Mun˜oz-Mare´, G. CampsValls, Randomized kernels for large scale earth observation applications, Remote Sensing of Environment (2017). [62] M. La´zaro-Gredilla, J. Quin˜onero-Candela, C.E. Rasmussen, A.R. Figueiras-Vidal, Sparse spectrum Gaussian process regression, The Journal of Machine Learning Research 11 (2010) 1865e1881. [63] D. Busby, Hierarchical adaptive experimental design for Gaussian process emulators, Reliability Engineering & System Safety 94 (2009) 1183e1193. [64] J.P. Rivera, J. Verrelst, J. Go´mez-Dans, J. Mun˜oz-Marı´, J. Moreno, G. Camps-Valls, An emulator toolbox to approximate radiative transfer models with statistical learning, Remote Sensing 7 (7) (2015) 9347e9370. [65] L. Martino, J. Vicent, G. Camps-Valls, Automatic emulator and optimized look-up table generation for radiative transfer models, IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (2017) 1e5. [66] L. Martino, J. Vicent, G. Camps-Valls, Automatic emulation by adaptive relevance vector machines, Scandinavian Conference on Image Analysis (SCIA) (2017) 1e12. [67] D. Luengo, L. Martino, Almost rejectionless sampling from Nakagami-m distributions (m1), IET Electronics Letters 48 (24) (2012) 1559e1561. [68] L. Martino, D. Luengo, Extremely efficient acceptance-rejection method for simulating uncorrelated Nakagami fading channels, Communications in Statistics - Simulation and Computation (2018). [69] L. Martino, V. Laparra, G. Camps-Valls, Probabilistic cross-validation estimators for Gaussian process regression, in: 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 1e5. [70] L. Martino, V. Elvira, D. Luengo, J. Corander, F. Louzada, Orthogonal parallel MCMC methods for sampling and optimization, Digital Signal Processing 58 (2016) 64e84. [71] J. Read, L. Martino, D. Luengo, Efficient Monte Carlo optimization for multi-label classifier chains, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 1e5. [72] L. Martino, V. Elvira, D. Luengo, A. Artes, J. Corander, Smelly parallel MCMC chains, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 1e5. ´ lvarez, D. Luengo, N.D. Lawrence, Linear latent force models using Gaussian [73] M.A. A processes, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11) (2013) 2693e2705. [74] S. Jacquemoud, W. Verhoef, F. Baret, C. Bacour, P.J. Zarco-Tejada, G.P. Asner, C. Francois, S.L. Ustin, PROSPECT þ SAIL models: a review of use for vegetation characterization, Remote Sensing of Environment 113 (Suppl. 1) (2009) S56eS66.