Useful techniques of validation for spatially explicit land-change models

Useful techniques of validation for spatially explicit land-change models

Ecological Modelling 179 (2004) 445–461 Useful techniques of validation for spatially explicit land-change models Robert Gilmore Pontius Jr∗ , Diana ...

871KB Sizes 0 Downloads 9 Views

Ecological Modelling 179 (2004) 445–461

Useful techniques of validation for spatially explicit land-change models Robert Gilmore Pontius Jr∗ , Diana Huffaker, Kevin Denman Department of International Development, Community and Environment, Graduate School of Geography and George Perkins Marsh Institute, Clark University, 950 Main Street, Worcester, MA 01610-1477, USA Received 25 June 2003; received in revised form 14 April 2004; accepted 3 May 2004

Abstract This paper offers techniques of validation that land-use and -cover change (LUCC) modelers should find useful because the methods give information that is useful to improve LUCC models and to set the agenda for future LUCC research. Specifically, the validation technique: (a) budgets sources of agreement and disagreement between the prediction map and the reference map, (b) compares the predictive model to a Null model that predicts pure persistence, (c) compares the predictive model to a Random model that predicts change evenly across the landscape, and (d) evaluates the goodness-of-fit at multiple-resolutions to see how scale influences the assessment. This paper introduces a new criterion called the Null Resolution, which is the spatial resolution at which the predictive model is as accurate as the Null model. For illustration, these techniques are applied to assess an LUCC model called Geomod, which predicts land change in the 22 towns of the Ipswich River Watershed in northeastern Massachusetts, USA. For this application, the Null Resolution is approximately 1 km. At resolutions finer than 1 km, the Null model performs better than Geomod, which performs better than the Random model. At resolutions coarser than 1 km, both Geomod and the Random models perform better than the Null model, but Geomod and the Random models are nearly indistinguishable beyond the 1 km resolution. © 2004 Elsevier B.V. All rights reserved. Keywords: LUCC; Model; Null; Scale; Prediction; Resolution; Validation

1. Introduction 1.1. State of land-change modeling Simulation models of land-use and -cover change (LUCC) typically examine a landscape at initial points in time t0 and t1 , and then predict the change from t1 to some subsequent point in time t2 . The predicted map of t2 is usually compared to a reference map (i.e., a truth map or observed map) of t2 , in order to eval∗ Corresponding author. Tel.: +1-508-793-7761; fax: +1-508-793-8881. E-mail address: [email protected] (R.G. Pontius Jr).

uate the performance of the simulation model. If the predicted map of t2 appears similar to the reference map of t2 , then the scientist might conclude that the simulation model performed well. However, a strong agreement between the predicted map of t2 and the reference map of t2 does not necessarily indicate that the simulation model provided additional information beyond what the scientist would have predicted without the model. If the scientist had no simulation model, then the scientist’s best prediction of t2 would probably be simply the map of t1 . That is, a Null model would predict pure persistence (i.e., no change) between t1 and t2 . We have reviewed scores of publications and attended dozens of scientific conferences.

0304-3800/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ecolmodel.2004.05.010

446

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

For every LUCC model that we have seen, the agreement between the reference map of t1 and the reference map of t2 always appears to be greater than the agreement between the predicted map of t2 and the reference map of t2 , at the resolution at which the model was run (Brown et al., 2002; Chen et al., 2002; Lo and Yang, 2002; Geoghegan et al., 2001; Schneider and Pontius, 2001). Should this cause alarm in the LUCC modeling community? If so, what should be done about it? This paper addresses these questions. The initial motivation to write this paper was a sense of disappointment on the part of the first author, who has been an LUCC modeler since 1991. After more than a decade of work, his models never performed better than a Null model, while many of his colleagues were reporting that their models performed well. This led to an investigation concerning the criteria that LUCC modelers use to evaluate models. It was found that scientists usually do not test the performance of LUCC models versus the performance of a Null model (Fielding and Bell, 1997; Baltzer, 2000). In the context of LUCC modeling, we have found only one other author who recommends that such a test be performed (Hagen, 2002, 2003). Wear and Bolstad (1998) describe a procedure to compare the prediction to a “na¨ıve” model, but that model is not a model of pure persistence. In cases where we have asked that a scientist explicitly test the model versus a Null model of pure persistence, the agreement between t1 and t2 was indeed found to be greater than the agreement between the predicted t2 and the reference t2 , at the resolution of the raw data that the model used (Manson, 2002; Pontius and Malanson, in press). Most importantly, the majority of LUCC modeling literature that we found lacked any rigorous validation. This paper argues that it is not necessarily a problem that the predictive power of LUCC models is less than the predictive power of a Null model, at the fine resolution of the raw data. In science, improvement from worse models to better models is an intended and expected consequence of applying good methodology. The most efficient way to improve the predictive power of LUCC models is to incorporate the lessons learned from rigorous validation. Therefore it is a serious problem that LUCC modeling efforts dedicate enormous effort to sophisticated methods of calibration, to the neglect of validation (Silva and Clarke, 2002; Brean et al., 1999; Logofet and Korotkov, 2002).

It is an even more dangerous problem when misleading techniques of validation are used to assess models, because poor techniques of validation can give scientists a false sense of security concerning a model’s certainty. The purpose of this paper is to offer a helpful methodology to assess the performance of LUCC models, so that LUCC scientists can improve models. 1.2. Calibration versus validation If a scientist is to assess the predictive power of a model, then there must be a clear distinction between the procedures of calibration and validation. Lack of such a distinction usually makes the interpretation of any results difficult and/or misleading. Some papers that we examined present a clear separation of calibration and validation (Mertens and Lambin, 2000; Brown et al., 2002; Geoghegan et al., 2001). In most of the papers, it is difficult to tell whether or how calibration and validation are distinct (Bradshaw and Muller, 1998; Messina and Walsh, 2001; Lo and Yang, 2002). In some of these cases, the lack of apparent distinction is likely a reflection of the lack of clarity in the written description. For example, the words “modeled” and “simulated” fail to distinguish the important difference between “fitted” and “predicted”. In other cases, the lack of distinction is a reflection of the methodology. For example, sometimes the model appears to be successful because the methodology calibrates the model with information from time t2 , which should be reserved exclusively for validation (Wu and Webster, 1998). In particular, it is common for researchers to force the prediction to simulate the correct quantity of each cover type, then to assess whether the model predicts the correct location of land cover (Kok et al., 2001; Pontius et al., 2001). These types of practices are sometimes done unintentionally by selecting the validation pixels randomly from the map of time t2 (Wear and Bolstad, 1998). Any lack of clarity in the methodology to distinguish the calibration information from validation information causes confusion in the scientific community and can lead to a misunderstanding of the model’s certainty. This section clarifies the distinction and explains the need for the distinction. Calibration is “the estimation and adjustment of the model parameters and constraints to improve the agreement between model output and a data set”, whereas validation is “a demonstration that a model

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

within its domain of applicability possesses a satisfactory range of accuracy consistent with the intended application of the model” (Rykiel, 1996). One set of data should be used to calibrate the model and a separate set of data should be used to validate the model. There are a variety of ways to separate calibration data from validation data. Separation through time is one of the most common methods, as described in the opening paragraph. If the goal of the model is to predict the change on the landscape after time t1 , then any information at t1 or before t1 is legitimate to use for calibration. For example, a typical method of calibration is to perform statistical regression on the change between times t0 and t1 . The results of the regression are fitted estimates of parameters. The fitted parameters and the regression relationship can then be used to extrapolate the change between times t1 and t2 . The key is that any information subsequent to time t1 must be prohibited from being used in the calibration process. The validation process compares the predicted map of time t2 to the reference map of time t2 . Separation through space is another common method to separate calibration information from validation information. In this strategy, the model uses data from one study site to fit the parameters, and then the fitted model is applied to a different site to predict change. Separation of the calibration process from the validation process is one of the best ways to assure that the model is not over-fitted. To over-fit the model is to perform a calibration procedure so that the model parameters describe both the signal and the noise in the calibration data. The signal is the general pattern in the calibration data that is relevant to the pattern in the validation data, whereas the noise is an unpredictable pattern that is specific to the calibration data only. There is substantial temptation to over-fit models given today’s environment of super computing power and enormous data abundance. A model that is over-fitted can yield strong agreement between its fitted values and the calibration data, which indicates merely that the model describes the data that it was given. A tight fit of calibration is not necessarily a good indication of how the model might perform when used for extrapolation, because an over-fitted model describes both the signal and the noise in the calibration data (Pontius and Pacheco, in press). Only

447

the signal is relevant for accurate prediction. The process of validation indicates how well the model uses the signal to extrapolate a pattern. 1.3. Criteria for validation During validation, there should be some objective way to compare the predicted map to the reference map in order to assess the level of agreement between the two. There is not a universally agreed-upon criterion to evaluate the goodness-of-fit of validation (Rykiel, 1996), nor should there be. Every different model has a different purpose, and the criterion should be related to the purpose. But even after the purpose of the model is specified, there can be a wide variety of criteria that are relevant to the particular application. How does a scientist decide which criteria to use? Scientists should use criteria that are likely to yield information that is useful to improve the model. Therefore, it is helpful to use a validation technique that: (a) budgets the sources of error, (b) compares the model to a Null model, (c) compares the model to a Random model, (d) performs the analysis at multiple scales. If the validation technique budgets the sources of error, then the modeler can focus future modeling improvements on the aspects of the model that have largest potential for benefits. A validation technique that gives just one number to indicate fit, such as percent correct, fails to give information concerning how to improve the model, since it fails to budget various sources of error. It is important to compare the model to both a Null model and a Random model in order to assess the additional predictive power, if any, that the model provides. Scale is important to consider during any comparison of maps, because results can be sensitive to scale and certain patterns may be evident at only certain scales (Kok et al., 2001; Quattrochi and Goodchild, 1997). The methods section of this paper describes an approach to validation that has these four characteristics and applies to cases where a LUCC model predicts change among categories on a map from time t1 to time t2 . Leaders in landscape research have been calling for techniques of assessment that can address these particular concerns (Bian and Walsh, 2002; Lambin et al., 1999; Messina and Walsh, 2001; Scott et al., 2002; Veldkamp and Lambin, 2001; Wu and Webster, 1998). We hope LUCC modelers will use this approach. If LUCC modelers adopt a small set of uniform meth-

448

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

ods to measure the goodness-of-fit of validation, then communication among us will be easier. The methods section applies the techniques to an ecologically important part of Massachusetts, USA, which the next section describes. 1.4. Ipswich River Watershed example The Ipswich River is the third most endangered river in the United States (American Rivers, 2003). Parts of the river frequently run completely dry, while approximately 330,000 people depend upon the watershed for drinking water (Zarrielo and Ries, 2000). When it flows, the river drains into the Plum Island Ecosystems, which constitute a Long Term Ecological Research site of the United States’ National Science Foundation (NSF). Water quality in Plum Island Sound is affected profoundly by land-use change in the upper sections of the watershed, so NSF funds research on the relationship between water quality and land use within the watershed. Since the middle of last century, the predominant form of land change has been the transformation from forest to new residential area. Fig. 1 shows the 22 towns that comprise the Ipswich River watershed, located in Northeastern Massachusetts. This paper uses the towns of the Ipswich River Watershed as a study area to develop and to evaluate techniques of validation of LUCC models.

2. Methods 2.1. Data The Massachusetts Executive Office of Environmental Affairs supplies most of the data (MassGIS, 2003). All maps are georegistered to the same 30 m by 30 m grid. The upper left portion of Fig. 2 shows the dependent variable, which is land cover of 1985 and 1999. Each pixel in the maps of land cover shows one of two categories, Built or Non-Built, as defined by the standard classification system of Anderson et al. (1976). The upper right portion of Fig. 2 shows the change in land cover between 1985 and 1999. The calibration data includes a map of slope and a map of land use of 1971. These two variables are the only ones that are both available to us in digital form and legitimate to use to calibrate the model, since they both pre-date 1985, which is the time t1 when the LUCC model begins to extrapolate the pattern of change. A map of contemporary legal protection is available in digital form. Legal protection would probably have some predictive power, but it is not legitimate to use to calibrate the model because many of today’s laws were probably instituted after 1985. The metadata for the map of legal protection fails to specify the date at which each parcel of land was put into protection. Similarly, other maps available in digital form show

Boston

N

0 12.5 25

50

75

100 Kilometers

Fig. 1. The shaded polygons are the 22 towns of the Ipswich River Watershed within the State of Massachusetts.

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

Fig. 2. Maps on the right show the change between the pairs of maps on the left.

449

450

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

Built (Square Kilometers)

Actual 450 440 430 420 410 400 390 380 Calibration 370 360 350 340 1971

Extrapolated

Validation

1985

1999

Year

Fig. 3. Linear extrapolation of Built area. Ten square kilometers constitute approximately 1% of the study area.

information subsequent to 1985. For example, maps of present day roads are available, but maps of roads of 1985 are not available. Therefore, the strict criterion for calibration data limits the number of legitimate independent variables. 2.2. The LUCC model The focus of this paper is on a technique of validation that applies to LUCC models that predict the change in land categories. Any such LUCC model must perform two fundamental tasks. The model must predict the quantity of each land type, and it must predict the location of each land type. These two tasks can be envisioned and measured separately. This paper uses the model Geomod as an example (Pontius et al., 2001; Pontius and Malanson, in press). Geomod’s approach explicitly separates the prediction of the quantity of land change from the location of land change. The next two paragraphs explain how Geomod computes each of these two components independently. Geomod uses a simple approach to estimate the quantity, unless a more complicated approach is justified. Fig. 3 shows how Geomod predicts the quantity of land change according to linear extrapolation. The calibration data is from time t0 = 1971 and t1 = 1985. The straight line interpolated between 1971 and 1985 is extrapolated to time t2 = 1999. Fig. 3 shows that this extrapolation results in a small error when compared to the reference data for 1999. Geomod takes

this slightly over-estimated prediction of the quantity of Built area and tries to place it at the correct locations on the landscape. Geomod predicts the location of the additional Built category according to two principles. First, Geomod assumes persistence of the category that grows. That is, if a pixel is Built in 1985, Geomod will predict that it remains Built in 1999. Second, Geomod uses a suitability map to predict which Non-Built pixels of 1985 become Built by 1999. Fig. 4 shows this suitability map. Geomod searches among the Non-Built pixels of 1985 to find the ones that are most suitable for the Built category according to Fig. 4. Geomod converts enough Non-Built pixels to Built in order to satisfy the prediction for quantity of Built area shown in Fig. 3. Two driver maps (i.e. categories of slope and categories of land use of 1971) are used in conjunction with the land cover of 1985 to calibrate the suitability map in the following manner. The calibration procedure computes the proportion of Built in 1985 for each category of slope and then uses those proportions to reclassify each category of the slope map, thus transforming it into a suitability map for post-1985 Built. The suitability map assumes that the post-1985 new Built is likely to occur on the same slope categories that the 1985 Built tended to occupy. In the case of the Ipswich River Watershed, Built tends to occur on flatter land. The same type of calibration procedure is performed using a map of 31 detailed categories of land use of 1971. Consequently, the 1971 land-use map is transformed into a suitability map that shows relatively large suitability values on Mining, Pasture and Cropland. Finally, the two suitability maps derived from slope and 1971 land use are averaged to produce the suitability map of Fig. 4, which Geomod then uses to place growth of post-1985 Built on relatively flat locations that were Mining, Pasture and Cropland in 1971. Pontius et al. (2001) give additional details of how Geomod works. The bottom left portion of Fig. 2 shows Geomod’s prediction of the landscape of 1999. The bottom right portion of Fig. 2 shows the location where Geomod predicts conversion from Non-Built to Built. We also consider a Random model and a Null model. The Random model takes the quantity of additional Built predicted for 1985–1999, and spreads it evenly among the Non-Built locations of 1985. The

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

451

Fig. 4. Suitability map created by Geomod from slope and 1971 land use.

Null model simply uses the 1985 reference map as the prediction for 1999. The remainder of this section describes how to compare the predictions of Geomod, the Null model, and the Random model. 2.3. Three-way map comparison The left side of Fig. 2 shows a very strong visual similarity between Geomod’s predicted map of 1999 and the reference map of 1999. This similarity is not necessarily evidence that Geomod is helpful in predicting land change, because the visual comparison of the reference map of 1985 and the reference map of 1999 also shows strong similarity. One must look closely to see any differences among the three maps on the left side of Fig. 2. A more helpful visual comparison is between the two maps on the right side of Fig. 2. The map on the upper right portion of Fig. 2 shows the true change; the bottom right portion shows the change that Geomod predicts. This pair of maps shows that Geomod predicts approximately the correct quantity of new Built area. It is difficult to tell visually whether Geomod

predicts the change at the precisely correct locations. In some towns, it is clear that Geomod would do better to predict the change in larger patches. Fig. 2 shows the need for objective statistical comparison among maps. A purely visual comparison is vulnerable to the subjective opinion of the viewer. Based on Fig. 2, one scientist might call the results good, another might call them poor. Visual inspection of Fig. 2 is important, because the human mind can quickly detect important patterns that any one statistical procedure might miss. Statistical techniques of comparison are also important to detect patterns that the human eye misses and to facilitate communication among scientists. This paper’s technique considers the agreement between two pairs of maps. The first comparison is between the reference map of time t1 and the reference map of time t2 . The second comparison is between the predicted map of time t2 and the reference map of time t2 . Finally, the procedure evaluates the first comparison vis-à-vis the second comparison. The next two sub-sections describe the method to compare a pair of maps.

452

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

Fig. 5. Mathematical expressions for five measurements defined by a combination of information of quantity and location.

2.4. Error budget Fig. 5 gives the mathematical expressions used to budget the components of agreement and disagreement during the comparison of two maps. Each mathematical expression compares the corresponding pixels of two maps, which are denoted R for reference map and S for simulated map, i.e. the predicted map. The notation in Fig. 5 is: j—index for categories, J—number of categories, n—index for pixels, Ng —number of pixels in the map at resolution g, g—resolution as a multiple of the length of the side of a pixel of the raw data, Wgn —weight of pixel n at resolution g, Rgnj —proportion of category j in pixel n at resolution g of the reference map, Sgnj —proportion of category j in pixel n at resolution g of the simulation map. In some expressions, the symbol “·” replaces the subscript n. This indicates that the term is averaged overall n. Specifically, Ng Rg·j =

n=1 (Wgn Rgnj ) Ng n=1 Wgn

(1)

Ng Sg·j =

n=1 (Wgn Sgnj ) Ng n=1 Wgn

(2)

The middle expression in the second column and second row of Fig. 5 is the proportion correct based on a pixel-by-pixel comparison of the reference map of t2 and the predicted map of t2 . The other four expressions give the proportion correct between the two maps after the predicted map has been adjusted for various levels of information of quantity and/or location. Specifically, each expression is in a particular column according to the level of information of quantity in the adjusted predicted map. Information of quantity is denoted by the bold letters: n means no information, m means medium information, and p means perfect information. Each expression is in a particular row according to the level of information of location in the adjusted predicted map. Information of location is represented by the capital letters: N(x) means no information, M(x) means medium information, and P(x) means perfect information. Fig. 5 shows a sequence of expressions starting in the lower left corner, climbing up the central column, and ending in the upper right corner. N(n) is the agreement between the reference map and a map in which every pixel is identical and has partial membership of 1/J for each of the J categories. N(m) is the agreement between the reference map and a map where every pixel is identical and has membership equal to the predicted proportion for each of the J categories. M(m) is the agreement between the reference map and

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

the unadjusted predicted map. P(m) is the agreement between the reference map and an adjusted map in which the location of pixels of the predicted map are rearranged in space to match as closely as possible the pixels in the reference map. P(p) is the agreement between the reference map and a map that has been adjusted perfectly in terms of both location and quantity of pixels, therefore P(p) always equals 1. Each subsequent expression in the sequence shows the agreement between the reference map and an adjusted map that usually has increasingly accurate information, therefore usually 0 < N(n) < N(m) < M(m) < P(m) < P(p) = 1. The difference between each subsequent expression in the sequence is a component of agreement or disagreement. N(n) is the agreement due to chance. N(m) − N(n) is the agreement due to the predicted quantity. M(m) − N(m) is the agreement due to the predicted location. P(m) − M(m) is the disagreement due to the predicted location, which derives from error in the suitability map of Fig. 4. P(p) − P(m) is the disagreement due to the predicted quantity, which derives from error in the extrapolation of Fig. 3. These components form a budget of sources of agreement and disagreement. Pontius (2000, 2002) and Pontius and Suedmeyer (2004) give additional details of this philosophy of map comparison. The disagreement due to location and disagreement due to quantity are the two most important components for the scientist who wants to know how to improve the model. All disagreement due to location could be fixed by moving the predicted pixels in space. In the context of Geomod, this would be accomplished by improving the suitability map (Fig. 4), because the suitability map alone determines the location of predicted new Built. If all pixels in the predicted map were rearranged in space to agree as well as possible with the reference map, then the accuracy would be P(m). The disagreement due to quantity must be resolved in order to improve the accuracy beyond P(m). In the context of Geomod, this would be accomplished by improving the extrapolation of quantity (Fig. 3). The expressions of Fig. 5 compute the proportion correct for a pixel-by-pixel comparison at a single resolution, denoted g. The next section explains how to use these expressions to budget sources of agreement and disagreement at multiple resolutions.

453

2.5. Multiple resolution comparison Pixel-by-pixel analysis is attractive because it is intuitive and relatively easy to compute, but it has its conceptual problems. Pixel-by-pixel analysis can fail to detect spatial patterns because it ignores neighborhood relationships. For example, a pixel is counted as incorrect if the category in the reference map disagrees with the category in the predicted map, regardless of whether the category is found in the neighboring pixel

Fig. 6. Method to aggregate pixels in a geometric progression of resolution.

454

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

Fig. 7. Multiple-resolution budget for components of agreement and disagreement between the reference map of 1985 and the reference map of 1999.

or nowhere near the pixel. Multiple resolution analysis seeks to address this issue (Costanza, 1989). Fig. 6 shows a pixel aggregation procedure whereby four neighboring pixels are averaged at each coarser resolution. Therefore, the length of the side of the pixel grows at each subsequent level of aggregation in a geometric progression, i.e. as multiples of 2. The Idrisi software allows also for an arithmetic progression, whereby the length of the pixel side grows by an equal increment at each step in the aggregation process. Both the geometric and arithmetic progressions produce coarse pixels with partial memberships in more than one category. The expressions of Fig. 5

are well suited to compute the agreement between such pixels that are soft-classified. Figs. 7 and 8 plot the components of agreement and disagreement at multiple resolutions. The percent correct is the interface between the agreement due to location and the disagreement due to location. The percent correct rises as resolution becomes coarser. Both agreement due to location and disagreement due to location approach 0 as the resolution approaches the point where the entire study area is in one extremely coarse pixel, which is the resolution of 1692 in the Ipswich River Watershed example. The disagreement due to quantity is independent of resolution, because

Fig. 8. Multiple-resolution budget for components of agreement and disagreement between Geomod’s predicted map of 1999 and the reference map of 1999.

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

455

Percent of Landscape

100 99 98

Geomod & Random asymptote Null asymptote

97 96 95 94

Null % correct

93 92

Geomod % correct Random % correct

91 90

1692

512

1024

256

64

128

32

8

16

4

2

1

89

Resolution as a multiple of a pixel side

Fig. 9. Percent correct between the reference map of 1999 vs. three other maps: (a) Null, (b) Geomod, and (c) Random. Both Random and Geomod perform better than the Null at resolutions coarser than 32 times the original resolution.

resolution relates to only the location of the categories, not to the quantity of the categories.

at resolutions finer than the Null Resolution. The predictive model is more accurate than the Null model at resolutions coarser than the Null Resolution.

2.6. Null Resolution Fig. 9 overlays crucial information of Figs. 7 and 8. The top horizontal line of Fig. 9 is P(m) for both Geomod and the Random models. It shows the disagreement due to the predicted quantity of Fig. 3. The percent correct curves for the Geomod and Random models approach this limit as resolution becomes coarser. The lower horizontal line of Fig. 9 is the Null model’s P(m), which shows the disagreement due to quantity of the net change in Built between 1985 and 1999. The percent correct for the Null model approaches its limit as resolution becomes coarser. It is typical that the percent correct of a predictive LUCC model is less than the percent correct of the Null model at fine resolutions. It is also common that the disagreement due to quantity of a predictive model is less than the disagreement due to quantity of a Null model. If these two conditions are true, then the percent correct of a predictive model must cross the percent correct of a Null model as resolution changes from fine to coarse. The resolution at which this happens is the Null Resolution. The Null Resolution is the resolution at which the accuracy of the predictive model matches the accuracy of the Null model. The predictive model is less accurate than the Null model

3. Results Fig. 9 shows the most important results of the statistical comparison among the three models: Random, Geomod and Null. It shows that the Null Resolution for both Geomod and the Random models is 32 times the resolution of the raw data. This Null Resolution is approximately 1 km because the length of the side of a pixel of the raw data is 30 m. At resolutions finer than 1 km, the Null model is more accurate than the Geomod model, which is more accurate than the Random model. At resolutions coarser than 1 km, the accuracies of the Geomod and Random models are nearly identical, and both are more accurate than the Null model. Fig. 10 shows the results mapped at the Null Resolution of 32. Figs. 7 and 8 budget the components of agreement and disagreement for the Null model and the Geomod model, respectively. The two budgets have some similar characteristics. The largest component of agreement is due to chance, since it accounts for 50% of the landscape at the finest resolution and grows as resolution becomes coarser. The next largest component of agreement is due to location. The agreement due to quantity is negligible, because the proportion of the

456

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

Fig. 10. Same as Fig. 4 but at 32 times coarser resolution.

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

Built category in 1985 and the predicted proportion of Built are both near 1/2. The most important differences between the budgets are their components of disagreement. For the Null model, the disagreement due to quantity is greater than the disagreement due to location at all resolutions. For Geomod, the disagreement due to quantity is less than the disagreement due to location at nearly all resolutions.

4. Discussion 4.1. Interpretation of scale The results show that the Null model performs better than Geomod at the 30 m resolution of the raw data, according to the percent correct criterion. This result should neither alarm nor disappoint LUCC modelers. It is merely an indication that the resolution of the raw data is finer than Geomod’s ability to predict land change. This is indeed a desirable situation because the alternative is for the data to be coarser than the model’s ability to predict change, in which case the analysis would be limited by the coarseness of the data. In fact, if any predictive model were to perform better than a Null model at the resolution of the raw data, then the immediate question would be “To what extent does the coarseness of the raw data limit the model’s ability to predict?” Geomod begins to predict more accurately than a Null model at a resolution of about 1 km, at which point there is 95% agreement between the predicted map and the reference map. Is this good enough? The most common answer to such a question is, “It depends on the purpose and the a priori goals of the model.” While this answer is true, it is not particularly helpful in the context of contemporary LUCC modeling because we know of no LUCC modeling exercise that has ever specified a priori a desired level of accuracy or precision. In all cases that we have seen, LUCC models simply attempt to perform as accurately and precisely as possible, given the available data. This creates a serious problem in setting the agenda for future LUCC research. If there is no target level of performance, then it is impossible to know whether a LUCC model is sufficiently detailed. In this paper’s

457

example, even if it were possible to predict change at the 30 m resolution, it would probably not be desirable to do so, because the level of detail of the data necessary to predict human activity at a precision of 30 m would overwhelm the research project’s resources. To predict change at the 30 m resolution, we would need to follow the individual history of every plot of land as it makes its way through a maze of regulations and criteria in order to allow it to be converted from Non-Built to Built. Numerous agents, such as buyers, sellers, developers, bankers, lawyers, environmentalists, engineers, politicians, bureaucrats, and abutters, can influence whether a particular plot of land passes from Non-Built to Built. The necessary research budget would grow exponentially as a function of the desired level of precision. If the goal were to predict change at the 1 km resolution, then a spatially explicit model would not even be necessary. The results of the Random model show that a LUCC model that distributes the predicted quantity of change evenly in space becomes more accurate than a Null model at a resolution of 1 km. At this resolution, the Random model is 95% correct. If this is sufficiently precise and accurate for the intended application, then the entire modeling exercise can conclude there, at tremendous savings to the research budget. Given that LUCC models usually do not specify the level of desired accuracy a priori, it is difficult to understand the basis on which scientists routinely claim that results are good or acceptable or useful. How can a model be judged to be good or acceptable when there is no explicitly stated benchmark (Rykiel, 1996)? One hypothesis is that scientists feel professional pressure to claim that our efforts are successful, or at least useful. Scientifically speaking, this practice is unprofessional and does tremendous damage to science because it makes us blind to areas of potential improvement. As scientists, we should merely state the level of accuracy as some statistical measurement. If a particular reader feels a need to judge the model as poor or good, then the reader should select a benchmark. A second hypothesis is that some scientists think of validation as a Boolean decision concerning whether the model is correct or not. This view of validation is not particularly helpful for model improvement and is a philosophical quagmire (Oreskes et al., 1994). We

458

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

find it more helpful to view validation as a standard procedure of science that is designed to show in what respects models perform well and poorly, with the ultimate goal to improve models. Nevertheless, in practice, it is sometimes necessary to make Boolean decisions concerning whether or not to use any particular model, especially when deciding which (if any) model to use from among several possible models. In these cases, the scientist must decide whether or not a model is valid for a particular domain of application or range of scales. The methods of this paper give important measurable criteria to consider when analyzing an appropriate range of scales. A predictive model should be used within a domain where the model predicts at least as well as a Null model. In our Ipswich River Watershed example, Geomod predicts better than a Null model at resolutions coarser than 1 km, so 1 km should be the finest resolution at which the results should be presented to decision makers (Fig. 10). 4.2. The statistical criterion This paper uses the percent correct as the conceptual foundation to evaluate the agreement between maps. The advantage of percent correct is that both scientists and non-scientists find the concept intuitive. Wear and Bolstad (1998) use both percent correct and a more complicated, less interpretable approach based on Shannon’s theory of information. This paper’s multiple resolution assessment of percent correct answers the question “Do the two maps generally look similar?” The answer to this question depends on resolution. For example, if the reference map were a chess board pattern and the prediction map were the same pattern shifted by one column or row, then the agreement would be zero percent correct at the resolution of the raw pixels, but the agreement would be perfect at a resolution of 2. In this sense, this paper’s multiple resolution procedure is a method that attempts to produce results similar to a human’s ability to recognize patterns. It does this by assessing the agreement of the quantity and location of the categories at various resolutions. After the two general questions concerning quantity and location are addressed, then the maps are ready for a more detailed analysis of spatial pattern (Gustafson and Parker, 1992).

Other researchers endorse other statistics that measure the details of the spatial patterns in maps (Messina and Walsh, 2001). Measures include number of patches, patch size, patch density, contagion, fractal dimension, etc. While those statistics can be helpful, they should be interpreted in the context of the two fundamental concepts of quantity and location that this paper’s methods address. For example, the number of patches on a landscape is constrained by the various quantities of each category. If one category dominates the landscape, then it is nearly impossible for that category to be distributed spatially in separate patches. On the other hand, if there are many categories that account for an equal proportion of the landscape, then it is possible to have a wide range of number of patches. This has important implications for setting the research agenda to improve a model. For example, if a model is very poor at predicting the quantities of each category, then it might be severely constrained in its ability to predict the pattern of patches. The model must first improve its prediction of the quantity of the categories in order eventually to improve the model’s prediction of patch pattern. Also, if the model predicts the categories in generally the correct neighborhood, then the pattern of patches are likely to be similar. 4.3. The bias of masking Whatever the statistical criterion, it is dangerous to mask out parts of the study area during the validation phase. Results of statistical analysis can be extremely sensitive to any procedure that ignores parts of the study area. For example, it is common practice to mask all water from a study area. In this paper’s example, we include all area enclosed by the town boundaries, including inland water such as lakes and streams. In Massachusetts, some of the historically most important types of land transformations have been to or from the water category. Much of Boston used to be wetland and water. Also, the water supply for Boston comes from huge reservoirs that used to be thriving farming communities. In addition to masking water, another common practice is to mask land that does not change during the interval of extrapolation. After the true persistence is removed, the question is “Did the model predict change at locations where change really occurred?” Accord-

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

ing to this criterion, a model that predicts change everyplace would always be 100% correct. This type of masking is undesirable because when persistence is removed from the analysis, the researcher can not answer the question “Did the model predict change at locations where change did not occur?” Yet another common practice is to mask land that is already Built at time t1 , i.e. 1985. This would be undesirable because it would make the analysis blind to the fact that land can and does transition from Built to Non-Built. Geomod does not have the ability to simulate the transition from Built to Non-Built. Therefore, if the validation procedure excludes areas that are already Built at t1 = 1985, then the scientist will fail to see this potential weakness of the model. The primary motivation to mask areas of persistence from the validation phase is to focus on the change that the model attempts to predict. Any such masking can introduce bias into the validation assessment. One of the best ways to avoid this type of bias is to include the entire study area in the analysis, and to compare the predictive model versus a Null model of pure persistence. 4.4. Room for improvement Numerous conversations with modelers have revealed extraordinary optimism concerning the potential predictive powers of LUCC models, if only better data were available. We would like to have several more maps of 1985 to help to generate alternative suitability maps, because Fig. 8 shows that nearly all of Geomod’s error is attributable to location. If the suitability map were better, then the error of location would be smaller. We suspect that important factors that determine the location of new Built are: roads, age of parcel owners, and legal constraints. Such maps did not exist in easily accessible digital form in 1985 or before. Some scientists have told us that if we were to use such factors, then surely Geomod’s prediction would be more accurate. As a matter of curiosity, to test whether this level of optimism is justified, we allowed Geomod to use the map of contemporary legal protection to help in the prediction of the location of the transition from Non-Built to Built. For this additional run, Geomod excluded any land where there was some legal constraint to a transition to the Built category. In Mas-

459

sachusetts, the legal status of the land has substantial influence in determining its use and hence its cover. At the finest resolution, this additional model run performs marginally better (90.5% correct) than the Geomod run that does not consider the map of legal constraints (90.3% correct). At coarser resolutions, this additional run is indistinguishable from the other Geomod run and the Random run. This is consistent with other research where a Null model has a higher agreement than even a model that merges calibration information and validation information. 4.5. Comparison with other models It could be that Geomod’s simplicity is the reason why it performs worse than the Null model at the resolution of the raw data. It might be that a more complicated model would be able to predict human behavior more accurately, given the same calibration data. One way to test this hypothesis would be to use any in the plethora of other LUCC models to attempt to predict the same phenomenon of land change in Central Massachusetts. Agarwal et al. (2002) and Briassoulis (2000) review many popular LUCC models. Whatever alternative model is chosen, it must perform the same two basic tasks that Geomod performs. Specifically, it must determine the quantity of each category for the predicted landscape and it must determine where to locate each category on the landscape. Many land allocation models, for example, CLUE (Veldkamp and Fresco, 1996) and Markov (Pontius and Malanson, in press; Baltzer et al., 1998), are designed to give the user the ability to specify independently these two distinct tasks. Some cellular automata models, for example, SLEUTH (Clarke et al., 1996; Clarke, 1998), do not allow the modeler to separate these two tasks explicitly, nevertheless such models usually allow the user to influence the quantity of each predicted category based on various parameters that relate to rates of change. Parker et al. (2003) review the state of the art of agent-based models, which attempt to simulate changes in landscapes based on complex interactions among decision-making agents. It is not clear whether the added complexity of any of these alternative models improves predictive accuracy because most LUCC modelers have not yet adopted rigorous validation techniques. Regardless of the alternative model, the methods of this paper would be use-

460

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461

ful to compare the performance of competing models in terms of errors of quantity, errors of location, and spatial resolution.

gram through the grant “Infrastructure to Develop a Human-Environment Regional Observatory Network” Award ID 9978052. Clark Labs facilitated this work by making the techniques of this paper part of the Kilimanjaro version of the GIS software Idrisi® .

5. Conclusions Validation is the weakest part of contemporary LUCC modeling. It is frequently ignored, and when it is performed, it often uses misleading methods. This paper offers techniques of validation that LUCC modelers should find helpful because the methods are designed specifically to give information that is useful to improve LUCC models and to set the agenda for future LUCC research. Specifically, this paper’s validation technique: (a) budgets sources of agreement and disagreement between the prediction map and the reference map, (b) compares the predictive model to a Null model, (c) compares the predictive model to a Random model, and (d) evaluates the goodness-of-fit at multiple resolutions to see how scale influences the assessment. This paper has defined a new criterion, the Null Resolution, which is the resolution at which the predictive model is as accurate as the Null model. It should be fairly easy for the LUCC community to adopt these techniques because they are available in the Validate module of the GIS software Idrisi® . If scientists adopt the methods of this paper, then they will increase their ability: to learn the most important ways to improve models, to define an appropriate domain for a particular model application, to present results at an appropriate spatial resolution, to communicate the validity of model predictions, and to set a research agenda that prioritizes the most important issues.

Acknowledgements The National Science Foundation supported this work via three of its programs: (1) Research Experience for Undergraduates Summer Fellowship Program in association with the Long Term Ecological Research grant OCE-9726921, (2) Center for Integrated Study of the Human Dimensions of Global Change through a cooperative agreement between Carnegie Mellon University and the National Science Foundation SBR-9521914, and (3) the HERO pro-

References American Rivers. 2003. America’s most endangered rivers of 2003. http://www.americanrivers.org/docs/MostEndangered Rivers2003.pdf. Agarwal, C., Green, G., Grove, J.M., Evans, T., Schweik, C., 2002. A Review and Assessment of Land-Use Change Models: Dynamics of Space, Time and Human Choice. USDA Forest Service, Newton Square, PA. Anderson, J.R., Hardy, E.E., Roach, J.T., Witmer, R.E., 1976. A land use and land cover classification system for use with remote sensor data. Geological Survey Professional Paper 964. United States Government Printing Office, Washington, DC. Baltzer, H., 2000. Markov chain models for vegetation dynamics. Ecol. Model. 126, 139–154. Baltzer, H., Braun, P.W., Kohler, W., 1998. Cellular automata models for vegetation dynamics. Ecol. Model. 107, 113–125. Bian, L., Walsh, S., 2002. Characterizing and Modeling Landscape Dynamics: An Introduction. Photogram. Eng. Rem. S. 68 (10), 999–1000. Bradshaw, T.K., Muller, B., 1998. Impacts of rapid urban growth on farmland conversion: application of new regional land use policy models and geographic information systems. Rural Sociol. 63 (1), 1–25. Brean, W.D., Boyle, S., Breininger, D.R., Schmalzer, P.A., 1999. Coupling past management practice and historical landscape change on John F. Kennedy Space Center, Florida. Landscape Ecol. 14, 291–309. Briassoulis, H., 2000. Analysis of land use change: theoretical and modeling approaches. In: Loveridge, S. (Ed.), The Web Book of Regional Science. Regional Research Institute, West Virginia University, Morgantown, WV. www.rri.wvu.edu/regsweb.htm. Brown, D.G., Goovaerts, P., Burnicki, A., Li, M.-Y., 2002. Stochastic simulation of land-cover change using geostatistics and generalized additive models. Photogram. Eng. Rem. S. 68 (10), 1051–1061. Chen, J., Gong, P., He, C., Luo, W., Tamura, M., Shi, P., 2002. Assessment of the Urban Development Plan of Beijing by Using a CA-based Urban Growth Model. Photogram. Eng. Rem. S. 68 (10), 1063–1071. Clarke, K.C., 1998. Loose-coupling a cellular automaton model and GIS: long-term urban growth prediction for San Francisco and Washington/Baltimore. Int. J. Geogr. Inf. Sci. 12, 699–714. Clarke, K.C., Hoppen, S., Gaydos, L.J., 1996. A self-modifying cellular automaton model of historical urbanization in the San Francisco Bay area. Environ. Planning B 24, 247–261. Costanza, R., 1989. Model goodness of fit: a multiple resolution procedure. Ecol. Model. 47, 199–215.

R.G. Pontius Jr et al. / Ecological Modelling 179 (2004) 445–461 Fielding, A.H., Bell, J.F., 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ. Conserv. 24 (1), 38–49. Geoghegan, J., Villar, S.C., Klepeis, P., Mendoza, P.M., Ogneva-Himmelberger, Y., Roy Chowdhury, R., Turner II, B.L., Vance, C., 2001. Modeling tropical deforestation in the southern Yucatan peninsular region: comparing survey and satellite data. Agr. Ecosyst. Environ. 85 (1–3), 25–46. Gustafson, E.J., Parker, G.R., 1992. Relationships between landcover proportion and indices of landscape spatial pattern. Landscape Ecol. 7 (2), 101–110. Hagen, A., 2002. Multi-method assessment of map similarity. In: Proceedings of the Fifth AGILE Conference on Geographic Information Science. Palma, Spain. April 25–27, pp. 171–182. Hagen, A., 2003. Fuzzy set approach to assessing similarity of categorical maps. Int. J. Geogr. Inf. Sci. 17 (3), 235–249. Kok, K., Farrow, A., Veldkamp, T.A., Verberg, P., 2001. A method and application of multi-scale validation in spatial land use models. Agr. Ecosyst. Environ. 85 (1–3), 223–238. Lambin, E.F., Baulies, X., Bockstael, N., Fischer, G., Krug, T., Leemans, R., Moran, E.F., Rindfuss, R.R., Sato, Y., Skole, D., Turner II, B.L., Vogel, C., 1999. Land-use and land-cover change implementation strategy. IGBP Report 48. IHDP Report 10. Royal Swedish Academy of Sciences, Stockholm. Lo, C.P., Yang, X., 2002. Driver of Land-Use/Land-Cover Changes and Dynamic Modeling for the Atlanta, Georgia Metropolitan Area. Photogram. Eng. Rem. S. 68 (10), 1073–1082. Logofet, D.O., Korotkov, V.N., 2002. ‘Hybrid’ Optimization: a heuristic solution to the Markov-chain calibration problem. Ecol. Model. 151, 51–61. Manson, S.M., 2002. Integrated Assessment and Projection of Land-Use and Land-Cover Change in the Southern Yucatán Peninsular Region of Mexico. Doctoral dissertation. Graduate School of Geography. Clark University, Worcester, MA. MassGIS, 2003. Massachusetts Geographic Information Systems. Executive Office of Environmental Affairs. www.state. ma.us/mgis. Mertens, B., Lambin, E., 2000. Land-cover-change trajectories in southern Cameroon. Ann. Assoc. Am. Geogr. 90 (3), 467– 494. Messina, J., Walsh, S., 2001. 2.5D morphogenesis: modeling landuse and landcover dynamics in the Ecuadorian Amazon. Plant Ecol. 156, 75–88. Oreskes, N., Shrader-Frechette, K., Belitz, K., 1994. Verification, validation, and confirmation of numerical models in the Earth. Science 263 (5147), 641–646. Parker, D.C., Manson, S.M., Janssen, M.A., Hoffmann, M.J., Deadman, P., 2003. Multi-agent systems for the simulation of land-use and land-cover change: a review. Ann. Assoc. Am. Geogr. 93 (2), 314–337.

461

Pontius Jr., R.G., 2000. Quantification error versus location error in comparison of categorical maps. Photogram. Eng. Rem. S. 66 (8), 1011–1016. Pontius Jr., R.G., 2002. Statistical methods to partition effects of quantity and location during comparison of categorical maps at multiple resolutions. Photogram. Eng. Rem. S. 68 (10), 1041– 1049. Pontius, R.G., Jr., Malanson, J., in press. Comparison of the accuracy of land-change models: cellular automata Markov versus Geomod, Int. J. Geogr. Inf. Sci. Pontius, R.G., Jr., Pacheco, P., in press. A multiple resolution ROC statistic to validate a GIS-based model of forest disturbance in the Western Ghats, India 1920–1990. GeoJournal. Pontius R.G., Jr., Suedmeyer, B., 2004. Components of agreement between categorical maps at multiple resolutions. In: Lunetta, R.S., Lyon, J.G. (Eds.), Remote Sensing and GIS Accuracy Assessment. CRC Press, Boca Raton, FL. Pontius Jr., R.G., Cornell, J.D., Hall, C.A.S., 2001. Modeling the spatial pattern of land-use change with GEOMOD2: application and validation for Costa Rica. Agr. Ecosyst. Environ. 85 (1–3), 191–203. Quattrochi, D.A., Goodchild, M.F. (Eds.), 1997. Scale in Remote Sensing and GIS. Lewis Publishers, Boca Raton, FL. Rykiel Jr., E.J., 1996. Testing ecological models: the meaning of validation. Ecol. Model. 90, 229–244. Schneider, L., Pontius, R., 2001. Modeling land-use change in the Ipswich watershed, Massachusetts, USA. Agr. Ecosyst. Environ. 85 (1–3), 83–94. Silva, E., Clarke, K., 2002. Calibration of the SLEUTH urban growth model for Lisbon and Porto, Portugal. Comput. Environ. Urban Syst. 26, 525–552. Scott, M.J., Heglund, P.J., Morrison, M.L., Haufler, J.B, Raphael, M.G., Wall, W.A., Samson, F.B. (Eds.), 2002. Predicting Species Occurrences: Issues of Accuracy and Scale. Island Press, Washington, DC. Veldkamp, A., Fresco, L.O., 1996. CLUE: a conceptual model to study the conversion of land use and its effects. Ecol. Model. 85, 253–270. Veldkamp, A., Lambin, E.F., 2001. Predicting land-use change. Agr. Ecosyst. Environ. 85, 1–6. Wear, D., Bolstad, P., 1998. Land-use changes in southern Appalachian landscapes: spatial analysis and forecast evaluation. Ecosystems 1, 575–594. Wu, F., Webster, C.J., 1998. Simulation of land development through the integration of cellular automata and multicriteria evaluation. Environ. Planning B 25, 103–126. Zarrielo, P., Ries, K., 2000. A Precipitation-Runoff Model for Analysis of the Effects of Water Withdrawals on Streamflow, Ipswich River Basin, Massachusetts. Report 00–4029. US Department of the Interior and U.S. Geological Survey.