INTRODUCTION

Actual evapotranspiration (ET) is a crucial process that links the terrestrial water and the energy balance (Xu et al., 2019). Thus, during the last decades researchers have been studying the relationship between ET and its controlling factors to better estimate this process. In general, there are three identified ET driving forces, i.e., net radiation (Rn), atmospheric variables, and surface properties (Qiu et al., 2019). As Rn is the main forcing variable, the soil moisture content (SM) is the most important surface state variable on which ET depends. SM controls the exchange of latent and sensible heat between the surface and the atmosphere (Kim et al., 2018; Purdy et al., 2018). In fact, ET empirical and semiempirical methods, such as Penman (1948), Priestley and Taylor (1972) are based upon Rn; however, neither ET nor Rn are readily available observations (Jain et al., 2008; Chen, J. et al., 2020).

ET most precise field measurements are made with lysimeters, flux towers, and scintillometers (Tikhamarine, Malik, Pandey et al. 2020); these observations are scarce outside the northern hemisphere. Moreover, it can be roughly estimated that there are 100 meteorological stations for each ET observation device in developed countries, and this ratio might decrease in developing countries (Tikhamarine, Malik, Souag-Gamane, and Kisi, 2020).

On the other hand, the solar radiation fluxes are observed using pyranometers and solarimeters, between other instruments. These instruments are also rare to find in most of the meteorological stations due to the high cost of installation and maintenance (Yadav and Chandel, 2014). The scarcity of solar radiation data, available from few meteorological stations, led to the development of several radiation models for clear sky using remotely sensed data (Bisht et al., 2005; Zhang et al., 2020) and machine learning algorithms (ML) (Alizamir et al., 2020). Besides, other hybrid forecasting models are reported in the literature (Si et al., 2020).

The impact of the radiation variables on the ET estimations has been the focus of different studies. For instance, error analysis on Rn and ET models were done by Llasat and Snyder (1998). The authors concluded that Rn and ET estimates are sensitive to the soil temperature (Ts) errors and insensitive to the air temperature (Ta), however Ta is readily available everywhere on the Earth while Ts is available at coarser temporal resolution. Trnka et al. (2007) analysed the effect of solar radiation estimates on crop yield models and transpiration models. They reported a Root Mean Square Error (RMSE) of about 15 % in crop yield estimates with Angstrom-Prescott global radiation formula and RMSEs up to 33 % with the formulation presented by Hargreaves, Hargreaves and Riley (1985). Jain et al. (2008) highlighted the importance of solar radiation to compute the reference evapotranspiration (ETo). Their results showed that solar radiation has the major impact (more than 30 %) on the ETo estimation compared to other meteorological variables. However, Majidi et al. (2015) concluded that the effect of calculated missing solar radiation data on the Penman-Monteith (P-M) estimates is negligible, in both semi-humid and semi-arid climate conditions. Mokhtari et al. (2018) also investigated the impact of different solar radiation estimations (using empirical, physically-based data assimilation, and satellite observation models) on P-M equation in a semiarid region. They found that the ETo error was related to solar radiation error with a fourth-degree equation.

Carter and Liang (2019) evaluated the effects of remote sensing or observed radiation data, along with vegetation indexes, in 10 ET ML algorithms for diverse ecosystem types around the world. They showed that Global Land Surface Satellite (GLASS) solar radiation produced similar ET errors compared to that obtained with Rn measurements. Granata (2019) explored four ML algorithms, i.e., Regression Tree (RT), Bootstrap Aggregating (BA), Random Forest (RF), and Support Vector Machine (SVM), to model ET from a grassland site in Florida. The author built three ET models, combining Rn, soil moisture content, relative humidity (RH), Ta, wind speed, and sensible heat fluxes data. Their results emphasized the importance of taking into account Rn data to obtain satisfactory ET results; however, the author did not evaluate the impact of Rn in their results. More recently, Yamag and Todorovic (2020) analysed the ¡nfluence of meteorological variables in the potato ET estimation using the k-Nearest Neighbour (kNN), Artificial Neural Network (ANN) and Adaptive Boosting (AB) techniques. They improved ET estimates, by more than 50 % in terms of RMSE, when observed solar radiation data was included in the input dataset along with Ta, HR, and wind speed data. Tang et al. (2018) investigated two ML algorithms (ANN and SVM) to estimate ET in a rainfed maize field under mulching and non-mulching condition, using meteorological and crop data. The authors tested two different input combinations to model ET, i.e., one combination considered meteorological observations (maximum, minimum, mean Ta, maximum, minimum, mean RH, solar radiation, wind speed) and crop data (leaf area index, plant height), and the other combination of input variables had only meteorological data. They did not analyse the influence of radiation data in ET errors.

Most of the aforementioned studies were applied using observations of different shortwave and longwave radiations, which are not easily available in developing countries at field scale. Also, some of these studies analysed the influence of the radiation terms on estimating ETo with meteorological variables, and only a few of them used vegetation indexes or crop characteristics along with ML algorithms.

Thus, this work aims to analyse the errors of different Rn substitutes on ML models to estimate ET from two crops: maize (Zea mays L.) and soybean (Glycine max (L.) Merr.). Therefore, in the present study, ET was estimated using five ML methods, including Support Vector Machine (SVM), Kernel Ridge (KR), Decision Tree (DT), Adaptive Boosting (AB), and Multilayer Perceptron (MLP) with three different radiation inputs, i.e., observed Rn (RnO), modelled Rn (RnM), and computed extraterrestrial solar radiation (Ra), in conjunction with meteorological variables and the normalized difference vegetation index (NDVI).

MATERIALS AND METHODS

Meteorological and satellite data

Meteorological data provided by the FLUXNET ground observations network and a Moderate-Resolution Imaging Spectroradiometer (MODIS) satellite product were used in this study.

FLUXNET ground tower sites data labelled as croplands (CRO) were selected here, but only those with maize and soybean were processed. Besides, operative stations were considered in this analysis according to Purdy et al. (2018), Walker and Venturini (2019) criteria, i.e., stations with Rn and ET high-quality measurements, with more than 60 % of reliable Rn data and few ET outliers. The selected meteorological stations, crop information source, location, time span, the Koppen-Geiger climate class (Beck et al., 2018), the dominant soil group, and the mean observed ET and Rn for each site are listed in Table 1. Data source references can be found on FLUXNET website. Six of the processed stations are in temperate climates and only two are in continental climates. The mean Rn varies according to the latitude, as expected, and the mean ET varies from approximately 1 to 7.5 mm/d. The dominant soil group information for each site was obtained from the Harmonized World Soil Database (HWSD), published in 2012 by Food and Agriculture Organization of the United Nations (FAO), International Institute for Applied Systems Analysis (IIASA), ISRIC-World Soil Information, Institute of Soil Science - Chinese Academy of Sciences (ISSCAS), and Joint Research Centre of the European Commission (JRC). Table 1 shows that most of the stations are in different soil types, except for US-Ne1, US-Ne2 and US-Ne3 which, due to their proximity, are in the same dominant soil group. However, based on the USDA texture classification (Shirazi et al., 1988) most of these soils have loam textures.

FLUXNET network was established for quantifying carbón, water vapor, and energy fluxes (Miralles et al., 2011), and fluxes data are available at https://fluxnet.org/data/fluxnet2015-dataset/. For this work, the meteorological variables mean Ta (Ta), minimum Ta (Tamin), maximum Ta (Tamax), Ta range (Tar), mean RH (RH), minimum RH (RHmin), maximum RH (RHmax), Rn, and latent heat flux (LE) were considered. LE was converted to water loss measure, i.e., ET in mm/d. Raw FLUXNET data were pre-processed for removing missing or wrong data and outliers using Tukey's methodology (Schwertman et al., 2004). Then, mean daily values were calculated by integrating only the quality checked measurements of the daylight hours.

NDVI index was obtained from MODIS products using the Google Earth Engine platform. NDVI was estimated with MOD09Q1 V6 product, an eight-day composite dataset, which provides an estimate of the surface spectral reflectance of EOS-Terra MODIS bands 1 and 2 corrected for atmospheric conditions such as gasses, aerosols, and Rayleigh scattering. The NDVI has a spatial resolution of about 250m, comparable to the FLUXNET towers footprint. The time series of NDVI index were obtained for each FLUXNET site and linearly interpolated after passing a moving average filter, to estimate daily values.

ML algorithms were calibrated and validated with FLUXNET ET as the output variable and FLUXNET meteorological data, Rn substitutes, and NDVI were used as input variables.

Machine learning

As mentioned in the Introduction, in this work five ML algorithms were used, considering the results published in Carter and Liang (2019). The regressor methods applied here are SVM, KR, DT, AB, and MLP (Carter and Liang, 2019). These methodologies are briefly described below.

Table 1: Summary of the general information for FLUXNET tower sites used in this study

1. Support Vector Machine (SVM)

SVM was developed by Vapnik (1999). The SVMs are derived from the concept of structural risk minimization theory to minimize the empirical risk and the confidence interval of the learning machine. The strength of these methodologies is their solid mathematical bases in statistical theory and have demonstrated accurate results in a wide range of real-world problems. Initially developed for solving classification problems, SVM techniques can also be successfully applied in regression problems.

A regression is estimated by using SVM for a given data set {(xi, yi)}n, where xi are the input vectors, y is the output value and n is the total number of data sets (Tang et al., 2018). So, the regression equation can be formulated as:

3. Decision Tree (DT)

DTs are very popular ML techniques since they have a simple formar Also, they are efficient methods for solving classification and regression problems (Xu et al., 2005). Basically, DT algorithms construct a tree with leaves that are labelled with a specific class property and with inner nodes that represent the class attribute. Given an inner node, the breeding of that node follows different values of a descriptive attribute. The result of this process is a decision tree, that classifies the new information following a track beginning from the root to a leaf according to the selected descriptive attributes. These models generate a set of rules which can be used for prediction through the repetitive process of splitting.

In a regression problem, X= X1, X2, ....., Xpn are the predictor variables and pn is the total number of predictor variables. Let n be the number of observations and Y= Y1, Y2, ., Yn a target variable that takes continuous values, vf is a feature variable and th is a threshold value. Let t and g=(vf, tht) be a node and a candidate split, respectively. Then:

The KR method (Saunders et al., 1998) is a special case of SVM, which combines ride regression with kernel techniques for capturing nonlinear relationships (You et al., 2018). Specifically, in the

where equation 7 and 8 shows that Q1 (that is the left side in the decision tree) and Q2 (the right side in the decision tree) are found by splitting the data into g candidate split. Then, formulation 9 presents the calculation of the mean predicted value at terminal node.

The ability to track and evalúate every step in the DT process is an important advantage that makes these methods applicable to different problems. Indeed, DTs have been applied to remote sensing (Zhang et al., 2017), biology (Darnell et al., 2007), hydrology (Nourani et al., 2019), among other applications.

4. Adaptive Boosting (AB)

The AB is one of the most used boosting methods given its simplicity and accurate estimation (Wu et al., 2008). It is an ensemble learning algorithm in which weak learners are combined into a weighted sum. The success of this method lies in looking for a strong learner by lineal combinations of weak learners: (10)

where hk(x) denotes the kth weak learner; K is the number of weak learners; ak denotes the coefficient of the kth weak learner, and H(x) denotes a strong learner.

The training process is done in three steps: first, a training dataset is randomly selected to begin; secondly, the model is repetitively trained to select the training set based on the errors of the last results, and finally, the model assigns the higher weight to weaker estimations. The algorithm iterates until the training data is estimated with the minimum error and reaches the maximum number of iterations. AB algorithms can be used both in classification and regression problems (Yamag and Todorovic, 2020).

5. Multilayer Perceptron Regressor (MLP) - Artificial Neuronal Network (ANN)

The ANN models are the most well-known ML methods, used for modelling soil moisture (García et al., 2019), water balance (Kumar et al., 2011), and solar radiation (Yadav and Chandel, 2014). The method connects neurons (input variables), by assigning weight to each of them, to find the pattern that explains the output variable. The training ANN process defines the relationship among the input neurons, so that it can be applied to a new dataset to estimate the output variable (Yamag and Todorovic, 2020).

The MLP model consists of multiple layers, classified as input, hidden, and output. Input neurons are the explanatory variables, the output layer is the estimated unknown variable, while the hidden layers are artificial neurons needed to connect the input and the output layers. Hidden layers are critical for modelling nonlinear processes. The MLP model can be mathematically formulated as:

where wkj are weights between hidden and output layers; wji are weights between input and hidden layers; Xi are input variables; m is the number of neurons in a hidden layer; n is the number of neurons in an input layer, Bj and Bk are the bias values of the neurons in the hidden and output layers, respectively; F is the transfer function; and Y is the output.

Model implementation and hyperparameters selection

In this work, the data were normalized using the mean and the standard deviation, as suggested by Yamac and Todorovic (2020). Then, the amount of data was randomly partitioned into training and testing data sets. Specifically, 80 % of the data were used for the parameter tunning of each ML method, and the remaining 20 % were used for testing. Table 2 presents the main statistics (minimum, maximum, mean, standard deviation and median) for each variable for the training and test, maize and soybean dataset. It can be noted that the proposed variables have similar statistics in both datasets, suggesting that training and test sets are not significantly different. ET maximum values in both datasets are about 20 mm/d, although the mean values are around 6 mm/d. It was observed that stations US-Ne1 and US-Ne2 (watered field) presents maize ET values as high as 21 mm/d during summer while in US-Ne3 (rainfed) ET reaches 15 mm/d. The other stations show values lower than 10 mm/d.

A k-fold cross-validation method was applied in prediction error estimation, and to set up the hyperparameters. This method is an iterative process, consisting of randomly splitting the dataset into k groups of approximately equal size, k-1 groups are used to train the model and one of the groups is used for testing. This process is repeated k times using a different group for testing in each iteration. The process generates k error estimates whose average is used as the final estimate. Here, the k-fold method was applied dividing the dataset into five subsets, i.e., k=5, (^{Anguita et al., 2005}).

Table 2: Minimum (Min), maximum (Max), mean, standard deviation (SD), and median statistics for each used variable for the training and test dataset

Rn substitutes

As aforesaid, routinely Rn measurements are not easy to obtain from meteorological stations. Henee, maize and soybean ET errors were tested with two Rn substitutes i.e., RnM and Ra.

The machine learning techniques presented here were applied to model Rn using the meteorological data from the stations listed in Table 1. Ra, the solar radiation received at the top of the atmosphere on a horizontal surface, is calculated as a function of the latitude, date, and time of day. Here, Ra was computed according to the methodology proposed by FAO 56 (Allen et al., 1998), through the following formulation:

where Ra ¡s the extraterrestrial solar radiation (\Nlm2),Gsc is the solar constant (0.0820 MJ/m2/d), ci?, is the inverse relative distance Earth-Sun, co3 is the sunset hour angle (radians), <p is the latitude (radians) and S is the solar declination (radians).

(SD), correlation coefficient (r), and the centered Root Mean Square difference (RMS) statistics. The equations of the statistics RMSE, MAE, Bias, R2, SD, r, and RMS used here are the following:

Models performance

In order to analyse the performance of the SVM, KR, DT, AB, and MLP algorithms, the RMSE, the Mean Absolute Error (MAE), Bias, and the determination coefficient (R2) were quantified. Besides, Taylor's diagram was plotted (Taylor, 2001), which comprised the standard deviation where n is the number of observationsAD is the observed data, M is the modelled data, O and M are the mean valúes of O and M, respectively. Also, and SD0 are the standard deviations of M and O, respectively.

RESULTS AND DISCUSSION

Radiation analysis

Year 2005 was randomly selected to present results from Ra calculation and RnO in Figure 1. In this Figure, those FLUXNET stations with maize and soybean data during 2005 were plotted. It is clear that Ra is about twice RnO, however both variables present similar trends. Ra is the incident solar radiation outside of the atmosphere, so it is reasonable to consider it as a radiation input, substituting Rn, in ET ML calculation.

The meteorological variables Ta, Tamin, Tamax, RH, RHmin, RHmax, and Ra were the inputs for the SVM, KR, DT, AB, and MLP algorithms to obtain RnM. The data were normalized using the mean and the standard deviation, and randomly partitioned into training and testing data sets. Then, the k-fold cross-validation method was applied, and the results were contrasted with RnO using the aforementioned statistics.

The mean, maximum and minimum RnO for all the processed stations were 361.39 W/m2, 639.35 W/m2, and 28.33W/m2, respectively, while the same statistics for Ra were 888.57 W/m2, 1059.83 W/m2, and 393.91 W/m2.

Figure 2 shows the results in terms of the median, first (25 %), and third (75 %) quartiles, the data range and outliers of the RnM. The RMSE, MAE, and R2 metrics were added to the box of each method. Clearly, AB has better performed RnO, with the lowest RMSE of 59.80 W/m2 (16.4 % of the mean RnO). The KR and MLP algorithms computed Rn estimations with errors similar to those obtained with the AB algorithm. The DT and SVM methods presented the poorest correlations compared with RnO, with a RMSE of 73.78 W/m2 and 72.67 W/m2, respectively.

Figure 1: Comparison of calculated daily Ra (black dotted line) with mean daily RnO (grey solid line) for FLUXNET stations for a randomly selected year (2005) with maize and soybean crops. The black dashed line shows RnO trends

Figure 2: Boxplots of the Rn estimations for SVM, KR, DT, AB, and MLP algorithms. The RMSE, MAE, and R2 between RnO and RnM are presented for each evaluated method

Figure 3 shows the scatter plot between AB Rn (RnM) estimates and RnO, for all the studied sites. It can be observed that AB estimations presented a good correlation with field Rn measurements, although there are important differences between them.

The results of this analysis show that AB yielded the best results compared to field Rn measurements. Hence, AB Rn estimations will be used as a Rn substitute to calculate daily ET.

The results presented in Figure 2 demonstrate that all the evaluated ML methods were able to adequately model RnO. However, the AB, KR, and MLP algorithms exhibited the best RnO estimation accuracy, being AB the technique that showed the lowest RMSE (16.4 % of the mean RnO). These results are comparable to those presented by other studies. Wang et al. (2019) applied a Boosting method to estimate surface shortwave Rn, reporting a RMSE of about 11 % of the observed Rn. Similarly, Jiang et al. (2014) published a RMSE of about 16 % of the mean RnO, using the MLP algorithm to simulate field Rn. However, these works modelled Rn from multi-source data, using remote sensing products, surface measurements, and reanalysis products.

Figure 3: Relationship between AB Rn estimations and ground Rn observations (n=7051) for all the study FLUXNET stations. The solid black line represents the 1:1 line

On the other hand, Ojo et al. (2021) used MLP with observed meteorological variables for predicting RnO, obtaining RMSEs of about 8 % of the observed data. Their investigations were conducted only in tropical regions (Ojo et al., 2021). Here, RnO was estimated using routinely measured meteorological variables from stations spatially distributed across the world, producing similar errors.

ET machine learning models for Maize and Soybean

The proposed ML algorithms were applied to estimate daily maize and soybean ET with three different radiation inputs, i.e., RnO, AB RnM, and calculated Ra, in conjunction with seven meteorological variables (Ta, Tamin, Tamax, Tar, RH, RHmin, RHmax), and the vegetation index NDVI. Thus, three scenarios were analysed in this study, as shown in Table 3, to investigate the effect of Rn substitutes estimates in maize and soybean ET errors.

The data were normalized, randomly splitted and pre-processed, as already explained in sub-section Model implementation and hyperparameters selection. The soundness of ML methods was evaluated using the proposed statistics.

Results of ML algorithms for daily maize ET, for three input combinations, were contrasted with observed ET and presented in Table 4. As expected, combination 1 yielded the lowest error for each case, given it made use of the most accurate Rn data, i.e., RnO. However, the errors and bias from the Rn substitutes are close to those of combination 1. The KR and AB methods presented the best results compared with field ET measurements for each evaluated combination (see Table 4). Using a simple estimation of the amount of incoming energy, such as Ra, would increase daily maize ET errors by 6 % of the mean observed ET (6.72 mm/d), compared to the RMSE obtained in combination 1.

The efficiency of ML methods with the three input combinations is shown in Figure 4. Taylor's diagrams confirm that RnO produces the lowest RMS compared with field ET measurements. Even so, all the evaluated algorithms yielded correlations higher than 0.87 and similar SD for the different radiation inputs.

From the above results, AB seems to be the most precise ML algorithm, thus it was used to plot the comparison between simulated and observed ET with all the input combinations (see Figure 5).

Best statistics are highlighted in bold.

Figure 4: Taylor's diagrams for comparative assessment of SVM, KR, DT, AB, and MLP daily maize ET estimation, with three different radiation inputs, RnO (a), RnM (b) and Ra (c). The black circle on the x-axis represents observed ET statistics

Results with RnO (AB1) are closer to the 1:1 line than RnM and Ra results (AB2 and AB3, respectively). Nevertheless, the Rn substitutes performance is good, delivering similar ET estimations to AB1 in maize.

The SVM, KR, DT, AB, and MLP daily ET estimations were compared with daily soybean ET measurements. Table 4 presents a summary of the RMSE, MAE, R2, and Bias for each evaluated ML method under the three input combinations. As was expected, combination 1 with RnO, presented the lowest errors and the highest R2 with ground

ET observations. Nevertheless, combinations 2 and 3 yielded errors and bias comparable to those obtained with RnO.

The KR, AB, and MLP algorithms exhibited the best ET estimation accuracy in soybean for each analysed input combination. Using Rn substitutes in ET estimation with ML methods, would increase ET errors up to 7 % of the RMSE obtained with RnO. Indeed, the mean ML RMSE with combinations 1 and 3 are 21.13 % and 29.63 % of the mean observed ET (6.50 mm/d), respectively. The DT method gave the worst daily soybean ET estimates for each case.

Figure 5: Relationship between simulated AB ET with all the input combinations (AB1, AB2, and AB3) and observed daily maize ET

Taylor's diagrams were plotted for comparative assessment of SVM, KR, DT, AB, and MLP daily soybean ET estimation with the three input combinations (see Figure 6).

It can be noted that RnM and Ra produce similar RMS compared with ground ET observations; nevertheless, RnO yielded the best results. In general, the proposed ML methods were able to estimate ET with good accuracy using the three input combinations, except for DT which showed the highest errors.

According to the results presented in Table 4, KR, MLP and AB seem to be the most accurate ML methods to estimate daily soybean ET. In particular, KR1, KR2 and AB3 yielded the lowest RMSE for combinations 1, 2, and 3, respectively. So, the KR1, KR2 and AB3 ET estimations were compared with field ET measurements in Figure 7. Results from the RnO (KR1), are the closest to the 1:1 line (see Figure 7.a). However, the Rn substitutes ET estimations (KR2 and AB3) presented a good correlation and bias with observed ET, with a moderated dispersion.

ET results discussion

The proposed ML algorithms were applied to estimate daily ET under three different input combinations as show in Table 3. Considering that ET represents an important component in the energy balance, similar input variables were used to model Rn and ET. In fact, the ET process expends most of the energy absorbed by the earth surface during a year.

Results presented here demonstrate that ML algorithms are suitable to simulate complex nonlinear processes such as ET. Moreover, the advantage of ML methods, compared with traditional ET equations, is their capability to assimilate substitute variables that represent the dynamics of a process, as a surrogate of accurate field measurements.

All the implemented ML algorithms provided good performances in daily ET estimation using Ra and RnM as inputs, yielding correlations higher than 0.87 and 0.83 for maize and soybean, respectively. Nevertheless, AB and KR methods exhibited the lowest RMSs compared with observed ET in maize and soybean (see Figures 4 and 6). Comparable findings were published by Carter and Liang (2019), who demonstrated that Boosting and Kernel methods similar to those used here presented the lowest error to estimate daily cropland ET, compared with other ML algorithms. The success of AB lies in looking for a strong regressor through lineal combinations of weak samples, iterating until the training data is estimated with the minimum error (Wu et al., 2008; Yamag and Todorovic, 2020). KR is the suitable method for estimating a nonlinear process using many variables as inputs (Hofmann et al., 2008; Zhang et al., 2013).

In this study, the AB method presented accurate results for daily maize and soybean ET. Indeed, AB yielded RMSEs of about 25 % (1.7 mm/d) in maize and 27 % (1.8 mm/d) in soybean of the mean observed ET using Rn substitutes as input variables. These results are in good agreement with Granata (2019), Yamac and Todorovic (2020), and Fan et al. (2021), who proved that Boosting techniques had a high precision for modelling daily ET and transpiration. These previous studies reported RMSEs varying from 8 to 13 % (Granata, 2019), 4 to 29 % (Yamac and Todorovic, 2020), and 20 to 33 % (Fan et al., 2021) of the mean observed, when modelling daily potato ET, grassland ET, and maize transpiration, respectively, using observed radiation data. However, Fan et al. (2021) findings cannot be directly extrapolated to maize ET, since evaporation is important in early maize stages while transpiration has great importance from V6 stage.

MLP algorithm showed good predictive capabilities to model ET using Rn substitutes, with RMSEs of about 26 and 27 % of the mean observed ET for maize and soybean, respectively. Comparable results were presented by Yamac and Todorovic (2020) when they simulated ET from potatoes, with RMSEs ranging from 2 to 29 % of the mean observed ET. They used solar radiation along with different variables to estímate FAO 56 crop evapotranspiration, Allen et al. (1998).

Figure 6: Taylor's diagrams for SVM, KR, DT, AB, and MLP daily soybean ET estimation under the various radiation inputs, RnO (a), RnM (b) and Ra (c). The black circle on the x-axis represents observed ET statistics

The results of this study exhibited that SVM algorithm was able to adequately model ET using Rn substitutes, yielding RMSEs of about 27 and 28 % of the mean observed maize and soybean ET, respectively. These results are comparable to those published by Tang et al. (2018), Granata (2019), Chen, Z. et al. (2020), and Fan et al. (2021). Tang et al. (2018) reported RMSEs ranging from 9 to 17 % of the mean observed maize ET, using the wind speed and crop height as input variables. Granata (2019) evaluated the SVM to model ET in a grassland site, with RMSEs ranging from 7 to 14 % of the mean observed ET. Chen, Z. et al. (2020) published RMSEs higher than 20 % using SVM to estimate daily ETo under different combinations of atmospheric variables. Fan et al. (2021) used SVM for daily maize transpiration estimation, with RMSEs ranging from 6 to 31 % of the mean daily observed transpiration.

Observed evapotranspiration (mm/d) Observed evapotranspiration (mm/d)

Figure 7: Relationship between KR1, KR2 (a) and AB3 (b) ET estimations and observed daily soybean ET

CONCLUSION

Empirical and semiempirical ET equations require precise net radiation measurements to obtain accurate results, at any time and space scale. Since Rn is not readily available information, a comparison of five ML methods for obtaining ET from two Rn substitutes, was performed here. Our results showed good efficiency of the ML algorithms assessed, yielding acceptable errors with easily obtainable radiation proxies, meteorological and NDVI variables. However, these errors are larger than physics-based model errors, as can be corroborated in the literature. In general, this type of ML methods is operative and flexible but the accuracy is debatable. Indeed, Rn was modelled with Support Vector Machine, Kernel Ridge, Decision Tree, Adaptive Boosting and Multilayer Perceptron methods, using meteorological variables readily available everywhere. In general, all the evaluated ML methods were able to effectively model Rn, with errors of about 16 % (60 W/m2) of the mean observed Rn. However, Kernel Ridge, Adaptive Boosting, and Multilayer Perceptron presented the most accurate estimations. Hence, Rn substitutes computed from routinely meteorological data seem to be an effective alternative to consider in many regions where the heat flux and radiation observations are rare.

The proposed ML methods were suitable to estimate ET with extraterrestrial solar radiation and modelled net radiation. Adaptive Boosting and Kernel Ridge presented consistent results for maize and soybean using Rn substitutes, rendering RMSE lower than 26 % (1.75 mm/d) of the mean observed ET. Thus, it can be concluded that Adaptive Boosting and Kernel Ridge techniques can be applied with Rn substitutes for mapping ET with meteorological data and satellite NDVI images.

ACKNOWLEDGEMENTS

The authors thank the National Scientific and Technical Research Council - Argentina witch supported this investigation. Also, the authors acknowledge KILIMO S A for supporting Gianfranco Fagioli's work. Finally, the authors wish to thank the FLUXNET ground observation network for freely providing the in-situ data belonging to its stations.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.