JUNTO Practice: Data Analysis, Historical Crop Yields (Part 3)
Discussed on December 01, 2020.
Analyze this dataset and create a short report:
- "The global dataset of historical yields for major crops 1981–2016." https://doi.org/10.1038/s41597-020-0433-7
Please submit an abstract (~200 words) with your report.
Solutions
Click to see:
Oscar Martinez
Crop Data 3
0 Abstract
Last time we looked at relationships between nation state data and the crop yields of the plots in them. This week we try modeling yield data based on plot specific latitude, longitude, temperature, and precipitation data. We also include previous year yield data because it seemed reasonable it would be available in practice.
We use two benchmarks to evaluate our model. (1) Guessing the mean. (2) Guessing the previous year’s yield (the one year lag).
We tried modeling the data with a simple linear regression as well as a neural network.
Results show that while the Mean Squared Error and Max Absolute Error for the linear regression models are the same as the mean-guessing benchmark, the Median Absolute Error is significantly lower. However, further analysis revealed that the majority of the modeling success came from emulating the lag benchmark. Coefficient analysis and residual plotting supports this.
The neural network was not able to outperform the linear regression model, suggesting the data does not contain enough signal to do better.
We conclude there is not enough signal at the frequency we have the data to meaningfully predict yields with any superiority to just guessing the previous year’s yield. Without more domain expertise, it is difficult to know how useful this is in practice. Further work should bring in domain knowledge to understand what this data is used for in practice, and see if more useful modeling emerges at different granularities than annual yield prediction.
1. Data Construction
We create annual aggregates of monthly temperature and precipitation data.
temperature = nc.Dataset( "climate_data/air.mon.mean.v501.nc" ) print(temperature, "\n") print(temperature.dimensions) print(temperature.variables) nc.num2date( temperature["time"][:], temperature["time"].units, only_use_cftime_datetimes=False, only_use_python_datetimes=True, ) def netcdf_window(dataset, variable, window_size, agg_func): window_stats = [] for i in range( 0, dataset[variable].shape[0], window_size ): window_stats.append( agg_func( dataset[variable][i : i + window_size], 0 ) ) return np.stack(window_stats) temp_means = netcdf_window(temperature, "air", 12, np.mean)[ 80:-2 ] temp_vars = netcdf_window(temperature, "air", 12, np.var)[ 80:-2 ]
years = [y for y in range(1981, 2017)] lat_range = [i for i in range(360)] lon_range = [i for i in range(720)] data = pd.DataFrame( index=pd.MultiIndex.from_product( [years, lat_range, lon_range], names=["year", "lat_id", "lon_id"], ) )
data["lat"] = precipitation["lat"][:].take( data.index.get_level_values(1) ) data["lon"] = precipitation["lon"][:].take( data.index.get_level_values(2) )
for crop in crops: data[f"{crop}_yield"] = np.concatenate( [ crop_matrices[crop][y]["var"][:].ravel() for y in years ] )
data = data.merge( data.groupby(level=[1, 2]) .shift(1) .drop(["lat", "lon"], axis=1), left_index=True, right_index=True, suffixes=("", "_lag"), )
data = data.replace(to_replace=-999000000.0, value=0)
2. Modeling
We set up a loop to train a model for each crop.
# Modeling for crop in crops: target_crop = crop target_col = f"{target_crop}_yield" non_model_cols = [ c_f for c in crops for c_f in (f"{c}_yield", f"{c}_yield_lag") ] # v2: non_model_cols = [f"{c}_yield" for c in crops] # non_model_cols.extend([f"{c}_{m}_lag" for c in ["precipitation","temperature"] for m in ["mean","var"]]) non_model_cols.remove(f"{target_col}_lag") X = ( data.loc[data[target_col] > -999_000_000.0] .loc[1982:] .drop(non_model_cols, axis=1) .replace(to_replace=-999_000_000.0, value=0)[ f"{target_col}_lag" ] ) y = ( data.loc[data[target_col] > -999_000_000.0] .loc[1982:2010][target_col] .replace(to_replace=-999_000_000.0, value=0) ) y_guess_lag = ( data.loc[data[target_col] > -999_000_000.0] .loc[1982:][target_col + "_lag"] .replace(to_replace=-999_000_000.0, value=0) ) # scaler = Normalizer() # scaled_X = scaler.fit_transform(X) yield_lmodel = sm.OLS( y, X, missing="drop", hasconst=True ) result = yield_lmodel.fit() print(f"MSE for model:", result.mse_total) print( f"Max Error for model:", max_error(y, result.fittedvalues), ) print( f"Median Absolute Error for model:", median_absolute_error(y, result.fittedvalues), ) print(f"\nStandard Deviation for y model:", y.std()) print(f"Mean for y:", y.mean()) print( "\nMSE for guessing mean:", mean_squared_error(y, [y.mean()] * len(y)), ) print( f"Max Error for guessing mean:", max_error(y, [y.mean()] * len(y)), ) print( f"Median Absolute Error for guessing mean:", median_absolute_error(y, [y.mean()] * len(y)), ) print( "\nMSE for guessing lag:", mean_squared_error(y, y_guess_lag), ) print( "Median Absolute Error for guessing lag:", median_absolute_error(y, y_guess_lag), ) print(result.summary()) plt.scatter(range(result.resid.shape[0]), result.resid) plt.show() plt.scatter(range(y.shape[0]), y - y.mean()) plt.show() plt.scatter(range(y.shape[0]), y - y_guess_lag) plt.show()
MSE for model: 12.890721416272546
Max Error for model: 27.743002
Median Absolute Error for model: 0.26920652
Standard Deviation for y model: 3.5903652
Mean for y: 3.4950097
MSE for guessing mean: 12.890694
Max Error for guessing mean: 25.76118
Median Absolute Error for guessing mean: 2.2961192
MSE for guessing lag: 1.0769701
Median Absolute Error for guessing lag: 0.26920128
OLS Regression Results
==============================================================================
Dep. Variable: maize_major_yield R-squared: 0.916
Model: OLS Adj. R-squared: 0.916
Method: Least Squares F-statistic: nan
Date: Sat, 28 Nov 2020 Prob (F-statistic): nan
Time: 11:14:31 Log-Likelihood: -6.3568e+05
No. Observations: 436591 AIC: 1.271e+06
Df Residuals: 436590 BIC: 1.271e+06
Df Model: 0
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
maize_major_yield_lag 1.0011 0.000 3121.096 0.000 1.000 1.002
==============================================================================
Omnibus: 329941.720 Durbin-Watson: 0.419
Prob(Omnibus): 0.000 Jarque-Bera (JB): 23804309.948
Skew: 3.046 Prob(JB): 0.00
Kurtosis: 38.657 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE for model: 1.177887018166247
Max Error for model: 5.966308
Median Absolute Error for model: 0.20253134
Standard Deviation for y model: 1.085305
Mean for y: 2.0639737
MSE for guessing mean: 1.1778803
Max Error for guessing mean: 8.736028
Median Absolute Error for guessing mean: 0.7471347
MSE for guessing lag: 0.21855998
Median Absolute Error for guessing lag: 0.20260334
OLS Regression Results
==============================================================================
Dep. Variable: soybean_yield R-squared: 0.814
Model: OLS Adj. R-squared: 0.814
Method: Least Squares F-statistic: nan
Date: Sat, 28 Nov 2020 Prob (F-statistic): nan
Time: 11:14:34 Log-Likelihood: -1.1685e+05
No. Observations: 177431 AIC: 2.337e+05
Df Residuals: 177430 BIC: 2.337e+05
Df Model: 0
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
soybean_yield_lag 0.9988 0.000 2058.457 0.000 0.998 1.000
==============================================================================
Omnibus: 69707.792 Durbin-Watson: 0.342
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1032754.720
Skew: 1.489 Prob(JB): 0.00
Kurtosis: 14.438 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE for model: 11.880597425507798
Max Error for model: 22.400003
Median Absolute Error for model: 0.2875111
Standard Deviation for y model: 3.4468243
Mean for y: 4.0950184
MSE for guessing mean: 11.880554
Max Error for guessing mean: 20.333187
Median Absolute Error for guessing mean: 2.2512727
MSE for guessing lag: 1.2174584
Median Absolute Error for guessing lag: 0.28756785
OLS Regression Results
==============================================================================
Dep. Variable: rice_yield R-squared: 0.898
Model: OLS Adj. R-squared: 0.898
Method: Least Squares F-statistic: nan
Date: Sat, 28 Nov 2020 Prob (F-statistic): nan
Time: 11:14:35 Log-Likelihood: -4.2029e+05
No. Observations: 276999 AIC: 8.406e+05
Df Residuals: 276998 BIC: 8.406e+05
Df Model: 0
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
rice_yield_lag 1.0007 0.000 2498.300 0.000 1.000 1.001
==============================================================================
Omnibus: 248589.441 Durbin-Watson: 0.460
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29119130.864
Skew: 3.871 Prob(JB): 0.00
Kurtosis: 52.629 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE for model: 17.150750067585562
Max Error for model: 17.00377
Median Absolute Error for model: 0.36177808
Standard Deviation for y model: 4.1413465
Mean for y: 4.8026533
MSE for guessing mean: 17.150377
Max Error for guessing mean: 15.626874
Median Absolute Error for guessing mean: 3.5748
MSE for guessing lag: 1.7080196
Median Absolute Error for guessing lag: 0.3617916
OLS Regression Results
==============================================================================
Dep. Variable: rice_second_yield R-squared: 0.900
Model: OLS Adj. R-squared: 0.900
Method: Least Squares F-statistic: nan
Date: Sat, 28 Nov 2020 Prob (F-statistic): nan
Time: 11:14:37 Log-Likelihood: -77617.
No. Observations: 46020 AIC: 1.552e+05
Df Residuals: 46019 BIC: 1.552e+05
Df Model: 0
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
rice_second_yield_lag 1.0008 0.001 1018.591 0.000 0.999 1.003
==============================================================================
Omnibus: 23212.207 Durbin-Watson: 0.331
Prob(Omnibus): 0.000 Jarque-Bera (JB): 586204.736
Skew: 1.897 Prob(JB): 0.00
Kurtosis: 20.068 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.