Return to JUNTO

JUNTO Practice: Data Analysis, Historical Crop Yields (Part 3)

Discussed on December 01, 2020.

Analyze this dataset and create a short report:

Please submit an abstract (~200 words) with your report.

Solutions

Click to see:

Oscar Martinez

Crop Data 3

0 Abstract

Last time we looked at relationships between nation state data and the crop yields of the plots in them. This week we try modeling yield data based on plot specific latitude, longitude, temperature, and precipitation data. We also include previous year yield data because it seemed reasonable it would be available in practice.

We use two benchmarks to evaluate our model. (1) Guessing the mean. (2) Guessing the previous year’s yield (the one year lag).

We tried modeling the data with a simple linear regression as well as a neural network.

Results show that while the Mean Squared Error and Max Absolute Error for the linear regression models are the same as the mean-guessing benchmark, the Median Absolute Error is significantly lower. However, further analysis revealed that the majority of the modeling success came from emulating the lag benchmark. Coefficient analysis and residual plotting supports this.

The neural network was not able to outperform the linear regression model, suggesting the data does not contain enough signal to do better.

We conclude there is not enough signal at the frequency we have the data to meaningfully predict yields with any superiority to just guessing the previous year’s yield. Without more domain expertise, it is difficult to know how useful this is in practice. Further work should bring in domain knowledge to understand what this data is used for in practice, and see if more useful modeling emerges at different granularities than annual yield prediction.

1. Data Construction

We create annual aggregates of monthly temperature and precipitation data.

temperature = nc.Dataset(
    "climate_data/air.mon.mean.v501.nc"
)
print(temperature, "\n")
print(temperature.dimensions)
print(temperature.variables)
nc.num2date(
    temperature["time"][:],
    temperature["time"].units,
    only_use_cftime_datetimes=False,
    only_use_python_datetimes=True,
)


def netcdf_window(dataset, variable, window_size, agg_func):
    window_stats = []
    for i in range(
        0, dataset[variable].shape[0], window_size
    ):
        window_stats.append(
            agg_func(
                dataset[variable][i : i + window_size], 0
            )
        )

    return np.stack(window_stats)


temp_means = netcdf_window(temperature, "air", 12, np.mean)[
    80:-2
]
temp_vars = netcdf_window(temperature, "air", 12, np.var)[
    80:-2
]
years = [y for y in range(1981, 2017)]
lat_range = [i for i in range(360)]
lon_range = [i for i in range(720)]
data = pd.DataFrame(
    index=pd.MultiIndex.from_product(
        [years, lat_range, lon_range],
        names=["year", "lat_id", "lon_id"],
    )
)
data["lat"] = precipitation["lat"][:].take(
    data.index.get_level_values(1)
)
data["lon"] = precipitation["lon"][:].take(
    data.index.get_level_values(2)
)
for crop in crops:
    data[f"{crop}_yield"] = np.concatenate(
        [
            crop_matrices[crop][y]["var"][:].ravel()
            for y in years
        ]
    )
data = data.merge(
    data.groupby(level=[1, 2])
    .shift(1)
    .drop(["lat", "lon"], axis=1),
    left_index=True,
    right_index=True,
    suffixes=("", "_lag"),
)
data = data.replace(to_replace=-999000000.0, value=0)

2. Modeling

We set up a loop to train a model for each crop.

# Modeling
for crop in crops:
    target_crop = crop
    target_col = f"{target_crop}_yield"
    non_model_cols = [
        c_f
        for c in crops
        for c_f in (f"{c}_yield", f"{c}_yield_lag")
    ]
    # v2: non_model_cols = [f"{c}_yield" for c in crops]
    # non_model_cols.extend([f"{c}_{m}_lag" for c in ["precipitation","temperature"] for m in ["mean","var"]])
    non_model_cols.remove(f"{target_col}_lag")

    X = (
        data.loc[data[target_col] > -999_000_000.0]
        .loc[1982:]
        .drop(non_model_cols, axis=1)
        .replace(to_replace=-999_000_000.0, value=0)[
            f"{target_col}_lag"
        ]
    )
    y = (
        data.loc[data[target_col] > -999_000_000.0]
        .loc[1982:2010][target_col]
        .replace(to_replace=-999_000_000.0, value=0)
    )

    y_guess_lag = (
        data.loc[data[target_col] > -999_000_000.0]
        .loc[1982:][target_col + "_lag"]
        .replace(to_replace=-999_000_000.0, value=0)
    )

    # scaler = Normalizer()
    # scaled_X = scaler.fit_transform(X)
    yield_lmodel = sm.OLS(
        y, X, missing="drop", hasconst=True
    )
    result = yield_lmodel.fit()

    print(f"MSE for model:", result.mse_total)
    print(
        f"Max Error for model:",
        max_error(y, result.fittedvalues),
    )
    print(
        f"Median Absolute Error for model:",
        median_absolute_error(y, result.fittedvalues),
    )
    print(f"\nStandard Deviation for y model:", y.std())
    print(f"Mean for y:", y.mean())
    print(
        "\nMSE for guessing mean:",
        mean_squared_error(y, [y.mean()] * len(y)),
    )
    print(
        f"Max Error for guessing mean:",
        max_error(y, [y.mean()] * len(y)),
    )
    print(
        f"Median Absolute Error for guessing mean:",
        median_absolute_error(y, [y.mean()] * len(y)),
    )
    print(
        "\nMSE for guessing lag:",
        mean_squared_error(y, y_guess_lag),
    )
    print(
        "Median Absolute Error for guessing lag:",
        median_absolute_error(y, y_guess_lag),
    )
    print(result.summary())

    plt.scatter(range(result.resid.shape[0]), result.resid)
    plt.show()
    plt.scatter(range(y.shape[0]), y - y.mean())
    plt.show()
    plt.scatter(range(y.shape[0]), y - y_guess_lag)
    plt.show()
MSE for model: 12.890721416272546
Max Error for model: 27.743002
Median Absolute Error for model: 0.26920652

Standard Deviation for y model: 3.5903652
Mean for y: 3.4950097

MSE for guessing mean: 12.890694
Max Error for guessing mean: 25.76118
Median Absolute Error for guessing mean: 2.2961192

MSE for guessing lag: 1.0769701
Median Absolute Error for guessing lag: 0.26920128
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      maize_major_yield   R-squared:                       0.916
Model:                            OLS   Adj. R-squared:                  0.916
Method:                 Least Squares   F-statistic:                       nan
Date:                Sat, 28 Nov 2020   Prob (F-statistic):                nan
Time:                        11:14:31   Log-Likelihood:            -6.3568e+05
No. Observations:              436591   AIC:                         1.271e+06
Df Residuals:                  436590   BIC:                         1.271e+06
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
maize_major_yield_lag     1.0011      0.000   3121.096      0.000       1.000       1.002
==============================================================================
Omnibus:                   329941.720   Durbin-Watson:                   0.419
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         23804309.948
Skew:                           3.046   Prob(JB):                         0.00
Kurtosis:                      38.657   Cond. No.                         1.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE for model: 1.177887018166247
Max Error for model: 5.966308
Median Absolute Error for model: 0.20253134

Standard Deviation for y model: 1.085305
Mean for y: 2.0639737

MSE for guessing mean: 1.1778803
Max Error for guessing mean: 8.736028
Median Absolute Error for guessing mean: 0.7471347

MSE for guessing lag: 0.21855998
Median Absolute Error for guessing lag: 0.20260334
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          soybean_yield   R-squared:                       0.814
Model:                            OLS   Adj. R-squared:                  0.814
Method:                 Least Squares   F-statistic:                       nan
Date:                Sat, 28 Nov 2020   Prob (F-statistic):                nan
Time:                        11:14:34   Log-Likelihood:            -1.1685e+05
No. Observations:              177431   AIC:                         2.337e+05
Df Residuals:                  177430   BIC:                         2.337e+05
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
soybean_yield_lag     0.9988      0.000   2058.457      0.000       0.998       1.000
==============================================================================
Omnibus:                    69707.792   Durbin-Watson:                   0.342
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1032754.720
Skew:                           1.489   Prob(JB):                         0.00
Kurtosis:                      14.438   Cond. No.                         1.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE for model: 11.880597425507798
Max Error for model: 22.400003
Median Absolute Error for model: 0.2875111

Standard Deviation for y model: 3.4468243
Mean for y: 4.0950184

MSE for guessing mean: 11.880554
Max Error for guessing mean: 20.333187
Median Absolute Error for guessing mean: 2.2512727

MSE for guessing lag: 1.2174584
Median Absolute Error for guessing lag: 0.28756785
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             rice_yield   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.898
Method:                 Least Squares   F-statistic:                       nan
Date:                Sat, 28 Nov 2020   Prob (F-statistic):                nan
Time:                        11:14:35   Log-Likelihood:            -4.2029e+05
No. Observations:              276999   AIC:                         8.406e+05
Df Residuals:                  276998   BIC:                         8.406e+05
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
rice_yield_lag     1.0007      0.000   2498.300      0.000       1.000       1.001
==============================================================================
Omnibus:                   248589.441   Durbin-Watson:                   0.460
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         29119130.864
Skew:                           3.871   Prob(JB):                         0.00
Kurtosis:                      52.629   Cond. No.                         1.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE for model: 17.150750067585562
Max Error for model: 17.00377
Median Absolute Error for model: 0.36177808

Standard Deviation for y model: 4.1413465
Mean for y: 4.8026533

MSE for guessing mean: 17.150377
Max Error for guessing mean: 15.626874
Median Absolute Error for guessing mean: 3.5748

MSE for guessing lag: 1.7080196
Median Absolute Error for guessing lag: 0.3617916
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      rice_second_yield   R-squared:                       0.900
Model:                            OLS   Adj. R-squared:                  0.900
Method:                 Least Squares   F-statistic:                       nan
Date:                Sat, 28 Nov 2020   Prob (F-statistic):                nan
Time:                        11:14:37   Log-Likelihood:                -77617.
No. Observations:               46020   AIC:                         1.552e+05
Df Residuals:                   46019   BIC:                         1.552e+05
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
rice_second_yield_lag     1.0008      0.001   1018.591      0.000       0.999       1.003
==============================================================================
Omnibus:                    23212.207   Durbin-Watson:                   0.331
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           586204.736
Skew:                           1.897   Prob(JB):                         0.00
Kurtosis:                      20.068   Cond. No.                         1.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.